When the term high availability is brought up, most of the DBAs that I know reflexively think of technical solutions that we have implemented to help keep a system or database available for users – things like pureScale (“data sharing” for you mainframe DBAs), HADR and cluster managers like HACMP or TSA. However, there is so much more to availability than simply ensuring that a database is available for a user to connect to.
In the book High Availability: Design, Techniques and Processes, Floyd Piedad and Michael W. Hawkins state the following when defining availability:
"A system is available when users can work with it without experiencing outages. Note that for as long as the user does not perceive or feel the outage, the system is available to him. Availability is measured from the user’s point of view."
This idea – that availability must be considered from the user’s point of view – is something that is often forgotten.
Quantifying High Availability
Availability is usually measured using the a metric called “class of nines” – the more nines, the more available the system is.
When my customers talk about high availability, they are frequently looking for systems that have a minimum of four 9s (which allows for only 4.3 minutes of downtime per month), and often are aiming for at least 5 nines (allowing only 25.9 seconds of downtime per month).
As IT practitioners, we will usually add what can be a very large caveat when they are talking about availability: We measure it against the Service Level Agreement (or SLA). “Planned” outages are not generally considered when calculating how much downtime a system has, so the durations shown in the downtime per month column above apply for unplanned outages only.
What Causes Downtime?
We tend to think that the most common causes of an outage are hardware failures (whether it’s a power outage, a component in the server failing or hard drives failing), or software failures like the operating system or (gasp!) even the DB2 engine failing.
While this may have been true 10–15 years ago, in my experience these types of failures – especially hardware failures – have become increasingly rare. Hardware and software quality both continue to improve, eliminating many unexpected failures.
Even though many of the traditional problems that cause outages have been reduced, we still receive calls from our users on a far-too-regular basis indicating that something is not truly available.
What Really Causes Downtime?
User (or operator) error can be a big problem. If you are depending on systems that are not fully automated for loading data into your warehouse (and let’s face it, this is more common than we would all like to admit), you can end up with duplicate data, incomplete data, etc. Sometimes users (or support staff) make mistakes and can cause inadvertent problems with table availability.
Similarly, failures in the normal ETL processes that load data can cause outages – whether it’s a failure or delay in the automated process – can also result in a situation where data is not available when users are expecting it to be.
From a DBA perspective, these may not seem like outages – the database is still available, after all – but from the users’ perspective, when the data that was loaded is wrong or incomplete, they can’t “use” the system like they need.
There are still plenty of database maintenance activities that can’t be done completely online that can also cause availability issues for users. While you may be able to conduct some of these activities only in planned outage windows, sometimes this is not feasible. Users are always pushing for outage windows to be smaller – especially when users are spread around the world in varying time zones. Even during outage windows, it’s not uncommon to find users on the system (even when they know that their reports may be killed).
There are also a set of circumstances that I like to refer to as “Logical Downtime”. This occurs when there are inconsistencies in the data within your database that can result in incorrect (or misleading) query results.
For example, if your database has materialized query tables (MQTs) or aggregate tables that have a deferred refresh, there will be a period of time after data has been loaded into the base table(s) where the data present in the aggregate tables does not match what is stored in the base table. If a user executes a report that uses the aggregate tables they will see one set of data, but drilling down to the base fact level can result in different results than a user expects because the fact table already contains data that is not yet present in the aggregates.
A similar situation can occur when loading data into a series of fact tables, and the loads do not all complete at the same time. This means that some tables will have “today’s” data, but other tables still have “yesterday’s” data. This can create confusion or problems for users if they execute a report that requires data from more than one fact table.
How to Avoid Downtime
How can we combat this downtime? Unfortunately, there aren’t packaged products like Tivoli System Automation (TSA) that can be implemented to prevent these issues. The solution must to be built into the design of your data warehouse.
One of the most common ways to solve this problem involves duplicating some or all of the data in the data warehouse. The goal is to create an “online” copy of the data that users report on, and an “offline” copy of the data that is used for ETL processes and other database maintenance activities.
There are 3 ways to create a second copy for online/offline purposes:
- Duplicating the entire system (hardware, storage, etc.)
- Creating a second copy of the database
- Duplicating only specific tables within the database
Each of these options has varying levels of cost, but may provide additional benefits. (For example, duplicating the entire system can also provides for disaster recovery if the second system is located in a different location).
Each option also adds varying levels of complexity to your ETL processes. With 2 copies of the data, ensuring that both copies have correct data may mean adding additional checks to your ETL processes.
You will have to determine how you will identify which copy of the data is online for users vs. offline for ETL and maintenance activities, as well as how you will “swap” which copy is in which role.
Making a database highly available means much more than simply ensuring that DB2 is running and that users can connect to it. As business analytics becomes more critical to your business’ bottom line, demands for system availability are constantly increasing.
When making your data warehouse or data marts highly available, make sure that you consider availability from your users’ point of view. Look for holes in your application design and database maintenance processes that can result in inconsistencies that may be perceived as outages by your users, and design a solution to prevent these.