High Availability and Outage Avoidance

By John Maenpaa posted Aug 06, 2021 09:00 AM

  
Businesses and their customers rely on applications being consistently available along with the databases that store their information. Database outages can cause enormous business problems. If there is a failure that requires us to recover a very large database, we are looking at a near disaster for the business. Db2 and z/OS work together with hardware features to minimize the need to recover our data while keeping our applications available.

It may sounds silly, but the key to high availability is to avoid outages. Avoiding those outages requires a certain amount of redundancy. Each hardware component that may fail (and that's all of them) needs to have at least one backup that can take over immediately. In some cases, you can use the redundant component together with the primary component to improve performance, as long as you remain aware that the performance will be degraded during a failure. But then, being up and slow is usually better than being down.

Hardware Failures

Let's look at hardware failures first. With z/Architecture, z/OS, and modern storage systems we have redundant components keeping our data intact. The storage systems maintain logical disk-level redunancy and we use multiple connections to ensure access.

Failure Type Frequency Impact Recovery Type Data Recovery Needed
Single physical disk High Internal to storage system RAID6 array new/spare disk No
Multiple physical disk Medium Storage system loss HyperSwap No
Storage system server Low Degraded performance HyperSwap No
Storage system failure Very Low Storage system loss HyperSwap No
Fiber disconnect Very Low Reduced performance Redundant connections No
Mainframe processor failure Medium Internal to mainframe Spare processor activation No
Single mainframe outage Very Low All LPARs down Failover to surviving systems No
Multiple mainframe outage Very Low Entire data center Failover to backup data center Possibly

With two storage systems in a Metro Mirror configuration, we have synchronous data propagation from the primary storage system to the secondary. If the primary storage system fails, we can take advantage of HyperSwap to immediately make use of the secondary (making it the new primary).

To handle mainframe processor failure, we have parallel sysplex and Db2 data sharing to allow failover to already running systems on a secondary box. Configuring this takes a bit of time and engineering, but the benefits are difficult to get in any other configuration.

System Software Failures

In the Linux and Windows environments, system software failures are common and often cause severe outages. Thousands of programmers are out there finding and fixing bugs, but there are always more to be found. Operating systems that are intentionally engineered to work in a cluster are more likely to remain available if there is a problem on only a single system. The ability to shift work away from a failing system will usually require some form of automation that can take into account the needs of our applications.

Application Software Failures

Application bugs more frequently cause partial outages where a transaction will abend or a batch job may need to be re-run. Sometimes, a newly implemented version of a program will update the database incorrectly. Similar situations include cases where jobs are run out of sequence. The main issue here is that the data within the database has been modified improperly and the data is no longer completely valid. We know the data is incorrect and need to revert to a point prior to the execution of the programs that made the changes.

Failure Type Frequency Impact Recovery Type Data Recovery Needed
Application logic error Medium One or more data objects Analyze logs and undo Partial
Application major error Medium One or more data objects Point in time recovery Yes
Major batch error Medium One or more data objects FlashCopy recovery Yes

When software failures occur, the data may contain logical content errors as a result of the failures. The failure could have been incorrect code, a memory overlay due to a bug, or other processing error. Regardless, our best be (from an availability viewpoint) is to use a log analyzer tool to identify the bad updates and undo them which applications continue to run with the good portions of the data. In severe cases, it may be necessary to keep portions of the applications unavailable during the period when the data is being corrected. In very large high volume situations, we have been more likely to write a special data-fix program when the data corruption is minor.

If the majority of database updates are done in batch, the use of FlashCopy to recover to the beginning of the batch cycle may be possible. This requires more storage resources, setup, and testing, but can maximize availability

Network Failures

Modern applications rely heavily on our networks. There are many things that can go wrong with network services. Designing the physical network to support redundant connections can avoid some issues. Using multiple network cards in each system can improve availability in other scenarios. Maintaining redundant critical network services (e.g., DNS) can mean the difference between a minor slow-down and a total outage.

From an application/database perspective, we want to be able to route our work between systems/servers. This way, if our primary server cannot be reached, we can try a secondary server. With a Db2 data sharing group spread across multiple machines in a parallel sysplex, we can set it up so that we have multiple network segments supporting our databases.

Conclusion

Redundancy is key to high availability. We need to engineer our infrastructure and applications to take advantage of redundancy while reducing any single points of failure. The ability to shift from a primary component to a secondary component automatically is necessary to take advantage of the redundancy.

With controlled failover, we can avoid outages. Avoiding outages is better than recovering from them, and very much better than explaining them.
0 comments
82 views

Permalink