The Book of Pacemaker - Chapter 4: Quorumania

Posted By: Hao Qi Technical Content,

If you landed here, chances are you've read through the previous blogs in the Db2 Pacemaker series, and perhaps you've already gone and deployed a Db2 HADR or Mutual Failover cluster. “What's next”, you may be wondering. Is the cluster ready for testing or production?

Well ... first of all, congratulations on getting this far! However, you need to make sure the cluster does not pull an Avengers style Civil War where both sides are trying to prove each other wrong. That leads to - Quorum considerations!

What do you mean by Quorum?

Through the life of a distributed cluster, things are bound to go wrong somewhere. Whether that's caused by software or hardware failures, the goal of a highly available system is to tolerate those faults, thereby allowing the cluster to remain operational with the highest level of availability.

Fundamentally speaking, "Quorum" allows the cluster to make decision on a unified front. Take network communication split for example, what happens when network communication breaks down between cluster nodes and an agreement cannot be reached? Whoʼs stay and whoʼs get kicked? Therein lies the rub...

Why should you be concerned about Quorum?

Again, using the network split-brain scenario as illustration. Say we have a distributed system consisting of two cluster nodes performing work, and suddenly the network breaks down between the nodes, this is a split-brain scenario by definition. If both nodes are allowed to continue to perform I/O on the same set of data (a.k.a. “Double Primary” phenomenon), it may lead to data corruption - a nightmare that must be avoided at all costs. Configure a suitable quorum device in the cluster allows the cluster manager to make the hard decision to cut the loss on the failed resources, thereby ensuring data consistency and integrity. Though in certain cases, fencing is required alongside Quorum device configuration.

What is Fencing?

"Fencing" refers to a mechanism that can be applied in conjunction with Quorum to isolate a problematic node to prevent further I/O to guarantee data integrity. It can be in the form of hardware or software fencing devices through various means such as power loss , shutdown, or even I/O block. In case a node loses quorum, the fencing device performs its configured function and prevents further damages to the cluster or application.

How does Db2 HA implement Quorum and Fencing?

Read on my friends, in subsequent sections, I will describe what we had before and what's in the works.


The Old Vs the New


Quorum type supported by TSA/RSCT

In the old days where TSA/RSCT is the integrated cluster manager for Db2 High Availability features, there are three supported quorum types, each with its own Pros and Cons:

  • Network IP Tiebreaker quorum

    The Network IP Tiebreaker wins with its simplicity; the only requirement is a pingable IP address for it to be configured. That's said, it is also the most unreliable quorum mechanism. Split-brain scenario is possible in certain edge cases, making it the least desirable quorum device overall. Deployment with this in production should be avoided. This is NOT Db2 best practice for production system.

  • Majority Node Set quorum

    MNS quorum is arguably the most reliable quorum device, where a tie is not possible since its requirement of odd number of cluster nodes. Of course, the requirement of an extra node is also its main drawback – it increases the total cost of ownership. This has been supported for a long time and has been our recommendation for any 2-node HA setup in production.

  • Disk Tiebreaker quorum

    Disk Tiebreaker quorum support was added a few years ago. It is a reliable tiebreaking mechanism where disk reservation is used. In a nutshell, every cluster host should have access to this special disk. The disk will only be used in a split-brain scenario where each host will attempt to reserve the disk. The host that can successfully reserve it in exclusive mode will remain online with the other host getting evicted. The main drawback of this mechanism is the fact that it requires a special type of disk that support the reservation scheme and must also be accessible by all hosts in the cluster - not exactly a simple setup.

Quorum type supported by Pacemaker/Corosync

Now, how about our new cluster manager of choice? What does Pacemaker/Corosync support in terms of quorum devices?

  • Two-node quorum

    The Two-node quorum device is the default quorum mechanism for Corosync. However, it is not meant to the used in production environment since there is no tie-breaking support in a split-brain scenario.

  • Majority quorum

    Similar to the TSA/RSCT MNS quorum, Corosync supports Majority Quorum type with an odd number of cluster node.

  • QDevice quorum

    Lastly, QDevice quorum is arguably the most reliable Quorum device supported by Pacemaker/Corosync stack. It has the benefit of low requirement, supporting multiple clusters, and highly reliable. To name a few specific benefits:
    • Support for different combinations of distributions and architectures.
    • Does not need to be integrated in the cluster.
    • One QDevice host can support multiple clusters concurrently.
    • Only need TCP/IP access.
    • Reliable tie-breaking mechanism through voting.

That's neat! But... how are quorum decided if both nodes can reach the QDevice at the same time? Which node stays up then? The answer lies within the node ID designated by Corosync at the time of domain creation (not to be confused with Db2 member ID) - The node with the lowest node ID that can reach the QDevice gets to stay alive!

If you want to be 100% sure and check node IDs in your cluster, this is the command to do so:

$ crm_node -l
1 devlnxps01 member
2 devlnxps02 lost

In this case, in a split scenario where both nodes can access the QDevice, the one with the lowest node ID - devlnxps01 gets to survive

DB2'S QUORUM TYPE OF CHOICE

From the above three types, the choice of the officially supported quorum device for our integrated HA implementations including HADR and Mutual Failover is obvious: the QDevice Quorum, given its reliability, flexibility, and simplicity in configuration.

This diagram depicts 2 clusters configured with QDevice quorum, and sharing the same QDevice host:

 

Again, to reiterate the main reason QDevice quorum is chosen for Db2, and the key benefits:

  • Flexible in platform and architecture
  • Small requirement in terms of footprint ( CPU, disk and memory )
  • Does not require complicated setup, only need corosync-qnetd package
  • Does not need to be apart of clusters
  • Only need TCP/IP connection from cluster nodes
  • Possibility to be shared by multiple clusters

Db2 HA implementations' quorum configurations


With the above information in mind, let's dive into a bit more details on how quorum is configured for the Db2 HA implementations.

Configuring Quorum for Db2 HADR and Mutual Failover deployments

First, choose a host to be your QDevice host. Remember that it does not need to be a part of the cluster, just TCP/IP network access with the actual cluster nodes is sufficient.

Package installation

Depending on your distribution of choice, install the bundled corosync-qdevice packages using dnf and zypper for RHEL and SLES distributions respectively. The packages themselves can be found bundled within the Db2 install image media located at

<Db2_image>/db2/<platform>/pmck/Linex/<OS_distribution>/<architecture>/corosync-qdevice*

Once the install is completed, verify using rpm -qa|grep corosync -qdevice command

Note: Make sure to only use the Db2 supplied packages.

Configure QDevice quorum

It is a quite straight forward procedure to incorporate the QDevice device into the Pacemaker/Corosync cluster, once the host has the required packages installed, simply configure it using the db2cm utility on one of the cluster nodes (not the QDevice host):

<Db2_install_path>/bin/db2cm - create -qdevice <hostname>

Upon the successful completion of the db2cm command, use the corosync -qdevice-tool -l command on the QDevice host to make sure the service is running correctly, and corosync -device-tool -s command on the cluster nodes to verify the configuration.

For example, to check the Qdevice configuration:

[root@cuisses1 ~]# corosync-qdevice-tool -s 
Qdevice information 
-------------------
Model: Net
Node ID: 1
Configured node list:
0 Node ID = 1
1 Node ID = 2
Membership node list: 1, 2
Qdevice-net information
---------------------- Cluster name: hadom
QNetd host: frizzly1:5403
Algorithm: LMS
Tie-breaker: Node with lowest node ID
State: Connected

Note the QDevice information listed in the "Qdevice-net information" section.

Similarly this shows the QDevice service status:

[root@frizzly1 ~]# corosync-qnetd-tool -l 
Cluster "hadom":
Algorithm: LMS
Tie-breaker: Node with lowest node ID
Node ID 2:
Client address: ::ffff:9.21.110.42:55568
Configured node list: 1, 2
Membership node list: 1, 2
Vote: ACK (ACK)
Node ID 1:
Client address: ::ffff:9.21.110.22:51400
Configured node list: 1, 2
Membership node list: 1, 2

Mutual Failover specific fencing considerations

Db2 HADR and Mutual Failover are both 2 node HA implementations, however they are quite different: HADR utilizes log-shipping while Mutual Failover utilizes filesystem mounts. Therefore, fencing must be setup for Mutual Failover to avoid data corruption in a split-brain scenario.

Mutual Failover utilizes SBD, short for STONITH Block Device, to perform fencing operations in case a node loses quorum. In essence, in the case of communication breakage and split-brain occurs, QDevice Quorum plus diskless SBD will instruct the node without quorum to be self-fenced through a preconfigured software watchdog(/dev/watchdog).

The good news is: There is nothing further you need to do! The db2cm utility is smart enough to detect the instance type and automatically configure diskless SBD fencing during the QDevice creation step above, as well as the software watchdog if it's not already setup by the Db2 installer. If a split-brain incident were to occur, the node without quorum will be rebooted by the software watchdog automatically.

Unconfigure QDevice quorum and cleanup QDevice host

If for any reason, you need to unconfigure QDevice quorum in the cluster, simply run:

sqllib/bin/db2cm -delete -qdevice

This command will unconfigure the QDevice quorum in the cluster upon successful completion.

However, this does not mean the QDevice host is cleaned up. If you are certain the QDevice host is no longer required, then issue these commands on the QDevice host to clean it up:

systemctl stop corosync-qnetd rm -rf /etc/corosync/qnetd/nssdb

Failure Scenarios

Letʼs turn our focus to the failure scenario behaviours after configuring QDevice quorum, and you are welcome to experiment and see if they match your observations.

If there is a split brain

The QDevice picks the node with the lowest node ID to survive, and the other node loses quorum, with its instance taken down and fenced accordingly.

If there is no split brain, but one or both nodes lose access to the QDevice host

Both nodes remain in quorum, and the instance(s) remain operational as normal.

If there is a split brain, and one of the nodes loses access to the QDevice host

It's quite intuitive, the host that still has access to the QDevice host remains in quorum, and the other node loses quorum, with its instance taken down and fenced accordingly.

If there is a split brain, and all access to the QDevice host are lost

Then the cluster loses quorum, both hosts and their associated instances are taken down.

Alternatives for deployments in the cloud


What about cloud configurations? Simply put, the answer is yes, you can still use the "vanilla" setup described above. However, there is alternative fencing setup available utilizing cloud vendor specific fencing agent features. In such case, the cluster will be setup with Two-Node quorum with cloud vendor specific fencing agent packages.

There are pros and cons to this approach, the pro is obvious: You no longer need a QDevice host to break ties. The con, however, needs to be taken into consideration when choosing the desired deployment configuration: The extra fencing step may prolong the recovery time from a host failure up to 6 times longer from test result in a controlled environment.

Detailed setup steps

Here are some useful links to IBM Documentation site regarding cloud alternative configurations, for detailed steps and latest update please refer to these links:

IBM Documentation

Amazon Web Services

Microsoft Azure

Google Cloud


Wrap-up


That's it for the high-level overview of everything quorum in a Db2 Pacemaker/Corosync cluster, I hope you find the information helpful. Please do leave a comment or reach out if you have any questions or concerns. Otherwise, please stay tuned the next chapter: The Book of Db2 Pacemaker – Chapter 5: Pacemaker Configuration Parameters – Across the Db2-Verse.

Quick reference to this series:

Chapter 1 - Red pill or Blue pill ?
Chapter 2 - Pacemaker Cluster … Assemble!
Chapter 3 - Pacemaker Resource Model: Into the Pacemaker-Verse


Hao Qi joined IBM in 2013 working in the Db2 SVT team and transitioned to High Availability development soon after. He specializes in various aspects of distributed systems including cluster manager, shared storage, networking, and databases. Throughout the years Hao worked on many Db2 HA features include Db2 HADR, DPF, Mutual Failover, and pureScale, and he served as the manager for the Db2 High Availability Cluster Services team. Currently he is a senior software developer in the Db2 HA team.