The Book of Db2 Pacemaker – Chapter 2: Pacemaker Cluster … Assemble!

Posted By: Gerry Sommerville Technical Content,

First off, I strongly recommend readers to review Chapter 1 of this series - The Book of Db2 Pacemaker – Chapter 1:  Red pill or Blue pill? It kicks off this new series by reiterating the many reasons and benefits to adopt Pacemaker in your environment, how to determine if there is a match to make such move in your current deployment, and many others.  In Chapter 2, I will switch into a more technical gear to give the nuts and bolts of assembling a common HADR cluster with Pacemaker.

The Goal
Let’s set the goal to walk through the setup a Pacemaker cluster using the db2cm (Db2 Cluster Manager) utility, capable to perform automatic takeover on standby in the event of a failure.  Like the old saying, “a picture is always worth a thousand words”, the resulting cluster will look like the one below.

The “Chicken or Egg” dilemma
If you are new to Pacemaker, you might be wondering about this: “Do we setup Db2 instance and HADR databases first or do we setup Pacemaker first?” Well, like everything else in life, there is no “One Ring To Rule Them All”.   The answer is “it depends!”.  It depends on the actual High Availability configuration you are trying to set up.  Now, as of this writing, Db2 supports HADR, and Mutual Failover only with Pacemaker. So, the answer is very straight forward – Set up your Db2 instance first, then setup the Pacemaker Cluster.  Going forward, when pureScale joins the Pacemaker world, the answer will differ.

Focusing back on HADR cluster with Pacemaker, the setup of Pacemaker should be done after the instances are setup on both hosts, but can be done prior to any HADR enabled databases or after.  For the purpose of illustration, I won’t diverge to discuss the setup of the HADR databases at DB level, the focus is on the cluster manager side setup.  That said, let me first point out a few critical DB configuration parameters that can impact the overall failover behaviours:

HADR_SYNCMODE and HADR_PEER_WINDOW

[db2inst1@pcmker-srv-1 ~]$ db2 get db cfg for HADRDB | grep HADR

       Database Configuration for Database HADRDB

 HADR database role                                                                              = PRIMARY

 HADR local host name                                   (HADR_LOCAL_HOST) = pcmker-srv-1

 HADR local service name                                (HADR_LOCAL_SVC) = 60000

 HADR remote host name                            (HADR_REMOTE_HOST) = pcmker-srv-2

 HADR remote service name                         (HADR_REMOTE_SVC) = 60000

 HADR instance name of remote server      (HADR_REMOTE_INST) = db2inst1

 HADR timeout value                                                (HADR_TIMEOUT) = 120

 HADR target list                                                 (HADR_TARGET_LIST) =

 HADR log write synchronization mode           (HADR_SYNCMODE) = NEARSYNC

 HADR spool log data limit (4KB)                    (HADR_SPOOL_LIMIT) = AUTOMATIC(25600)

 HADR log replay delay (seconds)              (HADR_REPLAY_DELAY) = 0

 HADR peer window duration (seconds)  (HADR_PEER_WINDOW) = 120

 HADR SSL certificate label                                   (HADR_SSL_LABEL) =

 HADR SSL Hostname Validation                (HADR_SSL_HOST_VAL) = OFF

The HADR_SYNCMODE must be either SYNC, or NEARSYNC, as automation is not supported for value ASYNC, or SUPERASYNC.

The HADR_PEER_WINDOW should be set to a value allowing enough time (in seconds) for Pacemaker to detect a failure and issue the takeover command. 60 seconds is the minimum value that should be used, but it is suggested that a value of 120 seconds be used.

With that out of the way, let’s dig into specific components in the above diagram.

1. Primary and Standby Hosts
The HADR cluster consists of hosts ‘pcmker-srv-1’ and ‘pcmker-srv-2’ as the primary and standby roles respectively. Both hosts are located on the same network subnet as it is a prerequisite for setting up the virtual IP.  For best-in-class availability, hosts should reside on different physical machines, with redundancies for network, storage, and power built in at their respective layers.

We will be automating takeover to standby on the ‘HADRDB’ database.

 

2. VIP Address
The IP 10.21.98.34 will be used as the ‘floating’ VIP that will be collocated with the primary role. The VIP allows clients to passively connect (with no additional configuration) to the new primary after failover.  For on-premises clusters, both hosts must be on the same subnet for the VIP to function correctly. Clusters on AWS can utilize an Overlay IP instead which allows a floating IP to exist across the different subnets of the availability zones.

 

3. Qdevice Host
The host ‘qdevice-srv’ will be used as the Qdevice host which additionally provides protection against split brain scenarios where the communication is lost between the two cluster nodes. The Qdevice host needs only TCP/IP connectivity to the two cluster hosts.  There are no further network requirements. For best-in-class availability, the Qdevice host should reside in a 3rd location, not collocated with the cluster hosts.

 

Prerequisites

HOSTNAME RESOLUTION

The configuration for hostname resolution should align with the Db2 & system administration best practices.  The /etc/hosts file on each cluster host (pcmker-srv-1, and pcmker-srv-2) should have the following form…

<IP Address of host A> <FQDN of host A> <Shortname of host A>

<IP Address of host B> <FQDN of host B> <Shortname of host B>


The nsswitch.conf should also specify that the local hosts file is used over DNS. Using the local hosts file prevents DNS failures from impacting both DB2 and Pacemaker functionality.

hosts: files dns nisplus

 

Passwordless root ssh or db2locssh

In the Db2 11.5 release stream, Pacemaker must be configured on each node by root, and as such the db2cm utility must be allowed to ssh passwordlessly as the root user between the cluster hosts.  (Spoiler Alert!!! This will change in the next major release).  Alternately for now, you can configure db2locssh which can utilize passwordless ssh using a ‘db2sshid’ user instead as documented in the below link.

Setting up db2locssh

Once db2locssh is configured, you can tell db2cm to use db2locssh for ssh using the -remote_cmd, and scp with -remote_scp.

Note: That the passwordless ssh must also be configured to the QDevice host for db2cm to work. It is suggested to allow passwordless ssh for the root user to configure the Qdevice, but remove it afterwards.

Pacemaker Cluster Setup and Configuration
A resource model defines resources, where they run (location constraints), who they run with (collocation constraints), and in what order resources are started or stopped (order constraints). Together the defined resources and constraints determine the actions that Pacemaker will take to recover the cluster state in various failure scenarios. A deep dive into the resource model will be the subject of the next blog post, for now all we need are the commands to create the resource model for HADR.

Previously with TSA, clusters were created and managed via the db2haicu tool which utilized an interactive menu on the command line. With Pacemaker, the new db2cm (Db2 Cluster Manager) utility will be used to manage the cluster. It is command line based.  It's more flexible and simpler to automate.

Step 1: Create the cluster.

[root@pcmker-srv-1 ~]# db2cm -create -cluster -domain hadomain -host pcmker-srv-1 -publicEthernet eth0 -host pcmker-srv-2 -publicEthernet eth0

Created db2_pcmker-srv-1_eth0 resource.

Created db2_pcmker-srv-2_eth0 resource.

Cluster created successfully.

After HADR has been configured, its time to setup Pacemaker automation. The first command to run is the ‘create cluster’ command as shown above. The command takes several parameters which will have long term implications on the automation of the cluster.

Breaking the command down there are 2 important parts.

First, -domain hadomain specifies that the cluster name will be “hadomain”. While this does not have any immediate impact on HADR automation within the cluster, it's important that the cluster name be unique between clusters that may share a common QDevice.

Secondly, the pair of “-host <server> -publicEthernet <interface>” indicates the pair of hosts that will be added to the cluster, as well as their associated “public” ethernet adapters. The Ethernet adapters specified should be the adapters that clients will use to connect to the cluster.  In this case, both hosts will use eth0 as the client connectable adapter.

Step 2: Create the instance resources.

The instance resources allow Pacemaker to automate the Db2 instances. If the db2sysc process were to fail at any point, Pacemaker will use this resource to both restart the instance and activate the databases.

[root@pcmker-srv-1 ~]# /home/db2inst1/sqllib/bin/db2cm -create -instance db2inst1 -host pcmker-srv-1

Created db2_pcmker-srv-1_db2inst1_0 resource.

Instance resource for db2inst1 on pcmker-srv-1 created successfully.

 

[root@pcmker-srv-1 ~]# /home/db2inst1/sqllib/bin/db2cm -create -instance db2inst1 -host pcmker-srv-2

Created db2_pcmker-srv-2_db2inst1_0 resource.

Instance resource for db2inst1 on pcmker-srv-2 created successfully.

Step 3: Create the database resource.

The database resource allows Pacemaker to ‘promote’ the standby by issuing the takeover command.

The -instance parameter in this command specifies the local instance that the database belongs to, the remote instance is determined by the databases HADR_REMOTE_INST configuration parameter.

[root@pcmker-srv-1 ~]# /home/db2inst1/sqllib/bin/db2cm -create -db HADRDB -instance db2inst1

Database resource for HADRDB created successfully.

Step 4: Create the Virtual IP (VIP) address resource.

This is an optional step but commonly used and recommended to use for transient failover at the client connectivity level. 

The primary VIP resource is a floating IP address collocated with the primary database role. If the primary database fails and Pacemaker issues takeover to standby, it will also move the VIP address to the new primary host . This allows clients to transparently connect to the new primary via the same address without additional client-side configuration.

[root@pcmker-srv-1 ~]# /home/db2inst1/sqllib/bin/db2cm -create -primaryVIP 10.21.98.34 -db HADRDB -instance db2inst1

Primary VIP resource created successfully.

If your HADR configuration enables Read-On-Standby feature, you can also use the above setup to setup Standby VIP to associate it with the databases with standby role.  A takeover will then move the standby VIP together with the standby database role.  For the purpose of this illustration, standby VIP is not shown.

Step 5: Review the resource model via crm status.

All resources should be in the ‘Started’ state, the database should have that ‘Master’ state on the primary host, and ‘Slave’ state on the standby. The VIP address will be running on the same host as the primary database.

[root@pcmker-srv-1 ~]# crm status

Cluster Summary:

  * Stack: corosync

  * Current DC: pcmker-srv-1 (version 2.1.2-4.db2pcmk.el8-ada5c3b36e2) - partition with quorum

  * Last updated: Sun Sep 10 05:54:37 2023

  * Last change:  Sat Sep  9 09:47:15 2023 by root via cibadmin on pcmker-srv-1

  * 2 nodes configured

  * 7 resource instances configured

Node List:

  * Online: [ pcmker-srv-1 pcmker-srv-2 ]

Full List of Resources:

  * db2_pcmker-srv-1_eth0             (ocf::heartbeat:db2ethmon):      Started pcmker-srv-1

  * db2_pcmker-srv-2_eth0            (ocf::heartbeat:db2ethmon):      Started pcmker-srv-2

  * db2_pcmker-srv-1_db2inst1_0  (ocf::heartbeat:db2inst):              Started pcmker-srv-1

  * db2_pcmker-srv-2_db2inst1_0 (ocf::heartbeat:db2inst):              Started pcmker-srv-2

  * Clone Set: db2_db2inst1_db2inst1_HADRDB-clone [db2_db2inst1_db2inst1_HADRDB] (promotable):

    * Masters: [ pcmker-srv-1 ]

    * Slaves: [ pcmker-srv-2 ]

  * db2_db2inst1_db2inst1_HADRDB-primary-VIP    (ocf::heartbeat:IPaddr2):        Started pcmker-srv-1

Step 6: Create the Qdevice

You can create the Qdevice at any point after the cluster has been created. The Qdevice will prevent split-brain scenarios from occurring by actively participating in the cluster quorum voting.

[root@pcmker-srv-1 ~]# /home/db2inst1/sqllib/bin/db2cm -create -qdevice rgshadr-srv-1

Successfully configured qdevice on nodes pcmker-srv-1 and pcmker-srv-2

Attempting to start qdevice on rgshadr-srv-1

Quorum device rgshadr-srv-1 added successfully.

Note that each cluster can only have a single Qdevice as it acts as the single source of truth regarding quorum. If you were able to configure multiple Qdevice hosts, you would need a Qdevice for your Qdevice hosts!

Validate that both hosts are successfully connected to the Qdevice using the corosync-qdevicetool command. The output should show the state as “Connected” on the last line.

[root@pcmker-srv-1 ~]# corosync-qdevice-tool -s

Qdevice information

-------------------

Model:                  Net

Node ID:                1

Configured node list:

    0   Node ID = 1

    1   Node ID = 2

Membership node list:   1, 2

Qdevice-net information

----------------------

Cluster name:         hadomain

QNetd host:             rgshadr-srv-1:5403

Algorithm:                LMS

Tie-breaker:            Node with lowest node ID

State:                       Connected

Step 7: Validation

To perform fast validation, the easiest way is to induce a Db2 failure by killing the Db2 instance or a host failure.  We will walk through the steps and the expected behaviour below for your reference.

Inducing a Db2 software failure

1. Find the PID for the instance user’s db2sysc process.

[root@pcmker-srv-1 ~]# ps -ef | grep db2inst1 |  grep db2sysc

root     2097883  140514  0 06:17 pts/0    00:00:00 grep --color=auto db2sysc

db2inst1 3490264 3490260  1 Aug29 ?        04:19:15 db2sysc 0

2. Kill the db2sysc process.

[root@pcmker-srv-1 ~]# kill -9 3490264

Pacemaker will detect the instance and database are no longer running and attempt to bring them back online.

This often results in takeover to standby, but the behavior is determined by the timing in which Pacemaker detected the failures of the instance vs database. It is possible in small systems that Pacemaker can detect the instance failure, and bring it back online (along with the primary database) before it even has a chance to detect the database failure.

However, in this case we can see that Pacemaker issued a takeover to the standby. Note that the VIP also failed over to the new primary host.

[root@pcmker-srv-1 ~]# crm status

Cluster Summary:

  * Stack: corosync

  * Current DC: pcmker-srv-1 (version 2.1.2-4.db2pcmk.el8-ada5c3b36e2) - partition with quorum

  * Last updated: Sun Sep 10 06:19:28 2023

  * Last change:  Sun Sep 10 06:18:33 2023 by root via crm_attribute on pcmker-srv-1

  * 2 nodes configured

  * 7 resource instances configured

 

Node List:

  * Online: [ pcmker-srv-1 pcmker-srv-2 ]

 

Full List of Resources:

  * db2_pcmker-srv-1_eth0             (ocf::heartbeat:db2ethmon):      Started pcmker-srv-1

  * db2_pcmker-srv-2_eth0            (ocf::heartbeat:db2ethmon):      Started pcmker-srv-2

  * db2_pcmker-srv-1_db2inst1_0  (ocf::heartbeat:db2inst):              Started pcmker-srv-1

  * db2_pcmker-srv-2_db2inst1_0 (ocf::heartbeat:db2inst):              Started pcmker-srv-2

  * Clone Set: db2_db2inst1_db2inst1_HADRDB-clone [db2_db2inst1_db2inst1_HADRDB] (promotable):

    * Masters: [ pcmker-srv-2 ]

    * Slaves: [ pcmker-srv-1 ]

  * db2_db2inst1_db2inst1_HADRDB-primary-VIP    (ocf::heartbeat:IPaddr2):        Started pcmker-srv-2

Inducing a host failure via `shutdown -h now`

To simulate a hardware or power failure on a host, run the `shutdown -h now` command. Note that it must be an immediate shutdown to simulate a power failure. A ‘graceful’ shutdown often results in a takeover to standby as Pacemaker shuts down

[root@pcmker-srv-2 ~]# shutdown -h now

Connection to pcmker-srv-2 closed by remote host.

Connection to pcmker-srv-2 closed.

 

As a result, we can see the database and VIP failed back to pcmker-srv-1, while pcmker-srv-2 is offline.

 

[root@pcmker-srv-1 ~]# crm status

Cluster Summary:

  * Stack: corosync

  * Current DC: pcmker-srv-1 (version 2.1.2-4.db2pcmk.el8-ada5c3b36e2) - partition with quorum

  * Last updated: Sun Sep 10 06:27:44 2023

  * Last change:  Sun Sep 10 06:27:37 2023 by root via crm_attribute on pcmker-srv-1

  * 2 nodes configured

  * 7 resource instances configured

 

Node List:

  * Online: [ pcmker-srv-1 ]

  * OFFLINE: [ pcmker-srv-2 ]

 

Full List of Resources:

  * db2_pcmker-srv-1_eth0             (ocf::heartbeat:db2ethmon):      Started pcmker-srv-1

  * db2_pcmker-srv-2_eth0            (ocf::heartbeat:db2ethmon):      Stopped

  * db2_pcmker-srv-1_db2inst1_0  (ocf::heartbeat:db2inst):              Started pcmker-srv-1

  * db2_pcmker-srv-2_db2inst1_0 (ocf::heartbeat:db2inst):              Stopped

  * Clone Set: db2_db2inst1_db2inst1_HADRDB-clone [db2_db2inst1_db2inst1_HADRDB] (promotable):

    * Masters: [ pcmker-srv-1 ]

    * Stopped: [ pcmker-srv-2 ]

  * db2_db2inst1_db2inst1_HADRDB-primary-VIP    (ocf::heartbeat:IPaddr2):        Started pcmker-srv-1

Bringing pcmker-srv-2 back online, we see the instance come back online and the original primary database successfully reintegrate again as standby.

 

[root@pcmker-srv-1 ~]# crm status

Cluster Summary:

  * Stack: corosync

  * Current DC: pcmker-srv-1 (version 2.1.2-4.db2pcmk.el8-ada5c3b36e2) - partition with quorum

  * Last updated: Sun Sep 10 06:33:29 2023

  * Last change:  Sun Sep 10 06:33:00 2023 by root via crm_attribute on pcmker-srv-2

  * 2 nodes configured

  * 7 resource instances configured

 

Node List:

  * Online: [ pcmker-srv-1 pcmker-srv-2 ]

Full List of Resources:

  * db2_pcmker-srv-1_eth0       (ocf::heartbeat:db2ethmon):      Started pcmker-srv-1

  * db2_pcmker-srv-2_eth0       (ocf::heartbeat:db2ethmon):      Started pcmker-srv-2

  * db2_pcmker-srv-1_db2inst1_0 (ocf::heartbeat:db2inst):        Started pcmker-srv-1

  * db2_pcmker-srv-2_db2inst1_0 (ocf::heartbeat:db2inst):        Started pcmker-srv-2

  * Clone Set: db2_db2inst1_db2inst1_HADRDB-clone [db2_db2inst1_db2inst1_HADRDB] (promotable):

    * Masters: [ pcmker-srv-1 ]

    * Slaves: [ pcmker-srv-2 ]

  * db2_db2inst1_db2inst1_HADRDB-primary-VIP    (ocf::heartbeat:IPaddr2):        Started pcmker-srv-1

And that's a wrap! It takes as few as five db2cm commands to have your HADR Pacemaker cluster setup. If you want more details, refer to Creating an HADR Db2 instance on a Pacemaker-managed Linux cluster in the IBM documentation.  I have also assembled the following FAQ section that includes common questions raised during setup.  Hopefully you will find them useful.  If you are wondering what’s next chapter is about … allow me to provide the teaser on the Chapter 3 … “Pacemaker Resource Model: Into the Pacemaker-Verse"!

Frequently Asked Questions

Can a single QDevice be used for multiple clusters?

Absolutely! Not only can you use (or re-use) the same QDevice host for multiple clusters, it is also not restricted to any type of Db2 High Availability configurations either.  In other words, you can potentially have a mix of one or more HADR clusters, one or more Mutual Failover clusters, and in the future, one or more DPF clusters – all using the same QDevice host as the 3rd host for tiebreaking decision.  After all, the ability to share the same Qdevice host among your development and test clusters across different distros, OS levels, and H/W architectures is one of the huge benefits and use case.   That’s said, Db2 always recommends dedicated Qdevice host for each of your production cluster.  Ultimately, the decision is yours and yours only based on cost Vs availability requirements mandated by your organization.

 

How many HADR databases and Db2 instances can be automated within a single Pacemaker domain?

There is no hard limit to the number of resources that can be configured and automated by Pacemaker, however you must consider the amount of CPU available for the Db2 workload, and for Pacemaker automation together. Pacemaker resources are scripts configured to perform certain actions within specific timeouts. It is imperative that the cluster hosts have enough CPUs to run the Db2 workload (including ‘heavy’ operations like backup) and still be responsive enough for Pacemaker automation to complete within the configured timeouts.

 

Can I make custom configuration changes to the Db2 Pacemaker resource model?

The frequency of this question being raised compared to in TSA era probably speaks volume on the simplicity of Pacemaker setup – which, once again, is a great testimony of the right technological choice by Db2 😊. The short answer to this is “No, but may be … yes?”.  I have recently responded with the following analogy.  Consider you are the owner of a car dealership, a customer who has recently purchased a pricey high-end SUV came back asking you this question: “I really like the car, I made some modification to add additional cameras on all sides for better monitoring, will you to continue support my SUV the same way you would have before the modification?”   I think the common answer is pretty straight forward.  Going back to the context of this question, the default answer is also very straight forward.   There is no control from Db2 perspective of what might be altered directly or indirectly that can negatively impact the runtime behaviours.   It can be as big as the overall timeout getting changed that impacts every resource or it can be as subtle as a resource name collision with Db2 that impacts Db2 specific logic within the engine.  Of course, we do recognize there are customers who would like to leverage existing hosts for other applications to be monitored and acted on by Pacemaker.  As a result, we are working on a staged approach to best accommodate this in a timely manner.  First stage will likely publicize a list of configuration parameters that cannot be modified along with other restrictions that must be adhered to if such shared resource model is to be done.  Second stage will likely be automations to perform these self-validations. Stay tuned!

 

Gerry Sommerville joined IBM in 2015, and has been an active developer on the Db2 pureScale and High Availability team since 2017. In 2019, he joined a highly focused team investigating a variety of cluster manager technologies, which went on to build the early prototype for Automated HADR with Pacemaker. In 2020 he took on the role of technical lead for integrating Pacemaker into the Db2 engine and continues today to work with the team in order to improve the solution.