Db2 and CDC in HADR governed by TSA

By Eddy Coppens Unpublished

  

I was working at a customer who wanted to have a database replicating data bi-directional towards another database by means of InfoSphere Data Replication (IIDR aka CDC), but also wanted to have a failover mechanism in place in conjunction with the bought replication solution.

An ‘out of the box’ setup for a Db2 cluster would be defined as follows: a Primary Db2 database server put in an HADR setup. If you want to have fail-overs automated, TSA add on top of it to act as a guardian. In case the database needs to replicate / receive data to / from another databases, the replication software is installed and points to the Primary. As a consequence, when the Standby acts as a new Primary – due to whatever reason, e.g. failure, upgrade, network outage, etc. – data is no longer replicated.

A possible solution to have also CDC continuously running is thoroughly enough described in the past – even though not officially supported by IBM. If you want to get up to speed before diving in the setup I’m going to describe, you would benefit a lot from reading first the following documentation:

  • Developer works:
  • Resource 1: https://www.ibm.com/developerworks/community/files/form/anonymous/api/library/7b8aba61-512c-4040-8718-4aca3b3bd8a3/document/1f4ba58e-d9a3-4470-9a3f-7335f1831a69/media/IIDR_CDC_DB2_HADR_TSA-2016-Jan-11.pdf
  • Resource 2: https://www.ibm.com/developerworks/community/files/app?lang=en#/file/1e86e44c-fa06-415b-83b7-d0e8857d75d7
  • IDUG material:
  • Resource 3: Frederik Engelen “A DBA’s guide to using TSA”
    https://www.idug.org/p/cm/ld/fid=144&tid=101&fileid=2650

 

  1. Some terms explained before take off

 

If you are new to the subject, you should know upfront a few terms before continuing. I will describe the most important terms in just a few words, but will add where possible a URL referring to a page containing the complete and official description.

 

  • CDC – this is a tool designed to replicate data from one database to another, as far as I know it does that by replaying on the receiving database all what is found in the (active) LOG files. You might know this product also as “Change Data Capture” or “InfoSphere Data Replication” (abbreviated to IIDR).

https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/concepts/overview_of_cdc.html

  • CDC instance – is a daemon (dmts process) continuously running and containing a configuration towards a remote database amongst a few other settings. A CDC instance is not the same as a Db2 instance.
  • CDC subscription – a list of logically grouped tables sharing the same source and target CDC instance
  • dmts process – see [CDC instance]
  • Pointbase database – the CDC instance keeps all its information safe in a Pointbase database. When subscriptions are defined, the information of them are stored in the CDC instance operating as Source.
  • HADR – High Availibility and Disaster Recovery. By setting HADR up for a Db2 database, the content – from the Primary database – gets copied to a second node or more, also known as Standby database(s). When the Primary database has an issue, a manual take-over can be done by a Standby.

https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.ha.doc/doc/c0011267.html

  • TSA (MP) – Tivoli System Automation for Multiplaforms. TSA has the capability to facilitate automated take-over for an HADR configured database.
  1. Starting point

 

The above-mentioned documentation describes a setup with a single Db2 instance containing multiple databases and whenever a fail-over is done, all databases switch from Primary to Standby. On fail-over the CDC instance will follow the databases and will guarantee the fact that replication will continue despite of the fact the databases are now running on the Standby.

The scope of the project changed and we wished to be able to fail-over each database separately. When such a fail-over occurred, only the related CDC instance should follow.

Just copying and pasting someone else’s work was not enough anymore and it was time to dig in for real. The architectural design I’m about to describe is not only mine, the ideas are probably far from new but it would be a shame not to share it and comment on those things I learned while setting it up and adding new features to the concept.

 eddy fig1.PNG

                 The setup as shown in Figure 1:

  • A Db2 instance containing at least one database. This database sends and/or receives replicated data by means of CDC
  • An HADR setup between a Primary and a Standby governed by a TSA doing automatic failover in case of need
  • CDC installed on a shared drive, so both Primary and Standby can use the Pointbase database

 

 

 

 

  1. The final goal

 eddy fig2.PNG

                Figure 2 describes the following:

  • Multiple instances just containing a single database sending / receiving data by means of CDC. The goal is to be able to fail-over a single database with its related CDC instance without interruption of any other database
  • An HADR setup between a Primary and a Standby governed by TSA doing automatic failover in case of need for multiple instances separately
  • CDC installed on a shared drive, so both Primary as Standby can use the same Pointbase database

 

  1. What is TSA expected to do?

 

In short what TSA does, boils down to the following:

  • Start a resource set when it is down and should be online
  • Stop a resource set when it has to be offline
  • Monitor the status of the Resource Set and try to act on this status, e.g. fail-over between Primary and Standby

Each of the actions above can be handled by scripts, e.g. START.sh, STOP.sh and MONITOR.sh and when the Resource Set is created within TSA, each of the three actions gets assigned one of these scripts. The original scripts (see "Resource 1") held the parameters which identified the hard-coded CDC Resource Set. That wouldn’t work if we were to re-use the same script for databases within multiple instances.

 

  1. Emulating TSA before setting it up

 

A topic I struggled with the most, was understanding what TSA performs under the covers as it seemed that TSA made decisions on its own and I had little control on it. I found some useful ideas in the scripts in /use/sbin/rsct/sapolicies/db2 and started adding more logging into the START/MONITOR/STOP scripts.

 

Note: TSA is set up as a root process but it runs every script as a particular user (su -). Testing the scripts can thus be done by logging on as this user and executing the scripts manually. By checking whether processes are started / stopped, what the return code of the script is, and looking at the content of your logging scripts, you get a good idea what is happening and what should be the result within TSA as it executes the scripts on its own. TSA does no decent logging on its own!

Hint 1: I wanted to test the changes within the original CDC-scripts without setting up a cluster and installing the CDC software on a shared drive. To emulate a continuously running daemon process I choose to replace the dmts process by a Unix script doing basically nothing but sleeping. I adapted the TSA scripts as such so they would use this dmts.sh script instead of the dmts binary.

 

Hint 2: I kept the logging within the START/MONITOR/STOP scripts, but reduced it to a strict minimal … which is still a big deal more than what TSA produces. I want it to be logged which Resource Set is doing what on which server. This is particularly a good idea to incorporate within the monitor script.

 

Some snippets of logging corporated within the START script:

scriptID="${HOSTNAME} : {${CDCUSER},${CDCDATABASE}} : ${script}"

DATE=$( date '+%Y-%m-%d %H:%M:%S' )

echo "${DATE} ${scriptID} : Start script is started" >> ${LOGNAME}

DATE=$( date '+%Y-%m-%d %H:%M:%S' )

echo "${DATE} ${scriptID} : Start script Clear CDC staging store" >> ${LOGNAME}

DATE=$( date '+%Y-%m-%d %H:%M:%S' )

echo "${DATE} ${scriptID} : Start script start subscriptions" >> ${LOGNAME}

 

 


 

Examples of used scripts / script calls:

> cat dmts.sh

#!/bin/sh

#

# Do a whole lot of nothing but wasting CPU and sleeping

#

 

while [ true ]

do

  maximumNumber=1000

  currentNumber=0

 

  while [ ${currentNumber} -le ${maximumNumber} ]

  do

    sleep ${maximumNumber}

    currentNumber=$(( currentNumber + 1 ))

  done

done

                > /work/CDCTSA/CDCstart.sh "

  • should start CDC instance related dmts process

(verify with ps -ef | grep dmts | grep "

  • should return with a status 0 (OK!)

                > /work/CDCTSA/CDCmonitor.sh "

  • should return with a status 1 (OK!)
  • when it returns with a status 2, something went wrong

                > /work/CDCTSA/CDCstop.sh "

  • should stop the CDC instance related dmts process

(verify with ps -ef | grep dmts | grep "

  • should return with a status 0 (OK!)

 

 

6. The preparation continues: create a CDC Resource Set with CDCsetup.sh

 eddy fig3.PNG

The possibility to add parameters when creating a Resource Set, opens up the opportunity to add e.g. instance name, database name, and CDC instance. The newly build scripts are made to capture in such a way they capture these parameters and are already thoroughly tested (previous paragraph).

eddy fig4.PNG

The hardest parts are done once the scripts for START/MONITOR/STOP and SETUP (make TSA aware of what to monitor and what to do) are figured out. Looking at the original SETUP script, you’ll notice the link between the recently created CDC Resource Set and the related database is not yet established:

  • Create a CDC Resource Group with an indicative name:

resourceGroup="CDC_

mkrg ${resourceGroup}

  • Add the CDC Resource Set to the CDC Resource Group:

resourceGroup="CDC_

resourceSet="CDC_

addrgmbr -g ${resourceGroup} IBM.Application:${resourceSet}

  • Make sure the just create CDC Resource Group is OFFLINE!
  • Create the relation between the CDC Resource Set and the Database Resource Set:

resourceGroup="CDC_

resourceSet="CDC_

dbResourceSet="db_

dependencyName="${resourceSet}_DependsOn_${dbResourceSet}"

mkrel -p DependsOn -S IBM.Application:${resourceSet} -G \

  IBM.Application:${dbResourceSet} ${dependencyName}

  • Bring the Resource Group online:

resourceGroup="CDC_

chrg -o Online ${resourceGroup}

 

I repeated the filling of the variables per bullet point to maintain readability. You’ll probably keep one single set of value assignments to variables in your SETUP script. I came up with my own convention on how to name the Resource Group as Sets; make sure those names help you to quickly identify the resources.

Do not lose sight of the CDCremove script – this script is the opposite of the CDCsetup.sh script and removes the Resource Group and Set from the TSA conguration. When removing a Resource Group for one single instance, you should not remove by default the CDC related stuff all at once of the other CDC definitions. Removing the CDC fail-over mechanism for one database does not imply you need to remove those of others as well.

 

  1. Putting the puzzle together: setting up TSA to handle multiple CDC replicated databases in different instances on a same server

 

Note: You’ve got the choice to continue using the placebo dmts daemon process in a clustered environment if you still do not feel confident enough to already start installing and setting up CDC, at least that is what I did. As soon I was able to separately fail-over my own developed dmts daemon, I repeated all my steps, but this time with the CDC daemons.

 

Once the SETUP/START/MONITOR/STOP/REMOVE scripts are mature enough to put them into action, make sure:

  • Have your DB2 cluster set up as described in paragraph 2
  • Replace the placebo dmts with the correct one
  • Adjust the additional parameters if necessary
  • Copy the START/MONITOR/STOP scripts to the /usr/rsct/sapolicies/db2 directory – on BOTH the Primary as the Standby – and align the rights on the files to those scripts already available in the directory
  • If you keep the logging (= writing to disk, because to screen does not make sense) enabled, make sure your CDC instance user can write to it
  • Create the TSA resource set by executing the adapted CDCsetup.sh script

Reminder: When TSA performs a fail-over for a Resource Set, it fails-over all related Resources. When an instance is due for fail-over, all databases within that instance (Resource Group) are taken into account as well as the associated CDC Resource Group. To enable the ability to fail-over a single CDC replicated database – via TSA! – use different instances.

 

  1. Bumps on the road?
    • As you might imagine there were many, but one in particular I wish to share with you: our first scripts were kind of brutal when it came to stopping CDC resources, we “kill”ed them even when there are other possibilities to gracefully stop the subscriptions and engine. We changed this part into steps that each become less graceful the more time it takes to stop a CDC resource:First, we attempt all the different options of ending the replication (dmendreplication)
    • Secondly, we perform a shutdown of the CDC instance and wait a bit to give the applications the time to gracefully stop (dmshutdown). Dmshutdown has some levels of its own, and we try them one by one:
      • -c: Stop normal
      • -i: Stop immediate
      • -a: Abort all
    • Finally, if applications related to the CDC instance are still running, perform a kill of the concerned processes (kill)

 

Over time I also came to realize that big chunks of code are shared by the START/MONITOR/STOP scripts. Since TSA is capable of running the scripts with parameters, I think it could be beneficial to combine the three scripts into a single one. Using the “mkrsrc” command would then be changed into something like:

    mkrsrc IBM.Application \

      Name="${resourceSet}" \

      ResourceType=1 \

      StartCommand="/usr/sbin/rsct/sapolicies/db2/CDC_TSA.sh "START"  "${lAdminUser}" "${lDb2Database}" "${lInstance}"" \

      StopCommand="/usr/sbin/rsct/sapolicies/db2/CDC_TSA.sh "STOP"  "${lAdminUser}" "${lDb2Database}" "${lInstance}"" \

      MonitorCommand="/usr/sbin/rsct/sapolicies/db2/CDC_TSA.sh "CHECK"  "${lAdminUser}" "${lDb2Database}" "${lInstance}"" \

      MonitorCommandPeriod=30 \

      MonitorCommandTimeout=30 \

      StartCommandTimeout=600 \

      StopCommandTimeout=30 \

      UserName="${lAdminUser}" \

      RunCommandsSync=1 \

      ProtectionMode=0 \

      NodeNameList={${lNodeList}}

 

I didn’t go through with this change as my set up was already running and the added value of the change was not big enough. If you are about to start from scratch, you still might take it into consideration.  

The testing was done on a isolated cluster to test the idea to have one single cluster hosting multiple CDC replicated databases which can be failed-over completely separate from each other. The documentation found on Developer works and the IDUG site were fundamental tools when implementing this particular setup. With this extra information you’ll get a good idea on how to proceed should you want to set this up yourself … but it isn’t really spelt out. A lot of trying and testing will be needed as adding too much detail would not make this document readable.

 Eddy Coppens

BiteStrike

0 comments
2 views