by: Michael Roecken and David Sciaraffa
With the many new features introduced in 11.1.2.2, not to be overlooked are some of the enhancements made in the area of Recovery and Availability. The team has been hard at work:
- Trying to provide earlier access to your data after a crash recovery or HADR takeover through online recovery
- pureScale support for rebuilding your database using table space backup images
- Better monitoring of table space errors on an HADR standby and an easier way to repair them
- Avoiding lock escalation situations between concurrent applications
While the above addresses some of the larger and more frequent requests we have had from customers of late, additional effort has also been made to improve other areas such as:
- Recovery performance
- Change history event monitor extensions
- Clarity on Db2 version compatibility
This article will attempt to give the reader a quick overview of some of the above mentioned features and introduce you to some of the new capabilities that Db2 Version 11.1.2.2 can offer.
Access your database sooner after a system failure -- introducing DB2_ONLINERECOVERY
System failures happen. An enterprise database must be extremely resilient to system failures. Data loss or corruption are impermissible.
Db2 recovery from a system failure begins when the database is restarted, and utilizes a process called 'crash recovery'. During crash recovery, any database changes which were previously committed by the application but not yet persisted to disk, are re-applied (we refer to this as the Forward-phase of crash recovery). Next, any database changes which were persisted to disk but not yet committed by the application, are undone (we refer to this as the Backward-phase of crash recovery). The Db2 transaction logs are processed to facilitate all this work.
Crash recovery processing is not isolated to system failures. In a database configured for disaster recovery via HADR, a TAKEOVER HADR operation on the Standby database must also perform crash recovery in order to undo any changes which were persisted to disk but not yet committed to the application.
While a database is performing crash recovery, it is not connectable by any application.
While system failures seldom happen, most customers expect crash recovery to complete quickly. A common customer request is for quicker accessibility to the database after a system failure, during crash recovery.
Our development team decided upon a solution which promised increased availability, without risking the distinguished resilience that Db2 has achieved over the decades – Db2 online recovery is here!
Off by default in Db2 Version 11.1.2.2, the feature can be enabled through the registry variable DB2_ONLINERECOVERY. The feature allows for the database to become connectable once the forward phase of crash recovery has completed, while the backward phase of crash recovery is progressing. Tables which contain data changes that need to be undone during the backward phase of crash recovery will not be accessible to queries until the last of their data changes is undone. All other tables are fully accessible to queries throughout the backward phase of crash recovery.
For more detailed information about this cool new feature, please see:
Database accessibility during backward phase of crash recovery or HADR takeover
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.ha.doc/doc/c0070281.html
Restore Rebuild and Rollforward your database using table space backup images in pureScale
Rebuilding a database is the process of restoring a database or a subset of its table spaces using a set of restore operations. The functionality provided with database rebuild makes Db2 database products more robust and versatile, and provides you with a more complete recovery solution.
The ability to rebuild a database from table space backup images means that you no longer have to take as many full database backups. As databases grow, opportunities for taking a full database backup are becoming limited. With table space backup as an alternative, you no longer need to take full database backups as frequently. Instead, you can take more frequent table space backups and plan to use them, along with log files, in case of a disaster.
In a recovery situation, if you need to bring a subset of table spaces online faster than others, you can use rebuild to accomplish this. The ability to bring only a subset of table spaces online is especially useful in a test and production environment.
Rebuilding a database involves a series of potentially many restore operations. A rebuild operation can use a database image, or table space images, or both. It can use full backups, or incremental backups, or both. The initial restore operation restores the target image, which defines the structure of the database that can be restored (such as the table space set, the storage groups and the database configuration). Rebuilding allows you to build a database that is connectable and that contains the subset of table spaces that you need to have online, while keeping table spaces that can be recovered at a later time offline.
Prior to Db2 Version 11.1.2.2, pureScale required a database backup to perform a database recovery, although the ability to rebuild a database using table space backup images has long been available in non-pureScale deployments.
As of Db2 Version 11.1.2.2, the team has brought pureScale on par and so now one can recover a database from a set of table space backup images.
For more details and information on limitations, please see:
RESTORE DATABASE
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.cmd.doc/doc/r0001976.html
Don't get caught unprepared - monitor invalid table spaces on your HADR standby database and repair them efficiently
The Db2 High Availability Disaster Recovery feature, better known as HADR, is a database replication solution for disaster recovery. There are few database administrators in the world unfamiliar with HADR, being one of the most successful and widely deployed Db2 features.
The recommended setup stipulates that both the primary and standby should be configured alike in case of failover. But, there can be cases where this may not always be in place and so errors can arise with the replay of certain log records.
File system full errors could occur during create table space or even loads may not be properly replayed if the load copy image cannot be found on the standby. In such cases, the table space on the standby will be placed into rollforward or restore pending and the primary database will be unaware that this has occurred. Applications on the primary will continue with business as usual, but replay of the data for that table space on the standby has now stopped. Because of this assumed health of the standby database, a future takeover operation can result in the unexpected unavailability of table space data when this standby database becomes the primary database.
To address this pain point, as of Db2 Version 11.1.1.1 monitoring this condition became much easier, as a new flag STANDBY_TABLESPACE_ERROR was added to the HADR_FLAGS field of db2pd -hadr (or the MON_GET_HADR() table function). This will now tell you that some table space on the standby is in error and searching the db2diag.log can provide details into the error.
For more details about monitoring this error condition, please see:
Monitoring and identifying tablespaces in invalid or error state, or tables in Inoperative state on the HADR Standby database
https://www-01.ibm.com/support/docview.wss?uid=swg21993013
Now with monitoring available the next step would be how to resolve such an error condition with a table space.
Previous versions required a full re-initialization of the standby database (a full database backup image of the primary database must be restored on the standby database and HADR restarted). For large databases, this could take a considerable amount of time, and leave the primary database exposed without a disaster recovery counterpart.
New in Db2 Version 11.1.2.2, a full database backup image is not required. A table space backup taken on the primary can now be restored on the standby. Yes, an offline table space restore can now be issued on the standby in order to repair a bad error state, while all along keeping the HADR role as a standby.
For more details on how to recover from this condition using a table space backup image, please see:
How to recover from tablespace errors on an HADR Standby database
http://www-01.ibm.com/support/docview.wss?uid=swg21993389
Improve application concurrency by preventing lock escalations
Lock escalation is the process by which many row locks on the same table and associated with an application, are released and replaced by a single table lock associated to that application. Thus, locklist memory becomes available for reuse.
While lock escalation is beneficial in that it allows the database to continue to grant new locks, it also has a negative side affect. By escalating to a single table lock associated to that application, other applications may no longer be able to access rows on this table (depending on their configured isolation levels).
For this reason, the frequency of lock escalations are monitored by experienced database administrators, and tuning is performed. Lock escalations can be monitored by querying the lock_escals value of the MON_GET_ACTIVITY() table function (as well as many other granular table functions).
When lock escalation problems are observed, a database administrator should always explore the cause and look for any tuning opportunities. Sometimes, queries and workloads are adhoc and beyond the control of the database administrator. In such scenarios, lock escalations are unpredictable and difficult to tune.
Starting in Db2 Version 11.1.2.2, a new feature called Lock Escalation Avoidance provides control over lock escalation behaviour. This feature is configured through the dynamic registry variable
DB2_AVOID_LOCK_ESCALATION=ON
When this feature is enabled and an application encounters the conditions where lock escalation would normally occur (for example locklist memory is full, or MAXLOCKS limit reached) instead of performing lock escalation the application will receive an SQL0912N error. The application then has an opportunity to either perform a COMMIT or ROLLBACK to release the locks held by this application.
For more information on Lock Escalation Avoidance, please see:
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.regvars.doc/doc/r0005665.html#P_DB2_AVOID_LOCK_ESCALATION
Recovery performance improvements
Who doesn’t want their runtime rollback or crash recovery to run faster? Ask anyone and they would always like to see things run more quickly, especially when trying to gain access to their data quicker.
In an ongoing effort by the team to find ways to make both forward and backward recovery quicker, as new findings are discovered they are investigated and constantly being delivered into the code. Db2 Version 11.1.2.2 is another delivery where we have added some internal tweaks that will help some workloads gain a little bit of extra performance.
During the forward phase of recovery, the team noticed that local pending list log records that only contain declared global temporary table (DGTT) drop table actions can be blocked at the table space level instead of requiring database blocking. This allows better parallelism during replay of such log records.
Not to be left out, during backward phase of recovery, the team improved the response rate of log record read requests needed for compensation by moving conflicting work that delayed the response rate to a new thread. In addition, more substantially the team altered the memory buffers available for log record reading. The biggest benefit comes with having more larger buffers to accommodate multiple concurrent rollbacks. Internally, tests showed improvements in some cases of up to 4x faster rollback performance1.
1 Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
Change history event monitor extensions
The change history event monitor captures information about events on the database that might impact the running of your regular database workload. In particular, it tracks online backups. The syntax of the command is stored in the monitor element utility_detail of both, the monitoring table function mon_get_utility and the change history event monitor. For the latter, this monitor element is persisted in the table UTILSTART_
Prior to Db2 Version 11.1.2.2, the monitor element utility_detail shown in both the mon_get_utility output and in the information recorded by change event monitors looks like the following:
db2 "select utility_detail from TABLE(MON_GET_UTILITY(-2)) AS T"
UTILITY_DETAIL
---------------------------------
BACKUP DATABASE DM ONLINE USE TSM
BACKUP DATABASE DM ONLINE USE TSM
2 record(s) selected.
This does not contain any of the parameters that would allow for a detailed history of backup activity. The following parameters are not recorded:
- exclude / include logs
- number of sessions
- parallelism
- size of the buffers
- number of buffers
- UTIL_IMPACT_PRIORITY
- usage of deduplication
- usage of compression with which compression library and which options
- usage of encryption with which encryption library and which options
- snapshot backup with which vendor library and which options
- scripted snapshot with which script
As of Db2 version 11.1.2.2, the above missing details are added to the utility details in the mon_get_utility output and the event monitor for change history for backup.
Understanding fallback compatibility of Db2 Version 11.1 Fix Packs and Mod Packs
As Db2 moves towards a continuous delivery model, more frequent deliverables will occur through modification packs and fix packs and less frequent through new major release/versions.
Modification Pack
A Db2 Modification Pack is a collection of significant feature and functionality. However, it differs from a new version in that no upgrade procedure is required. Db2 modification packs are usually fallback compatible. Db2 Version 11.1 Modification Pack 2 Fix Pack 2 is an example. Prior to Db2 Version 11.1, Db2 did not use modification pack nomenclature.
Fix Pack
A Db2 Fix Pack is a collection of fixes on a specific release or modification pack. The scope of these fixes are generally not significant and there is usually no new feature or functionality. Fix Packs are released on a periodic basis, usually approximately every 3-4 months for brand new releases, and stretching to approximately every 12 months for older mature releases. Prior to Db2 Version 11.1 (before Db2 adopted the modification pack nomenclature), some Fix Packs did contain significant feature and because of this may not have been fallback compatible.
Planning for the deployment of a new version, modification pack, or fix pack will require a back-out strategy or plan. Once a database is running on a newer version/mod/fix pack, consideration is needed for whether the database data structures, backup images, and transaction log files are still compatible with the previous version/mod/fix pack.
A detailed technote has been published to describe the many nuances of fallback compatibility, and should help in the assembly of a back-out strategy:
Compatibility between DB2 for LUW Version 11.1 Mod-Packs and Fix-Packs
http://www-01.ibm.com/support/docview.wss?uid=swg22003131