Lessons Learned of Locking and Latches
Locking and latching are both key mechanisms employed by relational database management systems to ensure data integrity, consistency, and the proper serialization of resources. In the Db2 for z/OS world, IRLM manages the locks, but both Db2 and IRLM have internal latches whose performance can degrade based on the level on contention for those latches. This article will delve into some recent discussions with customers around techniques to improve lock avoidance and reduce latch contention.
Locking and lock management are vital to ensure data integrity whenever multiple concurrent processes require update access to a shared object. It is the combination of concurrency, updates, and the need for consistency provided by the DBMS and application that drive granularity and locking. This desire for finer granularity has caused some application developers to go ‘too far’ in efforts to avoid taking a lock. Db2 has used internal lock avoidance techniques for many releases, and this is especially crucial in data sharing environments where locks may have to be propagated to the coupling facility as well as de-conflicted through lock negotiation.
A brief aside into lock avoidance:
The goal of lock avoidance is both to restrict access to uncommitted data, as well as reduce the overall lock volume. Db2 uses two levels of checks to avoid calling IRLM to obtain a lock. First, when a data page is accessed the RBA (relative byte address) of the last update in the header is compared to the RBA of the oldest unit of recovery (UR) running in the system. If the header RBA is older Db2 does not need a lock. If that test fails, the second check is against the PUNC (Possibly UNCommitted) bit for each qualifying row. If the PUNC bit is off then the row has not been updated ‘recently’, and no lock is taken. Hence controlling long running URs should be of utmost importance to Db2 DBAs.
The first line of defense in reducing the scope of the locks taken is the isolation level under which the application runs. In descending order of concurrency, you have UR (uncommitted read), CS (cursor stability), RS (read-stability), and finally RR (repeatable read). For absolute consistency with the underlying data a developer could use Repeatable Read to ensure if he went back and read the data again during the same execution of his program the data would be identical, no more, and no less rows. The next most consistent option would be Read Stability where he ensures none of the rows he returned in the first scan of the data would change, but there could have been new rows inserted. Cursor Stability ensures the data remains the same based on the position of the cursor in the application during the unit of work, thus allowing much more concurrency for other applications. Finally we have Uncommitted Read, the least restrictive as far as concurrency, but the application could be looking at rows that are being updated, or deleted, as well as missing new rows which are being inserted. Many DBAs would prefer application developers use UR whenever possible to avoid unnecessary locking. In the case where the application must do a searched update, or delete, Db2 ensures the update/delete is not lost so the transaction becomes CS behind the scenes. However the developer may have to take extra steps to ensure the result set does not change prior to altering the row. Hence cursors and cursor stability are powerful tools when doing these surgical updates to the data. But, when a FOR UPDATE cursor cannot be used with a query it is common for application programmers to use a READ ONLY cursor with a subsequent searched UPDATE or DELETE. As an example:
What is the issue? Well the row may no longer be ABC.DEF.GHI by the time the UPDATE executes. Another transaction may have gotten in and updated it to ABC.DEF.XYZ, hence the application is updating a different version of the row than was previously returned. This is an exposure even when using ISOLATION(CS) regardless of the CURRENTDATA option, but the use of CURRENTDATA(NO) will increase the exposure. What is CURRENTDATA again? CURRENTDATA(NO) implies that all rows, qualified or not, are eligible for lock avoidance, while CURRENTDATA(YES) implies only non-qualified rows are eligible for lock avoidance. With CURRENTDATA(YES) though, the lock is only held on any qualifying row until the next FETCH. A singleton SELECT with CS will, aside from a few exceptions, always try for lock avoidance irrespective of the value of the CURRENTDATA parameter in the bind statement.In the example we use a cursor, but it goes without saying that this exposure exists for non-cursor selects as well. The cursor is generally used in order keep track of the result set within the application itself. One clue that the exposure exists for the application can be found in the combination of bind parameters used. For instance using a simple catalog query:
An inquisitive DBA could find:
| PACK_ISO | PACK_CD | COUNT |
| UR | Y | 64 |
CURRENTDATA(YES) is irrelevant if combined with Uncommitted Read . Here it would seem some applications are trying to keep their result set ‘current’ even though they hold no locks on the base data, and could be looking at uncommitted, or ‘rolled-back’ rows. These packages should be investigated to understand the developer’s intent and what locking strategy is in place.
What are some possible solutions?
Option #1: Use additional WHERE predicates on the searched UPDATEs and DELETEs to enforce data currency i.e. ensure the data has not changed since the cursor first selected the data. This concept of an ‘over-weighted’ WHERE clause means you include all columns that logically determined if the update was necessary as WHERE predicates, instead of updating based solely on the key column value. In this case if row ABC.DEF.GHI became ABC.DEF.XYZ between the initial select and the update the application would receive "row not found“, and the transaction could be retried.
Option #2: Use of a timestamp or token column in the table to record the last update, or ROW CHANGE expression in the update statement. This way an application does not need to know all the old values which are marked for update, just compare the token or timestamp column to its previous value. The timestamp is rather self-explanatory, and the ROW CHANGE TOKEN was introduced to return a token that represents a relative point in the modification sequence of a row.
If altering the table to add such a column is unreasonable you can use a ROW CHANGE TIMESTAMP expression which returns a timestamp value that reflects changes made to the page instead of to the row. This timestamp value indicates that at least one row in the page has changed, but does not indicate which row, or even how many rows, have changed. In this case you may be safe if you are using page-level locking. But if you are only concerned about adding the column because of affecting applications using a ‘SELECT *’, you can always define the column as IMPLICITLY HIDDEN forcing applications to use the column name in order to retrieve it.
Latches are what you might call Db2 ‘internal locks’ as we use them to serialize access to resources to control physical consistency. There are over 30 different types of latches (latch classes) in Db2 which govern everything from buffer pool chain management to log output buffer space usage. IRLM uses latches as well to serialize requests for locks. These latches are hashed together in chains and there is one main latch that all of the secondary latches hang off of. Traversing through this main latch is necessary for some tasks like deadlock detection, query requests (e.g. DISPLAY DATABASE LOCKS, or MODIFY irlmproc, STATUS), and for tracing. IRLM tracing as well as IFCID 150 (Global Lock Conflict) can be quite expensive and heavily impact the efficiency with which IRLM manages those latches. We want to monitor and avoid unnecessary IRLM latch suspensions.
The following is a case study to determine the reason for elevated IRLM latch suspensions.
This customer was experiencing intermittent transaction response time issues, elevated lock/latch contention and CPU starvation. One looming question was whether or not the Db2 transactions were worsening the situation or merely the victims of the CPU starvation. The customer had been looking through statistics and accounting reports already, and the assertion was made that the lock/latch counts were adding to the CPU utilization. The timeframe with the worst response times was from 10am to just after 12pm. During most of this period the 15 minute RMF interval reported the LPAR’s CPU utilization at about 85%. However, in order to distinguish ‘good’ from ‘bad’ response times, it was necessary to collect more granular RMF Monitor III records, which ran at 60 second intervals. Using this data we could see the CPU utilization was actually at 100% with latent demand for up to 10 minutes at a time. Focusing on this smaller window I walked through the Db2 statistics trace at 1 minute intervals. This was possible with the use of the CSV generator option in Omegamon Performance Expert.
What I found were many 1-2 minute intervals with high IRLM latch suspensions. The graph below shows these spikes, but just as importantly we can see there are NO Lock suspensions.
Prior to this performance investigation the customer had noticed one transaction in particular taking more locks than it had historically. The package had inadvertently been bound with Repeatable Read isolation level, instead of Uncommitted Read (which was its design point). This discovery and the increased lock/latch contention during the CPU spikes lead the customer down the road of trying to rebind any and all packages with UR to avoid taking locks. The graph above shows beyond a doubt that there is no lock contention during the period of CPU starvation. When discussing this with the customer, they referenced several transactions that had high class 3 LOCK/LATCH(DB2+IRLM) time. This was a key point to clarify as Db2 broke out the IRLM vs. DB2 lock and latch time in Db2 10. The monitoring tool the customer was using to diagnose this issue still showed IRLM and Db2 suspensions in the same counter. Below is a current example of the Class 3 suspension piece of the Db2 accounting report.
With this finding we could rule out the potential for application locking changes causing the CPU utilization to climb. I then overlayed the RMF Monitor III data with the IRLM latch suspension counts in the statistics data, and came up with the chart below.
This chart conclusively shows the correlation between the CPU utilization and the elevated IRLM latch suspensions. During times of 100% CPU utilization threads will remain in the system longer, higher levels of concurrency exist, and more simultaneous IRLM requests will result. If you combine this with serious CPU constraint it stands to reason that IRLM is trying to run its latch chains more frequently with less CPU resources, and thus leading to the contention we see. In the scenario depicted above any time the CPU was over 90% utilized we see the IRLM latch contention, and below 90% it disappears. As a general rule of thumb, IRLM latch suspensions should represent no more than 5% of all the IRLM requests.
IRLM Latch Contention = SUSPENSIONS (IRLM LATCH) (A)
Total IRLM Requests = LOCK + UNLOCK + QUERY + CHANGE REQUESTS (B)
IRLM Latch Contention Rate (%) = (A)/(B)*100 ⇐ should be <=5%
During this period in our example the IRLM suspensions are 2-3x our rule of thumb for the total requests.
In the introduction to this section I listed several reasons IRLM needs to run through its latch chains. When IRLM is CPU constrained all the effects of these processes are magnified. We often see extreme latch contention when IRLM is not running at the SYSSTC dispatching priority in WLM. However even if it is classified correctly, on an LPAR running at >100% with latent demand and new threads coming in, the queue of waiters grows and contention appears out of nowhere. If the contention here could not be correlated to CPU constraint the next step would be to look at the WLM policy, then possibly adjust the deadlock detection cycle up, and also look for in-depth lock tracing. During problem determination it is of utmost importance to rule out CPU utilization as the root cause. Lowering the relative importance of other CPU intensive tasks, or even adding CPU capacity on demand may be necessary. Only then can you move to a lower level, application focused, investigation. In the case of this customer there was no need to rebind plans or packages with a different isolation level.
Dispatching suspended lock and latch processes when the resource becomes available is a critical task in Db2 and IRLM. Using the techniques described above should help avoid unnecessary locking and latching, while still ensuring data integrity.