Long latch time

Shay Miller

Long latch time

Hello everyone.

We have a table that is used by applications in order to get ever ascending keys.

This table is been around for many years - prior to the invention of sequences and many programs are still using this table.

Every row in the table holds a key and its value and is defined with row level locking.

The process of getting a new key is updating the requested key to the current value + 1 and then selecting the current (which is the new) value.

 

Last week we had some timeouts in production on this table.

We found that there was a single CICS transaction that ran for 38 seconds (more than the 30 seconds limit for timeout), and that 99.75% of the elapsed time was waiting for a latch.

Screen snippet from Mainview/Db2 :

"

STOP.....03OCT 15.41.37.95 PLAN..............CICSPPLN TYPE..............ALLIED

START....03OCT 15.40.59.76 AUTHID.............CICCK3P CONNECT....CICSOLIB/CICS

ELAPSED..............38 s  ORIG PRIM AUTH.....CICCK3P CORR ID.....POOLOI310130

TERM........NORMAL/DEALLOC COMMITS..................1 ROLLBACKS..............0

------------------------------------------------------------------------------

RUNTIME ANALYSIS   IN DB2     IN APPL.      TOTAL      %IN DB2(=)     TOTAL(*)

----------------   --------   --------   --------     0 ...25...50...75..100% 

ELAPSED TIME          38 s       82 ms      38 s      | ===================* |

 

 --TOTALS--                             38 s   100.00 | ******************** |

WAITS IN DB2 (LOCAL)                                  |                      |

  LOCK                   0      0 us     0 us    0.00 |                      |

  LATCH                  2     19 s     38 s    99.75 | *******************  |

"

1. What can be the cause of the long latch time?
   in RMF I didn't saw high load on the system or delays for IRLM.

2. How can I see which latch was the wait for?

Further suggestions/ ideas will be welcomed.

Thanks,

Shay

 

Edited By:
Shay Miller[Organization Members] @ Oct 08, 2019 - 11:29 AM (Asia/Jerusalem)

Roy Boxwell

Long latch time
(in response to Shay Miller)
Well, here’s my Latch docu that I have scraped together over the years...



WAIT LATCH

A latch is an in-memory device used by Db2 to not use the IRLM to manage a lock. You can view it is a mini-lock, normally very fast and transient in nature, related to Buffer Pools and control areas in pages of data. Where one process must gain access to e.g. a header page and “latch” it to stop any other process for the duration of the quick update. Some “known” latches are LC06 Index Split, LC14 Bufferpool LRU & Hash chain update, LC19 Log, LC24 Pre-fetch and EDM LRU processing. There are lots of latches but only 33 latch classes in use and most are not well, if at all, documented as they normally cause no trouble!

Here is a list of them all:

QVLSLC01 DS F /* Infrequently used */

QVLSLC02 DS F /* Global authorization cache */

QVLSLC03 DS F /* DDF disconnect */

QVLSLC04 DS F /* SYSSTRING cache */

QVLSLC05 DS F /* IRLM data sharing exits or RLF */

QVLSLC06 DS F /* Data sharing index split */

QVLSLC07 DS F /* Index latch and OBD allocation */

QVLSLC08 DS F /* Query parallelism */

QVLSLC09 DS F /* Utilities or stored procs URIDs */

QVLSLC10 DS F /* Allied agent chain or seq. desc.*/

QVLSLC11 DS F /* DGTT allocation or Sequence/Identity */

QVLSLC12 DS F /* Global transaction ID table */

QVLSLC13 DS F /* Pageset operations */

QVLSLC14 DS F /* Bufferpool Hash chain and LRU Chain */

QVLSLC15 DS F /* ARCHIVE LOG MODE(QUIESCE) */

QVLSLC16 DS F /* UR chain */

QVLSLC17 DS F /* RURE chain */

QVLSLC18 DS F /* DDF resynch list */

QVLSLC19 DS F /* Log write */ Effectively gone in Db2 12!

QVLSLC20 DS F /* System checkpoint */

QVLSLC21 DS F /* Accounting rollup */

QVLSLC22 DS F /* Internal checkpoint */

QVLSLC23 DS F /* Buffer manager:

Add page latch waiter on timer queue

Add remove to/from deferred write queue for GBP-dependent objects

QVLSLC24 DS F /* EDM pool LRU chain and Buffer manager page unlatch and prefetch */

QVLSLC25 DS F /* Workfile allocation and EDM hash chain */

QVLSLC26 DS F /* Dynamic statement cache */

QVLSLC27 DS F /* Stored procedures queue */

QVLSLC28 DS F /* Stored procedures or auth cache */

QVLSLC29 DS F /* Field procs and DDF trans manag */

QVLSLC30 DS F /* Agent services */

QVLSLC31 DS F /* Storage manager */

QVLSLC32 DS F /* Storage manager */

QVLSLC254 DS F /* Index latch */



IBM state that high numbers of internal DB2 latches in most cases points to DB2 internal problems and should be reported to DB2 service. The exceptions are Latch classes 06, 07, 11, 14, 19, and 24, where user tuning can address the problem.

- LC06: Latch type x46 70. In non-datasharing use type xFE 254. Index tree P-lock latch contention caused most likely by splits of GBP-dependent index pages, we can specify large PCTFREE for indexes that are causing the problem and reorganize them more often to reestablish the free space and reduce index splits at subsequent inserts. Also using a larger index page size can help.

- LC07: Using a larger index page size can help.

- LC11: Updating the MAXASSIGNEDVAL column in SYSSEQUENCES to figure out the next available value for an identity column is done under internal DB2 latch Class 11, Generating Identity Column. For data sharing it is additionally done under page P-lock. Using the CACHE option should be strongly recommended for high insert rate workloads, especially in data sharing.

- LC14: For BP LRU chain, we can isolate small highly accessed tables into separate pools.

- LC19: We can reduce the number of class 19 latch contentions by speeding up the log write activity allocating logs at the fastest devices, turning z/OS, avoiding log DASD contention, rearranging columns to minimize the number of log data written and so on. This Latch is effectively gone in Db2 12.

- LC24: It can be either EDM LRU latch or Buffer Manager latch. If latch contention trace, IFCID 51 and 56, is collected, QW0051LC/QW0056LC will show latch type x18 (24) for EDM LRU latch and x38 (56) for Buffer Manager latch. If 24, the best way to fix is to use EDMBFIT DSNPARM of NO as an order of magnitude reduction in LC 24 latch contention has been observed. If 56, there is a number of cases Buffer Manager uses latching. For example, there is one latch per data set in prefetch scheduling. This has been noticeable in dynamic prefetch in star join and also in CPU parallelism with many degrees. Having more partitions can reduce this latch contention. Real Time Statistics also uses LC 56 latch.





Hope that helps!



Roy Boxwell

SOFTWARE ENGINEERING GmbH and SEGUS Inc.
-Product Development-

Vagedesstrasse 19
40479 Dusseldorf/Germany
Tel. +49 (0)211 96149-675
Fax +49 (0)211 96149-32
Email: <mailto:[login to unmask email]> [login to unmask email]
Web http://www.seg.de http://www.seg.de

https://www.seg.de/corporate/rechtliche-hinweise/datenschutz Link zur Datenschutzerklärung


Software Engineering GmbH
Amtsgericht Düsseldorf, HRB 37894
Geschäftsführung: Gerhard Schubert, Ulf Heinrich



From: Shay Miller [mailto:[login to unmask email]
Sent: Tuesday, October 8, 2019 10:27 AM
To: [login to unmask email]
Subject: [DB2-L] - Long latch time



Hello everyone.

We have a table that is used by applications in order to get ever ascending keys.

This table is been around for many years - prior to the invention of sequences and many programs are still using this table.

Every row in the table holds a key and its value and is defined with row level locking.

The process of getting a new key is updating the requested key to the current value + 1 and then selecting the current (which is the new) value.



Last week we had some timeouts on this table and we found that there was a single CICS transaction that ran for 38 seconds (more than the 30 seconds limit for timeout), and that 99.75% of the elapsed time was waiting for a latch.

Screen snippet from Mainview/Db2 :

"

STOP.....03OCT 15.41.37.95 PLAN..............CICSPPLN TYPE..............ALLIED

START....03OCT 15.40.59.76 AUTHID.............CICCK3P CONNECT....CICSOLIB/CICS

ELAPSED..............38 s ORIG PRIM AUTH.....CICCK3P CORR ID.....POOLOI310130

TERM........NORMAL/DEALLOC COMMITS..................1 ROLLBACKS..............0

------------------------------------------------------------------------------

RUNTIME ANALYSIS IN DB2 IN APPL. TOTAL %IN DB2(=) TOTAL(*)

---------------- -------- -------- -------- 0 ...25...50...75..100%

ELAPSED TIME 38 s 82 ms 38 s | ===================* |



--TOTALS-- 38 s 100.00 | ******************** |

WAITS IN DB2 (LOCAL) | |

LOCK 0 0 us 0 us 0.00 | |

LATCH 2 19 s 38 s 99.75 | ******************* |

"

1. What can be the cause of the long latch time?
in RMF I didn't saw high load on the system or delays for IRLM.

2. How can I see which latch was the wait for?

Further suggestions/ ideas will be welcomed.

Thanks,

Shay





-----End Original Message-----

Attachments

  • smime.p7s (5.1k)

Tushar Jha

RE: Long latch time
(in response to Roy Boxwell)

Hi Roy,

Any thoughts on LC25 ( Workfile allocation and EDM hash chain ) . Can Tuning address this.

What are the things which can cause high number of LC25. And would things like doing frequent commits,having  more number of workfile tablespaces ( both 4K and 32K ) help . And , in what conditions will this be caused by DB2 internal problems.

I would greatly appreciate , if you can provide any insights on this.

Thank you.

 

Regards,

Tushar Jha

Roy Boxwell

Long latch time
(in response to Tushar Jha)
I would probably make sure you have at least 32 pretty big 32K files and four not so big 4k files

Make sure two of each has zero as secondary quantity and review the extents of all the others on a regular basis!



Db2 Internal problems are naturally internal Db2 problems and I have no idea about that – But remember Db2 is just a really huge program and all programs have bugs....



Roy Boxwell

SOFTWARE ENGINEERING GmbH and SEGUS Inc.
-Product Development-

Vagedesstrasse 19
40479 Dusseldorf/Germany
Tel. +49 (0)211 96149-675
Fax +49 (0)211 96149-32
Email: <mailto:[login to unmask email]> [login to unmask email]
Web http://www.seg.de http://www.seg.de

https://www.seg.de/corporate/rechtliche-hinweise/datenschutz Link zur Datenschutzerklärung


Software Engineering GmbH
Amtsgericht Düsseldorf, HRB 37894
Geschäftsführung: Gerhard Schubert, Ulf Heinrich



From: Tushar Jha [mailto:[login to unmask email]
Sent: Wednesday, October 9, 2019 11:07 AM
To: [login to unmask email]
Subject: [DB2-L] - RE: Long latch time



Hi Roy,

Any thoughts on LC25 ( Workfile allocation and EDM hash chain ) . Can Tuning address this.

What are the things which can cause high number of LC25. And would things like doing frequent commits,having more number of workfile tablespaces ( both 4K and 32K ) help . And , in what conditions will this be caused by DB2 internal problems.

I would greatly appreciate , if you can provide any insights on this.

Thank you.



Regards,

Tushar Jha



-----End Original Message-----

Attachments

  • smime.p7s (5.1k)

Adrian Burke

RE: Long latch time
(in response to Shay Miller)

Hi Shay,
If you are still looking for the type of latch which was elevated I would start with a statistics long report (SMF100) for the minutes before, during, and after the incident. So maybe 15:40, 15:41,15:42, with each report covering a 1 minute interval. The Latch Count section can show you which one was elevated. There are often more than 1 latch that is mapped to each latch counter but this is a good start.

Adrian Burke
DB2 SWAT Team

Tushar Jha

RE: Long latch time
(in response to Roy Boxwell)

Thanks a lot, Roy!!

Shay Miller

RE: Long latch time
(in response to Adrian Burke)

Hello Adrian.

 

which kind of report do you mean?

Can you please add an example?

 

what we saw was not a high number of latches , it was a long wait time for a latch.

Adrian Burke

RE: Long latch time
(in response to Shay Miller)

Hi Shay,

There is a statistics long report option for any batch reporting tool run against your SMF100 data. BMC/CA/OMPE all have various forms of statistics reports.

If you are looking in the accounting records, or online in the thread detail if the thread is waiting on the latch you will see the wait time, while in the stats reports the number of latch contentions shows the number of times different processes fought for control of the latch. So if there is lots of contention and you are in the back of the queue you may wait a long time and only see 1 latch attained by that thread. However at the system level the latch may be held and released thousands of times a second.

The output of the latch section looks like this in Omegamon:

 

LATCH CNT   /SECOND
---------            --------
LC01-LC04      0.00
LC05-LC08      0.23
LC09-LC12      0.00
LC13-LC16      0.06
LC17-LC20      0.00
LC21-LC24      0.06
LC25-LC28     19.61
LC29-LC32     30.92
LC254          0.75

Michael Hannan

RE: Long latch time
(in response to Shay Miller)

Like others, I think it becomes important to know the type of Latch and not just the delay time, but how many latches caused it. So large number of fast latches, or not? Latch is intended to brief while DB2 reads some info or updates it. Latch should be released quickly after that is done. Not long, like an exclusive lock held till commit.

Once a system gets a suspend delay problem, you get a chain reaction of others processes held up. So although the key number allocation table may not be ideal (possible cause of lock delays waiting for commit), it may be a victim or the initial problem delay rather than a culprit. Still it is possible to look at amendments to improve the number allocation scheme later. First have to find the root cause of your problem.

There are so many latch types and possible several reasons for high latch delay. Hard to guess right.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd