DB2 LUW LOB Internals & Performance, part 2: LOB storage internals

The second installment of helpful information on LOB internals from performance expert Russ Stocker.

How does DB2 allocate and manage space for LOBs?

 DB2 stores most row data in fixed size (4k, 8k, 16k or 32k) pages. This data is referenced by queries via bufferpool access, i.e. it is read from disk into bufferpool memory to be worked on.  LOB (large object) data is handled differently. A LOB (CLOB - character large object, BLOB – binary large object, or DBCLOB - double byte large object) is stored outside regular row based storage. A LOB can be up to 2GB in size (1GB for a logged LOB). The necessity of handling large amounts of data of potentially very disparate sizes efficiently is what drives the LOB storage mechanism.

A LOB column is represented in the row by a descriptor that points to the actual LOB data itself. The size of this descriptor varies from 60 to 312 bytes, depending on the size defined for the LOB in the definition of the table. As we discussed in a previous blog post, if a LOB column is defined with an inline size, and if the entire LOB will fit, then it will be inlined into the regular data page.  If the actual LOB size is no larger than the descriptor size, it will be implicitly inlined, even if no inlining is defined in the table definition.  If the LOB data is larger than the inline size, none of the LOB will be inlined. 

LOB data that does not get inlined as described above is represented internally by two objects, LBA and LB. The LBA object contains space allocation information, and the LB object contains the actual LOB data itself. If the tablespace is defined as using System Managed Storage (SMS), then these two objects can be seen as separate files, with .LBA and .LB file extensions. Space allocation information in the LBA object passes through the bufferpool. LOB data in the LB object is only referenced by DB2 using direct IO, and is not stored in the bufferpool. Some memory buffering of LB data can be obtained by utilizing the file system cache.

LOB space storage management uses what is called a Buddy Space mechanism. There are two types of buddy space - Super Buddy Space (SBS) and Regular Buddy Space (RBS). An SBS controls allocation of large segments of data, from 256K to 64M bytes, and can also break down into RBSs. An RBS controls allocation of segments of size 1k (the minimum LOB allocation unit) to 128K.

In an empty table, there is just one SBS. If we then insert a 50k LOB, the SBS gets subdivided (always by a factor of 2) until, eventually, we have an RBS segment of a size that can completely contain the LOB – in this case, the RBS segment will be 64K in size.  See the diagram below.  Since we always subdivide by factors of 2, at this stage we will have the occupied segment, and a free buddy segment of 64k (the occupied segment’s twin), as well as free buddy segments back up the chain.

LOB storage.JPG



DB2 maintains a hint of where free space of a particular size is located, and so another insert of, say, 60k, would go straight to the buddy segment of the first insert. If we do a third insert (55k this time), we have no pointer to free space of the right size, so we will look for a segment of the next largest size, i.e. 128k, and we will subdivide that into two 64k segments, and use one of them. The other (its buddy) will be free space.

The space remaining in a segment beyond the user data is not used, i.e. for our first insert of 50k, the last 14k is not used. There is a COMPACT option which can be specified for LOB columns. The effect of specifying this option is to subdivide the LOB data into smaller pieces to derive a closer 'fit' with the buddy segment sizes. For example, for our first insert of 50k of data, with COMPACT we would divide data into a 32k, 16k and 2k segment, and would thereby not 'waste' any space at all. However, these pieces will not necessarily be contiguous and may even be widely separated, and so performance of retrieving the data may suffer. Update operations are also affected by COMPACT. Generally with LOBs, an update will result in a delete followed by an insert. This will always be the case when COMPACT is specified. However, in the case of our 50k LOB insert example, a subsequent operation that appended, say, 12k of data would result in an in-place update of the existing segment, and result in less 'waste' at the end of the segment. Use of COMPACT is not recommended in most scenarios, unless storage efficiency is the top priority.

Deletion of a LOB will put its buddy space into a delete-pending state. After commit takes place the segment may be eligible to move to a free state and be reused. If the buddy space of the freed segment is also free, the two will be combined to form a single segment of the next higher size, for example two 64k buddy segments would become a 128k buddy segment, and so on up the chain.  Deleted LOB space cannot be reused until the deleting transaction commits.  However other things can delay reuse as well, such as a lock on the table itself, perhaps held by another transaction.

As already mentioned, DB2 maintains hints of where the next piece of space of a particular size is located. These hints can be maintained by both insert and delete operations. They will also indicate when there is no free space, i.e. a new insert needs to extend the LOB object. This hinting mechanism is however just a single indication. If we assume a situation where some deletes have created some free buddy space segments of various sizes in the “middle” of the table, the hints will be pointing to this newly freed area. When multiple concurrent inserts try to use this space, only one will get it, requiring the other inserts to have to search for free space. The space search could be lengthy – the next free buddy space segment of the right size could be a long way away. Once new free space is found, the hint mechanism will be reestablished, with the likely result that inserts will become quick once again. This kind of situation can lead to erratic insert times – normally quick, but occasionally long, or even very long.



Recent Stories
Tips for getting the best INSERT performance

Statistics Event Monitors - terrible name, terrific feature

Measuring OVERHEAD and TRANSFERRATE on DB2 LUW - the why & how