Blogs

Quick Guide to Compression for Db2 Column-Organized Tables

By Christopher Drexelius posted Sep 01, 2020 06:02 AM

  


IBM Db2 with BLU Acceleration is commonly used in warehouse environments to facilitate industry-leading analytical processing of data.  This technology is composed of many valuable pieces, but two that we regularly highlight are extreme compression and advanced query performance.  While both of these features are extremely valuable, they are also tightly coupled.  In many cases, the amazing query performance is directly tied to the level of compression for the column-organized tables involved in the queries.  For this reason, it is critical to maximize compression for column-organized tables in order to significantly improve query performance.  This article provides an overview of basic compression concepts for column-organized tables with links to documentation on best practices for maximizing compression/storage savings and consequently, query performance.

Actionable Compression

As you may expect, increased compression leads to less I/O, which often translates into improved query performance.  However, in addition to the I/O savings from compression, Db2 BLU provides actionable compression via encoding.  This means that compressed and encoded values can be evaluated without first being decoded for certain operations such as predicate, join, and group-by processing.

Figure 1: Column-level dictionary that supports actionable compression

In the figure above, we see a column-level dictionary for a column containing the fifty US states separated into different partitions by frequency.  More frequent values are placed into dictionary partitions where shorter codes are assigned.  Each partition is order-preserving which enables us to efficiently apply range predicates by forming upper and/or lower bounds for predicates in code space without decoding any values in a partition.  The column dictionary is not order-preserving across partitions, which allows values to be partitioned based on their actual frequency in the column data.

When extreme compression is combined with SIMD (Single Instruction Multiple Data) processing and other features within the Db2 BLU query engine, this enables significant query performance gains.  Db2 BLU query processing strongly favors encoded values, so the higher the compression, the higher the query performance.

Relationship between Compression and Encoding

Although compression and encoding commonly go together in Db2 with BLU Acceleration, that is not always the case.  The column-level dictionaries discussed above are used to apply encoding for all the values included in the dictionary.  There may be some situations though where we choose to compress and encode values at the column level even with minimal space savings in order to achieve actionable compression and thus improved query performance.

Beyond column-level compression, Db2 BLU also supports page-level compression.  Page-level compression focuses on storage savings rather than query performance.  The goal of page-level compression is to save space and ultimately reduce the number of data pages required for a column.  There are two primary types: specialization that supports actionable compression and non-actionable page compression.  Page-level compression may differ across data pages for a column, so is dynamic based on the data in the page.  Page-level dictionaries are stored in each data page as needed if they provide storage savings.

Figure 2: Page-level compression dictionary

The purpose of specialization page-level compression is to take advantage of clustering within a page to reduce the number of bits required to store an encoded value.  Let’s take a concrete example to better understand this type of page compression.  In the fifty US states column-level dictionary above, we see that the least frequent values in the third dictionary partition such as Alabama and Alaska require 6-bit codes.  If we have a data page for this column that contains only values of Alabama and Alaska in the third dictionary partition, we are able to reduce the encode size for those values from six bits to one bit using a specialization page-level dictionary.  Since there is a direct mapping from the column-level dictionary to the specialization page-level dictionary, this type of page-level compression is still fully actionable.

Non-actionable page compression is used to provide compression for values that are not included in our column-level dictionary.  This technique is used in two important cases.  First, if the table does not yet contain enough data to build a column-level dictionary that is representative of the data in the column, we use a default column-level dictionary that may not provide any encoding.  However, we may still be able to reduce the number of required bits for certain values depending on the page contents and data type.  For example, for an integer column if a data page contains only the values 0 and 1, we do not need to store four bytes per value in the page.  We can instead store one bit per value and still represent all values in the page.  We do support limited predicate evaluation on such page-level compressed values without needing to decompress, but this is not fully actionable compression as described above since it cannot be leveraged for group-by and join operations.

The second type of non-actionable page compression is applied after we see enough data and evolve the dictionary based on the column contents.  The evolved column-level dictionary still may not include all values in the column, especially for higher-cardinality columns.  For unencoded string data type values, we are able to apply Page-Based String Compression, which is also known as Improved Compression for String Data Types.  This feature was first introduced in Db2 11.5.1 and enhanced to provide coverage for more classes of string data in Db2 11.5.4.  Although savings are ultimately data-dependent, customers have experienced significant storage savings with this feature including 3X or more savings beyond all of the other compression techniques already discussed.  This type of page compression primarily looks for patterns in the string data within a page and applies compression.  Since we do not apply this type of page-level compression to values that are encodable with the column-level dictionary, this type of page-level compression is not actionable.

Figure 3: Actual customer storage savings with Page-Based String Compression

Managing Compression for Tables

For Db2 BLU tables, an evolved table-level dictionary is created based on data that is inserted/loaded into a table.  The table-level dictionary is composed of a separate column-level dictionary for each column.  Once an evolved table-level dictionary is created, it is static and not further updated.

Figure 4: Table-level dictionary

An important measure of storage savings is compression ratio which is the ratio of raw data outside of Db2 to size of data stored in data pages.  However, when it comes to actionable compression and query performance, a column’s PCTENCODED value in SYSCAT.COLUMNS is a key indicator. PCTENCODED represents the percentage of values in a column that are encoded using a column-level dictionary and does not account for the storage savings provided by the column-level dictionary or page compression applied to data pages of the column.  Analyzing the PCTENCODED statistic for a table’s columns is discussed more in the Db2 BLU Compression Best Practices.

As described in the previous section, Db2 BLU does include important compression technologies that focus on non-actionable compression.  These page-based compression techniques are often able to maintain compression ratios as data starts to skew since they adjust to the data within a single page.  However, due to the fact that such page-based compression is not actionable, over time query performance may be impacted.

These factors indicate that it is very important to properly manage compression for tables over time in order to maximize both storage savings from compression and query performance.  The Compression Best Practices document below outlines recommendations to properly manage Db2 BLU compression.

Db2 BLU Compression Best Practices

A new paper, “Best Practices to Compress Db2 Column-Organized Tables”, was recently published in April 2020.  The paper gives detailed instructions on measuring the current compression ratio, determining the effectiveness of the column-level dictionaries, and building more effective dictionaries. Steps are provided to maximize compression while populating new tables as well as for managing compression for existing tables. 

Two versions are available for different Db2 form factors:

These best practices apply to Db2 11.5 and later releases.  They are quite similar and primarily differ slightly due to the fact that additional tooling is available for some operations in IIAS.  In the future, these documents will be updated as new features become available that may impact the best practices recommendations.

Summary

Storage savings via compression and query performance are two critical components of Db2 with BLU Acceleration.  These features are in fact tightly coupled, and maximizing compression leads to notably increased query performance.  For this reason, it is critical to generate quality dictionaries for new tables as well as manage compression for existing tables.  The steps provided in the Db2 BLU Compression Best Practices document above enable customers to maximize compression and therefore leverage actionable compression for improved query performance.

 

0 comments
27 views