Db2 Columnar Database Page-Based String Compression
WRITTEN BY RON LIU
For many customers, string data dominates storage cost. Frequency-based compression is not effective for high cardinality datasets. So percentage of values encoded for string data might be low, i.e., < 10%. The following is an example of the common string data that your database deals with.
Product Description VARCHAR |
New Blue dress with unicorns girls size 6X |
3 Red dress with hearts girls size 6X |
1 Red dress with bears girls size 6X |
Many Blue uniform shirt boys size 5 |
Old Red uniform shirt boys size 5 |
As shown in the table above, although there are repeating words in the column data, each value in the column is distinct hence full string compression can’t apply. Also the values don’t share the same prefix. Hence prefix encoding can’t apply either. So encoding using column dictionary will not have good coverage for these data.
Some string numeric and hex data that is also challenging for encoding using the column dictionary. Some examples of these type of data look like the following.
- HEX: 59721B038CB9AC5A5DD09055529A64CA4D000000
- Postal-codes: 95133-7670
- Telephone Numbers: 1-408-775-6978
- Date: 12/21/2003 (or other date formats)
- Time: 11:32:02 (or other time formats)
- CDR (Call Detail Record for telecommunications): 00650068D34B41799911903603
These data don’t have repeating patterns. But they have no more than 16 distinct characters.
To address these limitations with encoding, Page-based string compression was introduced in Db2 v11.5.4. It uses two different schemes. Page-based string compression type 1 that uses LZ4 compression and type 2 that uses nibble compression (aka simplified Huffman). Supported data types include the following.
- CHAR, CHAR FOR BIT DATA, GRAPHIC
- VARCHAR, VARCHAR FOR BIT DATA, VARGRAPHIC
- BINARY
- VARBINARY
Page-based string compression type 1 works well with string data with repeating patterns that are of length at least 4 bytes long. Free flowing text as seen in comments, articles, manuals, descriptions, etc. normally work well with type 1.
Page-based string compression type 2 is a new simplified Huffman compressor that encodes each byte using a 4-bit code, which provides 2X compression. It is very effective for compressing high cardinality numeric, hex, date, time, and timestamp data stored in string data types. For type 2, data must have no more than 16 distinct characters per compressed data page.
Db2 automatically determines which compression scheme to use to have the best possible compression.
Registry Variables
Db2 warehouse does not support fallback so new features are often enabled by default. The registry variable for page-based string compression is DB2_COL_STRING_COMPRESSION . But for Db2 customers, if you are on 11.5.7 or later and have no plan to fallback prior to 11.5.7, it is recommended that you set the following registry variables together for overall best performance and compression. See the IDUG blog COLUMNAR DATA ENGINE (CDE) STORAGE NEW FEATURE.
db2set DB2_COL_INSERT_GROUPS=YES
db2set DB2_COL_STRING_COMPRESSION="UNENCODED_STRING:YES"
db2set DB2_COL_SYNOPSIS_SETTINGS="DEFER_FIRST_SYNOPSIS_TUPLE:YES"
Note, after these settings taking effect, all new data picks up the new compression algorithms. But the existing data stays as it.
Performance
Page-based string compression has insignificant impact on query performance. The CPU overhead for decompressing the data is offset by reduced IO and saving on bufferpool space. Verified with the "Simple sales analytics workload" and "Complex sales analytics workload" benchmarks, performance difference with code-based string compression enabled vs. disabled is no more than 3%, within test result deviation. Minimal additional memory requirement for holding decompressed data. Decompression buffers are allocated per page and only exist while processing data compressed with the page-based string compression enabled.
With page-based string compression enabled, as proven in the lab, compression ratio for string data normally falls in the range of 2x to 8x. Some significant compression improvements were also observed in customers.
Customer | Storage Saving | Page-based String Compression Type | Additional Info |
A Commercial Bank |
26% (49.5 TB to 36.5 TB) |
1 & 2 |
|
A Commercial Bank |
Type 1: 49% (24.3 TB to 13.4 TB) Type 2: 17% |
1 / 2 |
Type 2 saving additional 16.5%, estimated with a subset of typical tables |
A Government Tax Agency |
54% |
1 |
|
A Countrywide Marketplace Owner |
55% (1.7 TB to 800 GB) |
1 & 2 |
How to Verify if Page-based String Compression is in Effect
The following shell script can be used to verify if page-based string compression is in effect.
# change the schema and table name accordingly
schema=RONLIU
table=CUSTOMER
# change the db name accordingly
database=BLUDB
# provide the output file name
rptn=dart_report
$ db2 connect to $database
Database Connection Information
Database server = DB2/LINUXX8664 11.5.8
SQL authorization ID = RONLIU
Local database alias = BLUDB
oi=`db2 -x "select TABLEID from syscat.tables where TABSCHEMA='$schema' and TABNAME='$table'"`
tsi==`db2 -x "select TBSPACEID from syscat.tables where TABSCHEMA='$schema' and TABNAME='$table'"`
$ time db2dart $database /DC CIO /OI $oi /TSI $tsi /PS 0 /NP 0 /V N /RPTN $rptn 2>&1 > /dev/null
real 0m2.34s
user 0m0.52s
sys 0m0.75s
# check command execution is successful, it will generate a report, showing the number of String Compressed pages.
$ grep "Number of Pages String Compressed" $rptn
Number of Pages String Compressed = 0
Number of Pages String Compressed = 0
Number of Pages String Compressed = 0
Number of Pages String Compressed = 0
Number of Pages String Compressed = 0
Number of Pages String Compressed = 0
Number of Pages String Compressed = 74
Number of Pages String Compressed = 0
Number of Pages String Compressed = 0
Number of Pages String Compressed = 832
# if it complaints object id not found, it is because the table doesn't reside on the current node. You can specify another node to connect to.
# another node for db2dart to connect to. For example
#
# export DB2NODE=2
#
# then rerun the db2dart command
Conclusion
As observed in lab testing and reported by customers, page-base string compression drastically improves storage savings on string data while not impacting and in some cases improving performance. The feature should be turned on in Db2 for best compression.
Ron Liu is a long time IBMer and a developer in the IBM Db2 Columnar Data Engine (CDE) Data Service team. Prior to his current role, he was a developer for IBM Federation Server, performance engineer for IBM Information Server, performance engineer and Java developer for the IBM data integration products in cloud. In his free time, he loves to go hiking in the nature and do wildlife photography.