Db2 Columnar Database Page-Based String Compression

Posted By: Ron Liu Technical Content,

WRITTEN BY RON LIU


For many customers, string data dominates storage cost. Frequency-based compression is not effective for high cardinality datasets.  So percentage of values encoded for string data might be low, i.e., < 10%. The following is an example of the common string data that your database deals with.

Product Description VARCHAR
New Blue dress with unicorns girls size 6X
3 Red dress with hearts girls size 6X
1 Red dress with bears girls size 6X
Many Blue uniform shirt boys size 5
Old Red uniform shirt boys size 5

As shown in the table above, although there are repeating words in the column data, each value in the column is distinct hence full string compression can’t apply. Also the values don’t share the same prefix. Hence prefix encoding can’t apply either. So encoding using column dictionary will not have good coverage for these data.

Some string numeric and hex data that is also challenging for encoding using the column dictionary. Some examples of these type of data look like the following.

  • HEX: 59721B038CB9AC5A5DD09055529A64CA4D000000
  • Postal-codes: 95133-7670
  • Telephone Numbers: 1-408-775-6978
  • Date: 12/21/2003 (or other date formats)
  • Time: 11:32:02 (or other time formats)
  • CDR (Call Detail Record for telecommunications): 00650068D34B41799911903603

These data don’t have repeating patterns. But they have no more than 16 distinct characters.

To address these limitations with encoding, Page-based string compression was introduced in Db2 v11.5.4. It uses two different schemes. Page-based string compression type 1 that uses LZ4 compression and type 2 that uses nibble compression (aka simplified Huffman). Supported data types include the following.

  • CHAR, CHAR FOR BIT DATA, GRAPHIC
  • VARCHAR, VARCHAR FOR BIT DATA, VARGRAPHIC
  • BINARY
  • VARBINARY

Page-based string compression type 1 works well with string data with repeating patterns that are of length at least 4 bytes long. Free flowing text as seen in comments, articles, manuals, descriptions, etc. normally work well with type 1.

Page-based string compression type 2 is a new simplified Huffman compressor that encodes each byte using a 4-bit code, which provides 2X compression. It is very effective for compressing high cardinality numeric, hex, date, time, and timestamp data stored in string data types. For type 2, data must have no more than 16 distinct characters per compressed data page. 

Db2 automatically determines which compression scheme to use to have the best possible compression.


Registry Variables

Db2 warehouse does not support fallback so new features are often enabled by default. The registry variable for page-based string compression is DB2_COL_STRING_COMPRESSION . But for Db2 customers, if you are on 11.5.7 or later and have no plan to fallback prior to 11.5.7, it is recommended that you set the following registry variables together for overall best performance and compression.  See the IDUG blog COLUMNAR DATA ENGINE (CDE) STORAGE NEW FEATURE.

db2set DB2_COL_INSERT_GROUPS=YES 

db2set DB2_COL_STRING_COMPRESSION="UNENCODED_STRING:YES" 

db2set DB2_COL_SYNOPSIS_SETTINGS="DEFER_FIRST_SYNOPSIS_TUPLE:YES" 

 

Note, after these settings taking effect, all new data picks up the new compression algorithms. But the existing data stays as it.


Performance

Page-based string compression has insignificant impact on query performance. The CPU overhead for decompressing the data is offset by reduced IO and saving on bufferpool space.  Verified with the "Simple sales analytics workload" and "Complex sales analytics workload" benchmarks, performance difference with code-based string compression enabled vs. disabled is no more than 3%, within test result deviation. Minimal additional memory requirement for holding decompressed data. Decompression buffers are allocated per page and only exist while processing data compressed with the page-based string compression enabled.

With page-based string compression enabled, as proven in the lab, compression ratio for string data normally falls in the range of 2x to 8x. Some significant compression improvements were also observed in customers.

Customer Storage Saving Page-based String Compression Type Additional Info

A Commercial Bank

26% (49.5 TB to 36.5 TB)

1 & 2

First IIAS customer upgraded to Db2 v11.5.4

A Commercial Bank

Type 1: 49% (24.3 TB to 13.4 TB)

Type 2: 17%

1 / 2

Type 2 saving additional 16.5%, estimated with a subset of typical tables

A Government Tax Agency

54%

1

A Countrywide Marketplace Owner

55% (1.7 TB to 800 GB)

1 & 2

Saving over other database vendor for large VARCHAR column


How to Verify if Page-based String Compression is in Effect

The following shell script can be used to verify if page-based string compression is in effect.

# change the schema and table name accordingly

schema=RONLIU

table=CUSTOMER

 
# change the db name accordingly

database=BLUDB

 
# provide the output file name

rptn=dart_report

 
$ db2 connect to $database

 
     Database Connection Information

 
 Database server             = DB2/LINUXX8664 11.5.8

 SQL authorization ID   = RONLIU

 Local database alias      = BLUDB

 
oi=`db2 -x "select TABLEID from syscat.tables where TABSCHEMA='$schema' and TABNAME='$table'"`

tsi==`db2 -x "select TBSPACEID from syscat.tables where TABSCHEMA='$schema' and TABNAME='$table'"`


 
$ time db2dart $database /DC CIO /OI $oi /TSI $tsi /PS 0 /NP 0 /V N /RPTN $rptn 2>&1 > /dev/null

 
real    0m2.34s

user    0m0.52s

sys     0m0.75s

 
# check command execution is successful, it will generate a report, showing the number of String Compressed pages.

 
$ grep "Number of Pages String Compressed" $rptn

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 74

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 0

         Number of Pages String Compressed = 832

 
# if it complaints object id not found, it is because the table doesn't reside on the current node. You can specify another node to connect to.

# another node for db2dart to connect to. For example

#

#        export DB2NODE=2

#

# then rerun the db2dart command

 


Conclusion

As observed in lab testing and reported by customers, page-base string compression drastically improves storage savings on string data while not impacting and in some cases improving performance. The feature should be turned on in Db2 for best compression.


Ron Liu is a long time IBMer and a developer in the IBM Db2 Columnar Data Engine (CDE) Data Service team. Prior to his current role, he was a developer for IBM Federation Server, performance engineer for IBM Information Server, performance engineer and Java developer for the IBM data integration products in cloud. In his free time, he loves to go hiking in the nature and do wildlife photography.