Finding "Empty" Columns in Large Database using Column Statistics

Philip Nelson

Finding "Empty" Columns in Large Database using Column Statistics
Folks,

I've got a bit of a challenge surrounding the use of column statistics to
determine empty columns in a database.

I've got a database of about 450 tables, with nearly 9000 columns of which
nearly 3000 are numeric. The data comes from external sources and the
table structures are all generated based on an external definition of the
data files. In total, we have around 10 billion rows of data.

We know that some of the data columns are "empty". But, unfortunately, I
don't just mean that they are always null. In many cases, the input data
had either low values (which we've loaded as nulls) or a packed decimal
zero (which we've loaded as zero).

While we could generate SQL to test each column individually for only being
null or zero, this would take a long while to run.

So we thought we were being clever and would look at the statistics we've
gathered. And that's where we run into a problem with interpreting the
statistics. Here's a typical example.

We've got a column that is defined as nullable and DECIMAL(9,2). If we
look at this column it is exactly the type of column we want to identify -
it either has values of zero or null. But the HIGH2KEY and LOW2KEY values
are 'FF0000000000' and '00F000000000'.

I really don't understand why this is the case. Is it because there is
only one unique value (apart from null) and therefore because these are the
SECOND highest / lowest values Db2 doesn't have anything to report here?
Or is it because of the nulls?

Am I wasting my time here and should go back to evaluating each column
individually?

Any insights gratefully received.

Phil

Roy Boxwell

Finding "Empty" Columns in Large Database using Column Statistics
(in response to Philip Nelson)
The first byte is the NULL indicator so hex FF means the rest of the data *must* be ignored because the value is really NULL

Whereas if 00 then the value is a real value

In both cases the NULL byte must not be processed as part of the number.



Plus with DECIMAL you have nearly hit the jackpot in converting back to a real number. Here’s how my assembler does it:



DECIMAL is stored in a strange way.

First the high-hex nibble contains the SIGN. So

x'F' = x'C' in the output and x'0' = x'D' in the output.

The whole data is then one nibble shifted to the left.

If the sign nibble was x'F' then move in a x'C' to the sign

and we are finished.

If it was x'0' (negative) then we must invert all bits

and then move an x'D' to the sign.



This is only topped by converting DECFLOAT numbers…





Roy Boxwell

SOFTWARE ENGINEERING GmbH and SEGUS Inc.
-Product Development-

Vagedesstrasse 19
40479 Dusseldorf/Germany
Tel. +49 (0)211 96149-675
Fax +49 (0)211 96149-32
Email: <mailto:[login to unmask email]> [login to unmask email]
Web http://www.seg.de http://www.seg.de

https://www.seg.de/corporate/rechtliche-hinweise/datenschutz Link zur Datenschutzerklärung


Software Engineering GmbH
Amtsgericht Düsseldorf, HRB 37894
Geschäftsführung: Gerhard Schubert, Ulf Heinrich



From: Philip Nelson <[login to unmask email]>
Sent: Tuesday, June 23, 2020 1:48 PM
To: [login to unmask email]
Subject: [DB2-L] - Finding "Empty" Columns in Large Database using Column Statistics



Folks,



I've got a bit of a challenge surrounding the use of column statistics to determine empty columns in a database.



I've got a database of about 450 tables, with nearly 9000 columns of which nearly 3000 are numeric. The data comes from external sources and the table structures are all generated based on an external definition of the data files. In total, we have around 10 billion rows of data.



We know that some of the data columns are "empty". But, unfortunately, I don't just mean that they are always null. In many cases, the input data had either low values (which we've loaded as nulls) or a packed decimal zero (which we've loaded as zero).



While we could generate SQL to test each column individually for only being null or zero, this would take a long while to run.



So we thought we were being clever and would look at the statistics we've gathered. And that's where we run into a problem with interpreting the statistics. Here's a typical example.



We've got a column that is defined as nullable and DECIMAL(9,2). If we look at this column it is exactly the type of column we want to identify - it either has values of zero or null. But the HIGH2KEY and LOW2KEY values are 'FF0000000000' and '00F000000000'.



I really don't understand why this is the case. Is it because there is only one unique value (apart from null) and therefore because these are the SECOND highest / lowest values Db2 doesn't have anything to report here? Or is it because of the nulls?



Am I wasting my time here and should go back to evaluating each column individually?



Any insights gratefully received.



Phil



-----End Original Message-----

Attachments

  • smime.p7s (5.1k)

Michael Hannan

RE: Finding "Empty" Columns in Large Database using Column Statistics
(in response to Roy Boxwell)

Philip,

I guess you can use Runstats without sampling and check the COLCARDF value, first. Obviously any column with > 2 has a Null, and at least 1 non zero (or non empty) value. If a Not Null column, any colcardf > 1.

For remaining columns of COLCARDF 1 or 2, you probably can use the LOW2KEY and HIGH2KEY values which may contain 2 values or just 1 value, as long as you know how to interpret it. See previous answer. I have built SQL to unpick these columns for various data types excepting  perhaps DECFLOAT, and Graphic types ha ha.  Even normal FLOAT is quite tricky.

I would probably generate SQLs myself, so if Columns A, B, C are Char, decimal and Integer, and each has low COLCARDF (2 or less), then something like

SELECT   COUNT(*)      Rows
   ,MAX(IFNULL(A,X'00'))   MAX_A
   ,MAX(IFNULL(B,-1))    MAX_B
   ,MAX(IFNULL(C,-1))    MAX_C
FROM table
WHERE (A  <>  ''
      OR    B <> 0
      OR    C <> 0
              )
WITH UR; 

Note the predicates are Stage 1 for reasonable efficiency, but access path will be TS Scan, hopefully.

Run with Degree ANY for zIIP offload (would be worth a try on large PBR), Or run as Remote SQL via TCPIP to get 60% offload to zIIP.

SQL answer is a little more exact than Stats.

Michael Hannan,

DB2 Application Performance Specialist

Philip Nelson

Finding "Empty" Columns in Large Database using Column Statistics
(in response to Michael Hannan)
Thanks Roy and Michael.

I've now coded this up successfully for DECIMAL, INTEGER and SMALLINT.

Thankfully have no DECFLOATs to worry about.

Phil

On Wed, 24 Jun 2020 at 03:09, Michael Hannan <[login to unmask email]> wrote:

> Philip,
>
> I guess you can use Runstats without sampling and check the COLCARDF
> value, first. Obviously any column with > 2 has a Null, and at least 1 non
> zero (or non empty) value. If a Not Null column, any colcardf > 1.
>
> For remaining columns of COLCARDF 1 or 2, you probably can use the LOW2KEY
> and HIGH2KEY values which may contain 2 values or just 1 value, as long as
> you know how to interpret it. See previous answer. I have built SQL to
> unpick these columns for various data types excepting perhaps DECFLOAT,
> and Graphic types ha ha. Even normal FLOAT is quite tricky.
>
> I would probably generate SQLs myself, so if Columns A, B, C are Char,
> decimal and Integer, and each has low COLCARDF (2 or less), then something
> like
>
> SELECT COUNT(*) Rows
> ,MAX(IFNULL(A,X'00')) MAX_A
> ,MAX(IFNULL(B,-1)) MAX_B
> ,MAX(IFNULL(C,-1)) MAX_C
> FROM table
> WHERE (A <> ''
> OR B <> 0
> OR C <> 0
> )
> WITH UR;
>
> Note the predicates are Stage 1 for reasonable efficiency, but access path
> will be TS Scan, hopefully.
>
> Run with Degree ANY for zIIP offload (would be worth a try on large PBR),
> Or run as Remote SQL via TCPIP to get 60% offload to zIIP.
>
> SQL answer is a little more exact than Stats.
>
> Michael Hannan,
>
> DB2 Application Performance Specialist
>
> -----End Original Message-----
>