I've got a bit of a challenge surrounding the use of column statistics to
determine empty columns in a database.
I've got a database of about 450 tables, with nearly 9000 columns of which
nearly 3000 are numeric. The data comes from external sources and the
table structures are all generated based on an external definition of the
data files. In total, we have around 10 billion rows of data.
We know that some of the data columns are "empty". But, unfortunately, I
don't just mean that they are always null. In many cases, the input data
had either low values (which we've loaded as nulls) or a packed decimal
zero (which we've loaded as zero).
While we could generate SQL to test each column individually for only being
null or zero, this would take a long while to run.
So we thought we were being clever and would look at the statistics we've
gathered. And that's where we run into a problem with interpreting the
statistics. Here's a typical example.
We've got a column that is defined as nullable and DECIMAL(9,2). If we
look at this column it is exactly the type of column we want to identify -
it either has values of zero or null. But the HIGH2KEY and LOW2KEY values
are 'FF0000000000' and '00F000000000'.
I really don't understand why this is the case. Is it because there is
only one unique value (apart from null) and therefore because these are the
SECOND highest / lowest values Db2 doesn't have anything to report here?
Or is it because of the nulls?
Am I wasting my time here and should go back to evaluating each column
Any insights gratefully received.