Selectivity Overrides – Fixing Your Filter Factor Problems
By Joe Geller
Scenario – your query is performing poorly. Checking an Explain, you see DB2 picked an access path which you think is less than optimal. First, you make sure there are appropriate indexes. Then you look to see if the statement is written in a way that enables DB2 to use those indexes (e.g., expressions on indexed columns prevent index usage - until DB2 for z/OS V11, which can rewrite some of those). The code looks good to you. So, what else could be causing the poor choice of access path?
DB2 has a cost based Optimizer, which calculates the estimated cost of each possible access path. One of the major components of cost is the number of rows that are qualified at each step. This in turn is determined by the Filter Factor of each predicate.
DB2 filters out rows that do not meet the predicates, but the filter factor is the fraction of rows that satisfy that predicate. The term Selectivity is also used to mean the same thing, and different parts of the DB2 documentation use each term.
- SELECT …. FROM CUST WHERE LASTNAME=‘PURCELL’
- If there are 1,000,000 rows in the CUST table and 100 are named PURCELL, then we are selecting
- 100/1,000,000 = .0001 of the rows (that is 1/10,000)
- The Filter Factor is .0001
If DB2’s estimate of the filter factor is incorrect, that can clearly lead to an inaccurate cost estimate and therefore DB2 may pick a different access path. Looking at the explain (Visual Explain will clearly show you the filter factors for every predicate), you can see if they match the actual data in the tables. If they don’t, you have to find a way to give DB2 more accurate information.
- If the predicates are independent, you multiply the filter factors
- Sex = ‘M’ AND Salary_grade = 23
- ½ * 1/20 = 1/40 (.025) - 25,000 rows
- If they are dependent (such as city and state)
- If DB2 knows they are dependent, then it does not multiply. DB2 will know that columns are not independent if the predicate columns are the leading columns of an index or if you have collected colgroup statistics on those columns.
- LUW – index stats have firstkeycard, first2keycard, first3keycard, first4keycard, fullkeycard
- z/OS – index stats have firstkeycardf and fullkeycardf, but all intermediate combinations are gathered too and stored in SYSCOLDIST
- BETWEEN – even though between is equivalent to a >= predicate AND a <= predicate, DB2 does not treat them as independent
Matchcols or Filter Factors – Which is more important?
Matchcols (start keys in LUW) are the number of leading columns of an index that are used to position within the index. This means there are predicates (usually equality predicates) on each of these columns. We have long been told that higher Matchcols is good for performance. But is that always true? Within a single index, more matching columns is better, but when DB2 has a choice of indexes, then it depends. Consider this example:
A table has 2 indexes:
- IX1 on cola, colb, colc
- IX2 on cold
A query has this Where clause – Where cola=? and colb=? and colc=? and cold=?
- IX1 -> 3 matchcols IX2 -> 1 matchcols
- Which one is better?
It depends on the filter factors:
- The better index is the one that examines fewer entries and qualifies fewer rows
- Let’s say the table was the people in the world
- If IX1 were country, state, city and the predicates were
- country=‘US’ and state=‘NY’ and city=‘New York’
- 7 million rows would qualify
- If IX2 were SSN (social security number) and the predicate was
- SSN = 123-04-0567
- 1 row would qualify.
- IX2 is much better
The first and best step is to make sure that you have gathered as much relevant statistics (through Runstats) as possible.
- Run Runstats on a regular basis. It really is important to keep your statistics up to date. Equally important is that the statistics of related objects are in synch. If you create a new index, you should run Runstats on the table and all of its indexes, not just on the new index. And, you should gather statistics on related tables if they are not recent.
- Column cardinality statistics. This is the default for LUW, but not for z/OS. This is the key statistic for accurate filter factors.
- Distribution statistics. If a column has a non-uniform distribution and if you have a predicate on that column and the predicate uses a literal value, then column distribution statistics can be invaluable.
- COLGROUP statistics. If two columns are dependent on each other (such as city and state), then the filter factor should be very different than if they were independent. The way for DB2 to know they are related is to gather colgroup statistics for the set of columns.
Literals vs Host Variables (parameter markers)
Generally speaking, using variables in your SQL statements (especially dynamic SQL) is a good technique. For static SQL, it allows you to use one statement with different values passed in. For dynamic SQL it fosters statement reuse. DB2 has a cache of recently executed statements. If the exact same dynamic statement is executed and found in the cache, DB2 does not have to do a prepare and determine the access path. This can save substantial CPU time on each query. However, if a column has a highly skewed distribution, then using a literal in the predicate can enable DB2 to use the distribution statistics (if you have collected them) and may produce a much better access path than it would for a uniform distribution (which DB2 has to assume if you use a variable). Similarly, range predicates can benefit from using a literal in some cases.
A rule of thumb is to use variables most of the time, but use literals where it could make a difference. You should always use variables for ID columns. Otherwise every execution will be a different statement and there will be no reuse.
- Problem – program uses variables for static SQL or parameter markers for dynamic SQL because different values are passed in. But, some columns have highly skewed data. Different values would benefit from different access paths. You don’t want to convert to dynamic SQL.
- Solution – REOPT. REOPT is a bind option with 3 choices:
- NONE – process as normal
- ALWAYS – determine the access path for each execution.
- ONCE – determine the access path on the first execution (the assumption here is that the first value is typical)
- Is this good? Well of course that depends
- PLUS – customized access paths
- MINUS – execution time overhead. Access Path selection done for every execution if REOPT ALWAYS is used
On DB2 LUW, there is a mechanism to gather statistics for more complicated relationships. A statistical view is:
- A view you create and collect statistics on
- It is used to capture statistics for more complex relationships – such as joins, subqueries, expressions
- The statistics are used by the Optimizer when you issue a query that has a similar structure and can be matched by the Optimizer
- Create VIEW schema.v1 (cols…) AS ……
- ALTER VIEW schema.v1 enable query optimization;
- Runstats on view schema.v1 with distribution on all columns
- Examples covering a number of situations can be found in this article -http://www.ibm.com/developerworks/data/library/techarticle/dm-1305leverage/
- Statistical Views can be quirky and difficult for more complex cases with multiple query blocks. You may need multiple stat views to cover subsets of the query
Sometimes it is impossible to provide DB2 with enough information through Runstats. Range predicates with host variables are particular problems. How big a range are you asking for? DB2 has no way of knowing. Other cases include:
- Temp tables
- Expressions and subqueries (DB2 for z/OS does not have Statistical Views, and even though LUW does, they can get complicated to use).
- Joins – It is hard for DB2 to know how many rows will match between the outer and inner tables. RI helps, and statistical views can sometimes provide enough information, but not always. Each join key from the outer table may match a different number of inner rows.
- Highly correlated columns, but colgroup stats have not been collected.
When the access path is not what you want (and of course, the access path that you want may not actually be a better access path), you can tell DB2 what access path to use (access path hint on z/OS or optimization guideline on LUW). Alternatively (and more simply) you can tell DB2 what the filter factor of one or more predicates should be. With this type of hint (or override), you are giving DB2 more information, but DB2 will still pick the access path.
DB2 for z/OS and DB2 LUW have implemented selectivity overrides in completely different fashions. On z/OS, the overrides are done externally to the program and are put in the (fairly new) Access Path Repository. On LUW, the override goes right in the program as part of the SQL statement.
To enable the use of the selectivity clause, a registry variable has to be set:
The manuals actually only list values of YES or NO for this registry variable. All is the same as YES, but enables the use of selectivity in more places.
The selectivity override is a clause in the statement. It goes right after the predicate.
SELECT LASTNME, FIRSTNAME FROM EMPLOYEE
WHERE WORKDEPT = ?
DB2 doesn’t know the distribution for WORKDEPT = ? Is it 10% or 90% of the rows?
SELECT LASTNME, FIRSTNAME FROM EMPLOYEE
WHERE WORKDEPT = ? SELECTIVITY .4
Now DB2 knows that it is 40% (actually DB2 has been told by the programmer that it is 40%)
The Selectivity clause cannot be used for all predicates. For example, it can’t be used on a BETWEEN. You need to convert a BETWEEN to >= and <= predicates – adding Selectivity to each.
z/OS – The Access Path Repository
The Access Path Repository for access path hints was introduced in V10. Selectivity overrides via the Access Path Repository (APR) came in V11.
APR allows statement level hints based on matching the statement text. This can be done at two levels: Global – any statement in the system that matches; Package based – limited to a specific package.
There are 3 types of hints:
- Access path hints (as was done via the plan_table)
- Optimization hints (bind parameters such as REOPT)
- Selectivity overrides
This article will only be discussing the selectivity overrides.
Two sets of tables are used in the process.
- SYSIBM tables
- User tables to feed into the APR tables
There are additional tables used for the other types of hints.
- Populate yourschema.DSN_USERQUERY_TABLE with the statement text
- Do an Explain of the query. This populates the tables DSN_PREDICAT_TABLE & DSN_PREDICATE_SELECTIVITY (with the filter factor that DB2 determined)
- Update the selectivities of the predicates you want to change
- The set of selectivities is a selectivity instance. You can create additional selectivity instances with different overrides for various predicates by inserting additional rows. Each instance is given a weight representing the percentage that will occur at execution time. The full set of instances is called a selectivity profile
Issue BIND QUERY
- This will populate the SYSIBM tables (the APR) with the data from all queries in the DSN… tables. It is recommended to clear out these user tables of old entries before working on a new override to prevent inadvertently making an unintended change.
- Do an Explain of your query (or bind your package with EXPLAIN(YES) to verify the selectivity overrides were used (i.e. a match for the statement was found) and to check the access path to see if you accomplished what you intended
- Execute the query and monitor the performance to see if your new access path actually is better than what you had before
You can get the statement text from your source code, but IBM recommends getting it from SYSIBM.SYSPACKSTMT (for static SQL) or from the dynamic statement cache for dynamic SQL.
Creating multiple instances with different weights allows very detailed modeling of the occurrence rate of execution. However, the optimizer can still only select one access path, regardless of how many different instances are specified. This access path may be the best for 1 instance or another, or it may not be the best for any one combination of filter factors. It may not be worth this much effort.
All the details can be found in
Access Path Repository or Selectivity Clause
Which Do You Prefer?
- APR Advantages:
- External to program. Can modify values without changing and redeploying the code
- Can affect all occurrences of the matching statement, or just those for a particular package
- Can be more detailed
- Selectivity Clause Advantages:
- Internal to the program. Developer is aware of its existence
- If the statement changes, APR must be kept in synch – redo the query in DSN_USERQUERY_TABLE and BIND QUERY (will someone notice?). Information hiding may be a good attribute for inter-module communication, but it is not a good attribute when the information is hidden from the developer.
- Much simpler process
Limitations and Drawbacks to Selectivity Overrides
Selectivity can be a moving target. As the table grows and the data distribution changes, the filter factors will change too. If the table is growing, but the number of qualifying rows stays constant over time, the % is really going down. For example, if data is not purged, so the table size keeps increasing, but the daily volume stays the same, the number of “open” orders stays the same, but the percentage has changed. The reverse can also happen – data is purged on a regular basis, but the business is growing and the number of open orders is increasing and may now be a higher percent of the total rows.
Join predicates are a little trickier to handle. If you want to indicate that an outer row will match 1 row on average, then the selectivity will be 1/cardinality of the inner table (this assumes you know which is the inner table). The default filter factor is 1/max(colcard(t1.joincolcard),colcard(t2.joincolcard))
Any access path hint, whether you are just giving DB2 more information (selectivity) or telling DB2 what to do (access path hint) can actually hurt performance if not done with care and with performance testing. You may think you are smarter than DB2, but you may not be.
About the author : Joe Geller is an IBM Champion for Information Management and has been a DB2 consultant since Version 1.2. He’s been a Data Architect, DBA, Data Modeler and Software Engineer, often simultaneously. He feels that the best way to learning one aspect of a system is to work on all aspects. Joe specializes in performance tuning. Joe is the author of two books on database systems – DB2 Performance and Development Guide and IMS Administration, Programming and Data Base Design. Currently Joe is working as a database performance specialist at Moody’s.