SQL Tuning for Db2 - Where to Start by Eddy Coppens
Even on the school benches I noticed that my teachers where not well taught themselves when it came to SQL. They just copied the text from their books onto the school board not giving us much insight on the matter. Once working for a company, I had to learn programming languages which promised me I would never have to code one query myself … until I starting writing batch programs in ANSI-C on an AIX-box. I constructed the most brutal queries I could imagine and as nights got too short to finish my programs, I finally found the courage to talk to ‘THE DBA’.
I found the talk inspiring and went on searching for the manual which would explain to me how I would have to write proper SQL. I never found it and the more I talked to people on which strategy I should follow, the less useful information I seemed to get. The same was true whenever I asked for general rules to decide which index was needed to satisfy a particular query.
Now that I’m a DBA myself, I see many people still struggling with the same issues. Developers hide themselves behind frameworks and once again they do not get in touch with SQL. They rather choose for caching within their programming environment than asking for query tuning or eventually an index. Companies don’t seem to have the idea that query tuning can be beneficial for the overall performance of a database.
The list of actions I will mention for SQL-tuning are coming from my own experience and thus therefore more a guidance and not an absolute truth as such …
Keep it simple
It is not because a table is joinable with multiple others that you need to write one query to gather all the data at once. It is sometimes beneficial to chop up a gigantic query in multiple little ones and loop through the result set in your program.
Working for a car leasing company I created a query that found all related information for each lease contract and later decided in the application whether I could use some of this information. My program was slow until I reworked my queries in a different way: I created a first result set just containing those lease contracts which were relevant and once looping through them all, conditions made clear whether extra data needed to be collected.
The queries I sent to the database were more efficient and relevant and as a bonus my program finished a lot faster.
Understand what you are about to do
It seems trivial, but I have encountered multiple clients stating that a DBA does not need to know the business. I’m fine with that as long as I’m not asked to do query tuning. Whenever I get a query to tune, I want to have it explained to me what the purpose of the query is and how we want to get there. As anyone else I can read the query I’ve been provided but I do not assume the query is correct to begin with.
I also found out that some of the queries were created not knowing what the capabilities of SQL or the database are, e.g. types of aggregation, Pivot tables …
Know your data
When talking to developers for the first time, I do often let them imagine a table filled with data about people. The question I then ask is whether it is better to search on gender via a character representation or via a numeric one. Some people answer that it would be better to search on a numeric representation of gender. At that time, I can admit that it is indeed better to search on numeric values(1) … next I raise the topic selectivity. That is the moment I get the attention of all. (I do keep my examples – for simplicity’s sake – unrealistic by telling that chances are big that half of the table’s population is male or female and leave any other type of gender out of the picture.) A query returning a small subset of the data by specifying a well-chosen where-clause is in most cases more preferable.
Additional I like to point out that fetching results set from the database that are (too) large most of the times are by definition wrong in an OLTP environment. I give the example of people most of the times not wanting to browse through more than one screen when looking for something using their favorite search engine on the internet and thus it rarely makes sense to have a result set returned of multiple of thousands of records.
Know your SQL
Over the years this statement has proven to be as bold as true. Many years ago, I thought I knew SQL and still I learn more every day. Examples are quickly given – of course depending on the needs of your query:
In many cases it is more beneficial to use an EXISTS-clause instead of an IN-subquery
rather use an equal-sign
than a LIKE-clause
avoid casting on search-fields in the WHERE-clause
if the framework you are using generates bad (performing) SQL and there is no way to intervene, create a view hiding the complexity and use the view instead
Check database statistics for queries already in use
Whenever a query is already in use, Db2 monitoring tools can tell you a lot about how the particular query is doing out in the field. You can check out – amongst many others:
- the total execution time per execution
- the ratio between the number of records reads versus those that are fetched, aka really needed and returned to the clients’ The larger the gap between the records read / fetched the more chances you have that the tables are scanned and this is to be avoided.
- are runstats run on the tables that are part of the query (check out the great article on this topic written by George Baklarz recently posted on the IDUG Content Blog)
1. Source data given a name
2. Rather use “[not] exists” instead of “[not] … in”
Suppose we need to calculate the number of messages written per supported language
- The difference between the two queries explained
The result of both queries is the same and if the subquery contains but a few records in the table it may appear the performance for both situations is the same, and yet there is a huge difference between the two queries. The “do not” query will first retrieve all LANGUAGES available within the database and match it to each record of the table MESSAGES for a matching language. The “do” query will find for each row within MESSAGES if a matching language is found in the table LANGUAGES. As the result set of the subquery is much smaller the result might still be found in the Bufferpool and therefore a read on disk is avoided.
3. Performing necessary functions on columns used to filter on
- When the contents of the field on which has to be searched is undefined it could be beneficial to add a generated column on which the function is applied and create an index on this generated column
- It would be much better to known upfront which kind of variations the content of a field can contain and check on that when the data is inserted in the database and perform the same mutations on the entered search criterium
- An index on expression could also be an option
- The difference between the two queries explained
The function has a consequence that every single row of the table MESSAGES is read and applied to the field before its contents is checked against the searched value. This is because applying the function to the column makes it not indexable.
4. A real-life example
To find a good example on which all the advices are applicable is nearly impossible, but a good one to start with is this one
As one can image some of my own rules are hard to apply here and keeping it simple is not one of them. I rewrote it as follows gaining three times the performance it had before:
- I talked with the developer before I started as I didn’t understand what the outcome of the query had to be
- I changed the IN-clause into a EXISTS
- The ordering within didn’t make sense and had to go. The way the order by was written can only lead to slow performance as the last_updated has to be calculated and maybe replaced by a value of creationdate before the ordering can start.
- Instead of using the max-select as a subquery, I moved it to the where-clause. I found this useful for better readability
Many people may find my rules of thumb too strict or not representative (enough) for their environment and they are probably right, but now you have the possibility to compare the results of your testcases against what I’ve found. Every of the described cases lead to a conclusion after severe testing and trying to solve the problem at hand. I do hope these scenarios give you a starting point for your own set of conclusions which do the magic for you.
- Many debates have been held on this topic and as read in many different opinions … you do only know if you set up a complete test set. I wish to keep those remarks which seemed logical and relevant to me:
- Usage of numeric versus alphabetic primary keys: make them numeric and thus application independent
- Usage of numeric versus alphabetic fields in joins: as the length of a primary grows faster when choosing an alphabetic one, it will slow you down over time
- As numeric representation into the database of e.g. primary keys are smaller, these are the better choice