Blogs

The Rise and Fall of Dynamic SQL (Performance!)

By George Baklarz posted Nov 14, 2016 01:21 AM

  

The Rise and Fall of Dynamic SQL (Performance!)
by: George Baklarz

In this article we are going to examine some tuning options that a DBA has available to help improve dynamic SQL performance.

The Life and Times of a Database Administrator (DBA)

The life of a database administrator is tough.  There is never enough time to tune all of the aspects of a production database system. First there are the physical aspects of the machine that the database runs on:

  • Machine resources including CPU and memory
  • Disk performance and capacity
  • Network bandwidth

Of course, a few of you will laugh and say “We would be lucky to be able to get access to the real resources!” And once you’re satisfied you have enough capacity, you need to tune within the instance and database including:

  • Instance and Database settings
  • Buffer pool sizes
  • Sort heap and other memory structures

Now when you finally think you have the system ready to run, you have to deal with the applications.  There are endless tuning options available to you including creating:

  • Indexes
  • Statistical views
  • Multidimensional clustering
  • Columnar versus Row-based tables

To make matters worse, sometimes you have to go down to the statement level to get things to work! DB2 can automate lot of the performance aspects of a database (like using Self-Tuning Memory Manager or STMM), but there isn't a lot that can be done with the SQL itself other than relying on the wisdom of the DB2 optimizer.

Optimizing SQL that you Can't See

While creating optimization objects (like Indexes) can help many workloads, sometimes a DBA needs to look at individual SQL statements or optimization levels in order to get suitable performance.

While many of us can optimize statements by hand, sometimes we can't get access to the actual SQL itself. This is one area where we must use indirect approaches to improve the optimization of a statement, especially with dynamic SQL. One way we influence performance is through the SQL optimization setting:

SET CURRENT QUERY OPTIMIZATION = 0;

DB2 has a variety of optimization levels, with the most aggressive being level 9, while minimal optimization being level 0. Most of you probably use the default of 5 which is good for the majority of SQL statements. Some workloads are heavily OLTP oriented which means doing any kind of advanced optimization is probably a waste of CPU time (i.e. the optimization costs more than the actual run time of the statement!). However, using an optimization level below 2 will result in no query re-write occurring. Workloads that are heavily decision support oriented would benefit from higher optimization levels.

To illustrate the impact of optimization levels, you can perform a simple test against the DB2 SAMPLE database. Note: All of the examples are available in a companion file called concentrate.sql for you to try out yourself.

Connect to your DB2 SAMPLE and generate the EXPLAIN tables in the event you don't have them created already.

CALL SYSPROC.SYSINSTALLOBJECTS('EXPLAIN', 'C', 
     CAST (NULL AS VARCHAR(128)), CURRENT USER);

The following SQL will force DB2 to optimize a SQL statement with minimal query re-write performed:

SET CURRENT QUERY OPTIMIZATION = 0;
SET CURRENT EXPLAIN SNAPSHOT = YES;

The EXPLAIN SNAPSHOT will allow us to capture the information we need to determine how DB2 rewrote the SQL statement.

This SQL will return three employees from the EMPLOYEE table using the OR operator.

SELECT LASTNAME, SALARY FROM EMPLOYEE 
WHERE
 EMPNO = '000100' OR EMPNO = '000110' OR EMPNO = '000120';

SET CURRENT EXPLAIN SNAPSHOT = NO;

The rewritten SQL can be retrieved by selecting from the EXPLAIN_STATEMENT table:

SELECT STATEMENT_TEXT FROM EXPLAIN_STATEMENT 
WHERE
 EXPLAIN_LEVEL = 'P'; 

The explain results:

SELECT Q1.LASTNAME AS "LASTNAME",
       Q1.SALARY AS "SALARY"
FROM BAKLARZ.EMPLOYEE AS Q1
WHERE
 ((
  (Q1.EMPNO = '000100') OR (Q1.EMPNO = '000110')) OR
  (Q1.EMPNO = '000120')) 

Aside from a number of identifiers added to the predicates, not much has changed from the original statement. That’s probably what you would expect with the minimal optimization level. Now let’s change the optimization level to 9 and reissue the query.

SELECT Q3.LASTNAME AS "LASTNAME",
       Q3.SALARY AS "SALARY"
FROM BAKLARZ.EMPLOYEE AS Q3
WHERE Q3.EMPNO IN ('000100', '000110', '000120') 

The original SQL statement has been re-written from an OR statement into an IN list. What would cause the optimizer to change the statement from an OR to an IN list?

First of all, we know that these statements are logically equivalent. What is not the same is how DB2 will handle the processing of the OR statement versus the IN list.

OR processing is done at a higher level rather than at a lower physical level. From a very simplistic point of view, think of there being two levels of comparison operators in DB2.

  • One layer is the “logical” level which takes a value and compares it to another one. In the OR example, a value returned from disk would be compared to '000100', then '000110', and finally to '000120'. DB2 would only need to check all three values if a match wasn't found earlier.
  • The second layer is the “physical” layer which retrieves column values from a table. The second layer can be told which patterns to look for before passing the results back to the logical layer. The physical layer can look for the patterns '000100', '000110', and '000120' before passing control back to the logical layer. This is much more efficient than having the logical layer compare the result retrieved from disk to three different values.

This simple example illustrates that there are sometimes things that we can do to influence the optimization of a SQL statement, even though we may not be able to see the actual source code. Although this may result in some improvements in "problem" SQL, it does increase compile time and may not necessarily result in better throughput. 

Optimizing Transactional Dynamic SQL

The general rule of thumb is that Analytic SQL statements benefit from higher optimization levels. The reasoning is that the increased compile time will be more than offset by a reduced runtime. Analytic SQL tends to scan hundreds, or thousands of rows, so even small optimization improvements will add up over many rows.

A transactional (OLTP) statement doesn't have the same luxury of having the optimization overhead being amortized over the retrieval cost of hundreds of rows. For OLTP statements, it makes sense to use lower optimization levels and not pay for the extended optimization time. For simple dynamic statements that have access to well indexed tables, an optimization level of zero is sufficient, but if you want DB2 to re-write the SQL you should set it to at least 1.

So what can be done to improve the performance of short, transactional queries? This is one type of workload where the default DB2 settings may result in sub-optimal performance.

DB2 has a package cache that keeps frequently run packages and statements. DB2 uses this cache to find statements that were previously executed so that the need to recompile and optimize the statement can be eliminated. The efficiency of using the cache is directly proportional to the ability to find the statement in the cache (hit ratio). But, DB2 isn't that clever when it comes to dynamic SQL.

The new generation of web applications and development frameworks are built entirely around dynamic SQL. This means that all of the SQL is generated "on the fly" and programmers tend to write SQL (if they even get to see SQL) that has the format:

SELECT columns FROM tables 
  WHERE  column1 = 'value1' and column2 = 'value2' …

Every DBA reading this article will instantly say, "That should be re-written with parameter markers!" The statement would be transformed to:

SELECT columns FROM tables WHERE column1 = ? and column2 = ?

However, many of these programming languages don't have the concept of binding parameter markers to a statement. In fact, many programmers aren't aware of parameters markers so they are quite happy to write their SQL with predicates being compared to constants. In addition, the SQL may only be executed once in the application, so preparing and then executing the statement would not make the application any more efficient.

The effect on DB2 performance can be discouraging. DB2 attempts to cache these dynamic SQL statements as they come in, and at the same time looking to see if an identical statement is already in the cache that it can reuse. From a DB2 perspective, the following two statements are different because they have different constants in them:

SELECT LASTNAME FROM EMPLOYEE WHERE LASTNAME='GEOFFREY';
SELECT LASTNAME FROM EMPLOYEE WHERE LASTNAME='GEORGE';

We all know that these statements should generate an identical access plan, but DB2 sees them as two separate statements. This results in a package cache of statements that are never reused, so basically a huge amount of overhead for no benefit.

DB2 9.7 introduced a statement concentrator to try improve access plan reuse. The concept is very simple:

  • As DB2 scans a statement it removes all constants
  • The constants are replaced with parameter markers
  • The rewritten statement is executed with the original constants used as the parameters for the statement

The two statements that we originally had would be (internally) rewritten to:

SQL='SELECT LASTNAME FROM EMPLOYEE WHERE LASTNAME=?';
PREPARE RUNSQL FROM :SQL;

V1='GEOFFREY';
EXECUTE RUNSQL USING :V1;

V1='GEORGE';
EXECUTE RUNSQL USING :V1;

At this point DB2 can reuse the package cache since the SELECT statement is now in a form that can be used for the second statement. We eliminate the need to parse and prepare the statement and end up having less overhead associated with the dynamic SQL.

The performance difference when using parameter markers can be significant. The following example shows how the statement concentrator can be turned and used within a database.

Enabling the Statement Concentrator

To enable the statement concentrator, you must update the database configuration parameters STMT_CONC to LITERALS.  For this example, we start by turning it OFF so we can compare the results of running dynamic SQL in both environments.

UPDATE DB CONFIG FOR SAMPLE USING STMT_CONC OFF IMMEDIATE;

A note about the UPDATE statement above. When you specify IMMEDIATE, the setting will be updated when everyone disconnects from the database. If you run this example and there is an existing connection, the setting will not change.

The SQL example will mimic the tables that are used in the SAMPLE database. There are two tables that are used in the example. The first table (SC.EMPLOYEE) contains the names of the employees working in a company and the second (SC.DEPARTMENTS) has the names of the departments they work in.

CREATE TABLE SC.EMPLOYEE
(
  EMPID INTEGER GENERATED ALWAYS AS IDENTITY,
  FIRSTNAME VARCHAR(30),
  LASTNAME  VARCHAR(30),
  SALARY    DECFLOAT(16),
  DEPTID    INTEGER
) NOT LOGGED INITIALLY; 

CREATE TABLE SC.DEPARTMENTS
(
  DEPTID INTEGER,
  NAME VARCHAR(30)
); 
INSERT INTO SC.departments VALUES
  (123, 'Information Development'),
  (234, 'Quality Assurance'),
  (345, 'Development'),
  (456, 'Support'),
  (567, 'Software'),
  (678, 'Hardware'),
  (789, 'Consulting'),
  (890, 'Human Resources'),
  (901, 'CIO Office'); 

There is no need to create an index on the DEPARTMENTS table because it is so small and a scan will be just as quick as an index look up. If the department table was larger it would probably make sense to create one.

To populate the EMPLOYEE table, we use the following SQL that generates approximately 90,000 names.

ALTER TABLE SC.employee ACTIVATE NOT LOGGED INITIALLY;

INSERT INTO SC.employee(firstname, lastname, salary, deptid)
SELECT first1 || '-' || first2 || first3, last1 || last2,
    ROUND(DECFLOAT(RAND()) * 1000000, 2),
   (SELECT deptid FROM SC.departments WHERE deptid/100 =
      INTEGER(RAND(1) * 9 + 1) fetch first row only)
 FROM
   (VALUES 'John','Michael','Andy','Richard','George','Burt',
           'Rex','Monty','Marvin','Morton','Jessy','Boris')
           AS T(first1),
   (VALUES 'Christopher','Anton','Charles','Charlie','Chuck',
           'Tom','Wiederanders','Lester','Claus','Leopold',
           'Theodore','Barney')
           AS S(first2),
   (VALUES '','II.','III.','Sr.','Jr.') AS U(first3),
   (VALUES 'O''','de ','van ','van der ','Mc','du ','von der ','del')
           AS Q(last1),
   (VALUES 'John','Taylor','Smith','Falkner','Holsten','Bryant',
           'Allister','Allen','Miller','Doyle','Green','Schwenk','Linden',
           'Culligan','Cuthbert','Guinty')
           AS R(last2); 

We add an index to the employee table for faster lookups.

CREATE INDEX SC.idx ON SC.employee(empid);

The next step is to add a table that we can use to count the number of times an UPDATE occurs against the EMPLOYEE table.

CREATE TABLE SC.results
(
  run VARCHAR(40),
  iterations INTEGER
); 

The first run is going to count the number of times we can run an update statement without the statement concentrator running. The SQL that we are going to run is found below. The procedure will run for 5 seconds and update the salary of random employees in the table.

[ 1] CREATE PROCEDURE SC.workload(IN RUNNO INTEGER)
[ 2]   BEGIN
[ 3]      DECLARE text VARCHAR(250);
[ 4]      DECLARE salary DECFLOAT(16);
[ 5]      DECLARE numrows INTEGER;
[ 6]      DECLARE fullname VARCHAR(128);
[ 7]      DECLARE empid INTEGER;
[ 8]      DECLARE starttime TIMESTAMP;
[ 9]      DECLARE run_count INTEGER DEFAULT 0;
[10]      DECLARE stmt STATEMENT;
[11]      SET numrows = (SELECT COUNT(*) FROM SC.employee);
[12]      SET starttime = CURRENT TIMESTAMP;
[13]      WHILE starttime + 5 seconds > CURRENT TIMESTAMP DO
[14]         SET empid = INTEGER(RAND() * numrows + 1);
[15]         SET text = 'UPDATE SC.employee
              SET salary = salary * 1.02 WHERE empid = ' || empid;
[16]         PREPARE stmt FROM text;
[17]         EXECUTE stmt;
[18]         SET text = 'SET (?, ?, ?) =
            (SELECT empid, salary, firstname || lastname '
                || 'FROM SC.employee WHERE empid = ' || empid || ')';
[19]         PREPARE stmt FROM text;
[20]         EXECUTE stmt INTO empid, salary, fullname;
[21]      COMMIT;
[22]      SET run_count = run_count + 1;
[23]   END WHILE;
[24]   INSERT INTO SC.RESULTS
[25]     VALUES (RUNNO, RUN_COUNT);
[26]   COMMIT;
[27] END

A description of the code is found below. 

[1]         The procedure requires the run description to update the results table when the program completes.

[3-10]    These are the variables that are used by the procedure.

[11]       We need to know the number of rows in the EMPLOYEE table so we don't select an employee outside the range.

[12]       The program will run for 5 seconds, so it needs to know the current timestamp value.

[13]       The WHILE loop will continue until the 5 seconds are up.

[14]       We select a random employee to update the salary (I have a feeling that this is consistent with most salary increase programs).

[15]       The SQL is generated that will update the employee record with a 2% raise.

[16-17] The SQL is prepared and then executed.

[18]       Now we want to get the information about the employee we just updated. The SQL will return their employ ID, their salary, and first and last name concatenated together.

[19-20] The SQL is prepared and executed again.

[21]       Commit the transaction.

[22]       Increment the execution count.

[23]       End of the while loop.

[24-25] Update the table with the total execution count.

The lines that we are concerned about are [15] and [18]. Both of these statements are using literals in the SQL rather than parameter markers. Now that we have the routine created, we can run the procedure with and without the statement concentrator feature turned on.

To invoke the procedure, we just need to use the CALL statement:

CALL SC.WORKLOAD('Concentrator OFF (Simple)');

When the procedure completes (after 5 seconds), you can select from the SC.RESULTS table:

SELECT * FROM SC.RESULTS;

RUN ITERATIONS
------------------------------ -----------
Concentrator OFF (Simple)             1304

DB2 includes an administrative function (SYSIBMADM.TOP_DYNAMIC_SQL) that can tell us what the top dynamic statements are that have been running. The SQL has been modified to give us only the statement with the highest number of executions.

SELECT NUM_EXECUTIONS, STMT_TEXT
FROM   SYSIBMADM.TOP_DYNAMIC_SQL
ORDER BY NUM_EXECUTIONS DESC
FETCH FIRST 1 ROWS ONLY; 

The result will be different for your system, but in all cases you will see that the top statement will have only been executed one time.

COUNT          TEXT                                     
-------------- -----------------------------------------
            1 CALL SC.WORKLOAD(?)        

In other words, there is no statement that is being reused in our system. 

Next we turn on the statement concentrator, we use the option LITERALS, not ON as one would expect!

UPDATE DB CONFIG FOR SAMPLE USING STMT_CONC LITERALS IMMEDIATE;

The procedure is re-run with the statement concentrator on.

CALL SC.WORKLOAD('Concentrator ON  (Simple)');

Selecting the results from the SC.RESULTS table will highlight the performance improvements when running with the statement concentrator on:

RUN                                       ITERATIONS
---------------------------------------- -----------
Concentrator OFF (Simple)                       1304
Concentrator ON  (Simple)                       1717

Running with the statement concentrator ON results in 31% more throughput. Examining the dynamic cache explains why this is true:

COUNT          TEXT                                                                            
-------------- --------------------------------------------------------------
         1717 UPDATE SC.employee SET salary = salary * :L0 WHERE empid = :L1      

The statement concentrator has re-used the UPDATE statement 1717 times! All of the overhead of continually preparing this statement has been eliminated.

Triggers, Views, and Additional Complexity

The procedure that we have been using is relatively simple from a SQL perspective. There are no views, triggers, or functions being used within the body of the procedure. Usually applications will do the data validation before inserting data in to the database, but if you cannot depend on the application you will need to create triggers to perform the validation. User-defined functions (UDF) can also be used to simplify SQL coding and perform data validation.

In order to show the additional cost involved with preparing statements that need to include triggers, views, and UDFs, we will add a few additional objects.

First we add three triggers to handle data validation. You could use column level check clauses to accomplish the same task, but that would defeat the purpose of this example!

CREATE TRIGGER SC.trg1
   BEFORE UPDATE ON SC.employee
  REFERENCING NEW AS N FOR EACH ROW
  WHEN (N.SALARY IS NULL)
     SIGNAL SQLSTATE '78000' SET MESSAGE_TEXT = 'Bad salary'; 

CREATE TRIGGER SC.trg2
  BEFORE UPDATE ON SC.employee
  REFERENCING NEW AS N FOR EACH ROW
  WHEN (N.EMPID < 0)
     SIGNAL SQLSTATE '78000' SET MESSAGE_TEXT = 'Bad empid';

CREATE TRIGGER SC.trg3
  BEFORE UPDATE ON SC.employee
  REFERENCING NEW AS N FOR EACH ROW
  WHEN (N.FIRSTNAME IS NULL)
     SIGNAL SQLSTATE '78000' SET MESSAGE_TEXT = 'Bad firstname'; 

A view will be created that uses the base EMPLOYEE table and adds an additional column that includes the department name.

CREATE VIEW SC.employeev AS
   SELECT empid,
         firstname,
         lastname,
         salary,
         (select name FROM SC.departments D WHERE D.deptid = E.deptid)
            as deptname
  FROM SC.employee E;

Finally, a user-defined function is added that concatenates the first and last name of the employee to be used in the body of the procedure.

CREATE FUNCTION SC.fullname(firstname VARCHAR(128), lastname VARCHAR(128))
   RETURNS VARCHAR(128)
   BEGIN ATOMIC
    RETURN firstname || lastname;
  END

 Referring back to the original procedure, the following changes are made to the code to include the view and the UDF.

[11]      SET numrows = (SELECT COUNT(*) FROM SC.employeev);
[15]         SET text = 'UPDATE SC.employeev
               SET salary = salary * 1.02 WHERE empid = ' || empid;
[18]         SET text = 'SET (?, ?, ?) = 
            (SELECT empid, salary, SC.fullname(firstname, lastname) '
                 || 'FROM SC.employeev WHERE empid = ' || empid || ')';

 A description of the code is found below.

[11]       The EMPLOYEE table is replaced with the view EMPLOYEEV.

[15]       The UPDATE statement now uses the EMPLOYEEV view.

[18]       The EMPLOYEE table is replaced with the view EMPLOYEEV and the concatenation of the first and last name has been replaced by the FULLNAME function.

The procedure is now re-run twice, once with the concentrator off, and then with the concentrator on.

CALL SC.WORKLOAD('Concentrator OFF (Complex)');

CALL SC.WORKLOAD('Concentrator ON  (Complex)');

Retrieving the contents of the SC.RESULTS table gives the following execution counts:

RUN                                      ITERATIONS
---------------------------------------- -----------
Concentrator OFF (Simple)                       1304
Concentrator ON  (Simple)                       1717
Concentrator OFF (Complex)                       555
Concentrator ON  (Complex)                      1054 

Now when running in a more complex environment, the performance improvement when using the statement concentrator is a remarkable 90%. Your results will probably not be exactly the same as this. These tests were done in an uncontrolled environment on a MacBook Pro, running Windows 7 in a 4GB virtual environment and DB2 11.1.

Conclusions and Additional Thoughts

The statement concentrator can help improve the throughput of your transactional, dynamic SQL. Turning it on is as simple as updating a database parameter and then watching the package cache reuse.

What the statement concentrator is not good for is complex SQL. If you have an analytical or reporting system, the SQL statements tend to be very different and removing the literals and constants may actually result in worse access paths. If DB2 is optimizing a statement that contains parameter markers, it needs to make assumptions about the parameter – i.e. the potential value it could have. If the optimizer knows the actual value it will make a more informed decision.

So the decision to use the statement concentrator is dependent on the workloads running on your system. You may not be able to intercept the SQL (or train developers how to write better SQL code), but you can get DB2 to rewrite it so that the package cache can be reused and reduce the amount of overhead involved in optimizing statements. It's worth a try!

George Baklarz
Program Director, WW Core Database Technical Sales
IBM Analytic Platform Group
baklarz@ca.ibm.com

 

0 comments
3 views