Understanding DB2’s Backup and Restore Statistics
In an effort to help customers determine where performance bottlenecks exist in DB2 backup and restore operations, DB2 has added an extremely useful entry into the db2diag.log file. It was first introduced in v 9.7 under registry variable control (DB2_BAR_STATS), and is now enabled by default as of 10.1 FP2 (the registry variable is no longer required).
Each successful backup and restore operation generates a single record in the db2diag.log file, which provides information on the performance of that operation. The log record is informational and is written at diaglevel 3 (the default) as well as diaglevel 4. With the level of detail provided, one can get a very clear picture of where each thread spent its time while the operation was running. An example of a backup performance statistics db2diag.log entry is as follows:
2016-01-20-11.07.26.882298-300 E243498A1912 LEVEL: Info
PID : 11927628 TID : 2058 PROC : db2sysc
INSTANCE: dmcinnis NODE : 000 DB : PERF
APPHDL : 0-28 APPID: *LOCAL.dmcinnis.160120154344
AUTHID : DMCINNIS HOSTNAME: db2hadev
EDUID : 2058 EDUNAME: db2agent (PERF)
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:395
MESSAGE : Performance statistics
DATA #1 : String, 1414 bytes
Parallelism = 10
Number of buffers = 10
Buffer size = 10489856 (2561 4kB pages)
BM# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 1420.79 27.63 1269.31 122.62 789 7860
001 1420.74 53.90 1363.71 0.45 1683 16830
002 1420.74 51.27 1331.79 35.27 1508 15074
003 1420.74 39.19 1301.87 77.89 1118 11171
004 1420.74 32.95 1280.14 106.14 951 9507
005 1420.74 39.45 1223.28 156.55 932 9311
006 1420.74 31.31 1214.64 173.34 912 9114
007 1420.74 44.01 1260.33 114.55 1162 11615
008 1420.74 37.83 1246.14 135.36 897 8964
009 1420.74 27.64 1242.88 149.02 768 7676
--- -------- -------- -------- -------- -------- -------
TOT 14207.48 385.25 12734.14 1071.23 10720 107125
MC# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 1421.51 1420.33 0.41 0.00 10721 107231
--- -------- -------- -------- -------- -------- --------
TOT 1421.51 1420.33 0.41 0.00 10721 107231
The command issued to generate this db2diag.log entry was
backup db perf use TSM
The meanings of the various columns for the db2 utility performance statistics are as follows:

How can this information be interpreted?
By using the I/O, MsgQ and WaitQ values we can easily determine where the operations are spending most of their time. The formula used is I/O divided by total time to determine what percentage of time was spent performing I/O, MsgQ divided by total time to determine what percentage of time was spent waiting for buffers, and finally, WaitQ divided by total time to determine what percentage of time was spent waiting for control messages. A simple spreadsheet allows for the data to be easily interpreted. In the case above the spreadsheet would look like this:

What does this data tell us?
This example shows that the average amount of time each BM thread spent waiting for data to be read from the database was 2.71% whereas these same threads spent 89.63% of the time waiting for buffers. We also see that MC threads were spending on average, 99.92% of the time waiting for I/O (in this case for TSM to respond). Since the MC can’t free up a buffer until TSM confirms it has been written and the BMs have to wait for the MCs to free the buffers, we can conclude that the bottleneck is in writing to the target device. Based on the backup start and backup stop db2diag.log entries we know that the overall elapsed time for the backup was 23 minutes 41 seconds.
How can you use this information to improve backup performance?
In the previous example there are two approaches that could be used to improve performance of backup: allocating more buffers so the BM threads do not have to wait as long or increasing the number of target devices (in this case, provide more TSM parallel sessions). Let’s look at the effect of adding one additional TSM session which could be done by executing a command that looks like this:
backup db perf use TSM open 2 sessions
Backup performance statistics recorded in the db2diag.log file would now look something like this:
2016-01-20-11.18.08.651897-300 E252008A2571 LEVEL: Info
PID : 11927628 TID : 16557 PROC : db2sysc
INSTANCE: dmcinnis NODE : 000 DB : PERF
APPHDL : 0-43 APPID: *LOCAL.dmcinnis.160120160849
AUTHID : DMCINNIS HOSTNAME: db2hadev
EDUID : 16557 EDUNAME: db2agent (PERF)
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:395
MESSAGE : Performance statistics
DATA #1 : String, 2073 bytes
Parallelism = 19
Number of buffers = 19
Buffer size = 5246976 (1281 4kB pages)
BM# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 557.12 0.15 0.14 556.81 4 0
001 556.66 88.66 464.83 0.45 3366 16830
002 556.66 86.43 459.99 7.80 3015 15074
003 556.66 68.08 435.45 51.31 2235 11171
004 556.66 58.99 416.10 80.02 1902 9507
005 556.66 61.79 412.72 80.62 1863 9311
006 556.66 58.42 406.24 90.53 1796 8979
007 556.66 54.30 393.00 107.95 1705 8521
008 556.66 49.11 357.02 149.31 1488 7435
009 556.66 33.14 286.32 236.34 1048 5236
010 556.66 27.67 256.55 271.71 882 4406
011 556.66 17.47 192.77 345.92 598 2985
012 556.66 14.87 163.41 377.97 489 2440
013 556.66 9.95 114.45 431.98 329 1643
014 556.66 8.44 101.43 446.52 306 1529
015 556.66 7.99 103.17 445.25 291 1450
016 556.66 3.05 31.20 522.32 94 469
017 556.66 1.16 8.63 546.84 27 134
018 556.66 0.13 0.46 556.06 1 0
--- -------- -------- -------- -------- -------- --------
TOT 10577.00 649.89 4603.99 5305.79 21439 107125
MC# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 557.12 553.32 3.01 0.01 10628 53176
001 557.14 553.07 2.85 0.39 10813 54097
--- -------- -------- -------- -------- -------- --------
TOT 1114.26 1106.39 5.87 0.41 21441 107273
When we put these numbers in a spreadsheet, we get:


The elapsed time decreased to 9 minutes 17 seconds and the average time the BM threads spent waiting for buffers decreased to 43.53%. This is definitely an improvement, however the BM threads are still spending a lot of time waiting for buffers. One approach that could provide some relief is to further increase the number of TSM sessions.
Does increasing the number of target sessions always help?
We can attempt to increase the number of sessions further to see if it will lower the overall elapsed time ever further. The command used to do this is:
backup db perf use TSM open 4 sessions
Now, backup performance statistics recorded in the db2diag.log file might look like this:
2016-01-20-12.02.38.004368-300 E261177A2703 LEVEL: Info
PID : 11927628 TID : 16814 PROC : db2sysc
INSTANCE: dmcinnis NODE : 000 DB : PERF
APPHDL : 0-56 APPID: *LOCAL.dmcinnis.160120164909
AUTHID : DMCINNIS HOSTNAME: db2hadev
EDUID : 16814 EDUNAME: db2agent (PERF)
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:395
MESSAGE : Performance statistics
DATA #1 : String, 2205 bytes
Parallelism = 19
Number of buffers = 19
Buffer size = 5246976 (1281 4kB pages)
BM# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 806.36 0.26 0.04 806.05 4 0
001 805.81 75.13 727.40 0.62 3366 16830
002 805.81 69.72 722.34 11.37 3015 15074
003 805.81 53.14 686.56 64.31 2235 11171
004 805.81 45.70 630.90 127.67 1902 9507
005 805.81 48.07 654.78 101.47 1863 9311
006 805.81 45.90 654.31 104.14 1796 8979
007 805.81 45.55 605.23 153.63 1705 8521
008 805.81 40.52 486.75 277.29 1488 7435
009 805.81 33.03 301.06 470.84 1048 5236
010 805.81 28.26 222.79 554.00 882 4406
011 805.81 20.67 157.57 627.03 598 2985
012 805.81 17.59 127.94 659.82 489 2440
013 805.81 12.35 94.41 698.75 329 1643
014 805.81 12.50 85.82 707.20 306 1529
015 805.81 10.74 87.29 707.50 291 1450
016 805.81 4.42 27.02 774.28 94 469
017 805.81 1.63 9.11 795.03 27 134
018 805.81 0.40 0.08 805.32 1 0
--- -------- -------- -------- -------- -------- --------
TOT 15311.08 565.66 6281.50 8446.40 21439 107125
MC# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 806.39 805.45 0.16 0.00 5062 25319
001 806.36 805.09 0.08 0.33 6522 32630
002 806.32 804.92 0.06 0.31 4972 24874
003 806.34 804.74 0.06 0.30 4887 24449
--- -------- -------- -------- -------- -------- --------
TOT 3225.43 3220.21 0.36 0.95 21443 107273
Increasing the number of TSM sessions to 4 actually had a negative effect, with an elapsed time of 13 minutes and 27 seconds. Note that the DB2 autonomics did not alter the parallelism, number of buffers or buffer size when the number of TSM sessions were increased. However, it did decrease the average time that BMs spent waiting for buffers. When we put these numbers in a spreadsheet, we get:



We can see that the average time BMs spent waiting for buffers decreased, and the MCs continued to spend all of their time writing. If increasing the number of sessions does not help, then perhaps we need to increase the size of the buffers.
What effect does increasing the util_heap_sz have?
By altering the database configuration to increase the util_heap_sz from 30000 to 60000 the following results were achieved:

The util_heap_sz configuration parameter will affect both the number of buffers as well as the buffer size. By doubling util_heap_sz we obtained the following results:
Using 2 TSM sessions completed in 15 minutes and 8 seconds
Using 4 TSM sessions completed in 8 minutes and 2 seconds
In both cases the new buffer size chosen was 10489856 bytes or 2561 4K pages.
Conclusion
The DB2 autonomics engine does an outstanding job selecting the # of buffers, buffer size and parallelism (# of BM threads), however there are factors that you can tune to greatly impact the performance of the utility. The two most important ones were discussed in this paper and the following one is just as important but can’t be controlled by tuning parameters.
Although not discussed in this paper, one of the most influential factors is the distribution of the data. Each BM thread is assigned an entire tablespace; this single threaded architecture has had and continues to have, the largest impact on backup performance. Ideally the data in the database would be evenly distributed across all of the tablespaces so that all of the BM threads would be kept equally busy. To improve backup performance, utilize the ADMIN_MOVE_TABLE procedure to distribute the data more evenly across all of the table spaces. Ideally all of the BM threads should complete at approximately the same time, whereas with uneven distributed data a few threads will continue to run while the remaining threads remain idle. The DB2 autonomic engine will examine the number of CPUs on the system and number of tablespaces defined in the data to set the parallelism setting appropriately.
The second most influential factor is the size of the utility heap (util_heap_sz). By default, an online backup will consume no more that 50% of the available space in the util_heap_sz, whereas with an offline backup will consume up to 95% of the available util_heap_sz. This value will influence both the number of buffers as well as the size of each buffer allocated. Unless the DB2 server is memory constrained, an util_heap_sz of a minimum of 50,000 should be used.
The third influential factor is the number of target sessions or devices used. DB2 is unable to determine if the target device for a backup is a sequential or random access device, consequently DB2 will not autonomically compute an optimal number of devices to use. The utilities rely on the caller to specify a “reasonable” value. And, as was demonstrated in the previous examples, increasing the number of target devices will not always result in a lower overall elapsed time. Experimentation is the only method available to determine the optimal number of devices to use. If backing up to a SAN there can be benefit from having DB2 write to multiple files on the same file system. This is achieved by specifying the same target location multiple times, e.g.
backup db perf to /db2/backup, /db2/backup, /db2/backup
This command will result in 3 backup objects being created in the /db2/backup file system, similar to adding multiple sessions to TSM as we previously demonstrated in this paper.