Measuring OVERHEAD and TRANSFERRATE on DB2 LUW - the why & how

First, what are OVERHEAD and TRANSFERRATE?

These parameters are used by the DB2 optimizer to help determine the estimated cost of access plans (and we're usually talking about complex, expensive plans, more than simple transactional plans.)   More specifically, OVERHEAD (let's call it OH) is the average time in milliseconds that the tablespace container disk takes to seek to a new location.   This gives the optimizer an idea of the cost of doing a random disk read (for example, driven by an index-access based plan.)   The current default value for OH is 6.725 ms.

TRANSFERRATE (TR for short) is the amount of time required to read one page of data, assuming the disk head is already in the right place and no seek is required.  (Personally, I think TR is not a "rate", per se, it's a time.  But maybe TRANSFERTIME seemed non-intuitive?   But I digress.)  The current default for TR is 0.32ms for a 32KB page.   In fact, TR has been largely replaced by DEVICE READ RATE (DRR) when automatic storage was introduced.  DRR is actually a real rate measurement in MB/s, and is independent of page size (which TR is not.)

Simplistically, low values for OH (meaning fast, efficient disk seeks) may encourage the optimizer to pick plans using random reads.  The opposite, for high values of OH.  Low values for TR (meaning bulk disk reads are fast & efficient) could encourage the optimizer to pick plans with large scans in them.   There are many factors in how the optimizer chooses plans, and these are just two - but having the proper values can help avoid bad plans.

Don't forget - if you have different types of physical storage for different storage groups (or tablespaces, if you're still on DMS), they may have significantly different OH and TR/DRR characteristics.  If they're all using the default settings, then some (particularly your fast & expensive flash or SSD storage) may be suffering more than others (for example, spinning disks, which typically have performance a bit closer to the defaults.)   In cases of mixed storage types, repeating the measurement process for each major type of storage can pay off.

How to determine values for your environment

The default values of these parameters are based on "typical" storage and are periodically updated, based on faster technology.  Way back in the early v9 times, overhead was more than twice as high as it is now.  But the real question is, are the values you're using today (which could be the current defaults?  or old defaults?  custom settings from somewhere...?) correct for your system or not?  Due to the relative affordability of solid-state drives (SSDs) and flash storage, many database systems have much faster storage now than they did just a few years ago.   DB2 can't just update these parameter defaults to be in line with SSDs because many customers still have spinning disks.   It's a problem.

Estimates & ideas sometimes circulate around the user community about what 'good' values might be for a particular type of storage, but I think it's a better idea to just measure it. Example: you'd think that flash storage, with no moving parts, would have basically zero seek time (OVERHEAD = 0).  However, if you measure it, you'll see it's fast, but definitely not zero.  There are other factors, too, that impact OH and TR in unpredictable ways, such as whether the storage is directly attached vs. accessed over a network, or is physically present vs. virtual.   Best to measure what you've got.

It would be nice if DB2 could measure these itself, kind of like CPU_SPEED - but it doesn't.  Instead, we need an IO driver program that can mimic DB2 IO, and report the performance of seek & read operations.  When we choose an IO driver, the 'mimic DB2 IO' part is important.   Some important requirements include:

  • it reads (and writes, but that's less important) in DB2 page sizes of 4KB up to 32 KB
  • it can read and write either randomly (similar to index access) or sequentially (similar to table or index scans)
  • (very important for reads) It can avoid filesystem caching by using DIO / CIO like DB2 does

I'm sure there are many tools that meet these requirements, but the one I'm most familiar with is an open source tool called IOzone (http://www.iozone.org). I won't go into details here of how to build the IOzone executable, mainly because it's a piece of cake - just follow the directions. If you don't have a C compiler, there are a variety of sites with binaries of IOzone for different platforms, ready for download.

Measuring TRANSFERRATE with IOzone

TR is easier to measure, so let's start with that. This parameter  is about big scans, so we're going to ask IOzone to read <pagesize> x <prefetchsize> bytes sequentially from disk.   In my example, this means 32KB pages and a 32-page extent size, for a total of 1MB per read, but you could choose other combinations. To get a good-sized scan, we're going to ask it to read 1GB (1024 1MB reads) in total.

Note that this test will put a temporary file 1GB in size into the same filesystem as the tablespace container we want to measure, so it's important to make sure that there is sufficient free space there to enable that.

iozone -T -t 1 -F <container path>/iozone.tmp -s 1024m -r 1024k -i 0 -i 1 -I
# -t 1 -- run one thread
# -F <path>/<file> -- tells IOzone the location & name of the temporary file to write to / read from
# -s -- file size (1024m = 1GB)
# -r -- record size (1024k = 1MB)
# -i 0 -i 1 -- tests to run: write data out to temp file first, then sequential read
# -I -- use DIO / CIO, to mimic DB2's NO FILESYSTEM CACHING tablespace default

IOzone produces a lot of output, but the line we're looking for looks like this:

 Children see throughput for  1 readers          =  798027.88 kB/sec

In my case, with fast flash storage, I'm getting almost 800 MB/s throughput on these large scans.  This is actually DEVICE READ RATE (DRR), and we can turn it into TRANSFERRATE (TR) as follows

800 MB/s DRR, or (1 / 800) / 1024 x 32768 = 0.04 ms / 32KB page TRANSFERRATE

That's eight times faster than the default of 0.32ms/32KB page (or 100 MB/s DRR).  Of course, your mileage will vary, but this is a good illustration that defaults aren't necessarily close to actual behavior.

Measuring OVERHEAD with IOzone (and company)

We can find OVERHEAD in a similar way (with a bit of a plot twist at the end - but we'll come back to that.)   For OH, we want to measure how long a disk seek takes, and the closest thing to that is a random read.   So we'll ask IOzone to do a whole bunch of random reads, and look at how long they take.   Some foreshadowing to the plot twist: we actually want the seek itself, not seek+read, which is what actually gets timed when we do a random read.  Since the read is irrelevant, we're going to use a small read size of 4KB, to minimize the amount of time we spend doing reads we're not interested in, and then get the seek time another way.

# close, but not quite there, for OVERHEAD ...

iozone -T -t 1 -F <container path>/iozone.tmp -s 64m -r 4k -i 0 -i 2 -I
# -t 1 -- run one thread
# -F <path>/<file> -- tells IOzone the location & name of the temporary file to write to / read from
# -s -- file size (64MB)
# -r -- record size (4KB)
# -i 0 -i 2 -- tests to run: write data out to temp file first, then random read
# -I -- use DIO / CIO, to mimic DB2's NO FILESYSTEM CACHING tablespace default

Note - the next bit gets into a little gory detail, so feel free skip straight to the commands in the following two grey boxes (AIX and Linux).

The above IOzone command is correct, but not quite complete.  As mentioned above, we're interested in the seek time - but IOzone doesn't time or report the seek time on its own.  Instead, it reports a throughput for these random reads, which is really not what we're after.  If you're a coding nerd at all, you could easily change IOzone to measure and report seek performance, but you can also just run it a bit differently - wrapped in another tool that will monitor system calls like seek, etc.

# Step 1 (AIX): run IOzone inside of truss
truss -D -o /tmp/truss.out iozone -T -t 1 -F <container path>/iozone.tmp -s 64m -r 4k -i 0 -i 2 -I

# Step 2 (AIX): grep the truss report for seek times, and summarize
grep "klseek(3, [0-9]*, [0-9]" truss.out
| sed -e 's/:.*$//'
| awk '{TOT=TOT+$1;}END{print "OVERHEAD: " TOT/NR*1000 " (sample size = " NR " reads)" }'

All we did in step 1 was launch the IOzone command with another tool called truss, which "looks over IOzone's shoulder" as it were, and records system call timings to a report file (in this case, in /tmp/truss.out.)  The truss report contains the timings we're interested in - we're no longer interested in the IOzone output for the OH test, so we can ignore it.  Step 2 does the search for seek calls & calculates the average time for us, and in my case, with fast flash storage, gives something like this

OVERHEAD: 0.323102 (sample size = 32768 reads)    [AIX with flash storage]

Even though IOzone is portable across platforms, we have to make a slight adjustment for Linux because truss is an AIX tool. On Linux, we use its cousin, strace.

# Step 1 (Linux): run IOzone inside of strace
strace -o /tmp/strace.out -r -tt -f iozone -T -t 1 -F <container path>/iozone.tmp -s 64m -r 4k -i 0 -i 2 -I

# Step 2 (Linux): grep the strace report for seek times, and summarize
grep "seek.*SEEK_SET" /tmp/strace.out
| sed -e 's/:.*$//'
| awk '{TOT=TOT+$2;}END{print "OVERHEAD: " TOT/NR*1000 " (sample size = " NR " reads)" }'

On my slow Linux desktop machine, OH is understandably MUCH higher than on my big AIX server with flash storage:

OVERHEAD: 6.02563 (sample size = 32768 reads)   [Linux desktop with spinning disk]

A couple of finer points.

First, you many notice that if you run these IOzone measurements a few times, that runs produce slightly different results.  That's pretty normal, at least for modest amounts of variation.  In general it's a good idea to run enough times to get an idea of what a typical / average value is, rather than just taking the first number you get.

Second, you might wonder if it matters when you run this (and be tempted to run it when the system is as idle as possible, to avoid any impact to DB2's disk IO.)  That's perfectly understandable - yet I'd still recommend these tests be run when the system is active and at least reasonably busy.  You want IOzone and hence the optimizer to have a realistic view of storage performance (ie, under typical load) rather than idealized (when IOzone is the only thing running.)

You've measured new OH and TR / DRR - now what?

It's a simple matter of ALTER TABLESPACE / ALTER STOGROUP to apply the new values, but a certain amount of caution is suggested.  Just because the numbers are probably more 'technically correct' that what you have now doesn't mean you'll necessarily get better access plans, or even that you'll see a change at all.  It's kind of like runstats - it's a good idea to have up-to-date, accurate statistics, but it can sometimes happen that a statistics update results in a plan change you don't want.   So, as with other factors that affect the optimizer, testing is highly desirable.

If you can try out new values in a test environment, that's a good idea (of course, you'd want to apply the new OH & TR/DRR from the test environment, not production.)  So this test is not usually 100% reflective of what will happen in production.  The more test differs from production, the less value this test is.

if you're still feeling cautious when applying new values to the production system, you should apply these to certain  tablespaces one-by-one, at least initially, rather than changing them at the STOGROUP level and having them be inherited by all tablespaces in that STOGROUP.  This way you can look at the impact to the subset of access plans that deal with the affected tablespaces, and decide if you want to proceed with rolling out the changes.

Recent Stories
Tips for getting the best INSERT performance

Statistics Event Monitors - terrible name, terrific feature

Measuring OVERHEAD and TRANSFERRATE on DB2 LUW - the why & how