Tips for getting the best INSERT performance

While the LOAD utility definitely provides the biggest and fastest "pipe" to get data into DB2, it has some obvious limitations - the main one being that the data is not accessible until the LOAD completes.  Even the ALLOW READ ACCESS option (which allowed access to pre-existing data in the table) is now deprecated.  In terms of utilities, INGEST is probably the best choice of utilities if you need access to the data while it's being brought in, however it has its own set of restrictions too.

So that often leaves good old INSERT - easy to use, flexible, portable, ubiquitous.  When you've got a lot of data to insert and performance really matters, there are some steps you can take to make things go quicker.  A couple of these are up to the DBA, but most involve application or schema changes.  The recommendations below probably have the biggest impact on bulk insert operations, but are still worth paying attention to even if INSERT is just a part of bigger performance-sensitive transaction.

Basic Stuff

Inserted data goes into the bufferpool and is written out to disk by the page cleaners at a later time.  This is good for performance since the disk write isn't part of the INSERT operation itself.  However, all inserted (and updated and deleted) data gets written into the transaction log buffer, and this operation is part of the INSERT.  The contents of the transaction log buffer may get written to disk during the INSERT, but this only happens if the buffer happened to become full.  Otherwise, the log buffer isn't flushed until COMMIT time.  Disk IO isn't the fastest thing in the world, so we want to minimize both the amount of time one write takes, and number of log disk writes we do.  This leads us to our first couple of recommendations.

Make sure your log disk performance is up to scratch

A good target is 2-4 ms, but the lower, the better.  Average log write time can be calculated from MON_GET_TRANSACTION_LOG() as LOG_WRITE_TIME divided by NUM_LOG_WRITE_IO.  Also, if your system has been around a long time, double check at that the log buffer size (LOGBUFSZ) is reasonable.  2048 pages is usually a good size.

Don't COMMIT too frequently

Each COMMIT drives a log write, so it's best to include many inserted rows into each unit-of-work, to amortize the cost of the log write.  Something on the order of 1,000 to 10,000 rows is usually about right.  You certainly don't want ODBC or JDBC AUTOCOMMIT left enabled!

This is especially important with BLU (column-organized) tables.  Optimum performance is achieved when many rows (thousands) are inserted into the target table in the same unit of work.  Single-row INSERT transactions into BLU tables can suffer from quite poor performance - typically much more so than row-based tables.  That said, newer releases like v11.1.1.1 have been offering steady improvements - although maybe not enough to ignore this effect.

Don't have an excessive number of indexes or constraints on the target table

Any existing indexes on the table will be updated synchronously for each row inserted.  While the cost of this is negligible for 1 or 2 or 3 indexes, once you start to get up to 8 or 10 or more, it can really add up.  A discussion of how to find unused indexes is probably best left for another blog entry, so for now let's just say that if you've got 10 or more indexes on a heavily-inserted table, it's worth a review of why each index was created, and what statements would be impacted by removing them.

While the performance of the log is the DBA's responsibility (as often is the indexing strategy), the rate of COMMITs is up to the application developer.  Let's throw in another really important one for the app folks to think about:

Don't use literals in INSERT statements

DB2 can re-use SQL statements that are executed repeatedly.  If they differ (even by, say, the value of a literal in an INSERT statement), then the statement needs to be recompiled from scratch.  Because INSERTs are typically very simple to execute, the time spent to compile them can represent a big chunk of total execution time.  Spending a bit more effort in the application to PREPARE the statement once and then re-execute it with new parameter values can make a huge difference in performance.

(What about the statement concentrator, i.e. STMT_CONC=LITERALS?  This may help if the application can't be changed, but it's nowhere near as effective as using parameters.  Plus, it needs to be used carefully to avoid side-effects.  If you're going this route, best to turn it on at the connection level, rather than at the database configuration level.)

These are good, basic "table stakes" things to do, and they can make quite a difference (see below for some measurements to prove the point.)  But we can do much more.

Slightly more sophisticated stuff

The idea of not committing too frequently is to take an expensive operation (the log write) and increase the 'payload' - make do much more for only a small increase in cost.  We can apply the same technique to the INSERT itself.

Insert multiple rows at a time

Although few people tend to do this in the SQL itself, it's the easiest way to show what I mean.  Instead of

insert into t(c1,c2) values (1,2)
insert into t(c1,c2) values (3,4)

 we do

insert into t(c1,c2) values ( (1,2),(3,4) )

(and please forgive me for using literals here instead of parameter markers - I'm just trying to show the basic principle,)

In practice, you might have one prepared statement (with parameter markers, of course) to do batches of, say, 10 rows at a time, and another one to do one row at a time, when you have less than 10 rows left to insert.  You get the idea.

If you're like most people and use JDBC or ODBC, then it's a little different.  The SQL only specifies one row, but you use ODBC / JDBC constructs to tell the driver how many rows' worth of data you want to push down in a single execution.

/* Multiple rows at a time in ODBC */
PreparedStatement  SQLCHAR *stmt = (SQLCHAR *) "INSERT INTO t(c1, c2) VALUES(?, ?)";
SQLSMALLINT parameter1[] = { 121,       131,         141 };
char parameter2[][20]    = { "Toronto", "Vancouver", "Ottawa" };
int row_array_size = 3;
:
cliRC = SQLPrepare(hstmt, stmt, SQL_NTS);
:
cliRC = SQLSetStmtAttr(hstmt, SQL_ATTR_PARAMSET_SIZE, (SQLPOINTER) row_array_size, 0);
:
cliRC = SQLBindParameter(hstmt, 1, SQL_PARAM_INPUT, SQL_C_SHORT, SQL_SMALLINT, 0, 0, parameter1, 0, NULL);
cliRC = SQLBindParameter(hstmt, 2, SQL_PARAM_INPUT, SQL_C_CHAR, SQL_VARCHAR, 20, 0, parameter2, 20, NULL);
cliRC = SQLExecute(hstmt);
/* Multiple rows at a time in JDBC */
PreparedStatement prepStmt = con.prepareStatement( "INSERT INTO t VALUES (?,?)");
for ... object : objects ... {
prepStmt.setInt   (1,object.parameter1());
prepStmt.setString(2,object.parameter2();
prepStmt.addBatch ();
}
int [] numUpdates=prepStmt.executeBatch(); /* Generally you would do this periodically in the loop too */

Parallelize bulk insert work

Using parameter markers and batching together rows into a single INSERT are both important ways to improve insert throughput on a single connection when ingesting large volumes of data.  These improvements can be significantly boosted by running multiple jobs in parallel, giving each a disjoint set or range of input data to work on. (Note, in this case, when I talk about 'parallelizing', I don't mean query-level SMP parallelism, aka INTRA_PARALLEL etc, which generally has no real effect on INSERT.)  Of course, the approach of running multiple jobs in parallel can introduce some additional contention, due to multiple streams needing to update the table's & indexes' data structures concurrently.  It can be helpful to start with a small degree of parallelism (e.g. two jobs in parallel) and grow from there, monitoring the total throughput as parallelism increases, to find the optimal configuration.  The 'simpler' the target table is (fewer indexes & constraints, few or no triggers, etc.) the more this approach will scale.

Consider using range partitioning on the target table

As mentioned above, parallelized, high-throughput inserts can create contention, both in the space allocation mechanism of the table, as well as in the key insertion code of the indexes.  One way to alleviate that is by range partitioning the table, and using local (partitioned) indexes where possible.  Each partition has its own free space control structures, and these can be operated on efficiently in parallel, helping improve the scalability of parallelizing the insert work.  Likewise, when partitioned indexes can be used, each per-partition local index will have its own root page & other structures, independent from indexes on other partitions.

To be effective in this case, you'll need a partitioning strategy where all partitions are equally active in terms of the volume of rows being inserted at any given time.  An example approach is to partition by the lowest digit of something like order number or timestamp, giving ten partitions which all see plenty of activity at the same time.  On the other hand, partitioning by a strictly increasing value like date or time (or full timestamp) would not be a good choice to distribute heavy insert activity, since only one partition (the latest) would see activity at a time.

Use insert buffering in with Database Partitioning Feature (DPF) databases

Normally, when a row is inserted into a DPF database, it is forwarded from the coordinator node to the data node it hashes to, as part of the INSERT operation.  This can potentially be fairly expensive, since it may incur an extra network transfer, depending on how the cluster is laid out.  When insert buffering is enabled, the coordinator maintains a buffer for each target data partition.  These buffers are filled with the appropriate rows (ones which hash to that particular data partition), until either the buffer is full, or one of a list of 'triggering' operations is performed by the application (e.g. COMMIT, etc.)   The extra hop from coordinator to data node still has to happen, but now it's much more efficient since it passes a buffer full of rows each time, instead of just one row.  The performance improvement using this technique can be very large.

Insert buffering is enabled for static SQL applications with the INSERT BUF precompile or bind option.  For either ODBC or JDBC applications, you can rebind the CLI packages with the INSERT BUF option and put them under a different collection ID, which your application can choose at run time.  Some details at IBM Support here.

# Create the CLI package collection with INSERT BUF enabled
db2 bind @db2cli.lst blocking all grant public insert buf collection INSBUF
// Once the new collection is created, specify it as a property of the data source when connecting
javax.sql.DataSource ds=null;
ds=new com.ibm.db2.jcc.DB2SimpleDataSource();
((com.ibm.db2.jcc.DB2BaseDataSource) ds).setServerName(argv[0]);
((com.ibm.db2.jcc.DB2BaseDataSource) ds).setPortNumber(prt);
((com.ibm.db2.jcc.DB2BaseDataSource) ds).setDatabaseName("sample");
((com.ibm.db2.jcc.DB2BaseDataSource) ds).setDriverType(4);
((com.ibm.db2.jcc.DB2BaseDataSource) ds).setCurrentPackageSet("INSBUF");
con = ds.getConnection(argv[2],argv[3]);

Insert buffering can also be enabled at the statement level in ODBC by setting a statement attribute.

 Possibly consider APPEND MODE

When APPEND MODE is enabled for a table, DB2 no longer keeps track of space freed up by deleted rows, and on INSERT, just goes to the end of the table to add new rows, regardless of whether there might be free space somewhere in the table.  Avoiding the processing time and possible contention of using the free-space control structures in the table can give a significant boost, especially in highly parallelized environments. 

Sometimes this technique can be a really big win, but nonetheless APPEND MODE should be approached with some caution. Any space freed up by DELETE or an overflowing UPDATE will not be able to be reused, and the table may grow faster than had APPEND MODE not been set.  If DELETEs don't happen (or are at least very rare), or if the table is just temporary (e.g. for staging of inserted data), then maintaining and searching free space data isn't really valuable, and APPEND would have little or no down side.  If you do turn on APPEND MODE and find you need to get that deleted space back, you'll need to do a REORG.

If APPEND MODE is too heavy-handed for you, another option is to dial back the effort spent searching for (maybe non-existent) free space, by setting the registry variable DB2MAXFSCRSEACH to something lower than the default of 5.  Try 2 or 3 instead.  But be advised that this applies to the whole instance, not just one table like APPEND does.

 A bit about DB2 pureScale

All the advice above (except the part about buffered insert in DPF) applies here as well, but there are some pureScale-specific tips for better insert performance.  Like DB2 Parallel Sysplex on the mainframe, pureScale is a shared-data clustered architecture, geared toward transactional systems with very high throughput demands.  Most pureScale systems are set up to have intense, concurrent activity on all members, meaning that the kind of heavy bulk insert activity we're talking about here is running in parallel across the whole cluster. This creates a potential for more contention that we need to consider.

Use a large extent size

In a bulk insert scenario, the tablespace extent size determines the rate at which new extents need to be acquired for the target table. The more frequently we need new extents, the more interruption there is to the smooth flow of inserting rows, due to needing to contact the Cluster Caching Facility (CF) to reserve a new extent.  Extent size also controls the size of a per-member, per-table cache that's used to store new rows being inserted into the table.  A larger extent size means a bigger cache, and less contention across the cluster.  The default size of 32 pages is usually sufficient, but for very high insert rates, larger sizes (up to 256 pages) can be beneficial.

Make sure your CF interconnect performance is good

Heavy insert work will drive a lot of traffic from the members to the CF, and then back again for castout.  Just like with high transaction log response times, if CF response times are high, then INSERT throughput will suffer. With an RDMA-based interconnect (e.g. RoCE Ethernet) you should see overall average CF message times under around 200 microseconds, and no serious spike in times during heavy insert activity.  A sockets-based interconnect often provides very good everyday performance, but might not be up to the task for very heavy insert activity.

For indexes with strictly increasing key values, consider using RANDOM

Many applications have indexes based on strictly increasing values, like timestamp or order number.  This tends to focus the activity during insert work on the 'top end' of the index, which can create contention between members as the top index page is reclaimed back & forth between members wanting to insert new rows.  The RANDOM keyword hashesthe column value, which spreads out the hot spot and can reduce contention.  On the plus side, it's compatible with unique indexes (useful for primary keys!) but on the potential minus side, it can effect 'scan plans' on the table.  So it's a good idea to verify that there's no heavy dependence on start/stop key scan plans on the table.  In typical use cases, the net effect (benefit vs. cost) is very positive.

Consider range partitioning by CURRENT MEMBER

This variant on the common technique of range partitioning is tailored for pureScale.  By range partitioning the table based on CURRENT MEMBER (the number of the member executing the INSERT statement), each member has its own partition to insert rows into, eliminating a lot of the cross-member contention that can arise.  Again, using partitioned indexes where possible will help get the most out of this.

 Maybe just one member, and Explicit Hierarchical Locking?

If you happen to be running your insert job on just one member, you don't have to worry about cross-member contention, which can simplify things a lot.  But if there is no activity on that table on other members, one additional thing you can do is to enable EHL mode with the OPT_DIRECT_WRKLD database config parameter.  This allows pureScale to detect when a table is only being accessed by one member and put it into "non-shared" mode, thereby eliminating all CF message traffic for that table.  This can be helpful if your CF interconnect isn't the fastest.  Be aware that there is some temporary throughput impact involved if & when the table comes back into "shared" mode.

Some results for ESE and DPF

I always feel compelled to try to back up tuning suggestions with numbers, where I can.  So here are some measurements of the impact of some of the main ideas above.  The numbers below are all relative, meaning that I take the basic insert throughput (1 connection, using SQL which inserts 1 row a time, using literals) and compare how many times faster it gets as we make some of the improvements discussed earlier.   Note - these are just examples - your mileage will vary.

First, for a simple ESE case:

INSERT ESE.JPG

From bar #1 to #2, you can see the benefit of parameter markers vs. literals (11x speed increase in this case!)   From #2 to #3 (which uses array insert), we increase the payload by pushing in multiple rows at a time, achieving another big boost.  #4 and #5 show that there is some benefit to moving the multiple row approach all the way into the SQL itself (vs. just applying it with array insert as in #3) - however the benefit from #3 to #5 isn't huge.

Next, for DPF, we re-baseline with the same basic test, and re-apply our optimizations:

INSERT DPF.JPG

The first take-away here is that, like ESE, using parameter markers and inserting multiple rows each time (either via array insert or multi-row SQL) are big wins - although in terms of factors, not as big as on ESE.  Why is that?  Simply because there is another (in this case, bigger) problem here - the extra 'hop' from coordinator to data node discussed above.  We can see being solved with BUFFERED INSERT in bars #3 and #6 - HUGE benefits due to not sending rows one-by-one from the coordinator to the data partition.

For a final set of numbers, let's look at the benefit of running multiple parallel insert streams.  These numbers are from ESE, taking three of our insert strategies (single row + literals, single row + parameters, and multiple rows + parameters), and in each case, growing from a single stream up to eight concurrent streams.  In each case, the table was partitioned with eight ranges, each with two local indexes.

INSERT ESE scaling.JPGIf we look at the basic naive case (single row + literals, the red line), you can see very little growth in throughput as we go from one to eight streams, and peak at about 3000 rows/s.  Switching to parameter markers (blue line) gives us 20x more throughput at eight streams, with a peak of around 61,000 rows/s.  But when we use parameters and multiple rows per insert (purple line), we can get up to a staggering 140,000 rows/s at eight streams.

Wrapping up

As you can see, the most important basic techniques to improve insert performance are

  1. use parameter markers
  2. process multiple rows with each INSERT (either with array / batch processing or multi-row SQL) and many INSERTs per COMMIT
  3. scale-up with multiple parallel jobs, to really get large volumes of data moving
  4. if you're in a DPF environment, don't forget buffered inserts.  
  5. If you're using pureScale - well, I didn't give a bunch of test results for that, but from lots of experience, I can say that the biggest hitter there tends to be extent size (plus the others that I talked about for ESE.)
Recent Stories
Tips for getting the best INSERT performance

Statistics Event Monitors - terrible name, terrific feature

Measuring OVERHEAD and TRANSFERRATE on DB2 LUW - the why & how