Designing Large OLTP Databases for Performance

I know the trend is to use NoSQL for a lot of big data storage these days, but it isn’t necessarily the best way to go for every situation. At least not yet. For highly-available high-performance transactional systems we also need a great deal of reliability. DB2 on the mainframe has been growing and meeting our needs (without mysterious data losses) for more than 30 years now as we’ve continually increased the size of “big”. And that’s good news because our databases often get bigger than we expect despite all the planning we put into designing and maintaining them.

There are a lot of things to consider when designing and building a database that is going to hold a lot of transactional data. Some considerations have to be weighed against each other to come to a middle ground.

Support for Searching

You’ll want to be able to enable queries against your database. Because the data is large you will need indexes, but you can’t index every field. Each index takes time and CPU to manage as processes add and remove data. If processes update the fields that are indexed you get even more overhead.

Beyond that, each time you bind or prepare an SQL statement to the database DB2 will need to evaluate each index to determine the optimal access path and more indexes means potentially longer bind times. If you aren’t careful, you could spend more query time binding than actually executing. Not a problem for COBOL so much as Java/JDBC and ad hoc queries. So, if you’re high volume programs all have static embedded SQL you can probably afford a few more indexes.

Indexes, of course, require maintenance. So you also need to consider the impact of reorganization processes and their frequency. Let’s consider a header table with a logical design like the following.

1 - Header Logical.jpg

The table shows three candidate primary keys related to this table plus six additional alternate keys. We expect the Header table to contain at least 1 billion rows. In this case, Header ID is our new primary key and we’ll want to designate it as such. The other two candidate primary keys would have supporting unique indexes. Now, let’s think a bit about these six alternate keys. Are they useful? Definitely, but as they are defined at the logical level they are each potentially going to return a large number of rows making them expensive to use. Our volume for the rows that make up these alternate keys may look something like:

  • Subject One = 10 million entities
  • Subject Two = 1 million entities
  • Subject Three = 2 million entities

No matter which index is used in support of a query, that query could potentially be returning 100-1,000 rows. So, we might consider adding more information to the indexes to reduce the number of rows they return.

2 - Header and Indexes.jpg

In this case, we are increasing the CPU for our inserts while giving our indexes at little more to grab onto. But this probably isn’t enough. From interviews with the business, we know they are more interested in recent rows than in older rows. This means we can add the Received Timestamp as a descending column to each of these indexes. That way, we can retrieve the more interesting rows first and only get the others if the requestor needs to go back farther.

3 - Header and Indexes 2.jpg

We have now defined 9 potentially unique indexes on our tables. Do we need to make them all unique? Certainly for the primary business and technical indexes, but not necessarily for the six alternate indexes. For the alternate indexes we may be able to use Data Partitioned Secondary Indexes (DPSIs) instead. These would be physically partitioned into separate datasets so that we have one index partition for each data partition. Creating a DPSI and using the partitioning column in our SQL statements allows DB2 to determine which index partitions it needs to look at when searching the alternate indexes. Omitting the partitioning column in our SQL forces DB2 to search all of the index partitions in parallel, with potentially high CPU costs.

Now let’s look at the table definition a little more closely where we can see there are other columns that would likely need indexes. With 9 indexes on the table already, we may need to figure out a way to keep the maintenance on this table down a bit. One solution for doing this is to create a “search” table. The alternate indexes for our table here break down well into three subject areas. We can then create a search table for each subject area.

4 - Header and Search.jpeg

Now we can perform maintenance on tables with fewer indexes. The cost for this is duplication of data into three new tables. Programming efforts can be reduced, and data consistency assured if triggers are utilized to keep the subject area tables in sync with the header table.

Partitioning Considerations

With large data, you may be able to use a Partition-By-Growth approach if you’re transactions are always being added to the end. But with big transactional data, you’ll probably want to manage your partitioning to keep overall resource usage under control.

Date Partitioning

One fairly common partitioning option is to use a date or timestamp that is part of the table structure. Our sample Header table above has two timestamps we could use for this. The Received Timestamp is likely to be good for organizing our data as it comes in because it’ll keep our active data in one place. With the active data in one place we’ll need less memory and possibly fewer open datasets if frequent processes aren’t searching through historical data. On the other hand, if you can separate active data from completed data, the Completed Timestamp makes an excellent choice for partitioning historical tables. If you can use partition rotation to purge off old history then you’ll (hopefully) be able to come to a fairly steady operational state in terms of performance.

Date partitioning can work well if most of your data is inserted. If you have significant updates you might get a lot of row relocations which will increase your object maintenance costs. These relocations can occur if you use compression, variable length data types, or update the partitioning columns.

Business Key Partitioning

Another partitioning option is to try to divide your table into some number of partitions based upon a column related to business regions or entities, trying to spread the activity across the datasets. This will reduce contention, but will likely require more memory to support the index trees for each open partition (assuming DPSI usage). It’ll also require more CPU for the execution of maintenance to keep free space distributed throughout the objects. On the plus side, it keeps rows that are often used together clustered together and that’ll keep your primary access costs down.

If you have a good business key (like Business ID) you can use ranges of possible values for the partitioning key. If you’re data is anything like ours, you’ll find there may be difficulty figuring out how to distribute them evenly. If you get an even distribution, consider yourself lucky.

Assuming the Business ID is an alphanumeric string, then one choice that looks like it will work well is to use an algorithm against the last 2-3 characters of the string and assign ranges to them. In one design I’ve done recently, using the last two characters worked well for partitions with ’00’ thru ’99’ for the first hundred partitions. There are some cases of alphabetic characters, so ‘0x’ - ‘9x’ goes into partition 101 and ‘xx’ goes into partition 102. One last partition was necessary, partition 103, to handle a hotspot where the Business ID string ended in ’001’ and there were enough occurances for its own partition. This yielded a 103 partition distribution that looks like it was consistent for many months of raw data.

The next concern when designing a high volume transaction database is planning for growth. What if we can’t purge enough data and outgrow the partitions? How can we redistribute the data quickly? With a very large database in a highly available environment, the answer is that you can’t just fix it. You must plan for it. So, in this case we added a Partition ID column to contain the target partition number for each row. We calculate the Partition ID only on insert and we can do it in a ‘before’ trigger to ensure consistent use. The algorithm used to determine the Partition ID looks like:

-- Possible UDF implementation (first-cut)
GetPartition(bus_id) {

  -- first partition to use
  set p1 = 1
  -- Internal variables used
  set c1 = substr(bus_id, length(bus_id) - 1, 1)
  set c2 = substr(bus_id, length(bus_id), 1)

  if c1 between '0' and '9' and c2 between '0' and '9'
     then set p2 = smallint(substr( bus_id, length(bus_id) - 1, 2))
  elseif c1 between '0' and '9'
     then set p2 = 100
  else set p2 = 101
  end if

  -- handle '001' hotspot
  if c1 = '0' and c2 = '1'
     if substr(bus_id, length(bus_id) - 2, 1) = '0'
        then set p2 = 102
     end if
  end if

return p1 + p2
}

Key points about this routine:

  • We can add 103 more partitions just by changing p1 to 104
  • We have a code structure that can add support for additional hotspots

I should stress here that you really need to analyze your data. Don’t assume you will have an even distribution of data, and do what you can to verify it. If you can’t verify it, you’ll need to watch it as it comes in and be ready to react quickly. Either way, expect things to change and be ready for those changes.

This partitioning method will require that your purge activities scan through the data looking for rows that can be purged. You can optimize the purge using an index on the Completed Timestamp or let it run slowly and use a tablespace scan. Either works but the resources for maintaining the index may be more overhead than you want. You’d need to measure and compare the options in your environment.

Partitioning with Separate Tables

Yet another option for partitioning is to do it yourself with separate tables. You can then bring it all together using Views that Union the tables together. This is very similar to one of the design concepts used for big data: many separate datasets that get saved in a distributed file system only to be brought together again to resolve queries.

This method could be handy if your table is going to grow larger than the maximum table size. It also provides you with a lot of flexibility and control. A lot of the considerations mentioned above happen here as well, both within each “table” and across them all.

Statistics

When you have a lot of data, you’re going to need good statistics in the catalog. If you’re mostly adding data at the end, then you’ll be adding partitions or rotating the use of existing partitions. Before you are done with either of those, make sure you have statistics in the catalog that represent the empty partitions as if they had data. This will be true for the Header table as well as every single one of its children. If DB2 thinks the partitions are empty or if you have default stats, your access paths will not be what you want.

Object Maintenance

When we design physical databases, we take the logical model and try and figure out how to best implement it in DB2. This covers partitioning, indexing, splitting or merging tables, denormalized table creation and other similar activities. We take into account the applications that will use the tables and their access requirements then add in support for ad hoc query users. It isn’t trivial work and the larger the quantity of data the more important it is for us to put good thought into the design.

One key trade off in physical design is the consideration of the frequency of reorganization (and other database maintenance activities) with the growth and activity of the data from the applications. We need to make sure there is space in the database to hold the incoming data without performance degradation. This space gets used up as new data is inserted. Our regular reorg jobs have the task of preparing the tables for the next round of inserts. But we also need to consider that our reorg jobs may not complete successfully. With DB2 for z/OS we are blessed with the ability to reorganize our tables while the application is still running. If our timing is right, our jobs will run when there is low database activity and there will be little contention with applications. But timing won’t always be right and we should allow for the possibility that we have to go through two or three scheduled maintenance intervals between successful reorgs. That means allocating more free space than you’d think you would need.

If your partitioning is based on dates, you’ll be able to minimize the amount of maintenance you do against the older partitions. This is great in that it reduces the cost of managing infrequently changed (and hopefully infrequently used) data. If your partitioning is based on a business key, you’ll need to plan a much more extensive set of maintenance jobs. In either case, you’ll still need to keep current backups of the entire database. You’ll need to consider the use of full copies versus incremental copies. Can dataset/volume level snapshots work for you backups? Lots to consider.

Data Migration

It is pretty rare that you are going to start off a new large DB2 for z/OS database from scratch. If you’re lucky enough to be doing that, feel free to skip this section.

Now for the rest of us, we have data to convert. How you convert is going to depend a lot on circumstances, the design of your system(s), the amount of time you have to convert, the amount of money you can spend converting.

Converting to a New System

If you are converting your data from one system into another then you’ll probably have some serious time restrictions on getting the active data moved. This might be a good situation to structure your database into separate tables containing active data versus historical data. That way, hopefully, you just need to move the active data quickly and can take more time to move the historical data. You may even be able to move the historical data over in advance of implementing the new design until that historical data is synchronized, and then move the active data at the last minute. Either way, you’ll want to look at the amount of data that needs to be moved and figure out how long it’ll take. If you think you can move your data over a long weekend then I don’t think your data is “large”.

Converting within a Live System

If you, like me, have to keep a live system running and simply change the data storage underneath that system then you’ll probably want to consider a long-term phased approach. We were lucky enough to have a VSAM I/O layer with a limited number of interface programs. This allowed us to tie our new DB2 layer into the existing layer and migrate while the application was up and running. Here’s an overview of the approach we used:

  1. Added DB2 write support to the I/O layer so that we were writing to both VSAM and DB2 with every transaction. This synchronized the data going forward.
  2. Ran batch jobs to extract the information from the legacy VSAM record for the Header table and load that into DB2 so that we had one row for every VSAM record. This gave us control over what had and had not been converted.
  3. Developed a background CICS transaction that would read a VSAM record (actually a set of records from several files) and insert that into the child tables, updating the Header to mark it as converted.
  4. Ran the background conversion process. This took 9 months to convert almost 2 billion records.
  5. Verified the data matched between VSAM and DB2.
  6. Switched the applications to read from DB2 instead of VSAM.
  7. Turn off the VSAM writes. This step hasn’t actually been done yet due to the complexity of getting all of the reading applications switched over.

Notice the length of time it took us to convert our data in Step 4. Part of this time was due to bugs that were not noticed until after we’d converted a significant number of records. The majority of the time was due to the decision to keep mainframe CPU usage as low as possible. We had the luxury of time so our data conversion transactions ran at a very low priority during prime time hours to keep the main application responsive to the millions of user transactions running throughout the day. In the evening, we’d ramp them up a little to get more done. During the weekends, we ramped up concurrent activity to about 5 times the weekday rates. One holiday weekend we processed over 27 million “records” in a 24-hour period, resulting in more than 27 billion inserts (over 100 inserts per “record”). Not only did this prove that DB2 for z/OS could handle the load and perform well beyond our expectations, but that our physical design could handle the projected 2 million “records” per business day.

I know for some of you 2 billion records won’t seem like a large number. In reality, each of the “records” uses about 140 tables to store the data due to the number of repeating elements. Some of those are great-grandchild tables with thousands of rows per original VSAM record. In all, the raw size of the data is about 250 TB.

If you run into a similar situation, take the long view and plan it out so that you can run your existing system while you do your migration. Make sure every change allows you to fall back to the prior state without data loss. And triple check the content of your tables for accuracy in as many different ways as you can afford.

Write vs Read

As you design your large database, you’ll need to consider whether to optimize for read access or for write access. Be objective about this. You don’t want to optimize for reads if you never read the historical data. You don’t want to optimize completely for write activity if you read it even more frequently. If you have serious limits to the amount of time you can spend writing and you have very high read activity, then you may need to consider ways to optimize the writes and post-processing the data to make it available for reads.

In our database, we designed for write performance storing the most recent incoming data at the end. The incoming business record rate was about 1 million per day, in addition to our planned conversion rate of about 5 million per day. The inquiry rate was significantly lower. At least that was what we were led to believe. As we were converting, we learned some new things about the application and we had to adapt to changing requirements. The write activity went from once per business record to an average of more than 10 (1 insert and 9 updates). The architects came up with the need to replicate a subset of the data into a second database (in DB2 for LUW) for inquiry purposes, increasing our database activity by 10 million full reads of the business record. Yes, we’ve had a lot of opportunity to tune and learn and find things we could have done differently.

Read performance is actually pretty good considering the number of tables involved. There are outliers that take a lot longer than we’d like but overall we are pleased with the performance of the database. We’ve added only two denormalized tables to improve inquiry performance. If DB2 supported arrays we’d have been able to have a much smaller number of tables and that would have improved our read efficiency tremendously.

As our application evolves from the old mainframe CICS/COBOL/VSAM code it currently uses now into the Java/DB2 code we are developing, we have opportunities to improve even more. Our application almost always needs and uses the entire business record. Our Java object model works best if we bring the whole thing into memory and execute the business logic around it. Yes, with over 2,000 columns in the database you can expect some seriously convoluted business logic. So, we’re looking at XML columns to improve both our read and write performance as we move into the future. Like the original migration from VSAM to DB2, this won’t be a quick migration either. Our tests indicate we’ll save a lot of mainframe CPU reading from XML columns even though it’ll cost a little bit more for any updates we might do. That, of course, calls for other design tradeoffs.

Purging Data

Database Administrators can often be heard complaining about development teams that never build and implement purge processes. Developers don’t understand why the Database Administrators can’t just delete the old data. I’ve lived on both sides of this problem. There are many reasons to purge data from our databases.

  • Reduced legal risk from old documents
  • Sustained good performance levels for online users
  • Reduced query times
  • Reduced CPU usage
  • Quicker backup and recovery

Purge processes are often left to the end of development cycle when the database is going to be empty upon implementation. It’ll be years before anyone needs the purge and the development team already has a backlog of user functionality that the business needed available 3 months ago.

Having been through all this, we made sure we designed our database with a purge process from the beginning. DB2 provided us with partition rotation capability, all we needed to do was implement a process to control it across those 140 tables. Yes, easier said than done. But this too takes time and planning. Our Header is key to being able to do the rotation. Once we delete a row from the Header our application no longer finds the business record and does not look for the children. With a set of search tables (as described above) we’d need to make sure those were deleted as well. We can do this because all search paths come through that Header table, even ad hoc queries. The rotation-based purge works when the database is organized and partitioned by a date/timestamp column. In our case, Received Timestamp was the chosen partitioning key. Once the oldest partition is empty we are able to run the rotations.

Partition rotation is quite a bit cheaper in CPU compared to 20 million deletes (times 140 tables). It works very well too, until the legal department tells you to stop purging. Then what?

The ability to purge data from your database is not simply a technical decision. The operational business area and legal area must agree on how long data is to be retained and when (or even if) it can be purged. It is unfortunate, but our data is often involved in legal cases of multiple varieties. Each legal case could affect just a few rows of data or many thousands of rows. We designed a solution that keeps data in a secondary set of tables when there is a legal hold. This enabled the use of partition rotation while meeting the needs of the legal department. Make sure you plan accordingly.

Surrogate Keys

There has been a trend in application development where each table gets assigned its own surrogate key. This is handy for consistent design in smaller databases. In larger databases it has a major trade off in the need to support foreign keys in each and every child table. The support for those foreign keys will add significantly to your write costs over billions of rows and millions of transactions. Every microsecond counts. I once had to back out and redesign due to a 6 microsecond increase in CPU time because the program executed so frequently that it brought the system to a crawl.

Conclusion

These are just a few thoughts on the design considerations that take place when designing large databases. It is important that you understand the characteristics of the data as well as the current/future behavior of the applications that use it. Without this, you may make the wrong choice and it can take a long time to do it over again. That makes planning and design well worth the time taken up front.

Recent Stories
SQL Tuning for Developers – Part 1

The Key To Mastering Db2 Locking On Row-Store Tables

Lessons Learned of Locking and Latches by Adrian Burke