Understanding the Complexity of Time Travel
Daniel L Luksetich, DanL Database Consulting. firstname.lastname@example.org
I like to joke that time travel in SQL is fraught with peril, kind of like that movie when the guy goes back in time. If you recall the movie then you’ll remember that he ends up traveling back in time and kissing his own mother! So, we need to be careful when time traveling but more importantly we need to have an understanding of how time travel applies to our business. This is critical in getting the query to match the data as well as match the business need.
Time travel applies to DB2 automated temporal and archive tables, as well as to any home grown temporal data storage. Prior to understanding the issues surrounding time travel queries you must have knowledge of temporal data storage and the automated features of DB2. Great references can be found using these links, and I recommend gaining this knowledge before continuing with this article.
There are many other great articles and presentations out on the web worth reading, and one should definitely experiment with DB2 automated temporal tables in advance of any implementation. My customers are jumping on to system-period temporal tables as an automated solution to maintaining their own history tables. We’ve discovered some issues along the way, especially surrounding certain time travel queries.
Time travel is enabled via something called a period-specification. The period-specification is part of the FROM clause of an SQL statement, and is used to query data relative to either a point in time or a range of time. In addition to a period-specification in the FROM clause of a query there can also be a period-clause following the table reference of a DELETE or UPDATE statement against an application-period temporal table (covered in articles reference at the beginning of this article). A period-specification for an application-period temporal table to query data for a specific point in time could look something like this:
FROM ACT FOR BUSINESS_TIME AS OF '2015-12-30'
A period-specification for a system-period temporal table to query data for a range of time could look something like this:
FROM EMP FOR SYSTEM_TIME BETWEEN '2015-03-27-18.00.00.000000000000' AND '2015-12-27-18.00.00.000000000000'
Even though the period specification is part of the FROM clause the query transformation component of the DB2 optimizer will rewrite the period-specification into a compound predicate and merged with other predicates in the WHERE clause of the query that is passed internally to the access path selection component of the optimizer. For example, the following query:
FROM ACT FOR BUSINESS_TIME FROM '2015-05-01' TO '2015-06-01'
WHERE ACTNO = 10;
Will be transformed internally within DB2 as the following, with predicates on the START_DT and END_DT columns generated from the period-specification in order to retrieve a set of rows that intersect the period of time in the period-specification:
WHERE ( '2015-05-01' < '2015-06-01'
AND ACT.ACTNO = 10
AND ACT.END_DT > '2015-05-01'
AND ACT.START_DT < '2015-06-01');
While the inclusion of a period-specification in a query against an application-period temporal table, due to the fact that the application period is considered part of the key, is quite common, it is also quite common that applications search for “current” version of data in system-period temporal tables or archive tables. In these situations it is unwise to use a period-specification or to always request archive access. This is due to the fact that if these conditions are true then the query transformation component of the DB2 optimizer will rewrite those queries into a union of base and history or base and archive table. So, in situations where only current data is desired for system-period temporal tables then the period-specification should be left off the statement. For both system-period temporal and archive tables there are special registers and global variables to control whether or not historical or archived data is accessed. If using DB2 10 for z/OS the system period temporal special registers are not available and for performance reasons you should code separate current versus time travel queries.
What Data and When? The Understanding of Time Travel
There is a little bit of a culture shock when moving to an automated temporal design. This is due to the fact that data was traditionally stored relative to points in time using create and update timestamp columns, and are now stored for periods of time using start and end time columns. Periods of time were previously derived or laid upon data updated as of points in time, but now periods of time are queried using periods of time or points in time. As an example an employee table that may have been traditionally designed as follows:
With an EMPNO column as the primary key an application developer would query the table using that primary key value to get the current row for an employee obviously. An historical query, such as “return all employees created from January 1st of 2015 until today” would look like this:
SELECT * FROM EMPLOYEE
WHERE CREATE_TS BETWEEN ‘2015-01-01-00.00.00.000000’ AND CURRENT TIMESTAMP;
This is easy for application developers to understand as well as users. Likewise, a query returning rows from the employee table that were updated over a period of time is also easy to understand.
SELECT * FROM EMPLOYEE
WHERE UPD_TS BETWEEN ‘2015-01-01-00.00.00.000000’ AND CURRENT TIMESTAMP;
If the employee table is converted to a system-period temporal table then the UPDATE_TS column is not really needed anymore as the column representing the start of the system period reflects the same value.
We have to make sure when it comes to history that we are giving what it is that the user wants. The way this system-period temporal design works is that any change creates an additional history row for a given key. Any temporal query returns the intersection of the temporal period with that of the history. If you ask for a specific point in time you get the image of the data at that point in time. If you ask for a period of time you may get multiple row images that were “current” during the period of time specified in the query. So, what the users want has to be more refined than “give me the history between date1 and date2”. They may get confused by the answer if the start of an image of data was before date1 because the data did change between date 1 and date 2, or it was created before date 1 and is still “current” (i.e. has no end date). Only if we really know what it is the users want can we deliver accurate results. If there is no understanding then needed data could be missing from the result set and that can lead to mistakes made by the users interpreting the data. Here is a diagram that I hope demonstrates. Once we know which query needs to be answered then we can formulate a query. The application developer needs to know which of these 3 questions is right for a given request for historical data.
What is the business requirement for historical views of this data?
- All employees whose data was modified between 7/1/2015 and 9/15/2015? All three rows will be returned. Did the business really want the first row?
SELECT * FROM EMPLOYEE FOR SYSTEM_TIME BETWEEN ‘2015-07-01-00.00.00.000000’ AND ‘2015-09-16-00.00.00.000000’;
- All employees created on or after 7/1/2015 and modified between 7/1/2015 and 9/15/2015? No rows would be returned because the employee row was created back on 6/1/2015 as represented by the CREATE_TS value which does not change over time.
SELECT * FROM EMPLOYEE FOR SYSTEM_TIME BETWEEN ‘2015-07-01-00.00.00.000000’ AND ‘2015-09-16-00.00.00.000000’ WHERE CREATE_TS >= ‘2015-07-01-00.00.00.000000’;
- All employees modified between 7/1/2015 and 9/15/2015 and modified after 7/1/2015? The second two rows would be returned, but not the first row.
SELECT * FROM EMPLOYEE FOR SYSTEM_TIME BETWEEN ‘2015-07-01-00.00.00.000000’ AND ‘2015-09-16-00.00.00.000000’ WHERE START_TS >= ‘2015-07-01-00.00.00.000000’;
This means that business analysts need to properly communicate between the developers and users to determine what exactly is desired. Ultimately then the business analysts need to understand temporal data stores and the fact that the concept of history changes dramatically. Maybe the word “modified” should not be used and replaced with the word “active”.
Temporal tables and time travel queries can require a shift from traditional thinking about how data is stored as we as retrieved from the database. Proper education and testing will help business analysts and developers understand how to match user requirements with proper database access. Communication with users regarding requirements and understanding of temporal storage will aid in proper application design, as well as enhanced application design.