Db2 Iceberg – Chapter 2: Time Travel Queries
By Cameron van Ravens
Time travel is a unique feature of Iceberg, allowing for historic versions of a table to be queried, which is possible due to the snapshot-based architecture of Iceberg tables. As discussed in the first installment of this blog post series on Db2 Iceberg, called Tip of the (Db2) Iceberg, each transactional operation on an Iceberg table creates a snapshot representing the state and contents of the table at that moment. Using time travel, we can reference these snapshots to query the table’s data at that point in time or even combine multiple time travel queries across different snapshots or tables for more complex operations across the history of these tables.
Time Travel Queries
Let’s look at the syntax of a time travel query. The time travel period which specifies the point in time we wish to query, is provided as a clause on the target table reference. There are two ways to specify the time travel period, which provides some flexibility in how we can specify the target time period to query based on the requirements of the application.
FOR SYSTEM_VERSION AS OF <snapshot_id or snapshot_reference>
The SYSTEM_VERSION period allows us to specify a target snapshot to query. We can provide either the snapshot ID as an integer or the snapshot reference as a string to specify the target snapshot. The snapshot ID must be provided as an integer type to distinguish a snapshot ID from a snapshot reference that looks like an integer.
FOR SYSTEM_TIME AS OF <timestamp>
The SYSTEM_TIME period allows us to specify a timestamp to query the table at a point in time. The timestamp will be resolved to the latest snapshot created prior to the given timestamp. The timestamp must be provided as a type that can be cast to a timestamp.
Let’s look at some examples of how we can use time travel on a simple Iceberg table. First, let’s create and populate the table and create a named reference for the first snapshot:
We can view information on the table’s snapshots using the TABLE_SNAPSHOTS table function:
And the table’s snapshot references using the TABLE_SNAPSHOT_REFS table function:
To query our example table as of the first snapshot, we could use any of the following statements:
All three of these methods will yield the same result:
If we try to query a non-existent snapshot ID or reference, or a timestamp that does not correspond to any snapshot, we will get an error that looks like the following:
SQL20524N with reason code 10 will always mean that a snapshot could not be resolved for the given time travel period. For other reason codes please see the Db2 documentation for SQL20524N.
Alternatives for Specifying the Target Value
There is some flexibility in how we can specify the target snapshot or timestamp in the time travel clause. In addition to specifying the time travel target as literal values, we can also specify the target in a variety of ways such as using SQL variables or special registers. The restrictions on what can be used to specify the time travel target are as follows:
- The input must be a type that can be cast to a timestamp, except for integer types when specifying a snapshot ID
- No column references
- No sub-queries
- No user defined functions
- No non-deterministic functions
For example, if we wanted to provide ID of the first snapshot by querying the TABLE_SNAPSHOTS table function, trying to do this using a sub-query will fail:
Instead, this can be achieved by using a SQL variable to store the result of the sub-query:
Time Travel Behaviour involving Schema Changes
Continuing with our example table, let’s add a new column to this table and insert some new data to create a new snapshot:
If we query the table at a point in time just before we added this column (you can use the TABLE_SNAPSHOTS table function to help determine a timestamp to use), you’ll notice that the column still exists in the result set. Db2 will always use the table’s current schema, regardless of the schema originally used by the target snapshot.
Querying Changes to a Table using Time Travel
Let’s look at a more complex example where we can leverage time travel. If we wanted to see what changes have been made to an Iceberg table between two points in time, we can query the table at each point in time and take the difference to see the changes. By adjusting the order in which we compare the table’s datasets, we can determine which rows were added and which were removed. Let’s first delete a couple rows from our example table:
We can use a query like the following to outline the rows that were added or removed between the first and fourth snapshots (referenced by ‘tag1’ and ‘tag4’ respectively):
Running this gives us the following result set on our example table (note the absence of ID=3, which was added after ‘tag1’ but deleted before ‘tag4’):
Snapshot Rollback
While time travel allows us to peek at historic versions of a table, we also have the ability to rollback or set a table’s current snapshot to one in the past. By setting the current snapshot to a different snapshot we can achieve a similar effect to time travel, however all operations on the table will create new snapshots stemming from the snapshot we rolled back to, creating a new fork in the table’s snapshot history. Additionally, rolling back to a previous snapshot will affect all other database engines’ view of this table, so this should only be done when we want to go back to a previous version of the table and not as an alternative form of time travel.
We can rollback to a snapshot (meaning the target snapshot must be an ancestor of the current snapshot) using the ROLLBACK_TO_SNAPSHOT or ROLLBACK_TO_TIMESTAMP stored procedures, or we can set the snapshot (no restriction on the relation of the target snapshot to the current snapshot) using the SET_TO_SNAPSHOT or SET_TO_TIMESTAMP stored procedures. Similar to the behaviour previously mentioned for time travel, the table schema will not change to the schema used by the target snapshot.
Looking at our example table, we can rollback to the first snapshot as follows:
Since rolling back only changes which snapshot is considered “current” and doesn’t create a new snapshot, we can leverage time travel with the special register CURRENT TIMESTAMP to query the most recent version of the table without needing to first determine the latest snapshot. Here we can see that we retrieve all the data we inserted prior to rolling the table back to the first snapshot:
There are cases where using CURRENT TIMESTAMP or CURRENT DATE may fail to resolve to a snapshot. These special registers return the time-of-day of the current Db2 server, which may differ from the time-of-day of the server which wrote the snapshots.
Differences with Db2 Temporal Tables
Db2 users who are familiar with time travel on Db2 temporal tables may recognize the similar syntax used for Iceberg time travel, particularly when using the SYSTEM_TIME period. However, despite sharing some syntax Iceberg time travel has many differences when compared to Db2 temporal tables:
- Queries using Iceberg time travel functionality read the data files associated with the snapshot referenced, whereas time travel queries using Db2 temporal tables read the data that existed or was valid for a given time period based on special timestamp columns in the table.
- Iceberg time travel is baked into the design of Iceberg tables, where Db2 temporal tables must be created with these specific timestamp rows and correlated to a history table in some cases.
- The period used for Iceberg time travel refers to how we should resolve the target snapshot based on the supplied input, where the period in Db2 temporal tables has different semantic meanings: System-period tables track when rows were created, updated, and deleted, and application-period tables track when a row is considered to be valid.
- Iceberg time travel also introduces the SYSTEM_VERSION period, which is not supported by Db2 temporal tables.
- Iceberg time travel only supports the AS OF range, where time travel using Db2 temporal tables additionally support the BETWEEN and FROM .. TO ranges.
- The CURRENT TEMPORAL SYSTEM_TIME and CURRENT TEMPORAL BUSINESS_TIME special registers have no implicit effect on Iceberg time travel.
- Iceberg time travel can still reference the value of the special register, i.e. FOR SYSTEM_TIME AS OF CURRENT TEMPORAL SYSTEM_TIME
Additional Resources
- https://www.ibm.com/docs/en/db2/12.1?topic=tables-iceberg-time-travel-queries
- https://iceberg.apache.org/spec/#point-in-time-reads-time-travel
Cameron van Ravens joined IBM in 2020 as a Software Developer at the IBM Toronto Lab. He has a BEng in Software Engineering from McMaster University. Cam has worked in Db2 LUW, Db2 Big SQL and Data Virtualization and is the technical lead for the statistics collection tool used for open data format tables called ANALYZE. Most recently, he was a key contributor for the new support of Datalake tables in Db2, including support for Iceberg tables.