In the preceding blog posts we went through the basics of how to store and retrieve XML documents in DB2 table. Before we continue to more complex XML data manipulation techniques, we will spend some time in this post with the pureXML storage concept, discussing how DB2 stores XML documents in the database. You may think this is too theoretical; that it is something happening in the background that you don’t need to know. But give it a try! I am sure this will help you to understand later topics better.
It is worth remembering that prior to DB2 9, when pureXML was introduced, XML documents could be stored in the DB2 database, in a LOB column. If you think about the examples in previous posts and replace an XML column with a LOB column, you could use an XML document stored in a LOB column for very similar operations. However the XML document would be stored in its string representation — a bunch of text to DB2 — which is pretty inefficient.
The biggest change since DB2 9 with pureXML is that DB2 physically stores and accesses XML documents in the inherent XML hierarchical tree structure, not as text. The pureXML provides query languages, storage technologies, indexing technologies and other features to support this new XML data type in DB2. With this enhancement, DB2 became a hybrid data server, providing data management for traditional relational data as well as providing native XML data management. Let’s take a closer look at the pureXML storage concept and how the XML data is actually stored in the DB2 database.
We will go through the major steps that DB2 executes when a simple XML document from the example below is inserted into an XML column.
<?xml version= "1.0" ?>
First, DB2 parses the document to produce a hierarchical representation of the XML document that is stored on pages in a table space; i.e., to produce a tree. The reversed processed of parsing is called serialization. If the document is not well-formed (cannot be parsed by the XML parser), DB2 refuses the XML document and throws the SQL -20398 error. This tree is defined with the XQuery data model (XDM) and consists of nodes. A node can be a root node, an element node, an attribute node, a text node etc. The exact shape of a parsed tree in the storage can vary and depends on each individual instance of the document because DB2 performs a variety of optimizations when storing the tree.
If you are using XML Schema (more about XML schemas in a future post), the XML document can be validated against the appropriate XML schema as the next step. Use of XML Schema is optional and DB2’s XML storage is independent of any particular XML Schema. Note that DB2 does not store the XML declaration element. The document node is generated automatically on retrieval and represents the root node.
Element values reside in a separate nodes while attribute values do not. That is because an attribute has exactly one value and never any child nodes, thus it does not make any sense to keep the attribute value in a separate node. The example below shows the XML document’s parsed tree representation.
This “mapping” is stored in the SYSIBM.SYSXMLSTRINGS catalog table, which contains a stringID and string couple representing each tag. This mapping is catalog-wide and is unified for all XML documents on the subsystem, regardless of the table/database where the document is stored. The example below shows one possible mapping of the parsed tree.
When stringID mapping is done, DB2 assigns Node ID values and local node id values to nodes of the parsed tree. Every child under every parent is assigned a local node id value. The local node id is encoded using variable-length bytes. It starts with 02x for the first child and increments by 2 for each sibling. It ends with A0x, then it starts using two bytes, going from A10x and so on. The example below shows the local node id assignments of the parsed tree. Once all local node ids are assigned, every node is given a Node ID value. Node ID is used to uniquely represent each node externally. Moreover, the Node ID value also holds the information about the structure where the node is located within the tree. The Node ID value for a node is the concatenation of local node ids collected along the path from the root to the node. Such a composition ensures uniqueness of the node within the whole XML document regardless the name and the value of the node. For instance, the tag “nickname” (green color in the example below) has Node ID value of 02020604. The last ‘04’ is the local node ID value.
The XML document is almost ready to be stored into the table. As it could be quite big, up to 2GB, DB2 divides the tree into smaller subtrees and stores those subtrees into rows. The row is in VARBINARY format, 15850 bytes in length. Every root node of such a subtree is marked as a proxy node (green color in the example below). Proxy nodes and Node ID values are used on retrieval to reconstruct the tree. If a single node cannot fit into a single row, DB2 splits the node into several nodes and marks the lead node and continuation nodes.
Once all subtries are determined and the tree is optimized, DB2 stores the node data into the rows of the XML table as ilustrated in the example below. Each node appears before its children.
Let’s summarize all those steps that DB2 does when storing an XML document:
- Parse an XML document with the XML parser to ensure it is well-formed, and convert the document into the tree structure
- Validate the XML document against XML Schema if requested
- Translate all tags into integer values and store the mapping in the SYSIBM.SYSXMLSTRINGS catalog table
- Assign local node id values and Node ID values
- Divide (cut) the XML tree into smaller subtrees that can be stored into one row of the XML table
- Stores nodes one by one for each subtree of the XML document
Multiple versioning format
Before we discuss the structure of the objects which hold XML documents, we will discuss the concept of XML document versioning. This enhancement was added in DB2 version 10 and is available only for Universal Tablespaces. The main point is, that with the multiple versioning format, you can update part of the XML document instead of the entire document. This updated part is stored as a new row in the XML table. The old version is preserved until the next reorg. On retrieval, DB2 returns the latest version of the document.
This concept has several advantages. First, you can use the XMLMODIFY function to modify only a part of the XML Document. Second, it improves concurrency. As illustrated in the picture below, while application number one is reading the document, application number two can read it and update it at the same time. Finally, DB2 saves memory by storing just the part which is being updated. The savings are significant, especially for large XML document updates.
XML catalog objects
It may now be obvious to you that the XML document is not physically stored in a table space of the base table that has the XML column. An XML document is stored in its own table, where each row stores a subtree of the whole XML document, as described above. So when an XML column is created, DB2 implicitly creates the following objects:
- DB2_GENERATED_DOCID_FOR_XML column in the base table
- Table space for XML table (type P in SYSTABLESPACE)
- Table for XML column (type P in SYSTABLES)
- Following indexes on the XML table
- On DocID column
- Used to join the XML table with DB2_GENERATED_DOCID_FOR_XML column in the base table
- On DocID column + the largest NodeID value + Start and End Timestamp (ST, ET) columns
- Used for efficient access from base table to regions in the XML table
The names of all implicitly created objects are automatically assigned and you cannot change them. The XMLDATA column (VARBINARY, 15850 bytes) stores XML documents. The XML table space has 16k pages. Every single row contains either the whole XML tree or the document’s subtree. The DocID column of the XML table contains the document id value (BIGINT) for every row that belongs to the same XML document. To create a link to the XML column in the base table, DB2 creates the DB2_GENERATED_DOCID_FOR_XML column in the base table, which has a generated always attribute. The document id is generated in sequence. This might cause unwanted locks when concurrently inserting a large number of documents. To resolve this issue, a new ZPARM (XML_RANDOMIZE_DOCID) to generate the document id randomly was introduced in DB2 11 and retrofitted to version 9 and 10.
The minimum Node ID value – the root of the subtree which is stored in a particular row (a proxy node) – is stored in the min_NodeID column. If the multiple versioning format is used, there are two more timestamp columns in the XML table: start timestamp and end timestamp. Values in those columns indicate the time range when the version of the particular subtree of the XML document was current. The XML column in the base table contains the timestamp value of the last update. This timestamp, together with the start and end timestamps of the XML table, helps DB2 to retrieve the current version of the whole XML document.
The opposite process to parsing an XML document is called serialization. On retrieval, DB2 uses the DocID column to read all rows that belong to the XML document that is being retrieved. Start timestamp and end timestamp columns are used to retrieve the latest versions of the rows. The Node ID values with proxy nodes from min_NodeID column are used to reconstruct the tree representation of the XML document. Once the stringIDs are replaced with the tag strings as mapped in the SYSIBM.SYSXMLSTRINGS catalog table, DB2 can serialize the parsed XML tree into its string representation and retrieve it.
Despite the fact that this part is more theoretical, I hope you enjoyed it and that you have a good base for understanding more complex XQuery and XPath expressions. The awareness, somewhere in your brain, of what is happening in the background might help you work with XML documents.