XML, JSON and Schemas

XML isn’t quite dead. Why not? Like many tools created by a robust bureaucracy of software vendors, it feels bloated. It is difficult to work with. It is language agnostic, so parsing it requires complex libraries that are often inefficient. Files containing XML take much more space than the data they contain. But it has serious advantages to what we had before XML came along. You can open up an XML document in an editor and even make a quick change to it. You can store an object hierarchy together in one place instead of across several files. When you need to share complex data, the verbosity is actually a benefit because it ensures clarity. 

JSON, on the other hand, has its roots in the JavaScript programming language. Its overall structure and datatypes are simpler. It feels more natural when you open it up in an editor and make a few changes. The popularity of JSON increased when some prominent developers picked it up and used it with other programming languages during the Web 2.0 era. JSON met their needs because it was simpler to use than XML and the web browsers they were interacting with already had support for it. Importantly, JSON could handle the complex data structures with less overhead and more efficient parsing. What JSON really lacks is vendor tooling. 

As Philip Nelson stated in his recent article, “JSON and XML are not just slightly different ways of doing the same thing.” Both can be used to hold complex data structures and they do have a lot in common. The effort that went into the development of XML included a lot of work in developing the ability to define schemas for XML documents. In fact, there are at least three different XML schema languages: Document Type Definition (DTD), RELAX NG, and XML Schema Definition (XSD). Most of the people I work with today use the XML Schema Definition (XSD) language. XSD schemas are stored as XML documents, and the elements describe the objects, elements and attribute characteristics of each field that can be stored in a particular XML document. Java developers can use an XSD to generate data access objects and the methods necessary to load and unload the data from those objects.

Would you be surprised to learn that JSON has a schema definition standard as well? You should be, because it doesn’t. What JSON does have is a draft schema definition specification that doesn’t get enough attention. Like JSON itself, JSON schemas are simple and easy to implement. They can be used to validate the structure and content of a JSON document using validation routines available in several common languages. There may be tools that can take a JSON schema and generate data access objects, but I haven’t found them yet. The more common solution is to simply parse the JSON into a dynamically created object structure and hope it has what you expected.

To summarize, we have XML with complex schema definition capability and good tool support and JSON with simple draft schema definition capability and almost no tool support. Both can be used to hold complex hierarchical data structures. Both can be edited manually if you are careful. In DB2, we have many years of development that went into the use and exploitation of XML and we are still very near the beginning of similar support for JSON.

By now, you’re probably wondering where I’m going with this. Why is it important for the developers to be able to define schemas for their documents? The answer lies with the growth in complexity of our data. Early relational databases consisted of a few simple tables, almost like the pages in a spreadsheet. In fact, our user communities have continued developing spreadsheets to collect and aggregate their information. Some dabbled with Microsoft Access and build user interfaces. Eventually, they needed something more and we had to develop DB2 databases and applications to support their growing needs. The database schemas became more complicated. We had to bring in people that specialize in data modelling. We needed logical models that could define the business structures and data. We needed physical models that could put that data into tables with enough performance to make them useful.

Today, we have business people and data scientists interested in having all of their data in one place and accessible. Many of them are still using the spreadsheet model in their heads. After all, spreadsheets are just simple data files. We can simply add another column to it, right? They’ll start by asking for an extract of the database. Just a few fields will do. Later, they’ll ask for everything. Everything will be more than they really need, but initially they won’t know what they do need.

We also have developers that want to work with tools that make their work easier. If you have to share data across development teams then you’re going to need to understand the data elements and how they relate to each other. Is your metadata still in a set of spreadsheets too? Do you have tables that define the mapping between record formats or Word documents that need an interpreter?

What we need is documentation that can be actively used by programmers to generate solutions. That means we need standardized ways to define our schemas. And those schemas are going to be complex. Since JSON isn’t there yet, the best option is to use one of the XML Schema languages. We can use an XSD to generate our Java object model. We can use the Java objects to export data into XML documents. We can even use the same Java objects to export the data into JSON documents. With a little programming magic, we can even take the XSD and create a JSON Schema definition that can be used to validate those JSON documents.

But we are also going to need to make sure we do it right. Many organizations leave XSD creation up to the developers. But those developers are working from business requirements and likely have an incomplete understanding of the data. They certainly don’t have a data modelling background. We are already passed the point where our data modelling teams should have been designing our XML and JSON schemas.

Creating a logical model for a document is similar to creating a model for a relational database. You still need to define the entities (complex elements, objects) and the relationships between them. Repeating structures (arrays) aren’t all that different from a parent table with a one-to-many relationship to a child table. The entities still have attributes (elements) and some of those attributes would be candidate keys that we might want to use when searching across documents. You should also make sure your model is in at least 3rd normal form.

One consideration here is the potential to nest one model within another. For example, you can create a Person model that has child elements containing Name, Addresses, Phone Numbers. This Person model can then be packaged up in its own XSD for use in a larger model without needing to replicate the entire tree. Then, within the larger model, you can include a Person in one or more of its entities. This is sort of like creating your own “domain” in the relational model sense of the term.

 customer-model.jpg

Once you’ve done the modelling, you can review the content with the business analysts and developers to make sure everyone understands what the documents will look like, just like we do with our database models. 

Then, the physical modelling starts and we can work on creating a useful XML Schema. In the XML Schema, you can eliminate attributes that get propagated from parent entities to their children. In groups that become arrays, you don’t need sequence numbers since the position within the array is just as effective (unless you need another portion of the document tree to be able to reference an element of this group). Whether a field is required (from the logical model) is often used to determine if it is nullable in our database models. In an XSD, you could define the element to be “nillable” but that actually makes your XML larger because the “null” element must be present and defined as such. It is simpler to make its presence optional using minOccurs="0" instead and making sure this translates into an “unknown” value any programs that need to handle that. Just as we’ve built up skills and common usage in our database designs, we will need to build up our skills in modelling document schemas. 

A portion of the XSD would look like:

<complexType name="Address">
 <element name="AddressLine" maxOccurs="5" minOccurs="0" type="string"/>
 <element name="StateCode"   maxOccurs="1" minOccurs="0" type="string"/>
 <element name="CountryCode" maxOccurs="1" minOccurs="0" type="string"/>
 <element name="PostalCode"  maxOccurs="1" minOccurs="0" type="string"/>
</complexType>

<complexType name="Customer">
 <element name="AccountNumber" maxOccurs="1" minOccurs="1" type="string"/>
 <element name="CustomerTypeCode" maxOccurs="1" minOccurs="1" type="string"/>
 <element name="Address" maxOccurs="5" minOccurs="0" type="Address"/>
</complexType> 

The same schema information in JSON Schema format would look like:

 { "$schema": "http://json-schema.org/draft-04/schema#",
 "definitions": {
   "address": { "type": "object", "properties":
       { "AddressLine": {"type": "array", "items":
        { "type": "string", "maxLength": 55 }, "maxItems": 5 },
         "StateCode": { "type": "string", "maxLength": 3 }
         "CountryCode": { "type": "string", "maxLength": 3 },
         "PostalCode": { "type": "string", "maxLength": 15 },
       }},
   },
   "type": "object", "properties":
   {
     "Customer": { "type": "object", "properties": {
     "AccountNumber": {"type": "string" }
     "CustomerTypeCode": {"type": "string" }
     { "Addresses": {"type": "array", "items":
         { "Address": { "$ref": "#/definitions/address" }},
         }}

Both schema definitions contain most of the information we need for dealing with our data fields. XSD is more thorough in what it can handle, so it is probably best to work with it and generate a JSON schema from the XSD.

Once you have the final XSD, you’ll need to make sure it is available and kept up to date. Programmers will need to make sure their programs create documents that comply with the schema, regardless of the programming language. Business analysts will be able to refer to the schema in their future requirements and user stories. Data scientists will be able to use the schema to build better analytical processes. Compatible XML and JSON documents that conform to the schema can be stored and transferred across platforms. Everyone benefits from a well-designed schema. 

Recent Stories
Db2 for z/OS locking basics

A Beginners Guide to Locks, Latches, Claims and Drains in DB2 for z/OS by Steve Thomas

Fun with Date Ranges