Validating JSON Documents

Using JSON is a lot easier than using XML. Manually editing a document is much more manageable and adjusting the structure is quick and easy. It is also easier to create an invalid JSON document without knowing it.

About a year and a half ago, I put together an article talking a little about the ability to define document schemas (https://www.idug.org/p/bl/et/blogaid=644) for XML and JSON documents. At that time, the JSON Schema draft specification was at Draft-04. It is now at Draft-07 and becoming more robust. Tools and libraries are becoming available to use the JSON schemas.

Over the last year, I've been working a lot on the design aspects around managing very complex documents and the schemas that define them. To try to eliminate errors in the schema encoding, I've actually built tools that generate the schema from a known hierarchy and the fields in that hierarchy. Those tools also generate correspond data access objects (DAOs) to be used in various languages, providing a level of consistency that might not be there if you used the schema itself to generate the DAOs.

In this article, I'm going to provide a working example of a JSON Schema, the document structure and a program that can use the schema to validate our documents.

The Object Structure

For the examples in this article, I'm using a fairly simple structure for the hierarchy. Each document represents a martial arts style within a dojo. I've used dojo as an outer container with the main content being within the style and its children. The expectation is that we're collecting one document per style and storing them in a file or in a JSON document within separate rows of a table. Each style has an array of ranks that describe the levels that can be attained and an array of classes that are taught. Each class then has an array of sessions identifying when the classes are held, class size, etc.

A quick shorthand we use for identifying the paths within the hierarchy looks like the following:

dojo
dojo.style
dojo.style.ranks[]
dojo.style.classes[]
dojo.style.classes[].sessions[]

Here is the expanded list showing fields within the hierarchy.

dojo
dojo.style
dojo.style.styleName
dojo.style.countryOfOrigin
dojo.style.description
dojo.style.instructors[]
dojo.style.ranks[]
dojo.style.ranks[].rankSequence
dojo.style.ranks[].description
dojo.style.ranks[].rankName
dojo.style.classes[]
dojo.style.classes[].className
dojo.style.classes[].description
dojo.style.classes[].instructor
dojo.style.classes[].sessions[]
dojo.style.classes[].sessions[].dayOfWeek
dojo.style.classes[].sessions[].endTime
dojo.style.classes[].sessions[].startTime
dojo.style.classes[].sessions[].studentLimit

Looking at the hierarchy and fields this way is easier for humans to manage and we can use the full path to a field to dynamically get to the specific fields we are interested in using.

Assuming you've been working with Db2 for at least some period of time, you'll notice this isn't a proper normalized structure. In this case, imagine the data will stored in JSON columns in Db2 that are read in by a website when a potential martial arts student has clicked on the style they are interested in viewing. The web server can retrieve the information about that style much more quickly than assembling it from four separate relational tables. That wouldn't work for a generic business application but it'll perform quite well here.

The JSON Schema

The JSON Schema definition is more verbose. In part, this is due to the choice to use JSON structures to define the schema. Still, it is a lot easier to figure out than trying to read an XSD.

In the following schema structure, I include a definitions section that defines the 2nd level structures then I define the highest level to include the second level structures. In this somewhat simple document structure, the benefits are not as obvious, but being able to re-use portions of schema definitions in other locations is extremely beneficial. Think of an Address object where you want the same fields incorporated in multiple locations. You'd define the "address" in the definitions and refer to that as I refer to the "style" entry within my "dojo" specification.

File: dojo-schema.json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "dojo",
  "definitions": {
    "style": {
      "type": "object",
      "properties": {
        "countryOfOrigin": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "instructors": {
          "type": "array",
          "maxItems": 99,
          "items": {
            "type": "string"
          }
        },
        "styleName": {
          "type": "string"
        },
        "classes": {
          "type": "array",
          "maxItems": 10,
          "items": {
            "type": "object",
            "properties": {
              "className": {
                "type": "string"
              },
              "description": {
                "type": "string"
              },
              "instructor": {
                "type": "string"
              },
              "sessions": {
                "type": "array",
                "maxItems": 99,
                "items": {
                  "type": "object",
                  "properties": {
                    "dayOfWeek": {
                      "type": "string"
                    },
                    "endTime": {
                      "type": "string",
                      "format": "time"
                    },
                    "startTime": {
                      "type": "string",
                      "format": "time"
                    },
                    "studentLimit": {
                      "type": "number"
                    }
                  }
                }
              }
            }
          }
        },
        "ranks": {
          "type": "array",
          "maxItems": 99,
          "items": {
            "type": "object",
            "properties": {
              "description": {
                "type": "string"
              },
              "rankName": {
                "type": "string"
              },
              "rankSequence": {
                "type": "number"
              }
            }
          }
        }
      }
    }
  },
  "properties": {
    "dojo": {
      "type": "object",
      "properties": {
        "style": {
          "$ref": "#/definitions/style"
        }
      }
    }
  }
}

The specification at the top "$schema": "http://json-schema.org/draft-07/schema#" identifies the version of the JSON Schema specification we are using for our schema. Draft 07 is current. Draft 08 is expected to be published in early 2019.

Within the schema, we define our entities and fields as JSON objects. The attributes of each field is defined within the object. Supported field types are:

  • null
  • boolean
  • object
  • array
  • number
  • string

For dates, times and timestamps, we can provide additional format information to indicate the field is expected to contain a date, time, or date-time. Use of these are strongly recommended to ensure programs are using the correct date/time formats. Without this, date/time strings could contain almost anything. Or, worse, the programmers creating the JSON documents from Java might take a short cut and simply store them as millisecond integers.

You can find current information about the JSON Schema draft at https://json-schema.org/.

The Sample Documents

I have created two sample documents. For ease of reading, I'll include them here with line-breaks for readability.

Record 1

{
  "dojo": {
    "style": {
      "styleName": "Aikido",
      "countryOfOrigin": "Japan",
      "description": "Aikido is a Japanese martial art developed by Morihei Ueshiba (often referred to by his title 'O Sensei' or 'Great Teacher'). On a purely physical level it is an art involving some throws and joint locks that are derived from Jujitsu and some throws and other techniques derived from Kenjutsu. Aikido focuses not on punching or kicking opponents, but rather on using their own energy to gain control of them or to throw them away from you. It is not a static art, but places great emphasis on motion and the dynamics of movement. ",
      "instructors": [
        "Morihei Ueshiba",
        "Newby Senpai",
        "Aaron Shodan"
      ],
      "classes": [
        {
          "className": "Introduction to Aikido",
          "instructor": "Newby Senpai",
          "description": "Class designed for absolute beginners with no experience.",
          "sessions": [
            {
              "dayOfWeek": "Saturday",
              "startTime": "10:00",
              "endTime": "11:00",
              "studentLimit": 20
            }
          ]
        },
        {
          "className": "Intermediate Aikido",
          "instructor": "Aaron Shodan",
          "description": "Class designed for general practice. All levels welcome.",
          "sessions": [
            {
              "dayOfWeek": "Monday",
              "startTime": "18:00",
              "endTime": "19:00",
              "studentLimit": 12
            },
            {
              "dayOfWeek": "Tuesday",
              "startTime": "18:00",
              "endTime": "19:00",
              "studentLimit": 12
            },
            {
              "dayOfWeek": "Wednesday",
              "startTime": "18:00",
              "endTime": "19:00",
              "studentLimit": 12
            },
            {
              "dayOfWeek": "Saturday",
              "startTime": "12:00",
              "endTime": "14:00",
              "studentLimit": 20
            }
          ]
        },
        {
          "className": "Advanced Aikido",
          "instructor": "Morihei Ueshiba",
          "description": "For those working on perfecting their black belt skills",
          "sessions": [
            {
              "dayOfWeek": "Sunday",
              "startTime": "09:00",
              "endTime": "12:00",
              "studentLimit": 10
            }
          ]
        }
      ],
      "ranks": [
        {
          "rankSequence": 0,
          "rankName": "white belt",
          "description": "Absolute Beginner"
        },
        {
          "rankSequence": 2,
          "rankName": "yellow belt",
          "description": "Might be getting the hang of it."
        },
        {
          "rankSequence": 3,
          "rankName": "red belt",
          "description": "Getting a little better."
        },
        {
          "rankSequence": 9,
          "rankName": "brown belt",
          "description": "Getting serious here."
        },
        {
          "rankSequence": 11,
          "rankName": "shodan",
          "description": "First degree black belt."
        },
        {
          "rankSequence": 12,
          "rankName": "nidan",
          "description": "Second degree black belt."
        }
      ]
    }
  }
}

Record 2

{
  "dojo": {
    "style": {
      "styleName": "Karate",
      "countryOfOrigin": "Japan",
      "description": "Karate is a Japanese martial art that a lot of people know about.",
      "instructors": [
        "Sumo Wrestler"
      ],
      "classes": [
        {
          "className": "Introduction to Karate",
          "instructor": "Waldo Senpai",
          "description": "Class designed for absolute beginners with no experience.",
          "sessions": [
            {
              "dayOfWeek": "Saturday",
              "startTime": "08:00",
              "endTime": "09:00",
              "studentLimit": "30"
            }
          ]
        },
        {
          "className": "Intermediate Karate",
          "instructor": "Aaron Shodan",
          "description": "Class designed for general practice. All levels welcome.",
          "sessions": [
            {
              "dayOfWeek": "Monday",
              "startTime": "17:00",
              "endTime": "18:00",
              "studentLimit": 12
            },
            {
              "dayOfWeek": "Tuesday",
              "startTime": "17:00",
              "endTime": "18:00",
              "studentLimit": 20
            },
            {
              "dayOfWeek": "Wednesday",
              "startTime": "17:00",
              "endTime": "18:00",
              "studentLimit": 20
            },
            {
              "dayOfWeek": "Saturday",
              "startTime": "12:00",
              "endTime": "14:00",
              "studentLimit": 20
            }
          ]
        }
      ],
      "ranks": [
        {
          "rankSequence": 0,
          "rankName": "white belt",
          "description": "Absolute Beginner"
        },
        {
          "rankSequence": 2,
          "rankName": "yellow belt",
          "description": "Might be getting the hang of it."
        },
        {
          "rankSequence": 3,
          "rankName": "red belt",
          "description": "Getting a little better."
        },
        {
          "rankSequence": 9,
          "rankName": "brown belt",
          "description": "Getting serious here."
        },
        {
          "rankSequence": 11,
          "rankName": "shodan",
          "description": "First degree black belt."
        },
        {
          "rankSequence": 12,
          "rankName": "nidan",
          "description": "Second degree black belt."
        }
      ]
    }
  }
}

In the file, it is important that the records not be formatted prettily like that and that they have only the correct line separate (usually a newline character) between each of the records. The file I'm storing these in is called dojo-working.json.

The Validation Program

Next, we have a simple Python validation program. First, it reads in the schema we've built and validates it against the JSON Schema specification. The line with v = Draft7Validator(schema) performs this action and throws an error if the schema has an issue. The return from successful execution is a validator v that is used later to prevent re-compiling the schema as we process each record.

The program reads the JSON documents as records from the specified input file and validates each record against the schema. The line with errors = sorted(v.iter_errors(json_record), key=str) does this and returns a sorted list of errors if any are found.

All the heavy lifting is done using a library named jsonschema that can be found at https://github.com/Julian/jsonschema with documentation at https://python-jsonschema.readthedocs.io/en/latest/. There are many other available libraries, but this one worked well for my needs and gets the job done with minimal code on my part.

Program validate_json_records.py

#!/usr/bin/env python3
# Validate JSON records in a file against schema
# Expects each record to be a complete JSON document.

import
argparse
import os
import pathlib
import json
from jsonschema import Draft7Validator, FormatChecker, ValidationError, SchemaError, validate

# Initialization
count_records = 0

# Arguments with default values
parser = argparse.ArgumentParser(description = "Validate JSON Records against JSON schema")
parser.add_argument(
"filename", help = "File containing JSON records", default = "john.json")
parser.add_argument(
"-S", "--schema", help = "JSON schema file", default = "json/dojo-schema.json")
parser.add_argument(
"-F", "--formatted", action="store_true", help="File contains formatted JSON with CRLF line delimiters")

args
= parser.parse_args()
if args.formatted:
delimiter
= '\r\n'
else:
delimiter
= '\n'

# Our Schema file gets validated first
print("Using JSON schema file:",args.schema)

if
os.path.exists(args.schema):
schemaText
= open(args.schema,'r').read()
schema
= json.loads(schemaText)
try:
v
= Draft7Validator(schema)
except SchemaError as e:
print("Schema validation failed")
print(e)
else:
print("Error: Schema file does not exist")

# Our Records then get read and validated against the schema
if os.path.exists(args.filename):
with open(args.filename, 'r', newline=delimiter, encoding='utf-8-sig') as filein:
for record in filein:
count_records
+= 1
try:
try:
json_record
= json.loads(record)
except json.decoder.JSONDecodeError as e:
print("Error decoding JSON in record",count_records,"...Quitting")
exit()
errors
= sorted(v.iter_errors(json_record), key=str)
if errors:
for error in errors:
print("Record",count_records,error.path,error.message)
else:
print("Record",count_records,"is valid")
except ValidationError as e:
print(e.message)
else:
print("Error: File does not exist")

The line with json_record = json.loads(record) is critical to the use of the jsonschema library. The validation library doesn't validate a JSON text string. It validates a python structure that the JSON data has been loaded into. This is an easy point to miss in the documentation.

Running the Validator

I'm running on macOS, so those of you on Linux would see something very much like I have here. If you're on Windows 10, you can do the same using its integrated Linux environment. If you're on an earlier version of Windows, the command prompt and use of path name separators will be a little different.

From the project directory where I have the program and files save, I simply run:

Magni:dojo-samples john$ ./validate_json_records.py -S json/dojo-schema.json json/dojo-working.json
Using JSON schema file: json/dojo-schema.json
Record 1 is valid
Record 2 is valid
Magni:dojo-samples john$

It worked. That's boring. Let's create an error and see what happens. I'm going to change the second record from "styleName": "Karate", to "styleName": Karate, and re-run the validator.

Using JSON schema file: json/dojo-schema.json
Record 1 is valid
Error decoding JSON in record 2 ...Quitting
Magni:dojo-samples john$

That gave us an error from the JSON loads() function because it isn't valid JSON syntax. Good, but not what we're looking for in a schema validation error. Let's change that back and change "studentLimit": 30 to "studentLimit": "30" which is valid JSON.

Magni:dojo-samples john$ ./validate_json_records.py -S json/dojo-schema.json json/dojo-working.json
Using JSON schema file: json/dojo-schema.json
Record 1 is valid
Record 2 deque(['dojo', 'style', 'classes', 0, 'sessions', 0, 'studentLimit']) '30' is not of type 'number'
Magni:dojo-samples john$

There we go. That is an actual schema validation error because the schema specifies studentLimit as being a number and we stored it as a string. You can see the path to the field with the error. At some point, I'll enhance my program to output the path using the notation we commonly use dojo.style.classes[0].sessions[0].studentLimit but it works well enough as it is for now.

Errors in the JSON Schema

For my next trick, I'm going to leave that error in place in the second record and I'm going to modify the schema slightly.

File dojo-bad-schema.json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "dojo",
  "definitions": {
    "style": {
      "type": "object",
      "properties": {
        "countryOfOrigin": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "instructors": {
          "type": "array",
          "maxItems": 99,
          "items": {
            "type": "string"
          }
        },
        "styleName": {
          "type": "string"
        },
        "classes": {
          "type": "array",
          "maxItems": 10,
          "items": {
            "type": "object",
            "properties": {
              "className": {
                "type": "string"
              },
              "description": {
                "type": "string"
              },
              "instructor": {
                "type": "string"
              },
              "sessions": {
                "type": "array",
                "maxItems": 99,
                "items": {
                  "type": "object",
                  "properties": {
                    "dayOfWeek": {
                      "type": "string"
                    },
                    "endTime": {
                      "type": "string",
                      "format": "time"
                    },
                    "startTime": {
                      "type": "string",
                      "format": "time"
                    },
                    "studentLimit": {
                      "type": "number"
                    }
                  }
                }
              }
            }
          }
        },
        "ranks": {
          "type": "array",
          "maxItems": 99,
          "items": {
            "type": "object",
            "properties": {
              "description": {
                "type": "string"
              },
              "rankName": {
                "type": "string"
              },
              "rankSequence": {
                "type": "number"
              }
            }
          }
        }
      }
    }
  },
  "dojo": {
    "type": "object",
    "properties": {
      "style": {
        "$ref": "#/definitions/style"
      }
    }
  }
}

Can you spot what I changed? It is such a simple error that I made in my original JSON schema for the project I was working on. I had omitted the "properties": { } around the final dojo object definition. This is still valid JSON and passes the schema validation. Unfortunately, when I use it, here is the result:

Magni:dojo-samples john$ ./validate_json_records.py -S json/dojo-bad-schema.json json/dojo-working.json
Using JSON schema file: json/dojo-bad-schema.json
Record 1 is valid
Record 2 is valid
Magni:dojo-samples john$

In fact, every valid JSON document succeeds when validated against the bad schema.

Conclusion

If you are working with JSON, start developing JSON schemas for your documents. Design the schema by modeling the data and adjusting it to a document structure. Don't just model some tables and throw them into a document schema. Re-think the way the data should be organized based on the way your documents will be used.

Once you have a schema, make sure someone is running validation routines against your documents. JSON is easier to work with than XML. It is also much easier to get your documents wrong.

Recent Stories
Installing Db2 the Easy Way: Docker

March Spotlight – Db2 Developer C

Calling Db2 12 for z/OS, Are you available ?