Db2 JSON Support Update and Some More Python Fun

As Db2 becomes a better database for the implementation of hybrid datastores, we're starting to see support for JSON become more substantial.

Original Db2 Support for JSON

More than 5 years ago, Db2 introduced the ability to store JSON documents (https://developer.ibm.com/articles/dm-1306nosqlforjson1/). This worked in both DB2 for z/OS V10 and DB2 for Linux, UNIX, and Windows 10.5.

A wire listener (IBM NoSQL Wire Listener for DB2) was supplied to allow the direct migration of MongoDB applications to Db2 using their existing client software. MongoDB implements an HTTP-based API using JSON for nearly everything. Except, JSON is really limited when storing documents as it was designed with only a limited number of data types and the overall structure is character-based leading to wasted storage. MongoDB uses BSON (Binary JSON) to expand beyond the limitations of pure JSON. Details on BSON can be found at http://bsonspec.org/. The available client modules supplement this by hiding the protocol details and allowing the programmer to work using native language functions. The pymongo module, one of the more popular clients for Python translates Python data structures to/from BSON when interacting with MongoDB. This removes the need to manage the JSON. The Db2 wire listener was written to support the same clients using the same features. It hides the mechanics of data storage from the developer, creating tables behind the scenes to store document collections. More information on the Db2 wire listener can be found at https://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson4/.

Unfortunately, the listener is very backward in its support for the MongoDB protocols and does not work with current client versions. You'll receive the following error if you try to use it from the current pymongo client:

pymongo.errors.ConfigurationError: Server at centos64:27017 reports wire version 0, but this version of PyMongo requires at least 2 (MongoDB 2.6).

Along with the wire listener, Db2 included a JSON command-line interface (https://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson2/) used to manage the JSON databases used by the wire listener and a Java API (https://www.ibm.com/developerworks/data/library/techarticle/dm-1307nosqlforjson3/). The documents were stored in BLOBs using BSON format. At the time these features were added to Db2, MongoDB was a 32-bit database and had major limitations on the supported objects based on that implementation.

This was all well and good for people that wanted to store JSON in Db2 without changing their (back-level) client programs, but it didn't really add much reason to use Db2 for your document storage beyond that.

Db2 Functions for using JSON

The next level of support allowed us to access and manipulate JSON documents from SQL statements. The documents still needed to be stored as BSON in BLOB columns, but now they could be in any table. The supported functions included:

BSON2JSON

Convert BSON formatted document into JSON strings

BSON_VALIDATE

Checks to make sure that a BSON field in a BLOB object is in a correct format

JSON2BSON

Convert JSON strings into a BSON document format

JSON_GET_POS_ARR_INDEX

Find a value within an array

JSON_LEN

Returns the count of elements in an array type inside a document

JSON_TABLE

Returns a table of values for an array field

JSON_TYPE

Returns the data type of a specific field within a JSON document

JSON_UPDATE

Update a field or document using set syntax

JSON_VAL

Extracts data from a JSON document into SQL data types

This is a good step and provides some useful ways to get to the content of our documents. In theory, we could even create indexes on the functions in order to provide search capabilities equivalent to a native document datastore. In actuality, what this did was make Db2 a good option for storing documents when an application is intending to also use relational tables and really only needs to store/retrieve the full JSON document.

New JSON Support Functions

With Db2 version 11.1.4.4 the support for JSON has been enhanced to support the now published SQL standard for dealing with JSON documents. If you like reading standard documents, you can find it at https://www.iso.org/standard/67367.html.

Oddly, the standard does not define a JSON datatype. Instead, we are to use standard string and binary columns to store our JSON documents and standard functions are defined to allow us to access the content of the documents. We get to decide whether the data will be stored in character (JSON) or binary (BSON) format.

Our new functions, which are under the SYSIBM schema, include:

BSON_TO_JSON

Convert BSON formatted document into JSON strings

JSON_TO_BSON

Convert JSON strings into a BSON document format

JSON_QUERY

Extract a JSON object from a JSON object

JSON_VALUE

Extract an SQL scalar value from a JSON object

JSON_EXISTS

Determines whether or not a value exists in a document

JSON_ARRAY

Creates JSON array from input key value pairs

JSON_OBJECT

Creates JSON object from input key value pairs

JSON_TABLE

Creates relational output from a JSON object

As official functions, we'll be able to use these to create indexes on our JSON data. I've been looking forward to playing with the new functions.

The official Db2 documentation is at https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.swg.im.dbclient.json.doc/doc/c0070285.html.

The Python JSON Struggle for Good Data Access Classes

Since my article earlier this month (https://www.idug.org/p/bl/et/blogaid=813) I've been preparing by trying to get my JSON documents in and out of proper class data structures. I'd already prepared my Dojo JSON Schema and the next step was to generate classes that would do more than simply import the JSON data into generic Python structures (which is what json.loads(my_json)) does in the sample validation program.

This turned out to be more work than I had expected. There are a number of Python libraries that could be used, but they don't really function for complex data structures. I tried several before I accidentally found a reference to a new Python feature: Data Classes (https://docs.python.org/3/library/dataclasses.html). Finally, a nice declarative way to manage my classes.

"""DOJO document model using Python Data Classes

   This module was generated to match the DOJO schema
   generated at 2018-12-31 14:38
"""

from typing import List, Optional
from datetime import date, time, datetime
from dataclasses import dataclass, field
from mashumaro import DataClassJSONMixin

@dataclass
class Sessions(DataClassJSONMixin):
    """Class Sessions generated ODM"""

    dayOfWeek:Optional[Optional[str]] = None
    endTime:Optional[Optional[time]] = None
    startTime:Optional[Optional[time]] = None
    studentLimit:Optional[Optional[int]] = None
# end of class Sessions

@dataclass
class Classes(DataClassJSONMixin):
    """Class Classes generated ODM"""

    sessions: Optional[List[Optional[Sessions]]] = None
    className:Optional[Optional[str]] = None
    description:Optional[Optional[str]] = None
    instructor:Optional[Optional[str]] = None
# end of class Classes

@dataclass
class Ranks(DataClassJSONMixin):
    """Class Ranks generated ODM"""

    description:Optional[Optional[str]] = None
    rankName:Optional[Optional[str]] = None
    rankSequence:Optional[Optional[int]] = None
# end of class Ranks

@dataclass
class Style(DataClassJSONMixin):
    """Class Style generated ODM"""

    classes: Optional[List[Optional[Classes]]] = None
    ranks: Optional[List[Optional[Ranks]]] = None
    countryOfOrigin:Optional[Optional[str]] = None
    description:Optional[Optional[str]] = None
    instructors:Optional[List[Optional[str]]] = None
    styleName:Optional[Optional[str]] = None
# end of class Style

@dataclass
class Dojo(DataClassJSONMixin):
    """Class Dojo generated ODM"""

    style:Style
# end of class Dojo
# end of module dojo

On its own, this didn't get me very far without a lot of exploration and the addition of the following library:

from mashumaro import DataClassJSONMixin

This line, and the addition of having my classes inherit from DataClassJSONMixin, finally got me to the point I can move my data between JSON documents, standard Python structures, and these classes. Here's a sample test program:

#!/usr/bin/env python3
"""Testing with JSON Documents, dataclass, and mashumaro
   --> testing conversions
"""

import os
import pathlib
import argparse
import datetime
import json
import mashumaro
from jsonschema import Draft7Validator, FormatChecker, ValidationError, SchemaError, validate
from lib import dojo_dataclass as dojo # my data classes

in_file_name = 'json/dojo-working.json'
count_records = 0

# --- Helper Functions
def path_from_deque(in_path):
    """Build our path_to_field from deque structure"""
    new_path = ''
    path_separator = ''
    while True:
        try:
            new_node = in_path.popleft()
            if isinstance(new_node,int):
                new_path += '[' + str(new_node) + ']'
            else:
                new_path += path_separator + new_node
            path_separator = '.'
        except IndexError:
            break
    return new_path

# --- Prepare for schema validation
in_schema_name = 'json/dojo-schema.json'
print("Using JSON schema file:",in_schema_name)
if os.path.exists(in_schema_name):
    schemaText = open(in_schema_name,'r').read()
    schema = json.loads(schemaText)
    try:
        validator = Draft7Validator(schema)
    except SchemaError as e:
        print("Schema validation failed")
        print(e)
else:
    print("Error: Schema file does not exist")

# --- Process Dojo Records
dojo_instance = None
if os.path.exists(in_file_name):
    filein = open(in_file_name,'r', newline='\n',encoding='utf-8-sig')
    for record in filein:
        count_records += 1

        # -- Validate incoming record against schema
        try:
            try:
                json_record = json.loads(record)
            except json.decoder.JSONDecodeError as e:
                print("Error decoding JSON in record",count_records,"skipping further tests")
                continue
            errors = sorted(validator.iter_errors(json_record), key=str)
            if errors:
                print("Error validating JSON in record",count_records,"...skipping")
                for error in errors:
                    print("- Record",count_records,path_from_deque(error.path),error.message)
                continue
            else:
                print("Original JSON is valid",count_records)
        except ValidationError as e:
            print("Error with JSON structure in record",count_records,"...skipping")
            print(e.message)
            continue

        # -- Python dict variation
        print("Trying JSON -> dict -> dojo")
        try:
            json_record = json.loads(record)
            dojo_from_dict = dojo.Dojo.from_dict(json_record['dojo'])
            dict_from_dojo = { "dojo": dojo_from_dict.to_dict(dojo_instance) }
            json_from_dict  = json.dumps(dict_from_dojo)
            print("debug, record", count_records,"styleName",dict_from_dojo['dojo']['style']['styleName'])
            print(json_from_dict[:80])
            errors = sorted(validator.iter_errors(dict_from_dojo), key=str)
            if errors:
                print("Error validating JSON in record",count_records)
                for error in errors:
                    print("- Record",count_records,path_from_deque(error.path),error.message)
            else:
                print("Reconstructed Dojo dict is valid")
        except json.decoder.JSONDecodeError as e:
            # TODO: Better error handling
            print("Error, record", count_records)
            print("- Error decoding original JSON in record",count_records)
        except mashumaro.exceptions.MissingField as e:
            # TODO: Better error handling
            print("Error, record", count_records)
            print("- Missing Field Error converting to object structure")

        # -- Python JSON variation
        print("Trying JSON -> dojo -> dict/JSON")
        try:
            # Find second object start position
            start = record.find('{',1)
            # Ignore trailing ' }' -- NOTE: This is bad code
            end   = len(record) - 2
            # Pull just the inner objects from the JSON string
            json_dojo = record[start:end]
            # Load Dojo from JSON
            dojo_from_json = dojo.Dojo.from_json(json_dojo)
            # Convert Dojo back to new JSON
            json_from_dojo = '{ "dojo": ' + dojo_from_json.to_json() + '}'
            # Convert Dojo object to another "dojo" dict
            dict_from_dojo = { "dojo": dojo_from_json.to_dict(dojo_instance) }
            # Validation of reconstructed Dojo dict
            errors = sorted(validator.iter_errors(dict_from_dojo), key=str)       
            if errors:
                print("Error validating JSON in record",count_records)
                for error in errors:
                    print("- Record",count_records,path_from_deque(error.path),error.message)
            else:
                print("Reconstructed Dojo dict via Dojo is valid")
                # It would be nice to be able to compare the JSON strings,
                # but the whitespace will be different
                # Also, the output JSON string includes NULL fields
                try:
                    # Load new JSON to generic dict structure and validate again
                    new_dict = json.loads(json_from_dojo)
                except json.decoder.JSONDecodeError as e:
                    print("Error decoding JSON in record",count_records,"skipping further tests")
                    continue
                # Validation of reconstructed Dojo
                errors = sorted(validator.iter_errors(new_dict), key=str)         
                if errors:
                    print("Error validating JSON in record",count_records)
                    for error in errors:
                        print("- Record",count_records,path_from_deque(error.path),error.message)
                else:
                    print("Reconstructed Dojo dict via Dojo JSON is valid")
        except mashumaro.exceptions.MissingField as e:
            # TODO: Better error handling
            print("Error, record", count_records)
            print("- Missing Field Error converting to object structure")

    filein.close()
else:
    print("Error: Dojo file does not exist")

# -- Done
print("Done")

You will notice in that the outer "dojo" object is removed or added at various points. By this, I mean that we have to strip this from the incoming JSON text after validation and before we can convert the JSON to the class structure, and we need to add it back when creating JSON text directly from the data classes. The JSON validation needs it to be part of the document, but the classes (and MongoDB storage) don't want the outermost object to be present. Thus, adding/removing the outer wrapper object allows the program to work with the classes while also doing what MongoDB expects. In MongoDB, the outer object is essentially the Collection and isn't really stored as part of the document.

Here is the equivalent set of data classes using the PyMODM data class library:

"""DOJO document model using PyMODM for use with MongoDB collection

   This module was generated to match the DOJO schema
   generated at 2018-12-31 14:38
"""

from pymodm import MongoModel, EmbeddedMongoModel, fields
from pymongo.write_concern import WriteConcern
from pymongo.read_concern import ReadConcern

class Sessions(EmbeddedMongoModel):
    """Class Sessions generated ODM"""

    dayOfWeek = fields.CharField()
    endTime = fields.CharField()
    startTime = fields.CharField()
    studentLimit = fields.IntegerField()

# end of class Sessions

class Classes(EmbeddedMongoModel):
    """Class Classes generated ODM"""

    sessions = fields.EmbeddedDocumentListField(Sessions)
    className = fields.CharField()
    description = fields.CharField()
    instructor = fields.CharField()

# end of class Classes

class Ranks(EmbeddedMongoModel):
    """Class Ranks generated ODM"""

    description = fields.CharField()
    rankName = fields.CharField()
    rankSequence = fields.IntegerField()

# end of class Ranks

class Style(EmbeddedMongoModel):
    """Class Style generated ODM"""

    classes = fields.EmbeddedDocumentListField(Classes)
    ranks = fields.EmbeddedDocumentListField(Ranks)
    countryOfOrigin = fields.CharField()
    description = fields.CharField()
    instructors = fields.ListField(fields.CharField())
    styleName = fields.CharField()

# end of class Style

class Dojo(MongoModel):
    """Class Dojo generated ODM"""

    style = fields.EmbeddedDocumentField(Style)

    class Meta:
        """Characteristics for connection (database, collection, read/write concern)"""
        connection_alias = 'dojo'
        collection_name  = 'dojo'
        ignore_unknown_fields = True
        final = True
# end of class Dojo
# end of module dojo

The next step is to get my Db2 development copy up and working on 11.1.4.4 so I can play with this in Db2. I had hoped that would be working for this article, but there have been too many delays trying to get this part working properly. In the end, I'm expecting to have client code that can work with the same JSON documents across Db2 and MongoDB datastores. Having a well thought out schema and consistent data classes will be a critical need to make this worthy of a properly designed application.

Conclusion

With Db2 support for JSON documents improving, we will need to be able to evaluate the true differences between our datastores and the designs that can take advantage of the features of those datastores. A well-designed document schema is a critical part of the picture and we need corresponding object classes in our programs to ensure production readiness.

Recent Stories
Advanced Query Tuning with IBM Data Studio

Db2 JSON Support Update and Some More Python Fun

Making JSON with SQL Functions