Big Data and SQL

By John Maenpaa posted Oct 27, 2016 06:26 AM


My introduction to DB2 and SQL occurred over 30 years ago. It changed my life.

Before that, I’d been working with IMS databases. Every ad hoc process/report required us to write a program. Every program required a predetermined database access module before we could run that program. Hierarchies of data segments were not only supported, they were required.

Disk space was expensive so bigger datasets were in sequential master files, on magnetic tape. Updates to the master files required reading the file, applying updates from a transaction file, and writing a new master file. Our CICS programs were more about capturing the new transactions than actually applying updates. The master files tended to contain buckets for monthly information so that we could analyze limited historical trends.

True, that was a fairly small shop, but then so was the one I went to after that where we used DB2. The ability to embed SQL inside a program was liberating. My programs weren't stuck with pre-determined access paths. Since then, the power of SQL has only grown as has the amount of data we maintain in our on-line relational databases. We still organize our data in hierarchies, normalizing the tables so that each contains properly discreet groups of data elements. We organize our historical data differently, allowing for longer retention and better understanding. We design data warehouses that can handle the heavy lifting of enterprise reporting while minimizing the impact to our transactional databases.

Now, disk has become much less expensive. Files that would have been kept on tape can now be kept on disk in abundance, and we can keep many more copies for much longer durations. With the introduction of data lakes using big data technologies like HDFS, we can expand the storage capability across thousands of servers.

Big data can take all of that transactional data we used to throw away and keep it. It thrives on individual transactions where we can look for patterns in activity. The records showing this activity might be as simple as individual URL hits for a web site or as complex as a full business document in XML or JSON. The simpler activity records are more common, fitting into a flat record-oriented structure fairly easily.

The big data promise is the ability to parallelize operations against large datasets so that we can get results in hours instead of days or weeks. If we can get the same reports out of our transactional data without waiting for the completion of the monthly data warehouse load jobs, then we’ve provided more current data to our business. Along the way we may even have reduced the cost of producing those reports.

We've seen the upswing of NoSQL because purpose-built processes can perform more efficiently than generic RDBMS engines when scaling to large amounts of data. Of course, those purpose-built process have very limited flexibility. They don't solve all of the business data issues, just a subset that requires large amounts of data. When it works, it delivers, but it also has a fairly high cost in development effort compared with writing SQL.

Back in that early job working with tape master files, we spent a lot of time writing what my boss used to call “Lawrence Welk jobs” consisting of extract, sort, report steps. Mainframe processor time was expensive and we did as much batch bulk processing as possible to optimize our use of resources. In the big data world, we have MapReduce. Conceptually, these are not that different, but the distribution of the data and programs creates more moving pieces to deal with and the people that deal with that complexity are expensive.

Now we’re back to the tradeoff between custom build-to-order processes and the need for flexible tools that make development easier. SQL is an obvious solution to incorporate here because so many people already know how to use it and we have tools built over the last 30+ years that make it an ideal language for interaction between people and databases.

Apache Hive provides the ability to issue queries (and updates) against HDFS files and HBase data stores. It is extensible, allowing it to be used with various other file types. It is primarily intended for bulk data activities like those usually done in a data warehouse environment. As you move your data into a data lake, you will likely want to try using Hive to see how it fits in with your data and reporting requirements. For those simple extract/sort/report tasks where the data contains a very large set, this would be a good approach as long as you don’t really expect instant results.

Apache HAWQ is an example of a tool that adds more traditional SQL capability to your big data environment. HAWQ runs as a query engine that can access HDFS files and HBase tables. Unlike DB2, it does not actually store any user data. It does, however, maintain its own “catalog” and looks like PostgreSQL with a different way of distributing the work across the nodes in your Hadoop cluster. It will let you create tables that store data on your data nodes and distribute its operations across those nodes.

HAWQ uses an installed manager instance, much like DB2 does, for interaction with users and programs. Multiple client tools are available, and it comes with PostgreSQL interactive client (psql). The HAWQ instance can hold and manage multiple databases and from a client tool would look very much like any other PostgreSQL server.

Actual access to HDFS files in HAWQ requires a plugin framework called PXF. With the PXF plugins, you create tables using an extended SQL Create External Table statement that defines the filename and its format. The format can be simple, like CSV, or complex, like JSON or Avro. Since this maps a single logical table to a document record it would have limited benefit against large complex business documents.

With its mix of capabilities, HAWQ will allow you to build very complex data transformation processes that integrate the best of relational data with large external data. Consider the implications and potential benefits and drawbacks before making that a permanent leap.

You will likely need to combine your big data use with data that is stored in your existing RDBMS. Many of us run the business on DB2 for z/OS, DB2 for LUW, or both. Those systems aren’t going to just disappear, we built them for good reasons. Bringing in new technology options doesn’t negate the old technology. So we’re left looking for ways to replicate the data across the platforms or use federation products/tools that allow us to access multiple platforms from our “queries”.

On the replication side, tools are fairly commonly available to unload our relational tables and save the data into files. Apache Sqoop, for instance, parallelizes the retrieval of data from a relational database table and stores it in a set of files in HDFS. It even provides Java classes to use in MapReduce processing and will push it back to the tables when it is done. I see a lot of potential here for integrating the data lake with our data warehouse activities. If you want, you could try to use this to replicate a fairly large set of transactional tables, but you’d need to deal with the likely loss of data integrity across tables.

A second replication option would be to take your tables and extract them as business records, or export them from your applications (in XML or JSON) and send those files to the data lake. These could be used in this format directly, or reorganized into more useful data marts. I gravitate toward this approach for the bulk of my data because it gives me a valid consistent copy of the data from the original transactional system. I’ve seen too many cases where inconsistently replicated data is found to have missing information.

On the federation side, we’re starting to see a large number of options coming from various vendors. IBM developers have shown us how to use table functions in DB2 to access our data lake assets and they’re starting to provide RESTful ways to retrieve data for use in our distributed jobs running in the data lake environment.

We like to learn new things. We like to keep our skills relevant. Playing with new stuff brings fun and excitement to our work. And we tend to worry that, if we don’t jump on the bandwagon, we’ll be relegated to a basement office maintaining dying systems. But technology must fit the purposes for which we'll use it and no single solution can handle the variety of needs we will run across. SQL in a big data environment is not going to replace the need for DB2 and other RDBMS options. What SQL will do is make that environment more familiar.