IDUG’s DB2 and Big Data Primer

Have you been hearing about big data recently in advertisements, webcasts or magazine articles? Maybe you’ve seen it mentioned in a blog post. I don’t think I’ve witnessed the advent of the “new thing” surrounded by as much excitement and as much vagueness as big data. You aren’t cool if you’re not harnessing the power of big data! You’re missing out if you don’t know what it is. You’ll be going out of business if you’re not into big data. Confused? You’re certainly not alone.

Search properly for information concerning big data, and you will find a lot of content, much of which doesn’t make sense. With this article, I’m going to attempt to get the inexperienced started with an understanding of what big data might be, how in some ways it can be utilized and point to additional resources. Just as with all the new concepts that have emerged in the past, this one is going to continue to evolve as time goes on and will eventually be just a part of our everyday existence.

Big Data Historically Speaking


Most of you reading this article are probably in one way or another involved with big data. If you’re working regularly with DB2 then you’re probably working for a mid-range to large organization that is processing large amounts of data and/or high transaction volumes. I was in attendance for David Barnes’ keynote presentation at the IDUG DB2 Tech Conference in Berlin last November, and he started off by saying something very similar. Big data starts with something, and that’s the data.

I myself have been involved with applications and databases that have processed one hundred million or more transactions per day, as well as databases that contain many terabytes of data. Traditionally, these applications and databases store data relevant to current business activities. If we were accumulating data we were generally using only the most current data, or we were archiving the data off to data warehouses. The warehouses were the places where business analytics took place, and some of them could be quite large. In many situations the warehouses were expensive to build and maintain, and the questions they answered took a significant amount of time from conception to result.

Things are changing rapidly. Most significantly, the cost of processors, memory and disk storage has dropped considerably. In reaction, businesses are retaining much more information and management is demanding answers to complex questions quickly in an effort to react to market conditions in an expedited world. These demands, and the evolution of social media, have spawned the need for a different processing paradigm, and that leads us to this concept of big data.

Big Data Today


There is a tremendous amount of discussion around what big data is, and what it can be applied to, but there is no direct answer. In the most general terms it means that you have a set of data in which the quantity of information exceeds your ability to effectively process it. Doug Laney of the Gartner Group defined data challenges as the “3Vs”, being volume, velocity and variety. So, you have a lot of data, it’s coming and going very quickly and it’s not homogenous. Now, add in the fact that your boss wants to know how a certain aspect of the business is doing and how are customers reacting – and needs an answer now.

If that isn’t enough of a challenge then let’s throw in a little more complexity. A certain portion of the data that can be used in the analysis is something that some have called democratic data. That is, there are now data stores in the form of several social media platforms that can be used together with your in-house data to get more meaningful results from your data analytics. This democratic data, the source of which is varied and internally multi-sourced,  can have a severe data quality issue as well as being of an extreme variety.

In some discussions on big data the definition also includes the fact that the quantity and complexity of the big data exceeds the capability of traditional relational database systems. I think that point is arguable, but one thing that is hard to argue is that traditional design methodologies where months or years are spent in data administration and design, along which rigid sets of rules concerning data types and relationships, will not be acceptable or applicable in a big data implementation.

For years, organizations have been splitting up large sets of data and running processes in parallel to get results from large data stores in a reduced amount of time. People began developing technology that was able to collect and tag varieties of data quickly using this parallel processing technology. This type of meta-data tagging in parallel streams resulted in technology referred to as NoSQL. However, it was the introduction of an open source technology from the Apache Group called Hadoop that provided the base technology used in most big data implementations. What Hadoop provides is a framework for tagging the data (enabling a “schema-less” storage), storing the data and aggregation of the data in a massively parallel architecture. Using Hadoop, or similar technologies, organizations can now very quickly build massive data stores and then very quickly begin analytics. Hello big data! The contributions to the open source community continue, and there are many tools that are built upon Hadoop to help programmers build applications quickly. There are even tools that interpret SQL requests issued against a Hadoop file system. Super cool!

IBM, Big Data and DB2


IBM has been investing a considerable amount of resources towards big data solutions recently. They also offer several solutions for  customers interested in moving away from traditional data warehouses and towards big data solutions. All of these solutions, to the best of my knowledge, take advantage of IBM’s Netezza massively parallel processing appliance. This includes DB2!

Infosphere BigInsights and Infosphere Streams


These are solutions that more match the NoSQL and Hadoop model than a solution provided with or as part of DB2. BigInsights is their Hadoop base solution and Streams is a platform that captures and aggregates (or transforms) data that is in motion.

DB2 and SQL


I’m an SQL guy myself, and would be remiss if I didn’t talk about DB2’s capabilities in the area of business analytics, which in my opinion falls into the realm of big data if not by definition then certainly by association. DB2 has the capability of storing massive amounts of data and processing that data very quickly. DB2 has built-in parallelism that can split a query into separate parallel processes across a single data server, or multiple data servers (DB2 for LUW partitioned database feature). The SQL query language has become robust and complex analytics can be performed within a single statement. Recently expanded OLAP expressions allow for multiple variations of aggregate functions applied over windows of data and/or across a range of rows. These types of queries can run together with your OLTP applications in a workload managed environment with no intrusion, or you can run them against massive data warehouses. This is a more traditional approach to analytics and not truly big data by definition. However, the purpose of big data is to get analysis of massive data stores quickly, and DB2 has that capability built-in.

IBM DB2 Analytics Accelerator


This architecture utilizes the parallel processing capabilities of the Netezza appliance as the accelerator to process DB2 for z/OS queries with incredibly low latency. The accelerator is integrated with DB2 in such a way that data can be replicated from an OLTP or warehouse to the accelerator in real time, and queries are submitted directly to DB2 where the optimizer decides whether to run the query under the DB2 engine or the accelerator. This solution enables users to quickly utilize existing production data stores accessed via OLTP applications for data analytics with extremely fast response times and no interruption to the OLTP processing. This is the DB2 for z/OS answer to the need for faster analysis of large data stores that is driving the big data movement.

IBM PureData


These are pre-built systems that include hardware and software configurations bundled together based upon a specific need, be it for transactions or analytics. DB2 for LUW PureScale is utilized in these machines, as is the Netezza appliance and IBM Smart Analytics System (utilizing some of the big data concepts I touched on earlier as well as other technologies) depending upon the application.

DB2 RDF Support for NoSQL Storage


In my earlier discussion on NoSQL and Hadoop I mentioned that methods existed for tagging data, but didn’t go into detail. One reason is that there are many options and methodologies, and already a number of resources available to get details on how information is stored in a NoSQL database. Resource Definition Framework (RDF) is a specification that can support a “schema-less” data store similar to what’s going on in various NoSQL implementations such as Hadoop. The specification consists of something called a triple, which represents information in the form of a subject, predicate, and object (e.g. Bob Rady, is-a, Madman).  This method allows the data store to form a network, thus enabling high performance searches to find patterns in the data (i.e. analytics). DB2 10.1 for LUW supports RDF data stores as well as its query language called SPARQL (pronounced “sparkle” and a cool recursive acronym, look it up!) via a jar file API that is shipped as part of all DB2 clients. This API provides for the creation of tables in support of the RDF data store, and the querying of the data via SPARQL. Now you can store your NoSQL data in your SQL database. Based upon what I’ve read it works with DB2 for LUW, but since its support is via the DB2 client Java API then I wonder whether or not it would work with DB2 for z/OS as well!

DB2 and JSON


If you haven’t heard of JSON you’ll probably hear about JavaScript Object Notation in the near future. Although JSON has been in use for over a decade you’re likely to start hearing more about it in reference to big data applications. JSON is an open standard used for representing data structures, and is based upon a subset of the JavaScript language. JSON is used by some of our favorite social networking sites, and can thus easily find its way into our big data streams. JSON is most useful in the area of exchange of unstructured data, and especially text. Sound familiar? Yes, sounds a little like XML and it looks very much like XML. DB2 has a very robust implementation of XML called PureXML, which allows for the native storage of XML documents in a DB2 table. In support of JSON in DB2 IBM provides something called the JSONx Bundle. JSONx is an XML format that supports JSON data, and the JSONx Bundle provides the code to map JSON to JSONx and thus enabling to store your JSON data in DB2!

More Information on Big Data


You’ve read the primer. Are you interested in more?  Follow these links! Also, look for more content related to big data posted at the IDUG website as we receive it during the month of January 2013.

Big Data

OLAP Expressions

  • I recently posted an introduction to OLAP expressions on the IDUG DB2 News blog.
  • I’ll also be presenting a full session on OLAP expressions at the 2013 IDUG NA Technical Conference in Orlando.

IDAA

If you attended the IDUG DB2 Tech Conference in Berlin last November, make sure to access the conference proceedings as there were quite a few detailed sessions on IDAA. You can easily expect many sessions on this topic at the 2013 IDUG NA Technical Conference in Orlando.

Acknowledgements


Special thanks to the people that responded to my request for big data information. There is no way I could have written this article without them. In no special order: Bjarne Nelson, Hans Miseur, Jim Reed, Troy Coleman, Erin Thornburg, Tim Brown, Mark Simmonds, Luc Vandaele and Cindy Russell. If I missed you on the list due to me being overwhelmed with information please hate mail me. We can always add it online!

2 Likes
Recent Stories
An Introduction to IBM Data Studio

The Basics of SQL Performance and Performance Tuning

Index Decluttering Opportunities in DB2 for Linux, UNIX, and Windows