DB2 z/OS as a Modern Enterprise Data Warehouse

DB2 z/OS as a Modern Enterprise Data Warehouse

The sturgeon may be a prehistoric animal, but there can be great caviar inside

by Ludovic Janssens, Infocura

Introduction: what makes an EDW future proof?

The subtitle of this article refers to Alan Reddings blog[1] who calls the mainframe a ‘Dancing Dinosaur’. The sturgeon, who produces caviar, is one of the oldest animals on our planets. As DB2 for z/OS is still regarded as a high quality RDBMS, caviar is wanted by those who can appreciate good quality, hence the metaphor.

In order to explain why the ‘ancient’ system of DB2 for z/OS is still very precious in the modern IT architecture, we first need to assess what an Enterprise Data Warehouse should be able to do nowadays.

Searching for these requirements, I came across a white paper of Tamara Dull[2] who explains a few very interesting strategies that could be used in order to make use of Hadoop to accomplish a great analytics environment.

One of the major insights from the white paper is that Hadoop can be used for much more than Big Data. You can use Hadoop for Data Integration reasons, as a data archive or even as analytics platform. It also becomes clear that both structured and unstructured data are valuable for analytics.

A modern Enterprise Data Warehouse needs to be able

  • To deal with a wide variety of data sources both structured, semi-structured and unstructured
  • To process data as synchronously as possible (real-time data analysis)
  • To store as much data as is needed by the business (data archive)

Obviously, all of this leads to a series of questions that put question marks on the current implementations.

  • Do you still need data integration software if you can use Hadoop?


That one is easily answered, the question is not whether or not to use extra software, but how your existing software can work with unstructured data from Hadoop and become part of its ecosystem. Most data integration software packages now support Hadoop next to RDBMS and other data sources.

 

  • Do you still need an RDBMS if you can use Hadoop?

 

That question is the subject of this article. The relevance of an RDBMS could reside on many levels, but will also depend on the possibilities that exist to reply to modern needs as identified in the paragraphs above.

The possible role of DB2 for z/OS in this story

Integration of Unstructured, Semi-Structured and Unstructured Data

The current implementation of DB2 for z/OS includes the ability to store relational data, XML data, JSON and large objects. DB2 is categorized as a Hybrid Relational Database Management System (Hybrid because XML is natively supported).

Hadoop is known to allow the storage of large quantities of heterogeneous data and provide answers resulting from a structuring analysis.

If you can combine the right components in the Hadoop Ecosystem, you will definitely achieve many of your goals. You need to take into account several caveats however: Hadoop is not meant to replace DB2 as a DBMS. Hadoop and DB2 look much alike, but they are made with different usages in mind.

Hadoop is a data platform, not a database as such. In order to organize the heterogeneous data in analyzable material, you need to make use of several ecosystem components that can live independently from each other.

The fact that DB2 is made for structured data enforces the capabilities with regard to security, performance and data integrity. DB2 can provide you with the exact answer from the exact data set.

In this article, an overview is provided on how Hadoop and DB2 relate to each other. Although results in Hadoop may seem similar to those in DB2, you will be confronted with important differences. If we add the many out-of-the-box solutions that were developed and improved in DB2 throughout the past four decades, we can clearly identify how this ‘ancient’ software remains indispensable in the new context that has emerged since the advent of Big Data and Hadoop.

Reason 1: relational rules!

A first reason to make use of DB2 is simply the reason why DB2 came to be in the first place: people wanted to store and retrieve data as efficient as possible. The relational theory as established by E. Codd described originally 12 rules[3] all of which were set up to achieve these two goals: storing data efficiently and retrieving that data efficiently.

Hadoop on its side violates rule 8 that requires independency of the physical data. Each time the format of the data changes, the Hadoop logic must be adapted. Hadoop also violates rule 12, the non- subversion rule: the data in Hadoop can be altered in many ways.

DB2 for z/OS was designed with these rules in mind. Up to the present day few other systems than DB2 have achieved a high level of compliancy. This compliancy guarantees a minimal footprint with a maximal retrieval speed.

Reason 2: Security

If there is one aspect on data that needs attention in the current age of internet of things and cloud storage, it is security.

Hadoop is deemed to have much vulnerability, but luckily a lot of great solutions appeared in the past few years. Open Source solutions such as Sentry and Knox allow you to control authentication and authorization. Even data encryption and other security features we are familiar with can be applied on Hadoop.[4] Additionally, a few of the market’s leading auditing platforms integrate Hadoop fully, providing similar options as for RDBMS.

No matter how great the evolution with regard to data security on Hadoop is, it will never beat the ability to secure your data as you can in DB2 with Label Based Access Control (LBAC) or Role and Column based Access Control (RCAC).

Label Based Access Control (LBAC)[5] allows you to secure your data on lowest possible level. The security administrator creates security policies, labels for users and the data groups on which the policies need to be applied. LBAC allows also the creation of exemptions for people that need exceptional access against their security profile.

I make this statement, because these security mechanisms go beyond the security of physical objects, they secure both the physical and logical interface to this data and even allow you to transform or encrypt data following the person requesting the information.

LBAC and RCAC are already quite a challenge on well-structured data. I do not believe this granularity can be achieved on short term within the realm of Hadoop’s unstructured or semi-structured data stores. If you structure the data for security, you better make use of a structured data store such as DB2.

Real-Time Analysis

Reason 3: Concurrency

DB2 and other relational database systems utilize locking mechanisms such as Cursor Stability, Repeatable Read … Hadoop does not have such a locking mechanism, although data versioning addresses the same issues in part.

Hadoop will primarily do dirty reads, as for typical Big Data actions such as stream computing, sentiment analysis and predictive analysis concurrency should be out of the question.

Nevertheless for typical classic OLAP on structured data, concurrency is still important when data is being updated regularly. In this case Hadoop and DB2 serve two completely different purposes: Hadoop will analyze and structure data, DB2 will query the structured data from any resource.

Archiving Capabilities

Reason 4: IDAA

There was a time IBM had to admit that DB2 could not cope with the largest amounts of data that reside within the enterprise, as the boundaries of the relational paradigm had been reached. Recently, IBM came up with a feature that addresses this issue and wipes all of these arguments away.

IDAA or the Integrated DB2 Analytics Accelerator, places a Netezza-based appliance as an extra engine next to the DB2 z/OS engine. Large amounts of data can be replicated with change data capture from your DB2 system right into the appliance. It is the DB2 optimizer that will decide whether the data is taken from the appliance or from the online DB2 data.

In the Hadoop Ecosystem similar technologies exist. A common argument is the availability of SQL-on-Hadoop products such as Hive, HAWQ, BigSQL and Impala. These are all capable of addressing the huge amount of data residing in Hadoop’s HDFS and make usage of the ANSI SQL standards.

What technology should you choose? As always with DB2: it depends. The old adage is still valid, as IDAA will be interesting for Analytics within the scope of the applications having their data on the mainframe already. It makes absolutely no sense to move the carefully structured and protected data out of DB2 and put them on a platform where structure and security have to be redefined.

In my opinion SQL-on-Hadoop will serve broader analytics, but IDAA should allow you to keep as much data in DB2 as possible and reduce the maintenance costs of your archive data. SQL-on-Hadoop on the other hand should be able to address archive data through DB2.

Reason 5: DB2 z/OS as Application Requestor

Many people are not aware that DB2 for z/OS can act as a federation server for other databases, in which case DB2 for z/OS is the Application Requestor. Conversely, DB2 for z/OS can be federated throughout a wide variety of other vendor’s interfaces, which makes DB2 for z/OS the Application Server.

As we are already familiar with DB2 as Application Server, only DB2 as an Application Requestor is discussed in this article.

DB2 for z/OS can communicate with all databases that support DRDA client calls. Most industry RDBMS systems, such as Oracle[6] and SQL server[7] have DRDA support for Application Requesters. When you setup DDF to point to these resources, your mainframe applications will be able to connect to these resources. This opens a wide range of options.

Most people think of DB2 federation as a means to move off the mainframe, but why shouldn’t we consider the reverse?

The mainframe is an ideal place to centrally build a data mart. Unlike Hadoop, the mainframe (and DB2) may already be present in your Enterprise and you may very well already have the skills and infrastructure to build reports on it. When data becomes too big, IDAA is there to help, as we mentioned above.

When DB2 for z/OS becomes a federation server, the federated resources could serve as an operational data store from which you can pick the data without having to know on which platform the actual source data resides. This facilitates analytics definitely and it will possibly make the replication technology setup a lot easier.

Other Benefits

SQL in its own realm

Maybe it is needless to say to an audience of DB2 users, but what can be better than using SQL in its own environment? SQL was initially designed for relational database systems (by IBM) and was adapted to work on Hadoop.

Although the end user experience does not differ, the backend processing of SQL-on-Hadoop is quite different than the one for SQL-on-RDBMS. SQL-on-Hadoop is a kind of workaround to implement Hadoop MapReduce[8] functionality, whereas SQL on RDBMS makes usage of a mathematical model to calculate the most efficient path to achieve a given result.

On the long term, the difference will grow smaller, but the quality of the results in SQL-on-Hadoop depends on the quality of the analysis made by those people who programmed the layers between the raw data source and the SQL interface, whereas SQL-on-RDBMS will always provide an exact answer.

The Role of Hadoop in the Enterprise

In the introduction I mentioned that Hadoop could serve other purposes than the ones envisaged at first sight. What should be the exact role of Hadoop?

Hadoop as a ‘Data Lake’

You will find arguments on how Hadoop could be used as a ‘data lake’[9]. Some people argue that everyone should be using Hadoop and its resources, others point out that the analysis of unstructured data can be cumbersome and should be in the hands of specialized persons.

An interesting webinar on this subject was held by the International institute of Analytics, featuring Tamara Dull and Anne Buff[10]. Obviously, the perspective I am defending in this article is Anne Buff’s point of view. She states that the Data Lakes are not essential and that not everyone should be able to use it directly; Tamara Dull on her side argues that data lakes can be crucial.

If we would concur with Tamara Dull’s vision, Hadoop will become the main source for analytics. This is false in my opinion. As we have argued before, the data within the relational realm has been carefully structured and secured and its platform is designed with analytics in mind, hence there is no reason to move it out of its realm. Hadoop will not provide the same qualities to the end user as a relational database such as DB2 for z/OS offers.

Anne Buff’s vision on the other hand opens a view on Hadoop as a way to structure unstructured data and deliver the results to a platform that is fit for purpose, namely a relational database system. As we are looking for an Enterprise Database, DB2 for z/OS is the ideal target.

Hadoop as a Data Integrator and Provider

In Tamara Dull’s article on the Hadoop plays[11], she doesn’t take the hard viewpoint she expressed in the data lake debate. On the contrary, she opens a world of opportunities. Key in her article is that Hadoop can be used for more things than a mere data lake.

One of the most interesting plays is the one in which Hadoop plays the role of Data Integrator. Hadoop is a platform that is designed to process huge amounts of data and transform it to other formats. Its functionality with that regard can be seen in parallel to classic data integration software. Hadoop has the asset that it can also address and structure less structured or unstructured data and nevertheless provide value out of it.

As such, Hadoop can be used to extract knowledge from huge amounts of unstructured data and integrate this knowledge with structured data and persist it on a platform that is apt for SQL analytics on structured data, namely a relational database system such as DB2.

Current Hadoop and DB2 z/OS integration

Currently integration between Hadoop and DB2 z/OS does exist. There are vendor supplied connectors that allow you to connect to DB2 from within Hadoop. Reversely, there are user defined functions that can be installed in DB2 in order to address Hadoop using a combination of SQL and JAQL (JSON Query Language).

This integration allows you to import Hadoop analytics results into DB2 and combine it with relational data. Unfortunately it is not yet a smooth process (some are based on JAQL, a language with which we as DBAs are not really familiar). I hope that in future developments extend the integration capabilities will be extended.

Conclusion: The future’s bright, the future is hybrid

Throughout this article I have laid down several arguments in favor of both DB2 and Hadoop. Hadoop has great value, as much as the value DB2 had for the past 30 years. Both live in their own realm however. Both realms are relevant and should cooperate, but not be confused.

Hadoop will never replace an RDBMS, as Hadoop is primarily a file system with a set of products that can be used for analytics. What you can do with Hadoop depends on the layers you put on top of it. These technologies, by the way, make often use of relational technology.

Hadoop will not replace DB2 for z/OS. DB2 has a wide range of possibilities that provide you a data store of high quality on a technical level. Moreover, DB2 can be integrated with other technologies to extend its realm to other domains.

The future of DB2 does not lie in abandoning it, but in embracing the reasons why we have been using it all these years. Personally, I hope that IBM will provide more integration possibilities as they did with IDAA. Indeed, IBM does provide some integration options using User Defined Functions, but why shouldn’t we be able to use DB2 as an application requestor for Hadoop? Wouldn’t it be great to be able to use DB2 for z/OS able to do SQL-on-Hadoop as if you were working with Hive, HAWQ or Impala? This would integrate DB2 in the Hadoop Ecosystem and provide end users with a familiar, secure and powerful analytics environment.

[1] Dancing Dinosaur can be found on https://dancingdinosaur.wordpress.com/

[2] DULL, Tamara (2014) A Non-Geek’s Big Data Playbook, Hadoop and the Enterprise Data Warehouse, a SAS Best Practices white paper: http://www.sas.com/en_us/whitepapers/non-geeks-big-data-playbook-106947.html

[3] A reference to these rules can be found on http://www.w3resource.com/sql/sql-basic/codd-12-rule-relation.php

[4] More information on Hadoop security can be found in the following articles: http://www.infoq.com/articles/HadoopSecurityModel and http://www.esecurityplanet.com/network-security/hadoop-security-still-evolving.html

[5] The best explanation of LBAC can be found in the DB2 LUW documentation (see http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.sec.doc/doc/c0021486.html). The feature is also part of the DB2 for z/OS implementation.

[6] Please refer to http://docs.oracle.com/cd/E49086_01/doc.12/e27316/introduction.htm#DRDAS114 to find out how DB2 can act both as Application Server and Application Requester for Oracle installations.

[7] Please refer to http://www.platformmodernization.org/microsoft/Lists/ResearchPapers/Attachments/37/Microsoft_Service_for_DRDA_Whitepaper.pdf to find out how DB2 can act both as Application Server and Application Requester for SQL Server installations.

[8] MapReduce is a methodology used by Hadoop, in which data is indexed (mapped) and then reduced to tuples which can then be interpreted as data records.

[9] A data lake is the denomination of a set of data resources that is centrally stored and can be used for analysis whatever their format.

[10] Please have a look at http://iianalytics.com/analytics-resources/the-data-lake-debate

[11] DULL, Tamara (2014) A Non-Geek’s Big Data Playbook, Hadoop and the Enterprise Data Warehouse, a SAS Best Practices white paper: http://www.sas.com/en_us/whitepapers/non-geeks-big-data-playbook-106947.html

Recent Stories
Selecting Appropriate Data Architecture Technologies

SQL Tuning for Developers – Part 2 Explain Output

Some of my favorite tunings tips and tricks for DB2 by Joe Geller