A data lake provides data to an organization for a variety of analytics processing including discovery and exploration of data, simple ad hoc analytics, complex analysis for business decisions, traditional reporting as well as real-time predictive analytics. It is possible to deploy analytics back into the data lake to generate additional insight. Tools can manage the shared repositories of information for analytical purposes, where each Data Lake Repository is optimized for a particular type of processing, such as SQL, NoSQL, MapReduce, Machine Learning (ML), Graph, and so forth. As such, a state-of-the-art data lake approach does not necessarily mean to build an HDFS-compatible data store, where all enterprise data gets stored without knowing its usage intent. Data values may be replicated in multiple repositories in the data lake, where information in the data lake can be accessed through different types of interfaces and provisioning mechanisms provided by the Data Lake Services.
A data lake supports a governed and managed collection of existing and–in some cases–new data repositories; enables self-service data access; and facilitates analytics development and deployment. It also delivers an ecosystem for continual analytical discovery and modeling across an organization.
The data lake comprises three parts (Figure 1):
- Data lake repositories: Contains collections of information potentially useful for analytics. They are hosted on one or more data platforms that are capable of both storing data and running analytics close to the data.
- Data lake services: Provides services that surround the data lake repositories. They can locate, access, prepare, transform, process and move data into and out of the data lake repositories.
- Information management and governance fabric: Provides the engines and libraries that govern, secure and manage the data in the data lake.
Figure 1: The data lake is a three-part solution for analytics development and deployment.
The data lake supports the integration and management of a heterogeneous data landscape supporting structured, semi-structured (for example, XML format, JSON documents) or unstructured data. Depending on use-case scenarios, data can be stored on a variety of data platforms including traditional relational database management systems (RDBMSs) such as DB2, or NoSQL data stores of various kinds (for example, HBase or Hive—included in IBM BigInsights). The data lake gives organizations the flexibility to discover, explore and innovate using data and analytics. Data lake services support multiple user communities (Figure 2).
Figure 2: Data lake services and user communities.
Sample Analytics Usage Scenarios
IBM z Systems with DB2 for z/OS and the DB2 Analytics Accelerator, the new Machine Learning (ML) for z/OS with Spark on z/OS, QMF, as well as some offerings on the distributed platform, such as the IBM Open Platform (IOP) with Apache Spark and Apache Hadoop, IBM Big SQL, IBM Big Insights, and the Operational Decision Management (ODM) enable many data lake analytics usage scenarios by reducing the data movement, leaving source data largely in place, and limiting the number of data repositories.
Following is a subset of some key analytics usage scenarios:
- Scenario 1: Federated SQL Queries and Analytics
- Scenario 2: Real-Time Analytics with Machine Learning (ML) for z/OS
- Scenario 3: Integrated ODM Rules Management into Apache Spark Environment
Specific usage scenarios in client environments may very well see some variations and concretization. It is also possible to build data lakes that are comprised of several scenarios. Regardless of the specific usage intent, z Systems-centric data lakes can be deployed to accommodate the needs and characteristics of numerous use case scenarios.
Federated SQL Queries and Analytics
Federation itself is around for quite some time. However, the breadth of federation across different platforms and data stores has been significantly increased in recent years. Federated SQL queries can be submitted via IBM Big SQL either as part of IBM BigInsights on the distributed platform, or running directly on a Hortonworks HDP cluster. Starting with V4.2, IBM Big SQL provides access to DB2 for z/OS as a federated data source. Since IBM DB2 QMF V11.2.1, federated SQL support is provided to z/OS data sources, non-z/OS relational data sources (e.g. DB2 LUW, Oracle, MS SQL Server, etc.), and also to Hadoop clusters.
In addition, integration between Big SQL and Apache Spark as part of the ODPi compliant IBM Open Platform (IOP) and integration between DB2 QMF and Apache Spark on z/OS provide additional analytics options across the distributed and mainframe platforms. Based on these capabilities, z-centric data lake scenarios can be deployed by leaving where it originates, with only a relevant subset of the data to be moved to the requesting application (Figure 3).
Figure 3: Federated SQL queries.
Real-Time Analytics with ML for z/OS
IBM Watson Machine Learning (ML) for z/OS is an end to end enterprise ML platform that will help to simplify and significantly reduce the time to deployment of machine learning models by integrating all the tools and functions needed for machine learning and automating the machine learning workflow. It provides a platform for better collaboration across different personas including data scientist, data engineer, and business analyst and application developers, for a successful machine learning project. ML for z/OS leverages cognitive capabilities into the ML workflow to help determine when model results deteriorate and need to be tuned and provide suggestions for suitable changes. ML for z/OS leverages the Spark MLlib and the Spark on z/OS in-memory compute engine. By using Spark on z/OS, access to z/OS and non-z/OS data can be provided via the Data Virtualization layer.
With ML for z/OS and Spark on z/OS, data lakes can be deployed by exposing z/OS and non-z/OS data for data scientist tasks, for instance development of analytical models. The scoring will be performed on z Systems via a RESTful API. For any analytics-related usage intent of predominantly z/OS data, ML for z/OS enables efficient z-centric data lake deployments with limited data movement required (Figure 4).
Figure 4: Real-Time analytics with Machine Learning (ML) for z/OS.
ODM with Apache Spark
IBM Operational Decision Management (ODM) is well integrated into the Apache Spark environment. Using Apache Spark, a decision request resilient distributed dataset (RDD) can be read from a data store such as IMS. A Map transformation can be applied to this decision request RDD, which executes the required decision service by calling the ODM API to load and execute the corresponding ruleset. After validating the ruleset, the resulting decision RDD can be written back into a data store such as a Hadoop HDFS cluster for further processing and action. Spark will be used as a compute engine.
This ODM and Spark integration capability will enable z-centric data lake deployments by executing a ruleset close to the originating data (Figure 5).
Figure 5: Integrated ODM rules management into Apache Spark environment.
Depending on the data sources, the data volume, required data transformation and preparation tasks, and latency requirements, DB2 for z/OS and the Accelerator can play a vital role in all 3 scenarios.
Deployments with DB2 for z/OS and the DB2 Analytics Accelerator
Without the Accelerator, z Systems data is extracted and copied to the data lake. The result is that z Systems sits outside of the analytics lifecycle and its data is used for analytics that are deployed and used in other systems. There are three reasons for this:
- The analytic queries issued against repositories in a data lake tend to be complex and irregular. This is in direct contrast to regular, rapid and row-based queries that come from the online transaction processing applications. Supporting these two very different workloads in a single system can be difficult, so they are separated into two systems, each with its own copy of the data.
- z Systems data stores often do not have the historical data depth data scientists need. An external system is continuously fed with data from the z Systems stores to accumulate the historical data.
- Deploying new analytics into the z Systems environment and collecting results can be a challenge.
The Accelerator enables more choices for making the data in DB2 for z/OS available for analytics while keeping z Systems within the analytics lifecycle. The following sections discuss four deployment options and show how the Accelerator enables z Systems data to remain within the analytics lifecycle:
- Option 1: DB2 for z/OS as a source
- Option 2: DB2 for z/OS as a data platform for data lake repositories
- Option 3: DB2 for z/OS as a consumer of insight from the data lake
- Option 4: DB2 for z/OS as a downstream system from the data lake
Specific deployments will naturally see some variations of these models, including hybrid models with a mixture of these deployment options. The revolutionary performance characteristics of the Accelerator can cause the boundary between z Systems transactional and analytical workloads to become blurred. This means both these workloads may coexist on the same systems.
Option 1: DB2 for z/OS as a Source
DB2 for z/OS functions as a source of data for the data reservoir in option 1 (Figure 6). Data from selected DB2 for z/OS database schemas that may be used for analytics are copied into operational history repositories on a regular basis. Analytics teams access operational history repositories to build new analytical models that may be deployed into the Accelerator to create new derived data for z Systems.
This option moves the workload for analytics discovery, exploration and modeling to the data platforms inside the data lake, but it requires z Systems data to be regularly copied into the data lake. When new analytics models are created, they may be deployed into the Accelerator for use by the z Systems application.
Figure 6: Data Lake acts as an historical store for z Systems data.
Option 2: DB2 for z/OS as a Data Platform for Data Lake Repositories
An alternative deployment is to define DB2 for z/OS as one of the platforms for the data lake. Selected schemas are mapped as data lake repositories so they appear in the data lake’s catalog, enabling the data scientist to select them and create the sandboxes needed for discovery, exploration and modeling (Figure 7). The data scientist is never given direct access to the data lake repositories. The DB2 for z/OS platform will experience irregular, large queries as data scientists pull data into sandboxes. However, as long as the required data is accessible to the Accelerator, the transactional workload is not affected.
Figure 7: Data Lake with DB2 for z/OS as one of the data platforms.
Option 3: DB2 for z/OS as a Consumer of Insight from the Data Lake
The data lake may support APIs that can be called from z Systems applications, or through a database-stored procedure to access additional data and insight generated by analytics running in the data lake (Figure 8). For this deployment option, the data lake must support availability requirements of the z Systems platform.
Figure 8: The Data Lake delivers data and analytical insight to z Systems.
Option 4: DB2 for z/OS as a downstream System from the Data Lake
In the last deployment option, new insights generated by the data lake and selected supporting data are fed to DB2 for z/OS and accessed through data schema extensions supported by the Accelerator (Figure 9). Operation of z Systems and the data reservoir is decoupled. The z Systems environment has the advantage of analytical insight local to its processing. However, there may be a publishing delay between generating the insight and the time it arrives in the Accelerator.
Figure 9: The Data Lake feeds insight and supporting data to supplement z Systems data
The four deployment options represent alternatives for integrating z Systems with the data lake by taking advantage of the Accelerator. The next section introduces some imperatives for implementing a successful data lake.
Imperatives for implementing a successful Data Lake
In numerous global client engagements in recent years, the following imperatives for implementing a successful data lake have been derived. These recommendations or best practices should be applied wherever possible. The goal is to reduce data movement, reduce the number of data containers, and move analytics applications to where the data originates and resides:
- Reduce the complexity of the information supply chain: This includes avoiding unnecessary data movement and to optimizing the entire information supply chain. To simplify the data transformation, in-DB transformation and temporary table structures with Accelerator Only Tables (AOTs) of the Accelerator can be used.
- Use federation techniques whenever possible: Leaving data in place can easily be achieved with federated SQL queries as outlined above. Irrespective of where the application resides, federated analytical processing can be implemented with applications executing either on the distributed or mainframe platforms.
- Leverage state-of-the-art technology: From hardware accelerators to special purpose appliance, numerous fit-for-purpose capabilities can be used to optimize a data lake deployment. In-memory processing and open source offerings, such as Apache Spark with its integration with Big SQL and QMF add to the flexibility.
- Adhere to innovative and novel analytics concepts: In the past, data marts have been generated to overcome performance limitations. With the Accelerator, it is possible to limit the number of materialized data marts and data cubes and to use aggregation on the fly. This allows much more agile data lake usage and deployment patterns.
These imperatives align very well with the well-known strengths of IBM z Systems.
A data lake is often associated with Hadoop, for instance an HFDS-compatible data repository designed to store vast amount of structured and unstructured data without any pre-defined data schema or usage intent for downstream consuming systems. More often than not, enterprises find themselves engaged in building up a fairly large infrastructure with limited knowledge regarding its usage intent. This comes with an endeavor to implement and operate an information supply chain to move data from relevant source systems into the data lake, which is quickly getting increasingly complex and unmanageable.
This blog has described some z Systems-centric data lake usage scenarios and deployment options that will help overcome some of these challenges.