Apache Spark and IBM Machine Learning on z/OS

By George Wang posted Feb 27, 2017 11:06 AM


Article by George Wang, IBM and Kewei Wei, IBM


Analytics is increasingly an integral part of day-to-day operations at today's leading businesses, and transformation is also occurring through huge growth in mobile and digital channels. Enterprise organizations are attempting to leverage analytics in new ways and to transition existing analytics capabilities to respond with more flexibility while making the most efficient use of highly valuable data science skills. The recent growth and adoption of Apache Spark as an analytics framework and platform is very timely and helps meet these challenging demands.

The Apache Spark environment on IBM z/OS and Linux on IBM z Systems platforms allows this analytics framework to run on the same enterprise platform as the originating sources of data and transactions that feed it. If most of the data that will be used for Apache Spark analytics, or the most sensitive or quickly changing data is originating on z/OS, then an Apache Spark z/OS based environment will often be the optimal choice for performance, security, governance, and cost.

This article explores the enterprise analytics market, use of Apache Spark on IBM z Systems platforms, integration between Apache Spark and other enterprise data sources, and IBM’s new offering for Machine Learning on z/OS. It is of interest to data scientists, data engineers, enterprise architects, or anybody looking to better understand how to combine an analytics framework and platform on enterprise systems.

Overview on Apache Spark on IBM System z

You might hear much about the volume, variety, veracity, and volatility of data coming from sources that are external to the enterprise (such as social media, blogs, Twitter, and others). However, insight that can be gained by combining the analytic results from external data with high value (and highly sensitive) data held within the enterprise can deliver superior results to the business. Often, this high-value data for enterprise customers resides on the z Systems platform.

In the past, leading practices for gaining insight from multiple sets of data necessitated moving all this data into one location for purposes of simplifying the analytic programming environment and correlating across multiple data environments. This data centralization strategy resulted in some negative business consequences caused by these vast data transfers: data latency, analytic latency, data security, governance, cost, availability, auditability, and of course risk of exposure to data breaches. However, leading practices can and should change when a better solution presents itself.

Spark offers many advantages in its federated analytics approach; however, think about the greater potential advantages to Resilient Distributed Datasets (RDDs) that reside in memory governed by a secure z Systems environment: performance that is optimized not for only data access but also execution of Apache Spark for the Enterprise for the analytics required on that data. Conversely, consider the potential risk of accessing z Systems high-value business data remotely and having this data available across many distributed systems through RDD memory structures. With the unified analytic framework described, enabling Spark natively on z Systems platforms can allow these emerging analytic applications to use data-in-place: in an environment that offers locality of reference, secure governance and optimized implementation.

These key values to enterprises are why IBM Systems has enabled Spark natively for both z/OS and Linux on z Systems. Apache Spark is enabled for both operating system environments supported on z Systems hardware; clients can choose the configuration that fits best with their needs. The suggestion is to consider the originating sources of data and transactions that will feed the Spark analytics. If most of the data that will be used for Spark analytics, or the most sensitive or quickly changing data is originating on z/OS, then a Spark z/OS based environment will often be the optimal choice for performance, security, and governance. If most of the data that will be used for Spark analytics originates on Linux on z, then a Linux on z Spark is a viable approach. Of course, not all data needs to be hosted in one platform. In fact, the strength of Spark is that it can combine data from a wide variety of heterogeneous data sources and provide a clean data abstraction layer.


Consider a case (shown in the figure above) that uses Spark-based analytics for determining whether the consumer that is associated with a credit card transaction is a good candidate for a promotional offer, combining insight from sensitive transactions and account information analytics with insight from sentiment analysis based on the consumer's social media content. With this method, z/OS transaction and business data is analyzed in place securely, and the relevant information is associated with insight from unstructured analytics on external social media data. The insight is exchanged without the movement of data. With Spark, you can federate the insight, not centralize the data, to achieve superior business results.

By leveraging Spark's consistent interfaces and rich analytics libraries for the creation of analytic content, data scientists and programmers can quickly build high-value analytic applications across multiple environments (including z Systems platforms) that use data in-place and federate the analytic processing to best fit environments. Spark analytics can offer both batch and real-time capabilities, depending on the desired qualities associated with the analytics.


This structure shown in above figure demonstrates one Spark environment running natively on z/OS. Spark can also be clustered across more than one JVM, and these Spark environments can be dispersed across an IBM Parallel Sysplex. Because Spark is based on Java, the potential exists for z Systems transactional environments, customer-provided applications, and IBM and other vendor applications to leverage the consistent Spark interfaces with almost all zIIP-eligible MIPS. In this way, analytics processing on z/OS becomes extremely affordable. With the IBM z13 system, IBM supports up to 10 TB of memory that can enable the in-memory RDD Spark structures for optimal performance. Through the Spark SQL interfaces, access to DB2 for z/OS and IMS can be facilitated through standard type 2 and type 4 connections. Depending on the level of data integration that you want, access to VSAM, physical sequential, SMF, SYSLOG, and other environments is also possible.

Specific to z/OS Apache Spark, Rocket Software created a function named Rocket Software Mainframe Data Service for Apache Spark z/OS. This capability can enable Apache Spark z/OS to have optimized, virtualized, and parallelized access to a wide variety of z Systems and non z Systems data sources.

Overview of Federated Analytics with Apache Spark

One of the main advantages of Apache Spark lies in its ability to perform federated analytics over a heterogeneous source data landscape. Most enterprises store data in heterogeneous environments with a mix of data sources. With Spark, it has become easier than ever to ingest data from disparate data sources and perform fast in memory analytics on data combined from multiple sources, giving a 360-degree view of enterprise-wide data. Big data is not all about unstructured data. In fact, most real-world problems are solved using some form of structured or semi-structured data.

Integrating Spark with DB2 is an obvious next step in the evolution of big data. Enterprises store petabytes of transactional data in DB2 and run their mission-critical applications on that data. Customers often have a need to perform analytics by aggregating DB2 data with other data sources to derive additional business insights. For example, a business might want to aggregate transactional data in DB2 with social media data, such as Twitter data stored in HDFS, to establish patterns on consumer sentiment and take actions such as offering targeted discounts. Combining Spark and DB2 simplifies integration of mission critical transaction data with contextual data from other sources to derive additional Big Data insights.

Spark provides an easy integration with structured data using SparkSQL, a module in Apache Spark that integrates relational processing with Spark’s functional programming API. Spark SQL Data Sources support is helpful to more simply connect to relational databases, load data into Spark, and also access Spark data as though it were stored in a relational database. SparkSQL lets Spark programmers leverage the benefits of relational processing and lets SQL users call complex analytics libraries in Spark, such as machine learning.

SparkSQL is viewed as the unified way to access structured data in the Spark ecosystem. DataFrames support a wide variety of data formats that are ready to use, such as JSON and Hive and can also read and write to external relational data sources through a JDBC interface. This ability of DataFrames to support a wide variety of data sources and formats enables rich federation of data across many sources. DB2 offers integration with Spark SQL using the DB2 JDBC driver. The DataFrames API allows Spark to load DB2 data through the JDBC driver, making it very easy to expose DB2 data as Spark DataFrames. SQL queries can be run on a DataFrame instantiated with DB2 data. DataFrames also provides abstraction for selecting columns, joining different data sources, aggregation, and filtering. After DB2 data is loaded into Spark as DataFrames, those DataFrames can be joined with data from other sources, or transformations can be applied to generate new DataFrames. Transformed DataFrames can even be written back into DB2 and persisted. All of this can be done through SQL, or rich language bindings in Python, Scala, Java and R. Data scientists can go beyond joins, aggregation, and filtering on DataFrames created from DB2 data: they can even use complex user functions on DataFrames for advanced analytics and also MLib’s machine learning pipeline API for machine learning.

Overview of IBM Machine Learning on z/OS

Machine Learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. It evolved from the study of pattern recognition and computational learning theory in artificial intelligence, and explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data driven predictions or decisions, through building a model from sample data inputs.

Machine learning projects generally include tasks such data cleansing and ingestion, data feature extraction and selection, data transformation, model training, model evaluation and model deployment and prediction. Many of these tasks need to be performed iteratively to get to desired results. Each task requires heavy engagement from experienced analytics personas across the organization from data scientist and/or software/data engineers to application developers. As such, a machine learning project usually takes months or even years before a usable model can be generated and deployed in production.

In fact, machine learning was first introduced in 1959. Why has it taken almost 60 years to become a focus of businesses again?

First, it’s about economics. Machine learning is a resource intensive task. It was too expensive for most businesses in the past. However, after decades, as hardware and software cost have kept declining, machine learning has become more affordable.

Secondly, it’s driven by business demand. Competition among businesses has continued to increase. An enterprise’s ability to utilize business analytics in operations is one key to success in this competitive marketplace.

Finally, new technologies such as big data and cloud provide a solid infrastructure foundation for machine learning.

However, there are still challenges for enterprise adopting machine learning.

Traditionally, a machine learning project is a one-time effort. Data scientists analyze data, train a model and deploy it. Then the model will be used for years. Unfortunately, this approach is no longer efficient with the speed of business change. Today, new business rules go online every week. Mobile apps are updated several times a day. If a model is not monitored and adjusted in a new way, its accuracy can decline very quickly. Therefore, a feedback loop is needed to complete the circle of machine learning.


Another challenge is the gap between data scientists and application developers. With the new machine learning flow applied, data scientists and application developers are not isolated from one another. They must collaborate in ways that are consistent with DevOps methodologies. They need tools to connect these constituencies seamlessly.

Last, but not least, machine learning should be integrated into business operations. To provide timely business responses, model prediction needs to be embedded in transactional flows and co-located with the data for optimal efficiency and security.

IBM Machine Learning for z/OS is an end-to-end enterprise machine learning platform addressing these challenges. It helps to simplify and significantly reduce the time to deployment of machine learning models by:

  • Integrating all the tools and functions needed for machine learning and automating the machine learning workflow.
  • Providing a platform for better collaboration across different personas including data scientists, data engineers, business analysts and application developers, for a successful machine learning project.
  • Infusing cognitive capabilities into the machine learning workflow to help determine when model results deteriorate and need to be tuned and provide suggestions for updates or changes.



Machine Learning for z/OS provides a simple framework to manage the entire machine learning workflow. Key functions are delivered through intuitive graphical web user interface, a RESTful API and other programming APIs:

  • Ingest data from many different data sources including DB2, IMS, VSAM or even from outside of z Systems.
  • Transform and cleanse data as the algorithm input.
  • Train a model for the selected algorithm with the prepared data.
  • Evaluate the trained model.
  • Intelligent and automatic algorithm/model selection and model parameter optimization based on IBM Watson Cognitive Assistant for Data Science (CADS) and Hyper Parameter Optimization technology.
  • Model management.
  • Rapid model development and deployment into production.
  • Provide a RESTful API for application to embed the prediction using the model.
  • Monitoring both the model status, accuracy and the resource consumption.
  • An integrated notebook interface for data scientists to use Watson Machine Learning APIs to process interactively.
  • An intuitive GUI wizard to guide users to easily train, evaluate and deploy a model.
  • Security control by integrating authentication and authorization z Systems.

Existing models can be imported and/or deployed to Machine Learning for z/OS with several clicks through the intuitive interface, allowing you to visibly manage and monitor your models in the framework. Creating new models can be as easy as following a simple and intuitive wizard. Within a few minutes, you will have a model that has been intelligently selected for you among several possible candidates. This model will be ready for continuous training, including a feedback loop that allows the client to ingest new data/predictions for further improvements.


The developer story across this IBM Analytics Platform stack is also notable because common code can be run in multiple settings. As examples, SPSS models can be run on Streams, Hadoop, or IBM PureData for Analytics; AQL text extractors can be run on InfoSphere Streams or Hadoop; R applications can be run on Streams, Hadoop, or PureData for Analytics.

The big story here is that, to enable a vision for big data, the entire IBM stack for analytics integrates and supports Spark. This enables users to benefit from the scalability and flexibility of technologies like Hadoop and Spark, while still enjoying their familiar interfaces (like SQL, Cognos, or R) and advanced analytic tools (like Watson Analytics). These business users and data scientists need worry only about their analytic and data exploration activities, without having to learn new and evolving technologies.