How to get started with Db2 for z/OS and Apache Spark on a distributed platform

The need of organized, automated and user friendly data wrangling and data visualization has had an exponential growth in the data community, if you allow me to say that anyone working with data and the need to show or the need to manage and protect it is already part of the “data community”, which in turns recruits most people in the industry of Mainframe to be the first ever data community. However, there is a very important and popular movement expanding on its neighbor, the distributed community, where their needs of data processing is now comparable to that of our Mainframe systems, and therefore creating a most pressing need: The need to share information between systems with ease, and that’s how Apache Spark has become increasingly popular these last few years.

At any given time, you can call the Data Warehouse department and chances are their most shown reports contain the same information that is being stored inside Mainframe systems (making no distinction between z/OS or Linux for z as the source of the information). And looking at the daily production z/OS jobs, there is an increasing amount of transmissions in and out of the Mainframe, defending its position as the main source of information no matter what kind of user will look at the data. And you can bet the same situation is happening with many other departments, where the combined need to work with data for any final goal is what data science is about. IBM has identified this need some time ago, and they’re working on the Apache Spark initiative to use it as the means of data extraction and data manipulation, in the same way people working with Apache on the distributed platform would have done, and at the same time attracting data scientists from the distributed zone to the z/OS zone.

However, there is an important factor of failure in this initiative and this is what I’m addressing. The nature of the z/OS side is most likely restricted to changes, new installations and z/OS people are having a difficult time getting their hands on this Spark technology just to prove its worth and value to the business (unless they already have their z/OS information in a data lake, where you have other options and the cultural shift for “opened-ness” has already occurred). Therefore the purpose of this article is to provide a couple of ways you can work with Spark and Db2 for z/OS while having your Apache Spark installation (standalone or cluster) working outside of z/OS and at the same time enabling a feature of Db2 for z/OS most people are not aware of.

Requirements list

If you decide to follow along with this article, this is the requirements list to get started:

On the developer side:

  • An Apache Spark installation. This includes any platform.
  • A Python Data Science environment. Spark provides support for other languages such as Java or Scala, but for this task I will use Python 2.7 as it is easy to learn and easy to understand even if you haven’t learned Python already. As the point mentions Data Science environment, you can obtain Anaconda Python or Enthought Canopy.
  • The JDBC jars to connect to Db2 for z/OS.

On the z/OS side:

  • An application user with access to RACF resource <SSID>.DIST from class DSNR. (Remember, this is a remote Db2 application, and therefore we need the usual RACF profiles).
  • A table to extract information from, with SELECT privileges for the application user.
  • Nothing else.

(And I remind you again, from a Db2 DBA perspective, a distributed Spark application that reads from Db2 on z/OS is no different from any other Db2 online application).

It is very important to note the advantages you can obtain when working with your own Spark/Python environment (usually referred to as pyspark environment) outside of z/OS. At the time of this writing, the Spark version available on the z/OS side supports Scala and very recently it included Python, however, you, as the developer or data scientist will face incompatibilities and it is most likely that you will be working with a mix of libraries that are not available on the z/OS side, such as Python Pandas, which provides a lot of flexibility for handling data, and visualization libraries such as Seaborn, Plotly or Bokeh, from which it can be concluded that in some point, even if you’re working with Spark inside z/OS, the data scientist (congratulations, you are a data scientist) will eventually face the need to move the results outside of z/OS just to have access to other data science libraries not available on z/OS, and therefore, you’ll be able to adapt your work and include it as an add-on for any other application and virtually work on any kind of business needs because of the flexibility of Python or the programming language of your choice.

Accessing information from Db2 for z/OS

Of course, there are many options when it comes to accessing Db2 for z/OS like using ODBC access with IBM Db2 Connect or unloading the information from within z/OS using a DSNTIAUL, DSNTEP2 or UNLOAD JCL, or even Db2 query support over FTP, which will be used later in this article to demonstrate how Spark can read from that kind of output.

One very important point to consider is the difference between a standard Db2 application and a data science application, in which the latter is designed to extract and perform calculations with the objective of presenting a report or a chart that contains meaningful information about the system or application, and it is by no means a way to create an OLTP application.

Apache Spark JDBC support

Apache Spark provides a way to interact with databases using JDBC access and create a Spark DataFrame to work with, which you can then join or merge with information from other sources or transform it into a Pandas DataFrame and then continue working with your newly accessed data. The flow is simple:

  • Access the DBMS server of your choice, which in this case is Db2 for z/OS.
  • Extract the information contained in <schema>.<table>.
  • Return the information in a Spark DataFrame.

Before getting into the code, you might have already observed its pros and cons:

  • It can retrieve information in a single code statement.
  • It will retrieve the full contents of the object defined under <schema>.<name>
  • You can use any table, view or materialized query table to be extracted.
  • It requires intervention from someone with enough authorization to create the required objects in Db2.

Now, let’s get into the code. For the following demonstration we will be using the Apache Spark shell for Python called pyspark and you will need the Db2 JDBC jars to connect to your Db2 for z/OS server. Let’s begin by adding the required JDBC jars to the application, for which we have the following options:

Option 1. Execute pyspark shell by including the required JDBC jars on the variable --driver-class-path and again on the variable --jars.    

Option 2. Edit file <your-spark-path>/conf/spark-defaults.conf. You would be looking at properties spark.driver.extraClassPath and spark.executor.extraClassPath to add the JDBC jar files.

Option 3. Set your custom properties through the SparkConf object.

Option 4. Set the required variables on the file <your-spark-path>/conf/spark-env.sh.

Option 5. Set the CLASSPATH system environment variable before calling pyspark.

*These last two options are already deprecated, however, just for the sake of trying them out you can enable it by setting your CLASSPATH system environment variable.

For more information, the full official documentation for configuring your access to an Apache Spark environment can be found here.

Once you are inside your environment you can access this feature through just one line of code, which is:

Db2jdbcDF = spark.read.jdbc(‘jdbc:Db2://<your-IP>:<port>/<ssid>’,’<schema>.<table>’, properties={‘user’:’<your-user>’,’password’:’<your-password>’})

If you have set the required driver jars and it’s visible on your CLASSPATH at the time of execution, you will obtain a full Spark DataFrame containing the information from the object you specified under the <schema>.<table> section, otherwise you’ll get a “java.sql.SQLException: No suitable driver found” message.

From here on is where the fun starts because the use cases are only bounded by your imagination and creativity. You can collect batch reports from your z/OS monitor like Tivoli OMEGAMON, CA SYSVIEW, BMC MAINVIEW or even the RMF reports, load them on a database to file historical data and then extracting that information from Apache Spark, or perhaps you have a Db2 application for which you want to display statistics; you name it, this is where the magic happens because you make it happen.

Working with Spark DataFrames and Spark RDDs, a use case

It’s time for a demonstration of how you actually work with your data once you get it.

Before I continue, I need to clarify the terms Spark DataFrame and Spark RDD, which are the objects we will be working on:

Spark RDD: As most people define it (and I will as well), a Spark Resilient Distributed Dataset is an immutable and distributed collection of elements, and when you start working with Spark you will first obtain an RDD object by reading from a source or creating an RDD from an already existing collection (a list or an array, if you will).

Spark DataFrame: The difference with a Spark DataFrame is that it’s a distributed collection of data organized in named columns. That is why we obtain a DataFrame when we work with JDBC, and we can turn this DataFrame object into a Pandas DataFrame (which is a similar object from a different, very useful Python Library and I strongly encourage you to go check it out).

Reading Db2 data using Spark DataFrames

Taking as the source of information the sample table EMPPROJACT, after executing the read function if you just enter the name of the object, which is Db2jdbcDF, this is what you’ll see:

DataFrame[EMPNO: string, PROJNO: string, ACTNO: int, EMPTIME: decimal(5,2), EMSTDATE: date, EMENDATE: date]

** Bear with me, as now we will get into the Python code...

Once we have our Spark DataFrame containing the information, we then cache its contents as we will be working with it:

>>> jdbcDF.cache() 
DataFrame[EMPNO: string, PROJNO: string, ACTNO: int, EMPTIME: decimal(5,2), EMSTDATE: date, EMENDATE: date]

Then we register a table name that we will use to call this object:

>>> jdbcDF.createOrReplaceTempView("projact")

From then on we can use a SELECT statement to extract, for example, the information about employee number ‘000130’:

>>> projs = spark.sql("SELECT * FROM projact WHERE empno='000130'")
>>> projs.show()
+------+------+-----+-------+----------+----------+
| EMPNO|PROJNO|ACTNO|EMPTIME|  EMSTDATE|  EMENDATE|
+------+------+-----+-------+----------+----------+
|000130|IF1000|   90|   1.00|1982-01-01|1982-10-01|
|000130|IF1000|  100|   0.50|1982-10-01|1983-01-01|
+------+------+-----+-------+----------+----------+

When we create the projs object we are calling what we know as a Spark Transformation, which will give us another DataFrame that we can then work with, and we display its contents with the function show() that will give us a console style presentation (calling other methods like collect() or take() would give us the string representation of its Row objects), which I’ll show for the sake of curiosity:

>>> projs.take(10)
[Row(EMPNO=u'000130', PROJNO=u'IF1000', ACTNO=90, EMPTIME=Decimal('1.00'), EMSTDATE=datetime.date(1982, 1, 1), EMENDATE=datetime.date(1982, 10, 1)), Row(EMPNO=u'000130', PROJNO=u'IF1000', ACTNO=100, EMPTIME=Decimal('0.50'), EMSTDATE=datetime.date(1982, 10, 1), EMENDATE=datetime.date(1983, 1, 1))]

Congratulations!! Now you’re learning Spark SQL, which is a way to work with your data using SQL-like statements. Spark SQL is a growing tendency on the Spark world, and it can also work reading from non table-like sources such as text files. It gives a lot of flexibility and lets the scientist call functions that otherwise would have taken a lot more coding.

Let’s run a simple query using built-in functions, for example, if you wanted to execute the following query:

SELECT PROJNO,COUNT(*) FROM EMPPROJACT GROUP BY PROJNO;

You can do it like so:

>>> jdbcDF.groupby('projno').count().orderBy('projno').show()
+------+-----+
|projno|count|
+------+-----+
|AD3100|    1|
|AD3110|    1|
|AD3111|    7|
|AD3112|   10|
|AD3113|   14|
|IF1000|    4|
|IF2000|    5|
|MA2100|    2|
|MA2110|    1|
|MA2111|    3|
|MA2112|    6|
|OP1010|    5|
|OP2010|    2|
|OP2011|    2|
|OP2012|    2|
|OP2013|    2|
|PL2100|    1|
+------+-----+

What happens on Db2 when you invoke the JDBC call?

If you follow the previous example on your Db2 for z/OS monitor, here’s what you’ll see upon the creation of your DataFrame:

SELECT  *
   FROM  <your-object>
  WHERE 1 = 0 

This yields no results, however, Spark needs it so that it can then tell you what the DataFrame rows contain, and you’ll also see a call to DESCRIBE INPUT.

Side note: In Spark theory, the previous point occurs due to the Spark principle of lazy execution, which means that you can work with your DataFrame and it won’t perform any calculation until you execute an action in order to save resources and calculate the best possible way to compute the calls that you gave it only when you actually retrieve the actual information, otherwise Spark would possibly have to manage your data in memory before you ask for it, wasting resources.

Once you retrieve the information (in the previous case using projs.take(10) ), here is what you can see on the z/OS side:

SELECT <COL1>, <COL2>, .... <COLn>
  FROM <your-object>

Sending queries to Db2 for z/OS using Apache Spark

There is a drawback from the previous example, which is that we had to read the full table into a Spark DataFrame. You could have used an MQT or a view to encapsulate a report query into a single object you can then retrieve, but perhaps you can’t define objects for every report you’ll present, so you need to send the real query to the Db2 server, and as you might have noticed from the previous explanation, you can send the query instead, however it must be enclosed in parenthesis.

Let’s work with this. We’ll use the following query:

SELECT PROJNO, ACTNO, COUNT(*) AS EMPLOYEE_COUNT
  FROM <your-sampleDB-schema>.EMPPROJACT
 GROUP BY PROJNO,ACTNO               

We then create the DataFrame:

>>> jdbcDF = spark.read.format('jdbc')\
... .option('url','jdbc:Db2://<ip>:<port>/<SSID>')\
... .option('dbtable','(select projno,actno,count(*) as employee_count from <your-sampleDB-schema>.empprojact group by projno,actno)')\
... .option('user','<your-user>')\
... .option('password','<your-password>')\
... .load()   

Note that this is another way to execute a JDBC call, and here is another way to display a sample of your results:

>>> jdbcDF.show()
+------+-----+--------------+
|PROJNO|ACTNO|EMPLOYEE_COUNT|
+------+-----+--------------+
|AD3100|   10|             1|
|AD3110|   10|             1|
|AD3111|   60|             2|
|AD3111|   70|             2|
|AD3111|   80|             2|
|AD3111|  180|             1|
|AD3112|   60|             4|
|AD3112|   70|             3|
|AD3112|   80|             2|
|AD3112|  180|             1|
|AD3113|   60|             3|
|AD3113|   70|             4|
|AD3113|   80|             4|
|AD3113|  180|             3|
|IF1000|   10|             1|
|IF1000|   90|             2|
|IF1000|  100|             1|
|IF2000|   10|             1|
|IF2000|  100|             2|
|IF2000|  110|             2|
+------+-----+--------------+

Now you can play with your data, and I need to remind you that once you have a Spark DataFrame, you can virtually transform it in anything you desire or merge it with any other information you need to present. That is the advantage we have with Spark, we could merge this table with a DataFrame that comes from another DBMS, or the information from a text file, JSON, file in HDFS or any other source you can reach.

Reading Db2 data using Spark RDDs

How about we now extract the information using Db2 query support over FTP? For the following example, you will need to create a member containing the same query we used last time:

SELECT PROJNO,COUNT(*) FROM EMPPROJACT GROUP BY PROJNO;

The next steps are simple and you could also build a script in your favorite language to automate it, or to create the text file containing the query in your client environment to be uploaded and then retrieve the results from its execution. Once you execute the query using an FTP call (if you have Db2 query support configured on your FTP server on the z/OS side) we should see a file with the following contents:

PROJNO         COL002
AD3100              1
AD3110              1
AD3111              7
AD3112             10
AD3113             14
IF1000              4
IF2000              5
MA2100              2
MA2110              1
MA2111              3
MA2112              6
MA2113              5
OP1000              1
OP1010              5
OP2010              2
OP2011              2
OP2012              2
OP2013              2
PL2100              1

As it is, it is ready to be read using Spark textfile() function, so let’s get into the code:

This time we will use our SparkContext object “sc” which is automatically created when you invoke the pyspark shell or you can create yourself for a standalone using the following lines:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("SparkExample")
sc = SparkContext(conf = conf)

Then we invoke the function textFile to read directly from the file:

file_proj = sc.textFile("file:///<path-to-file-to-read>")

Now we split the lines over the blank spaces:

proj_RDD = file_proj.map( lambda line: line.split() )

If we wish for example, to display the same contents from the file without the headings (however, this time the information is obtained from the Spark RDD object), we can write the following lines:

for i in proj_RDD.collect()[1:]:
      print(‘{0}   {1}’.format(i[0],i[1]))

And this is what we obtain:

AD3100  1
AD3110  1
AD3111  7
AD3112  10
AD3113  14
IF1000  4
IF2000  5
MA2100  2
MA2110  1
MA2111  3
MA2112  6
MA2113  5
OP1000  1
OP1010  5
OP2010  2
OP2011  2
OP2012  2
OP2013  2
PL2100  1

Showing your data to the world

Now that we have seen how to read the data using Spark, we should talk about the end goal, which is of course, presenting your data. There are a lot of options when it comes to visualization libraries, from which perhaps the most popular one is Matplotlib and maybe Seaborn, however, for this purpose I will be using a different library called Bokeh since it is easy to use and it also lets the user play with the built-in interactions that Bokeh can offer (I would also suggest you take a look at Plotly since it’s also a powerful visualization library that includes user interaction).

The code snippet to have the fastest bar chart in the west would be something similar to this:

from bokeh.plotting import show,output_file
from bokeh.charts import Bar

#we are using the same object proj_RDD from the last example
projs = []
counts = []
for i in proj_RDD.collect[1:]:
     projs.append( i[0] )
     counts.append( int(i[1] )
data = {‘projects’:projs,’emp counts’:counts}
output_file( ‘Bar chart example 1.html’)
p = Bar( data, values = ‘emp counts’, label = ‘projects’, title =   ‘Example chart using Bokeh’, plot_width = 1100)
show(p)

Here is your resulting chart:

2017-09-11_13-52-29.jpg

It is important to note that Bokeh is popular because of the flexibility it provides when it comes to user interaction, data formats, and a lot of other options. It is definitely worth exploring to create a final web application, reports in PDF format or any other way you wish to present and share your data.

The output is automatically shown in your default browser, and from there you can then insert the file into a web application, save it on png format, create a dashboard, and many more use cases.

Closing thoughts

Here’s a short summary of what was discussed on this article:

  • We revisited a couple of options to collect information from Db2 for z/OS.
  • We then learned how to read data using Apache Spark from any distributed platform.
  • Finally, we displayed the information extracted from point 1 into a bar chart using a data visualization library.

Perhaps you’re looking for a way to prove what Spark can do, or you want to show its potential so you can present it and promote the need to work with Spark on z/OS, or maybe you need to learn about clustering technologies so that you can speak the same language as your colleagues on the data science department, or even better, perhaps you are starting a data science department. In this age we, the Mainframers, need to prove that we have always been data scientists, and we can have the tools to empower the position of the Mainframe as the main source of information, and what’s better than proving your ideas with the same language that people are going for? All you need is an idea and enough curiosity to show it.

Recent Stories
Db2 for z/OS locking basics

A Beginners Guide to Locks, Latches, Claims and Drains in DB2 for z/OS by Steve Thomas

Fun with Date Ranges