Apache Spark: the product, the hype and alternatives

Apache Spark: the product, the hype and alternatives by Ludovic Janssens, Infocura

Introduction

If you haven’t heard about Spark, it is about time you read an article on it. It may be this one or another one, as long as you are aware that a lot is going on with this software. But do we actually know what Apache Spark is exactly? Why should(n’t) we use it? Is it fit for purpose or not? Many things are still unclear to a broad audience; I hope to reveal a little more on this subject with this article.

The product: what is Spark?

It did take me some time before I grasped the essence of Spark. As I am no expert in the technical depths of the software, I would like to refer to MAPR’s architectural overview[1] for more exact details on that. In this article I will only point to some interesting facts.

Spark is a ‘stack’

Spark is neither a simple API nor a so-called Hadoop asparagus[2]. Spark is an engine that can be used against a wide variety of data sources outside Hadoop and has several libraries that support different processes:

  • Spark SQL: will allow you to write SQL against the resources targeted by Spark
  • Spark Streaming: facilitates processing of streaming data
  • MLlib: is a component for Machine Learning
  • GraphX: allows graph analytics
  • Spark R: allows you to run R statistics on top of Spark

Spark makes use of in-memory storage

Spark makes use of ‘Resilient Distributed Datasets’ (RDD), these are in-memory areas for data storage. Spark is capable of doing calculations and transformations in these RDDs. For each intermediate step a new RDD is allocated (Write Once, Read Many).

Actions are not executed unless their previous steps are being executed. As all is done in memory, this results in so-called micro-batch processes. The performance gain is significant, as most objects are still in memory when they are being reused for new processes.

Spark supports a wide variety of programming languages and frameworks

Spark has broad support for many languages. This makes it usable for many companies. Most commonly Spark is used with Java, Python, Scala, SQL and R.

Scala

Scala in particular is interesting, as the Spark stack itself is based on this language. If you are an IDUG Content Committee fan, you may remember the first contributions I made for the committee on Scala[3] and its benefits.

Spark is another framework build upon core Scala, but it can be combined with the ecosystem around it, Akka for example. Akka is a system for building highly concurrent, distributed, and fault tolerant event-driven applications[4]. With its Actor model of computation and ‘let it crash’ approach to resiliency, it replies to the principles of the ‘Reactive Manifesto[5]’ that describes a theory for flexible programs that are easy to operate and maintain.

Spark runs on ALL (common) operating systems

Spark can run on Linux, Unix and Windows and was recently released to run on z/OS too.

The hype: what can we really use it for?

Why do we love it?

Spark is being promoted by many parties. Last IDUG conference, there was a complete track dedicated to Spark with IBM presenting the sessions. MAPR has a complete tutorial and documentation set for Spark and HortonWorks is proud to include Spark in its latest releases of their HDP (HortonWorks Data Platform). The industry is really happy with this product.

Reason 1: the Virtual Warehouse

A first reason is quite obvious: Spark is a solution that implements current programming standards (Java / SQL / …) into a lightweight framework that allows you to analyze data in a uniform way whatever the provenance of the data and whatever the programming logic you like to apply.

Spark integrates fairly seamlessly all different kind of data sources without the high costs of ETL. As such, it establishes a virtual federation/data warehouse.

Spark can run on z/OS and/or zLinux and access DB2 for z/OS data very quickly without having to move the data. This represents serious potential cost savings when integrating mainframe data in distributed processes and opens the mainframe data even more to a broader audience.

Reason 2: Ease of use

Future programmers will be able to run their programs against Hadoop and DB2 with the same type of queries for both data sources and hence also move their code from one platform to the other without having to change it.

What about performance?

One of the primary reasons to start using Spark is its performance. We do know that many of the claims have commercial origins. So, what can we retain from this fabulous story?

Spark is Fast

The primary reason why Spark is performing well is because it does a lot in memory. Spark is fast, but not always the fastest. Much depends too on how you make usage of the framework (and of memory accordingly).

It is for example documented that Java on Spark is in many cases much slower than Scala on Spark. As with other analytical frameworks, it also depends on the type of analytics you envisage. So in a sense we are comparing cheese and chalk.

As the ‘it depends’ adage remains it is important to validate whether your use case is fit for Spark and not to blindly follow the masses.

Spark and Map Reduce

Spark’s performance is often compared to Map/Reduce. In its purest form Map/Reduce performs the following sequence on data: first it maps data, and then it shuffles the data in a given order to finally reduce the data counting the duplicates. Spark adds filter, join, sort and other actions to this logic, making it much more performing.

There is a nuance to this however. Recent implementations of Map/Reduce are enriched using Apache Tez[6]. Apache Tez combines the many Map-Shuffle-Reduce sequences in a single process and reorders the activities thus that resources are optimally used. Combined with YARN[7] (Resource management for Map Reduce), this improves the performance vastly.

Note however that Tez on its own does not replace Spark, but can be complemented with other software products that reflect Spark’s functionalities (for example Storm, see below). Spark on the other hand can also run on YARN.

What about SQL integration?

Having APIs for a framework is great, but unfortunately the Spark API is quite specific. The SQL does not comply with ANSI standards and is more embedded in the tradition of Java where any class can be called from anywhere. As such I would not recommend using Spark for SQL only.

If you are purely interested in analytics with SQL, I would consider the usage of other analytical tools that exist for both DB2 and Hadoop[8] and that do support the full and latest ANSI SQL standard.

On the other hand, Spark is useful as it has a full blown JDBC interface and it supports data formats such as JSON and Hive. Through its DataFrames API (SparkSQL), Spark will be able to expose DB2 data to its logic. From a maintenance perspective, this favors code reusability and replies to the need of a Service Oriented Architecture.

Next, Spark is also capable of persisting transformed data back into DB2, choosing one of the programming languages Spark provides such as Java, Scala, R and Python (like in Slick[9], the Spark syntax tree is matched with the JDBC (SQL) syntax tree).

What about real time analytics?

Well, Spark does not do real time analytics, it does micro-batch. This may sound bad, but actually the result is good enough for 99% of us. Is it really important to you that the data from the latest millisecond is taken into account? For some organizations it will, but many industries will not care, especially as we are talking about analytics and not online transaction processing.

Those who do need the real-time premise should look into the possibilities of another Apache Project, Apache Storm[10] that can be used with classic messaging and databases. According to the official website, Apache Storm is a distributed real time computation system which makes it easy to reliably process unbounded streams of data. Alternatively, many proprietary alternatives exist to achieve similar goals[11].

Is Spark really it?

Of course it is not it. Spark will evolve and as everything it will be replaced by something else eventually. Is it a good choice at the moment? As mentioned above, it does depend on the situation you’re in. Obviously, it will improve your analytics and the many features it provides make it a good and robust framework to work with, but it does not reply to all use cases…

Personally, I am somehow concerned about the durability of a Spark implementation. A few ongoing projects are emerging and could change the game play in the area of analytics APIs. The evolution is especially fast in the area of in-memory analytics.

Apache Geode, for example, is a new Apache project derived from what was previously Pivotal GemFire. This in-memory database allows you to reuse memory across many nodes thus enlarging the capacity enormously. It is already being used in many Java projects that process large amounts of data. The advent of this project in Apache will hopefully open possibilities for other projects (maybe Spark?) to benefit from its capabilities.

One of the use cases Spark wasn’t intended for is simple ETL/ELT transformations. As Spark involves the usage of an extensive framework it is better to look further for other solutions that allow you to do simple transformations without all this overhead. The appropriate solution could be found in classic ETL/ELT tools or in the usage of a classic database interface, depending on whether you need to address a single database or different data sources..

Another use case were Spark does not really apply is in-depth data analytics. Apache Spark is designed to process moderate amounts of data in memory. It is logical that an in-memory process cannot hold infinite amounts of data. If you want to do in-depth analytics using the SQL ANSI standards, you better make usage of an MPP implementation such as IDAA.

As mentioned before there are many other proprietary solutions that offer robust alternatives to Spark.

With DB2 as our focus, we may prefer to stick with simple DB2 as well. Why should we add these new exotic programming layers, if IBM has foreseen great extensions as the DB2 Analytics Accelerator (IDAA) and BLU, allowing us to achieve similar effects within the context of a classic relational warehouse environment?

This question has been raised before and a recent RedPaper on Spark and DB2 for z/OS[12] replies to this. IDAA is a high-performance accelerator for z Systems platforms that supports data-intensive and complex queries. Complex multidimensional queries can run as much as 2,000 times faster than the same query running natively on IBM DB2 for z/OS. This already replies to most use cases of mainframe analytics if the data is stored in DB2 only.

Spark comes in when additional data sources such as VSAM, IMS or SMF logs are involved in the analysis. The ‘DB2 Analytics Accelerator Loader’[13] indeed allows loading these data types; the Spark framework, however, could be seen as an enhancement to this. Not only does Spark provides the means to structure and ingest this external data into DB2 and the IDAA, which on its turn can be analyzed with DB2; Spark also provides means to directly query these resources.

IBM has focused on integrating data seamlessly through DB2’s federation capabilities too. If you are targeting ‘classic’ data stores such as Oracle, Teradata, SQL Server, Sybase… you may not benefit of Spark, as the federation can be done using DB2 itself and the in-memory aspects could be resolved differently. On DB2 for z/OS in particular, the DRDA standard allows you to catalog all data sources that support it. If you do not have the intention of broadening the scope of your analytics to other resources, Spark does not provide you many additional features but the advantage of a unified coding (but isn’t Java and its JVM (running Java/Scala/Groovy/…) meant as a unified coding approach as well?).

Finally, IBM is also continuously in compliancy with ANSI SQL, unlike Spark, so also with this regard we are in very good company.

Conclusion

Apache Spark is a great project to look into. I would like to stress that there is great value in it. Especially when integrating multiple types of data sources.

Spark can be the basis of a standard analytical approach, integrating Hadoop, Mainframe and other environments and adding (not replacing!) great features to it. As such it establishes the idea of a virtual data warehouse.

One should however take notice of a few caveats and consider this article as an antidote for all hyped information on Spark; it isn’t bad to get back on our feet and to recapitulate of what the repercussions are of such a paradigm shift.

Spark has great potential, but unfortunately also a few drawbacks. Spark is a realm an sich, building upon the features of other paradigms. Although Spark is very flexible, relying on a single technology is always dangerous; remember what happens when using language generators such as APS, EGL, CA GEN… Although they are all successful, shifting from one paradigm to another one is not that simple. Nobody knows what the future will bring as data platforms (and the ways data is used) are rapidly evolving.

Finally, several use cases are not (yet?) covered by Spark: simple ETL is less performing due to the framework overhead and complex queries are not supported due to the in-memory restrictions, unless you extend the configuration with an external MPP functionality on which preliminary queries are applied. Spark can spill data from memory to disk, but this must be limited, as we know from our DB2 experience.

Spark has only real added value when you need to query various types of data sources. If you need to address a single database type, you will lose performance because of the overhead that is generated by the Spark framework. You may benefit from the use of the special libraries, but if you do not integrate with other Spark implementations, you may want to consider alternative technologies that do the same.

Consider carefully whether Spark is fit for purpose and future proof when you start to us it. Make sure to consider possible alternatives, as there are many. In the end it will still be the use case that will dictate the software to use and not the way round.

[1] See: https://www.mapr.com/ebooks/spark/03-apache-spark-architecture-overview.html

[2] HortonWorks calls software projects that run on Hadoop asparaguses, as they grow on top of HDFS and YARN

[3] See http://www.idug.org/p/bl/et/blogaid=198 and http://www.idug.org/p/bl/et/blogaid=199

[4] See http://doc.akka.io/docs/akka/snapshot/intro/what-is-akka.html

[5] See http://www.reactivemanifesto.org/

[6] See https://tez.apache.org/

[7] See http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

[8] As part of an IDUG publication, no product names are mentioned here, please contact the sales representatives of our sponsors if you are not aware of the many great products they produce or support.

[9] Slick is functional-relational mapping for Scala; see http://www.idug.org/p/bl/et/blogaid=199

[10] See http://storm.apache.org/

[11] As part of an IDUG publication, no product names are mentioned here, please contact the sales representatives of our sponsors if you are not aware of the many great products they produce or support.

[12] Apache Spark for the Enterprise: Setting the Business Free; See http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/redp5336.html

[13] This is an IBM software facilitating loads in the DB2 Analytics Accelerator; see http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=AN&subtype=CA&htmlfid=897/ENUS213-471&appname=USN

Recent Stories
Db2 for z/OS locking basics

A Beginners Guide to Locks, Latches, Claims and Drains in DB2 for z/OS by Steve Thomas

Fun with Date Ranges