Articles & Content


F10 - Clash of the Titans : Apache Spark Vs. MapReduce

Topic: Cross Platform DB2 for z/OS & LUW

Subtopic: 2016

With the advent of new technologies, there has been an increase in the number of data sources. Web server logs, machine log files, user activity on social media, recording a user’s clicks on websites and many other data sources have caused an exponential growth of data. Individually this content may not be very large, but when taken across billions of users, it produces terabytes or petabytes of data. Such a massive amount of data which is not only structured but also unstructured and semi-structured is considered under the roof known as Big Data. With the ability to capture and manage big data, came the need to perform analytics on top of it. Apache Spark and MapReduce are the two very popular open source cluster computing frameworks for large scale data analytics. This talk will introduce Apache Spark and its core concepts, the ecosystem of services on top of it, types of problems it can solve, and similarities and differences from Apache Hadoop/MapReduce.

While both, MapReduce and Spark can be used for iterative and batch processing workloads, there are usecases where one is better than the other. This presentation will be focused on understanding the major architectural components in MapReduce and Spark frameworks including: Shuffle, Execution Model, and Caching, and attributing the performance differences between the two to these different components. Also, Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Unifying these powerful abstractions of RDDs and relational tables makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within a single application. In this presentation I will share my experience about using Spark to access and analyse the data stored in DB2.

Summing up, this presentation will broadly focus on -
1) What is Apache Spark and why is it important?
2) Understanding Spark internals and ecosystem
3) Studying the performance of Apache Spark and MapReduce for different analytics workloads
4) Spark SQL : A new relational data processing paradigm
5) Using Spark SQL to access and process data from IBM DB2

Click Here to Download

NOTE: These are only open to members of IDUG. If you are not a member, please CLICK HERE for more information.