View Printable Version

F10 - Clash of the Titans : Apache Spark Vs. MapReduce

Session Number: 3278
Track: Ala Carte I
Session Type: Podium Presentation
Primary Presenter: Saurabh Agrawal [University of Southern California]
Room(s)/
Time(s):

Pecos => Wed, May 25, 2016 (09:15 AM - 10:15 AM)

Speaker Bio: Saurabh Agrawal is a IBM certified Database Associate and a Data Science graduate student at the University of Southern California, Los Angeles. He worked for 2 years with ACI Payment Systems and has an experience in DB2 LUW for highly transactional OLTP databases.

Saurabh enjoys collaborating and sharing his knowledge with DB2 community via his blog www.rideondata.com . He ranked 4th in the "DB2's Got Talent 2014". He is the author of IBM developersWorks article titled "Data Purge Algorithm: Efficiently delete terabytes of data from DB2 database" published in Jan 2015. He was also a Speaker at IDUG NA 2015 which was his first IDUG.
Audience experience level: Beginner, Intermediate
Presentation Category: Emerging Technology, Big Data
Presentation Platform: Select a Value
Audiences this presentation will apply to: Application Developers, Data Architects
Technical areas this presentation will apply to: Select a Value
Objective 1: What is Apache Spark and why is it important?
Objective 2: Understanding the architectural differences between Spark and MapReduce
Objective 3: Studying the performance of Apache Spark and MapReduce for different analytics workloads
Objective 4: Spark SQL : A new relational data processing paradigm
Objective 5: Using Spark SQL to access and process data from IBM DB2

Abstract:  With the advent of new technologies, there has been an increase in the number of data sources. Web server logs, machine log files, user activity on social media, recording a user’s clicks on websites and many other data sources have caused an exponential growth of data. Individually this content may not be very large, but when taken across billions of users, it produces terabytes or petabytes of data. Such a massive amount of data which is not only structured but also unstructured and semi-structured is considered under the roof known as Big Data. With the ability to capture and manage big data, came the need to perform analytics on top of it. Apache Spark and MapReduce are the two very popular open source cluster computing frameworks for large scale data analytics. This talk will introduce Apache Spark and its core concepts, the ecosystem of services on top of it, types of problems it can solve, and similarities and differences from Apache Hadoop/MapReduce.

While both, MapReduce and Spark can be used for iterative and batch processing workloads, there are usecases where one is better than the other. This presentation will be focused on understanding the major architectural components in MapReduce and Spark frameworks including: Shuffle, Execution Model, and Caching, and attributing the performance differences between the two to these different components. Also, Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Unifying these powerful abstractions of RDDs and relational tables makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within a single application. In this presentation I will share my experience about using Spark to access and analyse the data stored in DB2.

Summing up, this presentation will broadly focus on -
1) What is Apache Spark and why is it important?
2) Understanding Spark internals and ecosystem
3) Studying the performance of Apache Spark and MapReduce for different analytics workloads
4) Spark SQL : A new relational data processing paradigm
5) Using Spark SQL to access and process data from IBM DB2

For questions or concerns about your event registration, please contact support@idug.org