spark-notes

This project is maintained by spoddutur

Spark as a successful contender to MapReduce

Its not uncommon for a beginner to think Spark as a replacement to Hadoop. The term “Hadoop” is interchangeably used to refer to either Hadoop ecosystem (or) Hadoop MapReduce (or) Hadoop HDFS. Apache Spark came in as a very strong contender to replace Hadoop MapReduce computation engine.

MapReduceVsSpark

This blog is to better understand what motivated Spark and how it evolved successfully as a strong contender to MapReduce.

We will section this blog in 3 parts:

  1. MapReduce computation Engine in a Nutshell

  2. Cons of MapReduce as motivation for Spark
    • Look at the drawbacks of MapReduce
    • How Spark addressed them
  3. How Spark works
    • Behind the scenes of a spark application running in cluster
  4. Appendix
    • Look at other attempts like Corona done to make up for the downsides of MapReduce Engine.

1. MapReduce (MR) computation in a nutshell

I’ll not go deep into the details, but, lets see birds eye view of how Hadoop MapReduce works. Below figure shows a typical Hadoop Cluster running two Map-Reduce applications. Each of these application’s Map(M) and Reduce(R) jobs are marked with black and white colours respectively.

Hadoop MapReduce

2. Cons of Map-Reduce as motivation for Spark

One can say that Spark has taken direct motivation from the downsides of MapReduce computation system. Let’s see the drawbacks of MapReduce computation engine and how Spark addressed them:

  1. Parallelism via processes:
    • MapReduce: MapReduce doesn’t run Map and Reduce jobs as threads. They are processes which are heavyweight compared to threads.
    • Spark: Spark runs its jobs by spawning different threads running inside the executor.
  2. CPU Utilization:
    • MapReduce: The slots within TaskTracker, where Map and Reduce jobs gets executed, are not generic slots that can be used to run either Map or Reduce job. These slots are categorized into two types, one to run Map jobs and the other to run Reduce jobs. How does it matter? So, when you start a MapReduce application, initially, that application might spend like hours in just the Map phase. So, during this time none of the reduce slots are going to be used. This is why if you notice, your CPU% would not be high because all these Reduce slots are sitting empty. Facebook came up with an improvement to address this a bit. If you are interested please check Appendix section below.
    • Spark: Similar to TaskTracker in MapReduce, Spark has Executor JVM’s on each machine. But, unlike hardcoded Map and Reduce slots in TaskTracker, these slots are generic where any task can run.
  3. Extensive Reads and writes:
    • MapReduce: There is a whole lot of intermediate results which are written to HDFS and then read back by the next job from HDFS. Data handshake between the any two jobs chained together happens via reads and writes to HDFS.
    • Spark: Spark is an in-memory processing engine. All of the data and intermediate results are kept in-memory. This is one of the reasons that you get 10-100x faster speed because of the efficient memory leverage.

Note: Facebook came up with Corona to address some of these cons and it did achieve 17% performance improvements on MapReduce Jobs. I’ve detailed it in Appendix.

3. How Spark works:

Now that we have seen the disadvantages with MapReduce and how Spark addressed it, its time to jump in and look at the internals of Spark briefly. In that, i’ll mainly try to cover how a spark application running in a cluster looks like. Below picture depicts Spark cluster:

spark-standalone-mode

Let’s look at differenct components shown in the above picture:

3. Conclusion:

We’ve seen:

My HomePage

4. Appendix:

4.1 Corona - An attempt to make up for the downsides of MapReduce and improve CPU Utilization

Facebook came up with Corona to address the CPU Utilization problem that MapReduce has. In their hadoop cluster, when Facebook was running 100’s of (MapReduce) MR jobs with lots of them already in the backlog waiting to be run because all the MR slots were full with currently running MR jobs, they noticed that their CPU utilisation was pretty low (~60%). It was weird because they thought that all the Map (M) & Reduce (R) slots were full and they had a whole lot of backlog waiting out there for a free slot. What they noticed was that in traditional MR, once a Map Job finishes, then TaskTracker has to let JobTracker know that there is an empty slot. JobTracker will then allot this empty slot to the next job. This handshake between TaskTracker & JobTracker is taking ~15-20secs before the next job takes up that freed up slot. This is because, heartbeat of JobTracker is 3secs. So, it checks with TaskTracker for free slots once in every 3secs and it is not necessary that the next job will be assigned in the very next heartbeat. So, FaceBook added Corona which is a more aggressive job scheduler added on top of JobTracker. MapReduce took 66secs to fill a slot while Corona took like 55 secs (~17%). Slots here are M or R process id’s.

4.2. Legend:

5. References: