10.
Spark is faster because it avoids
reading data from disk multiple times

11.
Group
By
Country
Read
from
HDFS
Decompress
Deserialize
Group
By
State
Read
from
HDFS
Decompress
Deserialize
Group
By
Video
Read
from
HDFS
Decompress
Deserialize
10s
of
Group
Bys
…
Cache
only
columns
of
interest
Hive/
MapReduce
startup
overhead
Overhead
of
ﬂushing
intermediate
data
to
disk
Spark
Group
By
Country
Read
from
HDFS
Decompress
Deserialize
Cache
data
in
memory
Group
By
State
Read
data
from
memory
Group
By
Video
Read
data
from
memory
10s
of
Group
Bys
…