Understanding the Role of Spark in Data Science

Unsurprisingly, A lot of data analysis involves sorting stuff. ZIP codes, income levels, hair colors, serial numbers, dates… you name it, data processed for human consumption invariably requires some kind of sorting and ordering methodology to make it easily understandable.

So data scientists spend a lot of time sorting data, and thinking about better ways of sorting data, because when you are dealing with data sets with potentially billions of individual items, even the fastest computers can take a long time to put them in an order that makes sense.

Being the fun bunch that they are, data scientists even started a contest to see who could sort things the fastest: the Daytona Gray Sort (named for legendary Turing Award winner Jim Gray, a Microsoft computer scientist who maintained the records for the contest until he disappeared at sea in 2007).

Up through 2013, Apache’s Hadoop was the strongest contender, turning in sort times of a little more than one terabyte per minute. But in 2014, a new challenger showed up and blew the doors off, clocking in at a speed triple the best that Hadoop had managed. That challenger? Another open-source Apache project, named Spark.

Spark is not technically a competitor to Hadoop (which is in fact a loose confederation of four different components commonly used together for cluster computing and data storage) but rather a compliment to it. As with most ground-breaking technologies, almost as soon as Hadoop hit it big, programmers started finding ways to improve on it.

Spark represents one of those improvements, and it’s a big one.

Spark Puts Hadoop Data Stores on Steroids

Hadoop continues to garner the most name-recognition in big data processing, but Spark is, appropriately, beginning to ignite Hadoop’s utility as a vehicle for data analysis and processing, versus simply data storage.

Hadoop consists of four core components:

  • Hadoop Common – Essential utilities and tools referenced by the other modules
  • Hadoop Distributed File System – The high-throughput file storage system (HDFS)
  • Hadoop YARN – The job-scheduling framework for distributed process allocation
  • Hadoop MapReduce – The parallel processing module based on YARN

Spark replaces only two of those, YARN and MapReduce. According to a February 2016 article in Information Week, many Spark implementations chug happily away on top of Hadoop Common code and the HDFS. Thanks to the integration, many major companies that have implemented Hadoop clusters to deal with insane amounts of data – the likes of Amazon and Facebook – have kept the data storage elements and simply swapped in Spark as a high-performance alternative to MapReduce.

But Hadoop is not the only data store that Spark can work with, making it more flexible than MapReduce in many ways. In addition to Hadoop, Spark supports:

  • MapR Filesystem
  • Cassandra
  • Amazon S3
  • Kudu

Spark also has native cluster management built in so it can be deployed without YARN or other third-party cluster managers.

An In-memory Approach Produces Bleeding Speeds

MapReduce was designed to support massive disk clusters, the most prevalent and economic data storage technology available at the time it was being developed. The parallel processing module read data in from disks, mapped a function across it, reduced the results and wrote the data back to disk again.

Although it could be scattered across a huge cluster of machines, the linear aspect of this operation was unavoidable.

Disk reads and writes are, in computing terms, expensive and time-consuming operations. In-memory data manipulation is considerably faster, and that is what Spark leverages to gain an advantage.

Instead of working with linear disk operations, Spark implements a feature called “resilient distributed datasets,” or RDDs. These read-only multisets of fault-tolerant data are striped across the cluster to provide rapid access for iterative processes– just the sort of processes that data scientists often need when performing analysis on large datasets.

In fact, this is exactly what Spark was designed for.

The Needs of Data Scientists Drove the Development of Spark

University of California, Berkeley’s AMPlab (AMP stands for algorithms, machines, and people) is a hotbed of machine learning research. Like most other big data-dependent operations in 2009, it was running experimental machine learning processes on a Hadoop cluster and taking advantage of MapReduce to execute algorithms on hundreds of different machines.

But researchers noticed that some types of algorithms were performing poorly on the cluster, despite all the theoretical computing power at hand. In particular, running repeated scans over a given data set sometimes executed faster on an average laptop than on the Hadoop cluster.

Matei Zaharia, a PhD candidate at the lab, thought he could come up with something better. He sat down to begin writing Spark. He decided it would take advantage of fast, in-memory datasets, which wouldn’t require a lot of disk reads and could be distributed across machines in the cluster. And, unlike MapReduce, it would include a number of specialized functions, more than 80, to offer programmers more discrete control over how cluster processing was handled.

The result was a hit with data scientists. Zaharia turned the project over to the Apache Foundation, and adoption and expansion exploded.

Inside Big Data reported in November 2015 that Spark was the most active big data project in the open source repository with more than 600 contributors over the proceeding 12 months adding to or improving on the code.

And data scientists have continued to build on Spark to provide even more specialized solutions that take advantage of its speed …

Spark SQL allows the seamless integration of standard SQL queries into Spark programs, introducing a common DataFrame object to allow data scientists to work with stored information in a format they already understand well.

MLlib is a dedicated machine learning library for Spark that allows data scientists to implement popular machine learning algorithms on top of Spark directly within Java, Python, and R programs.

SparkR is an R package that provides an implementation of that popular data analysis language dedicated to working with Spark.

Even through Spark didn’t win the 2016 Gray Daytona, it’s continuing to win over data scientists around the world.

Back to Top