Understanding How MapReduce is Used in Data Science

Sponsored School Search

If Hadoop is the lifeblood of the Big Data revolution, then MapReduce is its beating heart.

As much a programming paradigm as an actual code implementation, MapReduce is a deceptive name for the magic that actually happens in Hadoop and other massively parallel computing clusters.

As the name suggests, MapReduce is actually two steps– Mapping and Reducing the data:

  • Mapping sorts and filters a given data set
  • Reducing it performs a calculation of some sort on the resulting information

These steps are not exceptional and aren’t particularly new to the programming world. In and of themselves, these steps can be used just as easily with almost any functional programming language without resulting in any great breakthrough in data processing.

The real key to the Hadoop MapReduce operation was in the implementation. The ability to break apart the two steps and send them out in pieces to multiple servers in a cluster for parallel operation was what made MapReduce special.

And without that insight into implementation, the modern field of data science might never have come to be. In 2013, more than a quarter of data scientists responding to O’Reilly’s annual survey indicated they used Hadoop regularly– mapping and reducing like crazy.

With Massive Data Assets Come Great Processing Requirements

In 2002, Google had just passed Yahoo to become the most popular search engine in the world. The company was still two years from its Initial Public Offering and was experiencing the heyday of freedom that came with exploding revenues and few investors to satisfy.

It was also swamped with information to process and expectations of accuracy and performance. Although the dot-com bubble had just burst, the growth of the Internet had barely paused– more than 750 million users were surfing the web, and by 2004 the number of host computers would skyrocket to nearly 300 million.

Like a lot of startups, Google’s systems were a patchwork created by independent hackers approaching different problems in different ways. The processes that indexed the web pages the company’s web spiders had crawled were accurate, but slow. The company was implementing parallel processing to bring many servers to bear on the process, but had to figure out how to get them all to work together efficiently.

By 2003, they’d put together a cluster framework consisting of the Google File System to store all that data, and come up with a new paradigm to index it: MapReduce.

Hadoop Brings MapReduce Power to the People

Google’s breakthrough inspired external researchers working on parallel processing, too. In particular, Doug Cutting, director of the Internet Archive, and University of Washington graduate student Mike Cafarella had been banging their heads trying to create an open-source search engine for more than a year. They, too, could crawl the web and retrieve huge volumes of page data, but processing it proved prohibitive.

The MapReduce approach was just what they needed. It provided a way to automate much of the work they had been forced to do manually. Cutting and Cafarella dug in to create an open-source implementation to power their search engine indexing.

Today, the search engine they built is all but forgotten. The system for indexing the web, on the other hand (which Cutting named “Hadoop” after his son’s plush toy elephant), has taken over the Big Data world.

It Can Take Sweat and Tears to Map and Reduce with the Best

Hadoop is the best known MapReduce implementation but data scientists use the algorithm with many packages, including:

  • Couchdb
  • Infinispan
  • Riak

Although the cluster software supports MapReduce, it’s up to data scientists to actually write both the Map and Reduce functions that will be executed. The cluster manager takes care of distributing the data and code for processing.

The Map function takes a set of input key/value pairs and executes code on them to produce the desired output key/value pairs. The classic Map function searches sets of documents to count instances of words in those documents– an index. The corresponding Reduce function then totals the counts of those instances, across the broad set of documents, to return totals for the number of occurrences of each word. Both sets of instructions must be stateless; the process of distributing them across the machines in the cluster means that no part of the results can be dependent on any other part until the final combination.

This, of course, describes the simple step of indexing the world wide web–the purpose for which MapReduce was built.

But the two procedures can be used on almost any large set of information, and can be iterated and expanded to perform complex searches and calculations that might take conventional computers, even supercomputers, months or years to process.

As Data Scientists Learn More About Processing, MapReduce is Reduced to a Niche Player

Although MapReduce revolutionized data science on very large information sets, it represented only the first baby steps in using massively parallel clusters to perform intensive data sorting and computation. Almost immediately, some users ran into limitations at the type of problems that the functions could address. Database experts have taken MapReduce to task for its inability to function with conventional RDBMS (Relational Database Management System).

Despite having a powerful cluster behind it, there’s no guarantee a MapReduce operation will necessarily be blazingly fast– or even faster than the same calculation performed on a single workstation. For data scientists, the key to getting the most out of MapReduce is in understanding the tradeoffs inherent in how it runs. The communication costs of transmitting, writing, and reading data within the cluster can dominate the effectiveness of a MapReduce function, eating up the advantages of parallel processing without providing discernible benefits.

Fault tolerance is also a consideration for MapReduce programmers. Redundancy on cheap hardware demands multiple copies of data be stored, but the overhead associated with tracking and writing that data is significant. With certain data sets, this can far outweigh the speed of parallel processing.

So MapReduce has been reduced from the general purpose hammer to just another tool in the data science toolbox. All the same, MapReduce is nothing to sneeze at. Google used it to reindex the entire World Wide Web daily for years, crunching through terabytes of spider results on their massive cluster. For one-time calculations or simple sorts of truly massive data sets, it remains the preferred solution.

This powerful approach to certain types of analytical problems that almost all data scientists face at some point or another means that MapReduce will continue to be a mandatory part of the education of anyone entering the field for years to come.

Back to Top