Understanding the Role of NoSQL in Data Science

Sponsored School Search

Craigslist might be the most unlikely classified advertising site in the world. Serving more than 570 cities in 50 different countries, the plain blue-text on white-background site hasn’t changed much visually since founder Craig Newmark took the trouble of creating it as a web-based alternative to his popular San Francisco area email list back in 1996.

But behind the scenes, Craigslist has changed enormously since the mid-nineties. Initially served from a single MySQL database, the site’s popularity and geographic expansion pushed its archives onto a MySQL cluster. Taking in 1.5 million new ads per day, even that high-performance cluster was insufficient– a single ALTER TABLE statement took months to execute as relationship indexes on billions of records were laboriously refreshed. No other updates could occur while the update was happening, forcing the production database to become bogged down in the overflow.

Craigslist engineers realized that any relational database would suffer from the same problems— the very way the data was being stored was the bottleneck. So they turned to MongoDB, a document-based NoSQL database. MongoDB came without the complexity of relational indexes or requirements for records to share the same fields, allowing updates to occur rapidly without a broad schema to adjust.

And without internal consistency requirements, MongoDB also made it easier to split the data into shards for further distribution across commodity hardware clusters. Implementing auto-sharding, the initial Craigslist NoSQL deployment could support 5 billion documents and 10 terabytes of data.

Today, NoSQL databases are an integral part of operations at every major website and increasingly at academic institutions and companies of all stripes. Data scientists working with Big Data will be using some variant of NoSQL solutions for decades to come.

NoSQL isn’t New, But it Fills a New Niche

If you work in technology long enough, you will find that some old things become new again as they are repurposed for new applications. So is the case with non-relational database structures, which toss every advance in database design from the last 40 years into the trash bin in favor of pure, blinding speed.

For the thousands of years of recorded history before 1970, no data store was relational. All intentionally documented information, from Sumerian beer distribution records to U.S. Army pay ledgers, were basically kept as big, long lists. Cross-referencing and indexing were laborious manual operations, even when those lists were stored digitally.

The Rise of Relational Databases

But a researcher at IBM, E.F. Codd, realized that it could be easier and faster to look up certain results if information were stored relationally– that is, with keys allowing lists to be interrelated according to different types of links. SQL (pronounced sequel) was the language created to query those Relational Database Management System (RDBMS) stores.

The concept revolutionized data storage. Today, almost every application a person uses on any electronic device is powered by a relational database at some level.

As powerful as the relational paradigm is for storing and querying data, it suffers from drawbacks in scaling. Since queries rely on the indexes kept on the relationships between various tables, every time any of those tables are updated, the index, too, must be updated. This can make for relatively slow writing and updating operations. And the index must be singular, a master reference– there are no easy ways to split it apart for distributed computing.

In a world where the volume of data being generated second by second is only increasing, that slowness put a real ceiling on the growth of high-traffic websites. Netflix, for example, found that request traffic expanded by more than 3000 percent between 2010 and 2011, far outstripping the company’s ability to build out more Oracle-based data centers to keep up. Instead, the company moved to Cassandra, a NoSQL database, and found latency dropping and downtime eliminated.

Letting Data be Abnormal is Key to NoSQL

The integrity of an RDBMS is dependent on the singularity of the data within them. Companies ran into chronic problems once their database requirements outgrew the capacity of single servers; it was impossible to coherently store relational data on large clusters because there was no efficient way to keep all the indexes synchronized.

NoSQL offers an architectural approach with fewer constraints. In general, this makes it easier to break apart NoSQL data stores, but more difficult to query them for complex results.

For data scientists, NoSQL is a double-edged sword. Although the technology makes it almost trivially easy to rapidly accumulate massive sets of data and rapidly scale data stores to meet demand, it also involves breaking a cardinal sin of data analysis: de-normalizing data.

Normalized data demands that a unitary item of information be stored only once; relationships between entities allow the retrieval of that item when needed while simultaneously ensuring that it is recorded properly. Without any enforcement of atomicity, a NoSQL database neither has any way to ensure data is not duplicated nor that it is collected in the first place. It’s also difficult to ensure that all relevant records are updated or deleted when modifications are made. NoSQL databases usually strive for something called “eventual consistency” in their data stores, which isn’t good news if you want accurate results from your query immediately.

NoSQL databases are also frequently single-purpose. A good relational database describes the entities it contains with a schema that should closely model the reality from which the data is collected. That schema, quite independent of the original purpose, can be used to fill any other need that might arise in which that information would prove useful– a corporate accounting database can also feed a generalized ERP (Enterprise Resource Planning) system, for instance.

A NoSQL database might have to be re-built from scratch to fill a new role, however, or at least extensively transformed.

There’s More Than One Way to Not Relate Data

Theoretically, of course, any data store that isn’t relational could be called “NoSQL.” In practice, there are a few common types of NoSQL database:

  • Key-Value Store
  • Column
  • Document
  • Graph

Each of these techniques has strengths in different applications…

  • Column stores, for instance, are ideal for near-constant streams of data like those generated by server log files.
  • Document stores, on the other hand, are good for scenarios where it’s difficult to anticipate the exact nature of the data to be stored—they store each item as a unique object, which doesn’t necessarily have to conform to a predefined schema.
  • Key-value stores strip data storage down to the bare bones, essentially dumping data items into a set of rows like a spreadsheet. The lightweight nature means it is easy to keep key-value data in memory, where it can be rapidly read or written.
  • Graph stores are conceptually awkward for linear thinkers to grasp, as opposed to key-value and document stores, which are eminently recognizable to anyone who has every looked at a spreadsheet. Graphs use a series of nodes and edges to represent data items and the relationships between them, resulting in a very visual schema.

Back to Top