Data Science in NSA Surveillance | The Application of Data Science to NSA Surveillance

In 2012, a handful of journalists began receiving encrypted e-mails from an anonymous source that claimed to have inside information on U.S. government surveillance efforts. For a year, they tried to verify the identity of the source and veracity of the information.

Featured Programs:

Sponsored School(s)

UC Berkeley

Featured Program: Master of Information and Data Science Online - Bachelor's Degree Required.

Request Info

Syracuse University

Featured Program: M.S. in Applied Data Science: GRE Waivers available | Master of Information Management Online

Request Info

Grand Canyon University

Featured Program: B.S. in Business Information Systems and M.S. in Data Science

Request Info

In May of 2013, both those things were established: the source was National Security Agency (NSA) contractor Edward Snowden and the information was about PRISM: an unprecedented mass-intercept program of domestic communication data, all being performed by an agency that was supposedly prohibited from spying on Americans.

The word was out and the low-key agency behind PRISM would never be seen the same way again.

The NSA was created through a confidential presidential directive in 1952 to consolidate signal intelligence gathering and analysis. The agency was so carefully concealed that wags who knew the acronym insisted that it stood for “No Such Agency.”

The secrecy was vital. The Allies had won World War II less than a decade before largely on the basis of superior cryptanalysis, a fact that the general public wouldn’t learn for decades. The decryption capabilities of the United States were considered as much or more of a vital strategic advantage as nuclear weapons.

Big Data brought the agency out of the shadows. PRISM was a response to a world where encryption had become ubiquitous and electronic communications surged to a point that only collection and analysis on a massive scale could reveal foreign spies, terrorist plots, and other threats. According to Senator Dianne Feinstein, NSA intercepts were key in preventing more than 50 terrorist attacks.

Yet it was also an enormous threat to individual privacy, piquing Snowden’s revelations.

Today, the NSA threads a careful gap between monitoring the nation’s enemies and preserving civil liberties. It employs some of the most talented data analysts in the world to strike that delicate balance. Only the best master’s-educated data scientists need apply.

Out of the Shadows and Into The Limelight: Statistical Spying is Sexy at the NSA

The NSA is no longer skulking in the shadows. In fact, the agency touts its work in data science and makes no bones about the application of technology to cracking systems and to “…dominate the global computing and communications network.”

Today it’s known that NSA has long recruited some of the finest minds in data science for top-secret programs.

The NSA is the largest employer in the state of Maryland. It is also the largest consumer of electricity in the state, burning through as much juice in 2007 as the entire population of Annapolis. Although the details may never be revealed, it’s a sure bet that all that power is going to work on some enormous sets of data— all at the hands of some of the world’s brightest data scientists.

Using Data to Crack Codes

Traditionally, the agency has been focused on cracking encryption codes. And code-breaking has always been a game of data-collection and statistical analysis.

Among the NSA’s ancestral heritage is Station HYPO, a secret U.S. Navy decryption center in Hawaii before and during World War II. HYPO used primitive punch card machines to analyze some 90,000 possible words and phrases from the Japanese Navy’s primary codebook to find patterns and flaws in the code. That work lead to the turning point of the war in the Pacific: Midway.

During the Cold War, NSA satellites monitored Soviet communications over the air. Various technical programs, including undersea cable taps, brought in more intercepts, most of which were encoded to various degrees. The greater the number of intercepts, the more reliable statistical analysis would be in detecting flaws in the codes. Quietly, in their cloistered buildings in Fort Meade, NSA cryptanalysts worked at the cutting edge of mathematical and computing theory to break the codes and pass the message contents along to other intelligence agencies.

The NSA remains at the forefront of modern cryptanalysis. In May 2013, the agency broke ground on a new High Performance Computing center, an addition to what is already widely supposed to be the largest collection of supercomputing equipment in the world.

The NSA Uses Big Data for More than Codebreaking

But while the agency is known for its sexy big-iron decryption work, the upsurge in cryptographic security across the modern Internet has forced a new approach—one that relies on big data. Indeed, two years before the new High Performance Computing center began construction, another facility, in Utah, began to go up: the Utah Data Center. As described in a November 2015 article in The Atlantic, it is presumed to be the destination for enormous volumes of intercepted voice and data traffic.

Although the agency isn’t talking, it’s popularly assumed that it has the ability to break the Secure Socket Layer (SSL) encryption that has become commonly used to protect everything from Web traffic to e-mail. But doing so is computationally intensive, and decrypting the more than 1 exabyte per day flood of Internet data for subsequent analysis is far beyond the capacity of any human agency.

Instead, the NSA has been forced to fall back on an older strategy, one that is made more successful than ever by modern data science: signals analysis.

When the contents of messages cannot be accessed, it is still possible to infer certain data from the timing, source, and destination of those messages. During the First World War, simple Radio Direction Finding (RDF) sets allowed the British Admiralty to deduce German submarine locations and patrol areas simply by watching when and where they transmitted from, without having broken any of the message codes themselves.

Today, these trends can be assessed by computer and their accuracy increased by the sheer volume of messages. Every packet that crossed the Internet—including, increasingly, voice traffic—inevitably contains information about source and destination.

According to a 2013 article in Ars Technica, the NSA began installing network taps at global telecommunications providers as early as 2006. Each tap was capable of generating 1.5 gigabits of intercept data per second. For the most part, the information is not analyzed directly. Instead, it is mined for social network analysis: determining which people are talking to one another, at what times, and for how long.

Tracing these relationships back and forth through time lead to more contacts and an identification of any single part of the chain leads to more in-depth attention from cryptanalysts.

Forging New Alliances to Face New Threats

The NSA wasn’t alone in facing the problem of how to store and analyze all that data. At the same time, companies like Amazon and Google were dealing with civilian versions of the same problem.

Reaching across the aisle, NSA adopted a version of Google’s Big Table and the open-source Hadoop framework for distributed storage and processing of massive data sets. After tweaking it to their own needs, the agency gave it back: the result is an open-source Apache Software Foundation project called Accumulo.

The NSA works with big companies like Google and Yahoo on more practical aspects of their mission too. PRISM and other data collection programs work via direct taps placed into the systems of large email and telecommunications providers, usually by court order. The agency also routinely works with cellular phone companies, harvesting the bounty of information coming in from their networks. Former CIA Chief Technical Officer Ira Hunt put it in perspective in a March 2013 article in Business Insider: “You’re already a walking sensor platform,” he said in reference to the amount of information generated by the average person just walking on the street today with a cell phone in their pocket.

And with or without the consent of those providers, the agency also piggybacks on a number of private data collection systems, such as the cookies placed on computers by online advertisers. NSA programmers are also hard at work developing strains of malware that can directly infect target PCs and report information directly back to the agency. There are suggestions that NSA scientists were part of the team behind the famous Stuxnet worm that hacked and destroyed a significant portion of Iran’s nuclear weapons program.

All of these efforts suggest an even greater influx of data and a need for even more data scientists to figure out how to analyze it. And, with media attention on the once-secretive agency continuing to flare, those same scientists will face enormous challenges in collecting and sifting that data in ways that continue to provide the government and military with vital information about our enemies while protecting the basic rights of American citizens.

Preparing for a Data Science Career in Government Surveillance with a Master’s Degree

Out of the Shadows and Into The Limelight: Statistical Spying is Sexy at the NSA

The NSA Uses Big Data for More than Codebreaking

Forging New Alliances to Face New Threats