Preparing for a Data Science Career in Subatomic Physics with a Master’s Degree

A 12-foot long silver tube tumbles out of an apparently empty sky. Fifty feet before the tube hits the ground, the sky explodes. Unimaginable forces rend the atmosphere, releasing a blast wave capable of pulverizing concrete, flame hotter than the sun, and radiation moving at the speed of light that bathes everything in view with deadly waves of ionization.

Overhead, the characteristic bloom of an ash-gray mushroom cloud emerges from the fireball, ascending slowly into a sky that now seems yellow and dim by comparison. A nuclear detonation has occurred.

But it’s all simulated, fortunately, a virtual dream speeding through the 142,272 cores of Los Alamos National Laboratory’s Cielo supercomputer at more than 1,000 teraflops per second. Every molecule of uranium and plutonium involved in the complex sequence of events necessary to initiate a nuclear explosion have to be plotted, tracked, and the effects of their collisions – exponentially expanding – modeled in excruciating detail to obtain an accurate picture of the detonation.

Wrangling this massive stack of information is only one of the many ways data science is being used in subatomic physics.

The sheer volume of particles at the subatomic level demand extraordinary levels of storage and data manipulation; at the level that only data scientists can manage:

  • Tracking particle collisions in CERN’s Large Hadron Collider and deriving results from the more than 20 petabytes of data generated there each year
  • Calculating some of the fundamental processes of the Big Bang at Lawrence Livermore National Laboratory
  • Simulating nuclear explosions at Los Alamos National Laboratory
  • Probing the weirdness of quantum computing by studying interactions at the subatomic level at Google, IBM, and other companies

Some of the greatest mysteries of the universe are now being unlocked today by master’s-prepared data scientists at these organizations and others.

Nuclear Explosions Without All the Mess: Simulation is the Future of Nuclear Testing

No one knew that the first nuclear bomb would work when it was dropped. J. Robert Oppenheimer, head of the Manhattan Project laboratory that designed and built the weapon, bet another physicist on his team $10 that it wouldn’t. Edward Teller, another project physicist, thought the bomb might ignite the atmosphere itself in an unstoppable chain reaction, destroying the Earth.

In July of 1945, the only way to find out for sure was to give it a shot. They did, and the nuclear age dawned.

For the next 47 years, the only way to check the theoretical calculations of nuclear physicists was to actually build and detonate their creations to monitor the results. But atmospheric testing quickly became undesirable when its effects on the environment and humanity were better understood. Detonations moved underground, but concerns remained.

The Comprehensive Nuclear Test Ban Treaty of 1996 changed all that. Although it remains unratified, the treaty has largely been honored by its signatories, including the United States. The U.S. has not conducted a live nuclear detonation since 1992.

Despite this limitation, the United States recently embarked on a $1 trillion program to modernize its nuclear weapons. And the only way to do that without live detonations is through data science.

The B61 nuclear bomb is among the first in the arsenal due for revisions. A gravity-dropped “dial-a-yield” (it can be set to different kiloton ranges depending on targets) bomb, the B61 is among the oldest in the U.S. military inventory.

The simulations run through Cielo to model upgrades to the stock B61 are displayed in a virtual environment called the CAVE, or Cave Automatic Virtual Environment, allowing physicists and non-physicists alike to visualize the results of the testing. Wearing 3-D goggles, experimenters can step inside the stop-motion simulation to explore the data in every minute detail. These experiments allow today’s scientists to understand the reactions behind nuclear explosions better than the scientists who originally designed the weapons.

It’s not just the explosions themselves that are open to simulated tests today—Sandia National Laboratory also ran impact tests for the bomb casing, to ensure that it would hold up long enough for the detonation sequence.

Plumbing the Composition and Origins of the Universe with Data Science

Different sorts of collisions are on the menu at CERN in Switzerland at the Large Hadron Collider (LHC). There, subatomic particles are accelerated through twin magnetic beam pipes to speeds near that of light before being smashed into each other to see what kind of interesting stuff happens when they break.

Protons smashing into one another at those speeds can generate energies in excess of 13 trillion electron-volts, and around 600 million events – or 25 gigabytes of data per second  – to be stored and analyzed.

Of those 600 million discrete collision events, only about 100,000 are ultimately of interest to the physicists studying the tests. It’s up to data scientists to devise algorithms to filter the dataset down to those events, without accidentally excluding anything of interest. They do this in a two-stage process, using broad algorithms to make a first pass to narrow the 600 million events per second down to 100,000 or so, and then more specialized algorithms to get down to the 100 that the physicists want to look at in more detail.

Those events are providing physicists with information that is unveiling the building blocks of matter and the nature of reality itself.

Modeling the Big Bang by Looking Back in Time

As complicated as the LHC monitoring and calculations are, at least they involved direct observations of contemporary events. At Lawrence Livermore National Laboratory (LLNL) in Livermore, California, scientists looking into the Big Bang that formed our Universe have a slightly more complex problem: how do you find data to account for the unimaginable forces that spawned the whole of creation more than 13 billion years ago?

It turns out that the magic of light speed offers just that opportunity. Looking far enough away is the same thing as looking back in time. Astronomy, then, is a critical data source for subatomic physicists. And there is a lot of data up in the sky to be tracked and parsed.

According to a 2012 article in The Atlantic, advances in telescope technology roughly double the amount of data harvested from the sky each year. Improvements in digital imaging increase the number of megapixels to be stored and cataloged.

Some of that data finds its way into LLNL’s five petaflop Vulcan supercomputer, where it was used recently to calculate some of the conditions of the Universe in the first second after the Big Bang, conditions that are critical for scientists to understand in order to evaluate some of the other experimental results being generated today… including those at the Large Hadron Collider.

Quantum Computing Uses and Bolsters Big Data

Quantum computing, or the use of what Einstein called “spooky action” between apparently unconnected subatomic particles for computational purposes, has been the next big thing in computing since 2000. In that year, the first primitive quantum computer, with 5 qubits (quantum-state bits) measured by nuclear magnetic resonance (NMR), debuted at the Technical University of Munich.

For data scientists, quantum computing represents both challenge and promise.

The challenge is dealing with building a data bus that is capable of interfacing with a novel type of computation. A qubit can hold three different states of information instead of the two (on or off) held by traditional bits. But they may only hold it for fractions of a second, creating an enormous read/write problem.

The promise is in the tremendous processing power embodied in the qubit. According to a 2016 article on, researchers at MIT, USC, and the University of Waterloo have pioneered approaches to use quantum computing to accelerate the processing of massive data sets.

Data scientists have long understood that topological relationships within data sets can be of supreme importance to understanding the patterns represented in the information. Processing very large sets of data using algebraic topology is too computationally intensive to be efficient, however. A 300-point data set would require more conventional bits to calculate topologically than there are discrete particles in the universe.

The technique unveiled by MIT could compute the same algorithm on the same data set using only 300 qubits. With quantum computing, the answer would be reached almost instantaneously.

Back to Top