The Study of Predictive Analytics in a Data Science Master's Program

Predictive analytics is a method of using statistical and data processing techniques on sets of historical information to generate predictions about future or unknown facts or events.

Surgical site infections (SSI) are a common hazard in both elective and emergency surgeries. Peeling back the skin removes the biggest barrier to infectious bacteria that the body possesses, and decontamination procedures for surgical tools and of surgical teams can never be perfect. The annual incidence of SSI in the United States is thought to be around 1 percent, but the consequences can be deadly: 8000 patients each year die from SSI, and treatment costs for survivors total around $10 billion.

At the University of Iowa Hospital, though, your odds are a little better than average. After three years following a new risk reduction program for SSIs, infections had dropped by 74 percent.

How did they do it? Through predictive analytics.

It’s just one way that predictive analytics are changing the world and one way that data scientists are delivering life-changing data in places where it can make all the difference.

Predictive Analytics Has Already Reshaped the World

The key to predictive analytics is in finding and studying explanatory variables behind occurrences you hope to predict. These are key points in the data that are causative to the underlying trend that is being identified. If those points can be captured and measured prior to, or more easily than, the events they cause, then they can be used predictively to identify that likelihood of those events. In the case of the SSI occurrences in Iowa, certain vital signs measurements, a large amount of blood loss, and preexisting conditions such as diabetes or hypertension were found to explain the chances for post-operative SSI.

Great care has to be taken to identify variables that are genuinely explanatory, however. There can be high correlations between certain events and variables that may be completely false, or that may not have any causal relationship.

In the Iowa case, data scientists had to work closely with clinicians who could review the data and make medically valid links between the predictive variables and the SSIs.

Data scientists working for the hospital built a data warehouse with medical records data and information collected during surgeries themselves. By creating models to analyze vital signs information, variables such as the use of blood transfusions during the surgery and preexisting conditions experienced by the patient, scientists were able to identify key correlations that could predict the likelihood that certain patients would develop an SSI.

Using that information, doctors modified their post-operative treatments on those patients and were able to reduce the actual occurrence of those SSIs.

Modern Financial Systems Thrive on Predictive Analytics

While making diagnostic use of predictive analytics scores high on the feel-good scale, the type of predictive analytics that most people are likely to be familiar with revolve around credit cards.

Individual credit scores may be one of the earliest and most widely applied forms of predictive analytics. Invented in the 1950s, when the most prevalent way of getting credit was to walk into a bank and have a loan officer eyeball you and your bank documents, credit scores were a way for banks to reduce their risk and overhead by making predictions about who was likely to default based on five discrete variables.

Invented by statisticians Bill Fair and Earl Isaac, this method became known as the Fair-Isaac score… or, as it is abbreviated today, FICO (Fair, Isaac, and Company). The method also prevented racial or other biases from entering into loan decisions.

Card companies also use predictive analytics in near real-time to identify likely fraud cases. By building predictive models of likely behavior of individual card members, the companies can estimate with a high degree of accuracy what sort of charges they are likely to make, and to flag transactions that fall outside that model.

Of course, if companies have the data to make predictions about what purchases consumers are not likely to make, it also follows that they can make predictions about what purchases they probably will make. This is the flip-side of predictive analytics in sales, which categorizes consumers according to likely purchases and then focuses marketing efforts on those targets.

How Data Scientists Learn To Use Predictive Analytics To See The Future

Data science master’s programs don’t skimp on teaching predictive analytics techniques to future data scientists. Much of the value of the field, after all, is in its ability to predict events or behavior before they happen and allow the consumers of those predictions to make adjustments accordingly.

There are three types of models used in predictive analytics:

  • Predictive models – This type of model takes a particular sample of data and analyzes it to make predictions on the performance of similar samples to match the predictive variables across the set.
  • Descriptive models – The descriptive model attempts to group sample data according to similarities between variables; for example, customer segments based on retail sales purchases.
  • Decision models – Decision models are used in forecasting and optimization and incorporate all known data together with extrapolated data based on various decisions that might be applied. They support simulation of possible decision branches that can be applied to a single event or to optimize repeated decisions required in a business process.

Choosing the right model for the scenario at hand is only the first step, however. Fitting the model to the variables and testing the results are where the real work happens, and data scientists have to dig deep into their toolbox to get valid predictions.

Tools and Techniques in Predictive Analytics

Predictive analytics rely heavily on machine learning and regression techniques. These are statistical analysis methods that are also commonly taught as core parts of data science master’s program curriculums.

Unsurprisingly, these heavily mathematical functions are often performed using the R programming language, which is optimized for statistical analysis of data, or Python, a more general programming language which has been extended with libraries such as NumPy that provide additional statistical operation support.

There are also large numbers of predictive analytics software packages released over the past decades that data scientists may use instead of attempting to roll their own solutions. These include:

  • Apache Mahout
  • Statistica
  • MATLAB
  • SAP HANA
  • Oracle Advanced Analytics

Data scientists who work heavily in predictive analytics are also likely to use PMML, the Predictive Model Markup Language. PMML is a relatively recent addition to the data science toolkit that uses an XML-based markup style to describe and easily tinker with or share predictive models.

Whether you have a position that is dedicated to working in predictive analytics or not, though, most data scientists find themselves using data to make predictions as part of their job at some point.