Numbers are everywhere. We’re constantly generating and searching through enormous amounts of data, and rarely do we recognize how much we really use.
Statisticians, on the other hand, are very aware of these complexities in our daily life. From recommendation systems to clinical trials, these experts can clearly see how our Netflix likes and health data can turn into a huge matrix of information.
Yet translating how an enormous dataset or complicated statistical function can affect real life can be a challenge. Naimin Jing, PhD ’21, and Michael Power, PhD ’20, and assistant professor of instruction of statistical science at the Fox School, both found unique ways to communicate their research findings and explain their practical uses for businesses.
A clearer picture
When Jing explains her research in one sentence, she admits it sounds scary. “My research focus is to develop robust statistical methods for estimating matrix parameters for large and complex data,” says Jing.
But instead of being intimidated, she says, imagine a grainy black and white photograph. It’s old, faded and blurred due to sunlight or water damage. Jing’s statistical method makes it possible to turn that picture into a clearer image.
“In computers, an image consists of pixels,” she explains. “Each pixel can be represented by a number so an image can be represented by a matrix. That damaged image is a partially observed matrix, with the damaged part being blank. Matrix completion fills in the blanks.”
A major consideration of her research is the robustness, or strength of the model. “Ideally, the data collected would contain no error and perfectly satisfy the assumptions,” Jing says. “However, the reality is that modern large datasets inevitably contain outliers, wrong records or data that doesn’t follow the model’s assumptions.”
For example, that sunspot on the photograph would create discoloration or pixels lighter in color than reality. Jing’s research weeds out those errors by identifying and evaluating them to determine whether or not they’re critical to understanding the big picture. But instead of disregarding them, her model’s analysis expects these errors and reduces their influence on the outcomes.
Jing, who joins the University of Pennsylvania this year as a postdoctoral fellow, used this picture analogy during a Three Minute Thesis (3MT) competition among Fox doctoral students during her final year in the PhD program. Her clear, concise way of describing her research, using language and examples that non-statisticians could understand, earned her a tie for first place in the “third year and above” category.
Jing’s research also has a practical impact for better-predicting people’s preferences in recommender systems like Netflix or e-commerce sites like Amazon.
“For example, many people share Netflix profiles,” she says. But with millions of Netflix users, analysts can’t parse through that much data to find errors or outliers that come from rogue ratings. With her research, however, Jing can analyze the history of a user’s behavior in order to better understand the likelihood of which ratings might be outliers.
“The idea of my method is to make more accurate predictions, even when there are errors in the data,” she summarizes. For businesses, this translates into better recommendations for everything from TV shows and movies, restaurants on Yelp or suggested purchases on e-commerce websites—and thus more satisfied customers.
Efficient analysis
Businesses know that they can learn from the data their customers or users generate in order to improve their functionality, increase user-friendliness or fix glitches. “But with the amount of data generated online, dimension reduction becomes an essential first step in being able to visualize or use the data in any kind of meaningful fashion,” says Michael Power.
Power, who joined the Fox School as an assistant professor of instruction after graduating with his PhD last year, uses his research to help lessen that analytical burden. He explains its practical usages by using examples found in the medical and pharmaceutical industries.
“For example, take medical or clinical trial data,” he explains. These trials will collect relevant health data like blood pressure, weight, height, age or smoking status. “However, you may also have some demographic information, like income bracket or length of a work commute.”
“In most real-world data sets, there are variables included that are unrelated to the response,” Power continues. Bayesian Model Averaging (BMA), on which his research is based, avoids overestimating those variables. His research combines BMA’s accuracy with methods that reduce the dimensions, or the number of variables, without sacrificing the overall performance. He won the People’s Choice Award in the 2019 3MT competition at the Fox School.
Power’s research is part of the data evolution that businesses can use to learn from and improve. “With advancements in machine learning and artificial intelligence, computers can keep learning from data in order to self-update and make better decisions,” says Power. But those insights are only as good as the data that goes into them. His method of efficiently reducing data improves the efficiency of the algorithms.
“This saves on computation time and the end results will likely be much better.” From clinical trials to web traffic, a business that successfully collects and analyzes its existing data will be able to learn from its past and improve its performance for the future.