SANTA CRUZ >> From purchasing habits to political and sexual preferences, what can be inferred from seemingly innocuous personal data can be shocking. Certainly that was the case for one Minnesota father who angrily confronted Target employees about baby-related coupons sent to his teenage daughter — only to find out that the company’s algorithms had correctly sleuthed out what he had failed to see.

The rise of what’s known as ‘big data’ poses a number of difficult questions about privacy, ownership, accuracy and bias. Armed with a new data science center and $1.5 million grant, a cross-disciplinary team of UC Santa Cruz researchers are hoping to find some answers and create a framework for what they call “responsible data science.”

The researchers come from the disciplines of computer science, statistics and mathematics. By combining the most effective tools and techniques from each discipline — and forging new models where needed — the team hopes to build a toolkit that could have applications in social sciences, biology, and computer networking and security.

“We’re putting it all together specifically for this crazy, interconnected world that we’re in to understand how these models can all be brought to bear to understand what it means to be accurate, what it means to be biased, and what it means to preserve privacy,” said project lead Lise Getoor, a professor of computer science at UCSC.

Another goal of the three-year initiative is to build educational materials that can help introduce young people to the key concepts and concerns involved in data science, Getoor said.

“There’s a real need to educate the public and students to be aware of even the most basic versions of bias,” she said.

Data can be biased in a number of ways, according to Abel Rodriguez, a professor of applied mathematics and statistics at UCSC and the project’s co-lead.

And the stakes can be much higher. As the criminal justice system increasingly turns to algorithms to assess the likelihood of a convicted criminal to re-offend, an investigation by ProPublica uncovered a bias in one of the leading commercial tools that incorrectly judged black defendants to be more likely to re-offend than white defendants.

In terms of personal privacy, a 2009 paper co-authored by Getoor found sensitive information about social media users who cloak their personal information behind privacy settings can be easily uncovered using inferential models based on their friends and group memberships.

Advertisement

Another element of personal privacy the team aims to tackle is figuring out better ways to anonymize big data sets of, for example, tax information, while maintaining accuracy in the data.

The $1.5 million dollar grant from the National Science Foundation was announced at the end of August, and will fund the team’s work to develop a theoretical foundation for data science over the next three years. The funding comes as part of the a broader National Science Foundation initiative spanning 14 institutions in 11 states that aims to build a stable foundation for the burgeoning field of data science.

Next for the initiative would be to partner with more institutions and build a data science institute on a college campus, either at UCSC or, more likely, within Silicon Valley, according to Rodriguez.

Earlier this summer, UCSC launched a data science research center under Getoor’s direction called Data, Discovery and Decisions — or D3 for short — to provide a platform for collaboration between researchers and private industry as part of the campus’s focus on the field of data science.