Data Science Ethics

When Rachel and I started this class, we realized that there were a wide rage of topics that we wanted to cover. Ethics was one of the topics that kept popping up. However, for a variety of reasons, we didn’t get to cover the topic in class. So I thought I’d write about ethics here.

A lapse in ethics can be a conscious choice…

When we think of right and wrong, we often think of conscious choices. There are a ton of ways to purposefully commit fraud with data. Fabricating data, manipulating experiments, removing outliers are just a few. A 2005 study of questionable behavior in the scientific community shows that 15% of scientists report changing their experimental methodology in response to funding pressures. Another, more recent, survey of medical scientists reports that six of the twelve scientists surveyed had witnessed some sort of scientific fraud.

Lets’s look at specifics. Take the case of Diederik Stapel, a prominent Dutch social psychologist. His studies were newsworthy, with digestible, Malclom-Gladwell-ish results. The types of results that come up on first dates and at dinner parties. For example, he’s the guy behind the result that meat eaters are more selfish than vegetarians.

The only problem is that these results are bogus. In Diederik’s case the fraud was so bad and long standing that the authorities are pursuing criminal charges. However, most fraud cases don’t get this far, and the damage to the scientific community and society at large can be staggering.

.. but it can also be negligence.Of course not all ethics violations are malicious. Major retailers and lending institutions have a lot of our data. In aggregate, it helps them better understand our behavior and target products to us. In the best case, this data can be used to fight fraud and provide useful products. In the worst case, it can lead to serious privacy leaks.

Let’s easy to see how this could happen. Say your boss comes in and asks for a pregnancy classifier. This will allow your store to better target expectant mothers. You say, “sure boss, that seems not that hard.” You mine the database, create a working model, and send out flyers. Profits are up. Everyone is happy.

However, you forgot one little thing. Some pregnant women may want to hide their pregnancy, especially from people they live with. Sending this flyer could expose them. This is exactly what happened to Target. They, by mistake, revealed the pregnancy status of a teen to her father.

I’m not just picking on Target. Data leaks are not uncommon, even at high tech institutions that should know better. Google, leaked personal contact information when they launched their Buzz social networking system. This leak ultimately cost them 8.5 million dollars. Facebook has mistakenly leaked it’s ‘shadow profiles for 6.5 million users. These profiles contain information that Facebook inferred about it’s users, information that the users themselves did not provide.

Creating standards of conductThese lapses put the ethical considerations front and center, as data science rises in popularity. Some groups of data scientists have created codes of conduct. For example, the Data Science Association has a code of conduct. The code of conduct covers a number of things from defintions of common terms to ethical responsibilities of data scientists. In an interview with Information Week, one member of the association expresses his frustration with the current date of data science.

“Things were really getting out of control in terms of the definition of ‘data science,'” said Walker in a phone interview with InformationWeek. “A lot of people who really weren’t data scientists started calling themselves data scientists. And I saw a lot of data science malpractice in the companies, or clients, that we work with.”

Of course this isn’t the only group that is trying to hammer out a code of conduct for data scientists. There are literally entire books on this topic and numerousarticlesdiscussing potential ethics concerns and values that data scientists should have. For example, Rachel, when creating this class last year, created of list of qualities that next-gen data scientists should have. These codes of conduct directly address the frustrations felt by those in the community that have deep expertise. Pretty much everyone agrees that data scientists should not oversell their qualifications, and they should avoid letting external motivations influence their analyses.

Codes of conduct are important for maintaining quality and integrity. Of course they are evolving, and often lag behind the state of the art technology. And enforcement can often be difficult. However, it is important to think about the implications of the information you are gathering and the models of that you are creating. They have a real effect on the world.

Doing Data Science

Introduction to Data Science is a class at Columbia University in the Department of Statistics. The course was designed and taught by Dr. Rachel Schutt in the Fall of 2012. The course was team taught in the Fall of 2013 by Dr. Schutt and Dr. Kayur Patel.