I have had many discussions with my peers about how we perceive 'data science' and how it compares to what we do in biostatistics. I would like to share my perspective on the issue. Mainly, how there are many data sciences, and (what I think we call) data science and biostatistics are two examples.

In biostatistics, we develop and apply methods that help us learn what happens when people are exposed to something (epidemiology), when we put people in an MRI machine and ask them to tap their fingers or think unhappy thoughts (neuroimaging), when we apply a treatment (clinical trials, biomedical experiments, any kind of experiment…), among many other situations. This list comes from the types of open problems that I have seen at medical and public health schools.

But really, these methods that crunch data into knowledge can be (and are) applied in any field of knowledge. Think econometrics, chemometrics, biometrics*, psychometrics; these are all fields that measure something and apply statistics.

Now, statisticians develop methods that can be used in any quantitative field, but so can economists who have trained deeper in statistics (some people call them closet statisticians, I bet they prefer 'econometricians'.) Moreover, statistics quantifies the uncertainty in our answers, which is why it is used in many fields. It also measures how good a method is (think about properties of an estimator like consistency, its variance, its distribution.) The kick about biostatistics is that it develops methods for biomedical research.

On the other hand, data science comes forward with new methods (or old ones) to apply to data that has become at hand recently, like internet traffic data or data derived from small online businesses. Nevertheless, I think the heart of what it does is very close to what applying statistics is. They both want to use data to answer questions about the world. The difference is just like the difference between econometrics and chemometrics. Each field has different goals. I don't know about economics, but I don't think they focus on determining precisely the parts (and their quantities) that constitute a solution, which is one of chemometrics's goal. Each field has different methods, but they also can share plenty of them. The instrumental variable framework developed in economy can really help in neuroimaging.

Each field has its own way of doing the actual experiment, which involves collecting data and processing it. So, just as one titrates samples to get a measurement of how much chlorine there is in such solution, just as one has to interview people and measure them for a cohort study, there is a way to gather information derived from webservices. Knowing how to get how many people see this post and other posts by my friends, or how many groupons for grooming are available for Baltimore, or how to actually code an ab-test, are golden skills to have in the new field 'data science' (check out the Coursera Data Science specialty offered by Hopkins, and Jeff Leek's comment on it.)

My point is: they are all data sciences. Among the *metrics, we could rename data science as 'webmetrics™'. Scratch that because somebody else already registered it. Let's call it e-metrics, which could include all the data that is being collected by smartphones. Wait, 'e-metric™' was a trademark abandoned in 2002, can we use it? And then there is 'businessmetrics' (I am just making up words). With my zero knowledge on finance, I am tempted to just call it finance.

PS. Loved this post by Rachel Schutt on ab-testing and causal inference, it goes through data science, ab-tests, randomization, and a bit of causal inference.