Aren’t We Data Science?

1 July 20136,856 views9 Comments

Davidian

Last month, I shared this column with President-elect Nat Schenker and Past President Bob Rodriguez to announce an ASA strategic initiative to promote engagement of statisticians in Big Data. I’m following that announcement with an account of some of my recent experiences regarding data science, which inspire my enthusiasm for this effort. One in particular serves as a metaphor for the disconnect between statistics and data science we noted last month.

Around the time we were finalizing that column, Michelle Dunn, chair of the ASA Committee on Funded Research, forwarded an email to me. Michelle thought I would be interested in learning from the press release in the email that Eric Green would be speaking in Chapel Hill, North Carolina, 25 minutes from my office in Raleigh, on April 23. In January, the director of the National Institutes of Health (NIH), Francis Collins, announced the creation of a new NIH-wide position, the Associate Director for Data Science (ADDS), to “capitalize on the exponential growth of biomedical research data”. Collins named Green, current director of the National Human Genome Research Institute, as acting ADDS. Green is also co-chair of the search committee charged with nominating the permanent ADDS.

Indeed, I was very interested! But what was even more interesting was the organization that had invited Green to speak. The press release announced “a new collaboration called the National Consortium for Data Science (NCDS) (aiming) to make North Carolina a national hub for data-intensive business and data science research.” It went on to note that the NCDS had been launched at the Renaissance Computing Institute at The University of North Carolina at Chapel Hill (UNC-CH) and included among its founding members businesses, government organizations, and major research universities.

Rachel ShuttPhoto taken by Nina Krstic

I highlight that last group because, upon locating the NCDS website, I was astonished to review the list of founding members and see that not only is my university (North Carolina State) a founding member, but so are Duke University and UNC-CH. Along with SAS Institute; Research Triangle Institute International; NIH’s National Institute for Environmental Health Sciences; IBM; and several other institutions, businesses, and government agencies that employ numerous statisticians. The member representatives listed on the website from NC State, Duke, and UNC-CH are computer scientists/engineers, and among all 17 representatives, there is not one statistician.

Until I saw that email, I had no idea that the NCDS even existed. A quick check with my department head, others in my department, and statistician friends at the other institutions listed (including Bob at SAS) revealed that none of them did, either. I later learned that, of the 80 or so individuals participating in the invitation-only NCDS Leadership Summit on “Data to Discovery: Genomes to Health” for which Green was the keynote public speaker, only two are affiliated with an entity with the word “statistics” in its name (and are known to me to be trained as statisticians).

I tell this story not to take issue with the formation of the NCDS, but because it is reminiscent of stories and comments I have heard from many of you.

As we discussed in June, the field of data science has commanded considerable attention in the media and among business and science leaders. It is described as a blend of computer science, mathematics, data visualization, machine learning, distributed data management—and statistics. A New York Times article in April reported that centers and institutes devoted to data science and Big Data are being created and curricula and certificate and degree programs are being developed at a number of universities.

Rachel Schutt’s Introduction to Data Science class in the Columbia University Statistics Department, where she is adjunct assistant professor. She is also senior research scientist at Johnson Research Labs.

Many of you have expressed concern that these and other data-oriented initiatives have been or are being conceived on your campuses without involvement of or input from the department of statistics or similar unit. I’ve been told of university administrators who have stated their perceptions that statistics is relevant only to “small data” and “traditional” “tools” for their analysis, while data science is focused on Big Data, Big Questions, and innovative new methods. I’ve also heard about presentations on data science efforts by campus and agency leaders in which the word “statistics” was not mentioned. On the flip side, I have heard from statistics faculty frustrated at the failure of their departments to engage proactively in such efforts.

In fact, some of you have asked directly the question that comprises the title of this column.

I decided to contact a statistician who is at the forefront of data science to get her thoughts about the challenges (and opportunities) these developments pose for our discipline and how we might confront them. Rachel Schutt, who is featured in the Times article cited above, earned her PhD from the department of statistics at Columbia University, where she is an adjunct faculty member. Upon graduation, Rachel took a position at Google, where she became acquainted with the scope, practice, and jargon of data science before moving to her current position at Johnson Research Labs. In fall 2012, she taught “Introduction to Data Science” for the Columbia statistics department and is co-author of a book, Doing Data Science, summarizing the course (). I encourage you to visit the course website and read Rachel’s blog about the evolving course activities.

Rachel generously spent well over an hour sharing her perspectives with me; I summarize our discussion of only a few key topics here.

Data science is here to stay, Rachel says. There may be a lot of “hype,” but that might not be bad if it attracts talented people to work on data-driven problems. And to statistics. Statistics has enormous potential to contribute to data science. There are open research problems requiring that classical statistical methods in sampling, design, and causal inference be “scaled up” to be feasible with massive data sets. Few of the computer scientists and others who dominate the data science landscape are well-versed in these concepts, and many take an “algorithmic” view of data analysis. Data science needs statistical thinking and new foundational frameworks—for example, what is the “population” when one confronts the Big Data generated by Google?

In fact, many businesses are beginning to collect data prospectively for internal testing and validation, and there is little appreciation for the power of design principles. Statisticians could propel major advances through development of “experimental design for the 21st century”!

What skills does a statistician need to engage in data science activities, and how should we be preparing statistics students? In addition to a strong foundation in statistical theory, methods, and software, statistics students should develop deep proficiency in programming, Rachel says. Coding skills—in R and in Python including the use of Python as a scripting language—should be part of any modern statistics curriculum. And statisticians must appreciate issues and tools associated with parallel computing, combining data from disparate sources, and handling textual and streaming data.

Familiarity with data visualization techniques and popular tools like D3.js would be ideal and could enliven curricula and projects. Exploratory data analysis, which is generally not taught formally in many statistics programs, should be emphasized. Training in machine learning methods also is key. Not to mention communication skills.

Rachel stressed the importance of exposure to “real world” problems—the disconnect between curriculum and the “messiness” of the real world is greater than it has ever been. She advocates engaging local businesses and research organizations to present case studies to students, as she did in her course. Not only will this acquaint students with what they might confront, but also such interactions can forge connections that can inspire needed statistical research.

What can we do as individuals, a profession, and an association to address the concerns noted above? Rachel’s thinking? Sponsor and attend events that bridge disciplinary boundaries and afford opportunities to interact with scientists with massive data problems such as the University of California at Davis 2013 Statistical Sciences Symposium: Analysis of Complex and Massive Data. The ASA could make a big impact by sponsoring or collaborating in a conference on statistics and data science featuring top data scientists and statisticians as speakers.

Participate in data science Meetup groups. There are scores of these in San Francisco,Washington, DC,New York,Boston, and elsewhere—or consider forming one. We statisticians should seek these out and attend and offer to speak, and we should encourage our students to do likewise. In fact, Rachel and several colleagues have started The NYC Data Skeptics Meetup, which focuses on all aspects of data from a “skeptical perspective” on the hype surrounding Big Data and data science.

Statisticians in academia interested in engaging in data science should seek sabbatical opportunities in industry, and departments should reach out to industry data scientists and invite them to present seminars, contribute to the curriculum, and serve as adjunct faculty. Departments can propose partnerships with computer science, operations research, and other disciplinary units on campus to develop and team-teach courses and to sponsor joint seminars and working groups. Such interactions will reveal areas in which statistical research is needed.

Rachel noted in closing that she fears academic departments of statistics could be viewed as obsolete and be phased out over the next decade if we do not evolve to embrace this challenge—data science is not going away. She suggests we ask ourselves, “How would you feel if there were no departments of statistics 50 years from now?” It is essential that we confront this head-on; otherwise, the many philosophical issues data science presents demanding deep statistical thinking will not be addressed.

I am grateful to Rachel for sharing her candid views with me. She has convinced me that the ASA Big Data initiative is an essential step toward addressing some of these challenges at the association level, laying the groundwork for curriculum enhancements, significant engagement with stakeholders, and professional development. We aren’t data science, but we have a critical role to play. I encourage you to consider steps you can take locally to raise awareness of the importance of statistics in data science.

I don’t think statisticians are alone in the challenges of understanding how their roles as individuals and as a profession fit into the “Data Scientist” realm. The one point in the article that quoted Rachel Schutt about statisticians needing to learn ‘R’ is one of the disconnects. As Bob Howe, from the University of Washington points out, statisticians need to learn to deal with data that does not fit into memory. I don’t want to miss the main focus by pointing out that tools such a R deal with in memory data sets. In the Data Science space we are dealing with data that doesn’t fit in memory, it doesn’t fit one one machine, it may not even fit on 100 machines. This is where statiscians need to make the leap because all of thier other skills are so important to Data Science. I myself am on the software side and I need to learn more about statistics in order to compete and stay relevant in the Big Data (Data Science) space. Many traditional roles that deal with data in some aspect all have something to add to their skill set in order to make data science work for them and for them to advance thier profession. It will be interesting to see with the shortage of skilled “Data Scientist” how companies react both long and short term. Do they hire multiple people’ data scientist, software engineers, analysts, and others to cover the shortfall around this “new” science or do they continue to hold out for people that make the leap by merging thier profession and skill sets with what has been traditional other areas of expertise. Good article and I understand the challenge and to some extent a little frustration but its all common with shifts in paradigm of this size.

# 27 July 2013 at 1:09 pm

Shelly said:

Thanks for a wonderful article.

Data Science is a term much used but not understood well. Can somebody explain the data with which a Data Scientist starts off. Is it a direct output from MapReduce? Is it then put on a different playform for analysis and what kind of platform?

WOuld greatly appreciate if somebody can throw light on it.
Thanks

# 30 July 2013 at 6:52 am

Randy Bartlett said:

RE: We aren’t data science, but we have a critical role to play.
RESP: Most of your article was fine; this did not digest. Out here in the field (read ‘wild west’), we are data scientists. By our definition, all data analysis involves underlying statistical assumptions. We complete our applied education; many of us already know R and are masters of software. We have to be; if we are not data scientists/business quants, then we have no future.

Computer science and IT are trying to annex machine learning, predictive modeling, and data mining. The last thing we in the field want is for ASA to hand them over. Meanwhile, ASA has granted us two essential items we requested 30 years ago, certification and CSP. Let’s see if there is still time to save the profession.

The TOUGH question for those of you, who publish papers with no data and are so removed from what is going on out here is: ‘Are you still statisticians?’

# 31 July 2013 at 1:37 pm

Abhijit said:

As an active participant, organizer and member of data science meetups in the DC area, I can say that the participation of the academic community in our meetups is very limited, unless we have an academic speaker. This is possibly not true of other similar meetups around the country. We have tried to engage our academic brethren, but to limited success.

I myself am a trained statistician, but many of our members are not. They are coming from all domains, all backgrounds. They don’t necessarily have the theoretical foundation someone similar to me might have, but they have experience, insight and curiosity, which today counts for a bit more, IMO.

I agree with Randy that core areas like machine learning, predictive modeling are being annexed, but how many stats and biostats programs even teach them. We have limited ourselves as stat departments to the classical, not the modern. These areas have need for theoretical and foundational development, and questions about how the Big Data world relates to our ideas about statistical inference and modeling.

The biggest needs that statistics graduates will have to succeed in this environment are familiarity with large data and cloud computing, programming skills, a solid toolbox of statistical methods, an openness to develop and understand new methods (sounds like research 🙂 ), and most importantly, a desire and interest to translate the fruits of the analysis into information that is usable and actionable. Our work cannot stop at the analyst’s desk, it has to continue to the consultant’s chair and make the meaning of our analyses intelligible.

There is plenty of talent out there with the computing skills to do a lot of the work that is in demand, and more talent is needed, no doubt. The single biggest issue out there, IMO is that lots of people have data but have no idea what to do with that data. They don’t even have the questions. The exploration of big data to understand the possibilities of information and actionable intelligence it contains is the big story for the next 5 years. Can statistics, as a field, find a way to exploit this need and make it ours. If we don’t, we know others are chomping at the bit to take over

@Joshua: Data science is not just about analyzing data, it’s also about implementing algorithms that process data automatically, to provided automated predictions and actions such as
– automated bidding systems
– estimating in real time the value of all houses in US
– high frequency trading
– matching an Ad with a user and a web page to maximize odds of conversions
– book and friend recommendations (Amazon, Facebook)
– analyzing NASA pictures to detect new planets or asteroids
– weather forecasts
– computation chemistry to simulate new molecules for cancer treatment
– tax fraud detection, terrorism detection

All this involves both statistical science and terabytes of data

# 27 September 2013 at 10:16 am

Welcome!

Amstat News is the monthly membership magazine of the American Statistical Association, bringing you news and notices of the ASA, its chapters, its sections, and its members. Other departments in the magazine include announcements and news of upcoming meetings, continuing education courses, and statistics awards.

Departments

Archives

QUOTABLE

ADVERTISERS

MISC. PRODUCTS AND SERVICES
University of New Hampshire

PROFESSIONAL OPPORTUNITIES
Academia Sinica
The Chinese University of Hong Kong
Columbia University
Emory University
MD Anderson Cancer Center
NISS
NIH/NIAID
U.S. Census Bureau
University of Minnesota
University of Pittsburgh
Vanderbilt University
Virginia Tech
Westat