ASA – American Statistical Association and Big Data

The leaders of ASA - American Statistical Association discuss their view on Big Data, 3 reasons why statistical community seems to be disconnected from the Big Data movement, and how they plan to fix it.

I was among several data scientists invited to attend a meeting with the president of ASA - American Statistical Association, as part of ASA outreach to data science community. The gulf between statistical and data science/data mining communities is much larger than it should be, since both communities are working with data. For a good description of this gulf see, for example, Statistical Modeling: The Two Cultures, by famed statistician Leo Breiman, who was one of the few to work in both communities.

As a past chair and co-founder of SIGKDD, the leading professional association for Data Mining and Knowledge Discovery, I want to help improve understanding between our communities.

As a small step in this direction, I am reprinting this address from ASA in KDnuggets. Current SIGKDD Chair Bing Liu is in touch with ASA leadership and I will brief KDnuggets readers of any interesting developments.

Here is a position paper from leaders of ASA - American Statistical Association, current President Marie Davidian, President-elect Nat Schenker and Past President Bob Rodriguez announcing a strategic initiative for the ASA.

The ASA and Big Data - Amstat News, June 2013.

By Nathaniel Schenker, Marie Davidian, and Robert Rodriguez.

As Bob discussed in his June 2012 column (bit.ly/16xpHNA), Big Data is a Big Topic. It is almost impossible to avoid the daily barrage of media accounts (on.wsj.com/ZwxBBD), conference announcements (www.asesite.org/conferences/bigdata/2013), and events such as the recent Big Data Week (bigdataweek.com) focused on Big Data. Last year, President Obama announced a major Big Data research and development initiative (www.whitehouse.gov/blog/2012/03/29/big-data-big-deal) and, last month, the White House hosted a Big Data workshop. The National Institutes of Health created the position of associate director for data science (www.nih.gov/news/health/jan2013/od-10a.htm), and a new book "Big Data: A Revolution That Will Transform How We Live, Work, and Think" (n.pr/14wRacs), which explores the explosion of digital information, has received extensive press coverage.

Big Data are data on a massive scale in terms of volume intensity, and complexity, and their promise for transforming business, health care, scientific discovery, public policy, and a host of other areas has been proclaimed widely. But, despite the enormous potential for contributions by statisticians, our profession and the ASA have not been very involved in Big Data activities. We are often missing from Big Data discussions in the media.

There are three reasons for this disconnect. First, the media and public lack a general understanding of what statisticians contribute to society (the issue that motivated the International Year of Statistics, www.statistics2013.org). Second, few statisticians are engaged in Big Data projects or have the special skills necessary to handle Big Data challenges.

Third, the statistical community is disconnected from the new (and vaguely defined) community of data scientists, who are completely identified with Big Data in the eyes of the media and policymakers. Data science (nyti.ms/ZjRFob) is frequently described as an amalgam of computer science, mathematics, data visualization, machine learning, distributed data management-and statistics. Data scientists must be innovative modelers and programmers; they also must be exceptional communicators who have a deep understanding of the problem domain and can formulate key questions, uncover novel insights, and use this information to guide high-impact decisionmaking. Other disciplines have been quick to identify themselves with data science and are routinely featured in media accounts. Although statistics is mentioned in passing, statisticians are nearly invisible.

Ideally, statistics and statisticians should be the leaders of the Big Data and data science movement. Realistically, we must take a different view. While our discipline is certainly central to any data analysis context, the scope of Big Data and data science goes far beyond our traditional activities. As Bob noted in his column, the sheer scale and velocity of the data being generated from multiple sources requires new data management and computational paradigms. New techniques for analysis and visualization must be developed. And communication and leadership skills are critical.

We believe we should focus on what we need to do as a profession and as individuals to become valued contributors whose unique skills and expertise make us essential members of the Big Data team. The ASA is already providing opportunities for statisticians to hone their communication and leadership skills. Through Bob's career success factors initiative, discussed in his October 2012 column (bit.ly/12xNuaO), a high-quality presentation skills course is now available. And Nat has proposed development of a leadership skills course in 2014. We likewise must take steps to enhance our profession's role in Big Data practice. We know statistical thinking-our understanding of modeling, bias, confounding, false discovery, uncertainty, sampling, and design-brings much to the table. We also must be prepared to understand other ways of thinking that are critical in the Age of Big Data and to integrate these with our own expertise and knowledge.

We have had many discussions-among ourselves and with ASA members who are familiar with Big Data-about strategies for achieving this preparation and integration. These discussions have led to our joint ASA presidential initiative to establish the statistical profession as a valued partner in Big Data activities and to position the ASA in a proactive and facilitating role. The goal is to prepare members of our profession to collaborate on Big Data problems. Ultimately, this preparation will bridge the disconnect between statistics and data science.

We recognize we cannot tackle the breadth of this challenge all at once. Accordingly, we have launched three projects that focus on the knowledge base-beyond fundamental statistical training-that statisticians need to succeed in Big Data efforts.

Curriculum Development

A workgroup will be formed to identify issues, approaches, and models for curriculum development in statistics programs that equip students with the knowledge and experience needed to work in Big Data applications. A panel session will be developed for JSM 2014 that will discuss the findings and present recommendations. The workgroup and panel will include academic representatives involved in introducing Big Data into their curricula, together with government and business leaders who are hiring the Big Data work force. The workgroup will develop a report summarizing these discussions and disseminate it to the profession. The report will serve as a roadmap for integrating Big Data skills and knowledge into statistical training.

Engagement with External Stakeholders

The ASA will sponsor a series of one-day meetings, each involving leaders at the forefront of some aspect of Big Data in which statisticians and the ASA are not engaged, along with ASA representatives interested in pursuing Big Data initiatives. For example, a meeting could be held in Silicon Valley with Big Data leaders from the business and technology sectors; another could take place in Washington, DC, with Big Data stakeholders in government. A major goal is to develop networks that will both help the ASA to better understand the Big Data knowledge that interested statisticians must gain and to promote statistical thinking among Big Data leaders. The ASA participants will recommend next steps toward bridging the "disconnect."

Continuing Professional Development

The ASA will offer short courses in text analytics for interested statisticians at the Conference on Statistical Practice and JSM in 2014. As Bob discussed in his June 2012 column, an understanding of how to acquire and analyze unstructured text data is critical for Big Data work because so much data arise from sources such as electronic health records and social network interactions. To develop these courses, it will be necessary to identify the specific training that would most benefit statisticians and to collaborate with outside experts in natural language processing and text analytics.

Work on these activities has already begun. This initiative will form the foundation for a continuing strategy focused on Big Data beyond 2014 that will highlight the value statistics can bring to Big Data and engage statisticians in successful collaborations.