Earlier this year, several of us from the DC2 community (Harlan Harris --that's me, Marck Vaisman, and Sean Murphy) conducted a web-based survey of Data Scientists, with the goal of better understanding the varieties of people, skills, and experiences that fall under this rather broad buzzword. We have analyzed the results from over 250 respondents, and are excited to share some initial findings here!
The first task in the survey was to rank a set of 21 skill categories. We used the technique of non-negative matrix factorization to find five underlying dimensions of variation among the rankings. We found that Data Scientists have skills that tend to be associated together, and by grouping those skills, we can provide people with a useful shorthand. Here are the skills groups, with category names that we think clarify what we as Data Scientists bring to the table:

Clearly not everyone who is strong in some aspects of these categories will be expert in every area. But, as a general rule, these skill groups co-occur. Equally important, a Data Scientist who may have skills in Machine Learning and Big Data may have little expertise in Surveys or Front-End Programming.

We performed a similar NMF analysis on a series of self-evaluation questions near the end of the survey. Respondents gave "Completely Agree" to "Completely Disagree" responses to statements that started with "I think of myself as a(n)..." We view the Self-Identification groups that fell out of the NMF analysis as being critical to clarifying the diverse backgrounds and interests of Data Scientists. Here are how the responses to these questions grouped, along with category names that we feel are useful:

Data Businessperson: Business person, Leader, Entrepreneur

Data Creative: Artist, Jack-of-All-Trades, Hacker

Data Researcher: Scientist, Researcher, Statistician

Data Engineer: Engineer, Developer

Many people responded to many of these self-ID questions positively, but the analysis shows underlying dimensions of variation that can inform peoples' career paths and interests. Even more fascinating, the two groupings we identified, skills and self-ID, correlate in ways that we think are highly valuable to Data Scientists and organizations that need our skills. The below graph shows how survey participants, labeled by their primary (by strongest factor loading) skill group and their primary self-ID group, arrange themselves in a cross-tabulation table (click to see larger).

As we further dive into these results, we will be stressing the point that our data shows substantial variation in skills and interests among Data Scientists. The field is quite diverse, and a Data Creative who can build an amazing Javascript tool to visualize data from a set of disparate sources may be very different from a Data Businessperson who starts a data-related business or a Data Researcher who uses advanced mathematical tools to bring insight to organizations or a Data Engineer who integrates enterprise databases with predictive or optimization systems.

We'd love to share more results with you! If you are in the Washington, DC area on August 27th, please come see us talk about the survey results at the Data Science DC Meetup! And if you'll be attending DataGotham in New York City on September 14th, we'll be presenting highlights there too! Otherwise, stay tuned for future presentations and publications. If you have any specific questions that we might be able to answer as we further explore the data, please email us!

ps. If you are one of those Data Creatives or Engineers with Javascript skills to burn and a bit of free time, we'd love your help putting together a web-based tool related to this project. Please drop us a line.