In this short post I analyze one of the main headaches of HR departments: the search for talent. This is intended as a reflection about the Data Scientist, currently one of the most highly valued profiles on the market. This professional must be able to combine technical skills with interpretive capacity and critical thinking, but the balance we can find is not always optimal or the desired one.

In search of critical thinking. The new Data Scientist between correlation and causality.

In Conento, where we are focused on Analytics and Big Data projects, we are always engaged in ongoing selection procedures for new profiles. Our focus is on finding talented Data Scientists. While it is true that every day there are more and more Data Scientists on the market, it is also true that the difficulty in finding profiles that fit our needs is increasing. Out of every 100 Data Scientists that enter our selection process, after making a first CV filter, only 2 are hired. We are talking about 2%, with a rate that has been decreasing over the last few years.

What’s going on? On the one hand, a more competitive market compels us to using more rigid selection criteria. On the other hand, there is the feeling that universities and institutions, with their different master programs in Big Data, Data Science and Machine Learning, are “generating” many Data Scientists who suffer from what I call “the correlation syndrome”.

This means that the new prototype of Data Scientist seems to have correlation as a priority, not causality. It seems that it is no longer interesting to analyze data asking the why of things or whether our results make sense. What matters is to get, as quickly as possible, a result with the most sophisticated Machine Learning algorithm: “hit the button” and see what comes out, without looking back. This situation is becoming commonplace in the practical tests that we provide to candidates in the selection processes, with our increasing amazement and disbelief. The lover of correlation has a blind faith in algorithmic logic -which, after all, is a nihilistic and totalitarian vision- relinquishing the “narration” of data and numbers.

It is curious to observe how there is an ever-increasing talk about an artificial intelligence with more and more efficient and precise algorithms, but which needs to be coupled with human intelligence, the only one -still- always able to analyze in depth the why of things, that is, cause-effect relationships. But, in the case we are analyzing, something different happens: it seems that the Data Scientist wants to follow the steps hand in hand with artificial intelligence, becoming a clone of it, that is, focusing his attention on mechanical and repetitive tasks and renouncing to bring real added value: critical thinking.

This deficiency actually reflects a new dynamic of modern society, which mixes new living and consumption habits, technology and educational models: the difficulty of having a vision of things that is not superficial is obvious, in a world of speed and continuous acceleration that leaves no time to look back, reflect and contemplate. Technology reduces distance and time, and this would allow us, theoretically, to free up time to think; but, instead of doing so, we prefer to fill this new space with “empty” activities, replicating indefinitely -like machines- mechanical processes with no real value: adding strangers to our social networks, reading and discussing about contents which do not contribute anything, checking our email compulsively…

Acceleration makes us lose the ability to follow a process of standard data analysis (as traditional statistics has always performed). Prior to launching the modelling stage, a thorough evaluation of the quality of the available data is necessary, a good construction of metrics and a descriptive analysis to capture first associations between variables. And, following the modelling, a careful evaluation of the results and calibrations in order to strike a balance between mathematics and logic. A progressive construction path of the model, with different stages that increase knowledge and understanding of the problem we are analyzing. It seems we are losing all this, and turning back is very difficult.

We are concerned because we believe that, beyond the technological revolution and social changes, something in the educational and training processes is failing. It is not easy to identify potential solutions (and this would be another debate), but the love of causality would have to be again the guide in our journey: the Data Scientist who combines this aspect with technical knowledge will succeed in the labor market of the future.

It’s fair to say that the American election polls in 2018 were quite successful. Newspaper headlines and stories after the midterm elections for the most part praised them. However, their tone was sometimes one of surprise and shock, reflecting the long-term impact of the criticism that followed the 2016 presidential election.

That year, the U.S. pre-election polls averaged a three-point lead for Hillary Clinton in the national popular vote, and she did win the popular vote by just over two percentage points. While that would elsewhere have been seen as a success, the American electoral system selects a winner through the Electoral College, where votes are allocated based on the number of Senators and Representatives each state has in Congress. In most states, the popular vote winner takes all that state’s electoral votes. By winning Michigan, Wisconsin and Pennsylvania by a combined total of 79,000 votes, Donald Trump won a majority in the Electoral College. This state by state counting received less attention in 2016 than it should have (after all, as recently as 2000 the candidate who won the national popular vote also lost in the Electoral College).

The national popular vote for the House of Representatives means little, as seats are allocated district by district. But “generic ballot” national polls are common, and this year indicated that Democrats would have a clear lead in votes cast nationally. They did, winning in the national House vote count by eight percentage points. But as in Presidential elections, capturing a majority of the vote overall doesn’t mean that you will win enough seats (just as it may or may not give you a victory in the Electoral College). Democrats won 1.4 million more House votes nationally than Republicans did in 2012, but still would up with 33 fewer seats.

Midterm elections must be seen as a collection of many races – the 35 Senate elections and the 435 House races. So national polls are not enough. This year, however, there was an even greater focus on individual House races, particularly those that were likely to be close, or were viewed as having the potential to change sides. There were several creative attempts to deal with the large number of such races, using a combination of new and older methodologies. The old ways of doing things are definitely under challenge.

The New York Times’ Upshot paired with Siena College to conduct polls in dozens of competitive House districts. It used voter lists provided by a vendor, instead of making telephone calls using random digit dialing. It sampled within Congressional districts, making adjustments based on the availability of telephone numbers for subgroups, relying on outside information for data not available on voter lists, like education. It then created turnout estimates. This is difficult in the U.S., as voting is not compulsory and can be very low in non-presidential years. This year, the usual low turnout was expected to rise from the 36.7% of the vote eligible population that voted in the 2014 midterm election. The turnout rate jumped twelve percentage points, as nearly half the eligible population turned out in 2018, the highest in more than 50 years. The methodology is reported here and here.

Since The Times and Siena College polled in what were expected to be competitive districts, the polls basically showed election that were very close – within the sampling error – in nearly all of them. The Times decided to show results in real time as each interview was completed. This is an example. While the poll was being conducted, red and blue dots appeared in the location about an hour after an interview was completed. As is the case with most telephone polls, the vast majority of calls do not result in an interview. So watching the polling “live” could be a slow and lengthy process, not necessarily an exciting one. This was viewed as a way of making election polling more transparent to the public.

CBS News partnered with YouGov for its Battleground Tracker, using YouGov’s online panel, with oversamples in contested districts. But it also used information about voters throughout the country to improve the estimates in the contested districts, CBS News and YouGov were able to make better estimates of the final House outcomes. The questionnaires were somewhat longer than those used by the Upshot, and the Battleground Tracker conducted its polling online, not through telephone calls. Both of these approaches required a very large number of interviews.

The final Battleground estimate was 225 seats for Democrats and 210 for Republicans. With a large margin of error (plus or minus 13 seats on each number), the final estimate fell within the final outcome of what appears to be 235 Democratic seats (some races are still not officially settled).

This year, even the traditional exit poll had a challenger. The exit poll, invented in the 1970’s, has changed as American have changed how they cast ballots. With a growing share of the vote cast before election day (through absentee and early voting), Edison Media Research, which has conducted the media exit poll for more than a decade, now supplements traditional exit polling at precincts with pre-election telephone polls and polls at physical early voting locations. This year, the Associated Press partnered with the National Opinion Research Center (NORC) and Fox News to expand the reach of election day polling, creating the APVoteCast: 40,000 pre-election interviews using registration-based sampling (which The New York Times Upshot also used), 6,000 interviews with the NORC AmeriSpeaks probability-based online panel, and more than 90,000 interviews with non-probability online panelists. With approximately 60 different questions, the AP VoteCast would not only tell who had won, but provided issue and demographic information.

Overall, the election polls of 2018 generally did well, but some pollsters appear to have decided to use the concerns of 2016 as a starting point to develop new methods of understanding election behavior.

Recent Posts

Categories

Categories

Contribute to RW Connect

We are always looking for fresh ideas and contributions that are original, creative, challenging and critical. So if you have something to say about a trend, methodology, technique, personal experience or general #businessintelligence, #data, #insights impressions, we want to hear from you!