Anaconda: Data Science Exiting Hadoop for the Cloud

Data scientists are embracing cloud-native frameworks as they move on from on-premises data infrastructure previously dominated by Hadoop, concludes a survey on the state of data science.

The shift is driven in part by the enterprise transition from merely managing big data to using machine learning and other connected data tools to glean insights in real time, according to the data science survey released this week by Python platform specialist Anaconda Inc. Cloud-native technologies such as applications containers and the Kubernetes cluster orchestrator are growing at the expense of traditional big data technologies such as Hadoop and Apache Spark, the survey of more than 4,200 data scientists found.

The vendor survey that mostly addresses the growing popularity of the Python data science platform (Anaconda reports 2.5 million downloads per month) also acknowledges that Hadoop-style big data approaches are less appealing to data scientists. One reason may be a youth movement: 26 percent of respondents were identified as “students” accustomed to cloud services.

Anaconda executives noted that cloud-native technologies that also include API-based applications are helping drive the enterprise transition to cloud analytics.

“More software developers [are] using the Anaconda platform as machine learning is becoming pervasive and will be integrated with every application,” said Mathew Lodge, Anaconda’s senior vice president for products and marketing.

As containers move into production , the survey notes that data volumes at the time of Hadoop’s emergence in 2005 “now fit easily into a single server’s memory and there is a plethora of alternatives to building a data lake.” As a result, the survey found a growing preference among data scientists for Docker containers (19 percent) over Hadoop and Spark (15 percent).

Kubernetes, the de facto standard for container orchestration, was cited by nearly 6 percent of those surveyed.

As with most vendor surveys, Anaconda used the results to toot its own horn as the “data science community’s de facto platform for data processing, visualization and machine learning [and] AI.” A key reason is that the open source distribution of the Python and R programming languages is free.

Indeed, the survey found that open-source licensing ranked relatively low in importance to data scientists mainly interested in easy-to-use platforms. (Anaconda, formerly Continuum Analytics, offers an enterprise version of the Python data science platform. It claims more than 6 million users.)

Along with the shift to cloud data science, the survey found that NoSQL databases rank just behind cloud services, “demonstrating their value for storing and processing semi-structured data,” Anaconda said.