Big Data Languages, Tools, and Frameworks

The data scientists we spoke with most frequently mentioned Python, Spark, and Kafka as they’re go to data science tool kit

To understand the current and future state of big data, we spoke to 31 IT executives from 28 organizations. We asked them, “What are the most prevalent languages, tools, and frameworks you see being used in data ingestion, analysis, and reporting?” Here’s what they told us:

Python, Spark, Kafka

With big data and the push into AL/ML, Scala and Python are leading with Apache Spark gaining popularity. Move from OLAP cubes and data warehouses to less organized structures applying ML with Python. Developers are writing Python ML models due to the library support that’s out there.

Kafka for streaming ingest. R and Python for programming. Java is prevalent. SQL hasn’t gone away. Not big data’s best friend but opens access to a broader range of people can access the data. Gartner has SQL on Hadoop coming out of the trough of disillusionment.

We see a lot of Hadoop, Spark, and Kafka. The emerging tech is in data warehousing where there is a lot of interest in Redshift, Snowflake, and Big Query. ML is out there. Added capabilities for TensorFlow. Early interest there. The third is Kubernetes. A lot of interest in leveraging to scale out consumption.

Other open source tools are widely used, such as Spark, R, and Python. This is why platforms offer an integration with these open source tools. In our workflow, it is possible to introduce a new node in which to script Python, R or Spark code. At execution, the node will execute the code and will become part of the node pipeline in the workflow.

For a while, R was predominant, especially in data science operationalizing models. Now the real innovation is around Python. A lot of tools, libraries, and support. People are starting to explore Spark and Kafka. Spark processes huge volumes at speed. Kafka is a messaging system for getting data into Spark. R is great for analyzing historical data. Take the model, get real-time data, and help marshal data so it can be run in real-time and apply the models.

Some of the common tools and frameworks include in-memory relational databases like VoltDB, Spark, Storm, Flink, Kafka, and NoSQL databases.

We provide a LINQ-type API for all CRUD data operations which can be called from a variety of languages such as C#, Go, Java, JavaScript, Python, Ruby, Scala, and Swift. Designed as a high performance (predictable low latency) database, our primary data access was created to be programmatic rather than declarative and as such, we do not currently support SQL. As our customers add analysis to the workloads they are currently performing, we will be adding SQL support. We support exporting data to backend data warehouses and data lakes for analysis. For ingestion, tools such as Kafka and Kinesis are gaining traction as the default data communications pipes within our customers.

We see SQL as the primary protocol used by companies of all sizes for data residing in our platform. For deployment management, we have seen a rapidly growing use of Docker and Kubernetes. For data ingestion, Apache Kafka is used by many of our customers and we recently announced the certification of our Kafka Connector within the Confluent partner program. For analysis, we frequently see Apache Spark used along with Apache Ignite as an in-memory data store.

Apache Kafka has become, essentially, a standard for streaming high volumes of data (particularly sensor data) into data analytical platforms in near real-time at ingest. For the highest analytical performance, in-database machine learning and advanced analytics are becoming an increasingly important way for organizations to deliver predictive analytics at scale. For reporting, there are a variety of data visualization tools on the market today – from Tableau to Looker to Microsoft Power BI to IBM Cognos to MicroStrategy and many others. Business analysts have never had more options to report on and visualize data. However, they should insist that their underlying data analytical platform has the scale and performance to enable them to get insight from the largest volumes of data with complete accuracy in seconds or minutes, not after the business opportunity has passed.

We leverage several data ingestion and orchestration tools, with Apache Kafka and NIFI projects being the most prevalent. We use Hadoop YARN with HBASE/HDFS for our persistence layer, we take advantage of data processing, predictive modeling, analytics, and deep learning projects such as Apache Zeppelin, Spark/Spark Streaming, Storm, SciKit-Learn, and Elasticsearch. In addition to these open source projects above, we leverage Talend, Pentaho, Tableau, and other best in class commercial licensed tools.

TensorFlow, Tableau, PowerBI

1) We use Amazon Athena (Apache Presto) for log analysis. 2) We use Mode Analytics for data visualizations and Reporting. 3) We use TensorFlow to analyze traffic patterns.

Data science from an ML perspective. Availability of DL frameworks, TensorFlow, Pytorch, Keras, Caffe has made a huge difference to apply ML and create models for large-scale data.

Working through the platforms as a way to deliver insights at scale. BI use cases are trying to scale analysts. Tableau, PowerBI, MicroStrategy, TIBCO, and Qlik try to expand the number of people dashboards are in front of.

We see a lot of Spark as organizations are moving away from MapReduce. Java and Python are popular. Kafka is being used for ingestion. Visual Arcadia Data, Tableau, Qlik, and PowerBI for visualization.

Many projects use multiple languages and multiple analytics tools. We see a lot of SQL use, of course, and data science-oriented languages such as Python and R, but also a significant use of classic programming languages such as Java and C#. For data science, the top package we’re seeing as an adjunct to our products is TensorFlow, followed closely by self-service BI tools such as Tableau, PowerBI, and ClickView.

Other

Open source. More are moving to streaming data. This is driven by a need/desire for real-time answers.

It depends on the project. We see multiple mechanisms being used for ingestion, enrichment, document classifiers. SciByte, Thompson Reuters – ontologies, intelligent tagging tools to drill down into the data. Personality insights, sentiment analysis enrichment of the data.

The customer drives what they use from the browser. Customers are looking for how to build off tools they already have. SQL is still the language for big data. Works on top of Hadoop and other databases.

OData isn’t that new, but people are using it from server-side and client-side. Others use GraphQL to dynamically query and get data. There is a lot of new technology on the server side. MongoDB does certain things well. We’re getting more specific about what they are offering. Redis is good for caching. S3 is useful for data storage with Elasticsearch and S3 as the backend. More clearly-defined technologies and design patterns.

People who use R and Python stick with what they use. There are a number of APIs in the system with more support. From an ingestion point of view, you want to offer as many ways to get data into and out of the system as possible. Support as many tools as possible. This is no critical mass. Cater to the talent. Developer tools and APIs support a wide range of both.

Larger companies would like people using the same tool for BI and data science since they have a mix of tools and it’s hard to standardize thousands of people on one tool. The way to integrate with different backends and accelerate production varies from tool to tool. We provide integration, acceleration, and a catalog of what the data is and the semantic meaning of the data. The catalog is centrally located in the platform. Pull security, integration, and acceleration into a central open source layer that works with all tools and data sources.

The big data world is quickly evolving in so many ways across all environments—on-premises, in the cloud, etc. We see lots of variations of languages, execution engines, and data formats. Our core value is allowing customers to bypass having to deal with all those different tools and standards. With the drag and drop, no code environment that we deliver, customers don’t have to code anything by hand. This allows them to develop data pipelines once as part of a repeatable framework and then deploy them en mass regardless of the technology, platform, or language. For example, we have customers that have used Infoworks to implement on-premises on Cloudera once and then run those same pipelines without re-coding on Google Cloud using Dataproc.