Big Data II – That’s Not A Tool

Which programming languages and software packages do I really need for data science? There are so many competing to do the same thing; is one of each enough or should I learn a variety?

Following on from our last post investigating data scientist skills and tools on indeed.com, below are links to a few excellent articles and blog posts discussing the leading tools in each category. Active weblinks are provided inside each intext numerical reference, eg [5], and are collated at the bottom of this post.

1. Major Programming Languages For Data Science

R vs Python

There’s a great 2015 in depth comparison infographic at datacamp.com [1] , with a summary at the popular data science blog kdnuggets [2]. It concludes R is better for data visualisation but Python is a better multipurpose language.

Dan Kopf wrote a great ‘R vs python’ blog post in 2017 [3] concluding, “In a nutshell… Python is better for for data manipulation and repeated tasks, while R is good for ad hoc analysis and exploring datasets.”

His blog post also contains many useful links to other great ‘R vs python’ comparison articles, as does the Miami University Center for Computer Science ‘R vs Python’ webpage [4].

Another excellent ‘R vs Python’ article, with useful links for further research, can be found on the newgenaps.com website [5].

But what about the other analytics software and languages other than R and python? Here we’ll focus on just two leading commercial software options, SAS and SPSS, to rival the two big open source ones, R and Python.

R vs Python vs SAS

Check out the June 2017 article by Burtch Works Executive Recruiting [6]. They perform an annual survey of data science tools preferred by professionals currently in the industry. Here are a few of their results:

R vs Python vs SAS vs SPSS

Here’s the summary table from an excellent article written by Jeroen Kromme in March 2017 [7].

2. Other Programming Languages for Data Science

SQL is the primary programming language for relational databases and despite the rise of NoSQL databases (discussed below) most organisations still use relational databases as their primary data storage system. In short, SQL programming is a must for data science, especially for the business analytics, regardless of what statistical software (SAS, R, Python etc) you use.

But what about other major multipurpose programming languages to rival Python in data science?

In particular, looking back at our barchart of 10 US,UK,AUS job ads from indeed.com on data science programming languages, after the big four Python, R, SAS and SQL came Scala, Java and C. Do we need them as well or can we scrape by just with Python or R?

Java

The Hadoop ecosystem, arguably the most important Big Data software, is written in Java, although it can be accessed by other languages like Python as well. Even so, Aaron Lazar wrote a compelling article in 2017, ‘10 reasons why data scientists need to learn Java’ [8].

Scala

Apache Spark is written in Scala, although it can be accessed with Python and Java as well. Using Spark is currently the main reason for learning Scala and many discussions about Scala currently revolve around Spark , such as [11][12].

Chris McKinlay wrote a great article in June 2016, ‘Scala is the new golden child’, about the growing importance of Scala, particularly for data science employment [13].

He says, “Spark… has recently surpassed Hadoop to claim the title of most active open-source data processing project. Spark is essentially distributed Scala; it uses Scala-esque ideas (closures, immutability, lazy evaluation, etc.) throughout. The Java and Python APIs are semantically quite far removed from the core design innovations of the system.”

So turns out its not just the most commonly requested languages, like R and Python, that are important in data science but less common ones like Scala and Java are also very important, particularly for big data storage and analytics.

For another comparison between these languages, and other less common ones, in the context of data science, see ‘Which Languages Should You Learn For Data Science?’ by Peter Gleeson [14]. He covers R, Python, SQL, Java, Scala, Julia, MATLAB, C++, JavaScript, Pearl and Ruby.

3. Apache Big Data Software

The single best introductory article I’ve found for explaining and comparing the main Apache big data frameworks, Hadoop, Spark, Flink, Storm and Samza, is a 2016 article by Justin Ellingwood at digitalocean.com [15]. There’s another great article comparing Hadoop, Spark and Flink in great detail at therserverside.com [16].

Popular data science blog KDnuggets secured an exclusive interview with the creator of Spark, Matei Zaharia in 2015 [17], discussing the differences between Spark and Flink among other things. KDnuggets also has a more detailed comparison between Spark and Flink in a 2015 article, ‘Fast Big Data: Apache Flink vs Apache Spark for Streaming Data’ [18].

4. NoSQL Databases

NoSQL databases are the main big data storage alternative to the Hadoop Ecosystem.

digitalocean.com produced a good article, ‘A Comparison Of NoSQL Database Management Systems And Models’ [19], explaining the different types of NoSQL databases (key/value, column, document & graph), when to use them and what options are available (eg Redis, Cassandra, HBase).

Kristof Kovacs has a great overview, evaluation and potential applications for each of the main NoSQL databases [20]. For an indepth detailed comparison, consider the 2015 Journal of Big Data article ‘Choosing the right NoSQL database for the job: a quality attribute evaluation’ [21].

5. Big Data Cloud Platforms

The three major cloud platforms are AWS (Amazon Web Services), (Microsoft) Azure and GCP (Google Cloud Platform. For a good comparison between them see [22] and [23].

6. Data Visualization

To compare Tableau and Power BI, see [24] and [25]. For a good comparison between Tableau and Qlikview see [26]. To compare Qlikview, Tableau, Spotfire and MS BI Stack see [27]. To compare 16 visualisation tools, from Carto to Vizia see [28].

Or you could go straight to the experts at KDnuggets and read their 2018 article, ‘A Comparative Analysis of Top 6 BI and Data Visualization Tools in 2018’ [29] . It compares QlikView, Klipfolio, Tableau, Geckoboard, Power BI and Google Data Studio.

Conclusion

Start with R and Python for data analytics. Get your SQL skills sorted. Learn Java and/or Scala for Apache big data platforms like Hadoop and Spark. Investigate NoSQL and cloud platform data storage options. Play around with a few data visualization tools for business intelligence like Tableau.

In short, be familiar with many but focus on mastering a few. Bon chance!