Machine learning. Artificial Intelligence

Menu

Big Analytics Roundup (August 29, 2016)

TechCrunch reports results of a new study, which says that you really don’t need a co-founder after all. Next, they’ll be telling us we don’t need to floss.

Python and R

Matt Asay argues that Python is a gateway language that leads data scientists to R. (h/t Oliver Vagner). That’s oversimplified and mostly incorrect. The breadth of R’s analytics functionality tends to draw statisticians and scientists, while Python tends to be an entry language for software developers. While R supports more analytics than Python, Python has better tooling for Big Data; PySpark, for example, does much more than SparkR.

In KDnuggets’ 2016 poll, Python use increased markedly from 2015; this suggests that R users are adding Python to their battery of tools. More people in the poll use both Python and R than use either one alone.

While R is an excellent tool for personal use, its GPL license discourages adoption by companies that develop products or deliver services built on analytics. Thus, it is very unlikely that R will overtake Python as a development platform for machine learning applications.

Aster on Hadoop

Teradata announces the availability of Aster on Hadoop and AWS. Aster on Hadoop strikes me as a bladeless knife without a handle.

Aster was kind of interesting back in 2012; SQL/MapReduce offered analysts a way to run queries in Hadoop back when Hive was clunky and slow. Today, Aster is rendered obsolete by the likes of Impala, Spark, Presto, Drill, and Hawq. According to DB-Engines, Aster ranks 77th in popularity, well below competitors Vertica, Netezza, and Greenplum.

Teradata’s leadership says that Aster is a great foundation for custom applications. Assuming that is true, for the sake of argument, the logical move is to donate Aster to open source, as Pivotal did with Greenplum.

In 2012, Amgen researchers disclosed that they were unable to reproduce findings in 47 out of 53 published cancer discoveries. In Nautilus, Ahmed Alkhateeb argues that we should not accept scientific results unless the findings are reproducible.

In a thesis submitted to Sweden’s KTH Royal Institute of Technology, Ahsan Javed Awan reports the results of benchmarking Apache Spark on a single scale-up server. He ran into some scaling issues on machines with more than twelve cores, which he records in some detail.

Explainers

— Felix Gessert explains the ins and outs of different NoSQL databases and offers a rubric for choosing one.

— On the Google Research Blog, Peter Liu explains text summarization with TensorFlow.

— Julie Bort explains why you shouldn’t depend on one cloud service provider.

Perspectives

— On the Confluent blog, Jay Kreps argues that multi-tenancy is the key capability of distributed systems.

— Cynthia Harvey compares AWS and Azure; she misses the big picture. AWS is a software-agnostic IaaS provider; MSFT is a software company with complementary PaaS and SaaS services. There are advantages and disadvantages to each model, but first one must recognize the difference.

— Microsoft announces the availability of Microsoft R Open (MRO) 3.3.1, with a streamlined installation process, additional packages, and bug fixes. MRO is a free and open source enhanced distribution of R.

Commercial Announcements

— Big-Data-as-a-Service provider BlueData announces a $20 million “C” round led by Intel Capital. The company also announces a partnership with Intel to deliver its software on Xeon processors.

— Google offers several webinars in September for those who want to learn more about BigQuery, Cloud Dataflow, and the Google Cloud Platform.

— Syncsort announces that it has completed the acquisition of Cogito, a maker of mainframe stuff that complements Syncsort’s other mainframe stuff.