The article is broken down by different use cases (aka scenarios) or tasks, and recommendations will be given where applicable.

These use cases consider situations where data science tasks (analytics, machine learning, artificial intelligence) are carried out either on a local physical desktop, laptop, or server machine by a data scientist, for instance, or where these tasks are performed on remote physical or virtual machines (single server or distributed system) via a cloud infrastructure such as Amazon Web Services (AWS).

We’ll further differentiate between performing these tasks for development or production purposes, where production implementations and deployments have a unique set of considerations, challenges, and requirements. For more on that, check out the previous article in this series, and Dataiku has written multiple white papers on this topic that are definitely worth checking out as well.

Note that these lists aren’t exhaustive, and there is no particular preference implied by the order of the items below. Also, everything listed leans more towards open-source software, as opposed to corporate or enterprise solutions (with some exceptions).

Given that discussion or comparison of many of the items below could warrant an entire article, this article is meant to provide a stepping stone for the reader to do further research based on their particular interest and needs.

Single Machine (Local or Remote) for Analysis: Statistical, Data, and Exploratory (EDA)

In this section, we are focused mainly on statistical analysis, exploratory data analysis (EDA), and data analysis in general.

Part of performing these tasks, and those in sections throughout this article, is using the languages and packages below to carry out common tasks with data such as loading, querying, parsing, munging/wrangling, filtering, sorting, aggregating, visualizing, and performing exploratory data analysis (EDA).

Both Python and R are perfect for this scenario, and here are some recommended packages grouped by language.

Python

Reproducible research and notebooks

Jupyter

Data reading, parsing, munging, wrangling, and numerical

pandas

numpy

scipy

Data visualization

matplotlib

seaborn

bokeh

Altair

Advanced statistical analysis

statsmodels

Data scraping

Scrapy

Beautiful soup

R

Reproducible research and notebooks

R Markdown

Knitr

Data reading, parsing, munging, and wrangling

dplyr

plyr

stringr

tidyr

lubridate

reshape2

party

Data visualization

ggplot2

ggvis

htmlwidgets

lattice

googleVis

rCharts

Advanced statistical analysis

zoo

Interactive visualization and analysis

Shiny

Single Machine (Local or Remote) For Artificial Intelligence, Machine Learning, and Advanced Analytics Development

This use case is where a person such as a data scientist performs artificial intelligence, machine learning, and advanced analytics tasks in a development environment.

In many cases, this scenario is intended to extract information, answer questions, identify patterns and trends, or derive actionable insights, but without necessarily creating a deliverable to be deployed to a production environment.

That said, everything listed in this section can certainly apply to tasks with the goal of developing and deploying a high-performing model to a production environment as well. In addition to the languages (Python and R) and packages listed above (e.g. pandas, numpy, …), the following are perfect for this scenario.

Python

General machine learning and advanced statistics

scikit-learn

statsmodels

PyMC3

Neural networks, deep learning, and general AI

Keras

Tensorflow

Theano

Caffe and Caffe2

MXNet

Torch

PyBrain

Nervana neon

Natural language processing (NLP)

NLTK

spaCy

TextBlob

Gensim

R

General machine learning and advanced statistics

caret

randomForest

e1071

glmnet

rpart

gbm

kernlab

tree

Neural networks, deep learning, and general AI

nnet

neuralnet

Natural language processing (NLP)

wordcloud

tm

Single or Multiple Production Servers (Non-distributed) For Artificial Intelligence, Machine Learning, and Advanced Analytics

This scenario covers the situation where artificial intelligence, machine learning, and advanced analytics solutions are deployed to one or more production servers. In the case of multiple servers, this scenario does not cover the distributed case, which is discussed in the next section. All languages and packages covered thus far are applicable to this scenario.

Non-distributed, multiple production servers can be employed for scalability purposes in order to handle a large number of concurrent requests that require processing data through a production model (e.g. to make a prediction). Architecturally, this can be done by deploying the same model to many different machines and distribute the requests through routing (e.g. load balancing), or to spin up ephemeral, serverless workers using a technology such as AWS’s Lambda.

We briefly mentioned in the previous article that deploying data science code and predictive models to production is a matter of either plugging functionality into an existing framework (e.g. as a service or microservice), or translating code into something more compatible with that already in use. Keep in mind however that the latter option can be very costly and time consuming.

There are certainly benefits (e.g. consistency and simplicity) in deploying solutions using the same languages and tools as when developed. One way to do this is to modularize, or package functionality as a service or microservice, which can be leveraged by a wide variety of applications such as a web application, SaaS platform, and so on. In this case, deploying new or updated production code can be a relatively simple matter of deploying and loading an encoded file containing new model coefficients for an existing solution in production.

Regardless of the language ecosystem chosen, environment and package management is critical when deploying Python and R solutions, since both will likely rely on a significant number of dependent and versioned packages. Making sure code and packages are properly and regularly updated is a definite consideration. There are many ways to handle this, including using packaging systems to wrap and deploy code and all dependent libraries.

Package managers like Anaconda and Miniconda can help with this, and also allow the user to export and import so-called virtual environments, which define versions of languages and packages to install in a given environment. Python also has a virtualenv tool that is able to create isolated virtual environments if the user is not using a tool like Anaconda.

Other options include pushing trusted packages to production from a curated and maintained master repository for additional environment control, or creating builds or binaries that are then deployed to production. There are many binary-building tools available for this.

In all the cases described, it’s worth noting that there is a relatively significant amount of devops-related work associated with deploying and maintaining solutions in production.

As previously discussed in this series and other articles of mine, scalability can be achieved in many different ways (e.g. distributed computing, load balancing).

In addition to using inherently distributed software frameworks like Hadoop or Spark, there are multiple other options to achieve scalability. One is to create and implement a custom solution, including the required devops and system admin. This can be very time and cost intensive, as well as require specialized skillsets.

Another options is to leverage a platform as a service (PaaS) or infrastructure as a service (IaaS) provider like Amazon Web Services (AWS) or Google Cloud Platform (GCP). These platforms help abstract away many of the complexities associated with devops, site reliability engineering (SRE), system and network administration, and so on.

A particularly interesting and useful technology to consider is on-demand, scalable, and server-like computing resources like Amazon’s Lambda. This allows for highly scalable and dynamic computing power, but without the creation, maintenance, and general overhead of running a complete server application (e.g. written in node.js).

Lastly, there is a growing number of scalable, specialized APIs being made available as a service that provide various artificial intelligence, machine learning, and advanced analytics-related functionality.

Here are some common packages, frameworks, platforms, and APIs used for scalable, production machine learning and artificial intelligence applications.

Packages, Frameworks, and Platforms

Apache Spark MLib

Databricks

H2O

Apache Mahout

Apache SystemML

DataRobot

BigML

OpenNN

The VELES

AWS Deep Learning AMIs

Amazon Machine Learning

Amazon EMR

Microsoft Azure Machine Learning

APIs

AWS

Lex

Rekognition

Polly

Google Cloud Platform (GCP)

Machine learning engine

Jobs

Video intelligence

Vision

Speech

Natural language

Translation

Microsoft Cognitive Services

Vision

Speech

Language

Knowledge

Search

Miscellaneous

Houndify

API.AI

Wit.ai

Clarifai

AlchemyAPI

Diffbot

IBM Watson

PredictionIO

SDKs

ai-one

Big Data and Real-Time Analytics

This section discusses scenarios that fall under the admittedly overhyped term big data, or real-time analytics solutions.

Big data can be thought of as being a relative term that applies to huge data sets that require an entity (person, company, etc.) to leverage specialized hardware, software, processing techniques, visualization, and database technologies in order to solve the problems associated with the 3Vs (volume, variety, and velocity) and similar characteristic models, i.e., more than 3Vs.

Here is a list of popular and powerful software packages, frameworks, and databases (aka database management systems) categorized by different components of typical big data and real-time analytics solutions.

Data science and data scientists tend to perform a lot of their data analysis and other tasks by writing custom software or code. Analysts, data miners, and other similarly related roles tend to use pre-packaged software instead. Note that data scientists may also use some of the software listed below for quick and dirty analysis and visualization.

Data mining is a relatively broad term that includes components of machine learning, statistical and general data analysis, data visualization, and so on. Since data is required, data mining also involves databases, data management and processing, and other data handling techniques as already discussed.

Wikipedia highlights the goal of data mining as the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. Many of the techniques involved are of the unsupervised machine learning variety, and include clustering, anomaly detection, and so on.

Business intelligence is similar to data mining in that specialized software is used for the extraction of patterns and knowledge from data, but where the data is specific to a business (e.g. operations, CRM, Marketing, ERP, Sales). BI’s ultimate goals are to generate actionable insights, drive data-based decision making, make key predictions, and achieve top-level business goals and KPIs in general (e.g. increase revenue, increase customer acquisition).

Because of the data generated, and many different data sources available to most businesses, often data is collected, integrated, and stored in a database system known as a data warehouse or data lake. This results in a comprehensive, single “source of truth” database that can be leveraged for analytics, decision support, and so on.

In many cases, subsets of the data from a data warehouse or data lake can be exposed independently to different functional groups, or departments of an organization. These independent subsets of data are usually called data marts, and can include data specific to sales, marketing, operations, and so on.

Data Mining

RapidMiner

KNIME

Weka

Orange

Business Intelligence

Plot.ly

Tableau

Looker

Periscope

Mode Analytics

Amazon QuickSight

Qlik

SlamData (for MongoDB)

Google Data Studio

Data Instrumentation and logging

This final section lists technologies available for instrumentation (i.e., data measurement) and data logging. Data instrumentation and logging can be implemented on the front end (UI) and/or back end (server) of any software application, and with the goal of storing the data to be made available for various forms of analytics. Instrumentation and measurement also applies to data generated by sensors in the case of the internet of things (IoT).

For web and mobile applications, a common technique is to place an HTML tag within the pages of the application’s front-end, and also track user actions and events within the app (e.g. click, swipe, pageview) using code written in JavaScript.

Often server and other application execution and operational data is logged as well. This logging can be very useful for application/server monitoring and status (health), application performance metrics (e.g. response time), API usage and processing, transaction and CRUD auditing, user actions and engagement auditing, error handling, application troubleshooting, and the list goes on.

Here are some relevant technologies, and note that some of these also offer functionality like data integration, processing, and analytics.

Google Analytics

Segment

Snowplow

Mixpanel

Heap

Kissmetrics

Keen IO

Splunk

Sumo Logic

New Relic

Loggly

ELK stack

Elasticsearch

Logstash

Kibana

Summary

This article series has covered a lot of information.

We’ve now had a solid overview of the many aspects of developing and deploying production-ready machine learning and artificial intelligence-based solutions. This includes covering relevant programming languages, packages, libraries, techniques, considerations, and so on.

For a much more thorough resource on topics discussed in this article series, and for a more in-depth listing of software and packages by category, please feel free to check out my related GiHub resources repository.