Introduction

Welcome to the second article in a three-part series. In the first article, we discussed the characteristics and paradigms of programming languages in general, and for common data science languages such as Python and R. We also discussed different integrated development environments for each language, and gave an initial overview of which language to use, when, and why.

In this article, we’re going to discuss which languages and packages to use in the context of considerations such as:

Local or remote development

Production vs development environments

Single server vs distributed computing

Batch vs mini-batch vs real-time vs near real-time processing

Offline vs online vs automated learning

We’ll follow this with a discussion of improving solution performance and availability in production environments, including the importance and options around computing resources.

With that, let’s started!

Local vs Remote Development

During the development phase of a programming project, i.e., writing code and testing it, data scientists and software engineers often code on their local machine (computer). The machine is usually a laptop or desktop running a non-server operating system, aka OS (e.g., macOS, Windows, Ubuntu).

People can also develop on a remote machine that is a physical computer or virtual machine (VM). Remote access, control, and computing is possible with technologies such as remote desktop, virtual network computing (VNC), SSH access via a terminal, client/server request/response patterns, and CRON-like job schedulers for recurring tasks.

The benefit of remote development and computing is that it offloads the system memory (RAM), processing (CPU) requirements, and load to a more powerful machine. In the case of virtual machines hosted on cloud-based IaaS platforms such as Amazon Web Services (AWS) and Google Cloud Platform (GCP), they can be easily configured to handle a wide array of memory, processing, and data storage requirements.

Another benefit of remote development is that remote machines are usually servers, which means that they are running highly optimized operating systems that are meant for server-like usage, as opposed to home or desktop operating systems.

In some cases, remote development is necessary because the dataset being used is larger than the local machine’s available system memory (RAM), or the processing requirements for a given task is greater than the local machine’s capability. When limited by either RAM or CPU, it can result in extremely slow computation, or not be possible at all. When this happens, often data scientists use a smaller subset of the data for local development and prototyping purposes.

All of this said, local development is usually preferred, simpler, and faster for tasks that do not require additional memory or computing power.

Production vs Development Environments

Software code is usually developed, tested, and ultimately made available to end users in different computing environments. In this context, an environment can be thought of as a specific machine (physical or virtual), running a specific operating system and version, that is configured in a very specific way, along with having a specific, versioned set of software, programming languages (e.g., Python), and packages installed.

Different environments are commonly named development, staging, QA, and production, and represent different environments that have been set up for different phases of the software development process.

We focus here on development and production environments, along with their differences. Development environments are either local or remote environments that are meant for a programmer to write, test, and optimize code before it’s deployed into a real-world live situation. Code in development absolutely should be tracked and managed using a version control system such as git or SVN.

Once code is written and successfully meets all requirements (both functional and non-functional), and passes all tests where applicable, it is deployed through varying DevOps-like processes to a production environment. The production environment is where the software runs constantly and is widely available to end users and/or other software. This process is synonymous with the phrase software release.

Take Amazon and Netflix’s recommendations, which come by way of sophisticated recommender system engines that they’ve built. Both Amazon and Netflix make recommendations to their users via their web or mobile app user interfaces (UI). Whenever either company wants to improve their recommendation system, they likely write and test all code/algorithm/model changes in a development environment first, maybe move to a QA or staging environment for testing, and after deploy to production in order to update the recommendations that all users see.

Production machines run server-type operating systems that are highly optimized to meet expected load and demand (scalability), and are also expected to be continuously running and working properly (availability and reliability), and so on.

Because these requirements and environments are different than those needed for development and prototyping, deploying code and AI/machine learning models can be challenging due to the mismatch in environments. This is further complicated by the fact that production environments are often running larger applications (e.g., software as a service (SaaS), CRM, …), where the data science-based components are a subset of the overall functionality and user experience.

Amazon and Netflix are perfect examples. Amazon is running a production eCommerce application (amongst many other things), while Netflix runs a production SaaS video application. Applications like this are often written using languages and frameworks such as Java, C#/.NET, and so on.

Given this, deploying data science code and predictive models to production becomes a matter of either plugging functionality into the existing framework (e.g., as a service or microservice), or translating code into something more compatible with that already in use.

Single Server vs Distributed Computing

A single server is just as it sounds. It is a single machine that handles all requests and processing by itself, and is running a server operating system.

Distributed computing and distributed systems are terms used to describe systems where computer nodes (i.e., machines) coordinate and communicate with each other to achieve common goals, usually by leveraging computer clusters. Distributed computing also typically involves languages and packages associated with so-called big data.

In order to handle increasing memory, processing, concurrent load/requests, and large data volume/velocity requirements for scalability purposes, servers are either scaled up or scaled out (aka scaled horizontally). Scaling up means to increase the memory capacity and processing power of a single machine (and sometimes disk storage as well), whereas horizontal scaling means to add additional low-cost, commodity computing resources to distribute the workload.

Distribution amongst nodes, including coordination, execution, monitoring, and failover is carried out by specialized software, e.g., Spark, Hadoop, load balancers, and so on.

Batch vs Mini-batch vs Real-time vs Near Real-time Processing

The term processing includes things like data cleaning and transformation, data migration and integration, calculating new metrics and features from raw data, or can be part of an extract transform load (ETL/ELT) process, for example.

In the context of machine learning and artificial intelligence, processing can also include passing data through a predictive model (numeric, classification, or grouping), recommender system, anomaly detection system, clustering engine, recognition framework, scoring or ranking system, and so on.

Batch (offline) processing refers to the situation where a batch of data, i.e., an entire dataset or a smaller mini-batch (i.e., subset) of data is processed. This can happen either on-demand or scheduled, and through a single or recurring job in the scheduled case. This is not to be confused with batch learning, which we’ll discuss in the next section.

Real-time (online) processing deals with processing data as it moves through the data pipeline in real-time, and often the data moves directly from the processing stage into presentation of some sort (dashboard, notification, report, …), persistent storage (database), and/or as a response to a request (CC approval, ATM transaction, …). Real-time processing is characterized by very fast processing where minimal time delays are critical.

Near real-time (NRT, online) processing on the other hand, differs in that the delay introduced by data processing and movement means that the term real-time is not quite accurate. Some sources differentiate near real-time as being characterized by a delay of several seconds to several minutes, and where the delay is acceptable for the given application.

Offline (aka Batch) vs Online vs Automated Learning

In machine learning and artificial intelligence tasks, high-performing deliverables (e.g., a predictive model, recommender system, scoring engine) are trained as discussed in a local or remote development environment. Note that predictive models are sometimes referred to as predictors or estimators, and individual features of a model can also be referred to as a predictor.

Once developed and tested, the deliverable is then deployed to a production environment where all new and unseen data passes through it in order to generate a high-performing result (generalization). The processing and result can be delivered in a real-time or near real-time fashion, as previously discussed.

Because data, and the underlying information on which the deliverables are based, can change due to trends, behaviors, and other factors, the deliverables should be recreated on a recurring interval (cadence). This allows them to capture these changes and maintain a targeted level of performance. Without doing this, performance degradation or drift is common.

Offline learning (aka batch learning) is when deliverables are trained outside of production on an entire dataset, or on a subset of the data (mini-batch). Part of this involves storage and access of data from data stores such as an RDBMS, NoSQL, data warehouse, and Hadoop for example.

Deliverable validation (e.g., cross-validation) and optimization iterations (e.g., grid search, hyperparameter optimization, …) are typically carried out before deploying the optimized deliverable to production. This process is repeated as needed to maintain model performance in production.

Online learning on the other hand describes the situation when re-training and re-creation of the deliverable, along with performance assessment, occurs online in a production environment with production data. This process is usually carried out on a recurring interval (cadence) that can be minutes, days, or longer.

Online learning is intended to address maintenance and upkeep of target performance based on changing data, as in the batch learning case, but also to incrementally update and improve deployed deliverables without retraining on an entire dataset. This obviously also requires performant and available data storage and access due to the online nature, meaning involving network communications, latency, and availability of network resources.

Online learning is associated with the concept of online algorithms, where algorithms receive their input over time, and not all at once as in the offline case. Online learning is also associated with incremental learning techniques, which is when models continue learning as new data arrives, while also retaining previously learned information.

Netflix also defines a nearline computation technique that includes model training, and in which learning and results generation is carried out as in the online case, but results are stored and then updated in an asynchronous and incremental fashion. Netflix recommends using a hybrid combination of offline, nearline, and online processing techniques, although they obviously have scaling requirements beyond those of many other applications.

Automated learning is where model training, validation, performance assessment, and optimization are automated on a regular cadence, and then new and improved models are deployed to replace existing models in production. The deployment process can be either automated or manual for an additional safety check.

It obviously requires some form of model performance tracking and comparison framework to determine if automatically generated and validated models are an improvement relative to a previous, or base model.

Solution Performance, Availability, and Computing Resources

Production solution (e.g., a web application, API, microservice) performance, is highly dependent on many factors.

We’ve discussed some of these factors already, which include required system memory (RAM), processing power (CPU, cores), physical data storage (disk), scalability, availability, and so on. We’ve also discussed distributed programming at a high level, but not specifically in the context of production analytics, machine learning, artificial intelligence, and big data.

In order to handle very large data storage and querying requirements, distributed database and querying systems have been developed. Common distributed database technologies include Hadoop and HBase in the context of big data, while many NoSQL, search, and graph databases are also scalable and distributed, including MongoDB, Cassandra, ElasticSearch, and Neo4j. Common data querying and processing technologies include Spark, Storm, Hive, Pig, and elastic mapreduce (EMR).

In order to increase processing power, particularly for training neural networks and in deep learning applications, graphical processing units (GPUs) are being used more and more these days in place of computer processing units (CPUs). This is due to the relatively large number of cores that GPUs can have, along with their exceptional capabilities surrounding matrix computations and parallelization.

A technique for scaling out to handle a large numbers of concurrent requests that require processing data through a production model (e.g., make a prediction), is to deploy the same model to many different machines and distribute the requests through routing (e.g., load balancing), or to spin up ephemeral, serverless workers using a technology such as AWS’s Lambda.

Memory-wise, often situations arise where the data required for a processing or learning task is larger than a single computer’s system memory (RAM). Out-of-core or external memory algorithms (e.g., external sorting, external hashing) are often used in these cases, as well as incremental learning techniques. Scikit-learn and Vowpal Wabbit both offer out-of-core and incremental learning algorithms.

A couple of final and very interesting things to note. Netflix points out that “improvements that support scalability and agility have a greater impact than RMSE”, where RMSE here refers to the performance metric they use.

They further note that “What Netflix finds most important is usage, user experience, user satisfaction, and user retention, which all align with their business goals better than RMSE.”.

From a product perspective, these statements are extremely interesting, in that optimization for Netflix is much more user-centric, as opposed to performance focused. This makes sense, particularly given that Netflix can afford some level of non-perfect performance since they’re not using machine learning for cancer diagnostics, for example.

Summary

We’ve now had a solid overview of the many aspects of developing, testing, optimizing, and deploying production-ready machine learning and artificial intelligence-based solutions.

The next part of this series will be a detailed discussion of recommendations for languages, packages, and more for common use cases and scenarios as covered in this article.