Analytics Manager’s Guide to Choosing a Data Science Platform

Analytics Manager’s Guide to Choosing a Data Science PlatformElena LoweryBlockedUnblockFollowFollowingMay 14The term “data science” started circulating in IT world around 2010.

Less than 10 years later the data science platform landscape is as broad as it is for many mature software offerings.

Established software vendors have revamped their offerings, while newcomers and open source vendors continue to gain momentum.

It’s great having choices, but how do you avoid analysis paralysis in choosing the right data science platform for your business?Some companies use a Request for Proposal (RFP) to get a consistent overview of platform’s software features.

RFPs are a good starting point, but in most cases they are just a checklist because they don’t provide the details.

Data science is a complex domain that changes at a fast pace.

Understanding the details behind RFP answers is not just for the data scientist.

Analytics managers need to make sure that the platform meets the needs of the entire organization.

In this article I will review the evaluation criteria that analytics managers should consider when choosing a data science platform.

Requirements of the platform usersThis may seem like a trivial category — the data science platform is for data scientists, but the answer is not that simple.

Data scientist has become a generic title for anybody who works with data.

Some data scientist are programmers coding with Python machine learning APIs, while others are statisticians who use R.

Some companies have citizen data scientists — domain experts who understand business use cases but don’t have coding skills.

Finally, traditional IT roles, such as data engineers and IT architects can also be users of the data science platform.

Typical needs of open source data scientists are:Access to the latest open source tools and libraries.

Scalability and performance that may not be provided by free open source tools.

Compliance with standards set by IT.

The last bullet point, compliance with IT standards, is often a requirement set by the IT department.

Data scientists may not care how data is accessed or where it’s located, but they will not be able to work with data unless the data science platform meets IT requirements.

Growing the number of citizen data scientists is one of the quickest ways to bring data science to the line of business.

The key platform requirements for citizen data scientists are:Ability to implement use cases without codingAutomation and patterns that shorten the learning curveCollaboration with open source data scientists.

The next potential user of the data science platform is a data engineer.

The data science team is not the only consumer of data; the majority of data transformation will most likely be done outside of the data science platform.

However, the data science team often has specific data requirements, and either a data scientist or a data engineer will work on data preparation.

Data engineer requirements are:Easy access to dataOpen source and automation toolsCompliance with standards set by IT.

On a high level, “easy access to data” may seem like a simple requirement — database drivers and APIs to access other data sources have been available for many years.

What makes the “easy access difficult” are the changes in security requirements and migration to cloud infrastructure.

Various authentication and authorization requirements, as well as hybrid data architecture require a sophisticated implementation to make easy access possible.

For data transformation data engineers need both a coding environment (Python/R) and tools that can optimize the data preparation process.

These tools are usually visual tools that implement the most common tasks in data cleansing and preparation.

IT architects who work on deployment of analytics can also be the users of the data science platform.

Architects are responsible for making data science assets available to business users.

The data science team can create a variety of analytics assets — from an ad hoc project that investigates a specific issue to a predictive model that should be deployed in real time.

It’s the architect’s responsibility to make sure that all types of analytics assets can be deployed in the data science platform.

The main requirements of IT architects are:Multiple deployment optionsStandards-based integration with line of business applicationsCompliance with standards set by IT.

In summary, to come up with a comprehensive list of data science platform requirements, identify data science platform users, current and future, and capture their needs.

Requirements of the IT TeamThe IT team has the difficult task of providing continuous service at a low cost.

IT is also responsible for security, performance, and scalability of software that the business has purchased.

Many IT organizations try to achieve these goals by setting standards.

As we reviewed in the user requirements section, the data science platform must comply with IT requirements.

Typical requirements of an IT organization are:Operating system and environment support (cloud, on premise, or hybrid cloud)Security standards: authentication, authorization, and data accessArchitecture (for example, container-based)An easy way to scale a platformData access without engaging ITLow cost of ownershipFlexibility in platform deployment.

Most of these requirements existed for several years, but flexibility in platform deployment is relatively new.

By flexibility we mean the option to deploy software on premise, in the cloud, or on desktop.

Many companies operate in a hybrid environment.

Some applications and data sources have been moved to the cloud, while others are on premise.

Having a data science platform that seamlessly fits into the company’s environment is important for successful adoption and deployment.

Requirements of analytics managersAnalytics managers are not the users of the data science platform, but they are responsible for making sure that the platform meets the organization’s needs.

In addition to the requirements of data scientists and IT, an analytics manager should consider the following selection criteria:Deploy existing analytics assets in the new data science platform.

If the company already has analytics assets, for example, Python/R scripts, models, and notebooks — it should be possible to easily deploy them in the new data science platform.

Leverage existing investment in Hadoop.

Most Hadoop platforms not only store data, but also provide a highly scalable runtime environment, Spark.

The data science platform should be able to take advantage of remote Spark execution.

Remote Spark execution often improves performance when data is also stored in Hadoop.

Whenever possible, avoid vendor lock-in.

While a data science platform may claim that it’s using open source, that statement doesn’t guarantee that you won’t use proprietary APIs to take the full advantage of the platform.

Almost every data science platform will have a custom API to provide access to specific functionality.

An analytics manager needs to understand the impact of using these APIs on portability of analytic assets.

Improve data science team productivity.

Productivity can be improved by using visual tools that automate various tasks in the data science lifecycle.

Visual tools can be used by both beginner and advanced data scientists.

One of the main goals of visual tools is to automate repetitive tasks.