The Future is Multi-threaded Processing Using SAS® Viya®

When I first began researching how to run SAS 9 code as multi-threaded in SAS Viya, I decided to compile all the relevant information and detail the internal processing rules that guide how the SAS Viya Cloud Analytics Services (CAS) server handles code and data. What I found was that there are a simple set of guidelines, which if followed, help ensure that most of your existing SAS 9 code will run multi-threaded. I have stashed a lot of great information into a single whitepaper that is available today called “Getting Your SAS 9 Code to Run Multi-threaded in SAS Viya”.

Starting with the basic distinctions between single and parallel processing is key to understanding why and how some of the parallel processing changes have been implemented. You see, SAS Viya technology constructs code so that everything runs in pooled memory using multiple processors. Redefining SAS for this parallel processing paradigm has led to huge gains in decreasing program run-times, as well as concomitant increases in accuracy for a variety of machine learning techniques. Using SAS Viya products helps revolutionize how we think about undertaking large-scale work because now we can complete so many more tasks in a fraction of the time it took before.

The new SAS Viya products bring a ton of value compared to other choices you might have in the analytics marketplace. Unfortunately most open source libraries and packages, especially those developed for use in Python and R, are limited to single-threading. SAS Viya offers a way forward by coding in these languages using an alternative set of SAS objects that can run as parallel, multi-threaded, distributed processes. The real difference is in the shared memory architecture, which is not the same as parallel, distributed processing that you hear claimed from most Hadoop and cloud vendors. Even though parallel, distributed processing is faster than single-threading, it proverbially hits a performance wall that is far below what pooled and persisted data provides when using multi-threaded techniques and code.

For these reasons, I believe that SAS Viya is the future of data/decision science, with shared memory running against hundreds if not thousands of processors, and returning results back almost instantaneously. And it’s not for just a handful of statistical techniques. I’m talking about running every task in the analytics lifecycle as a multi-threaded process, meaning end-to-end processing through data staging, model development and deployment, all potentially accomplished through a point-and-click interface if you choose. Give SAS Viya a try and you will see what I am talking about. Oh, and don’t forget to read my technical white paper that provides a checklist of all the things that you may need to consider when transitioning your SAS 9 code to run multi-threaded in SAS Viya!

Questions or Comments?

If you have any questions about the details found in this paper, feel free to leave them in the comments field below.

There are many compelling reasons existing SAS users might want to start integrating SAS Viya into their SAS9 programs and applications. For me, it comes down to ease-of-use, speed, and faster time-to-value. With the ability to traverse the (necessarily iterative) analytics lifecycle faster than before, we are now able to generate output quicker – better supporting vital decision-making in a reduced timeframe. In addition to the positive impacts this can have on productivity, it can also change the way we look at current business challenges and how we design possible solutions.

Earlier this year I wrote about how SAS Viya provides a robust analytics environment to handle all of your big data processing needs. Since then, I’ve been involved in testing the new SAS Viya 3.3 software that will be released near the end of 2017 and found some additional advantages I think warrant attention. In this article, I rank order the main advantages of SAS Viya processing and new capabilities coming to SAS Viya 3.3 products. While the new SAS Viya feature list is too long to list everything individually, I’ve put together the top reasons why you might want to start taking advantage of SAS Viya capabilities of the SAS platform.

1. Multi-threaded everything, including the venerable DATA-step

In SAS Viya, everything that can run multi-threaded - does. This is the single-most important aspect of the SAS Viya architecture for existing SAS customers. As part of this new holistic approach to data processing, SAS has enabled the highly flexible DATA step to run multi-threaded, requiring very little modification of code in order to begin taking advantage of this significant new capability (more on that in soon-to-be-released blog). Migrating to SAS Viya is important especially in those cases where long-running jobs consist of very long DATA steps that act as processing bottle-necks where constraints exist because of older single-threading configurations.

2. No sorting necessary!

While not 100% true, most sort routines can be removed from your existing SAS programs. Ask yourself the question: “What portion of my runtimes are due strictly to sorting?” The answer is likely around 10-25%, maybe more. In general, the concept of sorting goes away with in-memory processing. SAS Viya does its own internal memory shuffling as a replacement. The SAS Viya CAS engine takes care of partitioning and organizing the data so you don’t have to. So, take those sorts out your existing code!

Not available in SAS 9.4, the VARCHAR informat/format allows you to store byte information without having to allocate room for blank spaces. Because storage for columnar (input) values varies by row, you have the potential to achieve an enormous amount of (blank space) savings, which is especially important if you are using expensive (fast) disk storage space. This represents a huge value in terms of potential data storage size reduction.

4. Reduced I/O in the form of data reads and writes from Hive/HDFS and Teradata to CAS memory

SAS Viya can leverage Hive/HDFS and Teradata platforms by loading (lifting) data up and writing data back down in parallel using CAS pooled memory. Data I/O, namely reading data from disk and converting it into a SAS binary format needed for processing, is the single most limiting factor of SAS 9.4. Once you speed up your data loading, especially for extremely large data sets, you will be able to generate faster time to results for all analyses and projects.

5. Persisted data can stay in memory to support multiple users or processing steps

Similar to SAS LASR, CAS can be structured to persist large data sets in memory, indefinitely. This allows users to access the same data at the same time and eliminates redundancy and repetitive I/O, potentially saving valuable compute cycles. Essentially, you can load the data once and then as many people (or processing steps) can reuse it as many times as needed thereafter.

All the most popular ML techniques are represented giving you the flexibility to customize model tournaments to include those techniques most appropriate for your given data and problem set. We also provide assessment capabilities, thus saving you valuable time to get the types of information you need to make valid model comparisons (like ROC charts, lift charts, etc.) and pick your champion models. We do not have extreme Gradient Boosting, Factorization Machines, or a specific Assessment procedure in SAS 9.4. Also, GPU processing is supported in SAS Viya 3.3, for Deep Neural Networks and Convolutional Neural Networks (this has not be available previously).

7. In-memory TRANSPOSE

The task of transposing data amounts to about 80% of any model building exercise, since predictive analytics requires a specialized data set called a ‘one-row-per-subject’ Analytic Base Table (ABT). SAS Viya allows you transpose in a fraction of the time that it used to take to develop the critical ABT outputs. A phenomenal time-saver procedure that now runs entirely multi-threaded, in-memory.

8. API’s!!!

The ability to code from external interfaces gives coders the flexibility they need in today’s fast-moving programming world. SAS Viya supports native language bindings for Lua, Java, Python and R. This means, for example, that you can launch SAS processes from a Jupyter Notebook while staying within a Python coding environment. SAS also provide a REST API for use in data science and IT departments.

9. Improved model build and deployment options

The core of SAS Viya machine learning techniques support auto-tuning. SAS has the most effective hyper-parameter search and optimization routines, allowing data scientists to arrive at the correct algorithm settings with higher probability and speed, giving them better answers with less effort. And because ML scoring code output is significantly more complex, SAS Viya Data Mining and Machine Learning allows you to deploy compact binary score files (called Astore files) into databases to help facilitate scoring. These binary files do not require compilation and can be pushed to ESP-supported edge analytics. Additionally, training within event streams is being examined for a future release.

10. Tons of new SAS visual interface advantages

A. Less coding – SAS Viya acts as a code generator, producing batch code for repeatability and score code for easier deployment. Both batch code and score code can be produced in a variety of formats, including SAS, Java, and Python.

B. Improved data integration between SAS Viya visual analytics products – you can now edit your data in-memory and pass it effortlessly through to reporting, modeling, text, and forecasting applications (new tabs in a single application interface).

C. Ability to compare modeling pipelines – now data scientists can compare champion models from any number of pipelines (think of SAS9 EM projects or data flows) they’ve created.

D. Best practices and white box templates – once only available as part of SAS 9 Rapid Predictive Modeler, Model Studio now gives you easy access to basic, intermediate and advanced model templates.

E. Reusable components – Users can save their best work (including pipelines and individual nodes) and share it with others. Collaborating is easier than ever.

11. Data flexibility

You can load big data without having all that data fit into memory. Before in HPA or LASR engines, the memory environment had to be sized exactly to fit all the data. That prior requirement has been removed using CAS technology – a really nice feature.

12. Overall consolidation and consistency

SAS Viya seeks to standardize on common algorithms and techniques provided within every analytic technique so that you don’t get different answers when attempting to do things using alternate procedures or methods. For instance, our deployment of Stochastic Gradient Descent is now the same in every technique that uses that method. Consistency also applies to the interfaces, as SAS Viya attempts to standardize the look-and-feel of various interfaces to reduce your learning curve when using a new capability.

The net result of these Top 12 advantages is that you have access to state-of-the-art technology, jobs finish faster, and you ultimately get faster time-to-value. While this idea has been articulated in some of the above points, it is important to re-emphasize because SAS Viya benefits, when added together, result in higher throughputs of work, a greater flexibility in terms of options, and the ability to keep running when other systems would have failed. You just have a much greater efficiency/productivity level when using SAS Viya as compared to before. So why not use it?

I have been using the SAS Viya environment for just over six months now and I absolutely love it. As a long-time SAS coder and data scientist I’m thrilled with the speed and greater accuracy I’m getting out of a lot of the same statistical techniques I once used in SAS9. So why would a data scientist want to switch over to the new SAS Viya platform? The simple response is “better, faster answers.” There are some features that are endemic to the SAS Viya architecture that provide advantages, and there are also benefits specific to different products as well. So, let me try to distinguish between these.

SAS Viya Platform Advantages

To begin, I want to talk about the SAS Viya platform advantages. For data processing, SAS Viya uses something called the CAS (Cloud Analytic Services) server – which takes the place of the SAS9 workspace server. You can still use your SAS9 installation, as SAS has made it easy to work between SAS9 and SAS Viya using SAS/CONNECT, a feature that will be automated later in 2017.

Parallel Data Loads

One thing I immediately noticed was the speed with which large data sets are loaded into SAS Viya memory. Using Hadoop, we can stage input files in either HDFS or Hive, and CAS will lift that data in parallel into its pooled memory area. The same data conversion is occurring, like what happened in SAS9, but now all available processors can be applied to load the input data simultaneously. And speaking of RAM, not all of the data needs to fit exactly into memory as it did with the LASR and HPA procedures, so much larger data sets can be processed in SAS Viya than you might have been able to handle before.

Multi-threaded DATA step

After initially loading data into SAS Viya, I was pleased to learn that the SAS DATA step is multi-threaded. Most of your SAS9 programs will run ‘as is,’ however the multi-processing really only kicks in when the system finds explicit BY statements or partition statements in the DATA step code. Surprisingly, you no longer need to sort your data before using BY statements in Procs or DATA steps. That’s because there is no Proc Sort anymore – sorting is a thing of the past and certainly takes some getting used to in SAS Viya. So for all of those times where I had to first sort data before I could use it, and then execute one or more DATA steps, that all transforms into a more simplified code stream. Steven Sober has some excellent code examples of the DATA step running in full-distributed mode in his recent article.

Open API’s

While all of SAS Viya’s graphical user interfaces are designed with consistency of look and feel in mind, the R&D teams have designed it to allow almost any front-end or REST service submit commands and receive results from either CAS or its corresponding micro-service architecture. Something new I had to learn was the concept of a CAS action set. CAS action sets are comprised of a number of separate actions which can be executed singly or with other actions belonging to the same set. The cool thing about CAS actions is that there is one for almost any task you can think about doing (kind of like a blend between functions and Procs in SAS9). In fact, all of the visual interfaces SAS deploys utilize CAS actions behind the scenes and most GUI’s will automatically generate code for you if you do not want to write it.

But the real beauty of CAS actions is that you can submit them through different coding interfaces using the open Application Programming Interface’s (API’s) that SAS has written to support external languages like Python, Java, Lua and R (check out Github on this topic). The standardization aspect of using the same CAS action within any type of external interface looks like it will pay huge dividends to anyone investing in this approach.

Write it once, re-use it elsewhere

I think another feature that old and new users alike will adore is the “write-it-once, re-use it” paradigm that CAS actions support. Here’s an example of code that was used in Proc CAS, and then used in Jupyter notebook using Python, followed by a R/REST example.

See how nearly identical each of these three are to one another? That is the beauty of SAS Viya. Using a coding approach like this means that I do not need to rely exclusively on finding SAS coding talent anymore. Younger coders who usually know several open source languages take one look at this, understand it, and can easily incorporate it into what they are already doing. In other words, they can stay in coding environments that are familiar to them, whilst learning a few new SAS Viya objects that dramatically extend and improve their work.

Analytics Procedure Advantages

Auto-tuning

Next, I want address some of the advantages in the newer analytics procedures. One really great new capability that has been added is the auto-tuning feature for some machine learning modeling techniques, specifically (extreme) gradient boosting, decision tree, random forest, support vector machine, factorization machine and neural network. This capability is something that is hard to find in the open source community, namely the automatic tuning of major option settings required by most iterative machine learning techniques. Called ‘hyperspace parameters’ among data scientists, SAS has built-in optimizing routines that try different settings and pick the best ones for you (in parallel!!!). The process takes longer to run initially, but, wow, the increase in accuracy without going through the normal model build trial-and-error process is worth it for this amazing feature!

Extreme Gradient Boosting, plus other popular ML techniques

Admittedly, xgboost has been in the open source community for a couple of years already, but SAS Viya has its own extreme[1] gradient boosting CAS action (‘gbtreetrain’) and accompanying procedure (Gradboost). Both are very close to what Chen (2015, 2016) originally developed, yet have some nice enhancements sprinkled throughout. One huge bonus is the auto-tuning feature I mentioned above. Another set of enhancements include: 1) a more flexible tree-splitting methodology that is not limited to CART (binary tree-splitting), and 2) the handling of nominal input variables is done automatically for you, versus ‘one-hot-encoding’ you need to perform in most open source tools. Plus, lots of other detailed option settings for fine tuning and control.

In SAS Viya, all of the popular machine learning techniques are there as well, and SAS makes it easy for you to explore your data, create your own model tournaments, and generate score code that is easy to deploy. Model management is currently done through SAS9 (at least until the next SAS Viya release later this year), but good, solid links are provided between SAS Viya and SAS9 to make transferring tasks and output fairly seamless. Check out the full list of SAS Viya analytics available as of March 2017.

In-memory forecasting

It is hard to beat SAS9 Forecast Server with its unique 12 patents for automatic diagnosing and generating forecasts, but now all of those industry-leading innovations are also available in SAS Viya’s distributed in-memory environment. And by leveraging SAS Viya’s optimized data shuffling routines, time series data does not need to be sorted, yet it is quickly and efficiently distributed across the shared memory array. The new architecture also has given us a set of new object packages to make more efficient use of the data and run faster than anything witnessed before. For example, we have seen 1.5 million weekly time series with three years of history take 130 hours (running single-machine and single-threaded) and reduce that down to run in 5 minutes on a 40 core networked array with 10 threads per core. Accurately forecasting 870 Gigabytes of information, in 5 minutes?!? That truly is amazing!

Conclusions

Though I first ventured into SAS Viya with some trepidation, it soon became clear that the new platform would fundamentally change how I build models and analytics. In fact, the jumps in performance and the reductions in time spent to do routine work has been so compelling for me, I am having a hard time thinking about going back to a pure SAS9 environment. For me it’s all about getting “better, faster answers,” and SAS Viya allows me to do just that. Multi-threaded processing is the way of the future and I want to be part of that, not only for my own personal development, but also because it will help me achieve things for my customers they may not have thought possible before. If you have not done so already, I highly recommend you to go out and sign up for a free trial and check out the benefits of SAS Viya for yourself.

[1] The definition of ‘extreme’ refers only to the distributed, multi-threaded aspect of any boosting technique.

Data asset management and analytic processing associated with big data were main topics of interest at the recent Strata conference in Santa Clara, California. Hosted in Silicon Valley, the conference attracted some of the brightest and most intelligent data scientists from America’s top research and academic institutions. Yet, to my ears, the [...]

In evolutionary biology there is an interesting concept called “punctuated equilibria,” which is used primarily to explain why the fossil record seems to abruptly change after long periods of apparent calm, or ‘stasis’ as it is called. To understand this better, think of a stairway where the length between each [...]