Topics

Featured in Development

Peter Alvaro talks about the reasons one should engage in language design and why many of us would (or should) do something so perverse as to design a language that no one will ever use. He shares some of the extreme and sometimes obnoxious opinions that guided his design process.

Featured in AI, ML & Data Engineering

Today on The InfoQ Podcast, Wes talks with Katharine Jarmul about privacy and fairness in machine learning algorithms. Jarul discusses what’s meant by Ethical Machine Learning and some things to consider when working towards achieving fairness. Jarmul is the co-founder at KIProtect a machine learning security and privacy firm based in Germany and is one of the three keynote speakers at QCon.ai.

Featured in Culture & Methods

Organizations struggle to scale their agility. While every organization is different, common patterns explain the major challenges that most organizations face: organizational design, trying to copy others, “one-size-fits-all” scaling, scaling in siloes, and neglecting engineering practices. This article explains why, what to do about it, and how the three leading scaling frameworks compare.

It seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline, which can transparently scale out to large Apache Hadoop/Spark clusters for distributed training or inference.

This article provides several specific tutorials on how to implement distributed TensorFlow pipelines on Apache Spark using Analytics Zoo, and end-to-end pipelines for text classification in real use cases using Analytics Zoo.

Continued advancements in artificial intelligence applications have brought deep learning to the forefront of a new generation of data analytics development. In particular, we are seeing increasing demand from organizations to apply deep learning technologies (such as computer vision, natural language processing, generative adversarial neural networks, etc.) to their big data platforms and pipelines. Today this often requires manually “stitching together” many separate components (e.g., Apache Spark, TensorFlow, Caffe, Apache Hadoop Distributed File System (HDFS), Apache Storm/Kafka, and others), in what can be a complex and error-prone process.

At Intel, we have been working extensively with open source community users and several partners & customers including JD.com, UCSF, Mastercard, and many others to build deep learning (DL) and AI applications on Apache Spark. To streamline end-to-end development and deployment, we have developed Analytics Zoo, a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline, which can transparently scale out to large Apache Hadoop/Spark clusters for distributed training or inference.

Early users such as World Bank,Cray,Talroo, Baosight, Midea/KUKA, and others have built analytics + AI applications on top of Analytics Zoo for a wide range of workloads. These include transfer learning based image classification, sequence-to-sequence prediction for precipitation nowcasting, neural collaborative filtering for job recommendations, and unsupervised time-series anomaly detection, among other examples.

Related Vendor Content

Related Sponsor

In this article, we will provide several specific tutorials on how to implement distributed TensorFlow pipelines on Apache Spark using Analytics Zoo, and end-to-end pipelines for text classification in real use cases using Analytics Zoo.

Distributed TensorFlow on Apache Spark

Using Analytics Zoo, the users can easily build an end-to-end deep learning pipeline on large-scale cluster using Spark and TensorFlow as follows.

Data wrangling and analysis using PySpark

For instance, to process the training data for an object detection pipeline in a distributed fashion, one can simply read the raw image data into an RDD (Resilient Distributed Dataset), an immutable collection of records partitioned across a cluster, using PySpark, and then apply a few transformations to decode images, and extract bounding boxes and class labels, as illustrated below.

Each record in the result RDD (train_rdd) consists of a list of NumPy ndrrays (namely, image, bounding boxes, classes, and number of detected boxes), which can then be directly used in TensorFlow models for distributed training on Analytics Zoo; this is accomplished by creating a TFDataset from the result RDD (as shown below).

Deep learning model development using TensorFlow

In Analytics Zoo, TFDataset represents a distributed set of elements, in which each element contains one or more Tensorflow Tensor objects. We can then directly use these Tensors(as inputs) to build Tensorflow models; for instance, we can use Tensorflow Object Detection API to construct a SSDLite+MobileNet V2 model (as illustrated below):

Distributed training/inference on Spark and BigDL

After the model construction, we can then train the model in a distributed fashion directly on top of Spark (leveraging the BigDL framework). For instance, in the code snippet below, we applies transfer learning technologies to train a Tensoflow model that has been pretrained on the MS COCO dataset.

Under the hood, the input data are read from disk and preprocessed to generate an RDD of Tensorflow Tensorsusing PySpark; then the Tensorflow model is trained in a distributed fashion on top of BigDL and Spark (as described in the BigDL Technical Report). The entire training pipeline can automatically scale out from a single node to a large Xeon-based Hadoop/Spark cluster (without code modifications or manual configurations).

Once the model is trained, we can also perform large-scale, distributed evaluation/inference on Analytics Zoo using PySpark, TensorFlow and BigDL (similar to the training pipeline above). Alternatively, we may also deploy the model for low latency, online serving (in, for instance, web services, Apache Storm, Apache Flink, etc.) using the POJO-style serving API provided by Analytics Zoo, as illustrated below.

Real world AI use cases on Analytics Zoo

As mentioned above, there are many early users that have built real-world application on top of Analytics Zoo.. In this section, we will describe in more details how to build an end-to-end text classification pipeline using NLP technologies on Analytics Zoo by Microsoft Azure.

Text Classification Overview

Text Classification is a common type of Natural Language Processing task, whose purpose is to classify input text corpus into one or more categories. For example, spam email detection classifies the content of an email into spam or non-spam categories.

In general, training a text classification model involves the following steps: collect and prepare train dataset and validation dataset, data cleaning and preprocessing, train the model, validate and evaluate the model, and tune the model (which include but not limited to adding data, adjust hyper parameters, adjust models).

There are several pre-defined text classifiers in Analytics Zoo that can be used out-of-box (namely, CNN, LSTM, GRU). We chose CNN as a start. We use Python API in the following texts to illustrate the training process.

In the above API, class_num is the number of categories in this problem, embedding_file is the path to the pertained word embedding file (only Glove is supported at this moment), sequence_length is the number of words each text record contains, encoder is the type of word encoder (which can be cnn, lstm or gru), encoder_output_dim is the output of this encoder. This model accepts as input a sequence of world index, and outputs a label.

Data collection and preprocessing

Each record in the training dataset contains two fields, a dialogue and a label. We collected thousands of such records, and collected labels both manually and semi-automatically. Then we did data cleaning to original texts, where we removed meaningless tags and garbled parts, and converted them into a text RDD with each record in format of a pair (text, label). Next we did preprocessing with the text RDD and output the correct form that our model accepts. Please make sure that you keep the data cleaning and processing the same for both training and prediction!

(How to get invoice …, 1)

(Can you send invoice to me…,1)

(Remote service connection failure…，2)

(How to buy…, 3)

…

Illustration of text RDD records after data cleaning (Each record is a pair of text andlabel)

Data read

We can use the TextSet provided by Analytics Zoo to read the text data in a distributed fashion as follows.

We then break the sentencesinto words, which converts each input text were into an array of tokens (words), and normalize the tokens (e.g., removing unknown characters and converting to lower case).

text_set = text_set.tokenize() \
.normalize()

Sequence Aligning

Different texts may generate different sizes of token array. But a text classification model needs the fixed size of input for all records. Thus we have to align the token arrays to the same size (specified in the parametersequence_lengthin text classier). If the size of a token array is larger than the required size, we striped the words from the beginning or the end; otherwise, we padded meaningless words to the end of the array (e.g. “##”).

text_set= text_set.shape_sequence(sequence_length)

Word to Index

After the token array size is aligned, we need to convert each token (word) into an index, which can be used to look up its embedding latter (in the Text Classifier model). During the word-to-index, we also remove the stop words (that is, the words that frequently appear in text but do not help semantics understanding, such as “the”, “of”, etc.) by removing the top Nwords with highest frequencies in the text.

text_set= text_set.word2idx(remove_topN=10, max_words_num)

Conversion to Sample

After all the above steps, each text becomesa tensor with shape (sequence_length, 1). Then we constructone BigDL Sample from each record, with the generated tensor as feature, the label integer aslabel.

text_set = text_set.generate_sample()

Model training, testing, evaluation and tuning

After prepared the train dataset (train_rdd) and the validation dataset (val_rdd) in the same way as above, we instantiate a new TextClassifier model (text_classifier), and then created an Optimizer to train the model in a distributed fashion. We used Sparse Categorical Cross Entropy as the loss function.

The tunable parameters for training include number of epochs, batch size, learning rate, and etc. You can specify validation options to output metrics such as accuracyon validation set along the training progress to detect overfit or underfit.

If the result is not good on the validation dataset, we have to tune the model. This is generally a repeated process of adjusting hyper parameters/data/model, train, and validation, until the result is good enough. We have improved our accuracy score remarkably after we tuned learning rate, added new data, and augmented the stopwords dictionary.

About the Author

Jason Dai is a senior principal engineer and CTO of Big Data Technologies at Intel, responsible for leading the global engineering teams (in both Silicon Valley and Shanghai) on the development of advanced big data analytics and machine learning. He is a committer and PMC member of Apache Spark, a mentor of Apache MXNet, a co-chair of Strata Data Conference Beijing, and the creator of BigDL (a distributed deep learning framework for Apache Spark).