Overview

This article explains how to use Talend to harness the capabilities of Amazon Web Services (AWS) Machine Learning (ML) services in real-time mode. The goal is to help Talend enthusiasts integrate AWS ML and Talend without worrying about underlying complexities.

It covers:

Data Preparation for an AWS Machine Learning

Creation of an AWS Machine Learning model

Creation of a real-time prediction endpoint for the AWS Machine Learning model

Configuration of a Talend routine to call an AWS Machine Learning real-time endpoint

Execution of a sample Job with an AWS Machine Learning real-time prediction

Environment

Talend 7.0.1

Data Preparation for an AWS Machine Learning

The first step in an AWS ML service is to identify source data that can be used to train the ML model. AWS S3 and AWS Redshift are the two services that can be used as the sources to train an AWS ML model.

In this article, you are going to create an ML model based on the Iris Data Set provided by the University of California, Irvine. The database classifies Iris plants in to three groups, Iris Setosa, Iris Versicolour, and Iris Virginica, based on the sepal length, sepal width, petal length, and petal width.

Talend can be seamlessly used to load data to AWS S3 and AWS Redshift.

For more information on loading data using Talend with S3 and Redshift, see the following resources in Talend Help Center:

In this example, S3 is identified as the source to train the AWS ML model, and the data was loaded to the bucket in Amazon S3 using Talend.

Creation of an AWS Machine Learning model

Creation of an ML model involves two sub-tasks, but these steps are integrated together in the AWS ML process.

Creation of a training data source in an AWS ML service

Creation of an AWS ML model

Once the data source is ready in Amazon S3, select the AWS region of your choice for Machine Learning activities. After selecting the region, go to the AWS ML service page, and select Standard setup, then Launch.

From the Input Data page, select the AmazonS3 button, and complete the bucket and file name details (where the Iris dataset is located). For Datasource name, type Iris_Dataset.

If the AWS ML service is using the bucket for the first time, you are prompted to provide read permission to the bucket. Select Yes for the query.

Select Continue.

The Schema page displays details of the dataset. Answer Yes to the question, Does the first line in your CSV contain the column names? Click Continue.

Select the entry in the Target column whose value has to be predicted. For this example, select the class row, then click Continue.

From the Row ID page, answer No to the question, Does your data contain an identifier? Click Review.

The next page provides training and evaluation settings of the ML model. Select the Default (Recommended) setting, to set aside 30% of the data for the evaluation process.

From the final Review page, select Create ML model to create the Machine Learning Model in AWS.

The model creation process can run from several minutes to several hours, based on the input data set size and the number of columns. The status is Pending until the model processing is complete, then status changes to Completed.

You have to repeat these steps in AWS whenever AWS ML model changes are required, due to massive modification in source data pattern.

Creation of a real-time prediction endpoint for the AWS ML model

Once the model is generated and is in Completed status, go to the Prediction section of the AWS ML model and select Create endpoint. The new endpoint is used to send requests and receive responses in real-time between Talend and AWS ML.

Once the endpoint is ready, the status changes to Ready, and the endpoint URL is displayed.

Configure a Talend routine to call an AWS ML real-time endpoint

Connect to Talend Studio and create a new routine called AWS_ML_RT_Predict that connects to the AWS ML endpoint to transmit the incoming JSON record and process the data. The routine also collects the predict response back from the AWS ML Predict function.

If the JAR file is not installed, the status changes from the error flag to Install a module followed by JAR file name. Click OK to load the JAR file to the routine. Once all the JAR files are installed, click Finish.

Talend sample Job with an AWS Machine Learning real-time prediction

The setup activities are complete and the routine can be used in any Talend Job as a user defined function. The Talend routine helps to generate real-time predictions based on the AWS ML model. In this example, nine sample JSON records, from the Iris dataset, are processed through the input_data.txt attached to this article.

The following diagram shows the overall Job flow for the AWS ML real-time prediction:

The configuration details for each Talend component are as follows:

Use a tFileInputFullRow component to read the file and to process each row.

Use a tJavaRow to call the Talend routine AWS_ML_RT_Predict to generate the prediction value based on the configuration details. The RT_Predict method of the Talend routine will process the incoming data and provide the Prediction value as output in String format. The parameters required for the method are: