Part II of the Forecasting Time Series blog provides a step by step guide for fitting an ARIMA model using Splunk’s Machine Learning Toolkit. ARIMA models can be used in a variety of business use cases. Here are a few examples of where we can use them:

Detecting anomalies and their impact on the data

Predicting seasonal patterns in sales/revenue

Streamline short-term forecasting by determine confidence intervals

From Part 1 of the blog series, we identified how you can use Kalman Filter for forecasting. The observation we made from the resulting graphs demonstrated how it was also useful in reducing/filtering noise (which is how it gets its name ‘Filter’) . On the other hand ARIMA belongs to a different class of models. In comparison to a Kalman filter, ARIMA models works on data that has moving averages over time or where the value of a data point is linearly depending on its previous value(s). In these two scenarios it makes more sense to use ARIMA over Kalman Filter. However good judgement, understanding of the data-set and objective of forecasting should always be the primary method of determining the algorithm.

Objective

Part II of this blog series aims to familiarize a Splunk user using the MLTK Assistant for forecasting their time series data, particularly with the ARIMA option. This blog is intended as a guide in determining the parameters and steps to utilize ARIMA for your data. In fact, it is a generalized template that can be used with any processed data to forecasting with ARIMA in Splunk’s MLTK. An advantage of using Splunk for forecasting is its benefit in observing the raw data side by side with the predicted data and once the analysis is complete, a user can create alerts or other actions based on a future prediction. We will talk more about creating alerts based on predicted or forecasted data in a future blog (see what I predicted there ;)?)

If you have read part I of our blog, we will reuse the same dataset
process_time.csv for this part. If not, click here to navigate to part I to understand the dataset.

Fundamental Concept for ARIMA Forecasting

A fundamental concept to understand before we move ahead with ARIMA is that the model works best with stationary data. Stationary data has a constant trend that does not change overtime. The average value is also independent of time as another characteristic of stationary data.

A simple example of non-stationary data is are the two graphs below, the first without a trendline, the second with a yellow trendline to show an average increase in the value of our data points. The data needs to be transformed into stationary data to remove the increasing trend.

Using Splunk’s autoregress command we can apply differencing to our data. The results are immediately visible through line chart visual! The below command can be used on any time series data set to demonstrate differencing.

Without creating a trendline for the below graph we can see that the data fluctuates around a constant mean value of ‘0’, we can say that differencing is applied. Differencing to make the data stationary can increase the accuracy and fit of our ARIMA forecast. To read more about differencing and other rules that apply on ARIMA, navigate to the Duke URL provided in the useful link section:

Differencing is simply subtracting the current and previous data points. In our example we are only applying differencing by an order of 1, meaning we will subtract the present data point by one data point in reverse chronological order. There are different types of non-stationary graphs, which require in-depth domain knowledge of ARIMA, however we simplify it in this blog and use differencing to remove the non-constant trend in this example 😊!

From part 1 of this blog series we can see that our data does not have a constant trend, as a result we apply differencing to our dataset. The step to apply differencing from the MLTK Assistant is detailed in the ‘Determining Starting Points’ section. Differencing in ARIMA allows the user to see spikes or drops (outliers) in a different perspective in comparison to Kalman Filter.

Walkthrough of MLTK Assistant for ARIMA

ARIMA is a popular and robust method of forecasting long-term data. From blog 1 we can describe Kalman Filter’s forecasting capabilities as extending the existing pattern/spikes, sort of a copy-paste method which may be advantageous when forecasting short-term data. ARIMA has an advantage in predicting data points when the we are uncertain about the future trend of the data points in the long-term. Now that we have got you excited about ARIMA, lets see how we can use it in Splunk’s MLTK!

We use the Machine Learning Toolkit Assistant for forecasting timeseries data in Splunk. Navigate to the Forecast Time Series Assistant page (Under the Classic Menu option) and use the Splunk ‘inputlookup’ command to view the process_time.csv file.

|inputlookup process_time.csv

Once we add the dataset click on Algorithm and select ‘ARIMA’ (Autoregressive Integrated Moving Average), and ‘value’ as your field to forecast. You will notice that the ARIMA arguments will appear.

There are three arguments that make up the ARIMA model:

Argument

Definition

AutoRegressive – p

Auto regressive (AR) component refers to the use of past values in the regression equation. Higher the value the more past terms you will use in the equation. This concept is also called ‘lags’. Another way of describing this concept is if the value your data point is depending on its previous value e.g process time right now will depend on the process time 30 seconds before (from our data set)

Integrated – d

The d represents the degrees of differencing as discussed in the previous section. This makes up the integrated component of the ARIMA model and is needed for the stationary assumption of the data.

Moving Average – q

Moving Average in ARIMA refers to the use of past errors in the equation. It is the use of lagging (like AR) but for the error terms.

Determine Starting Points

Identify the Order of Differencing (d)

As a refresher, we utilized the same dataset we worked with in part 1 of the blog series regarding the Kalman filter. As I input my process_time.csv file in the assistant, I enter the future_timespan variable as 20 and the holdback as 20. I’ve kept the confidence interval as default value ‘95’. Once the argument values are populated click on ‘Forecast’ to see the resulting graphs.

As a note, my ARIMA arguments described above are ARIMA(0,0,0) which can represented as a mathematics function ARIMA(p,d,q), where p,d,q = 0. We use this functional representation of the variables frequently in this blog for consistently with generally used mathematical languages.

When we click on forecast, observe the line chart graph from the results that show. This above graph confirms that the data is non-stationary, we will apply differencing to make it stationary. We can accomplish this by increasing the value of our ‘d’ argument from ‘0’ to ‘1’ in the forecasting assistant and clicking on forecast again. This step is essential to meet one of the main criteria’s of using ARIMA discussed in the ‘Fundamental Concept for ARIMA’ section.

Identifying AR(p) and MA(q)

After we apply differencing to our data our next step is to determine the AR or MR terms that mitigate any auto correlation in our data. There are two popular methods of estimating the these two parameters. We will expand on one of the methods in this blog.

Method 1

The first method for estimating the value of ‘p’ and ‘q’ is to use the Akaiki Information Criteria (AIC) and the Baysian Information Criteria (BIC), however using them is outside the scope of the blog as we will use a different method from the MLTK given the tools we have at hand. For the curious mind, the following blog contains detailed information on AIC and BIC to determine our ‘p’ and ‘q’ values:

After we have applied differencing to our time series data, we review the PCAF and the ACF plots to determine an order for AR(q) or MA(q). We will apply ARIMA(0,1,0) in our ARIMA MLTK assistant and then click on ‘Forecast’ to view the results of the graph. The below image shows the values that we entered in the assistant:

Once we click on forecast, we view the PACF plot to estimate a value for AR(p) model. Similarly we use the ACF plot to estimate a value for MA(q). The graphs are shown in the screenshot below.

We examine the PACF plot for a suggestion for our AR value, by counting the prominent high spikes. From the plot below I’ve circled the prominent spikes in the PACF graph. The value of AR (p) that we pick is 4.

We examine the ACF plot for a suggestion for our MA value, by counting the prominent high spikes. From the plot below I’ve circled the prominent spikes in the ACF graph. The value of AR (q) that we pick is 5.

We can now add in the values for the parameter integrated (d) – 1 and our estimates for AR – 4, and MA -5 in the Splunk MLTK. Once added in the assistant, click on ‘Forecast’.

For this particular combination for values we can see that once we click on ‘Forecast’, we get an error regarding the ‘invertability’ of the dataset as shown in the screenshot below. Without going too deep into the mathematics, it means that our model does not converge when it forecasts. I’ve added a link in the references and links section at the end for your interest! This error can be resolved by adjusting the values of model, similar to a ‘trail an error’ approach explained in the next section.

Optimize Your P and Q Values

Estimating this method of AR and MA is subjective to what can be considered as ‘prominent spikes’, this can result in estimating values of ‘q’ and ‘p’ that are not an optimal fit for the data. To resolve this we constructed a table displaying the R-squared and Root Mean Square Error (RMSE) values from the model error statistics from the MLTK assistance, for each combination of ‘p’ and ‘q’. An empty cell indicates an invertability error, while the other cells contain the value of R-squared and RMSE.

A higher R-squared indicates a better fit the model has on the data. R-squared is the amount of variability that the model can explain on the process time data points.

On the other hand, the lower the RMSE is the better the fit of the model. Root mean square is the difference between the data points the model predicted and our holdback points from the raw data.

We pick values of ‘p’ and ‘q’ that minimize RMSE and maximize R-square as the best fit to our data. From the table below we can see that q=5 and p=5 optimize the prediction for us.

Integrated (d) = 0

AutoRegressive (p)

0

1

2

3

4

5

Moving Average (q)

0

R2 Stat: -0.0015
RMSE: 19.31

R2 Stat: 0.1976
RMSE: 16.35

R2 Stat: 0.1977
RMSE: 16.34

R2 Stat: 0.2699
RMSE: 15.60

R2 Stat: 0.2696
RMSE: 15.60

R2 Stat: 0.3114
RMSE: 15.14

1

R2 Stat: 0.2401
RMSE: 15.91

R2 Stat: 0.2486
RMSE: 15.82

R2 Stat: 0.2780
RMSE: 15.51

R2 Stat: 0.2329
RMSE: 15.98

–

R2 Stat: 0.4053
RMSE: 14.07

2

R2 Stat: 0.2452
RMSE: 15.85

–

–

R2 Stat: 0.3017
RMSE: 15.25

R2 Stat: 0.3214
RMSE: 15.03

–

3

R2 Stat: 0.2872
RMSE: 15.41

R2 Stat: 0.4185
RMSE: 13.92

R2 Stat: 0.4428
RMSE: 13.62

R2 Stat:
RMSE:

R2 Stat: 0.4343
RMSE: 13.72

R2 Stat: 0.4456
RMSE: 13.58

4

R2 Stat: 02826
RMSE: 15.46

R2 Stat: 0.4185
RMSE: 13.92

R2 Stat:0.3241
RMSE: 15.00

–

–

–

5

R2 Stat: 0.2826
RMSE: 15.46

R2 Stat: 0.3133
RMSE: 15.99

R2 Stat: 0.4385
RMSE: 13.67

–

–

R2 Stat: 0.4515
RMSE: 13.52

Viewing Your Results

Once we have picked the values of p and q that optimize our model, we can go ahead plug the numbers in our assistant and click on forecast to display the forecasted graph. The values to plug in the assistant are as follows: p-5, d-1, q-5, holdback-20, forecast-20. The screenshots below show the values entered in the assistant and the resulting forecast graph.

A this point many would be satisfied with the forecast as the visual of the data itself is enough to analyse, asses and then make a judgement on the action(s) to take. The next step details how you can view the data and lists some ideas of alerts that can be constructed

Next Step

We can view the SPL used powering the graph by either clicking on ‘Open in Search’ or ‘ ‘Show SPL’. I prefer the ‘Open in Search’ option as it automatically open a new tab, allowing me to further understand how the SPL is constructed in the forecast and to view the data. Once a tab browser tab opens click on the ‘statistics’ option to view the raw data points, predicted data points and the confidence intervals created by our model. I have added the SPL from the image for your convenience below:

The resulting table lists all the necessary data in a clean tabular format (that we are all familiar with) for creating alerts based on our predicted process time. Here are some ideas on creating alerts based on the data we worked with:

Create alert when the predicted value of the process time goes above a certain threshold

Create alert when the average process time over a timespan is predict to stay above normal limits

Create alert based on outlier detection, when the predicted data is outside the lower or upper boundaries

Creating alerts based on our predict data allows us to be proactive of potential increase or decrease of our input variable

Summarizing ARIMA Forecasting in MLTK

Lets summarize what we have discussed so far in this blog:

A mathematical prerequisites of the model

Determining differencing requirement

Determine starting values for AR() and MA()

Optimize your AR() and MA() values based on error statistics

Forecast your data based on values decided in Step 4

View data and determine any alerts conditions

Prior to the above steps, we need to ensure that our data has been pre-processed or transformed in a MLTK-friendly manner. The pre-process steps include but not limited to; ensuring no gaps in the time series data, determine the relevance of data to forecasting, group data in time intervals (30 second, 1 minute etc). The pre-processing steps are important to create uniformity in the data input allow Splunk’s MLTK to analyse and forecast your data.

Hopefully this blog, streamlines the process of forecasting using ARIMA in Splunk’s MLTK. There are limitations as with any algorithm on forecasting using this method, as it involves a more theoretical knowledge in mathematics I’ve added two links in the the useful links section (first link is navigates you to on ‘datascienceplus.com’ and the second to ’emeraldinsight.com’) to further read on them.

With the New Year, and cold winter, now upon us here in Toronto we thought it would be fun to kick it off by revisiting our award winning Hackathon entry from last years Splunk’s Partner Technical Symposium and adapting it to provide insights for our very own Toronto’s Bike Share platform leveraging their Open Data.

­­In this blog we will use a classification approach for predicting Spam messages. A classification approach categorizes your observations/events in discrete groups which explain the relationship between explanatory and dependent variables which are your field(s) to predict. Some examples of where you can apply classification in business projects are: categorizing claims to identify fraudulent behaviour, predicting best retail location for new stores, pattern recognition and predicting spam messages via email or text. Read more

Splunk Enterprise 7.2 is the latest release from Splunk and was made available during Splunk .conf18 in Orlando. Many new features were added which will improve Splunk Enterprise from administration and user experience, to analytics and data onboarding. Read more

Splunk is a great data intelligence platform when used effectively. With a full understanding of Splunk’s functionality and capabilities, it should totally consume you with it’s awesomeness and you will find yourself preaching its benefits to your entire company! Our customers are always asking for recommendations on how to better grasp the fundamentals of the platform and the following article should provide this guidance. Read more

Discovered Intelligence is proud to announce that co-founder and Partner Josh Diakun was inducted into the 2019 SplunkTrust class at this year’s Splunk .conf18!

SplunkTrust members are the most dedicated members of the Splunk community. They assist other members, participate in events, demonstrate the power of Splunk’s products and services, and help identify future product needs.

As a leader at Discovered Intelligence, Josh demonstrates these values every day and it is an amazing recognition of the contributions he has made.

We are pleased to announce several updates to many of our popular free Splunk apps. These include updates to Confiq Quest, Meta Woot! and Sendresults. A summary of the updates follows below, along with links to Splunkbase where they can be downloaded. Read more

Looking to master your Operational data? Authored by leading experts from Discovered Intelligence; the Third Edition of the Splunk Operational Intelligence Cookbook has been completely refreshed for Splunk 7.1 and provides hands-on, easy to follow recipes that will have you mastering Splunk and discovering new insights from your operational data in no time. Leveraging our years of expertise, the book is filled with best practices and packed with content, that will get you hands-on with Splunk right from the first chapter.

In our previous blog we walked through steps on installing Splunk’s Machine Learning Toolkit and showcased some of the analytical capabilities of the app. In this blog we will deep dive into an example dataset and use the ‘Predict Numeric Fields’ assistant to help us answer some questions about it.

The sample dataset used is from People’s dataset repository [Houghton] This multivariate sample dataset contains the following fields:

What Questions do we want to ask?

We would like to understand the relationship between ‘Net Sales’ of Green Franchise and how it is impacted by the variables ‘Square Feet of Store’, ‘Inventory’, ‘Amount Spent on Advertising’, ‘Size of Sales District’ & ‘No of Competitors’. E.g Would an increase in ‘Inventory’ or ‘Amount Spent on Advertising’ increase or decrease ‘Net Sales’ for Greens?

The next few sections will walk you through uploading the data set and processing it in the Machine Learning Toolkit App.

Uploading the Sample Data Set

The CSV file was uploaded to Splunk from Settings -> Lookups -> Lookup table files (Add new). If you need more information on this step please consult the Splunk Docs here. Save the CSV file as greenfranchise.csv

Once the file has been uploaded and saved as greenfranchise.csv, navigate to the Machine Learning Toolkit App, click on the ‘Legacy’ menu, Assistants and open the ‘Predict Numeric Fields’ Assistant. This screenshot and navigation may differ depending on which version of Splunk and the MLTK is installed. Assistants in version 3.2 can be found under the ‘Legacy’ tab.

Populate Model Fields

In the Create New Model tab, you can view the contents of the CSV file by running the below Splunk Query in the Search bar:

| inputlookup greenfranchise.csv

This will automatically populate the panels with the fields in the csv file. Below the “Preprocessing Steps” we can see a second panel to choose the type of algorithm to apply to this lookup.

Selecting the Algorithm

In the panel for selecting the algorithm, we can see the ‘Fields to predict’ and ‘Fields to use for predicting’ fields are automatically populated from the data. For this test we use the linear regression algorithm to forecast the ‘Net Sales’ of Green Franchises. Select “Net Sales” as the Field to predict, and in the Fields to use for predicting, select all of the remaining fields except for “Size of Sales District”.

If you’re interested in the math behind it, linear regression from the Machine Learning Toolkit will provide us with the Beta (relationship) co-efficient between ‘Net Sales’ and each of the fields. The residual of regression model is the difference between the explanatory/input variables and the predicted equation at each data point, which can be used for further analysis of the model.

Fitting Model

Once the Fields have been picked, you need to determine the ‘Split for Training’ ratio for the model. Select ‘No Split’ for the model to use all the data for creating a model. The split option allows the user to divide the data for training and testing. This means that X% of the data will used to create our model, and (100-X) % of the data withheld will be used to test the model.

Click on ‘Fit Model’ after setting the Split for the data. Splunk processes the data to display visuals which we can use to analyze the data. Name the model ‘ex_linearreg_greens_sales’, however, based on the users data, the model name should reflect the field to predict, the type of algorithm and the user it is assigned to, to reduce ambiguity on the models ownership and purpose.

Analyzing the Results

The first two panels show a Line and Scatter Chart of “Actual vs Predicted” data. Both panels present one of the richest methods to analyze the linear regression model. From the scatter and line plot we can observe that the data fits well. We can determine that there is a strong correlation between the model’s predictions and the actual results. Since the model has more than one input variable, examining the residual line chart and histogram next, will give us a more practical understanding.

The second set of panels that we can use to analyse the model are residuals from the plot. From observing the “Residual Line Chart” and “Residual Histogram” we can see that there is large deviation from the center and the residuals appear to be scattered. A random scattering of the data points around the horizontal (x-axis) line signifies a good fit for the linear model. Otherwise, a pattern shape of the data points would indicate that a non-linear model from the MLTK should be used instead.

The last set of panels show us the R-squared of the model. The closer the value is to 1, better the fit of the regression model. The “Fit Model Parameters Summary” panel gives us the ‘Beta’ coefficients discussed in the ‘Selecting the Algorithm’ section. The assistant displays the data in a well-grounded and systematic setting. After analyzing the macro fit of the model, we can use the co-efficient of the variables create our equation for predicting ‘Net Sales’ :

In the last panel shown below, we can see our input variables under ‘Fit Model Parameters Summary’ and their values. We will assess in the next section on using these input variables to predict ‘Net_Sales‘.

Answering the Question: How is ‘Net Sales’ impacted by the Variables?

We can view the results of the model by running the following search:

| summary "ex_linearreg_greens_sales"

This Query will return the coefficients values of the linear regression algorithm. In our example for Greens, we observed that variable ‘X4’ are the number of competitors stores, an increment in competitors stores will reduce the ‘Net Sales‘ by approximately 12.62. While the variable ‘X5’ is the Sq Feet of the Store, and increment will increase the ‘Net Sales’ by approximately 23.69.

We can use the results from our model to forecast ‘Net Sales’ if the input variables (Sq Ft, Amt on Advertising etc) were different using the below Splunk search:

We used makeresults to work our own values for the input variables. Once the fields have been defined we used the apply command in the MLTK to output the predicted value of the ‘Net Sales’ given the new values of the input variables. The apply command uses the ouput values the model learnt from the csv dataset and applies them to new information. We used the ‘as’ command to alias the name of the predicted field as ‘Predicted_Net_Sales’. From the below screenshot we can observe that; 11.5 on Advertising, 700 on Inventory, 20 Competing stores nearby and 5.8 square feet of space predicts a Net Sales of approximately 306. Please note that all monetary variables are in $1,000 .

Summary

So to recap, we followed the following steps to answer our question of the data:

Uploaded the sample data set

Populated the model fields

Selected an algorithm

Fit the model

Analyzed the results

The Machine Learning Toolkit simplifies the steps for data preparation, reduces the steps needed to create a model, and saves the history of models we have executed and tested with. We can review the data before applying the algorithms allowing the user to standardize and adjust using MLTK capabilities or Splunk queries. The resulting statistic of the ‘Predict Numeric Fields’ assistant allows us to understand the dataset using machine learning.