Algorithms in the Machine Learning Toolkit

The Splunk Machine Learning Toolkit supports the algorithms listed here. In addition to the examples included in the Splunk Machine Learning Toolkit, you can find more examples of these algorithms on the scikit-learn website.

The following algorithms use the fit and apply commands within the Splunk Machine Learning Toolkit. For information on the steps taken by these commands, please review the Understanding the fit and apply commands document.

ML-SPL Performance App

Extend the algorithms you can use for your models

The algorithms listed here and in the ML-SPL Quick Reference Guide are available natively in the Splunk Machine Learning Toolkit. You can also base your algorithm on over 300 open source Python algorithms from scikit-learn, pandas, statsmodel, numpy and scipy libraries available through the Python for Scientific Computing add-on in Splunkbase.

For information on how to import an algorithm from the Python for Scientific Computing add-on into the Splunk Machine Learning Toolkit, see the ML-SPL API Guide.

Add algorithms through GitHub

On-prem customers looking for solutions that fall outside of the 30 native algorithms can use GitHub to add more algorithms. Solve custom uses cases through sharing and reusing algorithms in the Splunk Community for MLTK on GitHub. Here you can also learn about new machine learning algorithms from the Splunk open source community, and help fellow users of the toolkit.

Cloud customers can also use GitHub to add more algorithms via an app. The Splunk GitHub for Machine learning app provides access to custom algorithms and is based on the Machine Learning Toolkit open source repo. Cloud customers need to create a support ticket to have this app installed.

Anomaly Detection

DensityFunction

The DensityFunction algorithm provides a consistent and streamlined workflow to create and store density functions and utilize them for anomaly detection. DensityFunction allows for grouping of the data using the by clause, where for each group a separate density function is fitted and stored.

Using the DensityFunction algorithm requires running version 1.4 of the Python for Scientific Computing add-on.

The accuracy of the anomaly detection for DensityFunction depends on the quality and the size of the training dataset, how accurately the fitted distribution models the underlying process that generates the data, and the value chosen for the threshold parameter.

Follow these guidelines to make your models perform more accurately:

Aim for fitted distributions to have a cardinality (training dataset size) of at least 50. If you cannot collect more training data, create fewer groups of data using the by clause, giving you more data points per group.

The threshold parameter has a default value, but ideally the value for threshold, lower_threshold, or upper_threshold are chosen based on experimentation as guided by domain knowledge.

Continue tuning the threshold parameter until you are satisfied with the results.

Inspect the model using the summary command.

If the distribution of the data changes through time, re-train your models frequently.

The dist parameter default is auto. When set to auto, norm (normal distribution), expon (exponential distribution), and gaussian_kde (Gaussian KDE distribution) all run, with the best results returned.

The metric parameter calculates the distance between the sampled dataset from the density function and the training dataset.

Valid metrics for the metric parameter include: kolmogorov_smirnov and wasserstein.

The metric parameter default is wasserstein.

The sample parameter can be used during fit or apply stages.

The sample parameter default is False.

If the sample parameter is set to True during the fit stage, the size of the samples will be equal to the training dataset.

If the sample parameter is set to True during the apply stage, the size of the samples will be equal to the testing dataset.

If the sample parameter is set to True:

Samples are taken from the fitted density function.

Results output in a new column called SampledValue.

Sampled values only come from the inlier region of the distribution.

The full_sample parameter can be used during fit or apply stages.

The full_sample parameter default is False.

If the full_sample parameter is set to True during the fit stage, the size of the samples will be equal to the training dataset.

If the full_sample parameter is set to True during the apply stage, the size of the samples will be equal to the testing dataset.

Version 4.4.0 of the MLTK and above support min and max values in summary.

The min value is the minimum value of the dataset on which the density function is fitted.

The max value is the maximum value of the dataset on which the density function is fitted.

The cardinality value generated by the summary command represents the number of data points used when fitting the selected density function.

The distance value generated by the summary command represents the metric type used when calculating the distance as well as the distance between the sampled data points from the density function and the training dataset.

The mean value generated by the summary command is the mean of the density function.

The value for std generated by the summary command represents the standard deviation of the density function.

A value under other represents any parameters other than mean and std as applicable. In the case of Gaussian KDE, other could show parameter size or bandwidth.

The type field generated by the summary command shows both the chosen density function as well as if the dist parameter is set to auto.

The show_density parameter default is False. If the parameter is set to True, the density of each data point will be provided as output in a new field called ProbabilityDensity.

The output for ProbabilityDensity is the probability density of the data point according to the fitted probability density. This output is provided when the show_density parameter is set to True.

The fit command will fit a probability density function over the data, optionally store the resulting distribution's parameters in a model file, and output the outlier in a new field called IsOutlier.

The output for IsOutlier is a list of labels. Number 1 represents outliers, and 0 represents inliers, assigned to each data point. Outliers are detected based on the values set for the threshold parameter. Inspect the IsOutlier results column to see how well the outlier detection is performing.

The parameters threshold, lower_threshold, and upper_threshold control the outlier detection process.

The threshold parameter is the center of the outlier detection process. It represents the percentage of the area under the density function and has a value between 0.000000001 (refers to ~0%) and 1 (refers to 100%). The threshold parameter guides the DensityFunction algorithm to mark outlier areas on the fitted distribution. For example, if threshold=0.01, then 1% of the fitted density function will be set as the outlier area.

The threshold parameter default value is 0.01.

The threshold, lower_threshold, and upper_threshold parameters can take multiple values.

Multiple values must be in quotation marks and separated by commas.

In cases of multiple values for threshold, the default maximum is 5. Users with access permissions can change this default maximum under the Settings tab.

In cases of multiple values, you are limited to one type of threshold (threshold,lower_threshold, or upper_threshold).

The output for BoundaryRanges is the boundary ranges of outliers on the density function which are set according to the values of the threshold parameter.

In cases of a single boundary region, the value for the percentage of boundary region is equal to the threshold parameter value.

In some distributions (for example Gaussian KDE), the sum of outlier areas might not add up to the exact value of threshold parameter value, but will be a close approximation.

BoundaryRanges is calculated as an approximation and will be empty in the following two cases:

Where the density function has a sharp peak from low standard deviation.

When there are a low number of data points.

Data points that are exactly at the boundary opening or closing point are assigned as inliers. An opening or closing point is determined by the density function in use.

Normal density function has left and right boundary regions. Data points on the left of the left boundary closing point, and data points on the right of the right boundary opening point are assigned as outliers.

Exponential density function has one boundary region. Data points on the right of the right boundary opening point are assigned as outliers.

Gaussian KDE density function can have one or more boundary regions, depending on the number of peaks and dips within the density function. Data points in these boundary regions are assigned as outliers. In cases of boundary regions to the left or right, guidelines from Normal density function apply. As the shape for Gaussian KDE density function can differ from dataset to dataset, you do not consistently observe left and right boundary regions.

You can apply the saved model to new data with the apply command, with the option to update the parameters for threshold, lower_threshold, upper_threshold, and show_density. Parameters for dist and metric cannot be applied at this stage, and any new values provided will be ignored.

You can inspect the model learned by DensityFunction with the summary command. Version 4.4.0 of the MLTK or above supports min and max values in the summary command.

| summary <model name>

Syntax constraints

Fields within the by clause must be given in quotation marks.

The maximum number of fields within the by clause is 5.

The total number of groups calculated with the by clause can not exceed 1024. In an example clause of by "DayOfWeek,HourOfDay" there are two fields: one for DayOfWeek and one for HourOfDay. As there are seven days in a week, there are seven groups for DayOfWeek. As there are twenty-four hours in a day, there are twenty-four groups for HourOfDay. Meaning the total number of groups calculated with the by clause is 7*24= 168.

The limited number of groups prevents model files from growing too large. You can increase the limit by changing the value of max_groups in the DensityFunction settings. Larger limits mean larger model files and longer load times when running apply.

Decrease max_kde_parameter_size to allow for the increase of max_groups. This change keeps model sizes small while allowing for increased groups.

The parameters threshold, lower_threshold, and upper_threshold must be within the range of 0.00000001 to 1.

If the parameters of lower_threshold and upper_threshold are both provided, the summation of these parameters must be less than 1 (100%).

The threshold and lower_threshold / upper_threshold parameters can not be specified together.

The threshold, lower_threshold, and upper_threshold parameters can take multiple values but in these cases you are limited to one type of threshold (threshold,lower_threshold, or upper_threshold).

Exponential density function and Gaussian KDE density function only support the threshold.

Exponential density function and Gaussian KDE density function do not support lower_threshold or upper_threshold.

The following example shows DensityFunction on a dataset with the summary command. This example includes min and max values, which are supported in version 4.4.0 of the MLTK.

| summary mymodel

The following example shows BoundaryRages on a test set. In this example the threshold is set to 30% (0.3). The first row has a left boundary range which starts at -Infinity and goes up to the number 44.6912. The area of the left boundary range is 15% of the total area under the density function. It has also a right boundary range which starts at a number 518.3088 and goes up to Infinity. Again, the area of the right boundary range is the same as the left boundary range with 15% of the total area under the density function. The areas of right and left boundary ranges add up to the threshold value of 30%. The third row has only one boundary range which starts at number 300.0943 and goes up to Infinity. The area of the boundary range is 30% of the area under the density function.

LocalOutlierFactor

The LocalOutlierFactor algorithm uses the scikit-learn Local Outlier Factor (LOF) to measure the local deviation of density of a given sample with respect to its neighbors. LocalOutlierFactor is an unsupervised outlier detection method. The anomaly score depends on how isolated the object is with respect to its neighbors.

OneClassSVM

The OneClassSVM algorithm uses the scikit-learn OneClassSVM to fit a model from a set of features or fields for detecting anomalies and outliers, where features are expected to contain numerical values. OneClassSVM is an unsupervised outlier detection method.

Classifiers

Classifier algorithms predict the value of a categorical field.

The kfold cross-validation command can be used with all Classifier algorithms. Learn more here.

BernoulliNB

The BernoulliNB algorithm uses the scikit-learn BernoulliNB estimator to fit a model to predict the value of categorical fields where explanatory variables are assumed to be binary-valued. BernoulliNB is an implementation of the Naive Bayes classification algorithm. This algorithm supports incremental fit.

You can save BernoulliNB models using the into keyword and apply the saved model later to new data using the apply command.

... | apply TESTMODEL_BernoulliNB

You can inspect the model learned by BernoulliNB with the summary command as well as view the class and log probability information as calculated by the dataset.

.... | summary My_Incremental_Model

Syntax constraints

The partial_fit parameter controls whether an existing model should be incrementally updated or not. The default value is False, meaning it will not be incrementally updated. Choosing partial_fit=True allows you to update an existing model using only new data without having to retrain it on the full training data set.

Using partial_fit=True on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If partial_fit=False or partial_fit is not specified (default is False), the model specified is created and replaces the pre-trained model if one exists.

If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model. If My_Incremental_Model exists and was trained using BernoulliNB, the command updates the existing model with the new input. If My_Incremental_Model exists but was not trained by BernoulliNB, an error message displays.

GaussianNB

The GaussianNB algorithm uses the scikit-learn GaussianNB estimator to fit a model to predict the value of categorical fields, where the likelihood of explanatory variables is assumed to be Gaussian. GaussianNB is an implementation of Gaussian Naive Bayes classification algorithm. This algorithm supports incremental fit.

Parameters

The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.

You can save GaussianNB models using the into keyword and apply the saved model later to new data using the apply command.

... | apply TESTMODEL_GaussianNB

You can inspect models learned by GaussianNB with the summary command.

... | summary My_Incremental_Model

Syntax constraints

If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model. If My_Incremental_Model exists and was trained using GaussianNB, the command updates the existing model with the new input. If My_Incremental_Model exists but was not trained by GaussianNB, an error message is thrown.

If partial_fit=False or partial_fit is not specified the model specified is created and replaces the pre-trained model if one exists.

Using the MLPClassifier algorithm requires running version 1.3 or above of the Python for Scientific Computing add-on.

Parameters

The partial_fit parameter controls whether an existing model should be incrementally updated on not. This allows you to update nan existing model using only new data without having to retrain it on the full training data set.

The partial_fit parameter default is False.

The hidden_layer_sizes parameter format (int) varies based on the number of hidden layers in the data.

SGDClassifier

The SGDClassifier algorithm uses the scikit-learn SGDClassifier estimator to fit a model to predict the value of categorical fields. This algorithm supports incremental fit.

Parameters

The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set.

The partial_fit parameter default is False.

n_iter=<int> is the number of passes over the training data also known as epochs. The default is 5. The number of iterations is set to 1 if using partial_fit.

The loss=<hinge|log|modified_huber|squared_hinge|perceptron> parameter is the loss function to be used.

Defaults to hinge, which gives a linear SVM.

The log loss gives logistic regression, a probabilistic classifier.

modified_huber is another smooth loss that brings tolerance to outliers as well as probability estimates.

squared_hinge is like hinge but is quadratically penalized.

perceptron is the linear loss used by the perceptron algorithm.

The fit_intercept=<true|false> parameter specifies whether the intercept should be estimated or not. The default is True.

penalty=<l2|l1|elasticnet> is the penalty, also known as regularization term, to be used. The default is l2.

Kernel-based methods such as the scikit-learn SVC tend to work best when the data is scaled, for example, using our StandardScaler algorithm:
| fit StandardScaler into scaling_model | fit SVM from into svm_model. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

Parameters

The gamma parameter controls the width of the rbf kernel. The default value is 1 /number of fields.

The C parameter controls the degree of regularization when fitting the model. The default value is 1.0.

You can save SVM models using the into keyword and apply new data later using the apply command.

... | apply sla_model

Syntax constraints

You cannot inspect the model learned by SVM with the summary command.

Example

The following example uses SVM on a test set.

... | fit SVM SLA_violation from * into sla_model | ...

Clustering Algorithms

Clustering is the grouping of data points. Results will vary depending upon the clustering algorithm used. Clustering algorithms differ in how they determine if data points are similar and should be grouped. For example, the K-means algorithm clusters based on points in space, whereas the DBSCAN algorithm clusters based on local density.

Birch

The Birch algorithm uses the scikit-learn Birch clustering algorithm to divide data points into set of distinct clusters. The cluster for each event is set in a new field named cluster. This algorithm supports incremental fit.

Parameters

The k parameter specifies the number of clusters to divide the data into after the final clustering step, which treats the sub-clusters from the leaves of the CF tree as new samples.

By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

The partial_fit parameter controls whether an existing model should be incrementally updated on not. This allows you to update nan existing model using only new data without having to retrain it on the full training data set.

DBSCAN

The DBSCAN algorithm uses the scikit-learn DBSCAN clustering algorithm to divide a result set into distinct clusters. The cluster for each event is set in a new field named cluster. DBSCAN is distinct from K-Means in that it clusters results based on local density, and uncovers a variable number of clusters, whereas K-Means finds a precise number of clusters. For example, k=5 finds 5 clusters.

Parameters

The eps parameter specifies the maximum distance between two samples for them to be considered in the same cluster.

By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

The min_samples parameter defines the number of samples, or the total weight, in a neighborhood for a point to be considered as a core point - including the point itself. You can choose the min_samples parameter's best value based on preference for cluster density or noise in your dataset.

The min_samples parameter is optional.

The min_samples default value is 5.

The minimum value for the min_samples parameter is 3.

If min_samples=8 you need at least 8 data points to form a dense cluster.

If you choose the min_samples parameter's best value based on noise in your dataset, it's recommended to have a larger data set to pull from.

Syntax

| fit DBSCAN <fields> [eps=<number>] [min_samples=<integer>]

Syntax constraints

You cannot save DBSCAN models using the into keyword. To predict cluster assignments for future data, combine the DBSCAN algorithm with any classifier algorithm. For example, first cluster the data using DBSCAN, then fit RandomForestClassifier to predict the cluster.

K-means

K-means clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable k. The K-means algorithm uses the scikit-learn K-means implementation. The cluster for each event is set in a new field named cluster. Use the K-means algorithm when you have unlabeled data and have at least approximate knowledge of the total number of groups into which the data can be divided.

The k parameter specifies the number of clusters to divide the data into. By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

You can save K-means models using the into keyword when using the fit command.

You can apply the model to new data using the apply command.

... | apply cluster_model

You can inspect the model using the summary command.

... | summary cluster_model

Example

The following example uses K-means on a test set.

... | fit KMeans * k=3 | stats count by cluster

SpectralClustering

The SpectralClustering algorithm uses the scikit-learn SpectralClustering clustering algorithm to divide a result set into set of distinct clusters. SpectralClustering first transforms the input data using the Radial Basis Function (rbf) kernel, and then performs K-Means clustering on the result. Consequently, SpectralClustering can learn clusters with a non-convex shape. The cluster for each event is set in a new field named cluster.

Parameters

The k parameter specifies the number of clusters to divide the data into after kernel step. By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

You cannot save SpectralClustering models using the into keyword. If you want to be able to predict cluster assignments for future data, you can combine the SpectralClustering algorithm with any clustering algorithm. For example, first cluster the data using SpectralClustering, then fit a classifier to predict the cluster using RandomForestClassifier.

Example

The following example uses SpectralClustering on a test set.

... | fit SpectralClustering * k=3 | stats count by cluster

X-means

Use the X-means algorithm when you have unlabeled data and no prior knowledge of the total number of labels into which that data could be divided. The X-means clustering algorithm is an extended K-means that automatically determines the number of clusters based on Bayesian Information Criterion (BIC) scores. Starting with a single cluster, the X-means algorithm goes into action after each run of K-means, making local decisions about which subset of the current centroids should split themselves in order to fit the data better.

The cluster for each event is set in a new field named cluster, and the total number of clusters is set in a new field named n_clusters.

By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

Syntax

fit XMeans <fields> [into <model name>]

You can apply new data to the saved X-means model using the apply command.

... | apply cluster_model

You can save X-means models using the into command. You can inspect the model learned by X-means with the summary command.

...| summary cluster_model

Example

The following example uses X-means on a test set.

... | fit XMeans * | stats count by cluster

Cross-validation

Cross-validation assesses how well a statistical model generalizes on an independent dataset. Cross-validation tells you how well your machine learning model is expected to perform on data that it has not been trained on. There are many types of cross-validation, but K-fold cross-validation (kfold_cv) is one of the most common.

Cross-validation is typically used for the following machine learning scenarios:

Comparing two or more algorithms against each other for selecting the best choice on a particular dataset.

Comparing different choices of hyper-parameters on the same algorithm for choosing the best hyper-parameters for a particular dataset.

An improved method over a train/test split for quantifying model generalization.

Cross-validation is not well suited for time-series charts:

In situations where the data is ordered such as time-series, cross-validation is not well suited because the training data is shuffled. In these situations, other methods such as Forward Chaining are more suitable.

K-fold cross-validation

In the kfold_cv parameter, the training set is randomly partitioned into k equal-sized subsamples. Then, each sub-sample takes a turn at becoming the validation (test) set, predicted by the other k-1 training sets. Each sample is used exactly once in the validation set, and the variance of the resulting estimate is reduced as k is increased. The disadvantage of the kfold_cv parameter is that k different models have to be trained, leading to long execution times for large datasets and complex models.

The scores obtained from K-fold cross-validation are generally a less biased and less optimistic estimate of the model performance than a standard training and testing split.

You can obtain k performance metrics, one for each training and testing split. These k performance metrics can then be averaged to obtain a single estimate of how well the model generalizes on unseen data.

Syntax

The kfold_cv parameter is applicable to to all classification and regression algorithms, and you can append the command to the end of an SPL search.

Here kfold_cv=<int> specifies that k=<int> folds is used. When you specify a classification algorithm, stratified k-fold is used instead of k-fold. In stratified k-fold, each fold contains approximately the same percentage of samples for each class.

Output
The kfold_cv parameter returns performance metrics on each fold using the same model specified in the SPL - including algorithm and hyper parameters. Its only function is to give you insight into how well you model generalizes. It does not perform any model selection or hyper parameter tuning.

Examples

The first example shows the kfold_cv parameter used in classification. Where the output is a set of metrics for each fold including accuracy, f1_weighted, precision_weighted, and recall_weighted.

This second example shows the kfold_cv parameter used in classification. Where the output is a set of metrics for each the neg_mean_squared_error and r^2 folds.

HashingVectorizer

The HashingVectorizer algorithm converts text documents to a matrix of token occurrences. It uses a feature hashing strategy to allow for hash collisions when measuring the occurrence of tokens. It is a stateless transformer, meaning that it does not require building a vocabulary of the seen tokens. This reduces the memory footprint and allows for larger feature spaces.

HashingVectorizer is comparable with the TFIDF algorithm, as they share many of the same parameters. However HashingVectorizer is a better option for building models with large text fields provided you do not need to know term frequencies, and only want outcomes.

ICA

ICA (Independent component analysis) separates a multivariate signal into additive sub-components that are maximally independent. Typically, ICA is not used for separating superimposed signals, but for reducing dimensionality. The ICA model does not include a noise term for the model to be correct, meaning whitening must be applied. Whitening can be done internally using the whiten argument, or manually using one of the PCA variants.

Parameters

The n_components parameters determines the number of components ICA uses.

The n_components parameter is optional.

The n_components parameter default is None. If None is selected, all components are used.

Use the algorithm parameter to apply parallel or deflation algorithm for FastICA.

The the algorithm parameter default is algorithm='parallel' .

Use the whiten parameter to set a noise term.

The whiten parameter is optional.

If the whiten parameter is False no whitening is performed.

The whiten parameter default is True.

The max_iter parameter determines the maximum number of iterations during the running of the fit command.

The max_iter parameter is optional.

The max_iter parameter default is 200.

The fun parameter determines the functional form of the G function used in the approximation to neg-entropy.

The fun parameter is optional.

The fun parameter default is logcosh. Other options for this parameter are exp or cube.

The tol parameter sets the tolerance on update at each iteration.

The tol parameter is optional.

The tol parameter default is 0.0001 .

The random_state parameter sets the seed value used by the random number generator.

You can save ICA models using the into keyword and apply new data later using the apply command.

Syntax constraints

You cannot inspect the model learned by ICA with the summary command.

Example

The following example shows how ICA is able to find the two original sources of data from two measurements that have mixes of both. As a comparison, PCA is used to show the difference between the two – PCA is not able to identify the original sources.

KernelPCA

The KernelPCA algorithm uses the scikit-learn KernelPCA to reduce the number of fields by extracting uncorrelated new features out of data. The difference between KernelPCA and PCA is the use of kernels in the former, which helps with finding nonlinear dependencies among the fields. Currently, KernelPCA only supports the Radial Basis Function (rbf) kernel.

Kernel-based methods such as KernelPCA tend to work best when the data is scaled, for example, using our StandardScaler algorithm: | fit StandardScaler into scaling_model | fit KernelPCA into kpca_model. For details, see ''A Practical Guide to Support Vector Classification'' at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

Parameters

The k parameter specifies the number of features to be extracted from the data. The other parameters are for fine tuning of the kernel.

You can save KernelPCA models using the into keyword and apply new data later using the apply command.

... | apply user_feedback_model

Syntax constraints

You cannot inspect the model learned by KernelPCA with the summary command.

Example

The following example uses KernelPCA on a test set.

... | fit KernelPCA * k=3 gamma=0.001 | ...

NPR

The Normalized Perlich Ratio (NPR) algorithm converts high cardinality categorical field values into numeric field entries while intelligently handling space optimization. NPR offers low computational costs to perform feature extraction on variables with high cardinalities such as ZIP codes or IP addresses.

NPR does not perform one-hot encoding unlike other algorithms that leverage the fit and apply commands.

Parameters

Use the summary command to inspect the variance information of the saved model.

After running NPR the transformed dataset has calculated ratios for all feature variables (feature_field). Based on the training data NPR calculates a variable of X_unobserved which can be used as a replacement value in the following two scenarios:

In conjunction with the fit command NPR initially replaces missing values in the dataset for feature_field with the keyword unobserved which is then replaced by the calculated NPR value of X_unobserved.

In conjunction with the apply command, any new value for target_field that was not visible during model training but is encountered in the test dataset.

The number of transformed columns created after running NPR is equal to the number of distinct values for feature_field within the search string.

From the saved model, use the variance output field to examine the contribution of a particular feature towards the accuracy of the prediction. Higher variance indicates highly important categorical values whereas low variance indicates the value being of lower importance towards the target prediction. Variance may assist in the process of discarding irrelevant feature variables.

Syntax

fit NPR <target_field> from <feature_field> [into <model name>]

You can couple NPR with existing MLTK algorithms to feed the transformed results to the model as a means to enhance predictions.

You can save NPR models using the into keyword and apply new data later using the apply command.

| input lookup disk_failures.csv | tail 1000 | apply npr_disk

You can inspect the model learned by NPR with the summary command.

| summary npr_disk

Syntax constraints

The wildcard (*) character is not supported.

The maximum matrix size calculated from |X| * |Y| where X is the feature_field and Y is the target_field is 10000000. For example, if number of distinct categorical feature values are 1000 and distinct categorical target values are 100 then the matrix size is 100000.

PCA

The Principal Component Analysis (PCA) algorithm uses the scikit-learn PCA algorithm to reduce the number of fields by extracting new, uncorrelated features out of the data.

Parameters

The k parameter specifies the number of features to be extracted from the data.

The variance parameter is short for percentage variance ratio explained. This parameter determines the percentage of variance ratio explained in the principal components of the PCA. It computes the number of principal components dynamically by preserving the specified variance ratio.

The variance parameter defaults to 1 if k is not provided.

The variance parameter can take a value between 0 and 1.

The component name parameter represents the name of the selected components from the value specified in n_components.

The explained_variance parameter measures the proportion to which the principal component accounts for dispersion of a given dataset. A higher value denotes a higher variation.

The explained_variance_ratio parameter is the percentage of variance explained by each of the selected components.

The singular_values parameter represents the singular values corresponding to each of the selected components. Singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

Syntax

fit PCA <fields> [into <model name>] [k=<int>] [variance=<float>]

You can save PCA models using the into keyword and apply new data later using the apply command.

...into example_hard_drives_PCA_2 | apply example_hard_drives_PCA_2

You can inspect the model learned by PCA with the summary command.

| summary example_hard_drives_PCA_2

Syntax constraints

The variance parameter and k parameter cannot be used together. They are mutually exclusive.

The following example includes the variance parameter. The value variance=0.5 tells the algorithm to choose as many principal components for the data set until able to explain 50% of the variance in the original dataset.

TFIDF uses memory to create a dictionary of all terms including ngrams and words, and expands the Splunk search events with additional fields per event. If you are concerned with memory limits, consider using the HashingVectorizer algorithm.

Parameters

The default for max_features is 100.

To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set stop_words to english. For other languages (for example, machine language) you can ignore the common words by setting max_df to a value greater than or equal to 0.7 and less than 1.0.

Preprocessing (Prepare Data)

Preprocessing algorithms are used for preparing data. Other algorithms can also be used for preprocessing that may not be organized under this section. For example, PCA can be used for both Feature Extraction and Preprocessing.

Imputer

The Imputer algorithm is a preprocessing step wherein missing data is replaced with substitute values. The substitute values can be estimated, or based on other statistics or values in the dataset. To use Imputer, the user passes in the names of the fields to impute, along with arguments specifying the imputation strategy, and the values representing missing data. Imputer then adds new imputed versions of those fields to the data, which are copies of the original fields, except that their missing values are replaced by values computed according to the imputation strategy.

Parameters

Available imputation strategies include mean, median, most frequent, and field. The default strategy is mean.

All but the field parameter require numeric data. The field strategy accepts categorical data.

RobustScaler

The RobustScaler algorithm uses the scikit-learn RobustScaler algorithm to standardize data fields by scaling their median and interquartile range to 0 and 1, respectively. It is very similar to the StandardScaler algorithm, in that it helps avoid dominance of one or more fields over others in subsequent machine learning algorithms, and is practically required for some algorithms, such as KernelPCA and SVM. The main difference between StandardScaler and RobustScaler is that RobustScaler is less sensitive to outliers.

Parameters

The with_centering and with_scaling parameters specify if the fields should be standardized with respect to their median and interquartile range.

You can save RobustScaler models using the into keyword and apply new data later using the apply command.

... | apply scaling_model

You can inspect the statistics extracted by RobustScaler with the summary command.

... | summary scaling_model

Syntax constraints

RobustScaler does not support incremental fit.

Example

The following example uses RobustScaler on a test set.

... | fit RobustScaler * | ...

StandardScaler

The StandardScaler algorithm uses the scikit-learn StandardScaler algorithm to standardize data fields by scaling their mean and standard deviation to 0 and 1, respectively. This preprocessing step helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms. This step is practically required for some algorithms, such as KernelPCA and SVM. This algorithm supports incremental fit.

Parameters

The with_mean and with_std parameters specify if the fields should be standardized with respect to their mean and standard deviation.

The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set. The default is False.

You can save StandardScaler models using the into keyword and apply new data later using the apply command.

... | apply scaling_model

You can inspect the statistics extracted by StandardScaler with the summary command.

...| summary scaling_model

Syntax constraints

Using partial_fit=true on an existing model ignores the newly supplied parameters. The parameters supplied at model creation are used instead. If partial_fit=false or partial_fit is not specified (default is false), the model specified is created and replaces the pre-trained model if one exists.

If My_Incremental_Model does not exist, the command saves the model data under the model name My_Incremental_Model.

If My_Incremental_Model exists and was trained using StandardScaler, the command updates the existing model with the new input.

If My_Incremental_Model exists but was not trained by StandardScaler, an error message is thrown.

ElasticNet

The ElasticNet algorithm uses the scikit-learn ElasticNet estimator to fit a model to predict the value of numeric fields. ElasticNet is a linear regression model that includes both L1 and L2 regularization and is a generalization of Lasso and Ridge.

Lasso

The Lasso algorithm uses the scikit-learn Lasso estimator to fit a model to predict the value of numeric fields. Lasso is like LinearRegression, but it uses L1 regularization to learn a linear models with fewer coefficients and smaller coefficients. Lasso models are consequently more robust to noise and resilient against overfitting.

Ridge

The Ridge algorithm uses the scikit-learn Ridge estimator to fit a model to predict the value of numeric fields. Ridge is like LinearRegression, but it uses L2 regularization to learn a linear models with smaller coefficients, making the algorithm more robust to collinearity. For descriptions of the fit_intercept, normalize, and alpha parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html.

Parameters

The alpha parameter specifies the degree of regularization. The default value is 1.0.

SGDRegressor

The SGDRegressor algorithm uses the scikit-learn SGDRegressor estimator to fit a model to predict the value of numeric fields. This algorithm supports incremental fit.

Parameters

The partial_fit parameter controls whether an existing model should be incrementally updated or not. This allows you to update an existing model using only new data without having to retrain it on the full training data set. The default is False.

The fit_intercept=<true|false> parameter determines whether the intercept should be estimated or not.

The fit_intercept=<true|false> parameter default is True.

The n_iter=<int> parameter is the number of passes over the training data also known as epochs. The default is 5.

The number of iterations is set to 1 if using partial_fit.

The penalty=<l2|l1|elasticnet> parameter set the penalty or regularization term to be used. The default is l2.

The learning_rate=<constant|optimal|invscaling> parameter is the learning rate.

Time Series Analysis

Forecasting algorithms, also known as time series analysis, provide methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data, and forecast its future values.

ARIMA

The Autoregressive Integrated Moving Average (ARIMA) algorithm uses the StatsModels ARIMA algorithm to fit a model on a time series for better understanding and/or forecasting its future values. An ARIMA model can consist of autoregressive terms, moving average terms, and differencing operations. The autoregressive terms express the dependency of the current value of time series to its previous ones.

The moving average terms, also called random shocks or white noise, model the effect of previous forecast errors on the current value. If the time series is non-stationary, differencing operations are used to make it stationary. A stationary process is a stochastic process in that its probability distribution does not change over time.

It is highly recommended to send the time series through timechart before sending it into ARIMA to avoid non-uniform sampling time. If _time is not to be specified, using timechart is not necessary.

Parameters

The time series should not have any gaps or missing data otherwise ARIMA will complain. If there are missing samples in the data, using a bigger span in timechart or using streamstats to fill in the gaps with average values can do the trick.

When chaining ARIMA output to another algorithm (i.e. ARIMA itself), keep in mind the length of the data is the length of the original data + forecast_k. If you want to maintain the holdback position, you need to add the number in forecast_k to your holdback value.

ARIMA requires the order parameter to be specified at fitting time. The order parameter needs three values:

Number of autoregressive (AR) parameters

Number of differencing operations (D)

Number of moving average (MA) Parameters

The forecast_k=<int> parameter tells ARIMA how many points into the future should be forecasted. If _time is specified during fitting along with the field_to_forecast, ARIMA will also generate the timestamps for forecasted values. By default, forecast_k is zero.

The conf_interval=<1..99> parameter is the confidence interval in percentage around forecasted values. By default it is set to 95%.

The holdback=<int> parameter is the number of data points held back from the ARIMA model. This is useful for comparing the forecast against known data points. By default, holdback is zero.

ARIMA models cannot be saved and used at a later time in the current version.

Example

The following example uses ARIMA on a test set.

... | fit ARIMA Voltage order=4-0-1 holdback=10 forecast_k=10

StateSpaceForecast

StateSpaceForecast is a forecasting algorithm for time series data in the MLTK. It is based on Kalman filters. The algorithm supports incremental fit.

Advantages of StateSpaceForecast over ARIMA include:

Persists models created using the fit command that can then be used with apply.

A specialdays field allows you to account for the effects of a specified list of special days.

It is automatic in that you do no need to choose parameters or mode.

Supports multivariate forecasting.

Parameters

By default the historical data results from running the fit command are not shown. To modify this behavior set output_fit=True.

Use the target field to specify fields from which to forecast using historical data and other values.

The target field is a comma-separated list of fields that can be univariate or multivariate. These fields must be specified during the fit process.

Optionally use the target field to fit multiple fields during the fit process but apply only a selection of those target fields during the apply process.

If the target field is not specified, then all fields will be forecast together using historical data.

The specialdays field specifies the field that indicates effects due to special days such as holidays.

The specialdays field values must be numeric and are typically 0 and 1, with 1 indicating the existence of a special day effect. Null values are treated as 0.

The majority of use cases have no specialdays. Events that occur regularly and frequently such as weekends should not be treated as specialdays. Use specialdays to capture events such as holiday sales.

Use specialdays in the apply step if it has been specified during fit. The same field(s) must be assigned.

Use the period parameter to specify if your data has a known periodicity.

If the period parameter is not specified it is computed automatically.

Set period=1 to treat the time series as non-periodic.

As with other MLTK algorithms, the partial_fit parameter controls whether a model should be incrementally updated or not. This allows you to update a model using only new data without having to retrain the model on the full dataset.

The default for partial_fit is False.

Use update_last to modify the behavior of partial_fit

The default for update_last is False.

If partial_fit=True StateSpaceForecast first updates the model parameters and then predicts.

If partial_fit=True and update_last=True StateSpaceForecast first predicts and then updates the model parameters. This allows you to review the forecast before running new data through.

The conf_interval=<1..99> parameter is the confidence interval in percentage around forecasted values. Input an integer between 1 and 99 where a larger number means a greater tolerance for forecast uncertainty. The default integer is 95.

The as field to gives aliases for forecasted fields.

In univariate cases the as field field-list is a single field name.

In multivariate cases, the as field adheres to the following conventions:

The list must be in double quotes, separated by either spaces or commas.

The aliases correspond to the original fields in the given order.

The number of aliases can be smaller than the number of original fields.

The summary command lists the names of the fields used in the fit command step, the name of the specialdays field, and the period.

The holdback parameter is the number of data points held back from training. This is useful for comparing the forecast against known data points. Default holdback value is 0.

If you want to maintain the holdback position, add the position number in forecast_k to your holdback value.

The forecast_k parameter tells StateSpaceForecast how many points into the future should be forecasted. If _time is specified during fitting along with the field_to_forecast, StateSpaceForecast also generates the timestamps for forecasted values. Default, forecast_k value is 0.

The holdback and forecast_k values can be of two types: an integer or a time range.

An integer specifies a number of events. An example of forecast_k=10 forecasts 10 events into the future. An example of holdback=10 withholds the last 10 events from training.

A time range takes the form XY where X is a non-negative integer and Y is either empty or adheres to format in the time range table. If Y is empty, then the time range is instead interpreted as an integer or a number of events. An example of holdback=3day forecast_k=1week withholds 3 days of events and forecasts 1 week's worth of events.

The actual number of events withheld and forecasted using the time range option depends on the time interval between consecutive events.

You can inspect the model learned by StateSpaceForecast with the summary command.

| summary <model name>

Syntax constraints

For univariate analysis the fields parameter is a single field, but for multivariate analysis it is a list of fields.

For multivariate analysis, only one specialdays field can be specified and it applies to all the fields.

The specialdays field values must be numeric.

Null values in the specialdays field are treated as 0.

Double quotes are required around field lists.

Examples

The following is a univariate example of StateSpaceForecast. The example is considered univariate as there is only a single field following | fit StateSpaceForecast. The example dataset is derived from the milk.csv dataset that ships with the toolkit. The milk2.csv has a new column named holiday. This column has two values 0 and 1. The 0 value represents no holiday and 1 value represents a holiday for the associated date. The 1 values were set randomly.

The following is a multivariate example of StateSpaceForecast on a test set. The syntax is the same as that in the univariate example, except that this case has a list of fields (CRM, ERP, and Expenses) following | fit StateSpaceForecast, making it multivariate.

The following example is also multivariate and includes the target field. In this example the fields of CRM and ERP are forecast using historical data and the Expenses field. The apply command is used against the model created in the fit command step, resulting in the app_usage_model model.

Enter your email address, and someone from the documentation team will respond to you:

Send me a copy of this feedback

Please provide your comments here. Ask a question or make a suggestion.

Feedback submitted, thanks!

You must be logged into splunk.com in order to post comments.
Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic.
If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk,
consider posting a question to Splunkbase Answers.

0
out of 1000 Characters

Your Comment Has Been Posted Above

We use our own and third-party cookies to provide you with a great online experience. We also use these cookies to improve our products and services, support our marketing campaigns, and advertise to you on our website and other websites. Some cookies may continue to collect information after you have left our website.
Learn more (including how to update your settings) here »