Embed this resource in your web site

Script to select the n best features for modeling a given dataset,
using a greedy algorithm:

Initialize the set S of selected features to the empty set

Split your dataset into training and test sets

For i in 1 ... n:

For each feature f not in S, model and evaluate with feature set S + f

Greedily select the feature f' with the best performance and add it to S

The script takes as inputs the dataset to use and the number of features (that is, dataset fields) to return and yields as output a list of the n selected features, as field identifiers.

To select the best performance, the script uses the metric
average_phi in the evaluations it performs, which is only available for classification problems. Therefore, the script is only valid for categorical objective fields.

This script takes as inputs a cluster identifier, an instance, i.e., a map with values for all fields used by the cluster, and a positive count n. It then:

Finds the centroid in the cluster closer to the given instance p

Selects within that centroid's dataset the n instances that are closest to p

If there are less than n rows in the centroid's dataset, missing instances are read from the next closest centroid.

This workflow uses flatline to compute the distance between p and the centroid datasets (via the row-distance-squared flatline function) and add an extra column to the dataset, and then creates a sample of the result, ordered by the computed distance.

The input instance can be specified using either field identifiers or field names.

A very simple script in which we decide whether it's better to use a model or an ensemble for making predictions by creating both (given an input source) and evaluating the results, choosing the one with best f-1 measure in its evaluation if the objective field is categorical, or r-measure for regression problems.

Given an input dataset, we use SMACdown to find the best parameters for creating an ensemble from that dataset.

The script uses as inputs, beside the identifier of the dataset, the evaluation metric we maximize (defaulting to average_phi), the objective field and a string used as a prefix when naming intermediate resources created by the workflow. You can select the metric to optimize (see below).

Classification metrics:

average_recall

average_phi

accuracy

average_precision

average_f_measure

Regression metrics:

r_squared

mean_absolute_error

mean_squared_erro

This workflow will generate a big number of auxiliary resources when executed. To instruct the script to delete all of them before finishing set the delete-resources execution input parameter to true.

This is a simple script that, given an input dataset, creates an anomaly detector and uses it to identify its top anomalous rows, proceeding then to create a new dataset without them using a Flatline filter.

This script implements feature selection using
a version of the Boruta algorithm
to detect important and unimportant fields in your dataset. The algorithm:

Retrieves the dataset information.

Creates a new extended dataset. In the new dataset, each field has a corresponding shadow field which has the same type but contains a random sample of the values contained in the original one.

Creates a random forest from the extended dataset.

Extracts the maximum of the importances for the shadow fields.

Uses this maximum plus (minus) a minimum gain as threshold. Any of the original fields scoring less than the minimal threshold are considered unimportant and fields scoring more than the maximum threshold are considered important.

Fields marked as unimportant are removed from the list of fields to be used as input fields for new datasets.

The procedure is repeated, and a new extended dataset is created with the remaining fields. The process stops when it reaches the user-given number of runs or when all the original fields in the dataset are marked as important or unimportant.

When iteration stops, a new dataset is created where unimportant fields have been removed.

The objective of this script is to perform a 5-fold cross validation of the model built from a dataset by using the default choices in all the available configuration parameters. Thus, the only input needed in for the script to run is the name of the dataset used to both train and test de models in the cross validation. The algorithm:

Divides the dataset in 5 parts.

Holds out the data in one of the parts and builds a model with the rest of data.

Evaluates the model with the hold out data.

The second and third steps are repeated with each of the 5 parts, so that 5 evaluations are generated.

Finally, the evaluation metrics are averaged to provide the cross-validation metrics.

The output of the script will be an evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the 5 evaluations created in the cross-validation process.