If you do not specify the path to an output file, BigMLer will auto-generate
one for you under a
new directory named after the current date and time
(e.g., MonNov1212_174715/predictions.csv). With --prediction-info
flag set to brief only the prediction result will be stored (default is
normal and includes confidence information). You can also set it to
full if you prefer the result to be presented as a row with your test
input data followed by the corresponding prediction. To include a headers row
in the prediction file you can set --prediction-header. For both the
--prediction-infofull and --prediction-infobrief options, if you
want to include a subset of the fields in your test file you can select them by
setting --prediction-fields to a comma-separated list of them. Then

If you do not explicitly specify an objective field, BigML will default to the
last
column in your dataset. You can also use as selector the field column number
instead of the name (when –no-train-header is used, for instance).

Also, if your test file uses a particular field separator for its data,
you can tell BigMLer using --test-separator.
For example, if your test file uses the tab character as field separator the
call should be like

The model’s predictions in BigMLer are based on the mean of the distribution
of training values in the predicted node. In case you would like to use the
median instead, you could just add the --median flag to your command

If you don’t provide a file name for your training source, BigMLer will try to
read it from the standard input

cat data/iris.csv | bigmler --train

or you can also read the test info from there

cat data/test_iris.csv | bigmler --train data/iris.csv --test

BigMLer will try to use the locale of the model both to create a new source
(if the --train flag is used) and to interpret test data. In case
it fails, it will try en_US.UTF-8
or English_UnitedStates.1252 and a warning message will be printed.
If you want to change this behaviour you can specify your preferred locale

If you check your working directory you will see that BigMLer creates a file
with the
model ids that have been generated (e.g., FriNov0912_223645/models).
This file is handy if then you want to use those model ids to generate local
predictions. BigMLer also creates a file with the dataset id that has been
generated (e.g., TueNov1312_003451/dataset) and another one summarizing
the steps taken in the session progress: bigmler_sessions. You can also
store a copy of every created or retrieved resource in your output directory
(e.g., TueNov1312_003451/model_50c23e5e035d07305a00004f) by setting the flag
--store.

All the predictions we saw in the previous section are computed locally in
your computer. BigMLer allows you to ask for a remote computation by adding
the --remote flag. Remote computations are treated as batch computations.
This means that your test data will be loaded in BigML as a regular source and
the corresponding dataset will be created and fed as input data to your
model to generate a remote batchprediction object. BigMLer will download
the predictions file created as a result of this batchprediction and
save it to local storage just as it did for local predictions

This command will create a source, dataset and model for your training data,
a source and dataset for your test data and a batch prediction using the model
and the test dataset. The results will be stored in the
my_dir/remote_predictions.csv file. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-infofull, that may result
in a large CSV to be created as output.

In case you prefer BigMLer to issue
one-by-one remote prediction calls, you can use the --no-batch flag

There are some more advanced options that can help you build local predictions
with your ensembles.
When the number of local models becomes quite large holding all the models in
memory may exhaust your resources. To avoid this problem you can use the
--max_batch_models flag which controls how many local models are held
in memory at the same time

The predictions generated when using this option will be stored in a file per
model and named after the
models’ id (e.g. model_50c23e5e035d07305a00004f__predictions.csv”). Each line
contains the prediction, its confidence, the node’s distribution and the node’s
total number of instances. The default value for ``max-batch-models` is 10.

When using ensembles, model’s predictions are combined to issue a final
prediction. There are several different methods to build the combination.
You can choose plurality, confidenceweighted, probabilityweighted
or threshold using the --method flag

For classification ensembles, the combination is made by majority vote:
plurality weights each model’s prediction as one vote,
confidenceweighted uses confidences as weight for the prediction,
probabilityweighted uses the probability of the class in the distribution
of classes in the node as weight, and threshold uses an integer number
as threshold and a class name to issue the prediction: if the votes for
the chosen class reach the threshold value, then the class is predicted
and plurality for the rest of predictions is used otherwise

For regression ensembles, the predicted values are averaged: plurality
again weights each predicted value as one,
confidenceweighted weights each prediction according to the associated
error and probabilityweighted gives the same results as plurality.

As in the model’s case, you can base your prediction on the median of the
predicted node’s distribution by adding --median to your BigMLer command.

It is also possible to enlarge the number of models that build your prediction
gradually. You can build more than one ensemble for the same test data and
combine the votes of all of them by using the flag combine_votes
followed by the comma separated list of directories where predictions are
stored. For instance

Each resource created in BigML can be associated to a project. Projects are
intended for organizational purposes, and BigMLer can create projects
each time a source is created using a --project
option. For instance

bigmler --train data/iris.csv --project "my new project"

will first check for the existence of a project by that name. If it exists,
will associate the source, dataset and model resources to this project.
If it doesn’t, a new project is created and then associated.

You can also associate resources to any project in your account
by specifying the option --project-id followed by its id

Note: Once a source has been associated to a project, all the resources
derived from this source will be automatically associated to the same
project.

You can also create projects or update their properties by using the bigmler
project subcommand. In particular, when projects need
to be created in an organization, the --organization option has to
be added to inform about the ID of the organization where the project should
be created:

Only allowed users can create projects in organizations. If you are not the
owner or an administrator, please check your permissions with them first.
You can learn more about organizations at the
API documentation.

You can also create resources in an organization’s project if your user
has the right privileges. In order to do that, you should add the
--org-project option followed by the organization’s project ID.

You don’t need to create a model from scratch every time that you use BigMLer.
You can generate predictions for a test set using a previously generated
model, cluster, etc. The example shows how you would do that for a tree model:

You can also use a number of models providing a file with a model/id per line

bigmler --models TueDec0412_174148/models --test data/test_iris.csv

Or all the models that were tagged with a specific tag

bigmler --model-tag my_tag --test data/test_iris.csv

The same can be extended to any other subcomamnd, like bigmlercluster
using the correct option (--clustercluster/50a1f43deabcb404d3000da2,
--clustersTueDec0412_174148/clusters and cluster-tagmy_tag).
Please, check each subcommand available options for details.

You can also use a previously generated dataset to create a new model

bigmler --dataset dataset/50a1f441035d0706d9000371

You can also input the dataset from a file

bigmler --datasets iris_dataset

A previously generated source can also be used to generate a new
dataset and model

bigmler --source source/50a1e520eabcb404cd0000d1

And test sources and datasets can also be referenced by id in new
BigMLer requests for remote predictions

BigMLer can also help you to measure the performance of your supervised
models (decision trees, ensembles and logistic regressions). The
simplest way to build a model and evaluate it all at once is

bigmler --train data/iris.csv --evaluate

which will build the source, dataset and model objects for you using 80% of
the data in your training file chosen at random. After that, the remaining 20%
of the data will be run through the model to obtain
the corresponding evaluation.

The same procedure is available for ensembles:

bigmler --train data/iris.csv --number-of-models 10 --evaluate

and for logistic regressions:

bigmler logistic-regression --train data/iris.csv --evaluate

You can use the same procedure with a previously
existing source or dataset

The results of an evaluation are stored both in txt and json files. Its
contents will follow the description given in the
Developers guide, evaluation section
and vary depending on the model being a classification or regression one.

Finally, you can also evaluate a preexisting model using a separate set of
data stored in a file or a previous dataset

If you need cross-validation techniques to ponder which parameters (like
the ones related to different kinds of pruning) can improve the quality of your
models, you can use the --cross-validation-rate flag to settle the
part of your training data that will be separated for cross validation. BigMLer
will use a Monte-Carlo cross-validation variant, building 2*n different
models, each of which is constructed by a subset of the training data,
holding out randomly n% of the instances. The held-out data will then be
used to evaluate the corresponding model. For instance, both

will hold out 2% of the training data to evaluate a model built upon the
remaining 98%. The evaluations will be averaged and the result saved
in json and human-readable formats in cross-validation.json and
cross-validation.txt respectively. Of course, in this kind of
cross-validation you can choose the number of evaluations yourself by
setting the --number-of-evaluations flag. You should just keep in mind
that it must be high enough to ensure low variance, for instance

What if your raw data isn’t necessarily in the format that BigML expects? So we
have good news: you can use a number of options to configure your sources,
datasets, and models.

Most resources in BigML contain information about the fields used in the
resource construction. Sources contain information about the name, label,
description and type of the fields detected in the data you upload.
In addition to that, datasets contain the information of the values that
each field contains, whether they have missing values or errors and even
if they are preferred fields or non-preferred (fields that are not expected
to convey real information to the model, like user IDs or constant fields).
This information is available in the “fields” attribute of each resource,
but BigMLer can extract it and build a CSV file with a summary of it.

By using this command, BigMLer will create a fields_summary.csv file
in a summary output directory. The file will contain a headers row and
the fields information available in the source, namely the field column,
field ID, field name, field label and field description of each field. If you
execute the same command on a dataset

you will also see the number of missing values and errors found in each field
and an excerpt of the values and errors.

But then, imagine that you want to alter BigML’s default field names
or the ones provided
by the training set header and capitalize them, even to add a label or a
description to each field. You can use several methods. Write a text file
with a change per line as
follows

bigmler --train data/iris.csv --field-attributes fields.csv

where fields.csv would be

0,'SEPAL LENGTH','label for SEPAL LENGTH','description for SEPAL LENGTH'1,'SEPAL WIDTH','label for SEPAL WIDTH','description for SEPAL WIDTH'2,'PETAL LENGTH','label for PETAL LENGTH','description for PETAL LENGTH'3,'PETAL WIDTH','label for PETAL WIDTH','description for PETAL WIDTH'4,'SPECIES','label for SPECIES','description for SPECIES'

The number on the left in each line is the column number of the field in your
source and is followed by the new field’s name, label and description.

Similarly you can also alter the auto-detect type behavior from BigML assigning
specific types to specific fields

bigmler --train data/iris.csv --types types.txt

where types.txt would be

0, 'numeric'1, 'numeric'2, 'numeric'3, 'numeric'4, 'categorical'

Finally, the same summary file that could be built with the --export-fields
option can be used to modify the updatable information in sources
and datasets. Just edit the CSV file with your favourite editor setting
the new values for the fields and use:

to update the names, labels, descriptions or types of the fields with the ones
in the summary/fields_summary.csv file.

You could
also use this option to change the preferred attributes for each
of the fields. This transformation is made at the dataset level,
so in the prior code it will be applied once a dataset is created from
the referred source. You might as well act
on an existing dataset:

you can also reference the fields by its column number in this JSON structures.
If the field to be modified is in the second column (column index starts at 0)
then the contents of the attributes.json file could be as well

You can also specify the chosen fields by adding or removing the ones you
choose to the list of preferred fields of the previous resource. Just prefix
their names with + or - respectively. For example,
you could create a model from an existing dataset using all their fields but
the sepallength by saying

When evaluating, you can map the fields of the evaluated model to those of
the test dataset by writing in a file the field column of the model and
the field column of the dataset separated by a comma and using –fields-map
flag to specify the name of the file

When following the usual proceedings to evaluate your models you’ll need to
separate the available data in two sets: the training set and the test set. With
BigMLer you won’t need to create two separate physical files. Instead, you
can set a --test-split flag that will set the percentage of data used to
build the test set and leave the rest for training. For instance

bigmler --train data/iris.csv --test-split 0.2 --name iris --evaluate

will build a source with your entire file contents, create the corresponding
dataset and split it in two: a test dataset with 20% of instances and a
training dataset with the remaining 80%. Then, a model will be created based on
the training set data and evaluated using the test set. By default, split is
deterministic, so that every time you issue the same command will get the
same split datasets. If you want to generate
different splits from a unique dataset you can set the --seed option to a
different string in every call

As you can find in the BigML’s API documentation on
datasets besides the basic name,
label and description that we discussed in previous sections, there are many
more configurable options in a dataset resource.
As an example, to publish a dataset in the
gallery and set its price you could use

{"private": false, "price": 120.4}

Similarly, you might want to add fields to your existing dataset by combining
some of its fields or simply tagging their rows. Using BigMLer, you can set the
--new-fields option to a file path that contains a JSON structure that
describes the fields you want to select or exclude from the original dataset,
or the ones you want to combine and
the Flatline expression to
combine them. This structure
must follow the rules of a specific languange described in the Transformations
item of the developers
section

To see a simple example, should you want to include all the fields but the
one with id 000001 and add a new one with a label depending on whether
the value of the field sepallength is smaller than 1,
you would write in generators.json

will create a CSV file named my_dataset.csv in the default directory
created by BigMLer to place the command output files. If no file name is given,
the file will be named after the dataset id.

A dataset can also be generated as the union of several datasets using the
flag --multi-dataset. The datasets will be read from a file specified
in the --datasets option and the file must contain one dataset id per line.

bigmler --datasets my_datasets --multi-dataset --no-model

This syntax is used when all the datasets in the my_datasets file share
a common field structre, so the correspondence of the fields of all the
datasets is straight forward. In the general case, the multi-dataset will
inherit the field structure of the first component dataset.
If you want to build a multi-dataset with
datasets whose fields share not the same column disposition, you can specify
which fields are correlated to the ones of the first dataset
by mapping the fields of the rest of datasets to them.
The option --multi-dataset-attributes can point to a JSON
file that contains such a map. The command line syntax would then be

To deal with imbalanced datasets, BigMLer offers three options: --balance,
--weight-field and --objective-weights.

For classification models, the --balance flag will cause all the classes
in the dataset to
contribute evenly. A weight will be assigned automatically to each
instance. This weight is
inversely proportional to the number of instances in the class it belongs to,
in order to ensure even distribution for the classes.

You can also use a field in the dataset that contains the weight you would like
to use for each instance. Using the --weight-field option followed by
the field name or column number will cause BigMLer to use its data as instance
weight. This is valid for both regression and classification models.

The --objective-weights option is used in classification models to
transmit to BigMLer what weight is assigned to each class. The option accepts
a path to a CSV file that should contain the class,``weight`` values one
per row

so that BigMLer would associate a weight of 5 to the Iris-setosa
class and 3 to the Iris-versicolor class. For additional classes
in the model, like Iris-virginica in the previous example,
weight 1 is used as default. All specified weights must be non-negative
numbers (with either integer or real values) and at least one of them must
be non-zero.

Sometimes the available data lacks some of the features our models use to
predict. In these occasions, BigML offers two different ways of handling
input data with missing values, that is to say, the missing strategy. When the
path to the prediction reaches a split point that checks
the value of a field which is missing in your input data, using the
lastprediction strategy the final prediction will be the prediction for
the last node in the path before that point, and using the proportional
strategy it will be a weighted average of all the predictions for the final
nodes reached considering that both branches of the split are possible.

BigMLer adds the --missing-strategy option, that can be set either to
last or proportional to choose the behavior in such cases. Last
prediction is the one used when this option is not used.

Another configuration argument that can change models when
the training data has instances with missing values in some of its features
is --missing-splits. By setting this flag, the model building algorithm
will be able to include the instances
that have missing values for the field used to split the data in each node
in one of the stemming branches. This will, obviously, affect also the
predictions given by the model for input data with missing values. Here’s an
example to build
a model using missing-splits and predict with it.

Imagine that you have create a new source and that you want to create a
specific dataset filtering the rows of the source that only meet certain
criteria. You can do that using a JSON expresion as follows

Sometimes the information you want to predict is not a single category but a
set of complementary categories. In this case, training data is usually
presented as a row of features and an objective field that contains the
associated set of categories joined by some kind of delimiter. BigMLer can
also handle this scenario.

with information about a group of people and we want to predict the class
another person will fall into. As you can see, each record has more
than one class per person (for example, the first person is labeled as
being both a Student and a Teenager) and they are all stored in the
class field by concatenating all the applicable labels using , as
separator. Each of these labels is, ‘per se’, an objective to be predicted, and
that’s what we can rely on BigMLer to do.

The simplest multi-label command in BigMLer is

bigmler --multi-label --train data/tiny_multilabel.csv

First, it will analyze the training file to extract all the labels stored
in the objective field. Then, a new extended file will be generated
from it by adding a new field per label. Each generated field will contain
a boolean set to
True if the associated label is in the objective field and False
otherwise

This new file will be fed to BigML to build a source, a dataset and
a set of models using four input fields: the first three fields as
input features and one of the label fields as objective. Thus, each
of the classes that label the training set can be predicted independently using
one of the models.

But, naturally, when predicting a multi-labeled field you expect to obtain
all the labels that qualify the input features at once, as you provide them in
the training data records. That’s also what BigMLer does. The syntax to
predict using
multi-labeled training data sets is similar to the single labeled case

the main difference being that the ouput file predictions.csv will have
the following structure

"Adult,Student","0.34237,0.20654""Adult,Teenager","0.34237,0.34237"

where the first column contains the class prediction and the second one the
confidences for each label prediction. If the models predict True for
more than one label, the prediction is presented as a sequence of labels
(and their corresponding confidences) delimited by ,.

As you may have noted, BigMLer uses , both as default training data fields
separator and as label separator. You can change this behaviour by using the
--training-separator, --label-separator and --test-separator flags
to use different one-character separators

This command would use the tab character as train and test data field
delimiter and : as label delimiter (the examples in the tests set use
, as field delimiter and ‘:’ as label separator).

You can also choose to restrict the prediction to a subset of labels using
the --labels flag. The flag should be set to a comma-separated list of
labels. Setting this flag can also reduce the processing time for the
training file, because BigMLer will rely on them to produce the extended
version of the training file. Be careful, though, to avoid typos in the labels
in this case, or no objective fields will be created. Following the previous
example

will limit the predictions to the Adult and Student classes, leaving
out the Teenager classification.

Multi-labeled predictions can also be computed using ensembles, one for each
label. To create an ensemble prediction, use the --number-of-models option
that will set the number of models in each ensemble

will retrieve the set of models created in the last example and use them in new
predictions. In addition, for these three cases you can restrict the labels
to predict to a subset of the complete list available in the original objective
field. The --labels option can be set to a comma-separated list of the
selected labels in order to do so.

The --model-tag can be used as well to retrieve multi-labeled
models and predict with them

Finally, BigMLer is also able to handle training files with more than one
multi-labeled field. Using the --multi-label-fields option you can
settle the fields that will be expanded as containing multiple labels
in the generated source and dataset.

This command creates a source (and its corresponding dataset)
where both the class and type fields have been analysed
to create a new field per label. Then the --objective option sets class
to be the objective field and only the models needed to predict this field
are created. You could also create a new multi-label prediction for another
multi-label field, type in this case, by issuing a new BigMLer command
that uses the previously generated dataset as starting point

This would generate the models needed to predict type. It’s important to
remark that the models used to predict class in the first example will
use the rest of fields (including type as well as the ones generated
by expanding it) to build the prediction tree. If you don’t want this
fields to be used in the model construction, you can set the --model-fields
option to exclude them. For instance, if type has two labels, label1
and label2, then excluding them from the models that predict
class could be achieved using

You can also generate new fields applying aggregation functions such as
count, first or last on the labels of the multi label fields. The
option --label-aggregates can be set to a comma-separated list of these
functions and a new column per multi label field and aggregation function
will be added to your source

Multi-label predictions are computed using a set of binary models
(or ensembles), one for
each label to predict. Each model can be evaluated to check its
performance. In order to do so, you can mimic the commands explained in the
evaluations section for the single-label models and ensembles. Starting
from a local CSV file

will build the source, dataset and model objects for you using a
random 80% portion of data in your training file. After that, the remaining 20%
of the data will be run through each of the models to obtain an evaluation of
the corresponding model. BigMLer retrieves all evaluations and saves
them locally in json and txt format. They are named using the objective field
name and the value of the label that they refer to. Finally, it averages the
results obtained in all the evaluations to generate a mean evaluation stored
in the evaluation.txt and evaluation.json files. As an example,
if your objective field name is class and the labels it contains are
Adult,Student, the generated files will be

Generated files:

MonNov0413_201326

evaluations

extended_multilabel.csv

source

evaluation_class_student.txt

models

evaluation_class_adult.json

dataset

evaluation.json

evaluation.txt

evaluation_class_student.json

bigmler_sessions

evaluation_class_adult.txt

You can use the same procedure with a previously
existing multi-label source or dataset

In BigML there’s a limit in the number of categories of a categorical
objective field. This limit is set to ensure the quality of the resulting
models. This may become a restriction when dealing with
categorical objective fields with a high number of categories. To cope with
these cases, BigMLer offers the –max-categories option. Setting to a number
lower than the mentioned limit, the existing categories will be organized in
subsets of that size. Then the original dataset will be copied many times, one
per subset, and its objective field will only keep the categories belonging to
each subset plus a generic *****other***** category that will summarize
the rest of categories. Then a model will be created from each dataset and
the test data will be run through them to generate partial predictions. The
final prediction will be extracted by choosing the class with highest
confidence from the distributions obtained for
each model’s prediction ignoring the *****other****** generic category.
For instance, to use the same iris.csv example, you could do

This command would generate a source and dataset object, as usual, but then,
as the total number of categories is three and –max-categories is set to 1,
three more datasets will be created, one per each category. After generating
the corresponding models, the test data will be run through them and their
predictions combined to obtain the final predictions file. The same procedure
would be applied if starting from a preexisting source or dataset using the
--source or --dataset options. Please note that the --objective
flag is mandatory in this case to ensure that the right categorical field
is selected as objective field.

--method option accepts a new combine value to use such kind of
combination. You can use it if you need to create a new group of predictions
based on the same models produced in the first example. Filling the path to the
model ids file

the new predictions will be created. Also, you could use the set of datasets
created in the first case as starting point. Their ids are stored in a
dataset_parts file that can be found in the output location

will create a k-fold cross-validation by dividing the data in your dataset in
the number of parts given in --k-folds. Then evaluations are created by
selecting one of the parts to be the test set and using the rest of data
to build the model for testing. The generated
evaluations are placed in your output directory and its average is stored in
evaluation.txt and evaluation.json.

Similarly, you’ll be able to create an evaluation for ensembles. Using the
same command above and adding the options to define the ensembles’ properties,
such as --number-of-models, --sample-rate, --randomize or
--replacement

More insights can be drawn from the bigmleranalyze--features command. In
this case, the aim of the command is to analyze the complete set of features
in your dataset to single out the ones that produce models with better
evaluation scores. In this case, we focus on accuracy for categorical
objective fields and r-squared for regressions.

This command uses an algorithm for smart feature selection as described in this
blog post
that evaluates models built by using subsets of features. It starts by
building one model per feature, chooses the subset of features used in the
model that scores best and, from there on, repeats the procedure
by adding another of the available features in the dataset to the chosen
subset. The iteration stops when no improvement in score is found for a number
of repetitions that can be controlled using the --staleness option
(default is 5). There’s
also a --penalty option (default is 0.1%) that sets the amount that
is substracted from the score per feature added to the
subset. This penalty is intended
to mitigate overfitting, but it also favors models which are quicker to build
and evaluate. The evaluations for the scores are k-fold cross-validations.
The --k-folds value is set to 5 by default, but you can change it
to whatever suits your needs using the --k-folds option.

Would select the best subset of features using 10-fold cross-validation
and a 0.2% penalty per feature, stopping after 3 non-improving iterations.

Depending on the machine learning problem you intend to tackle, you might
want to optimize other evaluation metric, such as precision or
recall. The --optimize option will allow you to set the evaluation
metric you’d like to optimize.

For categorical models, the evaluation values are obtained by counting
the positive and negative matches for all the instances in
the test set, but sometimes it can be more useful to optimize the
performance of the model for a single category. This can be specially
important in highly non-balanced datasets or when the cost function is
mainly associated to one of the existing classes in the objective field.
Using ``–optimize-category” you can set the category whose evaluation
metrics you’d like to optimize

You should be aware that the smart feature selection command still generates
a high number of BigML resources. Using k as the k-folds number and
n as the number of explored feature sets, it will be generating k
datasets (1/k``thoftheinstanceseach),and``k*n models and
evaluations. Setting the --max-parallel-models and
--max-parallel-evaluations to higher values (up to k) can help you
speed up partially the creation process because resources will be created
in parallel. You must keep in mind, though, that this parallelization is
limited by the task limit associated to your subscription or account type.

As another optimization method, the bigmleranalyze--nodes subcommand
will find for you the best performing model by changing the number of nodes
in its tree. You provide the --min-nodes and --max-nodes that define
the range and --nodes-step controls the increment in each step. The command
runs a k-fold evaluation (see --k-folds option) on a model built with each
node threshold in you range and tries to optimize the evaluation metric you
chose (again, default is accuracy). If improvement stops (see
the –staleness option) or the node threshold reaches the --max-nodes
limit, the process ends and shows the node threshold that
lead to the best score.

When working with random forest, you can also change the number of
random_candidates or number of fields chosen at random when the models
in the forest are built. Using bigmleranalyze--random-fields the number
of random_candidates will range from 1 to the number of fields in the
origin dataset, and BigMLer will cross-validate the random forests to determine
which random_candidates number gives the best performance.

Please note that, in general, the exact choice of fields selected as random
candidates might be more
important than their actual number. However, in some marginal cases (e.g.
datasets with a high number noise features) the number of random candidates
can impact tree performance significantly.

For any of these options (--features, --nodes and --random-fields)
you can add the --predictions-csv flag to the bigmleranalyze
command. The results will then include a CSV file that stores the predictions
obtained in the evaluations that gave the best score. The file content includes
the data in your original dataset tagged by k-fold and the prediction and
confidence obtained. This file will be placed in an internal folder of your
chosen output directory.

The output directory for this command is my_features and it will
contain all the information about the resources generated when testing
the different feature combinations
organized in subfolders. The k-fold datasets’
IDs will be stored in an inner test directory. The IDs of the resources
created when testing each combination of features will be stored in
kfold1, kfold2, etc. folders inside the test directory.
If the best-scoring prediction
models are the ones in the kfold4 folder, then the predictions CSV file
will be stored in a new folder named kfold4_pred.

The results of a bigmleranalyze--features or bigmleranalyze--nodes
command are a series of k-fold cross-validations made on the training data that
leads to the configuration value that will create the best performant model.
However, the algorithm maximizes only one evaluation metric. To see the global
picture for the rest of metrics at each validation configuration you can build
a graphical report of the results using the report subcommand. Let’s say
you previously ran

and you want to have a look at the results for each node_threshold
configuration. Just say:

bigmler report --from-dir best_recall --port 8080

and the command will traverse the directories in best_recall and summarize
the results found there in a metrics comparison graphic and an ROC curve if
your
model is categorical. Then a simple HTTP server will be started locally and
bound to a port of your choice, 8080 in the example (8085 will be the
default value), and a new web browser
window will be started to show the results.
You can see an example
built on the well known diabetes dataset.

The HTTP server will create an auxiliary bigmler/reports directory in the
user’s home directory, where symbolic links to the reports in each output
directory will be stored and served from.

Just as the simple bigmler command can generate all the
resources leading to finding models and predictions for a supervised learning
problem, the bigmlercluster subcommand will follow the steps to generate
clusters and predict the centroids associated to your test data. To mimic what
we saw in the bigmler command section, the simplest call is

bigmler cluster --train data/diabetes.csv

This command will upload the data in the data/diabetes.csv file and generate
the corresponding source, dataset and cluster objects in BigML. You
can use any of the generated objects to produce new clusters. For instance, you
could set a subgroup of the fields of the generated dataset to produce a
different cluster by using

that would exclude the field bloodpressure from the cluster creation input
fields.

Similarly to the models and datasets, the generated clusters can be shared
using the --shared option, e.g.

bigmler cluster --source source/53b1f71437203f5ac30004e0 \
--shared

will generate a secret link for both the created dataset and cluster that
can be used to share the resource selectively.

As models were used to generate predictions (class names in classification
problems and an estimated number for regressions), clusters can be used to
predict the subgroup of data that our input data is more similar to.
Each subgroup is represented by its centroid, and the centroid is labelled
by a centroid name. Thus, a cluster would classify our
test data by assigning to each input an associated centroid name. The command

would produce a file centroids.csv with the centroid name associated to
each input. When the command is executed, the cluster information is downloaded
to your local computer and the centroid predictions are computed locally, with
no more latencies involved. Just in case you prefer to use BigML to compute
the centroid predictions remotely, you can do so too

would create a remote source and dataset from the test file data,
generate a batchcentroid also remotely and finally download the result
to your computer. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-infofull, that may result
in a large CSV to be created as output.

The k-means algorithm used in clustering can only use training data that has
no missing values in their numeric fields. Any data that does not comply with
that is discarded in cluster construction, so you should ensure that enough
number of rows in your training data file has non-missing values in their
numeric fields for the cluster to be built and relevant. Similarly, the cluster
cannot issue a centroid prediction for input data that has missing values in
its numeric fields, so centroid predictions will give a “-” string as output
in this case.

You can change the number of centroids used to group the data in the
clustering procedure

bigmler cluster --dataset dataset/53b1f71437203f5ac30004ed \
--k 3

And also generate the datasets associated to each centroid of a cluster.
Using the --cluster-datasets option

bigmler cluster –cluster cluster/53b1f71437203f5ac30004f0

–cluster-datasets “Cluster 1,Cluster 2”

you can generate the datasets associated to a comma-separated list of
centroid names. If no centroid name is provided, all datasets are generated.

Similarly, you can generate the models to predict if one instance is associated
to each centroid of a cluster.
Using the --cluster-models option

bigmler cluster –cluster cluster/53b1f71437203f5ac30004f0

–cluster-models “Cluster 1,Cluster 2”

you can generate the models associated to a comma-separated list of
centroid names. If no centroid name is provided, all models are generated.
Models can be useful to see which features are important to determine whether
a certain instance belongs to a concrete cluster.

The bigmleranomaly subcommand generates all the resources needed to buid
an anomaly detection model and/or predict the anomaly scores associated to your
test data. As usual, the simplest call

bigmler anomaly --train data/tiny_kdd.csv

uploads the data in the data/tiny_kdd.csv file and generates
the corresponding source, dataset and anomaly objects in BigML. You
can use any of the generated objects to produce new anomaly detectors.
For instance, you could set a subgroup of the fields of the generated dataset
to produce a different anomaly detector by using

that would exclude the field urgent from the anomaly detector
creation input fields. You can also change the number of top anomalies
enclosed in the anomaly detector list and the number of trees that the anomaly
detector iforest uses. The default values are 10 top anomalies and 128 trees
per iforest:

would produce a file anomaly_scores.csv with the anomaly score associated
to each input. When the command is executed, the anomaly detector
information is downloaded
to your local computer and the anomaly score predictions are computed locally,
with no more latencies involved. Just in case you prefer to use BigML
to compute the anomaly score predictions remotely, you can do so too

would create a remote source and dataset from the test file data,
generate a batchanomalyscore also remotely and finally
download the result to your computer. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-infofull, that may result
in a large CSV to be created as output.

Similarly, you can split your data in train/test datasets to build the
anomaly detector and create batch anomaly scores with the test portion of
data

bigmler anomaly --train data/tiny_kdd.csv --test-split 0.2 --remote

or if you want to apply the anomaly detector on the same training data set
to create a batch anomaly score, use:

bigmler anomaly --train data/tiny_kdd.csv --score --remote

To extract the top anomalies as a new dataset, or to exclude from the training
dataset the top anomalies in the anomaly detector, set the

You can extract samples from your datasets in BigML using the
bigmlersample subcommand. When a new sample is requested, a copy
of the dataset is stored in a special format in an in-memory cache.
This sample can then be used, before its expiration time, to
extract data from the related dataset by setting some options like the
number of rows or the fields to be retrieved. You can either begin from
scratch uploading your data to BigML, creating the corresponding source and
dataset and extracting your sample from it

bigmler sample --train data/iris.csv --rows 10 --row-offset 20

This command will create a source, a dataset, a sample object, whose id will
be stored in the samples file in the output directory,
and extract 10 rows of data
starting from the 21st that will be stored in the sample.csv file.

will create a new sample.csv file with a headers row where only the
petallength and petalwidth are retrieved. The --modelinear
option will cause the first available rows to be returned and the
--row-order-by="-petallength" option returns these rows sorted in
descending order according to the contents of petallength.

You can also add to the sample rows some statistical information by using the
--stat-field or --stat-fields options. Adding them to the command
will generate a stat-info.json file where the Pearson’s and Spearman’s
correlations, and linear regression terms will be stored in a JSON format.

You can also apply a filter to select the sample rows by the values in
their fields using the --fields-filter option. This must be set to
a string containing the conditions that must be met using field ids
and values.

With this command, only rows where field id 000001 is missing and
field id 000004 is not Iris-setosa will be retrieved. You can check
the available operators and syntax in the
samples’ developers doc .
More available
options can be found in the Samples subcommand Options
section.

This subcommand extracts the information in the existing resources to determine
the arguments that were used when they were created,
and generates scripts that could be used to reproduce them. Currently, the
language used in the scripts will be Python. The usual starting
point for BigML resources is a source created from inline, local or remote
data. Thus, the script keeps analyzing the chain of calls that led to a
certain resource until the root source is found.

According to this output, the source was created from a file named iris.csv
and was assigned a name. This script will be stored in the command output
directory and named reify.py` (you can specify a different name and location
using the --output
option).

When creating sources from data, field types are inferred from the contents
of the first lines in the uploaded file. Sometimes, these field types must be
adapted and the source fields attributes are updated. You can also
change other fields attributes, like their name, label or description.
In order to make sure
that the right fields information is reproduced, add the --add-fields flag:

Other resources will have more complex workflows and more user-given
attributes. Let’s see for instance the
script to generate an evaluation from a train/test split of a source that
was created using the
bigmler--traindata/iris.csv--evaluate command:

As you can see, BigMLer has added a default category, name,
description, tags, has built the model on 80% of the data
and used the out_of_bag attribute for the
evaluation to use the remaining part of the dataset test data.

This subcommand creates and executes scripts in WhizzML (BigML’s automation
language). With WhizzML you can program any specific workflow that involves
Machine Learning resources like datasets, models, etc. You just write a
script using the directives in the
reference manual
and upload it to BigML, where it will be available as one more resource in
your dashboard. Scripts can also be shared and published in the gallery,
so you can reuse other users’ scripts and execute them. These operations
can also be done using the bigmler execute subcommand.

The simplest example is executing some basic code, like adding two numbers:

bigmler execute --code "(+ 1 2)" --output-dir simple_exe

With this command, bigmler will generate a script in BigML whose source code
is the one given as a string in the --code option. The script ID will
be stored in a file called scripts in the simple_text
directory. After that, the
script will be executed, so a new resource called execution will be
created in BigML, and the corresponding ID will be stored in the
execution file of the output directory.
Similarly, the result of the execution will be stored
in whizzml_results.txt and whizzml_results.json
(in human-readable format and JSON respectively) in the
directory set in the --output-dir option. You can also use the code
stored in a file with the --code-file option.

Adding the --no-execute flag to the command will cause the process to
stop right after the script creation. You can also compile your code as a
library to be used in many scripts by setting the --to-library flag.

bigmler execute --code-file my_library.whizzml --to-library

Existing scripts can be referenced for execution with the --script option

bigmler execute --script script/50a2bb64035d0706db000643

or the script ID can be read from a file:

bigmler execute --scripts simple_exe/scripts

The script we used as an example is very simple and needs no additional
parameter. But, in general, scripts
will have input parameters and output variables. The inputs define the script
signature and must be declared in order to create the script. The outputs
are optional and any variable in the script can be declared to be an output.
Both inputs and outputs can be declared using the --declare-inputs and
--declare-outputs options. These options must contain the path
to the JSON file where the information about the
inputs and outputs (respectively) is stored.

[{"name":"a","default":0,"type":"number"},{"name":"b","default":0,"type":"number","description":"second number to add"}]

and my_outputs_dec.json

[{"name":"addition","type":"number"}]

so that the value of the addition variable would be returned as
output in the execution results.

Additionally, a script can import libraries. The list of libraries to be
used as imports can be added to the command with the option --imports
followed by a comma-separated list of library IDs.

Once the script has been created and its inputs and outputs declared, to
execute it you’ll need to provide a value for each input. This can be
done using --inputs, that will also point to a JSON file where
each input should have its corresponding value.

You can also provide default configuration attributes
for the resources generated in an execution. Add the
--creation-defaults option followed by the path
to a JSON file that contains a dictionary whose keys are the resource types
to which the configuration defaults apply and whose values are the
configuration attributes set by default.

the source created by the script will be associated to the given project
and the ensemble will have 100 models and a 0.9 sample rate unless the source
code in your script explicitly specifies a different value, in which case
it takes precedence over these defaults.

This subcommand creates packages of scripts and libraries in WhizzML
(BigML’s automation
language) based on the information provided by a metadata.json
file. These operations
can also be performed individually using the bigmler execute subcommand, but
bigmler whizzml reads the components of the package, and for each
component analyzes the corresponding metadata.json file to identify
the kind of code (script or library) that it contains and creates the
corresponding
resource in BigML. The metadata.json is expected to contain the
name, kind, description, inputs and outputs needed to create the script.
As an example,

{"name":"Example of whizzml script","description":"Test example of a whizzml script that adds two numbers","kind":"script","source_code":"code.whizzml","inputs":[{"name":"a","type":"number","description":"First number"},{"name":"b","type":"number","description":"Second number"}],"outputs":[{"name":"addition","type":"number","description":"Sum of the numbers"}]}

describes a script whose code is to be found in the code.whizzml file.
The script will have two inputs a and b and one output: addition.

In order to create this script, you can type the following command:

bigmler whizzml --package-dir my_package --output-dir creation_log

and bigmler will:

look for the metadata.json file located in the my_package
directory.

parse the JSON, identify that it defines a script and look for its code in
the code.whizzml file

create the corresponding BigML script resource, adding as arguments the ones
provided in inputs, outputs, name and description.

Packages can contain more than one script. In this case, a nested directory
structure is expected. The metadata.json file for a package with many
components should include the name of the directories where these components
can be found:

In this example, each string in the components attributes list corresponds
to one directory where a new script or library (with its corresponding
metadata.json descriptor) is stored. Then, using bigmlerwhizzml
for this composite package will create each of the component scripts or
libraries. It will also handle dependencies, using the IDs of the created
libraries as imports for the scripts when needed.

This subcommand can be used to retrain an existing modeling resource (model,
ensemble, deepnet, etc.) by adding new data to it. In BigML, resources are
immutable to ensure traceability, but at the same time they are reproducible.
Therefore, any model can be rebuilt using the data stored in a new consolidated
dataset or even from a list of existing datasets. That’s retraining the model
and the bigmlerretrain
subcommand provides a simple way to do it.

In the basic use case, different parameters and model types are tried and
evaluated till the best performing model is found. Then you can call:

so that the data in your local data/iris.csv file is uploaded to the
platform and all the steps that led to your existing model are reproduced to
create a new merged dataset that will be used to retrain your model. The
command output will contain the URL that you need to call to ensure you
always use the latest version of your model. The URL will look like:

in this case, the resource to retrain is an ensemble that has been
previously tagged as my_ensemble. The bigmlerretrain command will
look for the newest ensemble that contains that tag and after uploading and
consolidating your data with the one previously used in the ensemble, it will
rebuild it. The reference used in the URL that will contain the latest version
of the ensemble will use this tag also as reference:

In a different scenario, you might want to retrain your model from a list
of datasets, for instance training an anomaly detector using the data of the
last 6 months. This means that you don’t want your data to be merged. Rather
you would like to use a window over the list of available datasets.

In this case, adding the --window-size option to your command will cause
the dataset created by uploading your new data to be added to the list of
datasets as a separate resource. Then model will be rebuilt using the number
of datasets set as --window-size.

The operations run by bigmlerretrain are mainly run in BigML’s servers
using WhizzML scripts. This scripts are previously created in the user’s
account the first time you run the command, but they can also be recreated
by using the --upgrade flag in any bigmlerretrain command call.

You have seen that BigMLer is an agile tool that empowers you to create a
great number of resources easily. This is a tremedous help, but it also can
lead to a garbage-prone environment. To keep a control of each new created
remote resource use the flag –resources-log followed by the name of the log
file you choose.

bigmler --train data/iris.csv --resources-log my_log.log

Each new resource created by that command will cause its id to be appended as
a new line of the log file.

BigMLer can help you as well in deleting these resources. Using the delete
subcommand there are many options available. For instance, deleting a
comma-separated list of ids

As we’ve previously seen, each BigMLer command execution generates a
bunch of remote resources whose ids are stored in files located in a directory
that can be set using the --output-dir option. The
bigmlerdelete subcommand can retrieve the ids stored in such files by
using the --from-dir option.

You can also delete resources by date. The options --newer-than and
--older-than let you specify a reference date. Resources created after and
before that date respectively, will be deleted. Both options can be combined to
set a range of dates. The allowed values are:

dates in a YYYY-MM-DD format

integers, that will be interpreted as number of days before now

resource id, the creation datetime of the resource will be used

Thus,

bigmler delete --newer-than 2

will delete all resources created less than two days ago (now being
2014-03-23 14:00:00.00000, its creation time will be greater
than 2014-03-21 14:00:00.00000).

bigmler delete --older-than 2014-03-20 --newer-than 2014-03-19

will delete all resources created during 2014, March the 19th (creation time
between 2014-03-19 00:00:00 and 2014-03-20 00:00:00) and

bigmler delete --newer-than source/532db2b637203f3f1a000104

will delete all resources created after the source/532db2b637203f3f1a000104
was created.

You can also combine both types of options, to delete sources tagged as
my_tag starting from a certain date on

bigmler delete --newer-than 2 --source-tag my_tag

And finally, you can filter the type of resource to be deleted using the
--resource-types option to specify a comma-separated list of resource
types to be deleted

The output for the command will be a list of resources that would be deleted
if the --dry-run flag was removed. In this case, they will be sources
that contain the tag my_source and were created after the one given as
--newer-than value. The first 15 resources will be logged
to console, and the complete list can be found in the bigmler_sessions
file.

By default, only finished resources are selected to be deleted. If you want
to delete other resources, you can select them by choosing their status:

The bigmlerexport subcommand is intended to help generating the code
needed for the models in BigML to be integrated in other applications.
To produce a prediction using a BigML model you just need a function that
receives as argument the new test
case data and returns this prediction (and a confidence). The bigmler export
subcommand will retrieve the JSON information of your existing
decision tree model in BigML and will generate from it this function code and
store it in a file that can be imported or copied directly in your application.

Obviously, the function syntax will depend on the model and the language
used in your application, so these will be the options we need to provide:

This command will create a javascript version of the function that
produces the predictions and store it in a file named
model_532db2b637203f3f1a001304.js (after the model
ID) in the my_exports directory.

Models can currently exported in Python, Javascript and R. For models
whose fields are numeric or categorical, the command
also supports creating MySQL functions and Tableau separate expressions
for both the prediction and the confidence.

You can also generate the code for all the models in an ensemble in a
single bigmler export command using the –ensemble option followed
by the corresponding ensemble ID. The code for
each model will be stored in a separate file, named after the model ID and
transforming the slash into an underscore.

Projects are organizational resources and they are usually created at
source-creation time in order to keep together in a separate repo all
the resources derived from a source. However, you can also create a project
or update its properties independently using the bigmlerproject
subcommand.

bigmler project --name my_project

will create a new project and name it. You can also add other attributes
such as --tag, --description or --category in the project
creation call. You can also add or update any other attribute to
the project using a JSON file with the --project-attributes option.

Association Discovery is a popular method to find out relations among values
in high-dimensional datasets.

A common case where association discovery is often used is
market basket analysis. This analysis seeks for customer shopping
patterns across large transactional
datasets. For instance, do customers who buy hamburgers and ketchup also
consume bread?

Businesses use those insights to make decisions on promotions and product
placements.
Association Discovery can also be used for other purposes such as early
incident detection, web usage analysis, or software intrusion detection.

In BigML, the Association resource object can be built from any dataset, and
its results are a list of association rules between the items in the dataset.
In the example case, the corresponding
association rule would have hamburguers and ketchup as the items at the
left hand side of the association rule and bread would be the item at the
right hand side. Both sides in this association rule are related,
in the sense that observing
the items in the left hand side implies observing the items in the right hand
side. There are some metrics to ponder the quality of these association rules:

Support: the proportion of instances which contain an itemset.

For an association rule, it means the number of instances in the dataset which
contain the rule’s antecedent and rule’s consequent together
over the total number of instances (N) in the dataset.

It gives a measure of the importance of the rule. Association rules have
to satisfy a minimum support constraint (i.e., min_support).

Coverage: the support of the antedecent of an association rule.

It measures how often a rule can be applied.

Confidence or (strength): The probability of seeing the rule’s consequent

under the condition that the instances also contain the rule’s antecedent.
Confidence is computed using the support of the association rule over the
coverage. That is, the percentage of instances which contain the consequent
and antecedent together over the number of instances which only contain
the antecedent.

Confidence is directed and gives different values for the association
rules Antecedent → Consequent and Consequent → Antecedent. Association
rules also need to satisfy a minimum confidence constraint
(i.e., min_confidence).

Leverage: the difference of the support of the association

rule (i.e., the antecedent and consequent appearing together) and what would
be expected if antecedent and consequent where statistically independent.
This is a value between -1 and 1. A positive value suggests a positive
relationship and a negative value suggests a negative relationship.
0 indicates independence.

Lift: how many times more often antecedent and consequent occur together
than expected if they where statistically independent.
A value of 1 suggests that there is no relationship between the antecedent
and the consequent. Higher values suggest stronger positive relationships.
Lower values suggest stronger negative relationships (the presence of the
antecedent reduces the likelihood of the consequent)

As to the items used in association rules, each type of field is parsed to
extract items for the rules as follows:

Categorical: each different value (class) will be considered a separate item.

Text: each unique term will be considered a separate item.

Items: each different item in the items summary will be considered.

Numeric: Values will be converted into categorical by making a

segmentation of the values.
For example, a numeric field with values ranging from 0 to 600 split
into 3 segments:
segment 1 → [0, 200), segment 2 → [200, 400), segment 3 → [400, 600].
You can refine the behavior of the transformation using
discretization
and field_discretizations.

The bigmlerassociation subcommand will discover the association
rules present in your
datasets. Starting from the raw data in your files:

bigmler association --train my_file.csv

will generate the source, dataset and association objects
required to present the association rules hidden in your data. You can also
limit the number of rules extracted using the --max-k option

The bigmlerlogistic-regression subcommand generates all the
resources needed to buid
a logistic regression model and use it to predict.
The logistic regression model is a supervised
learning method for solving classification problems. It predicts the
objective field class as logistic function whose argument is a linear
combination of the rest of features. The simplest call to build a logistic
regression is

bigmler logistic-regression --train data/iris.csv

uploads the data in the data/iris.csv file and generates
the corresponding source, dataset and logisticregression
objects in BigML. You
can use any of the generated objects to produce new logistic regressions.
For instance, you could set a subgroup of the fields of the generated dataset
to produce a different logistic regression model by using

that would exclude the field sepallength from the logistic regression
model creation input fields. You can also change some parameters in the
logistic regression model, like the bias (scale of the intercept term),
c (the strength of the regularization map) or eps (stopping criteria
for solver).

would produce a file predictions.csv with the predictions associated
to each input. When the command is executed, the logistic regression
information is downloaded
to your local computer and the logistic regression predictions are
computed locally,
with no more latencies involved. Just in case you prefer to use BigML
to compute the predictions remotely, you can do so too

would create a remote source and dataset from the test file data,
generate a batchprediction also remotely and finally
download the result to your computer. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-infofull, that may result
in a large CSV to be created as output.

Using this subcommand you can generate all the
resources leading to finding a topicmodel and its topicdistributions.
These are unsupervised learning models which find out the topics in a
collection of documents and will then be useful to classify new documents
according to the topics. The bigmlertopic-model subcommand
will follow the steps to generate
topicmodels and predict the topicdistribution, or distribution of
probabilities for the new document to be associated to a certain topic. As
shown in the bigmler command section, the simplest call is

bigmler topic-model --train data/spam.csv

This command will upload the data in the data/spam.csv file and
generate
the corresponding source, dataset and topicmodel objects in BigML.
You
can use any of the intermediate generated objects to produce new
topic models. For instance, you
could set a subgroup of the fields of the generated dataset to produce a
different topic model by using

will generate a secret link for both the created dataset and topic model that
can be used to share the resource selectively.

As models were used to generate predictions (class names in classification
problems and an estimated number for regressions), topic models can be used
to classify a new document in the discovered list of topics. The classification
is run by computing the probability for the document to belonging to the topic
group. The command

would produce a file topic_distributions.csv where each row will contain
the probabilities
associated to each topic for the corresponding test input.
When the command is executed, the topic model information is downloaded
to your local computer and the distributions are computed locally, with
no more latencies involved. Just in case you prefer to use BigML to compute
the topic distributions remotely, you can do so too

would create a remote source and dataset from the test file data,
generate a batchtopicdistribution also remotely and finally
download the result
to your computer. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-infofull, that may result
in a large CSV to be created as output.

Using this subcommand you can generate all the
resources leading to a timeseries and its forecasts.
The timeseries is a supervised learning model that works on
an ordered sequence of data to extract the patterns needed to make
forecasts. The bigmlertime-series subcommand
will follow the steps to generate
timeseries and predict the forecasts for every numeric field in
the original dataset that has been set as objective field. As
shown in the bigmler command section, the simplest call is

bigmler time-series --train data/grades.csv

This command will upload the data in the data/grades.csv file and
generate
the corresponding source, dataset and timeseries objects in BigML.
You
can use any of the intermediate generated objects to produce new
time series. For instance, you
could set a subgroup of the numeric fields in the dataset to be used
as objective fields using the --objectives option.

will generate a secret link for both the created dataset and time series that
can be used to share the resource selectively.

As models were used to generate predictions (class names in classification
problems and an estimated number for regressions), time series can be used
to generate forecasts, that is, to predict the value of each objective
field up till the user-given horizon. The command

would produce a file forecast_000001.csv with ten rows, one per point, and
as many columns as ETS models the time series contains.

When the command is executed, the time series information is downloaded
to your local computer and the forecasts are computed locally, with
no more latencies involved. Just in case you prefer to use BigML to compute
the forecasts remotely, you can do so too

would create a remote forecast with the specified horizon. You can also
specify more complex inputs for the forecast. For instance, you can set a
different horizon to each objective field and you can give some criteria
to select the models used in the forecast. All of this can be done using
the --test option pointing to a JSON file that should contain the
input to be used in the forecast as described in the
API documentation. As an example,
let’s set a horizon of 5 points for the Final field and select the
first model in the time series array of ETS models, and also forecast 7
points for the Assignment field using the model with less aic (the one
used by default). The command call should then be:

The bigmlerdeepnet subcommand generates all the
resources needed to buid
a deepnet model and use it to predict.
The deepnet model is a supervised
learning method for solving both regression and classification problems. It
uses deep neural networks, a composition of layers of different functions
that when applied to the
input data generate the prediction.

The simplest call to build a deepnet is:

bigmler deepnet --train data/iris.csv

uploads the data in the data/iris.csv file and generates
the corresponding source, dataset and deepnet
objects in BigML. You
can use any of the generated objects to produce new deepnets.
For instance, you could set a subgroup of the fields of the generated dataset
to produce a different deepnet model by using

would produce a file predictions.csv with the predictions associated
to each input. When the command is executed, the deepnet
information is downloaded
to your local computer and the deepnet predictions are
computed locally,
with no more latencies involved. Just in case you prefer to use BigML
to compute the predictions remotely, you can do so too

would create a remote source and dataset from the test file data,
generate a batchprediction also remotely and finally
download the result to your computer. If you prefer the result not to be
dowloaded but to be stored as a new dataset remotely, add --no-csv and
to-dataset to the command line. This can be specially helpful when
dealing with a high number of scores or when adding to the final result
the original dataset fields with --prediction-infofull, that may result
in a large CSV to be created as output.

Most of the previously described commands need the remote resources to
be downloaded to work. For instance, when you want to create a new
model from an existing dataset, BigMLer is going to download the dataset
JSON structure to extract the fields and objective field information,
and only then ask for the model creation. As mentioned,
the --store flag forces BigMLer to store the downloaded JSON
structures in local files inside your output directory. If you use that flag
when building a model with BigMLer, then the model is stored in your computer.
This model file contains all the information you need in order to make
new predictions, so you can use the
--model-file option to set the path to this file and predict
the value of your objective field for new input data with no reference at all
to your remote resources. You could even delete the original remote model and
work exclusively with the locally downloaded file

Network connections failures or other external causes can break the BigMLer
command process. To resume a command ended by an unexpected event you
can issue

bigmler --resume

BigMLer keeps track of each command you issue in a .bigmler file and of
the output directory in .bigmler_dir_stack of your working directory.
Then --resume will recover the last issued command and try to continue
work from the point it was stopped. There’s also a --stack-level flag

bigmler --resume --stack-level 1

to allow resuming a previous command in the stack. In the example, the one
before the last.

The resources generated in the execution of a BigMLer command are listed in
the standard output by default,
but they can be summarized as well in a Gazibit format.
Gazibit is a platform where you can create interactive
presentations in a
flexible and dynamic way. Using BigMLer’s --reportsgazibit option you’ll
be able to generate a Gazibit summary report of your newly created
resources. In
case you use also the --shared flag, a second template will be generated
where the links for the shared resources will be used. Both reports will be
stored in the reports subdirectory of your output directory, where all of
the files generated by the BigMLer command are. Thus,

will generate two files: gazibit.json and gazibit_shared.json in a
reports subdirectory of your my_dir directory. In case you provide
your Gazibittoken in the GAZIBIT_TOKEN environment variable, they will
also be uploaded to your account in Gazibit. Upload can be avoided, by
using the --no-upload flag.

BigMLer will look for bigmler.ini file in the working directory where
users can personalize the default values they like for the most relevant flags.
The options should be written in a config style, e.g.

[BigMLer]dev=trueresources_log= ./my_log.log

as you can see, under a [BigMLer] section the file should contain one line
per option. Dashes in flags are transformed to undescores in options.
The example would keep development mode on and would log all created
resources to my_log.log for any new bigmler command issued under the
same working directory if none of the related flags are set.

Naturally, the default value options given in this file will be overriden by
the corresponding flag value in the present command. To follow the previous
example, if you use

bigmler --train data/iris.csv --resources-log ./another_log.log

in the same working directory, the value of the flag will be preeminent and
resources will be logged in another_log.log. For boolean-valued flags,
such as --replacement itself, you’ll need to use the associated negative
flags to
overide the default behaviour. That is, following the former example if you
want to avoid storing the downloaded resource JSON information,
you should use the --no-store flag.

BigMLer requires bigml 4.18.1 or
higher. Using proportional missing strategy will additionally request
the use of the numpy and
scipy libraries. They are not
automatically installed as a dependency, as they are quite heavy and
exclusively required in this case. Therefore, they have been left for
the user to install them if required. The same happens with the
pystemmer
library, used only for topic modeling. Check the bindings documentation
for more info.

You can also install the development version of bigmler directly
from the Git repository

$ pip install -e git://github.com/bigmlcom/bigmler.git#egg=bigmler

Pystemmer
is needed in order to use the bigmlertopic-model subcommand.
This library is not installed automatically because it needs compilation and
some developer tools to be present in your operative system.
If the pipinstallpystemmer command ends in error,
please check the error message for the link to these tools (Windows) or
install the Xcode developer tools (OSX).

For a detailed description of install instructions on Windows see the
:ref:bigmler-windows section.

All the requests to BigML.io must be authenticated using your username
and API key and are always
transmitted over HTTPS.

BigML module will look for your username and API key in the environment
variables BIGML_USERNAME and BIGML_API_KEY respectively. You can
add the following lines to your .bashrc or .bash_profile to set
those variables automatically when you log in

In addition to that, you’ll need the pip tool to install BigMLer. To
install pip, first you need to open your command line window (write cmd in
the input field that appears when you click on Start and hit enter),
download this python file
and execute it

c:\Python27\python.exe ez_setup.py

After that, you’ll be able to install pip by typing the following command

c:\Python27\Scripts\easy_install.exe pip

And finally, to install BigMLer, just type

c:\Python27\Scripts\pip.exe install bigmler

and BigMLer should be installed in your computer. Then
issuing

bigmler --version

should show BigMLer version information.

Finally, to start using BigMLer to handle your BigML resources, you need to
set your credentials in BigML for authentication. If you want them to be
permanently stored in your system, use

The Sandbox environment that could be reached by using the flag --dev
has been deprecated and. Right now, there’s only one mode to work with BigML:
the previous ``Production Model `, so the flag is no longer available.

Path to a file containing
sample/ids.
One sample per line
(e.g., sample/4f824203ce80051)

--no-sample

No sample will be generated

--sample-fieldsFIELD_NAMES

Comma-separated list of fields
that
will be used in the sample
detector construction

--sample-attributesPATH

Path to a JSON file containing
attributes (any of the updatable
attributes described in the
developers section )
to
be used in the sample creation
call

--fields-filterQUERY

Query string that will be used as
filter before selecting the
sample
rows. The query string can be
built
using the field ids, their
values and
the usual operators. You can see
some
examples in the
developers section

--sample-header

Adds a headers row to the
sample.csv
output

--row-index

Prepends acolumn to the sample
rows
with the absolute row number

--occurrence

Prepends a column to the sample
rows
with the number of occurences of
each
row. When used with –row-index,
the occurrence column will be
placed
after the index column

--precision

Decimal numbers precision

--rowsSIZE

Number of rows returned

--row-offsetOFFSET

Skip the given number of rows

--row-order-byFIELD_NAME

Field name whose values will be
used
to sort the returned rows

--row-fieldsFIELD_NAMES

Comma-separated list of fields
that
will be returned in the sample

--stat-fieldsFIELD_NAME,FIELD_NAME

Two comma-separated numeric field
names
that will be used to compute
their
Pearson’s and Spearman’s
correlations
and linear regression terms

--stat-fieldFIELD_NAME

Numeric field that will be used
to compute
Pearson’s and Spearman’s
correlations
and linear regression terms
against
the rest of numeric fields in the
sample

Path to a file containing
deepnet/ids.
One deepnet per line
(e.g., deepnet/4f824203ce80051)

--no-deepnet

No deepnet will be
generated

--deepnet-fieldsDEEPNET_FIELDS

Comma-separated list of fields
that
will be used in the deepnet
construction

--batch-normalization

Specifies whether to normalize
the outputs of a network before
being passed to the activation
function or not.

--default-numeric-valueDFT

It accepts any of the following
strings to substitute
missing numeric values across
all the numeric fields in the
dataset (options: mean, median
minimum, maximum, zero).

--dropout-rateRATE

A number between 0 and 1
specifying the rate at which to
drop weights during training to
control overfitting.

--hidden-layersLAYERS

A JSON file that contains a list
of maps describing the number
and type of layers in the network
(other than the output layer,
which is determined by the type
of learning problem).

--learn-residuals

Specifies whether alternate
layers should learn a
representation of the residuals
for a given layer rather than
the layer itself or not.

--learning-rateRATE

A number between 0 and 1
specifying the learning rate.

--max-iterationsITERATIONS

A number between 100 and 100000
for the maximum number of
gradient steps to take during the
optimization.

--max-training-timeTIME

The maximum wall-clock training
time, in seconds, for which to
train the network.

--number-of-hidden-layers#LAYERS

The number of hidden layers to
use in the network. If the number
is greater than the length of the
list of hidden_layers, the list
is cycled until the desired
number is reached. If the number
is smaller than the length of the
list of hidden_layers, the
list is shortened.

--number-of-model-candidates#CAND

An integer specifying the number
of models to try during the model
search.

--search

During the deepnet creation,
BigML trains and evaluates over
all possible network
configurations, returning the
best networks found for the
problem. The final deepnet
returned by the search is a
compromise between the top n
networks found in the search.
Since this option builds several
networks, it may be significantly
slower than the suggest_structure
technique.

--missing-numerics

Whether to create an additional
binary predictor each numeric
field which denotes a missing
value. If false, these predictors
are not created, and rows
containing missing numeric values
are dropped.

--tree-embedding

Specify whether to learn a
tree-based representation
of the data as engineered
features along with the
raw features, essentially by
learning trees over slices of the
input space and a small amount of
the training data. The theory is
that these engineered features
will linearize obvious non-linear
dependencies before training
begins, and so make learning
proceed more quickly.

--no-missing-numerics

Avoids the default behaviour,
which creates a new
coefficient for missings in
numeric fields. Missing rows are
discarded.

--deepnet-attributesPATH

Path to a JSON file containing
attributes (any of the updatable
attributes described in the
developers section )
to
be used in the deepnet creation
call

BigMLer will accept flags written with underscore as word separator like
--clear_logs for compatibility with prior versions. Also --field-names
is accepted, although the more complete --field-attributes flag is
preferred. --stat_pruning and --no_stat_pruning are discontinued
and their effects can be achived by setting the actual --pruning flag
to statistical or no-pruning values respectively.