BigML.io—The BigML APIDocumentation

Loading…

Overview

Last Updated: Thursday, 2015-02-05 23:59

This page provides an introduction to BigML.io—The
BigML API. A quick start guide for the impatient is here.

BigML.io is a Machine Learning REST API to
easily build, run, and bring predictive models to your project. You can
use BigML.io for basic supervised and unsupervised machine
learning tasks and also to create sophisticated machine learning
pipelines.

BigML.io is a REST-style API for creating and managing
BigML resources
programmatically. That is to say, using BigML.io you can
create, retrieve, update and delete BigML resources using standard HTTP methods.

The four original BigML resources are:
source, dataset,
model, and prediction.

As shown in the picture below, the most basic flow consists of using some local (or remote) training data to create a
source,
then using the source to create a
dataset, later using the dataset to
create a model,
and, finally, using the model and new input data to create a
prediction.

The training data is usually in tabular format. Each row in the data represents
an instance (or example) and each column a field (or attribute). These
fields are also known as predictors or
covariates.

When the machine learning task to learn from training data is
supervised one of the columns (usually the last
column) represents a special attribute known as objective
field (or target) that assigns a label (or class) to each instance. The training
data in this format is named labeled and the machine learning
task to learn from is named supervised learning.

Once a source is created, it can be used to create multiple datasets.
Likewise, a dataset can be used to create multiple models and a model
can be used to create multiple predictions.

A model can be either a classification or a
regression model depending
on whether the objective field is respectively
categorical or numeric.

Often an ensemble (or collection of models) can
perform better than just a single model. Thus, a
dataset can also be used to
create an ensemble instead of a single
model.

A dataset can also be used to create a
cluster or an anomaly detector.
Clusters and Anomaly Detectors are both built using
unsupervised learning and therefore an objective field is
not needed. In these cases, the training data is named
unlabeled.

A centroid is to a cluster what a
prediction is to a
model. Likewise, an anomaly score is
to an anomaly detector what a
prediction is to a model.

There are scenarios where generating predictions for a relative big
collection of input data is very convenient. For these
scenarios, BigML.io offers batch resources such as: batchprediction,
batchcentroid, and batchanomalyscore. These resources take a dataset and respectively a model (or ensemble), a cluster, or
an anomaly detector to create a new dataset that contains a new column
with the corresponding prediction, centroid or anomaly score computed
for each instance in the dataset.

When dealing with multiple projects, it's better to keep the resources
that belong to each project separated. Thus, BigML also has a resource
named project
that helps you group together all the other resources. As you will see,
you just need to assign a source to a pre-existing
project and all the subsequent resources will be
created in that project.

Note: In the snippets below you should substitute Alfred's username and
API key for your own username and API Key.

REST API

You can create, read, update,
and delete resources using the respective standard HTTP methods: POST,
GET, PUT and DELETE.

All communication with BigML.io is JSON formatted except for source creation.
Source creation is handled with a HTTP PUT using the
"multipart/form-data" content-type.

HTTPS

All access to BigML.io must be performed over HTTPS.
In this way communication between your application and BigML.io is encrypted and the integrity of traffic between both is verified.

Base URL

All BigML.io HTTP commands use the following base URL:

https://bigml.io

Base URL

Development mode

BigML comes with a development mode
that allows you to explore BigML API free of charge as much as you want. To use BigML.io in development mode you just need to append
/dev/ to your URLs right after the domain name and
before the version name.

https://bigml.io/dev/

Development mode

You can visualize the resources created in development
mode using the switch
to change between production and development
mode in your account.
In development mode you do not consume any credits.
However, it has a few limitations:

Even if you can create sources of any size you cannot create
datasets that are bigger than 16 MB each.

The number of models in an
ensemble cannot be bigger than 10.

The term_limit, that is the number of words
considered in the analysis of text fields, is limited to
100 per dataset.

Version

The BigML.io API is versioned using code names instead
of version numbers. The current version name is
"andromeda"
so URLs for this version can be written to require this version as follows:

https://bigml.io/andromeda/

Version

Specifying the version name is optional. If you omit the version name in your API requests, you will always get access
to the latest API version. While we will do our best to make future API versions backward compatible it is possible
that a future API release could cause your application to fail.

Specifying the API version in your HTTP calls will ensure that your application continues to function for the life cycle of the API release.

Authentication

All access to BigML.io needs to be authenticated. Authentication is performed by appending your username and
BigML API Key to the query string of every request.

To use BigML.io from the command line, we
recommend setting your username and
API key as environment variables. Using environment
variables is also an easy way to keep your credentials out of your source code.

Alternative Keys

To create an alternative key you need to use BigML's
web interface.
There you can define what resources an alternative key can access and what operations
(i.e., create, list, retrieve, update or delete) are allowed with it.
This is useful in scenarios where you want to grant different roles and
privileges to different applications. For example, an application for
the IT folks that collects data and creates sources in BigML, another
that is accessed by data scientists to create and evaluate models, and
a third that is used by the marketing folks to create predictions.

Summary of HTTP methods

Creates a new resource. Only certain fields are "postable". This method
is not idempotent. Each valid POST request results in a new directly accesible resource.

RETRIEVE

GET

Retrieves either a specific resource or a list of resources.
This methods is idempotent. The content type of the
resources is always "application/json; charset=utf-8".

UPDATE

UPDATE

Updates partial content of a resource. Only certain
fields are "putable". This method is idempotent.

DELETE

DELETE

Deletes a resource. This method is idempotent.

Resource Ids

All BigML resources are identified by a name composed of two parts separated by a slash "/". The first part is the type of the resource and the second part is a 24-char unique identifier. See the examples below:

Resource Ids are immediately assigned when a resource is
created and you can use them to retrieve, update or delete
the corresponding resource. The resource Ids are also used as the
input parameter for the creation of dependent resources.
You can also directly append a resource Id to the URL
https://bigml.com/dashboard to visualize it in the BigML web interface.

Requests

POST and PUT methods accept a body that contains or refers to the data
that you want to use to create or update the resource, respectively. The only
allowed content type so far is "application/json;charset=utf-8". The
only exception is source creation which requires the
"multipart/form-data" content type.

The following is an example of what a request header would look like for a dataset
creation request:

A number of required and optional parameters exist for each
type of resource. You can see a detailed list for each resource in their
respective sections: sources, datasets, models, and
predictions, etc.

Responses

BigML.io uses conventional HTTP response codes to
indicate success or failure of every API request. For example, the
response below shows the headers you should expect after creating a new
resource.

All response content from BigML.io, including errors,
are JSON formatted.
For convenience's sake, each JSON response has a key named "code" that matches the HTTP response code.
For example, after successfully creating a new source BigML.io
will send back a JSON response like the one below, with the HTTP "201 Created" code:

In the body of the error response, the JSON formatted messages include a key named
"code" that matches the reponse code in HTTP header. Additionally, JSON includes a
"status" field. The status gives you more information
about the type of error. It includes a second more specific error
code and a message that gives a human
readable explanation of what caused the error. You can get the full list of error
codes in the Status Codes Section. For example, if you try to access to
a resource that does not exist you will get a response like the following one.

Resource Status

The creation of sources, datasets and models involves a computational task that can last
a few seconds or a few days depending on the size of the data. Consequently, the HTTP POST
request to create a resource will launch an asynchronous task and return immediately. In order
to know the completion status of this task, each resource has a status field that reports the
current state of the request. The possible states for a task are:

Code

Status

Semantics

0

Waiting

The resource is waiting for another resource to be
finished before BigML.io can start
processing it.

1

Queued

The task that is going to create the resource has been
accepted but has been queued because there are other
tasks using the system.

2

Started

The task to create the resource has been started and you should expect
partial results soon.

3

In Progress

The task is partially completed but still needs to do more computations.

4

Summarized

This status is specific to datasets. Although the dataset computation is complete, the
dataset needs to be serialized before it can be used to create a model.

5

Finished

The task is complete and the resource is final.

-1

Faulty

The task has failed. We either could not process the
task as you requested it or there is an internal issue.

-2

Unknown

The task has reached a state that we cannot verify at
this time. You should never see this status unless
BigML.io has suffered a major outage.

-3

Runnable

The task has reached a faulty state because of a network
or computer error, or because a dependent resource was not ready yet. If you repeat
the request it might work this time.

Libraries

A number of
libraries for many other languages have been developed by the growing BigML community:
C#,
Ruby,
PHP
, and
iOS. If you are interested
in library support for a particular language let us know. Or if you are motivated to develop a library, we will give you all the support that we can.

Limits

BigML.io is currently limited to 1,000,000 (a million) requests per API
key per hour. Please email us if you have a specific use case that requires a higher rate limit.

Quick Start

Last Updated: Thursday, 2015-02-05 23:59

This page helps you quickly create your first
source, dataset,
model, and prediction.

Authentication

The following snippet will help you set up an environment variable
(i.e., BIGML_AUTH) to store your
username and API key and avoid typing
them again in the rest of examples. See this section for more details.

BigML.io will respond with a JSON object
containing preliminary information about your new
source. As with all BigML.io
resources, the new source will have a resource
key with a unique resource/id. You can use
the source/id to retrieve the source or to create
new datasets.

BigML.io will return
a dataset resource if the request
succeeds. BigML.io detects types for each
field and will begins computing the histograms and summary
statistics. In the Datasets Section you can learn how customize
the parsing rules and other options when converting a
datasource to a dataset.
Each field in your source is automatically assigned an id that you
can later use as a parameter in models and predictions.

Creating a Model

To create a model, POST
the dataset/id from the previous step to the
model base URL. By default BigML.io will include
all fields as predictors and will treat the last non-text field as
the objective. In the Models Section you will learn how to customize
the input fields or the objective field.

If the request succeeds, BigML.io will return a new
prediction resource with its
own prediction/id. You can use this id to
retrieve the prediction later on. The predicted value is found in
the prediction object, keyed by the
corresponding objective field id

Projects

A project is an abstract resource that helps you group
related BigML resources together.

A project must have a name and optionally a category, description, and multiple
tags to help you organize and retrieve your projects.

When you create a new source you can assign it to a pre-existing
project. All the subsequent resources created using
that source
will belong to the same project.

All the resources created within a
project will inherit the name, description, and tags
of the project unless you
change them when you create the resources or update them later.

When you select a project on your BigML's dashboard,
you will only see the BigML resources related to
that project. Using your BigML dashboard you can also
create, update and delete projects (and all their associated
resources).

Retrieving a Project

Each project has a unique identifier in the form
project/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the project or as a
parameter when you create projects.

Retrieving a project with curl is
extremely easy.

curl https://bigml.io/project/54d9807bf0a5eafcd9000000?$BIGML_AUTH

$ Retrieving a project from the command line

You can also use your browser to visualize the project using the full
BigML.io URL.

Properties

Once a project has been successfully created it will
have the following properties.

Source Properties

Property

Type

Description

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table above that help classify this
project according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
project creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
dataset creation
has been completed without errors.

This is the date and time in which
the project was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

In addition to exact match, there are more filters that you can
use. To add one of these filters to your request you just need to
append one of the suffixes in the following table to the name of the attribute that you want to use as a filter.

Source Filters

Filter

Description

__gtoptional

Greater
than

__gteoptional

Greater
than or equal to

__containsoptional

Case-sensitive word match

__icontainsoptional

Case-insensitive word match

__inoptional

Case-sensitive
list word match

__ltoptional

Less
than

__lteoptional

Less
than or equal to

Ordering Projects

Projects can also be ordered by any of the fields that we
labeled as sortable in the table describing a
project's properties above.

Ordering Parameters

Parameter

Type

Description

order_byoptional

Name
of a sortable field,
default is "-created"

Specifies
the order of the projects to retrieve.
Must be one of the sortable fields. If you prefix the
field name with "-", they will be given in descending order.

For example, you can list your projects ordered by descending
name directly in your
browser, using your own username and API key, with the following link.

You can do the same thing from the command line using curl
as follows:

curl https://bigml.io/project?$BIGML_AUTH;order_by=-name

$ Listing projets ordered by name from the command line

Paginating Projects

There are two parameters that can help you retrieve just a portion of
your projects and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of projects to retrieve. Must be less than
or equal to 100.

offsetoptional

Integer,
default is 0

The order number from which the project
listing will start.

If a limit is given, no more than that many
projects will be returned but possibly less, if the
request itself yields less projects.

For example, if you want to retrieve only the third and forth latest
projects:

curl "https://bigml.io/project?$BIGML_AUTH;limit=2;offset=2"

$ Paginating projects from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return projects 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Project

To update a project, you need to PUT an object containing the fields that you want to
update to the project's URL.
The content-type must always be: "application/json".

If the request succeeds, BigML.io will respond
with a 202 accepted code and with the new updated
project in the body of the message.

For example, to update a project with a new name, a
new category, a new description, and
new tags you can use curl like this:

Sources

Last Updated: Thursday, 2015-02-05 23:59

A data source or source is the raw data that you want to use to create
a predictive model. A source is usually a (big) file in a comma
separated values (CSV) format. See the example below. Each row
represents an instance (or example). Each column in
the file represents a feature or field. The last column usually represents the class or objective field.
The file might have a first row named header
with a name for each field.

Creating a Source

Local Sources: Using a local file. You need
to post the file content in "multipart/form-data". The maximum size allowed is 64 GB per
file.

Remote Sources: Using a URL that points to
your data. The maximum size allowed is 64 GB or 4 TB if you use a
file stored in Amazon S3.

Inline Sources: Using some inline data. The
content type must be "application/json". The maximum size in this
case is limited 8 MB per post.

Creating a source using a local file

To create a new source, you need to POST the file containing your data to the source
base URL. The file must be attached in the post as a file upload.The
Content-Type in your HTTP request must be "multipart/form-data"
according to RFC2388.
This allows you to upload binary files in compressed format (.Z, .gz,
etc) that will be uploaded faster.

You can easily do this using curl. The option -F
(--form) lets curl emulate a filled-in form in which a user has pressed the submit button.
You need to prefix the file path name with "@".

curl https://bigml.io/andromeda/source?$BIGML_AUTH -F file=@iris.csv

> Creating a source

Creating a source using a URL

To create a new remote source you need a URL that points to the
data file that you want BigML to download for you.

You can easily do this using curl. The option -H
lets curl set the content type header while the option -X sets the http
method. You can send the URL within a JSON object as follows:

If you do not specify a name, BigML.io will assign to
the source the same name as the file that you
uploaded. If you do not specify a source_parser,
BigML.io will do its best to automatically select the parsing
parameters for you. However, if you do specify it, BigML.io will not try to
second-guess you.

Term Analsysis

A term_analysis object is composed of any combination of the
following properties.

Term Analysis Object

Property

Type

Description

enabledoptional

Boolean, default
is true

Whether text
processing should be enabled or not. Example: true

use_stopwordsoptional

Boolean, default
is true

Whether to
use stop words or not. Example: true.

stem_wordsoptional

Boolean,
default
is true

Whether to
stem words or not. Example: true.

case_sensitiveoptional

Boolean, default
is false

Whether text
analysis should be case sensitive or not.Example: true.

languageoptional

String, default
is "en"

The default
language of text fields.Example: "es".

token_modeoptional

String, default
is "all"

Whether
tokens_only,
full_terms_only or
all should be
tokenized. Example:
"tokens_only".

Text Processing

While the handling of numeric and categorical fields within a decision
tree framework is fairly straightforward, the handling of text fields
can be done in a number of different ways. BigML.io
takes a basic and reasonably robust approach, leverging some basic NLP techniques along with a simple bag-of-words style method of feature generation.

At the data source level, BigML.io attempts to do
basic language detection. Initially the language can be English
("en"), Spanish
("es"), Dutch ("nl"), or
"none" if no language is detected. In the near
future, BigML.io will support many more languages.

For text fields, BigML.io adds
potentially five keys to the detected fields, all of which are placed
in a map under term_analysis.

The first is language, which is mapped to the detected language.

There are also three boolean keys, case_sensitive,
use_stopwords, and stem_words. The
case_sensitive key is false by default.
use_stopwords should be true if we should include
stopwords in the vocabulary for the detected field during text
summarization. stem_words should be true if
BigML.io should perform word stemming on this field,
which maps forms of the same term to the same key when summarizing or
generating models. By default, use_stopwords is false
and stem_words is true for languages other than "none" and they are not present otherwise.

Finally, token_mode determines the tokenization
strategy. It may be set as either tokens_only,
full_terms_only, and all. When set as
tokens_only then individual words are used as terms.
For example, "ML for all" becomes ["ML", "for", "all"]. However, when
full_terms_only is selected, then the entire field is
treated as a single term as long as it is shorter than 256 characters.
In this case "ML for all" stays ["ML for all"]. If all
is selected, then both full terms and tokenized terms are used. In this
case ["ML for all"] becomes ["ML", "for", "all", "ML for all"]. The
default for token_mode is all.

There are a few details to note:

If full_terms_only is selected, then no stemming will occur even
if stem_words is true.

Also, when either all or tokens_only are selected, a term must
appear at least twice to be selected for the tag cloud. However
full_terms_only lowers this limit to a single occurrence.

Finally, if the language is "none", or if a
language does not have an algorithm available for stopword removal or
stemming, the use_stopwords and
stem_words keys will have no effect.

Retrieving a Source

Each source has a unique identifier in the form
source/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the source or as a
parameter to create datasets.

You can also use your browser to visualize the source using the full
BigML.io URL or pasting the source/id
into the BigML.com dashboard.

Properties

Once a source has been successfully created it will
have the following properties.

Source Properties

Property

Type

Description

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table above that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the source creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the source creation
has been completed without errors.

content_typefilterable,
sortable

String

Thi
is the MIME
content-type as provided by your HTTP client. The
content-type can help BigML.io to better parse your
file. For example, if you use curl, you can
alter it using the type option "-F file=@iris.csv;type=text/csv".

This is the date and time in which
the source was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to create this
source.

descriptionupdatable

String

A
text describing the source. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
source has been created in development mode.

fieldsupdatable

Object

A
dictionary with an entry per field (column) in your
data. Each entry includes the column number, the name
of the field, the type of the field, a specific
locale if it differs from the source's one, and
specific missing tokens if the differ from the source's
one. This property is very handy to update sources
according to your own parsing prefences. Example:

This is the date and time in which
the source was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

Source Fields

The property fields is a dictionary keyed by an
auto-generated id per each field in the
source. Each field has as a value an object with the
following properties:

Field Object Properties

Property

Type

Description

column_number

Integer

Specifies
the number of column in the original file.

description

String

An even longer description for the field.

localeoptional

String,
default is source's
locale

The specific locale for
this field. Example: "en-US".

missing_tokensoptional

Array,
default is source's missing
tokens

The specific missing
tokens for this field. Example: ["NA", "N/A"].

label

String

A longer and more descriptive name of the field.
Example: "Sepal length in cm".

name

String

Name of the column if provided in the header of the source or a name generated automatically otherwise.
Example: "Sepal length".

optype

String

Specifies
the type of the field. It can be
numeric, categorical, or
text.
Example: "text".

For fields classified with optype
"text", the default values specified
in the term_analysis at the top-level of the source are used.

Non-provided flags by term_analysis take their default
value, i.e., false for booleans, none for language.

Besides these global default values, which apply to all text fields
(and potential text fields, such as categorical ones that might
overflow to text during dataset creation), it's possible to specify
term_analysis flags on a per-field basis.

Source Status

Before a source is successfully created,
BigML.io makes sure that it has been uploaded in an understandable
format, that the data that it contains is parseable, and that the types for each column
in the data can be inferred successfully. The source goes through a number of
states until all these analyses are completed. Through the status field in the
source you can determine when the source has been
fully processed and is ready to be used to create a dataset. These are the
fields that a source's status has:

Source Status Object Properties

Property

Type

Description

code

Integer

A status
code that reflects the status of the source creation. It can be any of the explained here.

Filtering and Paginating Fields from a Source

A source might be composed of hundreds or even thousands of
fields. Thus when retrieving a source, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):

Parameters to filter fields from a source

Parameter

Type

Description

fieldsoptional

Comma-separated
list

A comma-separated list of field IDs to retrieve.

iprefixoptional

String

A case-insensitive string to retrieve fields whose name
start with the given prefix. It is possible to specify
more than one iprefix by repeating the parameter, in which case the union of the results is returned.

fulloptional

Boolean

If
false, no information about fields is returned.

limitoptional

Integer

Maximum
number of fields that you will get in
the fields field.

offsetoptional

Integer

How
far off from the first field in your
dataset is the first
field in the fields field.

prefixoptional

String

A case-sensitive string to retrieve fields whose name start with the given prefix;It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.

order_byoptional

String

Sorting criteria; possible values are "type" and
"name", and their negated values ("-type", "-name") to specify a descending order.

Since the fields field is a map and therefore not
ordered, the returned fields contain an additional key, "order," whose
integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object Properties

Property

Type

Description

count

Integer

Specifies
the current number of fields in the resource.

limit

Integer

The
maximum number of fields that will be returned in the
resource.

offset

Integer

The
current offset in the pagination of fields.

total

Integer

The
total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide sources (i.e., more than 200 fields).

Listing Sources

To list all the sources you can use the
source base URL. By default, only the 20 most recent
sources will be returned. You can see below how to change this number
using the limit parameter.

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

You can do the same thing from the command line using curl
as follows:

curl https://bigml.io/andromeda/source?$BIGML_AUTH;order_by=-size

$ Listing sources ordered by size from the command line

Paginating Sources

There are two parameters that can help you retrieve just a portion of
your sources and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of sources to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order number from which the source
listing will start.

If a limit is given, no more than that many
sources will be returned but possibly less, if the
request itself yields less sources.

For example, if you want to retrieve only the third and forth latest
sources:

curl "https://bigml.io/andromeda/source?$BIGML_AUTH;limit=2;offset=2"

$ Paginating sources from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return sources 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Source

To update a source, you need to PUT an object containing the fields that you want to
update to the source's URL.
The content-type must always be: "application/json".

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a source, it is permanently deleted. That is, a delete request cannot be undone.
However, if you try to delete a source that is being used to create a
dataset, then BigML.io will not accept the request and
will respond with the following error.

Datasets

A dataset is a structured version of a
source where each field has been processed and
serialized according to its type. A field can be numeric,
categorical, or text.

When you create a new dataset, BigML.io will
automatically compute a histogram for each numeric or categorical
field. For each numeric field, the minimum,
the maximum, the median, the sum, and the sum of squares are also computed.
BigML.io does not process text fields
yet. For each field, you can also get the number of errors
that were encountered processing it. Errors are mostly missing values or values that do not match with the
type assigned to the column.

Creating a Dataset

To create a new dataset, you need to POST to the
dataset base URL an object containing at least
the source/id that you want to use to create the
dataset. The content-type must always be
"application/json".

Arguments

By default, the dataset will include all fields in the corresponding
source; but this behaviour can be fine-tuned via the
input_fields and
excluded_fields lists of identifiers. The former
specifies the list of fields to be included in the dataset, and
defaults to all fields in the source when empty. To specify excluded
fields, you can use excluded_fields: identifiers in
that list are removed from the list constructed using
input_fields".

See below the full list of arguments that you can POST to create a
dataset.

Dataset Creation Arguments

Argument

Type

Description

categoryoptional

Integer,
default is the category of the source

The category that
best describes the dataset. See the category table
for the complete list of categories.Example: 1

descriptionoptional

String

A description of the dataset of up to 8192 characters.
Example: "This is a description of my new dataset"

excluded_fieldsoptional

Array,
default is [], an empty list. None of the fields in the
source is excluded.

Specifies the fields that won't
be included in the dataset. Example:

["000000", "000002"]

fieldsoptional

Object,
default is {}, an empty dictionary. That is, no
names, labels or descriptions are
changed.

Updates the names, labels, and descriptions of the fields in the
dataset with respect to the original names in the
source. An entry keyed with the field id generated in the source
for each field that you want the name updated. Example:

A
JSON list representing a filter over the rows in the
datasource. The first element is an operator and the rest
of the elements its arguments. See the section below for
more details. Example: [">", 3.14, ["field", "000002"]].

The name you want to give
to the new dataset. Example: "my new
dataset".

objective_fieldoptional

Object,
default is the last non-autogenerated field in the
dataset.

Specifies the default objective field. Example:

{"id": "000003"}

sizeoptional

Integer,
default is the source's
size

The number of bytes from
the source that you want to use. Example: 1073741824.

sourcerequired

String

A
valid source/id. Example: source/4f665b8103ce8920bb000006

tagsoptional

Array
of Strings

A list of strings that help classify and
index your dataset.Example: ["best customers", "2012"].

term_limitoptional

Integer

The maximum total number of terms to be used in text analysis.Example: 500

Fields

Each entry in the fields argument is composed of any
combination of the following properties.

Field Object Properties

Property

Type

Description

nameoptional

String, default
is source's field
name

The new name for this
field. Example: "my new name".

You can also use curl to customize a new
dataset with a name, and different size, and only a
few fields from the original source. For example,
to create a new dataset named "my dataset", with only 500 bytes, and
with only two fields:

If you do not specify a name, BigML.io will assign to
the new dataset the source's name. If
you do not specify a size, BigML.io
will use the full the source's size. If you do not specify any fields
BigML.io will include all the fields in the
source with their corresponding names.

Filtering Rows

The dataset creation request can include an argument,
json_filter, specifying a predicate that the input
rows from the source have to satisfy in order to be included in the dataset. This predicate is specified as a (possibly nested) JSON list whose first element is an operator and the rest of the elements its arguments. Here's an example of a filter specification to choose only those rows whose field "000002" is less than 3.14:

[">", 3.14, ["field", "000002"]]

Filter Example

As you see, the list starts with the operator we want to use, ">",
followed by its operands: the number 3.14, and the value of the field
with identifier "000002", which is denoted by the operator "field". As
another example, this filter:

["=", ["field", "000002"], ["field", "000003"], ["field", "000004"]]

Filter Example

selects rows for which the three fields with identifiers "000002",
"000003" and "000004" have identical values. Note how you're not
limited to two arguments. It's also worth noting that for a filter like
that one to be accepted, all three fields must have the same optype
(e.g. numeric), otherwise they cannot be compared.

The field operator also accepts as arguments the field's name (as a string)
or the row column (as an integer). For instance, if field "000002" had column
number 12, and field "000003" was named "Stock prize", our previous query could
have been written:

["=", ["field", 12], ["field", "Stock prize"], ["field", "000004"]]

Filter Example

If the name is not unique, the first matching field found is picked,
consistently over the whole filter expression. If you have duplicated field
names, the best thing to do is to use either column numbers or field
identifiers in your filters, to avoid ambiguities.

Besides a field's value, one can also ask whether it's missing or not. For
instance, to include only those rows for which field "000002" contains a missing
token, you would use:

["missing", "000002"]

Filter Example

and to select only those for which neither "000002" nor "000003" are missing:

where, as with field, missing's argument can also be a column number or a name, and where we have introduced the logical operators "and" and "not", which, together with "or", can take any number of arguments:

As with logical and relational operators, arithmetic operators accept more than two arguments.

These are all the accepted operators:

=, !=, >,
>=, <, <=,
and, or, not,
field, missing, +,
-, *, /.

To be accepted by the API, the filter must evaluate to a boolean value and contain at least one operator.So, for instance, a constant or an expression evaluating to a number will be rejected.

Since writing and reading the above expressions in pure JSON might be a bit
involved, you can also send your query to the server as a string representing a
Lisp s-expression using the argument lisp_filter, e.g.

You can also use your browser to visualize the dataset using the full
BigML.io URL or pasting the dataset/id
into the BigML.com dashboard.

Properties

Once a dataset has been successfully created it will
have all the following properties.

Dataset Properties

Property

Type

Description

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
dataset creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
dataset creation
has been completed without errors.

This is the date and time in which
the dataset was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to create this
dataset.

descriptionupdatable

String

A
text describing the dataset. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
dataset has been created in development mode.

excluded_fields

Array

The list of fields's ids that were
excluded to build the model.

field_types

Object

A
dictionary that informs about the number of fields of
each type. It has an entry per each field type
(categorical,
datetime, numeric,
and text), an entry for
preferred fields and an entry for the
total number of fields.

fields

Object

A
dictionary with an entry per field (column) in your
data. Each entry includes the column number, the name
of the field, the type of the field, and the
summary.

fields_meta

Object

A
dictionary with meta information about the fields
dictionary. It specifies the
total number of
fields, the current offset, and
limit, and the number
of fields (count) returned.

input_fields

Array

The list of input fields' ids used to
create the dataset.

locale

String

The
source's locale.

namefilterable,
sortable, updatable

String

The
name of the dataset as your provided
or based on the name of the source by default.

number_of_batchpredictionsfilterable,
sortable

Integer

The
current number of batch predictions that use this
dataset.

number_of_ensemblesfilterable,
sortable

Integer

The
current number of ensembles that use this
dataset.

number_of_evaluationsfilterable,
sortable

Integer

The
current number of evaluations that use this
dataset.

number_of_modelsfilterable,
sortable

Integer

The
current number of models that use this
dataset.

number_of_predictionsfilterable,
sortable

Integer

The
current number of predictions that use this
dataset.

objective_fieldupdatable

Object

The default objective field.

out_of_bagfilterable,
sortable

Boolean

Whether the out-of-bag instances were used to clone the
dataset instead of the sampled instances.

pricefilterable,
sortable, updatable

Float

The price other users must pay to clone your dataset.

privatefilterable,
sortable, updatable

Boolean

Whether
the dataset is public or not. In a future version, you will be able to share
datasets with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

range

Array

The
range of instances used to clone the
dataset.

replacementfilterable,
sortable

Boolean

Whether the instances sampled to clone the
dataset were selected using replacement or not.

resource

String

The
dataset/id.

rowsfilterable,
sortable

Integer

The
total number of rows in the dataset.

sample_ratefilterable, sortable

Float

The sample rate used to select instances
from the dataset.

seedfilterable,
sortable

String

The string that was used to generate the sample.

sharedfilterable,
sortable

Boolean

Whether the dataset is shared using a private link or not.

shared_hash

String

The hash that gives access to this
dataset if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to
this dataset.

sizefilterable,
sortable

Integer

The
number of bytes of the source that
were used to create this dataset.

sourcefilterable,
sortable

String

The
source/id that was used to build the
dataset.

source_statusfilterable,
sortable

Boolean

Whether the
source is still available or has
been deleted.

status

Object

A
description of the status of the
dataset. It includes a code, a message,
and some extra information. See the table below.

This is the date and time in which
the dataset was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

Dataset Fields

The property fields is a dictionary keyed by
each field's id in the source. Each
field's id has as a value an object with the
following properties:

Field Object Properties

Property

Type

Description

column_number

Integer

Specifies
the number of column in the original file.

datatype

String

Specifies
the storage type of the field.

description

String

An even longer text description for the field.

localeoptional

String

Specifies
the locale for this field if it is different
from the dataset's locale.

label

String

A longer and more descriptive name of the field.

name

String

Name of the field. It will be the same as in the
source if has not been specified here.

preferred

Boolean

Whether BigML thinks that this field will be useful when creating a
model or not.

optype

String

Specifies
the operational type of the field. It can be
numeric, categorical, or
text.

summary

Object

Numeric
or categoricalsummary of the field.

Numeric Summary

Numeric summaries come with all the fields described below. If
the number of unique values in the data is greater than 32,
then 'bins' will be used for the summary. If not, 'counts'
will be available.

Numeric Summary Object Properties

Property

Type

Description

counts

Array

An array of pairs where the first element of each pair is one of
the unique values found in the field and the second element is the count. Only
available when the number of distinct values is less than or
equal to 32.

maximum

Number

The
maximum value found in this field.

median

Number

The
approximate median of all the values in this field.

minimum

Number

The
minimum value found in this field.

missing_count

Integer

Number
of instances missing this field.

population

Integer

The
number of instances containing data for this field.

bins

Array

An array that represents an approximate histogram
of the distribution. It consists of value pairs, where
the first value is the mean of a histogram bin and the
second value is the bin population. 'bins' is only
available when the number of distinct values is greater
than 32. For more information, see
our blog
post or
read this
paper.

sum

String

Sum of all
values for this field (for mean calculation).

sum_squares

String

Sum of squared values (for variance calculation).

Categorical Summary

Categorical summaries give you a count per each category and missing
count in case any of the instances contain missing values.

Categorical Summary Object Properties

Property

Type

Description

counts

Array

A array of pairs where the first
element of each pair is one of the unique categories found in the field and the second
element is the count for that category.

missing_count

Integer

Number
of instances missing this field.

Text Summary

Text summaries give statistics about the vocabulary of a text field,
and the number of instances containing missing values.

Text Summary Object Properties

Property

Type

Description

tag_cloud

Array

An array
of two-element arrays, where the first element of each is the a
term in the field's vocabulary, and the second is the number of
instances in which that term appears.

term_forms

Object

A map
keyed to the field's vocabulary terms. The value of each entry is an
array of alternate forms of the given term, determined by word
stemming.

missing_count

Integer

Number
of instances missing this field.

Dataset Status

Before a dataset is successfully created,
BigML.io makes sure that it has been uploaded in an understandable
format, that the data that it contains is parseable, and that the types for each column
in the data can be inferred successfully. The dataset goes through a number of
states until all these analyses are completed. Through the status field in the
dataset you can determine when the dataset has been
fully processed and ready to be used to create a
model. These are the fields that a dataset's status has:

Dataset Status Properties

Property

Type

Description

bytes

Integer

Number
of bytes processed so far.

code

Integer

A status
code that reflects the status of the dataset creation. It can be any of the explained here.

elapsed

Integer

Number
of milliseconds that BigML.io took
to process the dataset.

field_errors

Object

Information
about ill-formatted fields that includes the total format
errors for the field and a sample of the ill-formatted
tokens.
Example:

Filtering and Paginating Fields from a Dataset

A dataset might be composed of hundreds or even thousands of
fields. Thus when retrieving a dataset, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):

Parameters to filter fields from a dataset

Parameter

Type

Description

fieldsoptional

Comma-separated
list

A comma-separated list of field IDs to retrieve.

iprefixoptional

String

A case-insensitive string to retrieve fields whose name
start with the given prefix. It is possible to specify
more than one iprefix by repeating the parameter, in which case the union of the results is returned.

fulloptional

Boolean

If
false, no information about fields is returned.

limitoptional

Integer

Maximum
number of fields that you will get in
the fields field.

offsetoptional

Integer

How
far off from the first field in your
dataset is the first
field in the fields field.

prefixoptional

String

A case-sensitive string to retrieve fields whose name start with the given prefix;It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.

Since the fields field is a map and therefore not
ordered, the returned fields contain an additional key, "order," whose
integer (increasing) value gives you their ordering. In all other
respects, the dataset is the same as the one you would get without any
filtering parameter above.

When using prefixes, BigML first filters by them and then applies the rest of the parameters to the result. All the other parameters can be used together with prefixes, except for "fields".

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object Properties

Property

Type

Description

count

Integer

Specifies
the current number of fields in the resource.

limit

Integer

The
maximum number of fields that will be returned in the
resource.

offset

Integer

The
current offset in the pagination of fields.

total

Integer

The
total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide datasets (i.e., more than 200 fields).

Listing Datasets

To list all the datasets you can use the
dataset base URL. By default, only the 20 most recent
datasets will be returned. You can see below how to change this number
using the limit parameter.

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return datasets 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Dataset

To update a dataset, you need to PUT an object containing the fields that you want to
update to the dataset's URL.
The content-type must always be: "application/json".

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a dataset, it is permanently deleted. That is, a delete request cannot be undone.
However, if you try to delete a dataset that is being used to create a
model, then BigML.io will not accept the request and
will respond with the following error.

Transformations

Last Updated: Thursday, 2014-09-18 23:59

Once you have created a dataset, BigML.io allows you to derive new datasets from it,
sampling, filtering, adding new fields, or concatenating it to other
datasets. We apply the term dataset transformations to the set of operations to create new modified
versions of your original dataset or just
transformations to abbreviate.

We use the term:

Cloning for the general operation of generating a new
dataset.

Sampling when the original dataset is sampled.

Filtering when the original dataset is filtered.

Extending when new fields are generated.

Merging when a multi-dataset is created.

Keep in mind that you can sample, filter and extend a dataset all at
once in only one API request. In this page, we'll describe the first
four operations above and will cover multi-datasets in the following page.

So let's start with the most basic transformation: cloning a dataset.

Cloning a Dataset

To clone a dataset you just need to use the origin_dataset argument
to send the dataset/id of the dataset that you want to
clone. For example:

You can also give the new dataset a different
category, name,
description, and
tags to the original one. Also when cloning a dataset,
you can modify the names, labels, descriptions and preferred flags of
its fields using a fields argument with entries for
those fields you want to change. See a description for all the arguments below.

Dataset Cloning Arguments

Argument

Type

Description

categoryoptional

Integer

The category that
best describes the dataset. See the category table
for the complete list of categories.Example:

"category": 1

descriptionoptional

String

A description of the dataset of up to 8192 characters.
Example:

"description": "This is a description of my new dataset"

fieldsoptional

Object

Updates
the names, labels, and descriptions of the fields in
the new dataset. An entry keyed with the field id of
the original dataset for each field that will be updated. Example:

As illustrated in the following example, it's possible to provide a
list of input fields, selecting the fields from the filtered input
dataset that will be created. Filtering happens before field picking and, therefore, the row filter can use fields that won't end up in the cloned dataset.

The following table shows the complete list of arguments that you can
use to filter a dataset.

Dataset Filtering Arguments

Argument

Type

Description

excluded_fieldsoptional

Array

Specifies the fields that won't
be included in the new dataset. Example:

"excluded_fields": ["000000", "000002"]

input_fieldsoptional

Array

Specifies the fields to be included in the dataset. Example:

"input_fields": ["000001", "000003"]

json_filteroptional

Array

A
JSON list representing a filter over the rows in the
origin dataset. The first element is an operator and the rest
of the elements its arguments. See the Section on filtering sources for more details. Example:

"json_filter": [">", 3.14, ["field", "000002"]]

lisp_filteroptional

String

A
string representing a Lisp s-expression to filter rows from
the origin dataset. Example:

"lisp_filter": "(> 3.14 (field 2))"

rangeoptional

Array

The range of
successive instances to create the new dataset. Example:

"range": [100, 200]

Extending a Dataset with New Fields

You can clone a dataset and extend it with brand new fields using the
new_fields argument. Each new field is created using
a Flatline expression and optionally a
name, label, and
description.

A Flatline expression is a lisp-like expresion that
allows you to make references and process columns and rows of the
origin dataset. See the full Flatline reference here.
Let's see a first example that clones a dataset and adds a new field named "Celsius"
to it using an expression that converts the values from the "Fahrenheit"
field to Celsius.

When you clone a dataset adding new fields, by default, the rest of fields in the origin
dataset are added to new dataset. If you only want to keep the new
fields you can you set up the all_fields argument to
false. You can also use the argument
all_but to exclude fields that you do not want to get in
the new dataset.
A new field can actually generate multiple fields. In that case
their names can be specified using the names arguments.

Note also that field references can be built using either field or
f and use the field id, the field
name, or its column number.
In addition to horizontally selecting different fields in the same row, you
can keep the field fixed and select vertical windows of its value, via the
window and related operators. For example, the following request will
generate a new field using a sliding window of 7 values for the field named
"Fahrenheit" and will also generate two additional fields
named "Yesterday" and "Tomorrow" with the
previous and next value of the current row for the field 0.

Filtering the New Fields Output

The generation of new fields works by traversing the input dataset row by row and applying the Flatline expression of each new field to each row in turn. The list of values generated from each input row that way constitutes an output row of the generated dataset.

It is possible to limit the number of input rows that the generator sees by
means of filters and/or sample specifications, for example:

And, as an additional convenience, it is also possible to specify either a
output_lisp_filter or a output_json_filter, that is, a Flatline row filter that will act on the generated rows, instead of on the input data:

You can also skip any number of rows in the input, starting the generation at
an offset given by row_offset, and traverse the input rows by
any step specified by row_step. For instance, the following request will generate a dataset whose rows are created by putting together every three consecutive values of the input field "Price":

With the specification above, the new field will start with the third row in the input dataset, generate an output row (which uses values from the current input row as well as from the two previous ones), skip to the 6th input row, generate a new output, and so on and so forth.

Next, we'll list all the arguments that can be used to generate new fields.

New Fields Arguments

Argument

Type

Description

all_fieldsoptional

Boolean

Whether
all fields should be included in the new dataset or not. Example:

"all_fields": false

all_butoptional

Array

Specifies the fields to be included in the dataset. Example:

"all_but": ["000001", "000003"]

new_fieldsoptional

Array

Specifies
the new fields to be included in the dataset. See the
table below for more details.Example:

"new_fields": [{"field": "(log10 (field "000001"))", "name": "log"}]

output_json_filteroptional

Array

A
JSON list representing a filter over the rows of the
dataset once the new fields have been generated. The first element is an operator and the rest
of the elements its arguments. See the Section on filtering rowsfor
more details. Example:

"output_json_filter": [">", 3.14, ["field", "000002"]]

output_lisp_filteroptional

String

A
string representing a Lisp s-expression to filter rows
after the new fields have been generated. Example:

"output_lisp_filter": "(> 3.14 (field 2))"

row_offsetoptional

Array

The
initial number of rows to skip from from the input dataset before start processing
rows. Example:

Lisp and Json syntaxes

Flatline also has a json-like flavor with exactly the same semantics that
the lisp-like version. Basically, a Flatline expresion can easily be
translated to its json-like variant and vice versa by just changing parentheses to brackets,
symbols to quoted strings, and adding commas to separate each sub-expression.
For example, the following two expressions are the same for
BigML.io.

"(/ (* 5 (- (f Fahrenheit) 32)) 9)"

Lisp-like expression

["/", ["*", 5, ["-", ["f", "Fahrenheit"], 32]], 9]

Json-like expression

Final Remarks

A few important details that you should keep in mind:

Cloning a dataset implies creating also a copy of its serialized form,
so you get an asyncronous resource with a status that evolves from the
Summarized (4) to the Finished (5) state.

If you specify both sampling and filtering arguments, the former are applied first.

As with filters applied to datasources, dataset filters can use the full Flatline language to specify the boolean expression to use when sifting the input.

Flatline performs type inference, and will in general figure out the proper optype for the generated fields, which are subsequently summarized by the dataset creation process, reaching then their final datatype (just as with a regular dataset created from a datasource). In case you need to fine-tune Flatline's inferences, you can provide an optype (or optypes) key and value in the corresponding output field entry (together with generator and names), but in general this shouldn't be needed.

Please check the Flatline reference manual for a full description of
the language for field generation and the many pre-built functions it provides.

Multi-datasets

Last Updated: Thursday, 2014-09-18 23:59

BigML.io now allows you to create a new dataset
merging multiple datasets. This functionaliy can be very useful when
you use multiple sources of data and in online scenarios as well.
Imagine, for example, that you collect data in a hourly basis and want to create a dataset aggregrating
data collected over the whole day. So you only need to send the new
generated data each hour to BigML, create a source and a dataset for
each one and then merge all the individual datasets into one at the end of the day.

We usually call datasets created in this way
multi-datasets. BigML.io allows you
to aggregrate up to 32 datasets in the same API request. You can merge
multi-datasets too so basically you can grow a dataset as much as you
want.

To create a multi dataset, you can
specify a list of dataset identifiers as input using the argument
origin_datasets. The example below will construct a
new dataset that is the concatenation of three other datasets.

By convention, the first dataset defines the final dataset fields.
However, there can be cases where each dataset might come
from a different source and therefore have different field ids. In
these cases, you might need to use a fields_maps argument to match each field in a dataset
to the fields of the first dataset.

For instance, in the request above, we use four datasets as input. The
first one would define the final dataset fields. For instance, let's
say that the dataset "dataset/52bc7fc83c1920e4a3000012" in this example
has three fields with identifiers "000001",
"000002" and "000003". Those will be
the default resulting fields, together with their datatypes and so on.
Then we need to specify, for each of the remaining datasets in the
list, a mapping from the "standard" fields to those in the
corresponding dataset. In our example, we're saying that the fields of
the second dataset to be used during the concatenation are
"000023", "000024" and
"00003a", which correspond to the final fields having
them as keys. In the case of the third dataset, the fields used will be
"000023", "000004" and
"00000f". For the last one, since there's no entry in
fields_maps, we'll try to use the same identifiers as those of the first dataset.

The optypes of the paired fields should match, and for
the case of categorical fields, be a proper subset. If a final field
has optype text, however, all values are converted to strings.

BigML.io also allows you to sample each dataset
individually before merging it. You can specify the sample options for
each dataset using the arguments sample_rates,
replacements, seeds, and
out_of_bags. All are dictionaries that must be keyed
using the dataset/id of the dataset you want to specify parameters
for. The next request will create a multi-dataset sampling
the two input datasets differently.

A dictionary keyed by dataset/id and
boolean values.
Setting
this parameter to true for a dataset will return a
dataset containing sequence of the out-of-bag instances instead of the sampled
instances. See the Section on
Sampling for more details.Example:

In addition to the arguments above you can use all the regular
arguments to clone, sample,
filter, and extend a dataset that
were explained in the Section on Transformations. Basically in those cases the flow that BigML.io follows to build a new dataset with multiple datasets is:

Sample each individual dataset according to the specifications
provided in the arguments sample_rates, replacements,
seeds, and out_of_bags.

Merge all the datasets together using the
fields_maps argument to match fields in case they
come from different sources (i.e., have different field ids).

Sample the merged dataset like in the case of a regular
datasaset sampling using the the arguments sample_rate, replacement,
seed,out_of_bag.

Filter the sampled dataset using
input_fields, excluded_fields,
and either a json_filter or
lisp_filter.

Extend the dataset with new fields according to the specifications provided in
the new_fields argument.

Filter the output of the new fields using
either an output_json_filter or
output_lisp_filter.

You can also create a model using multiple datasets as input at once.
That is, without merging all the datasets together into a new dataset
first. The same applies to ensembles, clusters, anomaly detectors, and evaluations. All the multi-dataset arguments above can be used. You just need to use the
datasets argument instead of the regular
dataset.

See examples below to create a multi-dataset model, a
multi-dataset
ensemble, and a multi-dataset evaluation.

Samples

Last Updated: Thursday, 2015-02-05 23:59NEW

A sample provides fast-access to the raw data of a
dataset on an on-demand basis.

When a new sample is requested, a copy of the dataset
is stored in a special format in an in-memory cache. Multiple and different samples
of the data can then be extracted using HTTPS parameterized requests
by sampling sizes and simple query string filters.

Samples are ephemeral. That is to say, a
sample will be available as long as GETs are requested
within periods smaller than a pre-established TTL (Time to Live). The
expiration timer of a sample is reset every time a new GET is received.

If requested, a sample can also perform linear regression and compute Pearson's and Spearman's correlations for either one numeric field against all other numeric fields or between two specific numeric fields.

Sample base URL

You can use the following base URL to create, retrieve, update, and
delete samples.

https://bigml.io/sample

Sample base URL

Authentication

All requests to manage your samples must use HTTPS
and be authenticated using your username and API key to convey
your identity. See this section for more details.

Creating a Sample

To create a new sample, you need to POST to the
sample base URL an object containing at least
the dataset/id that you want to use to create the
sample. The content-type must always be
"application/json".

You can easily create a new sample using
curl as follows. All you need is a valid
dataset/id and your authentication variable set up as
indicated above.

If you do not specify a name, BigML.io will assign to
the new sample the dataset's name.

Retrieving a Sample

Each sample has a unique identifier in the form
sample/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the sample.

Retrieving a sample with curl is
extremely easy.

curl https://bigml.io/sample/54da3889f0a5ea707b000000?$BIGML_AUTH

$ Retrieving a sample from the command line

You can also use your browser to visualize the sample using the full
BigML.io URL or pasting the sample/id
into the BigML.com dashboard.

Properties

Once a sample has been successfully created it will
have all the following properties.

Sample Properties

Property

Type

Description

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
sample creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
sample creation
has been completed without errors and that it is still
available in the in-memory cache.

This is the date and time in which
the sample was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

A Sample Object has the following properties:

Sample Object Properties

Property

Type

Description

fieldsupdatable

Array

A
list with an element per field in the dataset used to
build the sample. Fields are paginated according to the
field_meta attribute described above.
Each entry includes the column number in the original
dataset, the name of the field, the type of the field, and the
summary. See this Section for more details.

rows

Array of Arrays

A list of lists representing the rows of the
sample. Values in each list are ordered according to the fields list.

Sample Status

Through the status field in the
sample you can determine when the sample has been
fully processed and ready to be used. These are the fields that a
sample's status has:

Sample Status Properties

Property

Type

Description

code

Integer

A status
code that reflects the status of the sample creation. It can be any of the explained here.

Filtering and Paginating Fields from a Sample

A sample might be composed of hundreds or even thousands of
fields. Thus when retrieving a sample, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):

Parameters to filter fields from a sample

Parameter

Type

Description

fieldsoptional

Comma-separated
list

A comma-separated list of field IDs to retrieve.
Examples:

"fields=000000,000002"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name
start with the given prefix. It is possible to specify
more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Examples:

"iprefix=INCOME"

fulloptional

Boolean

If
false, no information about the sample is returned.
Examples:

"full=false"

limitoptional

Integer

Maximum
number of fields that you will get in
the fields field.
Examples:

"limit=100"

offsetoptional

Integer

How
far off from the first field in your
sample is the first
field in the fields field.
Examples:

"offset=100"

prefixoptional

String

A case-sensitive string to retrieve fields whose name start with the given prefix;It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Examples:

Since the fields field is a map and therefore not
ordered, the returned fields contain an additional key, "order," whose
integer (increasing) value gives you their ordering. In all other
respects, the sample is the same as the one you would get without any
filtering parameter above.

When using prefixes, BigML first filters by them and then applies the rest of the parameters to the result. All the other parameters can be used together with prefixes, except for "fields".

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object Properties

Property

Type

Description

count

Integer

Specifies
the current number of fields in the resource.

limit

Integer

The
maximum number of fields that will be returned in the
resource.

offset

Integer

The
current offset in the pagination of fields.

total

Integer

The
total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide samples (i.e., more than 200 fields).

Filtering Rows from a Sample

A sample might be composed of thousands or even
millions of rows. Thus when retrieving a sample, it's
possible to specify that only a subset of rows be retrieved, by using
any combination of the following parameters in the query string
(unrecognized parameters are ignored). BigML will never return more
than 1000 rows in the same response. However, you can send additional
request to get different random samples.

Parameters to filter rows from a sample

Parameter

Type

Description

field=valueoptional

List

With field the identifier of a numeric
field, returns rows for which the field equals that
value:
Examples:

"000000=2": field 0000000 equals 2

!field=valueoptional

List

With field the identifier of a numeric
field, returns rows for which the field doesn't equal that
value:
Examples:

"!000000=2": field 0000000 doesn't equal 2

field=valueoptional

String

With field the identifier of a
categorical field, select only those rows with the
value of that field one of the provided categories
(when the parameter is repeated).
Examples:

"000002=iris-setosa&000002=iris-versicolor"

!field=valueoptional

String

With field the identifier of a
categorical field, select only those rows with the
value of that field not one of the provided categories
(when the parameter is repeated).
Examples:

"!000002=iris-setosa&!000002=iris-versicolor"

field=optional

With field the identifier of a
field, select only those rows where
field is missing.
Examples:

"000002=&000002=iris-setosa": includes rows with either "iris-setosa" or missing.

!field=optional

With field the identifier of a
field, select only those rows where
field is not missing (i.e., it has a
definite value).
Examples:

"!000002="

field=from,tooptional

List

With field the identifier of a numeric
field and from,to optional numbers, specifies a filter for the numeric values of that field in the range [from, to].
One of the limits can be omitted.
Examples:

"000000=-10,10": field 000000 between -10 and 10, included

"000001=,3": field 000001 less or equal to 3

"00000a=-20,": field 00000a greater than or equal to -20

It is possible to specify whether the interval should include its boundaries with the usual [] or () brackets, as in:

"000000=[-10,10)": -10 <= field 000000 < 10

"000000=(-10,10]": -10 < field 000000 <= 10

"000000=(-10,10)": -10 < field 000000 < 10

"000000=[-10,10]": -10 <= field 000000 <= 10

!field=from,tooptional

List

With field the identifier of a numeric field, returns the values not in the specified interval. As with inclusion, it's possible to include or exclude the boundaries of the specified interval using square or round brackets:
Examples:

"!000000=[-10,10)": field 000000 < -10 or >= 10

"!000000=(-10,10]": field 000000 <= 10 or > 10

"!000000=(-10,10)": field 000000 <= 10 or >= 10

"!000000=[-10,10]": field 000000 < 10 or > 10

indexoptional

Boolean

When set to true, every returned row will have a
first extra value which is the absolute row number, i.e., a
unique row identifier. This can be useful, for instance, when you're
performing various GET requests and want to compute the union of the returned
regions.
Example: index=true

modeoptional

String

One amongst deterministic,
random, or linear. The way we
sample the resulting rows, if needed;
random means a
random sample, deterministic is also random but using
a fixed seed so that it's repeatable, and
linear means that BigML just returns
the first size rows after filtering; defaults to "deterministic".
Example: mode=random

occurrenceoptional

Boolean

When set to true, rows have prepended a value which denotes
the number of times the row was present in the sample. You'll want this only
when unique is set to true, otherwise
all those extra values will be equal to 1. When
index is also set to true (see above), the multiplicity column is
added after the row index.
Example: occurrence=true

precisionoptional

Integer

The number of significant decimal numbers to keep in the returned values, for fields of type float or double. For instance, if you set precision=0, all returned numeric values will be truncated to their integral part.
Example: precision=2

rowsoptional

Integer

The total number of rows to be returned; if less than
the resulting from the rest of the filter parameters,
the latter will be sampled according to
mode.
Example: rows=300

row_fieldsoptional

List

You can provide a list of identifiers to be present in
the samples rows, specifying which ones you actually
want to see and in which order.
Example: row_fields=000000,000002

row_offsetoptional

Integer

Skip the given number of rows. Useful when paginating
over the sample in linear
mode.
Example: row_offset=300

row_order_byoptional

String

A field that causes the returned columns to be sorted by
the value of the given field, in ascending order or, when the - prefix is used,
in descending order.Example: row_order_by=-000000

seedoptional

String

When mode is random, you can specify your own seed in this parameter;
otherwise, we choose it at random, and return the value we've used in the body of
the response: that way you can make a random sampling deterministic if you
happen to like a particular result.
Example: seed=mysample

stat_fieldoptional

String

A field_id that corresponds to the identifier of a numeric
field will cause the answer to include the Pearson's and Spearman's correlations,
and linear regression terms of this field with all other numeric fields in the
sample. Those values will be returned in maps keyed by "other" field id and
named spearman_correlations,
pearson_correlations,
slopes, and
intercepts.
Example: stat_field=000000

stat_fieldsoptional

String

Two field_ids that correspond to the
identifiers of numeric fields will cause the answer to include the Pearson's and Spearman's correlations,
and linear regression terms between the two fields. Those values will be returned in maps keyed named spearman_correlation,
pearson_correlation,
slope, and
intercept.
Example: stat_fields=000000,0000003

uniqueoptional

Boolean

When set to true, repeated rows will
be removed from the sample.
Example: unique=true

Listing Samples

To list all the samples you can use the
sample base URL. By default, only the 20 most recent
samples will be returned. You can see below how to change this number
using the limit parameter.

You can list your samples directly in your browser using your own username and API key with the following link.

In addition to exact match, there are more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

You can do the same thing from the command line using curl
as follows:

curl https://bigml.io/sample?$BIGML_AUTH;order_by=-size

$ Listing samples ordered by size from the command line

Paginating Samples

There are two parameters that can help you retrieve just a portion of
your samples and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of samples to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order number from which the sample listing will start.

If a limit is given, no more than that many
samples will be returned but possibly less, if the
request itself yields less samples

For example, if you want to retrieve only the third and forth latest
samples:

curl "https://bigml.io/sample?$BIGML_AUTH;limit=2;offset=2"

$ Paginating samples from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return samples 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Sample

To update a sample, you need to PUT an object
containing the fields that you want to update to the sample's URL.
The content-type must always be: "application/json".

If the request succeeds, BigML.io will respond
with a 202 accepted code and with the new updated
sample in the body of the message.

For example, to update a sample with a new name you can use curl like this:

Weight Field

A weight_field may be declared for either regression or classification models.
Any numeric field with no negative or missing values is valid as a weight field.
Each instance will be weighted individually according to the weight field's value.
See the toy dataset for credit card transactions below.

The last column represents the weight for each
transaction. We can use it as an input to create a model that will use
to weight each instance accordingly. In this case, fraudulent
transactions will weigh 10 times more than valid transactions in the
model building computations.

With Flatline, you can define arbritrarily complex functions to produce weight fields, making this the most flexible and powerful way to produce weighted models.

For instance, the request below would create a new dataset using the
example above that will add a new weight field using the previous and
multiplying by two when the amount of the transaction is higher than
500.

This method also works well when you query very large databases that can
produce the same row hundreds or thousands of times. You can just use one of
the rows and add the corresponding count as a weight field. This will reduce
the size of your data sources enormously.

Objective Weights

The second method for adding weights only applies to classification
models. A set of objective_weights may be defined, one per objective class. Each instance will be weighted according to its class weight.

Weights of zero are valid as long as there are some positive valued weights. If every weight does end up zero (this is possible with sampled datasets), then the resulting model will have a single node with a nil output.

Automatic Balancing

Finally, we provide a convenience shortcut for specifying weights for a
classification objective which are proportional to their category
counts, by means of the balance_objective flag.

Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value.Example:

"weight_field": "000005"

The nodes for a weighted tree will include a weight and
weighted_objective_distribution, which are the weighted
analogs of count and objective_distribution. Confidence, importance, and pruning calculations also take weights into account.

Models

Last Updated: Thursday, 2015-02-05 23:59

A model is a tree-like representation of your
dataset with predictive power. You can create a
model selecting which fields from your
dataset you want to use as input
fields (or predictors) and
which field you want to predict, the objective field.

Each node in the model corresponds to one of the
input fields. Each node has an incoming branch except
the top node also known as root that has none. Each node has a number of outgoing
branches except those at the bottom (the "leaves") that have none.

Each branch represents a possible value for the input field where it
originates. A leaf represents the value of the
objective field given
all the values for each input field in the chain of branches that goes from the root to
that leaf.

When you create a new model, BigML.io will
automatically compute a classification model or regression model
depending on whether the objective field that you want to predict is categorical
or numeric, respectively.

Arguments

In addition to the dataset, you can also POST the
following arguments.

Model Creation Arguments

Argument

Type

Description

categoryoptional

Integer,
default is the category of the dataset

The category that
best describes the model. See the category table
for the complete list of categories.Example: 1

datasetrequired

String

A
valid dataset/id.Example: dataset/4f66a80803ce8940c5000006

descriptionoptional

String

A description of the model of up to 8192 characters.
Example: "This is a description of my new model"

excluded_fieldsoptional

Array,
default is [], an empty list. None of the fields in the
dataset is excluded.

Specifies the fields that won't
be included in the model Example:

["000000", "000002"]

fieldsoptional

Object,
default is {}, an empty dictionary. That is, no
names or preferred statuses are changed.

This can be
used to change the names of the fields in the
model with respect to the original names in the
dataset or to tell BigML that certain fields should
be preferred. An entry keyed with the field id generated in the source
for each field that you want the name updated. Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,
default is []. All the fields in the
dataset

Specifies the fields to be included
as predictors in the model.Example:

["000001", "000003"]

missing_splitsoptional

Boolean,
default is false

Defines whether to explicitly
include missing field values when choosing a split. When this option is enabled,
generates predicates whose operators include an asterisk, such as
;&gt*, ;&lt=*, =*, or !=*. The presence of an asterisk means "or
missing". So a split with the operator >* and the value
8 can be read as "x > 8 or x is missing".
When using missing_splits there may also be predicates with operators = or
!=, but with a null value. This means "x is missing" and "x is
not missing" respectively.Example: true.

nameoptional

String,
default is dataset's name

The name you want to give
to the new model. Example: "my new
model".

node_thresholdoptional

Integer,
default is 512

When the number of nodes in the tree exceeds this value, the tree stops growing. Example: 1000

objective_fieldoptional

String,
default is the id of the last field in the
dataset

Specifies the id of the field that
you want to predict.Example:
"000003".

objective_fieldsoptional

Array,
default is an array with the id of the last field in the
dataset

Specifies the id of the field that
you want to predict. Even if this an array
BigML.io only accepts one
objective field in the current
version. If both objective_field and
objective_fields
are specified then, objective_field takes preference.
Example:
["000003"].

orderingoptional

Integer,
default is 0 (deterministic).

Specifies
the type of ordering followed to build the model. There are
three different types that you can specify:

If you do not specify a name, BigML.io will assign to
the new model the dataset's name. If
you do not specify a range of instances, BigML.io
will use all the instances in the dataset. If you do
not specify any input fields, BigML.io will include all the input fields in the
dataset, and if you do not specify an
objective field, BigML.io will
use the last field in your dataset.

Shuffling the Rows of your Dataset

By default, rows from the input dataset are deterministically
shuffled before being processed, to avoid inaccurate models caused by
ordered fields in the input rows. Since the shuffling is deterministic,
i.e., always the same for a given dataset, retraining a model for the
same dataset will always yield the same result.

However, you can modify this default behaviour by including the
ordering argument in the model creation request, where "ordering"
here is a shortcut for "ordering for the traversal of input rows". When
this property is absent or set to 0,
deterministic shuffling takes place;
otherwise, you can set it to:

Linear: If you know that your input is already in random order.
Setting "ordering" to 1 in your model request tells
BigML to traverse the dataset in a linear fashion, without performing any
shuffling (and therefore operating faster).

Random: If you'd like to perform a really random shuffling, most
probably different from any other one attempted before. Setting
"ordering" to 2 will shuffle the input rows
non-deterministically.

Sampling your Dataset

You can limit the dataset rows that are used to create a model in
two ways (which can be combined), namely, by specifying a row range and
by asking for a sample of the (alreaday clipped) input rows.

To specify a sample, which is taken over the row range or over the whole
dataset if a range is not provided, you can add the following arguments
to the creation request:

sample_rate: A positive number that specifes the sampling rate, i.e., how often we
pick a row from the range. In other words, the final number of rows
will be the size of the range multiply by the sample_rate, unless
"out_of_bag" is true (see below).

replacement: A boolean indicating whether sampling
should be performed with or without replacement, i.e., the same instance may be selected multiple times for inclusion in the result set. Defaults to false.

out_of_bag: If an instance isn't selected as part of a
sampling, it's called out of bag. Setting this parameter to true will
return a sequence of the out-of-bag instances instead of the sampled
instances.
This can be useful when paired with "seed". When
replacement is false,
the final number of row returned is the size of the range multiply by
one minus the sample_rate. Out-of-bag sampling with replacement gives rise to
variable-size samples. Defaults to false.

seed: Rows are sampled probabilistically using a
random number generator. By default, BigML seeds this generator with a
random integer, which means that, in general, two identical samples of
the same row range of the same dataset will be different. If you
provide a seed (as an arbitrary string), its hash value will be used as
the seed, and it'll be possible for you to generate deterministic
samples.

Finally, note that the "ordering" of the dataset described in the previous subsection is used on the result of the sampling.

Here's an example of a model request with range and sampling specifications:

Random Decision Forests

A model can be randomized by setting the randomize parameter to true. The default is false.

When randomized, the model considers only a subset of the possible
fields when choosing a split. The size of the subset will be the square
root of the total number of input fields. So if there are 100 input
fields, each split will only consider 10 fields randomly chosen from
the 100. Every split will choose a new subset of fields.

Although randomize could be used for other purposes, it's intended for
growing random decision forests. To grow tree models for a random forest, set
randomize to true and select a sample from the dataset. Traditionally
this is a 1.0 sample rate with replacement, but we suggest a 0.63
sample rate without replacement.

Retrieving a Model

Each model has a unique identifier in the form
model/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the model or as a
parameter to create predictions.

You can also use your browser to visualize the model using the full
BigML.io URL or pasting the model/id
into the BigML.com dashboard.

Properties

Once a model has been successfully created it will
have the following properties.

Model Properties

Property

Type

Description

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
model creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
model creation
has been completed without errors.

This is the date and time in which
the model was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to create this
model.

credits_per_predictionfilterable,
sortable, updatable

Float

This
is the number of credits that other users will
consume to make a prediction with your
model if you made it public.

datasetfilterable,
sortable

String

The
dataset/id that was used to build the
model.

dataset_field_types

Object

A dictionary that informs about the number of fields of
each type in the dataset used to create the model. It has an entry per each field type
(categorical,
datetime, numeric,
and text), an entry for
preferred fields, and an entry for the
total number of fields.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset is still available or has
been deleted.

descriptionupdatable

String

A
text describing the model. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
model has been built in development mode.

ensemblefilterable,
sortable

Boolean

Whether
the model was built as part of an ensemble of not.

ensemble_idfilterable,
sortable

String

The ensemble
id.

ensemble_indexfilterable,
sortable

Integer

The number of
order in the ensemble.

excluded_fields

Array

The list of fields's ids that were
excluded to build the model.

fields_meta

Object

A
dictionary with meta information about the fields
dictionary. It specifies the
total number of
fields, the current offset, and
limit, and the number
of fields (count) returned.

input_fields

Array

The list of input fields' ids used to build the model.

locale

String

The
dataset's locale.

max_columnsfilterable,
sortable

Integer

The total number of
fields in the dataset used to
build the model.

max_rowsfilterable,
sortable

Integer

The maximum number of
instances in the dataset that can
be used to build the model.

missing_splistsfilerable,
sortable

Boolean

Whether to explicitly include missing field values when choosing a
split while growing a model.

model

Object

All the
information that you need to recreate or use the model on
your own. It includes a very intuitive description of the tree-like structure that makes
the model up and the field's dictionary describing the fields
and their summaries.

namefilterable,
sortable, updatable

String

The
name of the model as your provided
or based on the name of the dataset by default.

node_thresholdfilterable,
sortable

String

The maximum number of nodes that the model will grow.

number_of_batchpredictionsfilterable,
sortable

Integer

The
current number of batch predictions that use this
model.

number_of_evaluationsfilterable,
sortable

Integer

The
current number of evaluations that use this
model.

number_of_predictionsfilterable,
sortable

Integer

The
current number of predictions that use this
model.

number_of_public_predictionsfilterable,
sortable

Integer

The
current number of public predictions that use this
model.

objective_field

Strings

The id of the field that
the model predicts.

objective_fields

Array

Specifies
the list of ids of the field that
the model predicts. Even if this an array
BigML.io only accepts one
objective field in the current
version.

orderingfilterable,
sortable

Integer

The order
used to chose instances from the dataset to
build the model. There are three different types:

0 Deterministic

1 Linear

2 Random

out_of_bagfilterable,
sortable

Boolean

Whether the out-of-bag instances were used to create the
model instead of the sampled instances.

pricefilterable,
sortable, updatable

Float

The price other users must pay to clone your model.

privatefilterable,
sortable, updatable

Boolean

Whether
the model is public or not. In a future version, you will be able to share
models with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

randomizefilterable,
sortable

Boolean

Whether
the model splits considered only a random subset of the fields or all
the fields available.

random_candidatesfilterable,
sortable

Integer

The number of random fields considered when
randomize is true.

range

Array

The
range of instances used to build the
model.

replacementfilterable,
sortable

Boolean

Whether the instances sampled to build the
model were selected using replacement or not.

resource

String

The
model/id.

rowsfilterable,
sortable

Integer

The
total number of instances used to build the model

sample_ratefilterable, sortable

Float

The sample rate used to select instances
from the dataset to build the
model.

seedfilterable,
sortable

Boolean

The string that was used to generate the sample.

selective_pruningfilterable,
sortable

Boolean

If true, selective pruning throttled the strength of the statistical pruning depending on the size of the dataset.

sharedfilterable,
sortable, updatable

Boolean

Whether the model is shared using a private link or not.

shared_hash

String

The hash that gives access to this
model if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to
this model.

sizefilterable,
sortable

Integer

The
number of bytes of the dataset that
were used to create this model.

sourcefilterable,
sortable

String

The
source/id that was used to build the
dataset.

source_statusfilterable,
sortable

Boolean

Whether the
source is still available or has
been deleted.

stat_pruningfilterable,
sortable

Boolean

Whether statistical pruning was used when building the model.

status

Object

A
description of the status of the
model. It includes a code, a message,
and some extra information. See the table below.

This is the date and time in which
the model was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

white_boxfilterable,
sortable

Boolean

Whether the model is publicly shared as a white-box
model.

A Model Object has the following properties:

Model Object Properties

Property

Type

Description

depth_threshold

Integer

The depth, or generation, limit for a tree.

distribution

Object

This dictionary gives information about how the training data
is distributed across the tree leaves. More concretely, it contains the training data distribution with key "training", and the distribution for the actual prediction values of the tree with key "predictions". The former is just the
"objective_summary" of the tree root (see
below), copied for easier individual retrieval, and both have
the format of the objective summary in the tree nodes.

fieldsupdatable

Object

A
dictionary with an entry per field in the dataset used to
build the model. Fields are paginated according to the
field_meta attribute described above.
Each entry includes the column number in original
source, the name of the field, the type of the field, and the
summary. See this Section for more
details.

importance

Array of Arrays

A list of pairs [field_id,
importance].
Importance is the amount by which each field in the model reduces prediction error,
normalized to be between zero and one. Note that fields with an
importance of zero may still be correlated with the objective; they
were just not used in the model.

kind

String

The type
of model. Currently, only "stree".

missing_strategy

String

Default strategy followed by the model when it finds a missing value. Currently, "last_prediction". At
prediction time you can opt for using "proportional". See this
Section for more details.

model_fields

Object

A dictionary with an entry per field used by the model (not
all the fields that were available in the dataset).
They follow the same structure as the
fields attribute above except that
the summary is not present.

root

Object

A
Node Object,
a tree-like recursive structure representing the model.

split_criterion

Integer

Method of choosing best attribute and split point for a given node.

support_threshold

Number

A number between 0 and 1. For a split to be valid, each child's support
(instances / total instances) must be greater than this
threshold.

Node Objects have the following properties:

Node Object Properties

Property

Type

Description

children

Array

Array
of Node Objects.

confidence

Float

For
classification models, a
number between 0 and 1 that expresses how certain the model is
of the prediction. For
regression models, a number mapped to the top end of a 95% confidence
interval around the expected error at that node
(measured using the variance of the output at the node). See
the Section on Confidence for
more details. Note
that for models you might have created using the first versions of
BigML this value might be null.

count

Integer

Number
of instances classfied by this node.

objective_summary

Object

An
Objective Summary Object summarizes
the objective field's distribution at this node.

output

Number or
String

Prediction at this node.

predicate

Boolean or
Object

Predicate structure to make a decision at this
node.

Objective Summary Objects have the following properties:

Objective Summary Properties

Property

Type

Description

categories

Array

If the objective field is categorical, an array of pairs where the first element of each pair
is one of the unique categories and the second element
is the count for that category.

counts

Array

If the objective field is numeric and the number of
distinct values is less than or equal to 32, an array of pairs where the first element of each pair
is one of the unique values found in the field and the
second element is the count.

bins

Array

If the objective field is numeric and the
number of distinct values is greater than 32. An array that represents an approximate histogram
of the distribution. It consists of value pairs, where
the first value is the mean of a histogram bin and the
second value is the bin population. For
more information, see
our blog
post or
read this
paper.

minimum

Number

The minimum of the objective field's
values. Available when 'bins' is present.

maximum

Number

The maximum of the objective field's
values. Available when 'bins' is present.

Predicate Objects have the following properties:

Predicate Object Properties

Property

Type

Description

field

String

Field's
id used for this decision.

operator

String

Type
of test used for this field.

value

Number or
String

Value of the field to make this node decision.

Model Status

Creating a model is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input and on the workload of BigML's
systems. The model goes through a number of
states until its fully completed. Through the status field in the
model you can determine when the model has been
fully processed and ready to be used to create predictions. These are the
properties that a model's status has:

Model Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the model creation. It can be any of the explained here.

Filtering a Model

It is possible to filter the tree returned by a GET to the model location by means of two optional query string parameters, namely support and value.

Filter by support

Support is a number from 0 to 1 that specifies the
minimum fraction of the total number of instances that a given
branch must cover to be retained in the resulting tree. Thus,
asking for (minimum) support of 0, is just asking for the whole
tree, while something like:

will return just the root node, that being the only one that covers all instances. If you repeat the "support" parameter in the query string, the last one is used. Non-parseable support values are ignored.

Filter by values and value intervals

Value is a concrete value or interval of values (for regression
trees) that a leaf must predict to be kept in the returning tree. For
instance:

in which case the union of the different predicates is used (i.e., the first query will return a tree will all leaves predicting "Iris-setosa" and all leaves predicting "Iris-versicolor".

Intervals can be closed or open in either end. For example, "(-2,10]", "[1,2)" or "(-1.234,0)", and the values of the left or right limits can be omitted, in which case they're taken as negative and positive infinity, respectively; thus "(,3]" denotes all values less or equal to three, as does "[,3]" (infinity not being a valid value for a numeric prediction), while "(0,)" accepts any positive value.

Filter by confidence

Confidence is a concrete value or interval of values that a leaf must have to be kept in the returning tree.
The specification of intervals follows the same conventions as those of value. Since confidences are a continuous value,
the most common case will be asking for a range, but the service will accept also individual values.
It's also possible to specify both a value and a confidence. For instance:

Filtering and Paginating Fields from a Model

A model might be composed of hundreds or even thousands of
fields. Thus when retrieving a model, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):

Parameters to filter fields from a model

Parameter

Type

Description

fieldsoptional

Comma-separated
list

A comma-separated list of field IDs to retrieve.

iprefixoptional

String

A case-insensitive string to retrieve fields whose name
start with the given prefix. It is possible to specify
more than one iprefix by repeating the parameter, in which case the union of the results is returned.

fulloptional

Boolean

If
false, no information about fields is returned.

limitoptional

Integer

Maximum
number of fields that you will get in
the fields field.

offsetoptional

Integer

How
far off from the first field in your
dataset is the first
field in the fields field.

prefixoptional

String

A case-sensitive string to retrieve fields whose name start with the given prefix;It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.

Since the fields field is a map and therefore not
ordered, the returned fields contain an additional key, "order," whose
integer (increasing) value gives you their ordering. In all other
respects, the model is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object Properties

Property

Type

Description

count

Integer

Specifies
the current number of fields in the resource.

limit

Integer

The
maximum number of fields that will be returned in the
resource.

offset

Integer

The
current offset in the pagination of fields.

total

Integer

The
total number of fields in the resource.

Note that paginating fields might only be worth it if you are going to
deal with really wide models (i.e., more than 200 fields).

Listing Models

To list all your models you can use the
model base URL. By default, only the 20 most recent
models will be returned. You can see below how to change this number
using the limit parameter.

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

You can do the same thing from the command line using curl
as follows:

curl https://bigml.io/andromeda/model?$BIGML_AUTH;order_by=-size

$ Listing models ordered by size from the command line

Paginating Models

There are two parameters that can help you retrieve just a portion of
your models and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of models to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order
number from which the model listing will start.

If a limit is given, no more than that many
models will be returned but possibly less, if the
request itself yields less models

For example, if you want to retrieve only the third and forth latest
models:

curl "https://bigml.io/andromeda/model?$BIGML_AUTH;limit=2;offset=2"

$ Paginating models from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return models 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Model

To update a model, you need to PUT an object containing the fields that you want to
update to the model's URL. The content-type must always be: "application/json".

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a model, it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a model a second time, or to delete a model that does not
exist you will receive an error like this:

Ensembles

What are ensembles?

An ensemble is a number of models grouped together to
create a stronger model with better predictive performance.

Depending on the nature of your data and the specific parameters of the
ensemble, you can significantly boost the predictive performance for
single models, using exactly the same data.

You can create an ensemble just as you would create a
model, with the
addition of two optional parameters: the number of models (number_of_models) to be built and
the task-level parallelism (tlp, a number between
1 and 5) that you want BigML to use to create the ensemble. The higher
the level the more models in parallel will be built and the faster the
ensemble will be created. However, the higher the level the higher the
number of credits that it will cost you. By default, the number of models is set up to 10 and the task-level parallelism to 1.

Currently, you can build ensembles following two
basic machine learning techniques: bagging and
random decision forests.

Bagging, also known as
bootstrap aggregating,
is one of the simplest ensemble-based strategies but often
outperforms strategies that are more complex. The basic idea is to use a different random subset
of the original dataset for each model in the ensemble. Specifically BigML uses by default a
sampling rate of 100% with replacement for each model. You can read
more about bagging here.

Random decision forests is the second ensemble-based
strategy that BigML provides. It consists, essentially, in selecting
a new random set of the input fields at each split while an individual model is
being built instead of considering all the input fields. To create a random decision forest you just need to set the
randomize parameter to true. You can
read more about random decision forests here.

Creating an Ensemble

To create a new ensemble, you need to POST to the
ensemble base URL an object containing at least
the dataset/id that you want to use to create the
ensemble. The content-type must always be
"application/json".

Arguments

In addition to the dataset, you can also POST the
following arguments.

Ensemble Creation Arguments

Argument

Type

Description

categoryoptional

Integer, default is the category of the dataset

The category that best describes the ensemble. See the
category table
for the complete list of categories.Example: 1

datasetrequired

String

A valid dataset/id.Example: dataset/50e8d4f03c19202d91000004

descriptionoptional

String

A description of the ensemble of up to 8192 characters.Example: "This is a description of my new ensemble"

excluded_fieldsoptional

Array,
default is [], an empty list. None of the fields in the dataset is excluded.

Specifies the fields that won't be included in the models of the
ensembleExample:

["000000", "000002"]

fieldsoptional

Object,
default is {}, an empty dictionary. That is, no names or preferred statuses are changed.

This argument can be
used to change the names of the fields in the
models of the ensemble with respect to the original names in the
dataset. It can also be used to tell BigML that certain fields should
be preferred. An entry keyed with the field
id generated in the source
for each field that you want the name updated.Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,
default is []. All the fields in the dataset

Specifies the fields to be included as predictors in the models of the
ensemble.Example:

["000001", "000003"]

missing_splitsoptional

Boolean,
default is true

Defines whether to explicitly
include missing field values when choosing a split while growing the models of
an ensemble. When this option is enabled, in each model, generates predicates
whose operators include an asterisk, such as
;&gt*, ;&lt=*, =*, or !=*. The presence of an asterisk means "or
missing". So a split with the operator >* and the value
8 can be read as "x > 8 or x is missing".
When using missing_splits there may also be predicates with operators = or
!=, but with a null value. This means "x is missing" and "x is
not missing" respectively.Example: flase.

nameoptional

String,
default is dataset's name

The name you want to give to the new ensemble.Example: "my new model".

node_thresholdoptional

Integer,
default is 512

When the number of nodes in the tree exceeds this value, the tree stops growing. Example: 1000

number_of_modelsoptional

Integer,
default is 10

The number of models to build the ensemble.Example: 100

objective_fieldoptional

String,
default is the id of the last field in the dataset

Specifies the id of the field that the ensemble
will predict.Example: "000003".

orderingoptional

Integer,
default is 0 (deterministic).

Specifies the type of ordering followed to build the models of
the ensemble. There are three different types that you can specify:

If this this parameter is set to true, the models of
the ensemble will be created with a sequence of the out-of-bag
instances instead of the sampled instances. See the
Section on Sampling
for more information.Example: true

randomizeoptional

Boolean,
default is false.

Setting this parameter to true will consider
only a subset of the possible fields when choosing a
split. See the
Section on Random Decision Forests
for further details.Example: true

random_candidatesoptional

Integer,
default is the square root of the total
number of input fields.

Sets the
number of random fields considered when
randomize is true.Example: 10

rangeoptional

Array,
default is [1, max rows in the dataset].

The range of successive instances to build the models of the
ensemble.Example: [1, 150]

replacementoptional

Boolean,
default is false

Whether sampling should be performed with or without
replacement. See the
Section on Sampling
for more details.Example: true

sample_rateoptional

Float,
default is 1.0.

A real number between 0 and 1 specifying the sample rate. See the
Section on Sampling.Example: 0.5

seedoptional

String.

A string to be hashed to generate deterministic samples.
See the
Section on Sampling
for more information.Example: "MySample"

A list of strings that help classify and retrieve the
ensemble.Example: ["best customers", "2012"].

tlpoptional

Integer,
default is 1

The number of models to build in paralllel. This is a
number between 1 and 5. It increases the speed at which
your ensemble will be built and the
number of credits that the ensemble
will cost you.Example: 5DEPRECATED

You can use curl to customize a new
ensemble from the command line. For example, to create
a new ensemble named "my ensemble", with only certain rows, and
with only three fields:

If you do not specify a name, the dataset's name will
be assigned to the new ensemble. If
you do not specify a range of instances,
the complete set of instances in the dataset will be
used. If you do not specify any input fields, all the "preferred" input fields in the
dataset will be included, and if you do not specify an
objective field, the last field in your
dataset will be considered the objective field.

Retrieving an Ensemble

Each ensemble has a unique identifier in the form
ensemble/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the ensemble or as a
parameter to create predictions or to evaluate the
ensemble.

You can also use your browser to visualize the model using the full
BigML.io URL or pasting the ensemble/id
into the BigML.com dashboard.

Properties

Once an ensemble has been successfully created it will
have the following properties.

Ensemble Properties

Property

Type

Description

categoryfilterable, sortable, updatable

Integer

One
of the categories in the
table of categories
that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
model creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
ensemble creation
has been completed without errors.

This is the date and time in which
the ensemble was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in
Coordinated Universal Time (UTC).

creditsfilterable, sortable

Float

The number of credits it cost you to create this
ensemble.

credits_per_predictionfilterable, sortable, updatable

Float

This
is the number of credits that other users will
consume to make a prediction with your
ensemble in case you decide to make it
public.

datasetfilterable, sortable

String

The dataset/id that was used to build the
ensemble.

dataset_statusfilterable, sortable

Boolean

Whether the
dataset is still available or has
been deleted.

descriptionupdatable

String

A
text describing the ensemble. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
ensemble has been created in development mode.

distributions

Array

Unordered list of
distributions for each model in the ensemble. Each distribution
is an Object with a entry for the distribution of instances in
the training set and the distribution of predictions in the
model. See a model distribution field for more details.
Note that distributions must be accessed by the model_order below.

error_modelsfilterable, sortable

Integer

The number of models in the
ensemble that have failed.

excluded_fields

Array

The list of fields's ids that were
excluded to build the models of the ensemble.

finished_modelsfilterable, sortable

Integer

The number of models in the
ensemble that have finished correctly.

input_fields

Array

The list of input fields' ids used to
build the models of the ensemble.

locale

String

The dataset's locale.

max_columnsfilterable, sortable

Integer

The total number of fields in the dataset used to
build the models of the ensemble.

max_rowsfilterable, sortable

Integer

The maximum number of
instances in the dataset that was
used to build the ensemble.

missing_splitsfilerable,
sortable

Boolean

Whether to explicitly include missing field values when choosing a
split while growing the models of an ensemble.

model_order

Array

Order in which each model in the list of "models" was
finished. The distributions above must be accessed
following this index.

models

Array

Unordered list of dataset/ids that compose the
ensemble. Models are ordered by the
model_order above.

namefilterable, sortable, updatable

String

The name of the ensemble as your provided
or based on the name of the dataset by default.

node_thresholdfilterable,
sortable

String

The maximum number of nodes that the model will grow.

number_of_batchpredictionsfilterable,
sortable

Integer

The
current number of batch predictions that use this
ensemble.

number_of_evaluationsfilterable,
sortable

Integer

The
current number of evaluations that use this
ensemble.

number_of_modelsfilterable, sortable

Integer

The number of models in the
ensemble.

number_of_predictionsfilterable, sortable

Integer

The current number of predictions that use this
ensemble.

number_of_public_predictionsfilterable,
sortable

Integer

The
current number of public predictions that use this
ensemble.

objective_fieldoptional

Strings

Specifies the id of the field that
the ensemble predicts.Example: "000003".

orderingfilterable, sortable

Integer

The order used to chose instances from the dataset to
build the models of the ensemble. There are three different types:

0 Deterministic

1 Linear

2 Random

out_of_bagfilterable, sortable

Boolean

Whether the out-of-bag instances were used to create the
models of the ensemble instead of the sampled instances.

pricefilterable, sortable, updatable

Float

The price other users must pay to clone your ensemble in
case you decided to make it public.

privatefilterable, sortable, updatable

Boolean

Whether the ensemble is public or not. In a future version, you will be able
to share ensembles with other coworkers or, if desired, make
them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

randomizefilterable,
sortable

Boolean

Whether
the splits of each model in the ensemble considered only a random subset of the fields or all
the fields available.

random_candidatesfilterable,
sortable

Integer

The
number of random fields considered when
randomize is true.

range

Array

The range of instances used to build the
models of the ensemble.

replacementfilterable, sortable

Boolean

Whether the instances sampled to build the
model were selected using replacement or not.

resource

String

The ensemble/id.

rowsfilterable, sortable

Integer

The total number of instances used to build the models
of the ensemble

sample_ratefilterable, sortable

Float

The sample rate used to select instances from the dataset to build
the models of the ensemble.

seedfilterable,
sortable

Boolean

The string that was used to generate the sample.

sharedfilterable,
sortable, updatable

Boolean

Whether the ensemble is shared using a private link or not.

shared_hash

String

The hash that gives access to this
ensemble if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to
this ensemble.

sizefilterable, sortable

Integer

The number of bytes of the dataset that
were used to create these models.

sourcefilterable, sortable

String

The source/id that was used to build the
dataset.

source_statusfilterable, sortable

Boolean

Whether the source is still available or has
been deleted.

status

Object

A description of the status of the
ensemble. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the ensemble was created using a subscription plan or not.

tagsupdatable

Array of Strings

A list of user tags that can help classify and
index this resource.

tlpfilterable, sortable

Integer

Task-level parallelization that was used to build the ensemble.
DEPRECATED

This is the date and time in which
the ensemble was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS.
All times are provided in
Coordinated Universal Time (UTC).

Ensemble Status

Creating a ensemble is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input, the number of models, and
on the workload of BigML's
systems. The ensemble goes through a number of
states until its fully completed. Through the status field in the
ensemble you can determine when the ensemble has been
fully processed and ready to be used to create predictions. These are the
properties that an ensemble's status has:

Ensemble Status Properties

Field

Type

Description

code

Integer

A status code that reflects the status of the ensemble creation.
It can be any of the explained
here.

In addition to exact match, there are four filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Ensemble Filters

Filter

Description

__ltoptional

Less than.Example: number_of_models__lt=10

__lteoptional

Less than or equal to.Example: number_of_models__lte=10

__gtoptional

Greater than.Example: number_of_models__gt=10

__gteoptional

Greater than or equal to.Example: number_of_models__gte=10

For example, to get your ensembles that are composed
of more than 10 models:

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return ensembles 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating an Ensemble

To update an ensemble, you need to PUT an object containing the fields that you want to
update to the ensemble's URL. The content-type must always be: "application/json".

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete an ensemble, it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete an ensemble a second time, or
to delete a ensemble that does not
exist you will receive an error like this:

Clusters

Last Updated: Thursday, 2015-02-05 23:59

A cluster is a set of groups (i.e., clusters) of instances of a
dataset that have been automatically classified together according to a distance measure
computed using the fields of the dataset. Each group is represented by
a centroid or center that is computed using the mean
for each numeric field and the mode for each categorical field.

To create a cluster, you can select an arbitrary number of clusters (i.e.,
k) and also select an arbitrary subset of fields from your dataset as
input_fields. You can use scales to select how each field
influences the distance measure used to group instances together.

BigML.io allows you to create new clusters,
retrieve individual
clusters, list
your clusters, delete, and
update your clusters.
BigML allows you to use sampling or a different ordering to build your
clusters. You can also filter
your clusters or get them automatically translated into PMML.

Cluster base URL

You can use the following base URL to create, retrieve, update, and
delete clusters.

https://bigml.io/cluster

Cluster base URL

Authentication

All requests to manage your clusters must use HTTPS
and be authenticated using your username and API key to convey
your identity.

The following snippet can help you set up an environment variable to store your username and
API key to avoid typing them again in the rest of examples.

Creating a Cluster

To create a new cluster, you need to POST to the
cluster base URL an object containing at least
the dataset/id that you want to use to create the
cluster. The content-type must always be
"application/json".

Arguments

In addition to the dataset, you can also POST the
following arguments.

Cluster Creation Arguments

Argument

Type

Description

balance_fieldsoptional

Boolean,
default is true.

When this parameter is enabled, all the numeric fields will be scaled so
that their standard deviations are 1. This makes each field have roughly
equivalent influence.
Example: true

categoryoptional

Integer,
default is the category of the dataset

The category that
best describes the cluster. See the category table
for the complete list of categories.Example: 1

cluster_seedoptional

String

A string to generate deterministic clusters.
Example: "My Seed"

datasetrequired

String

A valid dataset/id.Example: dataset/4f66a80803ce8940c5000006

descriptionoptional

String

A description of the cluster of up to 8192 characters.
Example: "This is a description of my
new cluster"

default_numeric_valuesoptional

String

It accepts any of the following strings to substitute
missing numeric values across all the numeric fields in the
dataset: "mean", "median",
"minimum", "maximum",
"zero"Example: "median"

excluded_fieldsoptional

Array,
default is [], an empty list. None of the fields in the
dataset is excluded.

Specifies the fields that won't
be included in the cluster.Example:

["000000", "000002"]

fieldsoptional

Object,
default is {}, an empty dictionary. That is, no
names or preferred statuses are changed.

This can be
used to change the names of the fields in the
cluster with respect to the original names in the
dataset or to tell BigML that certain fields should
be preferred. An entry keyed with the field id generated in the source
for each field that you want the name updated. Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

field_scalesoptional

Object,
default is {}, an empty dictionary. That is, no
special scaling is used.

With this argument you can pick your own
scaling for each field. If a field isn't included in
field_scales, BigML will treat
the scale as 1 (no scale change). If both balance_fields
and field_scales
are present, then balance_fields will be applied first. This will make it easy
for you do things like balancing age and salary, but then request that age be
twice as important. Example:

{
"000001": 4,
"000003": 2
}

input_fieldsoptional

Array,
default is []. All the fields in the
dataset

Specifies the fields to be
considered to create the clusters.Example:

["000001", "000003"]

koptional

Integer,
default is 8

The number of clusters. Must be a
number greater or equal than 2 and less or equal than 300.Example: 3

critical_valueoptional

Integer,default is 5

The clustering algorithm G-means is parameter free except for one, the critical_value
parameter. G-means iteratively takes existing clusters and tests whether the cluster's neighborhood
appears Gaussian. If it doesn't the cluster is split into two. The critical_value
sets how strict the test is when deciding whether data looks Gaussian. The default is to 5, which
seems to work well in most cases. A range of 1 - 10 is acceptable. A critical_value
of 1 means data must look very Gaussian to pass the test, and can lead to more clusters being detected.
Higher critical_value will tend to find fewer clusters.
Example: 3NEW

nameoptional

String,
default is dataset's name

The name you want to give
to the new cluster. Example: "my new cluster".

model_clustersoptional

Boolean,
default is
false

Whether a model for
every cluster will be generated or not. Each model
predicts whether or not an instance is part of its
respective cluster.Example: true

out_of_bagoptional

Boolean,
default is false.

Setting
this parameter to true will return a
sequence of the out-of-bag instances instead of the sampled
instances. See the Section on
Sampling below.Example: true

rangeoptional

Array,
default is [1, max rows in the
dataset].

The range of
successive instances to build the cluster. Example: [1, 150]

replacementoptional

Boolean,
default is false

Whether
sampling should be performed with or without
replacement. See the Section on
Sampling below. Example: true

sample_rateoptional

Float,
default is 1.0.

A real number between 0 and
1 specifying the sample rate. See the Section on
Sampling below. Example: 0.5

seedoptional

String

A
string to be hashed to generate deterministic samples.
See the Section on Sampling below.Example: "MySample"

summary_fieldsoptional

Array,
default is [].

Specifies the ids for fields
which will be included when generating the per cluster
summaries/datasets, but will not be used for clustering.
The summary_fields must be a strict subset
of the input_fields, where the latter is adjusted before
passing it to the model creation algorithm by setting it to
all non-preferred fields if not provided explicilty,
adding to it explicit summary_fields, and
subtracting explicit excluded_fields.
You can use either field identifiers or field names.
Example: ["000004"]

tagsoptional

Array of Strings

A list of strings that help classify and index your cluster.Example: ["best customers", "2012"].

weight_fieldoptional

String

Any numeric field with no negative or missing values is valid as a weight field.
Each instance will be weighted individually according to the weight field's value.Example: "000004"

You can also use curl to customize a new
cluster. For example, to create a new cluster named "my cluster", with only certain rows, and
with only three fields:

If you do not specify a name, BigML.io will assign to
the new cluster the dataset's name. If
you do not specify a range of instances, BigML.io
will use all the instances in the dataset. If you do
not specify any input fields, BigML.io will include all the input fields in the
dataset.

Sampling your Dataset

You can limit the dataset rows that are used to create a cluster in
two ways (which can be combined), namely, by specifying a row range and
by asking for a sample of the (already clipped) input rows.

To specify a sample, which is taken over the row range or over the whole
dataset if a range is not provided, you can add the following arguments
to the creation request:

sample_rate: A positive number that specifes the sampling rate, i.e., how often we
pick a row from the range. In other words, the final number of rows
will be the size of the range multiply by the sample_rate, unless
"out_of_bag" is true (see below).

replacement: A boolean indicating whether sampling
should be performed with or without replacement. Defaults to
false.

out_of_bag: If an instance isn't selected as part of a
sampling, it's called out of bag. Setting this parameter to true will
return a sequence of the out-of-bag instances instead of the sampled
instances.
This can be useful when paired with "seed". When
replacement is false,
the final number of row returned is the size of the range multiply by
one minus the sample_rate. Out-of-bag sampling with replacement gives rise to
variable-size samples. Defaults to false.

seed: Rows are sampled probabilistically using a
random number generator. By default, BigML seeds this generator with a
random integer, which means that, in general, two identical samples of
the same row range of the same dataset will be different. If you
provide a seed (as an arbitrary string), its hash value will be used as
the seed, and it'll be possible for you to generate deterministic
samples.

Here's an example of a cluster request with range and sampling specifications:

You can also use your browser to visualize the cluster using the full
BigML.io URL or pasting the cluster/id
into the BigML.com dashboard.

Properties

Once a cluster has been successfully created it will
have the following properties.

Cluster Properties

Property

Type

Description

balance_fieldsfilterable,
sortable

Boolean

Whether all the numeric fields have been scaled so that their standard deviations are 1.

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

cluster_datasets

Object

A dictionary that maps cluster ids to dataset resources offering per field distribution summaries for each cluster. Each dataset resource can be serialized on-demand using the neighborhood of the cluster.

cluster_seed

String

With no seed, the cluster locations can vary from run to
run. With a seed, the clusters are deterministic.

clusters

Object

All the
information that you need to recreate or use the cluster on
your own. It includes:

clusters: a list of centroids with a cluster
object for each centroid; and

fields: a dictionary with an entry per field in the dataset used to
build the cluster. Fields are paginated according to the
field_meta attribute described above.
Each entry includes the column number in original
source, the name of the field, the type of the field, and the
summary. See this Section for more
details.

code

Integer

HTTP
status code. This can be 201 upon the
cluster creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
cluster creation
has been completed without errors.

This is the date and time in which
the cluster was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to create this
cluster.

credits_per_predictionfilterable,
sortable, updatable

Float

This
is the number of credits that other users will
consume to make a prediction with your
cluster if you made it public.

datasetfilterable,
sortable

String

The
dataset/id that was used to build the
cluster.

dataset_field_types

Object

A
dictionary that informs about the number of fields of
each type in the dataset used to create the cluster. It has an entry per each field type
(categorical,
datetime, numeric,
and text), an entry for
preferred fields, and an entry for the
total number of fields.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset is still available or has
been deleted.

descriptionupdatable

String

A
text describing the cluster. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
cluster has been built in development mode.

excluded_fields

Array

The list of fields's ids that were
excluded to build the cluster.

fields_meta

Object

A
dictionary with meta information about the fields
dictionary. It specifies the
total number of
fields, the current offset, and
limit, and the number
of fields (count) returned.

field_scales

Object

The
specific scales used for each field, if any.

input_fields

Array

The list of input fields' ids used to build the cluster.

kfilterable, sortable

Integer

The number of clusters.

locale

String

The
dataset's locale.

max_columnsfilterable,
sortable

Integer

The total number of
fields in the dataset used to
build the cluster.

max_rowsfilterable,
sortable

Integer

The maximum number of
instances in the dataset that can
be used to build the cluster.

namefilterable,
sortable, updatable

String

The
name of the cluster as your provided
or based on the name of the dataset by default.

model_clustersfilterable,
sortable

Boolean

Whether a model for each cluster was created or not.

number_of_batchcentroidsfilterable,
sortable

Integer

The
current number of batch centroids that use this
cluster.

number_of_centroidsfilterable,
sortable

Integer

The
current number of centroids that use this
cluster.

number_of_public_centroidsfilterable,
sortable

Integer

The
current number of public centroids that use this
cluster.

out_of_bagfilterable,
sortable

Boolean

Whether the out-of-bag instances were used to create the
cluster instead of the sampled instances.

pricefilterable,
sortable, updatable

Float

The price other users must pay to clone your cluster.

privatefilterable,
sortable, updatable

Boolean

Whether
the cluster is public or not. In a future version, you will be able to share
clusters with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

range

Array

The
range of instances used to build the
cluster.

replacementfilterable,
sortable

Boolean

Whether the instances sampled to build the
cluster were selected using replacement or not.

resource

String

The
cluster/id.

rowsfilterable,
sortable

Integer

The
total number of instances used to build the cluster

sample_ratefilterable, sortable

Float

The sample rate used to select instances
from the dataset to build the
cluster.

scales

Object

A dictionary that represents the
combination of user requested field_scales
and balance_fields.

seedfilterable,
sortable

Boolean

The string that was used to generate the sample.

sharedfilterable,
sortable, updatable

Boolean

Whether the cluster is shared using a private link or not.

shared_hash

String

The hash that gives access to this
cluster if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to
this cluster.

sizefilterable,
sortable

Integer

The
number of bytes of the dataset that
were used to create this cluster.

sourcefilterable,
sortable

String

The
source/id that was used to build the
dataset.

source_statusfilterable,
sortable

Boolean

Whether the
source is still available or has
been deleted.

status

Object

A
description of the status of the
cluster. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the cluster was created using a subscription plan or not.

summary_fields

Array

The list of field's ids that are included when
generating the cluster's summaries but were not
used for clustering.

This is the date and time in which
the cluster was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

wheight_fieldfilterable,
sortable

String

The field used to weight each instance of the dataset
differently.

white_boxfilterable,
sortable

Boolean

Whether the cluster is publicly shared as a white-box
cluster.

A Cluster Object has the following properties:

Cluster Object Properties

Property

Type

Description

center

Object

A dictionary with the mean or mode for each numeric or categorical field.

count

Integer

The count gives the size of that neighborhood.

distance

Object

A dictionary that gives a numeric summary capturing the distribution of distances from the cluster's center to each of the points that fall into its neighborhood.

id

String

The id of the cluster.

name

String

The name
of the cluster.

Cluster Status

Creating a cluster is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input and on the workload of BigML's
systems. The cluster goes through a number of
states until its fully completed. Through the status field in the
cluster you can determine when the cluster has been
fully processed and ready to be used to create predictions. These are the
properties that a cluster's status has:

Cluster Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the cluster creation. It can be any of the explained here.

Creating a Dataset using a Cluster and a Centroid

Each centroid has associated a pre-computed dataset that has been
created using all the instances in the neighborhood. You can create a
new dataset using the corresponding cluster/id and
centroid id as follows:

Creating a Model using a Cluster and a Centroid

If you created a Cluster setting the model_clusters
option to true, then each
centroid has associated a pre-computed model that has been
created using all the instances of the dataset. Each model separates between
those instances that belong to the centroid neighborhood and those that belong to
other neighborhoods. You can create a
new model using the corresponding cluster/id and
centroid id as follows:

Filtering and Paginating Fields from a Cluster

A cluster might be composed of hundreds or even thousands of
fields. Thus when retrieving a cluster, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):

Parameters to filter fields from a cluster

Parameter

Type

Description

fieldsoptional

Comma-separated
list

A comma-separated list of field IDs to retrieve.

iprefixoptional

String

A case-insensitive string to retrieve fields whose name
start with the given prefix. It is possible to specify
more than one iprefix by repeating the parameter, in which case the union of the results is returned.

fulloptional

Boolean

If
false, no information about fields is returned.

limitoptional

Integer

Maximum
number of fields that you will get in
the fields field.

offsetoptional

Integer

How
far off from the first field in your
dataset is the first
field in the fields field.

prefixoptional

String

A case-sensitive string to retrieve fields whose name start with the given prefix;It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.

Since the fields field is a map and therefore not
ordered, the returned fields contain an additional key, "order," whose
integer (increasing) value gives you their ordering. In all other
respects, the cluster is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object Properties

Property

Type

Description

count

Integer

Specifies
the current number of fields in the resource.

limit

Integer

The
maximum number of fields that will be returned in the
resource.

offset

Integer

The
current offset in the pagination of fields.

total

Integer

The
total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide clusters (i.e., more than 200 fields).

Listing Clusters

To list all your clusters you can use the
cluster base URL. By default, only the 20 most recent
clusters will be returned. You can see below how to change this number
using the limit parameter.

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

You can do the same thing from the command line using curl
as follows:

curl "https://bigml.io/cluster?$BIGML_AUTH;order_by=-size"

$ Listing clusters ordered by size from the command line

Paginating Clusters

There are two parameters that can help you retrieve just a portion of
your clusters and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of clusters to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order number from which the clusterlisting will start.

If a limit is given, no more than that many
clusters will be returned but possibly less, if the
request itself yields less clusters

For example, if you want to retrieve only the third and forth latest
clusters:

curl "https://bigml.io/cluster?$BIGML_AUTH;limit=2;offset=2"

$ Paginating clusters from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return clusters 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Cluster

To update a cluster, you need to PUT an object containing the fields that you want to
update to the cluster's URL. The content-type must always be: "application/json".

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a cluster, it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a cluster a second time, or to delete a cluster that does not
exist you will receive an error like this:

Anomaly detectors

Last Updated: Thursday, 2015-02-05 23:59

An anomaly detector is a predictive model that can
help identify the instances within a dataset that do not conform to a regular pattern.
It can be useful for tasks like data cleansing, identifying unusual instances,
or, given a new data point, deciding whether a model is competent to make a prediction or not.

Anomaly detectors can be applied to a variety of domains like fraud
detection, security, quality control, medicine, etc.

BigML anomaly detectors are built using an unsupervised anomaly
detection technique. Therefore, you do not need to explicitly
label each instance in your dataset as "normal" or "abnormal".

When you create a new anomaly detector, it automatically returns an
anomaly score for the top n most anomalous instances. The newly created
anomaly detector can also be used to later create anomaly scores for
new data points or batch anomaly scores for all the instances of a
dataset.

Creating an Anomaly detector

To create a new anomaly detector, you need to POST to the
anomaly detector base URL an object containing at least
the dataset/id that you want to use to create the
anomaly detector. The content-type must always be
"application/json".

You can easily create a new anomaly detector using
curl as follows. All you need is a valid
dataset/id and your authentication variable set up as
indicated above.

The category that
best describes the anomaly detector. See the category table
for the complete list of categories.Example: 1

constraintsoptional

Boolean,
default is false.

An experimental option which adds more predicates to each node in the
tree. These predicates help capture expectations about the data, making
the tree more sensitive to anomalies. This option tends to inflate the
anomaly scores and requires more CPU time to build and evaluate. However,
it also seems to make the trees more effective at flagging anomalous
data that was not in the training set. It also seems to improve the
forests effectiveness on categorical data.
Example: false

datasetrequired

String

A
valid dataset/id.Example: dataset/4f66a80803ce8940c5000006

descriptionoptional

String

A description of the anomaly detector of up to 8192 characters.
Example: "This is a description of my new anomaly detector"

excluded_fieldsoptional

Array,
default is [], an empty list. None of the fields in the
dataset is excluded.

Specifies the fields that won't
be included in the anomaly detector.Example:

["000000", "000002"]

fieldsoptional

Object,
default is {}, an empty dictionary. That is, no
names or preferred statuses are changed.

This can be
used to change the names of the fields in the
anomaly detector with respect to the original names in the
dataset or to tell BigML that certain fields should
be preferred. An entry keyed with the field id generated in the source
for each field that you want the name updated. Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

id_fieldsoptional

Array,
default is [].

Specifies the ids for fields which will be included when computing the
top anomalies, but will not be considered to create the
anomaly detector. The id_fields must be a strict subset of
the input_fields, where the latter is adjusted before
passing it to the model creation algorithm by setting it to
all non-preferred fields if not provided explicilty, adding
to it explicit id_fields, and subtracting
explicit excluded_fields.
You can use either field identifiers or field names.
Example:

["000001", "000003"]

input_fieldsoptional

Array,
default is []. All the fields in the
dataset

Specifies the fields to be
considered to create the anomaly detector.Example:

["000001", "000003"]

forest_sizeoptional

Integer,
default is 128 (16 in development mode)

The
number of trees used by the anomaly detector. Must be a
number greater or equal than 2 and less or equal than 1000 (or
16 in development mode).Example: 256

nameoptional

String,
default is dataset's name

The name you want to give
to the new anomaly detector. Example: "my new anomaly detector".

out_of_bagoptional

Boolean,
default is false.

Setting
this parameter to true will return a
sequence of the out-of-bag instances instead of the sampled
instances. See the Section on
Sampling for more details.Example: true

rangeoptional

Array,
default is [1, max rows in the
dataset].

The range of
successive instances to build the anomaly detector. Example: [1, 150]

replacementoptional

Boolean,
default is false

Whether
sampling should be performed with or without
replacement. See the Section on
Sampling below. Example: true

sample_rateoptional

Float,
default is 1.0.

A real number between 0 and
1 specifying the sample rate. See the Section on
Sampling below. Example: 0.5

seedoptional

String

A
string to be hashed to generate deterministic samples.
See the Section on Sampling below.Example: "MySample"

tagsoptional

Array
of Strings

A list of strings that help classify and
index your anomaly detector.Example: ["web analytics", "2014"].

top_noptional

Integer,
default is 10

The
number of instances that will be returned together with the
anomaly detector that were scored as most anomalous.
If set to 0, no scoring will be produced during training. The
maximum number is 1024.
Example: 256

You can also use curl to customize a new
anomaly detector. For example, to create a new anomaly detector named "my anomaly detector", with only certain rows, and
with only three fields:

If you do not specify a name, BigML.io will assign to
the new anomaly detector the dataset's name. If
you do not specify a range of instances, BigML.io
will use all the instances in the dataset. If you do
not specify any input fields, BigML.io will include all the input fields in the
dataset.

Read the Section on Sampling
your Dataset to lean how to sample your dataset. Here's an example of an anomaly detector request with range and sampling specifications:

Retrieving an Anomaly detector

Each anomaly detector has a unique identifier in the form
anomaly detector/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the anomaly detector or as a
parameter to create predictions.

Retrieving an anomaly detector with curl is
extremely easy.

curl "https://bigml.io/anomaly/54230c1af0a5ea0c24000000?$BIGML_AUTH"

$ Retrieving an anomaly detector from the command line

You can also use your browser to visualize the anomaly detector using the full
BigML.io URL or pasting the anomaly/id
into the BigML.com dashboard.

Properties

Once an anomaly detector has been successfully created it will
have the following properties.

Anomaly detector Properties

Property

Type

Description

anomaly_seed

String

With no seed, the anomaly detector locations can vary from run to
run. With a seed, the anomaly detectors are deterministic.

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
anomaly detector creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
anomaly detector creation
has been completed without errors.

This is the date and time in which
the anomaly detector was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to create this
anomaly detector.

credits_per_predictionfilterable,
sortable, updatable

Float

This
is the number of credits that other users will
consume to make a prediction with your
anomaly detector if you made it public.

datasetfilterable,
sortable

String

The
dataset/id that was used to build the
anomaly detector.

dataset_field_types

Object

A
dictionary that informs about the number of fields of
each type in the dataset used to create the anomaly detector.
It has an entry per each field type
(categorical,
datetime, numeric,
and text), an entry for
preferred fields, and an entry for the
total number of fields.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset is still available or has
been deleted.

descriptionupdatable

String

A
text describing the anomaly detector. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
anomaly detector has been built in development mode.

excluded_fields

Array

The list of fields's ids that were
excluded to build the anomaly detector.

fields_meta

Object

A
dictionary with meta information about the fields
dictionary. It specifies the
total number of
fields, the current offset, and
limit, and the number
of fields (count) returned.

forest_size

Integer

The
number of individual trees the anomaly detector will contain.

id_fields

Array

The list of id fields's ids used to build the anomaly detector.

input_fields

Array

The list of input fields' ids used to build the anomaly detector.

locale

String

The
dataset's locale.

max_columnsfilterable,
sortable

Integer

The total number of
fields in the dataset used to
build the anomaly detector.

max_rowsfilterable,
sortable

Integer

The maximum number of
instances in the dataset that can
be used to build the anomaly detector.

model

Object

All the
information that you need to recreate or use the anomaly
detector on your own. It includes the field's dictionary describing the fields
and their summaries, the tree structures that makes the
model up, and the top anomalies found in the dataset. See
the Model
Object definition below.

namefilterable,
sortable, updatable

String

The
name of the anomaly detector as your provided
or based on the name of the dataset by default.

number_of_anomalyscoresfilterable,
sortable

Integer

The
current number of anomaly scores that use this
anomaly detector.

number_of_batchanomalyscoresfilterable,
sortable

Integer

The
current number of batch anomaly scores that use this
anomaly detector.

number_of_public_anomalyscoresfilterable,
sortable

Integer

The
current number of public anomalys scores that use this
anomaly detector.

out_of_bagfilterable,
sortable

Boolean

Whether the out-of-bag instances were used to create the
anomaly detector instead of the sampled instances.

pricefilterable,
sortable, updatable

Float

The price other users must pay to clone your anomaly detector.

privatefilterable,
sortable, updatable

Boolean

Whether
the anomaly detector is public or not. In a future version, you will be able to share
anomaly detectors with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

range

Array

The
range of instances used to build the
anomaly detector.

replacementfilterable,
sortable

Boolean

Whether the instances sampled to build the
anomaly detector were selected using replacement or not.

resource

String

The
anomaly/id.

rowsfilterable,
sortable

Integer

The
total number of instances used to build the anomaly detector

sample_ratefilterable, sortable

Float

The sample rate used to select instances
from the dataset to build the
anomaly detector.

seedfilterable,
sortable

Boolean

The string that was used to generate the sample.

sharedfilterable,
sortable, updatable

Boolean

Whether the anomaly detector is shared using a private link or not.

shared_hash

String

The hash that gives access to this
anomaly detector if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to
this anomaly detector.

sizefilterable,
sortable

Integer

The
number of bytes of the dataset that
were used to create this anomaly detector.

sourcefilterable,
sortable

String

The
source/id that was used to build the
dataset.

source_statusfilterable,
sortable

Boolean

Whether the
source is still available or has
been deleted.

status

Object

A
description of the status of the
anomaly detector. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the anomaly detector was created using a subscription plan or not.

tagsupdatable

Array of
Strings

A list of user tags that can help classify and
index this resource.

top_nfilterable,
sortable

Integer

The number of
top anomalies returned after scoring each row in the
training dataset.

This is the date and time in which
the anomaly detector was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

white_boxfilterable,
sortable

Boolean

Whether the anomaly detector is publicly shared as a white-box
anomaly detector.

The Model Object of an anomaly detector has the following properties:

Anomaly detector Model Object Properties

Property

Type

Description

fieldsupdatable

Object

A
dictionary with an entry per field in the dataset used to
build the anomaly detector. Fields are paginated according to the
field_meta attribute described above.
Each entry includes the column number in original
source, the name of the field, the type of the field, and the
summary. See this Section for more
details.

A
list with the trees representing the anomaly detector. Each
tree conforms the Root Object definition.

A Top Anomalies Object has the following properties:

Top Anomalies Object Properties

Property

Type

Description

importance

Array

A list of
floats with the relative importance for each field. The
importances tell us which values contributed most to the
anomaly score.

row

Array

The list of values of the row included in the top
anomalies.

score

Float, between 0 and 1

The
anomaly score. A number between 0 and 1. The closer to one
the more anomalous is the row.

Anomaly Detector Status

Creating an anomaly detector is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input and on the workload of BigML's
systems. The anomaly detector goes through a number of
states until its fully completed. Through the status field in the
anomaly detector you can determine when the anomaly detector has been
fully processed and ready to be used to create predictions. These are the
properties that an anomaly detector's status has:

Anomaly Detector Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the anomaly detector creation. It can be any of the explained here.

elapsed

Integer

Number
of milliseconds that BigML.io took
to process the anomaly detector.

message

String

A human
readable message explaining the status.

progress

Float, between 0 and
1

How far BigML.io has progressed
building the anomaly detector.

Once an anomaly detector has been successfully created, it will look like:

Filtering and Paginating Fields from an Anomaly detector

A anomaly detector might be composed of hundreds or even thousands of
fields. Thus when retrieving an anomaly detector, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):

Parameters to filter fields from an anomaly detector

Parameter

Type

Description

fieldsoptional

Comma-separated
list

A comma-separated list of field IDs to retrieve.

iprefixoptional

String

A case-insensitive string to retrieve fields whose name
start with the given prefix. It is possible to specify
more than one iprefix by repeating the parameter, in which case the union of the results is returned.

fulloptional

Boolean

If
false, no information about fields is returned.

limitoptional

Integer

Maximum
number of fields that you will get in
the fields field.

offsetoptional

Integer

How
far off from the first field in your
dataset is the first
field in the fields field.

prefixoptional

String

A case-sensitive string to retrieve fields whose name start with the given prefix;It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.

Since the fields field is a map and therefore not
ordered, the returned fields contain an additional key, "order," whose
integer (increasing) value gives you their ordering. In all other
respects, the anomaly detector is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object Properties

Property

Type

Description

count

Integer

Specifies
the current number of fields in the resource.

limit

Integer

The
maximum number of fields that will be returned in the
resource.

offset

Integer

The
current offset in the pagination of fields.

total

Integer

The
total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide anomaly detectors (i.e., more than 200 fields).

Listing Anomaly detectors

To list all your anomaly detectors you can use the
anomaly detector base URL. By default, only the 20 most recent
anomaly detectors will be returned. You can see below how to change this number
using the limit parameter.

You can list your anomaly detectors directly in your browser using your own username and API key
with the following link.

Filtering Anomaly detectors

The listings of anomaly detectors can be filtered by any of the fields that we
labeled as filterable in the table describing
attributes above. For example, to retrieve all the
anomaly detectors named "iris":

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Anomaly detector Filters

Filter

Description

__ltoptional

Less
than. Example: size__lt=1024

__lteoptional

Less
than or equal to. Example:
size__lte=1024

__gtoptional

Greater
than. Example: size__gt=1024

__gteoptional

Greater
than or equal to. Example:
size__gte=1024

For example, to get your anomaly detectors that are bigger than
one megabyte:

You can do the same thing from the command line using curl
as follows:

curl "https://bigml.io/anomaly?$BIGML_AUTH;order_by=-size"

$ Listing anomaly detectors ordered by size from the command line

Paginating Anomaly detectors

There are two parameters that can help you retrieve just a portion of
your anomaly detectors and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of anomaly detectors to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order number from which the anomaly detector listing will start.

If a limit is given, no more than that many
anomaly detectors will be returned but possibly less, if the
request itself yields less anomaly detectors

For example, if you want to retrieve only the third and forth latest
anomaly detectors:

curl "https://bigml.io/anomaly?$BIGML_AUTH;limit=2;offset=2"

$ Paginating anomaly detectors from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return anomaly detectors 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating an Anomaly detector

To update an anomaly detector, you need to PUT an object containing the fields that you want to
update to the anomaly detector's URL. The content-type must always be: "application/json".

If the request succeeds, BigML.io will respond
with a 202 accepted code and with the new updated
anomaly detector in the body of the message.

For example, to update an anomaly detector with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete an anomaly detector, it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete an anomaly detector a second time, or to delete an anomaly detector that does not exist you will receive an error like this:

Predictions

Last Updated: Thursday, 2015-02-05 23:59

A prediction is created using either a
model/id or a ensemble/id and the new instance (input_data) for which you wish to create a prediction.

When you create a new prediction, BigML.io will
automatically navigate the corresponding model to find
the leaf node that best classifies the new instance. If you create a
new prediction using an ensemble, the same process is repeated for each
model in the ensemble. Then all the predictions from the individual
models in the ensemble are combined to return a
final prediction using one of the strategies described below.

Creating a Prediction

To create a new prediction, you need to POST to the
prediction base URL an object containing at least
the model/id that you want to use to create the
prediction. The content-type must always be
"application/json".

You can easily create a new prediction using
curl as follows. All you need is a valid
model/id or valid ensemble/id, some input data, and your authentication variable set up as
indicated above. For example:

2: probability weighted uses the
probability of the class in the distribution of classes in
the leaf node as weight.

3: threshold-based uses a given
threshold k (by default 1) to
predict a given class
(by default the minority class). You can set up
both using the threshold argument.
If there are less than k models
voting class, the most frequent
of the remaining categories is chosen, as in a
plurality combination after removing the models
that were voting for class.
The confidence of the prediction is computed as
that of a plurality vote, excluding votes for the
majority class when it's not selected.

For regression ensembles, the predicted values are averaged.
The options are:

A description of the prediction of up to 8192 characters.Example: "This is a description of my new prediction"

ensembleoptional

String

A valid ensemble/id.Example: ensemble/517020d53c1920a514000056

input_datarequired

Object

An object with field's id/value
pairs representing the instance you want to create a
prediction for.Example: {"000000": 5, "000001": 3}

missing_strategyoptional

Integer,
default is 0

Specifies the method that should be
used when a missing split is found. That is, when a missing value in found in the input data for a
decision node. The options are:

0: last prediction predicts based on the subset of the data which reached the parent of the missing split.

1: proportional evaluates all the subtrees of a missing split and recombines their predictions based on the proportion of data in each subtree

Example: 1

modeloptional

String

A
valid model/id.Example: model/4f67c0ee03ce89c74a000006

nameoptional

String, default is Prediction for model's name

The name you want to give to the new prediction.Example: "my new prediction".

privateoptional

Boolean, default is true

Whether you want your prediction to be private or not.Example: false

tagsoptional

Array of Strings

A list of strings that help classify and
index your prediction.Example: ["best customers", "2012"].

thresholdoptional

Object

A dictionary with two optional keys:

k indicates the minimum value of instances
of the category denoted by class that must be present in
the predictions tally (with one vote per model)
to produce as a result of the combination
that category. It's 1 by default.

class if absent, the category
that appears least frequently in the training data
is chosen as the "class" value (with ties broken
using lexicographical ordering). This is the usual
default in some systems trying to detect anomalies
(e.g. IDS and the like), and other uses of this
combiner should probably not rely on our default
value.

Example: {"k": 2, "class": "attack"}.

You can use curl to customize new
predictions. For example, to create a new prediction
named "my prediction":

If you do not specify a name, BigML.io will assign to
the new prediction a name based on the model's
objective field.

Retrieving a Prediction

Each prediction has a unique identifier in the form
prediction/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the prediction or as a
parameter to create predictions.

You can also use your browser to visualize the
prediction using the full
BigML.io URL or pasting the prediction/id
into the BigML.com dashboard.

Properties

Once a prediction has been successfully created it will
have the following properties.

Prediction Properties

Property

Type

Description

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
prediction creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
prediction creation
has been completed without errors.

combiner

Integer

The
method used to combine predictions from the ensemble. See
the available combiners above.

confidence

Float

For
classification models, a
number between 0 and 1 that expresses how certain the model is
of the prediction. For
regression models, a number mapped to the top end of a 95% confidence
interval around the expected error at that node
(measured using the variance of the output at the node). Note
that for models you might have created using the first versions of
BigML this value might be null.

This is the date and time in which
the prediction was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to create this
prediction.

datasetfilterable,
sortable

String

The
dataset/id that was used to build the
dataset.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset is still available or has
been deleted.

descriptionupdatable

String

A
text describing the prediction. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
prediction has been created in development mode.

ensemble

filterable,
sortable

String

The
ensemble/id that was used to create the
prediction.

error_predictions

Integer

The
number of predictions in the ensemble that failed.

fields

Object

A
dictionary with an entry per field in the
input_data or
prediction_path.
Each entry includes the column number in original
source, the name of the field, the type of the field, and the
specific datatype.

finished_predictions

Integer

The
number of predictions in the ensemble that succeed.

input_data

Object

The dictionary of input fields' ids and
values used as input for the prediction.

locale

String

The
dataset's locale.

missing_strategyoptional

Integer,
default is 0 (last prediction).

Specifies
the type of strategy that a model will follow when a
missing value needed to continue with inference in the
model is found. The possible valures are:

0 Last prediction

1 Proportional

modelfilterable,
sortable

String

The
model/id that was used to create the
prediction.

models

Array

A
list of the model/id that compose the
ensemble.

model_type

Integer

Either
0 or 1 to specify
respectively whether the prediction is from a single
model or
an ensemble.

model_statusfilterable,
sortable

Boolean

Whether the
model is still available or has
been deleted.

namefilterable,
sortable, updatable

String

The
name of the prediction as you provided
or based on the name of the objective
field's name by default.

number_of_models

Integer

The
number of models in the ensemble.

objective_field

String

The
id of the field that the model predicts.

objective_fieldsfilterable, sortable

Array

Specifies the id of the field that
the model predicts. Even if this an
array the current version of BigML.io only accepts one
objective field.Example:
["000004"].

objective_field_name

String

The
name of the objective field

output

String / Number

The actual prediction. A string if the
task is classification and a number if the
task is regression.

predictionfilterable,
sortable

Object

A
dictionary keyed with the objective field
to get the prediction output.Example:
{"000004": "Iris-virginica"}.

predictions

Array

An
array with a prediction object for each model in the
ensemble. The prediction object includes:

confidence: the
confidence or expected error for the
prediction. See confidence above.

count: the total number
of instances at the leaf node.

distribution: the actual
distribution of predictions at the leaf node.

prediction: the actual
prediction of the corresponding model.

prediction_path

Object

A
Prediction Path Object specifing the
decision path followed
to make the prediction, the next predicates, and lists of
unknown fields and bad fields submitted .

privatefilterable,
sortable, updatable

Boolean

Whether
the prediction is public or not. In a future version, you will be able to share
predictions with other coworkers or, if desired, make them publically available.

resource

String

The
prediction/id.

query_string

String

The
query string that was used to filter the model.

sourcefilterable,
sortable

String

The
source/id that was used to build the
dataset.

source_statusfilterable,
sortable

Boolean

Whether the
source is still available or has
been deleted.

status

Object

A
description of the status of the
prediction. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the prediction was created using a subscription plan or not.

tagsupdatable

Array of Strings

A list of user tags that can help classify and
index this resource.

task

String

Either
classification or
regression depending on whether the
objective field is categorical or numeric.

threshold

Object

The parameters (k and
class) given when a threshold-based
combiner is used.

This is the date and time in which
the prediction was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

votes

Array

A list
of the prediction candidates according to the different
combiners explained above.

vote_confidence

Array

A
list of the confidence (or expected error in regression
ensembles) for each prediction candidate.

A Prediction Path Object has the following properties:

Prediction Path Object Properties

Property

Type

Description

bad_fields

Array

An
array of field's ids with wrong values
submitted. Bad fields are ignored. That is, if you submit a
value that is wrong, a prediction is created anyway ignoring
the input field with the wrong value.

confidence

Float

For
classification models, a
number between 0 and 1 that expresses how certain the model is
of the prediction. For
regression models, a number mapped to the top end of a 95% confidence
interval around the expected error at that node
(measured using the variance of the output at the node). Note
that for models you might have created using the first versions of
BigML this value might be null.

next_predicates

Array

An
array of Predicate Objects with the children of the deepest
node that was reached with the input_data.

objective_summary

Object

A
dictionary keyed by categories (classification
models) or counts (regression models) that
represents a histogram of possible target values for the
instance.

path

Array

An
ordered array of Predicate Objects in the decision path from
the root to the current node or to a final decision if the the
next predicate array is empty.

unknown_fields

Array

An
array of field's ids that were submitted in
the input_data and were not recognized.
Unknown fields are ignored. That is, if you submit a
field that is wrong, a prediction is created anyway ignoring
the wrong input field.

Confidence

The confidence field provides a measure of how certain the model is of the prediction.

For classification models, confidence is the lower end of a binomial-style
confidence interval, where 1 indicates absolute certainty and 0
indicates no better than a random guess. Technically it is the lower
bound of Wilson score confidence interval for a
Bernoulli parameter. Read how it works in layman's terms here.

For
regression models,
confidence is the upper end of a confidence interval around the
expected error for that prediction.

For more detailed information about the distribution of the target for the
given instance, the objective_summary field provides a histogram of possible target values for the instance. The default prediction is the mean or mode of this distribution (for regression and classification, respectively), but one could use this distribution to make more sophisticated choices such as classification according to a specific loss function.

Predicate Objects have the following properties:

Predicate Object Properties

Property

Type

Description

field

String

Field's
id used for this decision.

operator

String

Type
of test used for this field.

value

Number or
String

Value of the field to make this node decision.

Prediction Status

Creating a prediction is a near real-time process that take just a few
seconds depending on whether the corresponding
model has been used recently and the workload of BigML's
systems. The prediction goes through a number of
states until its fully completed. Through the status field in the
prediction you can determine when the prediction has been
fully processed and ready to be used. Most of the times predictions are
fully processed and the output returned in the first call. These are
the properties that a prediction's status has:

Prediction Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the prediction creation. It can be any of the explained here.

message

String

A human
readable message explaining the status.

Listing Predictions

To list all your predictions you can use the
prediction base URL. By default, only the 20 most recent
predictions will be returned. You can see below how to change this number
using the limit parameter.

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Prediction Filters

Filter

Description

__ltoptional

Less
than. Example:
created__lt=2012-01-12

__lteoptional

Less
than or equal to. Example:
created__lte=2012-01-12

__gtoptional

Greater
than. Example: created__gt=2012-01-12

__gteoptional

Greater
than or equal to. Example:
created__gte=2012-01-12

For example, to get your predictions that were created
after January 12, 2012:

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return predictions 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Prediction

To update a prediction, you need to PUT an object containing the fields that you want to
update to the prediction's URL. If the request succeeds, BigML.io will respond
with a 202 accepted code and with the new updated
prediction in the body of the message. For example, to update a prediction with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a prediction, it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a prediction for a second time, or a model that does not
exist you will receive an error like this:

Creating a Centroid

To create a new centroid, you need to POST to the
centroid base URL an object containing at least
the cluster/id that you want to use to find the
centroid and an instance without any missing values.
For example, you can easily create a new centroid using
curl as follows:

Arguments

In addition to the cluster and the
input_data, you can also POST the following arguments.

Centroid Creation Arguments

Argument

Type

Description

categoryoptional

Integer,
default is the category of the model

The category that
best describes the centroid. See the category table
for the complete list of categories.Example: 1

cluster

String

A
valid cluster/id.Example: cluster/4f67c0ee03ce89c74a000006

descriptionoptional

String

A description of the centroid of up to 8192 characters.Example: "This is a description of my new centroid"

input_datarequired

Object

An object with field's id/value
pairs representing the instance you want to find the
closest centroid for. You can use either field ids or field
names as keys in your input_data.Example: {"000000": 5, "000001": 3}

nameoptional

String, default is Centroid for model's name

The name you want to give to the new centroid.Example: "my new centroid".

privateoptional

Boolean, default is true

Whether you want your centroid to be private or not.Example: false

tagsoptional

Array of Strings

A list of strings that help classify and
index your centroid.Example: ["best customers", "2012"].

You can use curl to customize new
centroids. For example, to create a new centroid
named "my centroid":

If you do not specify a name, BigML.io will assign to
the new centroid a default name.

Retrieving a Centroid

Each centroid has a unique identifier in the form
centroid/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the centroid or as a
parameter to create centroids.

Retrieving a centroid with curl is
extremely easy.

curl https://bigml.io/centroid/53798eda3c1920ee08000026?$BIGML_AUTH

$ Retrieving a centroid from the command line

You can also use your browser to visualize the
centroid using the full
BigML.io URL or pasting the centroid/id
into the BigML.com dashboard.

Properties

Once a centroid has been successfully created it will
have the following properties.

Centroid Properties

Property

Type

Description

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

centroid

Object

A
dictionary describing the centroid. See the Centroid Object definition below.

centroid_id

String

Id assigned
to identify the centroid in the cluster.

centroid_name

String

Name
associated to the centroid in the cluster.

clusterfilterable,
sortable

String

The
cluster/id that was used to create the
centroid.

cluster_type

Integer

Reserved
for further use.

cluster_statusfilterable,
sortable

Boolean

Whether the
cluster is still available or has
been deleted.

code

Integer

HTTP
status code. This can be 201 upon the
centroid creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
centroid creation
has been completed without errors.

This is the date and time in which
the centroid was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

A Centroid Object has the following properties:

Centroid Object Properties

Property

Type

Description

center

Object

A dictionary keyed by field id that
represents the point that is at the center of the centroid.

count

Integer

The
number of instances in the centroid.

distance

Float

The distance between the input_data and
the center of the centroid.

id

String

The
id of the centroid.

name

String

The name
of the centroid.

Centroid Status

Creating a centroid is a near real-time process that take just a few
seconds depending on whether the corresponding
cluster has been used recently and the workload of BigML's
systems. The centroid goes through a number of
states until its fully completed. Through the status field in the
centroid you can determine when the centroid has been
fully processed and ready to be used. Most of the times centroids are
fully processed and the output returned in the first call. These are
the properties that a centroid's status has:

Centroid Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the centroid creation. It can be any of the explained here.

elapsed

Float

Number
of milliseconds that BigML.io took
to find the centroid.

message

String

A human
readable message explaining the status.

progress

Float

How
far BigML.io has progressed computing the centroid.

unknown_fields

Array

An
array of field's ids that were submitted in
the input_data and were not recognized.
Unknown fields are ignored. That is, if you submit a
field that is wrong, a prediction is created anyway ignoring
the wrong input field.

Listing Centroids

To list all your centroids you can use the
centroid base URL. By default, only the 20 most recent
centroids will be returned. You can see below how to change this number
using the limit parameter.

You can list your centroids directly in your browser using your own username and API key
with the following link.

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Centroid Filters

Filter

Description

__ltoptional

Less
than. Example:
created__lt=2012-01-12

__lteoptional

Less
than or equal to. Example:
created__lte=2012-01-12

__gtoptional

Greater
than. Example: created__gt=2012-01-12

__gteoptional

Greater
than or equal to. Example:
created__gte=2012-01-12

For example, to get your centroids that were created
after May 1, 2014:

You can do the same thing from the command line using curl
as follows:

curl https://bigml.io/centroid?$BIGML_AUTH;order_by=-name

$ Listing centroids ordered by name from the command line

Paginating Centroids

There are two parameters that can help you retrieve just a portion of
your centroids and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of centroids to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order number from which the centroid listing will start.

If a limit is given, no more than that many
centroids will be returned but possibly less, if the
request itself yields less centroids

For example, if you want to retrieve only the third and forth latest
centroids:

curl "https://bigml.io/centroid?$BIGML_AUTH;limit=2;offset=2"

$ Paginating centroids from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return centroids 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Centroid

To update a centroid, you need to PUT an object containing the fields that you want to
update to the centroid's URL. If the request succeeds, BigML.io will respond
with a 202 accepted code and with the new updated
centroid in the body of the message.
For example, to update a centroid with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a centroid, it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a centroid for a second time, or a model that does not
exist you will receive an error like this:

Anomaly Scores

Last Updated: Thursday, 2015-02-05 23:59

An anomaly score is created using an
anomaly/id and the new instance
(input_data) for which you wish to create an anomaly score.

When you create a new anomaly score, BigML.io will
automatically compute a score between 0 and 1. The closer the score is
to 1, the more anomalous the instance being scored is.
BigML.io will also compute the relative importance for
each field. That is, how much each value in the input data contributed to the score.

Creating an Anomaly Score

To create a new anomaly score, you need to POST to the
anomaly score base URL an object containing at least
the anomaly/id that you want to use to find the
anomaly score and an instance.
For example, you can easily create a new anomaly score using
curl as follows:

Arguments

In addition to the anomaly and the
input_data, you can also POST the following arguments.

Anomaly Score Creation Arguments

Argument

Type

Description

categoryoptional

Integer,
default is the category of the model

The category that
best describes the anomaly score. See the category table
for the complete list of categories.Example: 1

anomaly

String

A
valid anomaly/id.Example: anomaly/5423625af0a5ea3eea000028

descriptionoptional

String

A description of the anomaly score of up to 8192 characters.Example: "This is a description of my new anomaly score"

input_datarequired

Object

An object with field's id/value
pairs representing the instance you want to find the
closest anomaly score for. You can use either field ids or field
names as keys in your input_data.Example: {"000000": 5, "000001": 3}

nameoptional

String, default is Score for anomaly
detectors's name

The name you want to give to the new anomaly score.Example: "my new anomaly score".

tagsoptional

Array of Strings

A list of strings that help classify and
index your anomaly score.Example: ["best customers", "2012"].

You can use curl to customize new
anomaly scores. For example, to create a new anomaly score
named "my anomaly score":

If you do not specify a name, BigML.io will assign to
the new anomaly score a default name.

Retrieving an Anomaly Score

Each anomaly score has a unique identifier in the form
anomalyscore/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the anomaly score or as a
parameter to create anomaly scores.

You can also use your browser to visualize the
anomaly score using the full
BigML.io URL or pasting the anomaly score/id
into the BigML.com dashboard.

Properties

Once a anomaly score has been successfully created it will
have the following properties.

Anomaly Score Properties

Property

Type

Description

anomalyfilterable,
sortable

String

The
anomaly/id of the anomaly detector that was used to create the
anomaly score.

anomaly_statusfilterable,
sortable

Boolean

Whether the
anomaly detector is still available or has
been deleted.

anomaly_type

Integer

Reserved
for further use.

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
anomaly score creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that the
anomaly score creation
has been completed without errors.

This is the date and time in which
the anomaly score was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to create this
anomaly score.

datasetfilterable,
sortable

String

The
dataset/id that was used to build the
dataset.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset is still available or has
been deleted.

descriptionupdatable

String

A
text describing the anomaly score. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
anomaly score has been created in development mode.

importance

Object

A dictionary keyed by field id that reports the relative
contribution of each field to the anomaly score.

input_data

Object

The dictionary of input fields' ids or
fields' names and
values used as input for the anomaly score.

locale

String

The
dataset's locale.

namefilterable,
sortable, updatable

String

The
name of the anomaly score as you provided
or based on the name of the objective
field's name by default.

privatefilterable,
sortable, updatable

Boolean

Whether
the anomaly score is public or not. In a future version, you will be able to share
anomaly scores with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

resource

String

The
anomaly score/id.

score

Float

The
anomaly score. The closer to 1, the
more anomalous the input data is.

query_string

String

The
query string that was used to filter the anomaly.

sourcefilterable,
sortable

String

The
source/id that was used to build the
dataset.

source_statusfilterable,
sortable

Boolean

Whether the
source is still available or has
been deleted.

status

Object

A
description of the status of the
anomaly score. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the anomaly score was created using a subscription plan or not.

This is the date and time in which
the anomaly score was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

Anomaly Score Status

Creating an anomaly score is a near real-time process that take just a few
seconds depending on whether the corresponding
anomaly has been used recently and the workload of BigML's
systems. The anomaly score goes through a number of
states until its fully completed. Through the status field in the
anomaly score you can determine when the
anomaly score has been
fully processed and ready to be used. Most of the times anomaly scores are
fully processed and the output returned in the first call. These are
the properties that an anomaly score's status has:

Anomaly Score Status Properties

Field

Type

Description

bad_fields

Array

An
array of field's ids with wrong values
submitted. Bad fields are ignored. That is, if you submit a
value that is wrong, an anomaly score is created anyway ignoring
the input field with the wrong value.

code

Integer

A status
code that reflects the status of the anomaly score creation. It can be any of the explained here.

elapsed

Float

Number
of milliseconds that BigML.io took
to find the anomaly score.

message

String

A human
readable message explaining the status.

progress

Float

How far BigML.io has progressed computing the score.

unknown_fields

Array

An
array of field's ids that were submitted in
the input_data and were not recognized.
Unknown fields are ignored. That is, if you submit a
field that is wrong, a prediction is created anyway ignoring
the wrong input field.

Listing Anomaly Scores

To list all your anomaly scores you can use the
anomaly score base URL. By default, only the 20 most recent
anomaly scores will be returned. You can see below how to change this number
using the limit parameter.

You can list your anomaly scores directly in your browser using your own username and API key
with the following link.

Filtering Anomaly Scores

The listings of anomaly scores can be filtered by any of the fields that we
labeled as filterable in the table describing
properties above. For example, to retrieve all the
anomaly scores named "iris":

In addition to exact match, there are four more filters that you can
use. To add one of these filters to your request you just need to
append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Anomaly Score Filters

Filter

Description

__ltoptional

Less
than. Example:
created__lt=2012-01-12

__lteoptional

Less
than or equal to. Example:
created__lte=2012-01-12

__gtoptional

Greater
than. Example: created__gt=2012-01-12

__gteoptional

Greater
than or equal to. Example:
created__gte=2012-01-12

For example, to get your anomaly scores that were created
after May 1, 2014:

You can do the same thing from the command line using curl
as follows:

curl https://bigml.io/anomalyscore?$BIGML_AUTH;order_by=-name

$ Listing anomalyscores ordered by name from the command line

Paginating Anomaly Scores

There are two parameters that can help you retrieve just a portion of
your anomaly scores and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of anomaly scores to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order number from which the anomaly score listing will start.

If a limit is given, no more than that many
anomaly scores will be returned but possibly less, if the
request itself yields less anomaly scores

For example, if you want to retrieve only the third and forth latest
anomaly scores:

curl "https://bigml.io/anomalyscore?$BIGML_AUTH;limit=2;offset=2"

$ Paginating anomalyscores from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return anomaly scores 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating an Anomaly Score

To update an anomaly score, you need to PUT an object containing the fields that you want to
update to the anomaly score's URL. If the request succeeds, BigML.io will respond
with a 202 accepted code and with the new updated
anomaly score in the body of the message.
For example, to update an anomaly score with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete an anomaly score, it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete an anomaly score a second time, or anomaly score that does not
exist you will receive an error like this:

Batch Predictions

Last Updated: Thursday, 2015-02-05 23:59

A batch prediction provides an easy way to compute
a prediction for each instance in a dataset in only one request. To create a new
batch prediction you need a model/id or an
ensemble/id and a
dataset/id.

Batch predictions are created asynchronously. You can
retrieve the associated resource to check the progress and status in a
similar fashion to the rest of BigML.io resources.
Additionally, once a batch prediction is finished you can
also download a csv file that
contains all the predictions just appending
"/download" to the
batch prediction URL. You can also set output_dataset
to true to
automatically generate a new dataset with the results.

BigML.io gives you a number of options to tailor
the format of the csv file containing the predictions. For example, you
can set up the "separator" (e.g., ";"),
whether the file should have a "header" or not,
or whether the "confidence" for each prediction should
also appear together with each prediction. You can read about all the available options below.

Batch Prediction base URL

You can use the following base URL to create, retrieve, download, update, and
delete batch predictions.

https://bigml.io/andromeda/batchprediction

Batch Prediction base URL

Authentication

All requests to manage your batch predictions must use HTTPS
and be authenticated using your username and API key to convey
your identity. Check this Section out for more details.

Creating a Batch Prediction

To create a new batch prediction, you need to POST to the
batch prediction base URL an object containing at least
the model/id or the ensemble/id that
you want to use to make predictions and the dataset/id of the dataset that contains the input data that
will be used to create predictions. BigML.io will
create a prediction for each instance in that dataset.
The content-type must always be "application/json".

Arguments

In addition to the model and the dataset, you can also POST the
following arguments.

Batch Prediction Creation Arguments

Argument

Type

Description

all_fieldsoptional

Boolean,
default is false

Whether all
the fields from the dataset should be part of the generated csv
file together with the predictions.Example: true

categoryoptional

Integer,
default is the category of the model

The category that
best describes the batch prediction. See the category table
for the complete list of categories.Example: 1

combineroptional

Integer,
default is 0

Specifies the method that should be
used to combine predictions when an ensemble is used to create
the batch predicition. For classification ensembles,
the combination is made by majority vote. The options are:

2: probability weighted uses the
probability of the class in the distribution of classes in
the leaf node as weight.

3: threshold-based uses a given
threshold k (by default 1) to
predict a given class
(by default the minority class). You can set up
both using the threshold argument.
If there are less than k models
voting class, the most frequent
of the remaining categories is chosen, as in a
plurality combination after removing the models
that were voting for class.
The confidence of the prediction is computed as
that of a plurality vote, excluding votes for the
majority class when it's not selected.

For regression ensembles, the predicted values are averaged.
The options are:

Whether the
confidence for each prediction should be added to the each csv
file.Example: true

confidence_nameoptional

String

The name of the column in the header of the generated file
containing the confidence. It will only have effect if
header is true.
Example: "Confidence"

datasetrequired

String

A
valid dataset/id.Example: dataset/4f66a80803ce8940c5000006

descriptionoptional

String

A description of the batch prediction of up to 8192 characters.
Example: "This is a description of my new batch prediction"

ensembleoptional

String

A
valid ensemble/id.Example: ensemble/517020d53c1920a514000056

fields_mapoptional

Object

A dictionary of identifiers of the fields to use from the
model under test mapped to their corresponding identifiers in
the input dataset.
Example:
{"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"}

headeroptional

Boolean,
default is true

Whether the
csv file should have a header with the name of each field.Example: true

missing_strategyoptional

Integer,
default is 0

Specifies the method that should be
used when a missing split is found. That is, when a missing value in found in the input data for a
decision node. The options are:

0: last prediction predicts based on the subset of the data which reached the parent of the missing split.

1: proportional evaluates all the subtrees of a missing split and recombines their predictions based on the proportion of data in each subtree

Example: 1

modeloptional

String

A
valid model/id.Example: model/4f67c0ee03ce89c74a000006

nameoptional

String,
default is dataset's name

The name you want to give
to the new batch prediction. Example: "my new batch prediction".

output_datasetoptional

Boolean,
default is false

Whether a
dataset with the results should be automatically created or not.Example: true

output_fieldsoptional

Array,
default is []. None of the fields in the
dataset

Specifies the fields to be included
in the csv file.Example:

["000001", "000003"]

prediction_nameoptional

String

The name of the column in the header of the generated file
for the prediction. It will only have effect if
header is true.
Example: "Prediction"

separatoroptional

Char,
default is ","

The
separator that you want to get between fields in the
generated csv file. Example: ";".

tagsoptional

Array
of Strings

A list of strings that help classify and
index your batch prediction.Example: ["best customers", "2013"].

thresholdoptional

Object

A dictionary with two optional keys:

k indicates the minimum value of instances
of the category denoted by class that must be present in
the predictions tally (with one vote per model)
to produce as a result of the combination
that category. It's 1 by default.

class if absent, the category
that appears least frequently in the training data
is chosen as the "class" value (with ties broken
using lexicographical ordering). This is the usual
default in some systems trying to detect anomalies
(e.g. IDS and the like), and other uses of this
combiner should probably not rely on our default
value.

Example: {"k": 2, "class": "attack"}.

You can also use curl to customize a new
batch prediction. For example, to create a new batch prediction
named "my batch prediction", that will not include a header, and will only
ouput the field "000001" together with the confidence for each
prediction.

If you do not specify a name, BigML.io will assign to
the new batch prediction a combination of the
dataset's name and the model's name. If you do
not specify any fields_map, BigML.io
will use a direct map of all the fields in the dataset.

Retrieving a Batch Prediction

Each batch prediction has a unique identifier in the form
batchprediction/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the batch prediction resource.
Notice that to download the associated csv file generated you will
need to append "/download" to resource id.

You can also use your browser to visualize the batch prediction
resource using the full BigML.io URL or pasting the batchprediction/id
into the BigML.com dashboard.

Properties

Once a batch prediction has been successfully created it will
have the following properties.

Batch Prediction Properties

Property

Type

Description

all_fieldsfilterable,
sortable

Boolean

Whether
the batch prediction contains all the fields in the
dataset used as an input.

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
batch prediction creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that
the batch prediction creation has been completed without errors.

combiner

Integer

The
method used to combine predictions from the ensemble. See
the available combiners above.

confidencefilterable,
sortable

Boolean

Whether
the confidence for each prediction was added to the
output file.

confidence_name

String

The
name of the column containing the confidence for each
prediction when it has been passed as an argument.

This is the date and time in which
the batch prediction was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to perform this
batch prediction

datasetfilterable,
sortable

String

The
dataset/id that was used as input to
create the batch predictions.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset used as input is still available or has
been deleted.

descriptionupdatable

String

A
text describing the batch prediction. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
batch prediction has been created in development mode.

ensemblefilterable,
sortable

String

The
ensemble/id of the ensemble used to
create the batch predicition.

fields_map

Array

The map of dataset fields to model or ensemble fields used.

headerfilterable,
sortable

Boolean

Whether
the batch prediction file contains a header with the name
of each field or not.

locale

String

The
batch prediction's locale.

missing_strategyoptional

Integer,
default is 0 (last prediction).

Specifies
the type of strategy that a model will follow when a
missing value needed to continue with inference in the
model is found. The possible valures are:

0 Last prediction

1 Proportional

modelfilterable,
sortable

String

The
model/id of the model used to create
the batch prediction.

model_statusfilterable,
sortable

Boolean

Whether the
model or ensemble
used is still available or has
been deleted.

model_type

Integer

Either
0 or 1 to specify
whether the batch prediction is from a single
model or
an ensemble respectively.

namefilterable,
sortable, updatable

String

The
name of the batch prediction. By default,
it's based on the name of model or ensemble and the
dataset used.

number_of_models

Integer

The
number of models in the ensemble.

objective_field

Object

The objective field of the model or ensemble. It includes
all the properties of the corresponding field (i.e.,
column_number, datatype,
id,name,
optype, etc).

output_datasetoptional

Boolean,
default is false

Whether a
dataset with the results should be automatically created or not.Example: true

output_dataset_resourcefilterable,
sortable

String

The
dataset/id of the newly created
dataset when output_dataset has been
set to true.

output_dataset_statusfilterable,
sortable

Boolean

Whether the
dataset generated as an output is still available or has
been deleted.

output_fields

Array

The list of output fields's ids used to
format the output csv file.

prediction_name

String

The
name of the column containing the predictions when it has been
passed as an argument.

privatefilterable,
sortable

Boolean

Whether
the batch prediction is public or not. In a future version,
you might be able to share batch predictions with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

resource

String

The
batchprediction/id.

rowsfilterable,
sortable

Integer

The
total number of instances in the dataset used as an
input.

separator

Char

The
separator used in the csv file that contains the
predictions.

sharedfilterable,
sortable

Boolean

Whether
the batch prediction has been shared via a private link. In a future version,
you might be able to share batch predictions with other coworkers or, if desired, make them publically available.

sizefilterable,
sortable

Integer

The
number of bytes of the dataset that
was used create the batch prediction.

status

Object

A
description of the status of the
batch prediction. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the batch prediction was created using a subscription plan or not.

tagsfilterable, updatable

Array of Strings

A list of user tags that can help classify and
index this resource.

threshold

Object

The parameters (k and
class) given when a threshold-based
combiner is used.

This is the date and time in which
the batch prediction was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

Batch Prediction Status

Creating a batch prediction is a process that can take just a few
seconds or a few hours depending on the size of the
dataset used as input and on the workload of BigML's
systems. The batch prediction goes through a number of
states until its finished. Through the status field in the
batch prediction you can determine when it has been
fully processed. These are the
properties that a batch prediction's status has:

Batch Prediction Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the batch prediction. It can be any of the explained here.

elapsed

Integer

Number
of milliseconds that BigML.io took
to process the batch prediction.

message

String

A human
readable message explaining the status.

progress

Float, between 0 and
1

How far BigML.io has progressed
processing the batch prediction.

Once batch prediction has been successfully finished, it will look like:

Listing Batch Predictions

To list your batch predictions you need use the
batch prediction base URL. By default, only the 20 most recent
batch predictions are returned. You can see how to increase or
decrease this number using the limit parameter below.

Filtering Batch Predictions

The listings of batch predictions can be filtered by any of the fields that we
labeled as filterable in the table describing
a batch prediction's attributes above. For example, to retrieve all the
batch predictions named "New Customers Churn":

In addition to exact match, there are five more filters that you can
use. To add one of these filters to your request you just need to
prepend the negative prefix and/or append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Model Filters

Filter

Description

!optional prefix

Is not.
Example: !size=1024

__ltoptional suffix

Less
than. Example: size__lt=1024

__lteoptional suffix

Less
than or equal to. Example:
size__lte=1024

__gtoptional suffix

Greater
than. Example: size__gt=1024

__gteoptional suffix

Greater
than or equal to. Example:
size__gte=1024

For example, to get your batch predictions that are bigger than
one megabyte:

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return batch_predictions 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating a Batch Prediction

To update a batch prediction, you need to PUT an object containing the fields that you want to
update to the batchprediction/id's URL. Only
category, description,
name and tags are updatable
fields. The content-type must always be: "application/json".

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a batch prediction, it is permanently
deleted. That is, a delete request cannot be undone.
If you try to delete a batch prediction for the a second
time, or an batch prediction that does not
exist you will receive an error like this:

Batch Centroids

A batch centroid provides an easy way to compute
a centroid for each instance in a dataset in only one request. To create a new
batch centroid you need a cluster/id and a
dataset/id.

Batch centroids are created asynchronously. You can
retrieve the associated resource to check the progress and status in a
similar fashion to the rest of BigML.io resources.
Additionally, once a batch centroid is finished you can
also download a csv file that
contains all the centroids just appending
"/download" to the
batch centroid URL. You can also set output_dataset to true to
automatically generate a new dataset with the results.

BigML.io gives you a number of options to tailor
the format of the csv file containing the centroids. For example, you
can set up the "separator" (e.g., ";"),
whether the file should have a "header" or not,
or whether the "distance" for each centroid should
also appear together with each centroid. You can read about all the available options below.

Batch Centroid base URL

You can use the following base URL to create, retrieve, download, update, and
delete batch centroids.

https://bigml.io/batchcentroid

Batch Centroid base URL

Authentication

All requests to manage your batch centroids must use HTTPS
and be authenticated using your username and API key to convey
your identity. Check this section out for more details.

Creating an Batch Centroid

To create a new batch centroid, you need to POST to the
batch centroid base URL an object containing at least
the cluster/id that
you want to use to compute centroids and the dataset/id of the dataset that contains the input data that
will be used to compute centroids. BigML.io will
compute a centroid for each instance in that dataset.

You can easily create a new batch centroid using
curl as follows. Your authentication variable should
be set up first as indicated above.

Arguments

In addition to the model and the dataset, you can also POST the
following arguments.

Batch Centroid Creation Arguments

Argument

Type

Description

all_fieldsoptional

Boolean,
default is false

Whether all
the fields from the dataset should be part of the generated csv
file together with the centroids.Example: true

categoryoptional

Integer,
default is the category of the model

The category that
best describes the batch centroid. See the category table
for the complete list of categories.Example: 1

clusterrequired

String

A
valid cluster/id.Example: cluster/4f67c0ee03ce89c74a000006

distanceoptional

Boolean,
default is false

Whether the
distance for each centroid should be added to the csv
file.Example: true

distance_nameoptional

String

The name of the column in the header of the generated file
containing the distance. It will only have effect if
header is true.
Example: "Distance"

datasetrequired

String

A
valid dataset/id.Example: dataset/4f66a80803ce8940c5000006

descriptionoptional

String

A description of the batch centroid of up to 8192 characters.
Example: "This is a description of my new batch centroid"

fields_mapoptional

Object

A dictionary of identifiers of the fields to use from the
cluster under test mapped to their corresponding identifiers in
the input dataset.
Example:
{"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"}

headeroptional

Boolean,
default is true

Whether the
csv file should have a header with the name of each field.Example: true

nameoptional

String,
default is dataset's name

The name you want to give
to the new batch centroid. Example: "my new batch centroid".

output_datasetoptional

Boolean, default is false

Whether
a dataset with the results should be automatically created or not.
Example: true

output_fieldsoptional

Array,
default is []. None of the fields in the
dataset

Specifies the fields to be included
in the csv file.Example:

["000001", "000003"]

centroid_nameoptional

String

The name of the column in the header of the generated file
for the centroid. It will only have effect if
header is true.
Example: "Centroid"

separatoroptional

Char,
default is ","

The
separator that you want to get between fields in the
generated csv file. Example: ";".

tagsoptional

Array
of Strings

A list of strings that help classify and
index your batch centroid.Example: ["best customers", "2013"].

You can also use curl to customize a new
batch centroid. For example, to create a new batch centroid
named "my batch centroid", that will not include a header, and will only
ouput the field "000001" together with the distance for each
centroid.

If you do not specify a name, BigML.io will assign to
the new batch centroid a combination of the
dataset's name and the cluster's name. If you do
not specify any fields_map, BigML.io
will use a direct map of all the fields in the dataset.

Retrieving an Batch Centroid

Each batch centroid has a unique identifier in the form
batchcentroid/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the batch centroid resource.
Notice that to download the associated csv file generated you will
need to append "/download" to resource id.

You can also use your browser to visualize the batch centroid
resource using the full BigML.io URL or pasting the batchcentroid/id
into the BigML.com dashboard.

Properties

Once a batch centroid has been successfully created it will
have the following properties.

Batch Centroid Properties

Property

Type

Description

all_fieldsfilterable,
sortable

Boolean

Whether
the batch centroid contains all the fields in the
dataset used as an input.

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

centroid_name

String

The
name of the column containing the centroids when it has been
passed as an argument.

code

Integer

HTTP
status code. This can be 201 upon the
batch centroid creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that
the batch centroid creation has been completed without errors.

This is the date and time in which
the batch centroid was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to perform this
batch centroid.

datasetfilterable,
sortable

String

The
dataset/id that was used as input to
create the batch centroids.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset used as input is still available or has
been deleted.

descriptionupdatable

String

A
text describing the batch centroid. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
batch centroid has been created in development mode.

distancefilterable,
sortable

Boolean

Whether
the distance for each centroid was added to the
output file.

distance_name

String

The
name of the column containing the distance for each
centroid when it has been passed as an argument.

fields_map

Array

The map of dataset fields to cluster fields used.

headerfilterable,
sortable

Boolean

Whether
the batch centroid file contains a header with the name
of each field or not.

locale

String

The
batch centroid's locale.

namefilterable,
sortable, updatable

String

The
name of the batch centroid. By default,
it's based on the name of model or ensemble and the
dataset used.

output_datasetfilterable,
sortable

Boolean

Whether
a new dataset with the results was created or not.

output_dataset_resourcefilterable,
sortable

String

The
dataset/id of the newly created
dataset when output_dataset has been
set to true.

output_dataset_statusfilterable,
sortable

Boolean

Whether the
dataset generated as an output is still available or has
been deleted.

output_fields

Array

The list of output fields's ids used to
format the output csv file.

privatefilterable,
sortable

Boolean

Whether
the batch centroid is public or not. In a future version,
you might be able to share batch centroids with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

resource

String

The
batchcentroid/id.

rowsfilterable,
sortable

Integer

The
total number of instances in the dataset used as an
input.

separator

Char

The
separator used in the csv file that contains the
centroids.

sharedfilterable,
sortable

Boolean

Whether
the batch centroid has been shared via a private link.

sizefilterable,
sortable

Integer

The
number of bytes of the dataset that
was used create the batch centroid.

status

Object

A
description of the status of the
batch centroid. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the batch centroid was created using a subscription plan or not.

This is the date and time in which
the batch centroid was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

Batch Centroid Status

Creating a batch centroid is a process that can take just a few
seconds or a few hours depending on the size of the
dataset used as input and on the workload of BigML's
systems. The batch centroid goes through a number of
states until its finished. Through the status field in the
batch centroid you can determine when it has been
fully processed. These are the
properties that a batch centroid's status has:

Batch Centroid Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the batch centroid. It can be any of the explained here.

elapsed

Integer

Number
of milliseconds that BigML.io took
to process the batch centroid.

message

String

A human
readable message explaining the status.

progress

Float, between 0 and
1

How far BigML.io has progressed
processing the batch centroid.

Once batch centroid has been successfully finished, it will look like:

Filtering Batch Centroids

The listings of batch centroids can be filtered by any of the fields that we
labeled as filterable in the table describing
a batch centroid's attributes above. For example, to retrieve all the
batch centroids named "Customers":

In addition to exact match, there are five more filters that you can
use. To add one of these filters to your request you just need to
prepend the negative prefix and/or append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Model Filters

Filter

Description

!optional prefix

Is not.
Example: !size=1024

__ltoptional suffix

Less
than. Example: size__lt=1024

__lteoptional suffix

Less
than or equal to. Example:
size__lte=1024

__gtoptional suffix

Greater
than. Example: size__gt=1024

__gteoptional suffix

Greater
than or equal to. Example:
size__gte=1024

For example, to get your batch centroids that are bigger than
one megabyte:

You can do the same thing from the command line using curl
as follows:

curl "https://bigml.io/batchcentroid?$BIGML_AUTH;order_by=-rows"

$ Listing batch centroids ordered by rows from the command line

Paginating Batch Centroids

There are two parameters that can help you retrieve just a portion of
your batch centroids and paginate them.

Pagination Parameters

Parameter

Type

Description

limitoptional

Integer,
default is 20

Specifies
the number of batch centroids to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,
default is 0

The order number from which the batch centroid listing will start.

If a limit is given, no more than that many
batch centroids will be returned but possibly less, if the
request itself yields less batch centroids.

For example, if you want to retrieve only the third and forth latest
batch centroids:

curl "https://bigml.io/batchcentroids?$BIGML_AUTH;limit=2;offset=2"

$ Paginating batch centroids from the command line

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return batch_centroids 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating an Batch Centroid

To update a batch centroid, you need to PUT an object containing the fields that you want to
update to the batchcentroid/id's URL. Only
category, description,
name, shared, tags,
and user_metadata are updatable
fields. The content-type must always be: "application/json".

For example, to update a batch centroid with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return HTTP 204 responses with no body.

HTTP/1.1 204 NO CONTENT
Content-Length: 0

< Successful response

Once you delete a batch centroid, it is permanently
deleted. That is, a delete request cannot be undone.
If you try to delete a batch centroid for the a second
time, or an batch centroid that does not
exist you will receive an error like this:

Batch Anomaly Scores

A batch anomaly score provides an easy way to compute
an anomaly score for each instance in a dataset in only one request. To create a new
batch anomaly score you need an anomaly/id and a
dataset/id.

Batch anomaly scores are created asynchronously. You can
retrieve the associated resource to check the progress and status in a
similar fashion to the rest of BigML.io resources.
Additionally, once a batch anomaly score is finished you can
also download a csv file that
contains all the anomaly scores just appending
"/download" to the
batch anomaly score URL. You can also set output_dataset to
true to automatically generate a new dataset with the
results.

BigML.io gives you a number of options to tailor
the format of the csv file containing the anomaly scores. For example, you
can set up the "separator" (e.g., ";"),
whether the file should have a "header" or not,
or whether the "distance" for each anomaly score should
also appear together with each anomaly score. You can read about all the available options below.

Batch Anomaly Score base URL

You can use the following base URL to create, retrieve, download, update, and
delete batch anomaly scores.

https://bigml.io/batchanomalyscore

Batch Anomaly Score base URL

Authentication

All requests to manage your batch anomaly scores must use HTTPS
and be authenticated using your username and API key to convey
your identity. Check this section out for more details.

Creating a Batch Anomaly Score

To create a new batch anomaly score, you need to POST to the
batch anomaly score base URL an object containing at least
the anomaly/id that
you want to use to compute anomaly scores and the dataset/id of the dataset that contains the input data that
will be used to compute anomaly scores. BigML.io will
compute an anomaly score for each instance in that dataset.

You can easily create a new batch anomaly score using
curl as follows. Your authentication variable should
be set up first as indicated above.

Arguments

In addition to the model and the dataset, you can also POST the
following arguments.

Batch Anomaly Score Creation Arguments

Argument

Type

Description

all_fieldsoptional

Boolean,
default is false

Whether all
the fields from the dataset should be part of the generated csv
file together with the anomaly scores.Example: true

categoryoptional

Integer,
default is the category of the model

The category that
best describes the batch anomaly score. See the category table
for the complete list of categories.Example: 1

anomalyrequired

String

A
valid anomaly/id.Example: anomaly/4f67c0ee03ce89c74a000006

datasetrequired

String

A
valid dataset/id.Example: dataset/4f66a80803ce8940c5000006

descriptionoptional

String

A description of the batch anomaly score of up to 8192 characters.
Example: "This is a description of my new batch anomaly score"

fields_mapoptional

Object

A dictionary of identifiers of the fields to use from the
anomaly under test mapped to their corresponding identifiers in
the input dataset.
Example:
{"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"}

headeroptional

Boolean,
default is true

Whether the
csv file should have a header with the name of each field.Example: true

nameoptional

String,
default is dataset's name

The name you want to give
to the new batch anomaly score. Example: "my new batch anomaly score".

output_datasetoptional

Boolean,
default is false

Whether a
dataset with the results should be automatically created or not.Example: true

output_fieldsoptional

Array,
default is []. None of the fields in the
dataset

Specifies the fields to be included
in the csv file.Example:

["000001", "000003"]

score_nameoptional

String

The name of the column in the header of the generated file
for the anomaly score. It will only have effect if
header is true.
Example: "Anomaly Score"

separatoroptional

Char,
default is ","

The
separator that you want to get between fields in the
generated csv file. Example: ";".

tagsoptional

Array
of Strings

A list of strings that help classify and
index your batch anomaly score.Example: ["best customers", "2013"].

You can also use curl to customize a new
batch anomaly score. For example, to create a new batch anomaly score
named "my batch anomaly score", that will not include a header, and will only
ouput the field "000001" together with the distance for each
anomaly score.

If you do not specify a name, BigML.io will assign to
the new batch anomaly score a combination of the
dataset's name and the anomaly's name. If you do
not specify any fields_map, BigML.io
will use a direct map of all the fields in the dataset.

Retrieving an Batch Anomaly Score

Each batch anomaly score has a unique identifier in the form
batchanomalyscore/id where id is a string of 24 alpha-numeric
characters that you can use to retrieve the batch anomaly score resource.
Notice that to download the associated csv file generated you will
need to append "/download" to resource id.

You can also use your browser to visualize the batch anomaly score
resource using the full BigML.io URL or pasting the batchanomalyscore/id
into the BigML.com dashboard.

Properties

Once a batch anomaly score has been successfully created it will
have the following properties.

Batch Anomaly Score Properties

Property

Type

Description

all_fieldsfilterable,
sortable

Boolean

Whether
the batch anomaly score contains all the fields in the
dataset used as an input.

anomalyfilterable,
sortable

String

The
anomaly/id of the anomaly used to create
the batch anomaly score.

anomaly_statusfilterable,
sortable

Boolean

Whether the
anomaly used is still available or has
been deleted.

anomaly_type

Integer

Reserved
for further use.

categoryfilterable,
sortable, updatable

Integer

One
of the categories in the table of
categories that help classify this
resource according to the domain of application.

code

Integer

HTTP
status code. This can be 201 upon the
batch anomaly score creation and
200 after it. Make sure that you check the code that comes
with the status attribute to make sure that
the batch anomaly score creation has been completed without errors.

This is the date and time in which
the batch anomaly score was created with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

creditsfilterable,
sortable

Float

The
number of credits it cost you to perform this
batch anomaly score.

datasetfilterable,
sortable

String

The
dataset/id that was used as input to
create the batch anomaly scores.

dataset_statusfilterable,
sortable

Boolean

Whether the
dataset used as input is still available or has
been deleted.

descriptionupdatable

String

A
text describing the batch anomaly score. It can contain restricted markdown
to decorate the text.

dev

Boolean

True when the
batch anomaly score has been created in development mode.

fields_map

Array

The map of dataset fields to anomaly fields used.

headerfilterable,
sortable

Boolean

Whether
the batch anomaly score file contains a header with the name
of each field or not.

locale

String

The
batch anomaly score's locale.

namefilterable,
sortable, updatable

String

The
name of the batch anomaly score. By default,
it's based on the name of model or ensemble and the
dataset used.

output_datasetfilterable,
sortable

Boolean

Whether
a new dataset with the results was created or not.

output_dataset_resourcefilterable,
sortable

String

The
dataset/id of the newly created
dataset when output_dataset has been
set to true.

output_dataset_statusfilterable,
sortable

Boolean

Whether the
dataset generated as an output is still available or has
been deleted.

output_fields

Array

The list of output fields's ids used to
format the output csv file.

privatefilterable,
sortable

Boolean

Whether
the batch anomaly score is public or not. In a future version,
you might be able to share batch anomaly scores with other coworkers or, if desired, make them publically available.

projectfilterable,
sortable

String

The
project/id the resource belongs to.NEW

resource

String

The
batchanomalyscore/id.

rowsfilterable,
sortable

Integer

The
total number of instances in the dataset used as an
input.

score_name

String

The
name of the column containing the anomaly scores when it has been
passed as an argument.

separator

Char

The
separator used in the csv file that contains the
anomaly scores.

sharedfilterable,
sortable

Boolean

Whether
the batch anomaly score has been shared via a private link.

sizefilterable,
sortable

Integer

The
number of bytes of the dataset that
was used create the batch anomaly score.

status

Object

A
description of the status of the
batch anomaly score. It includes a code, a message,
and some extra information. See the table below.

subscriptionfilterable,
sortable

Boolean

Whether
the batch anomaly score was created using a subscription plan or not.

This is the date and time in which
the batch anomaly score was last updated with
microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are
provided in Coordinated
Universal Time (UTC).

Batch Anomaly Score Status

Creating a batch anomaly score is a process that can take just a few
seconds or a few hours depending on the size of the
dataset used as input and on the workload of BigML's
systems. The batch anomaly score goes through a number of
states until its finished. Through the status field in the
batch anomaly score you can determine when it has been
fully processed. These are the
properties that a batch anomaly score's status has:

Batch Anomaly Score Status Properties

Field

Type

Description

code

Integer

A status
code that reflects the status of the batch anomaly score. It can be any of the explained here.

elapsed

Integer

Number
of milliseconds that BigML.io took
to process the batch anomaly score.

message

String

A human
readable message explaining the status.

progress

Float, between 0 and
1

How far BigML.io has progressed
processing the batch anomaly score.

Once batch anomaly score has been successfully finished, it will look like:

Filtering Batch Anomaly Scores

The listings of batch anomaly scores can be filtered by any of the fields that we
labeled as filterable in the table describing
a batch anomaly score's attributes above. For example, to retrieve all the
batch anomaly scores named "Customers":

In addition to exact match, there are five more filters that you can
use. To add one of these filters to your request you just need to
prepend the negative prefix and/or append one the suffixes in the following table to the name of the attribute that you want to use as a filter.

Model Filters

Filter

Description

!optional prefix

Is not.
Example: !size=1024

__ltoptional suffix

Less
than. Example: size__lt=1024

__lteoptional suffix

Less
than or equal to. Example:
size__lte=1024

__gtoptional suffix

Greater
than. Example: size__gt=1024

__gteoptional suffix

Greater
than or equal to. Example:
size__gte=1024

For example, to get your batch anomaly scores that are bigger than
one megabyte:

To paginate results, you need to start off with an
offset of zero, then increment it by whatever value
you use for the limit each time. So if you wanted to
return batch_anomalyscores 1-10, then 11-20, then 21-30, etc., you would use
"limit=10;offset=0",
"limit=10;offset=10", and
"limit=10;offset=20", respectively.

Updating an Batch Anomaly Score

To update a batch anomaly score, you need to PUT an object containing the fields that you want to
update to the batchanomalyscore/id's URL. Only
category, description,
name, shared, tags,
and user_metadata are updatable
fields. The content-type must always be: "application/json".

For example, to update a batch anomaly score with a new name you can use curl like this: