Authentication

The following snippet will help you set up an environment variable (i.e., BIGML_AUTH) to store your username and API key and avoid typing them again in the rest of examples. See this section for more details.

BigML.io will respond with a JSON object containing preliminary information about your new source. As with all BigML.io resources, the new source will have a resource key with a unique resource/id. You can use the source/id to retrieve the source or to create new datasets.

BigML.io will return a dataset resource if the request succeeds. BigML.io detects types for each field and will begins computing the histograms and summary statistics. In the Datasets Section you can learn how customize the parsing rules and other options when converting a datasource to a dataset.
Each field in your source is automatically assigned an id that you can later use as a parameter in models and predictions.

Creating a Model

To create a model, POST the dataset/id from the previous step to the model base URL. By default BigML.io will include all fields as predictors and will treat the last non-text field as the objective. In the Models Section you will learn how to customize the input fields or the objective field.

If the request succeeds, BigML.io will return a new prediction resource with its own prediction/id. You can use this id to retrieve the prediction later on. The predicted value is found in the prediction object, keyed by the corresponding objective field id.

Overview

Last Updated: Monday, 2018-01-29 08:31

This page provides an introduction to BigML.io—The BigML API. A quick start guide for the impatient is here.

BigML.io is a Machine Learning REST API to easily build, run, and bring predictive models to your project. You can use BigML.io for basic supervised and unsupervised machine learning tasks and also to create sophisticated machine learning pipelines.

BigML.io is a REST-style API for creating and managing BigML resources programmatically. That is to say, using BigML.io you can create, retrieve, update and delete BigML resources using standard HTTP methods.

The four original BigML resources are: source, dataset, model, and prediction.

As shown in the picture below, the most basic flow consists of using some local (or remote) training data to create a source, then using the source to create a dataset, later using the dataset to create a model, and, finally, using the model and new input data to create a prediction.

The training data is usually in tabular format. Each row in the data represents an instance (or example) and each column a field (or attribute). These fields are also known as predictors or covariates.

When the machine learning task to learn from training data is supervised one of the columns (usually the last column) represents a special attribute known as objective field (or target) that assigns a label (or class) to each instance. The training data in this format is named labeled and the machine learning task to learn from is named supervised learning.

Once a source is created, it can be used to create multiple datasets. Likewise, a dataset can be used to create multiple models and a model can be used to create multiple predictions.

A model can be either a classification or a regression model depending on whether the objective field is respectively categorical or numeric.

Often an ensemble (or collection of models) can perform better than just a single model. Thus, a dataset can also be used to create an ensemble instead of a single model.

A dataset can also be used to create a cluster or an anomaly detector. Clusters and Anomaly Detectors are both built using unsupervised learning and therefore an objective field is not needed. In these cases, the training data is named unlabeled.

A centroid is to a cluster what a prediction is to a model. Likewise, an anomaly score is to an anomaly detector what a prediction is to a model.

There are scenarios where generating predictions for a relative big collection of input data is very convenient. For these scenarios, BigML.io offers batch resources such as: batchprediction, batchcentroid, and batchanomalyscore. These resources take a dataset and respectively a model (or ensemble), a cluster, or an anomaly detector to create a new dataset that contains a new column with the corresponding prediction, centroid or anomaly score computed for each instance in the dataset.

When dealing with multiple projects, it's better to keep the resources that belong to each project separated. Thus, BigML also has a resource named project that helps you group together all the other resources. As you will see, you just need to assign a source to a pre-existing project and all the subsequent resources will be created in that project.

Note: In the snippets below you should substitute Alfred's username and API key for your own username and API Key.

REST API

You can create, read, update, and delete resources using the respective standard HTTP methods: POST, GET, PUT and DELETE.

All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the "multipart/form-data" content-type.

HTTPS

All access to BigML.io must be performed over HTTPS. In this way communication between your application and BigML.io is encrypted and the integrity of traffic between both is verified.

Base URL

All BigML.io HTTP commands use the following base URL:

https://bigml.io

Base URL

Version

The BigML.io API is versioned using code names instead of version numbers. The current version name is "andromeda" so URLs for this version can be written to require this version as follows:

https://bigml.io/andromeda/

Version

Specifying the version name is optional. If you omit the version name in your API requests, you will always get access to the latest API version. While we will do our best to make future API versions backward compatible it is possible that a future API release could cause your application to fail.

Specifying the API version in your HTTP calls will ensure that your application continues to function for the life cycle of the API release.

Summary of HTTP Methods

Creates a new resource. Only certain fields are "postable". This method is not idempotent. Each valid POST request results in a new directly accessible resource.

RETRIEVE

GET

Retrieves either a specific resource or a list of resources. This methods is idempotent. The content type of the resources is always "application/json; charset=utf-8".

UPDATE

PUT

Updates partial content of a resource. Only certain fields are "putable". This method is idempotent.

DELETE

DELETE

Deletes a resource. This method is idempotent.

Resource ID

All BigML resources are identified by a name composed of two parts separated by a slash "/". The first part is the type of the resource and the second part is a 24-char unique identifier. See the examples below:

A resource id is immediately assigned when a resource is created and you can use them to retrieve, update or delete the corresponding resource. The resource id is also used as the input parameter for the creation of dependent resources.
You can also directly append a resource id to the URL https://bigml.com/dashboard to visualize it in the BigML web interface.

Libraries

A number of libraries for many other languages have been developed by the growing BigML community:
C#,
Ruby,
PHP , and
iOS.
If you are interested in library support for a particular language let us know.
Or if you are motivated to develop a library, we will give you all the support that we can.

Limits

BigML.io is currently limited to 1,000,000 (one million) requests per API key per hour. Please email us if you have a specific use case that requires a higher rate limit.

Authentication

Last Updated: Thursday, 2018-02-01 21:20

All access to BigML.io needs to be authenticated. Authentication is performed by appending your username and BigML API Key to the query string of every request.
See the example below.

To use BigML.io from the command line, we recommend setting your username and API key as environment variables. Using environment variables is also an easy way to keep your credentials out of your source code.

Alternative Keys

To create an alternative key you need to use BigML's
web interface.
There you can define what resources an alternative key can access and what operations
(i.e., create, list, retrieve, update or delete) are allowed with it.
This is useful in scenarios where you want to grant different roles and
privileges to different applications. For example, an application for
the IT folks that collects data and creates sources in BigML, another
that is accessed by data scientists to create and evaluate models, and
a third that is used by the marketing folks to create predictions.

Organizations

Last Updated: Monday, 2018-02-19 10:04

An organization is a permission-based grouping of resources
that helps you centralize your organization's resources.
The permissions can be managed in a company-specific dashboard,
and a user can be a member of multiple organizations at the same time.
All resources are created under a specific project in the the organization.
A project can be configured as private or public, and you can control
who has the access to your projects and resources under the projects.

Organization Member Types

There are 4 types of membership for an organization.

A restricted member can create, retrieve, update, and delete resources
in the organization project, and view public or private projects that the user has access to.

A member has the restricted member privilege and also
can create public or private projects in the organization.
A public project can be accessed by any users of the organization, and
a private project can be accessed only by those who have permission to the project.

When a project is created or updated,
certain organization users can be assigned with the manage, write, or read permission.
A user with the admin permission or an organization administrator can update and delete the project.
A user with the write permission can create, retrieve, update, and delete resources in the project,
and a user with the read permission can only read existing resources in the project.
The user who creates the project will automatically have the admin permission
until the user is specifically removed from the project or the organization.
For example, let's say a user with a member role John is in the sales department.
John has created a private project Sales Reports and added users
Amy and Mike to the write permission list.
Now John has been transferred to the marketing department
and he shouldn't have access to the Sales Reports project anymore.
John can delegate Amy or another organization user with the admin permission
allowing the user to update or delete the project in the future and remove himself from the list.
If John is already removed or unavailable,
it can also be done by any administrator.
Any user with the write permission of the project can create, update, and delete resources and
move their personal resources to the project. However, once personal resource is moved
under a organization project, it cannot be moved back to the personal account.
Last, the users with read permissions can view all resources
in the project. However, they cannot update or delete them, or create a new resource.

An administrator has the full access to all projects and resources in the organization,
and can manage the users and their membership of the organization.

The owner has all privileges that an administrator has plus billing,
and is the only one who can update and delete the organization.

Each user can have only one role. If a user is assigned with multiple roles,
then only the role with the highest privilege will be considered.
For example, a user is assigned with the member and restricted member roles,
then the user's final role in the organization will be member.

All resources created under the organization have the username and
user_id properties filled with the owner's username and id, and a separate property
creator which is the username of the user who actually created the resource.

Authentication

In addition to your username and api_key,
all access to BigML organization resources requires an
additional parameter in the query string to authenticate.
As explained above, an organization resource must be created under a project.
In order to create, retrieve, update, and delete an organization resource,
you must pass project in the query string.
Thus, even if project is defined in the POST request,
it will simply be ignored in favor of the project in the query string.
For organization project resources, however,
you need to pass organization instead.
See the examples below.

Requests

Last Updated: Monday, 2018-01-29 08:31

BigML.io uses the standard POST, GET, PUT, and DELETE HTTP methods to
create,
retrieve,
update,
and delete
individual resources, respectively.
You can also list all your resources for each resource type.

Creating a Resource

To create a new resource, you need to POST an object to the resource's base URL. The content-type must always be "application/json". The only exception is source creation which requires the "multipart/form-data" content type.

For example, to create a model with a dataset, you can use curl like this:

Retrieving a Resource

To retrieve a resource, you need to issue a HTTP GET request to the resource/id to be retrieved. Each resource has a unique identifier in the form resource/id where resource is a type of the resource such as dataset, model, and etc, and id is a string of 24 alpha-numeric characters that you can use to retrieve the resource or as a parameter to create other resources from the resource.

For example, using curl you can do something like this to retrieve a dataset:

curl "https://bigml.io/dataset/54d86680f0a5ea5fc0000011?$BIGML_AUTH"

$ Retrieving a dataset from the command line

The following is an example of what a request header would look like for a dataset GET request:

Paginating Resources

There are two parameters that can help you retrieve just a portion of your resources and paginate them.

Pagination
Parameters

Parameter

Type

Description

limitoptional

Integer,default is 20

Specifies the number of resources to retrieve. Must be less than or equal to 200.

offsetoptional

Integer,default is 0

The order number from which the resource listing will start.

If a limit is given, no more than that many resources will be returned but possibly less, if the request itself yields less resources.

For example, if you want to retrieve only the third and forth latest projects:

curl "https://bigml.io/project?$BIGML_AUTH;limit=2;offset=2"

$ Paginating projects from the command line

To paginate results, you need to start off with an offset of zero, then increment it by whatever value you use for the limit each time. So if you wanted to return resources 1-10, then 11-20, then 21-30, etc., you would use "limit=10;offset=0", "limit=10;offset=10", and limit=10;offset=20", respectively.

Filtering Resources

The listings of resources can be filtered by any of the fields that we labeled as filterable in the table describing the properties of a resource type. For example, to retrieve all the projects tagged with "fraud":

https://bigml.io/project?$BIGML_AUTH;tags__in=fraud

> Filtering projects by tag from a browser

Using curl:

curl "https://bigml.io/project?$BIGML_AUTH;tags__in=fraud"

$ Filtering projects by tag from the command line

In addition to exact match, there are more filters that you can use. To add one of these filters to your request you just need to append one of the suffixes in the following table to the name of the property that you want to use as a filter.

All response content from BigML.io, including errors, are JSON formatted. For convenience's sake, each JSON response has a key named "code" that matches the HTTP response code. For example, after successfully creating a new source, BigML.io will send back a JSON response like the one below, with the HTTP "201 Created" code.

In the body of the error response, the JSON formatted messages include a key named code that matches the reponse code in HTTP header. Additionally, JSON includes a status field. The status gives you more information about the type of error. It includes a second more specific error code and a message that gives a human readable explanation of what caused the error. You can get the full list of error codes in the Status Codes section. For example, if you try to access to a resource that does not exist you will get a response like the following one.

Status Codes

Last Updated: Monday, 2017-10-30 10:31

This section lists the different status codes BigML.io sends in responses. First, we list the HTTP status codes, then the codes that define a resource creation status, and finally detailed error codes for every resource.

HTTP Status Code Summary

BigML.io returns meaningful HTTP status codes for every request. The same status code is returned in both the HTTP header of the response and in the JSON body.

Code

Status

Semantics

200

OK

Your request was successful and the JSON response should include the resource that you requested.

201

Created

A new resource was created. You can get the new resource complete location through the HTTP headers or the resource/id through the resource key of the JSON response.

202

Accepted

Received after sending a request to update a resource if it was processed successfully.

204

No Content

Received after sending a request to delete a resource if it was processed successfully.

400

Bad Request

Your request is malformed, missed a required parameter, or used an invalid value as parameter.

401

Unauthorized

Your request used the wrong username or API key.

402

Payment Required

Your subscription plan does not allow to perform this action because it has exceeded your subscription limit. Please wait until your running tasks complete or upgrade your plan.

403

Forbidden

Your request is trying to access to a resource that you do not own.

404

Not Found

The resource that you requested or used as parameter in a request does not exist anymore.

405

Not Allowed

Your request is trying to use a HTTP method that is not supported or to change fields of a resource that can not be modified.

411

Length required

Your request is trying to PUT or POST without sending any content or specifying its length.

413

Request Entity Too Large

The size of the content in your request is greater than what support to PUT or POST.

415

Unsupported Media Type

Your request is trying to POST 'multipart/form-data' content but it is actually sending the wrong content-type.

429

Too Many Requests

You have sent too many requests in a given amount of time

500

Internal Server Error

Your request could not be processed because something went wrong on BigML's end.

503

Service Unavailable

BigML.io is undergoing maintenance.

Resource Status Code Summary

The creation of resources involves a computational task that can last a few seconds or a few days depending on the size of the data. Consequently, some HTTP POST requests to create a resource may launch an asynchronous task and return immediately. In order to know the completion status of this task, each resource has a status field that reports the current state of the request. This status is useful to monitor the progress during their creation. The possible states for a task are:

Code

Status

Semantics

0

Waiting

The resource is waiting for another resource to be finished before BigML.io can start processing it.

1

Queued

The task that is going to create the resource has been accepted but has been queued because there are other tasks using the system.

2

Started

The task to create the resource has been is started and you should expect partial results soon.

3

In Progress

The task has computed the first partial resource but still needs to do more computations.

4

Summarized

This status is specific to datasets. It happens when the dataset has been computed but its data has not been serialized yet. The dataset is final but you cannot use it yet to create a model or if you use it the model will be waiting until the dataset is finished.

5

Finished

The task is completed and the resource is final.

-1

Faulty

The task has failed. We either could not process the task as you requested it or have an internal issue.

-2

Unknown

The task has reached a state that we cannot verify at this time. This a status you should never see unless BigML.io suffers a major outage.

Error Code Summary

This is the list of possible general error codes you can receive fromBigML.io managing any type of resources.

Error Code

Semantics

-1100

Unauthorized use

-1101

Not enough credits

-1102

Wrong resource

-1104

Cloned resourced cannot be public

-1105

Price cannot be changed

-1107

Too many projects

-1108

Too many tasks

-1109

Subscription required

-1200

Missing parameter

-1201

Invalid Id

-1203

Field Error

-1204

Bad Request

-1205

Value Error

-1206

Validation Error

-1207

Unsupported Format

-1208

Invalid Sort Error

Source Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing sources.

Error Code

Semantics

-2000

This source cannot be read properly

-2001

Bad request to create a source

-2002

The source could not be created

-2003

The source cannot be retrieved

-2004

The source cannot be deleted now

-2005

Faulty source

Dataset Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing datasets.

Error Code

Semantics

-3000

The source is not ready yet

-3001

Bad request to create a dataset

-3002

The dataset cannot be created

-30021

The dataset cannot be created now

-3003

The dataset cannot be retrieved

-3004

The dataset cannot be deleted now

-3005

Faulty dataset

-3006

The dataset could not be created properly. This happens when a 1-click model has been requested and corresponding dataset could not be created

-3008

The dataset could not be cloned properly. This happens when there is an internal error when you try to buy or clone other user's dataset

-3010

The clone of the origin dataset is not finished yet

-3020

The source does not contain readable data

-3030

The source cannot be parsed

-3040

The filter expression is not valid

Download Dataset Unsuccessful Requests

This is the list of possible specific error codes you can receive from BigML.io managing downloads.

Error Code

Semantics

-9000

The dataset export is not ready yet

-9001

Bad request to perform a dataset export

-9002

The dataset export cannot be performed

-90021

The dataset export cannot be performed now

-9003

The dataset export cannot be retrieved now

-9004

The dataset export cannot be deleted now

-9005

The dataset export could not be performed

-9006

Dataset exports aren't available for cloned datasets

Sample Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing samples.

Error Code

Semantics

-16000

The sample is not ready yet

-16001

Bad request to create a sample

-16002

Your sample cannot be created

-16021

Your sample cannot be created now

-16003

The sample cannot be retrieved now

-16004

Cannot delete sample now

-16005

The sample could not be created

Correlation Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing correlations.

Error Code

Semantics

-18000

The correlation is not ready yet

-18001

Bad request to create a correlation

-18002

Your correlation cannot be created

-18021

Your correlation cannot be created now

-18003

The correlation cannot be retrieved now

-18004

Cannot delete correlation now

-18005

The correlation could not be created

Statistical Test Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing statistical tests.

Error Code

Semantics

-17000

The statistical test is not ready yet

-17001

Bad request to create a statistical test

-17002

Your statistical test cannot be created

-17021

Your statistical test cannot be created now

-17003

The statistical test cannot be retrieved now

-17004

Cannot delete statistical test now

-17005

The statistical test could not be created

Model Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing models.

Error Code

Semantics

-4000

The dataset is not ready. A one-click model has been requested but the corresponding dataset is not ready yet

-4001

Bad request to create a model

-4002

The model cannot be created

-40021

The model cannot be created now

-4003

The model cannot be retrieved

-4004

The model cannot be deleted now

-4005

Faulty model

-4006

The dataset is empty

-4007

The input fields are empty

-4008

The model could not be cloned properly. This happens when there is an internal error when you try to buy or clone other user's model

-4008

Wrong objective field

-6060

The (sampled) input dataset is empty

Ensemble Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing ensembles.

Error Code

Semantics

-8001

Bad request to create an ensemble

-8002

The ensemble cannot be created

-80021

The ensemble cannot be created now

-8003

The ensemble cannot be retrieved now

-8004

The ensemble cannot be deleted now

-8005

The ensemble could not be created

-8008

The ensemble could not be cloned properly

Logistic Regression Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing logistic regressions.

Error Code

Semantics

-22000

The logistic regression is not ready yet

-22001

Bad request to create a logistic regression

-22002

Your logistic regression cannot be created

-22021

Your logistic regression cannot be created now

-22003

The logistic regression cannot be retrieved now

-22004

Cannot delete logistic regression now

-22005

The logistic regression could not be created

Cluster Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing clusters.

Error Code

Semantics

-10000

The cluster is not ready yet

-10001

Bad request to create a cluster

-10002

The cluster cannot be created

-10003

The cluster cannot be created now

-10004

The cluster cannot be retrieved now

-10005

The cluster cannot be deleted now

-10008

The cluster could not be cloned properly

Anomaly Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing anomaly detectors.

Error Code

Semantics

-13000

The anomaly detector is not ready yet

-13001

Bad request to create an anomaly detector

-13002

The anomaly detector cannot be created

-13021

The anomaly detector cannot be created now

-13003

The anomaly detector cannot be retrieved now

-13004

The anomaly detector cannot be deleted now

-13005

The anomaly detector could not be created

-13008

The anomaly detector could not be cloned properly

Association Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing associations.

Error Code

Semantics

-23000

The association is not ready yet

-23001

Bad request to create an association

-23002

Your association cannot be created

-23021

Your association cannot be created now

-23003

The association cannot be retrieved now

-23004

Cannot delete association now

-23005

The association could not be created

Topic Model Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing topic models.

Error Code

Semantics

-26000

The topic model is not ready yet

-26001

Bad request to create a topic model

-26002

Your topic model cannot be created

-26021

Your topic model cannot be created now

-26003

The topic model cannot be retrieved now

-26004

Cannot delete topic model now

-26005

The topic model could not be created

Time Series Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing time series.

Error Code

Semantics

-30000

The time series is not ready yet

-30001

Bad request to create a time series

-30002

Your time series cannot be created

-30021

Your time series cannot be created now

-30003

The time series cannot be retrieved now

-30004

Cannot delete time series now

-30005

The time series could not be created

Deepnet Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing deepnets.

Error Code

Semantics

-33001

Bad request to create an deepnet

-33002

The deepnet cannot be created

-330021

The deepnet cannot be created now

-33003

The deepnet cannot be retrieved now

-33004

The deepnet cannot be deleted now

-33005

The deepnet could not be created

Prediction Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing predictions.

Error Code

Semantics

-5000

This model is not ready yet

-5001

Bad request to create a prediction

-5002

The prediction can not be created

-5003

The prediction cannot be retrieved

-5004

The prediction cannot be deleted now

-5005

The prediction could not be created

Centroid Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing centroids.

Error Code

Semantics

-11001

Bad request to create a centroid

-11002

Your centroid cannot be created now

-11003

The centroid cannot be retrieved now

-11004

Cannot delete centroid now

-11005

The centroid could not be created

Anomaly Score Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing anomaly scores.

Error Code

Semantics

-14001

Bad request to create an anomaly score

-14002

Your anomaly score cannot be created now

-14003

The anomaly score cannot be retrieved now

-14004

Cannot delete anomaly score now

-14005

The anomaly score could not be created

Association Set Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing association set.

Error Code

Semantics

-24001

Bad request to create an association set

-24002

Your association set cannot be created now

-24003

The association set cannot be retrieved now

-24004

Cannot delete association set now

-24005

The association set could not be created

Topic Distribution Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing topic distributions.

Error Code

Semantics

-27001

Bad request to create a topic distribution

-27002

Your topic distribution cannot be created now

-27003

The topic distribution cannot be retrieved now

-27004

Cannot delete topic distribution now

-27005

The topic distribution could not be created

Forecast Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing forecasts.

Error Code

Semantics

-31001

Bad request to create a forecast

-31002

Your forecast cannot be created now

-31003

The forecast cannot be retrieved now

-31004

Cannot delete forecast now

-31005

The forecast could not be created

Batch Prediction Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing batch predictions.

Error Code

Semantics

-6001

Bad request to perform a batch prediction

-6002

The batch prediction cannot be performed

-60021

The batch prediction cannot be performed now

-6003

The batch prediction cannot be retrieved now

-6004

The batch prediction cannot be deleted now

-6005

The batch prediction could not be performed

Batch Centroid Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing batch centroids.

Error Code

Semantics

-12001

Bad request to perform a batch centroid

-12002

The batch centroid cannot be performed

-12021

The batch centroid cannot be performed now

-12003

The batch centroid cannot be retrieved now

-12004

The batch centroid cannot be deleted now

-12005

The batch centroid could not be performed

Batch Anomaly Score Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing batch anomaly scores.

Error Code

Semantics

-15001

Bad request to perform a batch anomaly score

-15002

The batch anomaly score cannot be performed

-15021

The batch anomaly score cannot be performed now

-15003

The batch anomaly score cannot be retrieved now

-15004

The batch anomaly score cannot be deleted now

-15005

The batch anomaly score could not be performed

Batch Topic Distribution Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing batch topic distributions.

Error Code

Semantics

-28001

Bad request to perform a batch topic distribution

-28002

The batch topic distribution cannot be performed

-28021

The batch topic distribution cannot be performed now

-28003

The batch topic distribution cannot be retrieved now

-28004

The batch topic distribution cannot be deleted now

-28005

The batch topic distribution could not be performed

Evaluation Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing evaluations.

Error Code

Semantics

-7001

Bad request to perform an evaluation

-7002

The evaluation cannot be performed

-70021

The evaluation cannot be performed now

-7003

The evaluation cannot be retrieved now

-7004

The evaluation cannot be deleted now

-7005

The evaluation could not be performed

Whizzml Library Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing libraries.

Error Code

Semantics

-19000

The library is not ready yet

-19001

Bad request to create a library

-19002

Your library cannot be created

-19021

Your library cannot be created now

-19003

The library cannot be retrieved now

-19004

Cannot delete library now

-19005

The library could not be created

Whizzml Script Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing scripts.

Error Code

Semantics

-20000

The script is not ready yet

-20001

Bad request to create a script

-20002

Your script cannot be created

-20021

Your script cannot be created now

-20003

The script cannot be retrieved now

-20004

Cannot delete script now

-20005

The script could not be created

Whizzml Execution Error Code Summary

This is the list of possible specific error codes you can receive from BigML.io managing executions.

Error Code

Semantics

-21000

The execution is not ready yet

-21001

Bad request to create an execution

-21002

Your execution cannot be created

-21021

Your execution cannot be created now

-21003

The execution cannot be retrieved now

-21004

Cannot delete execution now

-21005

The execution could not be created

Category Codes

Last Updated: Monday, 2017-10-30 10:31

A category is useful to group your resources under the same domain of
application if you plan to use BigML in more than one domain.
There are two types of categories: one for common resources and the
other for the WhizzML resources, i.e.,
library,
script, and
execution.
We use the following codes for categories.

Category Codes

Category

Description

-1

Uncategorized

0

Miscellaneous

1

Automotive, Engineering & Manufacturing

2

Energy, Oil & Gas

3

Banking & Finance

4

Fraud & Crime

5

Healthcare

6

Physical, Earth & Life Sciences

7

Consumer & Retail

8

Sports & Games

9

Demographics & Surveys

10

Aerospace & Defense

11

Chemical & Pharmaceutical

12

Higher Education & Scientific Research

13

Human Resources & Psychology

14

Insurance

15

Law & Order

16

Media, Marketing & Advertising

17

Public Sector & Nonprofit

18

Professional Services

19

Technology & Communications

20

Transportation & Logistics

21

Travel & Leisure

22

Utilities

WhizzML Category Codes

Category

Description

-1

Uncategorized

0

Miscellaneous

1

Advanced Workflow

2

Anomaly Detection

3

Association Discovery

4

Basic Workflow

5

Boosting

6

Classification

7

Classification/Regression

8

Correlations

9

Cluster Analysis

10

Data Transformation

11

Evaluation

12

Feature Engineering

13

Feature Extraction

14

Feature Selection

15

Hyperparameter Optimization

16

Model Selection

17

Prediction and Scoring

18

Regression

19

Stacking

20

Statistical Test

Projects

Last Updated: Monday, 2017-10-30 10:31

A project is an abstract resource that helps you group
related BigML resources together.

A project must have a name and optionally a category, description, and multiple
tags to help you organize and retrieve your projects.

When you create a new source you can assign it to a pre-existing
project. All the subsequent resources created using
that source
will belong to the same project.

All the resources created within a
project will inherit the name, description, and tags
of the project unless you
change them when you create the resources or update them later.

When you select a project on your BigML's dashboard,
you will only see the BigML resources related to
that project. Using your BigML dashboard you can also
create, update and delete projects (and all their associated
resources).

Retrieving a Project

Each project has a unique identifier in the form
"project/id" where id is a string of
24 alpha-numeric characters that you can use to retrieve the
project.

To retrieve a project with curl:

curl "https://bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH"

$ Retrieving a project from the command line

You can also use your browser to visualize the project
using the full BigML.io URL or pasting the
project/id into the BigML.com dashboard.

Project Properties

Once a project has been
successfully created it will have the following properties.

Project Properties

Property

Type

Description

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the project and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the project creation has been completed without errors.

This is the date and time in which the project was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

Updating a Project

To update a project,
you need to PUT an object containing the fields that you want to update to the
project'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated project.

For example, to update a project with a new name, a
new category, a new description, and
new tags you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a project,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a project
a second time, or a project that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a project
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Sources

Last Updated: Tuesday, 2018-03-13 12:20

A source is the raw data that you want to use to create
a predictive model. A source is usually a (big) file in a comma
separated values (CSV) format. See the example below. Each row represents an instance (or example).
Each column in the file represents a feature or field.
The last column usually represents the class or objective field.
The file might have a first row named header
with a name for each field.

Creating a Source Using a Local File

To create a new source, you need to POST the file containing your data to the source
base URL. The file must be attached in the post as a file upload.The
Content-Type in your HTTP request must be "multipart/form-data"
according to RFC2388.
This allows you to upload binary files in compressed format (.Z, .gz,
etc) that will be uploaded faster.

You can easily do this using curl. The option -F
(--form) lets curl emulate a filled-in form in which a user has pressed the submit button.
You need to prefix the file path name with "@".

curl "https://bigml.io/source?$BIGML_AUTH" -F file=@iris.csv

> Creating a source

Creating a Source Using a URL

To create a new remote source you need a URL that points to the
data file that you want BigML to download for you.

You can easily do this using curl. The option -H
lets curl set the content type header while the option -X sets the http
method. You can send the URL within a JSON object as follows:

Creating a remote source from Google Drive and Google Storage

You have two options to create a remote datasource from Google Drive and Google Storage via API:

Using BigML:

Allow BigML to access to your Google Drive or Google Storage from the
Cloud Storages section from your
Account or from your Dashboard sources list. You will get the access token and the refresh token.

Google Drive example:

Select the option to create source from Google Drive:

Allow BigML access to your Google Drive:

Get the access token and refresh token:

After complete these steps you need to POST to the source endpoint URL an object containing at least the file ID (for Google Drive)
or the bucket and the file name (for Google Storage) and the access token. Including also the refresh token is
optional before your access token expires. Including it avoids you to be worried about expiration time.
The content-type must always be "application/json".

You can easily create the remote source using curl as in the examples below:

You can also create a remote source from your own App. You first need to authorize BigML access from your own Google Apps application. BigML only needs authorization for read-only authentication scope (https://www.googleapis.com/auth/devstorage.read_only, https://www.googleapis.com/auth/drive.readonly), but you can have any of the other available scopes (find authentication scopes available for Google Drive and Google Storage). After the authorization process you will get your access token and refresh token from the Google Authorization Server.

Then the process is the same as creating a remote source using BigML application described above. You need
to POST to the source endpoint an object containing at least the file ID (for Google Drive) or the bucket and the file name (for Google Storage) and the access token, but in this case you will also need to include
the app secret and app client from your App. Again, including the refresh token is optional.

If you do not specify a name, BigML.io will assign to
the source the same name as the file that you
uploaded. If you do not specify a source_parser,
BigML.io will do its best to automatically select the parsing
parameters for you. However, if you do specify it, BigML.io will not try to
second-guess you.

A item_analysis object is composed of any combination of the following properties.

Item Analysis
Objects Properties

Property

Type

Description

limitoptional

Integer,default is 10000

The maximum number of items that will be used for summarization and modeling.
Example: 1000

pruning_strategyoptional

String,default is "nearest_to_frequency"

Parameter describing how that pruning is performed. When the number of different items is over the configured limit, we must discard some of them. Available values are nearest_to_frequency to keep those items whose frequency is close to a fixed occurrence rate given by the parameter target_frequency or most_frequent to take the most frequent items up to a total of limit.
Example: "most_frequent"

separatoroptional

Char,default is null

A character used as the item separator. Defaults to null for auto-detect.
Example: ";"

separator_regexpoptional

String,default is null

A regular expression used to identify item separators. If provided, it overrides separator. Default to null to use the value for separator.
Example: ";;"

target_frequencyoptional

Float,default is 1/3

A number between 0 and 1 specifying the occurrence rate used when the pruning strategy is nearest_to_frequency.
Example: 0.2

A term_analysis object is composed of any combination of the following properties.

Term Analysis
Objects Properties

Property

Type

Description

case_sensitiveoptional

Boolean,default is false

Whether text analysis should be case sensitive or not.
Example: true

enabledoptional

Boolean,default is true

Whether text processing should be enabled or not.
Example: true

languageoptional

String,default is "en"

The default language of text fields in a two-letter language code, which will change the resulting stemming and tokenization. Available options are: "ca", "de", "en", "es", "fr", "nl", "pt", "none", or null for auto-detect.
Example: "es"

stem_wordsoptional

Boolean,default is true

Whether to stem words or not.
Example: true

use_stopwordsoptional

Boolean,default is true

Whether to use stop words or not.
Example: true

A synthetic object is composed of the following properties.

Synthetic Data Generation
Objects Properties

Property

Type

Description

cat_paddingoptional

Integer,default is 0

The number of padding characters to prepend to category names for testing long strings.
Example: 1

fields

Integer

The number of fields to include in the synthetic source. The generated synthetic source will have as many fields as set by the argument fields plus a "class" field.
Example: 10

frac_catoptional

Number,default is 0.5

The fraction (between 0 and 1) of attributes that are categorical.
Example: 0.3

frac_timeoptional

Number,default is 0

The fraction (between 0 and 1) of attributes that represent times.
Example: 0.1

missingoptional

Number,default is 0

The fraction (between 0 and 1) of missing values (null) that occur in the data.
Example: 0.1

noiseoptional

Number,default is 0

The fraction (between 0 and 1) of attributes not correlated with the class attribute.
Example: 0.1

num_catsoptional

Integer,default is 3

The number of categories for categorical attributes.
Example: 5

num_classesoptional

Integer,default is 3

The number of classes for the final class attribute
Example: 5

rows

Integer

The number of rows to include in the synthetic source.
Example: 10

sparsityoptional

Number,default is 0

The fraction (between 0 and 1) rows that will have a value of zero for each numeric field.
Example: 0.1

Text Processing

While the handling of numeric, categorical, or items fields
within a decision tree framework is fairly straightforward, the handling of text fields
can be done in a number of different ways. BigML.io takes a basic and reasonably robust approach,
leverging some basic NLP techniques along with a simple bag-of-words style method of feature generation.

At the source level, BigML.io attempts to do
basic language detection. Initially the language can be English
("en"), Spanish
("es"), Catalan/Valencian ("ca"),
Dutch ("nl"),
French ("fr"),
German ("de"),
Portuguese ("pt"), or
"none" if no language is detected. In the near
future, BigML.io will support many more languages.

For text fields, BigML.io adds
potentially five keys to the detected fields, all of which are placed
in a map under term_analysis.

The first is language, which is mapped to the detected language.

There are also three boolean keys, case_sensitive,
use_stopwords, and stem_words. The
case_sensitive key is false by default.
use_stopwords should be true if we should include
stopwords in the vocabulary for the detected field during text
summarization. stem_words should be true if
BigML.io should perform word stemming on this field,
which maps forms of the same term to the same key when summarizing or
generating models. By default, use_stopwords is false
and stem_words is true for languages other than "none"
and they are not present otherwise.

Finally, token_mode determines the tokenization
strategy. It may be set as either tokens_only,
full_terms_only, and all. When set as
tokens_only then individual words are used as terms.
For example, "ML for all" becomes ["ML", "for", "all"]. However, when
full_terms_only is selected, then the entire field is
treated as a single term as long as it is shorter than 256 characters.
In this case "ML for all" stays ["ML for all"]. If all
is selected, then both full terms and tokenized terms are used. In this
case ["ML for all"] becomes ["ML", "for", "all", "ML for all"]. The
default for token_mode is all.

There are a few details to note:

If full_terms_only is selected, then no stemming will occur even
if stem_words is true.

Also, when either all or tokens_only are selected, a term must
appear at least twice to be selected for the tag cloud. However
full_terms_only lowers this limit to a single occurrence.

Finally, if the language is "none", or if a
language does not have an algorithm available for stopword removal or
stemming, the use_stopwords and
stem_words keys will have no effect.

Items Detection

BigML automatically detects as items fields that have many different categorical values
per instance separated by non-alphanumeric characters, so they can’t be considered either categorical or text fields

These kind of fields can be found in transactional datasets where each instance is associated to a different set of
products contained within one field. For example, datasets containing all products bought by users or prescription datasets
where each patient is associated to different treatments. These datasets are commonly used for Association Discovery
to find relationships between different items.

Find the two CSV examples below that could be considered items fields:

In the examples above, the fields Prescription and Products
will be considered as items and each different value will be a unique item.

Once a field has been detected as items, BigML tries to
automatically detect which is the best separator for your items. For example, for the following
itemset {hot dog; milk, skimmed; chocolate}, the best separator is the semicolon which yields
three different items: 'hot dog', 'milk, skimmed' and 'chocolate'.

For items fields,
there are five different parameters you can configure under the property group
item_analysis,
which includes separator that allows you to specify which separator you want to set
for your items.

Note that items fields can’t be eligible as target fields for models, logistic regression,
and ensembles, but they can be used as predictors. For anomaly detection, they can’t be included as an input field
to calculate the anomaly score, although they can be selected as summary fields.

Datetime Detection

During the source pre-scan BigML tries to determine the data type of each field in your file. This process automatically
detects datetime fields and, if disable_datetime is not explicitly set to "false", BigML will generate additional
fields with their components.

For instance, if a field named "date" has been identified as a datetime with format "YYYY-MM-dd", four new fields will be
automatically added to the source, namely "date.year", "date.month", "date.day-of-month" and "date.day-of-week". For each row, these new fields
will be filled in automatically by parsing the value of their parent field, "date". For example, if the latter contains the value "1969-07-14",
the autogenerated columns in that row will have the values 1969, 7, 14 and 1 (because that day was Monday). As noted before, autogenaration can be
disabled by setting disable_datetime option to "true", either in the create source request or later in an update source operation.

When a field is detected as datetime, BigML tries to determine its format for parsing the values and generate
the fields with their components. By default, BigML accepts ISO 8601 time formats
(YYYY-MM-DD) as well as a number of other common European and US formats, as seen in the table below:

Datetime predefined formats

time_format Name

Example

basic-date-time

19690714T173639.592Z

basic-date-time-no-ms

19690714T173639Z

basic-ordinal-date-time

1969195T173639.592Z

basic-ordinal-date-time-no-ms

1969195T173639Z

basic-t-time

T173639.592Z

basic-t-time-no-ms

T173639Z

basic-time

173639.592Z

basic-time-no-ms

173639Z

basic-week-date

1969W297

basic-week-date-time

1969W297T173639.592Z

basic-week-date-time-no-ms

1969W297T173639Z

clock-minute

5:36 PM

clock-minute-nospace

5:36PM

clock-second

5:36:39 PM

clock-second-nospace

5:36:39PM

date

1969-07-14

date-hour

1969-07-14T17

date-hour-minute

1969-07-14T17:36

date-hour-minute-second

1969-07-14T17:36:39

date-hour-minute-second-fraction

1969-07-14T17:36:39.592

date-hour-minute-second-ms

1969-07-14T17:36:39.592

date-time

1969-07-14T17:36:39.592Z

date-time-no-ms

1969-07-14T17:36:39Z

eu-date

14/7/1969

eu-date-clock-minute

14/7/1969 5:36 PM

eu-date-clock-minute-nospace

14/7/1969 5:36PM

eu-date-clock-second

14/7/1969 5:36:39 PM

eu-date-clock-second-nospace

14/7/1969 5:36:39PM

eu-date-millisecond

14/7/1969 17:36:39.592

eu-date-minute

14/7/1969 17:36

eu-date-second

14/7/1969 17:36:39

eu-sdate

14-7-1969

eu-sdate-clock-minute

14-7-1969 5:36 PM

eu-sdate-clock-minute-nospace

14-7-1969 5:36PM

eu-sdate-clock-second

14-7-1969 5:36:39 PM

eu-sdate-clock-second-nospace

14-7-1969 5:36:39PM

eu-sdate-millisecond

14-7-1969 17:36:39.592

eu-sdate-minute

14-7-1969 17:36

eu-sdate-second

14-7-1969 17:36:39

hour-minute

17:36

hour-minute-second

17:36:39

hour-minute-second-fraction

17:36:39.592

hour-minute-second-ms

17:36:39.592

mysql

1969-07-14 17:36:39

no-t-date-hour-minute

1969-7-14 17:36

odata-format

/Datetime(-14752170831)/

ordinal-date-time

1969-195T17:36:39.592Z

ordinal-date-time-no-ms

1969-195T17:36:39Z

rfc822

Mon, 14 Jul 1969 17:36:39 +0000

t-time

T17:36:39.592Z

t-time-no-ms

T17:36:39Z

time

17:36:39.592Z

time-no-ms

17:36:39Z

timestamp

-14718201

timestamp-msecs

-14718201000

twitter-time

Mon Jul 14 17:36:39 +0000 1969

twitter-time-alt

1969-7-14 17:36:39 +0000

twitter-time-alt-2

1969-7-14 17:36 +0000

twitter-time-alt-3

Mon Jul 14 17:36 +0000 1969

us-date

7/14/1969

us-date-clock-minute

7/14/1969 5:36 PM

us-date-clock-minute-nospace

7/14/1969 5:36PM

us-date-clock-second

7/14/1969 5:36:39 PM

us-date-clock-second-nospace

7/14/1969 5:36:39PM

us-date-millisecond

7/14/1969 17:36:39.592

us-date-minute

7/14/1969 17:36

us-date-second

7/14/1969 17:36:39

us-sdate

7-14-1969

us-sdate-clock-minute

7-14-1969 5:36 PM

us-sdate-clock-minute-nospace

7-14-1969 5:36PM

us-sdate-clock-second

7-14-1969 5:36:39 PM

us-sdate-clock-second-nospace

7-14-1969 5:36:39PM

us-sdate-millisecond

7-14-1969 17:36:39.592

us-sdate-minute

7-14-1969 17:36

us-sdate-second

7-14-1969 17:36:39

week-date

1969-W29-7

week-date-time

1969-W29-7T17:36:39.592Z

week-date-time-no-ms

1969-W29-7T17:36:39Z

weekyear-week

1969-W29

weekyear-week-day

1969-W29-7

year-month

1969-07

year-month-day

1969-07-14

It might happen that BigML is not able to determine the right format of your datetime field. In that case, it will be considered either a
text or a categorical field. You can override that assignment by setting the optype of the field
to datetime and passing the appropriate format in time_formats.
For instance:

Retrieving a Source

Each source has a unique identifier in the form
"source/id" where id is a string of
24 alpha-numeric characters that you can use to retrieve the
source.

To retrieve a source with curl:

curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH"

$ Retrieving a source from the command line

You can also use your browser to visualize the source
using the full BigML.io URL or pasting the
source/id into the BigML.com dashboard.

Source Properties

Once a source has been
successfully created it will have the following properties.

Source Properties

Property

Type

Description

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the source and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the source creation has been completed without errors.

content_type
filterable,
sortable

String

This is the MIME content-type as provided by your HTTP client. The content-type can help BigML.io to better parse your file. For example, if you use curl, you can alter it using the type option "-F file=@iris.csv;type=text/csv".

A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, a specific locale if it differs from the source's one, and specific missing tokens if the differ from the source's one. This property is very handy to update sources according to your own parsing preferences.
Example:

For fields classified with optype
"text", the default values specified
in the term_analysis at the top-level of the source are used.

Non-provided flags by term_analysis take their default
value, i.e., false for booleans, none for language.

Besides these global default values, which apply to all text fields
(and potential text fields, such as categorical ones that might
overflow to text during dataset creation), it's possible to specify
term_analysis flags on a per-field basis.

For fields classified with optype "items",
the default values specified in the item_analysis
at the top-level of the source are used.

Like term_analysis, non-provided flags by item_analysis
take their default value and it's possible to specify item_analysis
flags on a per-field basis as well at the global level, too.

Source Status

Before a source is successfully created,
BigML.io makes sure that it has been uploaded in an understandable
format, that the data that it contains is parseable, and that the types for each column
in the data can be inferred successfully. The source goes through a number of
states until all these analyses are completed. Through the status field in the
source you can determine when the source has been
fully processed and is ready to be used to create a dataset. These are the
fields that a source's status has:

Source Status Object Properties

Property

Type

Description

code

Integer

A status code that reflects the status of the source creation. It can be any of those that are explained here.

Filtering and Paginating Fields from a Source

A source might be composed of hundreds or even thousands of
fields. Thus when retrieving a source,
it's possible to specify that only a subset of fields be retrieved, by using any combination of the following
parameters in the query string (unrecognized parameters are ignored):

Fields Filter
Parameters

Parameter

Type

Description

fieldsoptional

Comma-separated list

A comma-separated list of field IDs to retrieve.
Example: fields=000000,000002"

fulloptional

Boolean

If false, no information about fields is returned.
Example: "full=false"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Example: "iprefix=INCOME"

limitoptional

Integer

Maximum number of fields that you will get in the fields field.
Example: "limit=100"

offsetoptional

Integer

How far off from the first field in your dataset is the first field in the fields field.
Example: "offset=100"

A case-sensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Example: "prefix=income"

Since fields is a map and therefore not
ordered, the returned fields contain an additional key, order,
whose integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object
Objects Properties

Property

Type

Description

countoptional

Integer

Specifies the current number of fields in the resource.

limitoptional

Integer

The maximum number of fields that will be returned in the resource.

offsetoptional

Integer

The current offset in the pagination of fields.

totaloptional

Integer

The total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide (i.e., more than 200 fields).

Updating a Source

To update a source,
you need to PUT an object containing the fields that you want to update to the
source'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated source.

For example, to update a source with a new name and a
new locale you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a source,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a source
a second time, or a source that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a source
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Datasets

Last Updated: Monday, 2017-10-30 10:31

A dataset is a structured version of a
source where each field has been processed and
serialized according to its type. The possible field types are
numeric, categorical, text,
date-time, or items.
For each field, you can also get the number of errors that were encountered processing it.
Errors are mostly missing values or values that do not match with the type assigned to the column.

When you create a new dataset, histograms of the field values are
created for the categorical and numeric fields.
In addition, for the numeric fields, a collection of statistics
about the field distribution such as minimum, maximum, sum, and sum of squares are
also computed.

For date-time fields, BigML attempts to parse the format
and automatically generate the related subfields (year, month, day, and so on) present in the format.

For items fields which have many different categorical values per instance
separated by non-alphanumeric characters, BigML tries to automatically detect
which is the best separator for your items.

Finally, for text fields, BigML handles plain text
fields with some light-weight natural language processing; BigML separates the field
into words using punctuation and whitespace, attempts to detect the language, groups word forms
together using word stemming, and eliminates words that are too common or too rare to be useful.
We are then left with somewhere between a few dozen and a few hundred interesting
words per text field, the occurrences of which can be features in a model.

Dataset Base URL

You can use the following base URL to create, retrieve, update, and
delete datasets.

https://bigml.io/dataset

Dataset base URL

All requests to manage your datasets must use HTTPS
and be authenticated using your username and API key to verify
your identity. See this section for more details.

Creating a Dataset

To create a new dataset, you need to POST to the
dataset base URL an object containing at least
the source/id that you want to use to create the
dataset. The content-type must always be
"application/json".

You can easily create a new dataset using
curl as follows. All you need is a valid
source/id and your authentication variable set up as
shown above.

Dataset Arguments

By default, the dataset will include all fields in the corresponding
source; but this behaviour can be fine-tuned via the
input_fields and
excluded_fields lists of identifiers. The former
specifies the list of fields to be included in the dataset, and
defaults to all fields in the source when empty. To specify excluded
fields, you can use excluded_fields: identifiers in
that list are removed from the list constructed using
input_fields".

See below the full list of arguments that you can POST to create a dataset.

Dataset Creation
Arguments

Argument

Type

Description

categoryoptional

Integer,default is the category of the source

The category that best describes the dataset. See the category codes for the complete list of categories.
Example: 1

descriptionoptional

String

A description of the dataset up to 8192 characters long.
Example: "This is a description of my new dataset"

excluded_fieldsoptional

Array,default is [], an empty list. None of the fields in the source is excluded.

Specifies the fields that won't be included in the dataset.
Example:

["000000", "000002"]

fieldsoptional

Object,default is {}, an empty dictionary. That is, no names, labels or descriptions are changed.

Updates the names, labels, and descriptions of the fields in the dataset with respect to the original names in the source. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:

A JSON list representing a filter over the rows in the datasource. The first element is an operator and the rest of the elements its arguments. See the section below for more details.
Example: [">", 3.14, ["field", "000002"]]

The number of bytes from the source that you want to use.
Example: 1073741824

source

String

A valid source/id.
Example: source/4f665b8103ce8920bb000006

tagsoptional

Array of Strings

A list of strings that help classify and index your dataset.
Example: ["best customers", "2018"]

term_limitoptional

Integer

The maximum total number of terms to be used in text analysis.
Example: 500

You can also use curl to customize a new
dataset with a name, and different size, and only a
few fields from the original source. For example,
to create a new dataset named "my dataset", with only 500 bytes, and
with only two fields:

If you do not specify a name, BigML.io will assign to
the new dataset the source's name. If
you do not specify a size, BigML.io
will use the full the source's size. If you do not specify any fields
BigML.io will include all the fields in the
source with their corresponding names.

Filtering Rows

The dataset creation request can include an argument,
json_filter, specifying a predicate that the input
rows from the source have to satisfy in order to be included in the dataset. This predicate is specified as a (possibly nested) JSON list whose first element is an operator and the rest of the elements its arguments. Here's an example of a filter specification to choose only those rows whose field "000002" is less than 3.14:

[">", 3.14, ["field", "000002"]]

Filter Example

As you see, the list starts with the operator we want to use, ">",
followed by its operands: the number 3.14, and the value of the field
with identifier "000002", which is denoted by the operator "field". As
another example, this filter:

["=", ["field", "000002"], ["field", "000003"], ["field", "000004"]]

Filter Example

selects rows for which the three fields with identifiers "000002",
"000003" and "000004" have identical values. Note how you're not
limited to two arguments. It's also worth noting that for a filter like
that one to be accepted, all three fields must have the same optype
(e.g. numeric), otherwise they cannot be compared.

The field operator also accepts as arguments the field's name (as a string)
or the row column (as an integer). For instance, if field "000002" had column
number 12, and field "000003" was named "Stock prize", our previous query could
have been written:

["=", ["field", 12], ["field", "Stock prize"], ["field", "000004"]]

Filter Example

If the name is not unique, the first matching field found is picked,
consistently over the whole filter expression. If you have duplicated field
names, the best thing to do is to use either column numbers or field
identifiers in your filters, to avoid ambiguities.

Besides a field's value, one can also ask whether it's missing or not. For
instance, to include only those rows for which field "000002" contains a missing
token, you would use:

["missing", "000002"]

Filter Example

and to select only those for which neither "000002" nor "000003" are missing:

where, as with field, missing's argument can also be a column number or a name, and where we have introduced the logical operators "and" and "not", which, together with "or", can take any number of arguments:

As with logical and relational operators, arithmetic operators accept more than two arguments.

These are all the accepted operators:

=, !=, >,
>=, <, <=,
and, or, not,
field, missing, +,
-, *, /.

To be accepted by the API, the filter must evaluate to a boolean value and contain at least one operator.So, for instance, a constant or an expression evaluating to a number will be rejected.

Since writing and reading the above expressions in pure JSON might be a bit
involved, you can also send your query to the server as a string representing a
Lisp s-expression using the argument lisp_filter, e.g.

Retrieving a Dataset

Each dataset has a unique identifier in the form
"dataset/id" where id is a string of
24 alpha-numeric characters that you can use to retrieve the
dataset.
Notice that to download the dataset file in the CSV format, you will need to append "/download", and
in the Tableau tde format, append "/download?format=tde" to resource id.

You can also use your browser to visualize the dataset
using the full BigML.io URL or pasting the
dataset/id into the BigML.com dashboard.

Dataset Properties

Once a dataset has been
successfully created it will have the following properties.

Dataset Properties

Property

Type

Description

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the dataset and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the dataset creation has been completed without errors.

columns
filterable,
sortable

Integer

The number of fields in the dataset.

correlations

Object

A dictionary where each entry represents a field (column) in your data with the last calculated correlation/id for it.
Example:

This is the date and time in which the dataset was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

credits
filterable,
sortable

Float

The number of credits it cost you to create this dataset.

description
updatable

String

A text describing the dataset. It can contain restricted markdown to decorate the text.

excluded_fields

Array

The list of fields's ids that were excluded to build the model.

field_types

Object

A dictionary that informs about the number of fields of each type. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields and an entry for the total number of fields. In new datasets, it uses the key effective_fields to inform of the effective number of fields. That is the total number of fields including those created under the hood to support text fields.

fields

Object

A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, and the summary.

fields_meta

Object

A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned.

input_fields

Array

The list of input fields' ids used to create the dataset.

locale

String

The source's locale.

name
filterable,
sortable,
updatable

String

The name of the dataset as your provided or based on the name of the source by default.)

number_of_anomalies
filterable,
sortable

Integer

The current number of anomalies that use this dataset.

number_of_anomalyscores
filterable,
sortable

Integer

The current number of anomaly scores that use this dataset.

number_of_associations
filterable,
sortable

Integer

The current number of associations that use this dataset.

number_of_associationsets
filterable,
sortable

Integer

The current number of association sets that use this dataset.

number_of_batchanomalyscores
filterable,
sortable

Integer

The current number of batch anomaly scores that use this dataset.

number_of_batchcentroids
filterable,
sortable

Integer

The current number of batch centroids that use this dataset.

number_of_batchpredictions
filterable,
sortable

Integer

The current number of batch predictions that use this dataset.

number_of_batchtopicdistributions
filterable,
sortable

Integer

The current number of batch topic distributions that use this dataset.

number_of_centroids
filterable,
sortable

Integer

The current number of centroids that use this dataset.

number_of_clusters
filterable,
sortable

Integer

The current number of clusters that use this dataset.

number_of_correlations
filterable,
sortable

Integer

The current number of correlations that use this dataset.

number_of_ensembles
filterable,
sortable

Integer

The current number of ensembles that use this dataset.

number_of_evaluations
filterable,
sortable

Integer

The current number of evaluations that use this dataset.

number_of_forecasts
filterable,
sortable

Integer

The current number of forecasts that use this dataset.

number_of_logisticregressions
filterable,
sortable

Integer

The current number of logistic regressions that use this dataset.

number_of_models
filterable,
sortable

Integer

The current number of models that use this dataset.

number_of_predictions
filterable,
sortable

Integer

The current number of predictions that use this dataset.

number_of_statisticaltests
filterable,
sortable

Integer

The current number of statistical tests that use this dataset.

number_of_timeseries
filterable,
sortable

Integer

The current number of time series that use this dataset.

number_of_topicdistributions
filterable,
sortable

Integer

The current number of topic distributions that use this dataset.

number_of_topicmodels
filterable,
sortable

Integer

The current number of topic models that use this dataset.

objective_field
updatable

Object

The default objective field.

out_of_bag
filterable,
sortable

Boolean

Whether the out-of-bag instances were used to clone the dataset instead of the sampled instances.

price
filterable,
sortable,
updatable

Float

The price other users must pay to clone your dataset.

private
filterable,
sortable,
updatable

Boolean

Whether the dataset is public or not.

project
filterable,
sortable,
updatable

String

The project/id the resource belongs to.

range

Array

The range of instances used to clone the dataset.

refresh_field_types
filterable,
sortable

Boolean

Whether the field types of the dataset have been recomputed or not.

refresh_objective
filterable,
sortable

Boolean

Whether the default objective field of the dataset has been recomputed or not.

refresh_preferred
filterable,
sortable

Boolean

Whether the preferred flags of the dataset fields have been recomputed or not.

replacement
filterable,
sortable

Boolean

Whether the instances sampled to clone the dataset were selected using replacement or not.

resource

String

The dataset/id.

rows
filterable,
sortable

Integer

The total number of rows in the dataset.

sample_rate
filterable,
sortable

Float

The sample rate used to select instances from the dataset.

seed
filterable,
sortable

String

The string that was used to generate the sample.

shared
filterable,
sortable,
updatable

Boolean

Whether the dataset is shared using a private link or not.

shared_clonable
filterable,
sortable,
updatable

Boolean

Whether the shared dataset can be cloned or not.

shared_hash

String

The hash that gives access to this dataset if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to this dataset.

size
filterable,
sortable

Integer

The number of bytes of the source that were used to create this dataset.

source
filterable,
sortable

String

The source/id that was used to build the dataset.

source_status
filterable,
sortable

Boolean

Whether the source is still available or has been deleted.

statisticaltest
filterable,
sortable

String

The last statisticaltest/id that was generated for this dataset.

status

Object

A description of the status of the dataset. It includes a code, a message, and some extra information. See the table below.

This is the date and time in which the dataset was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

Dataset Fields

The property fields is a dictionary keyed by
each field's id in the source. Each
field's id has as a value an object with the
following properties:

Fields Object Properties

Property

Type

Description

column_number

Integer

Specifies the number of column in the original file.

datatype

String

Specifies the storage type of the field.

description

String

An even longer text description for the field.

label

String

A longer and more descriptive name of the field.

locale

String

Specifies the locale for this field if it is different from the dataset's locale.

name

String

Name of the field. It will be the same as in the source if has not been specified here.

optype

String

Specifies the operational type of the field. It can be numeric, categorical, or text.

preferred

Boolean

Whether BigML thinks that this field will be useful when creating a model or not.

summary

Object

Numeric or categoricalsummary of the field.

Numeric Summary

Numeric summaries come with all the fields described below. If
the number of unique values in the data is greater than 32,
then 'bins' will be used for the summary. If not, 'counts'
will be available.

Numeric Summary Object Properties

Property

Type

Description

bins

Array

An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. bins is only available when the number of distinct values is greater than 32. For more information, see our blog post or read this paper.

counts

Array

An array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count. Only available when the number of distinct values is less than or equal to 32.

kurtosis

Number

The sample kurtosis. A measure of 'peakiness' or heavy tails in the field's distribution.

maximum

Number

The maximum value found in this field.

mean

Number

The arithmetic mean of non-missing field values.

median

Number

The approximate median of the non-missing values in this field.

minimum

Number

The minimum value found in this field.

missing_count

Integer

Number of instances missing this field.

population

Integer

The number of instances containing data for this field.

skewness

Number

The sample skewness. A measure of asymmetry in the field's distribution.

standard_deviation

Number

The unbiased sample standard deviation.

sum

String

Sum of all values for this field (for mean calculation).

sum_squares

String

Sum of squared values (for variance calculation).

variance

Number

The unbiased sample variance.

Categorical Summary

Categorical summaries give you a count per each category and missing
count in case any of the instances contain missing values.

Categorical Summary Object Properties

Property

Type

Description

counts

Array

A array of pairs where the first element of each pair is one of the unique categories found in the field and the second element is the count for that category.

missing_count

Integer

Number of instances missing this field.

Text Summary

Text summaries give statistics about the vocabulary of a text field,
and the number of instances containing missing values.

Text Summary Object Properties

Property

Type

Description

column_number

Integer

Specifies the number of column in the original file.

datatype

String

Specifies the storage type of the field.

description

String

An even longer text description for the field.

label

String

A longer and more descriptive name of the field.

locale

String

Specifies the locale for this field if it is different from the dataset's locale.

name

String

Name of the field. It will be the same as in the source if has not been specified here.

optype

String

Specifies the operational type of the field. It can be numeric, categorical, or text.

preferred

Boolean

Whether BigML thinks that this field will be useful when creating a model or not.

summary

Object

Numeric or categoricalsummary of the field.

Dataset Status

Before a dataset is successfully created,
BigML.io makes sure that it has been uploaded in an understandable
format, that the data that it contains is parseable, and that the types for each column
in the data can be inferred successfully. The dataset goes through a number of
states until all these analyses are completed. Through the status field in the
dataset you can determine when the dataset has been
fully processed and ready to be used to create a
model. These are the fields that a dataset's status has:

Dataset Status Object Properties

Property

Type

Description

bytes

Integer

Number of bytes processed so far.

code

Integer

A status code that reflects the status of the dataset creation. It can be any of those that are explained here.

elapsed

Integer

Number of milliseconds that BigML.io took to process the dataset.

field_errors

Object

Information about ill-formatted fields that includes the total format errors for the field and a sample of the ill-formatted tokens.
Example:

Filtering and Paginating Fields from a Dataset

A dataset might be composed of hundreds or even thousands of
fields. Thus when retrieving a dataset,
it's possible to specify that only a subset of fields be retrieved, by using any combination of the following
parameters in the query string (unrecognized parameters are ignored):

Fields Filter
Parameters

Parameter

Type

Description

fieldsoptional

Comma-separated list

A comma-separated list of field IDs to retrieve.
Example: fields=000000,000002"

fulloptional

Boolean

If false, no information about fields is returned.
Example: "full=false"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Example: "iprefix=INCOME"

limitoptional

Integer

Maximum number of fields that you will get in the fields field.
Example: "limit=100"

offsetoptional

Integer

How far off from the first field in your dataset is the first field in the fields field.
Example: "offset=100"

A case-sensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Example: "prefix=income"

Since fields is a map and therefore not
ordered, the returned fields contain an additional key, order,
whose integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object
Objects Properties

Property

Type

Description

countoptional

Integer

Specifies the current number of fields in the resource.

limitoptional

Integer

The maximum number of fields that will be returned in the resource.

offsetoptional

Integer

The current offset in the pagination of fields.

totaloptional

Integer

The total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide (i.e., more than 200 fields).

Updating a Dataset

To update a dataset,
you need to PUT an object containing the fields that you want to update to the
dataset'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated dataset.

For example, to update a dataset with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a dataset,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a dataset
a second time, or a dataset that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a dataset
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Multi-Datasets

BigML.io now allows you to create a new dataset
merging multiple datasets. This functionaliy can be very useful when
you use multiple sources of data and in online scenarios as well.
Imagine, for example, that you collect data in a hourly basis and want to create a dataset aggregrating
data collected over the whole day. So you only need to send the new
generated data each hour to BigML, create a source and a dataset for
each one and then merge all the individual datasets into one at the end of the day.

We usually call datasets created in this way
multi-datasets. BigML.io allows you
to aggregrate up to 32 datasets in the same API request. You can merge
multi-datasets too so basically you can grow a dataset as much as you
want.

To create a multi dataset, you can
specify a list of dataset identifiers as input using the argument
origin_datasets. The example below will construct a
new dataset that is the concatenation of three other datasets.

By convention, the first dataset defines the final dataset fields.
However, there can be cases where each dataset might come
from a different source and therefore have different field ids. In
these cases, you might need to use a fields_maps argument to match each field in a dataset
to the fields of the first dataset.

For instance, in the request above, we use four datasets as input. The
first one would define the final dataset fields. For instance, let's
say that the dataset "dataset/52bc7fc83c1920e4a3000012" in this example
has three fields with identifiers "000001",
"000002" and "000003". Those will be
the default resulting fields, together with their datatypes and so on.
Then we need to specify, for each of the remaining datasets in the
list, a mapping from the "standard" fields to those in the
corresponding dataset. In our example, we're saying that the fields of
the second dataset to be used during the concatenation are
"000023", "000024" and
"00003a", which correspond to the final fields having
them as keys. In the case of the third dataset, the fields used will be
"000023", "000004" and
"00000f". For the last one, since there's no entry in
fields_maps, we'll try to use the same identifiers as those of the first dataset.

The optypes of the paired fields should match, and for
the case of categorical fields, be a proper subset. If a final field
has optype text, however, all values are converted to strings.

BigML.io also allows you to sample each dataset
individually before merging it. You can specify the sample options for
each dataset using the arguments sample_rates,
replacements, seeds, and
out_of_bags. All are dictionaries that must be keyed
using the dataset/id of the dataset you want to specify parameters
for. The next request will create a multi-dataset sampling
the two input datasets differently.

A dictionary keyed by dataset/id and boolean values. Setting this parameter to true for a dataset will return a dataset containing sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example:

In addition to the arguments above you can use all the regular
arguments to clone, sample,
filter, and extend a dataset that
were explained in the Transformations section.
Basically in those cases the flow that BigML.io follows to build a new dataset with multiple datasets is:

Sample each individual dataset according to the specifications
provided in the arguments sample_rates, replacements,
seeds, and out_of_bags.

Merge all the datasets together using the
fields_maps argument to match fields in case they
come from different sources (i.e., have different field ids).

Sample the merged dataset like in the case of a regular
datasaset sampling using the the arguments sample_rate, replacement,
seed,out_of_bag.

Filter the sampled dataset using
input_fields, excluded_fields,
and either a json_filter or
lisp_filter.

Extend the dataset with new fields according to the specifications provided in
the new_fields argument.

Filter the output of the new fields using
either an output_json_filter or
output_lisp_filter.

Resources Accepting Multi-Datasets Input

You can also create a model using multiple datasets as input at once.
That is, without merging all the datasets together into a new dataset
first. The same applies to correlations, statistical tests, ensembles, clusters,
anomaly detectors, and evaluations.
All the multi-dataset arguments above can be used. You just need to use the
datasets argument instead of the regular
dataset.

See examples below to create a multi-dataset model, a
multi-dataset ensemble, and a multi-dataset evaluation.

Transformations

Once you have created a dataset, BigML.io allows you to derive new datasets from it,
sampling, filtering, adding new fields, or concatenating it to other
datasets. We apply the term dataset transformations to the set of operations to create new modified versions of your original dataset or just
transformations to abbreviate.

We use the term:

Cloning for the general operation of generating a new dataset.

Sampling when the original dataset is sampled.

Filtering when the original dataset is filtered.

Extending when new fields are generated.

Merging when a multi-dataset is created.

Keep in mind that you can sample, filter and extend a dataset all at
once in only one API request.

So let's start with the most basic transformation: cloning a dataset.

Cloning a Dataset

To clone a dataset you just need to use the origin_dataset argument
to send the dataset/id of the dataset that you want to
clone. For example:

You can also give the new dataset a different
category, name,
description, and
tags to the original one. Also when cloning a dataset,
you can modify the names, labels, descriptions and preferred flags of
its fields using a fields argument with entries for
those fields you want to change. See a description for all the arguments below.

Dataset Cloning
Arguments

Argument

Type

Description

categoryoptional

Integer

The category that best describes the dataset. See the category codes for the complete list of categories.
Example:

"category": 1

fieldsoptional

Object

Updates the names, labels, and descriptions of the fields in the new dataset. An entry keyed with the field id of the original dataset for each field that will be updated.
Example:

Setting this parameter to true will return a dataset containing a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example:

"out_of_bag": true

replacementoptional

Boolean, default is false

Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example:

"replacement": true

sample_rateoptional

Float, default is 1.0

A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example:

"sample_rate": 0.5

seedoptional

String

A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample"

Filtering a Dataset

A dataset can be filtered in different ways:

Excluding a few fields using the excluded_fields argument.

Selecting only a few fields using the input_fields argument.

Filtering rows using a json_filter or lisp_filter
similarly to the way you can filter a source.

Specifying a range of rows.

As illustrated in the following example, it's possible to provide a
list of input fields, selecting the fields from the filtered input
dataset that will be created. Filtering happens before field picking and, therefore, the row filter can use fields that won't end up in the cloned dataset.

The following table shows the complete list of arguments that you can
use to filter a dataset.

Dataset Filtering
Arguments

Argument

Type

Description

excluded_fieldsoptional

Array

Specifies the fields that won't be included in the new dataset.
Example:

["000000", "000002"]

input_fieldsoptional

Array

Specifies the fields to be included in the dataset.
Example:

["000001", "000003"]

json_filteroptional

Array

A JSON list representing a filter over the rows in the origin dataset. The first element is an operator and the rest of the elements its arguments. See the Section on filtering sources for more details.
Example:

"json_filter": [">", 3.14, ["field", "000002"]]

lisp_filteroptional

String

A string representing a Lisp s-expression to filter rows from the origin dataset.
Example:

"lisp_filter": "(> 3.14 (field 2))"

rangeoptional

Array

The range of successive instances to create the new dataset.
Example:

"range": [100, 200]

Extending a Dataset

You can clone a dataset and extend it with brand new fields using the
new_fields argument. Each new field is created using
a Flatline expression and optionally a
name, label, and
description.

A Flatline expression is a lisp-like expresion that
allows you to make references and process columns and rows of the
origin dataset. See the full Flatline reference here.
Let's see a first example that clones a dataset and adds a new field named "Celsius"
to it using an expression that converts the values from the "Fahrenheit"
field to Celsius.

When you clone a dataset adding new fields, by default, the rest of fields in the origin
dataset are added to new dataset. If you only want to keep the new
fields you can you set up the all_fields argument to
false. You can also use the argument
all_but to exclude fields that you do not want to get in
the new dataset.
A new field can actually generate multiple fields. In that case
their names can be specified using the names arguments.

Note also that field references can be built using either field or
f and use the field id, the field
name, or its column number.
In addition to horizontally selecting different fields in the same row, you
can keep the field fixed and select vertical windows of its value, via the
window and related operators. For example, the following request will
generate a new field using a sliding window of 7 values for the field named
"Fahrenheit" and will also generate two additional fields
named "Yesterday" and "Tomorrow" with the
previous and next value of the current row for the field 0.

Filtering the New Fields Output

The generation of new fields works by traversing the input dataset row by row and applying
the Flatline expression of each new field to each row in turn. The list of values generated
from each input row that way constitutes an output row of the generated dataset.

It is possible to limit the number of input rows that the generator sees by
means of filters and/or sample specifications, for example:

And, as an additional convenience, it is also possible to specify either a
output_lisp_filter or a output_json_filter,
that is, a Flatline row filter that will act on the generated rows, instead of on the input data:

You can also skip any number of rows in the input, starting the generation at
an offset given by row_offset, and traverse the input rows by
any step specified by row_step. For instance, the following request will generate
a dataset whose rows are created by putting together every three consecutive values of the input field "Price":

With the specification above, the new field will start with the third row in the input dataset,
generate an output row (which uses values from the current input row as well as from the two previous ones),
skip to the 6th input row, generate a new output, and so on and so forth.

Next, we'll list all the arguments that can be used to extend a dataset.

Dataset Extending
Arguments

Argument

Type

Description

all_butoptional

Array

Specifies the fields to be included in the dataset.
Example:

"all_but": ["000001", "000003"]

all_fieldsoptional

Boolean

Whether all fields should be included in the new dataset or not.
Example:

"all_fields": false

new_fieldsoptional

Array

Specifies the new fields to be included in the dataset. See the table below for more details.
Example:

"new_fields": [{"field": "(log10 (field "000001"))", "name": "log"}]

output_json_filteroptional

Array

A JSON list representing a filter over the rows of the dataset once the new fields have been generated. The first element is an operator and the rest of the elements its arguments. See the Section on filtering rows for more details.
Example:

"output_json_filter": [">", 3.14, ["field", "000002"]]

output_lisp_filteroptional

String

A string representing a Lisp s-expression to filter rows after the new fields have been generated.
Example:

"output_lisp_filter": "(> 3.14 (field 2))"

row_offsetoptional

Array

The initial number of rows to skip from from the input dataset before start processing rows.
Example:

Lisp and JSON Syntaxes

Flatline also has a json-like flavor with exactly the same semantics that
the lisp-like version. Basically, a Flatline expresion can easily be
translated to its json-like variant and vice versa by just changing parentheses to brackets,
symbols to quoted strings, and adding commas to separate each sub-expression.
For example, the following two expressions are the same for
BigML.io.

"(/ (* 5 (- (f Fahrenheit) 32)) 9)"

Lisp-like expression

["/", ["*", 5, ["-", ["f", "Fahrenheit"], 32]], 9]

Json-like expression

Final Remarks

A few important details that you should keep in mind:

Cloning a dataset implies creating also a copy of its serialized form,
so you get an asyncronous resource with a status that evolves from the
Summarized (4) to the Finished (5)
state.

If you specify both sampling and filtering arguments, the former are applied first.

As with filters applied to datasources, dataset filters can use the full Flatline language to specify the boolean expression to use when sifting the input.

Flatline performs type inference, and will in general figure out the proper optype for the generated fields, which are subsequently summarized by the dataset creation process, reaching then their final datatype (just as with a regular dataset created from a datasource). In case you need to fine-tune Flatline's inferences, you can provide an optype (or optypes) key and value in the corresponding output field entry (together with generator and names), but in general this shouldn't be needed.

Please check the Flatline reference manual for a full description of the language for field generation and the many pre-built functions it provides.

Samples

Last Updated: Thursday, 2018-02-22 12:54

A sample provides fast-access to the raw data of a
dataset on an on-demand basis.

When a new sample is requested, a copy of the dataset
is stored in a special format in an in-memory cache. Multiple and different samples
of the data can then be extracted using HTTPS parameterized requests
by sampling sizes and simple query string filters.

Samples are ephemeral. That is to say, a
sample will be available as long as GETs are requested
within periods smaller than a pre-established TTL (Time to Live). The
expiration timer of a sample is reset every time a new GET is received.

If requested, a sample can also perform
linear regression and
compute Pearson's
and Spearman's correlations
for either one numeric field against all other numeric fields or between two specific numeric fields.

Sample Base URL

You can use the following base URL to create, retrieve, update, and
delete samples.

https://bigml.io/sample

Sample base URL

All requests to manage your samples must use HTTPS
and be authenticated using your username and API key to verify
your identity. See this section for more details.

Creating a Sample

To create a new sample, you need to POST to the
sample base URL an object containing at least
the dataset/id that you want to use to create the
sample.
The content-type must always be "application/json".

You can easily create a new sample using
curl as follows. All you need is a valid
dataset/id and your authentication variable set up as
shown above.

If you do not specify a name, BigML.io will assign to
the new sample the dataset's name.

Retrieving a Sample

Each sample has a unique identifier in the form
"sample/id" where id is a string of
24 alpha-numeric characters that you can use to retrieve the
sample.

To retrieve a sample with curl:

curl "https://bigml.io/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH"

$ Retrieving a sample from the command line

You can also use your browser to visualize the sample
using the full BigML.io URL or pasting the
sample/id into the BigML.com dashboard.

Sample Properties

Once a sample has been
successfully created it will have the following properties.

Sample Properties

Property

Type

Description

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the sample and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the sample creation has been completed without errors.

This is the date and time in which the sample was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

A Sample Object has the following properties:

Sample Object Properties

Property

Type

Description

fields
updatable

Array

A list with an element per field in the dataset used to build the sample. Fields are paginated according to the field_meta attribute. Each entry includes the column number in the original dataset, the name of the field, the type of the field, and the summary. See this Section for more details.

rows

Array of Arrays

A list of lists representing the rows of the sample. Values in each list are ordered according to the fields list.

Sample Status

Through the status field in the
sample you can determine when the sample has been
fully processed and ready to be used. These are the fields that a
sample's status has:

Sample Status Object Properties

Property

Type

Description

code

Integer

A status code that reflects the status of the sample creation. It can be any of those that are explained here.

Filtering and Paginating Fields from a Sample

A sample might be composed of hundreds or even thousands of
fields. Thus when retrieving a sample,
it's possible to specify that only a subset of fields be retrieved, by using any combination of the following
parameters in the query string (unrecognized parameters are ignored):

Fields Filter
Parameters

Parameter

Type

Description

fieldsoptional

Comma-separated list

A comma-separated list of field IDs to retrieve.
Example: fields=000000,000002"

fulloptional

Boolean

If false, no information about fields is returned.
Example: "full=false"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Example: "iprefix=INCOME"

limitoptional

Integer

Maximum number of fields that you will get in the fields field.
Example: "limit=100"

offsetoptional

Integer

How far off from the first field in your dataset is the first field in the fields field.
Example: "offset=100"

A case-sensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Example: "prefix=income"

Since fields is a map and therefore not
ordered, the returned fields contain an additional key, order,
whose integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object
Objects Properties

Property

Type

Description

countoptional

Integer

Specifies the current number of fields in the resource.

limitoptional

Integer

The maximum number of fields that will be returned in the resource.

offsetoptional

Integer

The current offset in the pagination of fields.

totaloptional

Integer

The total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide (i.e., more than 200 fields).

Filtering Rows from a Sample

A sample might be composed of thousands or even
millions of rows. Thus when retrieving a sample, it's
possible to specify that only a subset of rows be retrieved, by using
any combination of the following parameters in the query string
(unrecognized parameters are ignored). BigML will never return more
than 1000 rows in the same response. However, you can send additional
request to get different random samples.

Filtering Rows from a Sample
Parameters

Parameter

Type

Description

!field=optional

Blank

With field the identifier of a field, select only those rows where field is not missing (i.e., it has a definite value).
Example:

"!000002="

!field=from,tooptional

List

With field the identifier of a numeric field, returns the values not in the specified interval. As with inclusion, it's possible to include or exclude the boundaries of the specified interval using square or round brackets
Example:

"!000000=[-10,10)": field 000000 < -10 or >= 10

"!000000=(-10,10]": field 000000 <= 10 or > 10

"!000000=(-10,10)": field 000000 <= 10 or >= 10

"!000000=[-10,10]": field 000000 < 10 or > 10

!field=valueoptional

List

With field the identifier of a numeric field, returns rows for which the field doesn't equal that value.
Example:

"!000000=2": field 0000000 doesn't equal 2

!field=value1&!field=value2&...optional

String

With field the identifier of a categorical field, select only those rows with the value of that field not one of the provided categories (when the parameter is repeated).
Example:

"!000002=iris-setosa&!000002=iris-versicolor"

field=optional

Blank

With field the identifier of a field, select only those rows where field is missing.
Example:

"000002=&000002=iris-setosa": includes rows with either "iris-setosa" or missing.

field=from,tooptional

List

With field the identifier of a numeric field and from, to optional numbers, specifies a filter for the numeric values of that field in the range [from, to]. One of the limits can be omitted.
Example:

"000000=-10,10": field 000000 between -10 and 10, included

"000001=,3": field 000001 less or equal to 3

"00000a=-20,": field 00000a greater than or equal to -20

It is possible to specify whether the interval should include its boundaries with the usual [] or () brackets, as in:

"000000=[-10,10)": -10 <= field 000000 < 10

"000000=(-10,10]": -10 < field 000000 <= 10

"000000=(-10,10)": -10 < field 000000 < 10

"000000=[-10,10]": -10 <= field 000000 <= 10

field=valueoptional

List

With field the identifier of a numeric field, returns rows for which the field equals that value.
Example:

"000000=2": field 0000000 equals 2

field=value1&field=value2&...optional

String

With field the identifier of a categorical field, select only those rows with the value of that field one of the provided categories (when the parameter is repeated).
Example:

"000002=iris-setosa&000002=iris-versicolor"

indexoptional

Boolean

When set to true, every returned row will have a first extra value which is the absolute row number, i.e., a unique row identifier. This can be useful, for instance, when you're performing various GET requests and want to compute the union of the returned regions.
Example: index=true

modeoptional

String

One amongst deterministic, random, or linear. The way we sample the resulting rows, if needed; random means a random sample, deterministic is also random but using a fixed seed so that it's repeatable, and linear means that BigML just returns the first size rows after filtering; defaults to "deterministic".
Example: mode=random

occurrenceoptional

Boolean

When set to true, rows have prepended a value which denotes the number of times the row was present in the sample. You'll want this only when unique is set to true, otherwise all those extra values will be equal to 1. When index is also set to true (see above), the multiplicity column is added after the row index.
Example: occurrence=true

precisionoptional

Integer

The number of significant decimal numbers to keep in the returned values, for fields of type float or double. For instance, if you set precision=0, all returned numeric values will be truncated to their integral part.
Example: precision=2

row_fieldsoptional

List

You can provide a list of identifiers to be present in the samples rows, specifying which ones you actually want to see and in which order.
Example: row_fields=000000,000002

row_offsetoptional

Integer

Skip the given number of rows. Useful when paginating over the sample in linear mode.
Example: row_offset=300

row_order_byoptional

String

A field that causes the returned columns to be sorted by the value of the given field, in ascending order or, when the - prefix is used, in descending order.
Example: row_order_by=-000000

rowsoptional

Integer

The total number of rows to be returned; if less than the resulting from the rest of the filter parameters, the latter will be sampled according to mode.
Example: rows=300

seedoptional

String

When mode is random, you can specify your own seed in this parameter; otherwise, we choose it at random, and return the value we've used in the body of the response: that way you can make a random sampling deterministic if you happen to like a particular result.
Example: seed=mysample

stat_fieldoptional

String

A field_id that corresponds to the identifier of a numeric field will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms of this field with all other numeric fields in the sample. Those values will be returned in maps keyed by "other" field id and named spearman_correlations, pearson_correlations, slopes, and intercepts.
Example: stat_field=000000

stat_fieldsoptional

String

Two field_ids that correspond to the identifiers of numeric fields will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms between the two fields. Those values will be returned in maps keyed named spearman_correlation, pearson_correlation, slope, and intercept.
Example: stat_fields=000000,000003

uniqueoptional

Boolean

When set to true, repeated rows will be removed from the sample.
Example: unique=true

Updating a Sample

To update a sample,
you need to PUT an object containing the fields that you want to update to the
sample'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated sample.

For example, to update a sample with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a sample,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a sample
a second time, or a sample that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a sample
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Correlations

Last Updated: Thursday, 2018-02-22 12:54

A correlation resource allows you to compute advanced statistics
for the fields in your dataset by applying various exploratory data analysis techniques
to compare the distributions of the fields in your dataset against an objective_field.

Correlation Base URL

You can use the following base URL to create, retrieve, update, and
delete correlations.

https://bigml.io/correlation

Correlation base URL

All requests to manage your correlations must use HTTPS
and be authenticated using your username and API key to verify
your identity. See this section for more details.

Creating a Correlation

To create a new correlation, you need to POST to the
correlation base URL an object containing at least
the dataset/id that you want to use to create the
correlation. The content-type must always be
"application/json".

You can easily create a new correlation using
curl as follows. All you need is a valid
dataset/id and your authentication variable set up as
shown above.

Correlation Arguments

In addition to the dataset, you can also POST the
following arguments.

Correlation Creation
Arguments

Argument

Type

Description

categoriesoptional

Object,default is {}, an empty dictionary. That is no categories are specified.

A dictionary between input field id and an array of categories to limit the analysis to. Each array must contain 2 or more unique and valid categories in the string format. If omitted, each categorical field is limited to its 100 most frequent categorical values. This field has no impact if the data type of input fields are non-categorical.
Example:

Object,default is {}, an empty dictionary. That is, no names or preferred statuses are changed.

This can be used to change the names of the fields in the correlation with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,default is []. All the fields in the dataset

Specifies the fields to be considered to create the correlation.
Example:

["000001", "000003"]

nameoptional

String,default is dataset's name

The name you want to give to the new correlation.
Example: "my new correlation"

objective_fieldoptional

String,default is dataset's pre-defined objective field

The id of the field to be used as the objective for correlation tests.
Example: "000001"

out_of_bagoptional

Boolean,default is false

Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true

projectoptional

String

The project/id you want the correlation to belong to.
Example: "project/54d98718f0a5ea0b16000000"

rangeoptional

Array,default is [1, max rows in the dataset]

The range of successive instances to build the correlation.
Example: [1, 150]

replacementoptional

Boolean,default is false

Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true

sample_rateoptional

Float,default is 1.0

A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5

seedoptional

String

A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample"

significance_levelsoptional

Array,default is [0.01, 0.05, 0.1]

An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1]

tagsoptional

Array of Strings

A list of strings that help classify and index your correlation.
Example: ["best customers", "2018"]

ç
Discretization is used to transform numeric input fields to categoricals before further processing. It is to be applied globally with all input fields.
A Discretization object is composed of any combination of the following properties.

Discretization
Parameters

Parameter

Type

Description

prettyoptional

Boolean,default is true

When enabled the bin locations and widths will be chosen to be easy to read numbers, specifically they will be 1, 2, 2.5, or 5 times a power of 10.
Example: false

sizeoptional

Integer,default is 32

The number of equal width bins. If pretty is enabled then this value acts as a maximum size, but the actual number of bins may be lower.
Example: 12

trimoptional

Float,default is 0

A real number between 0 and 0.1 specifying the portion of the overall population that may be removed from either tail of the distribution. For example, 0.01 indicates that 1 percent of the data may be removed from either tail. Default is 0, however, 0.01 often gives nice looking results.
Example: 0.01

typeoptional

String,default is "population"

Whether the field is discretized using an equal width or equal population strategy.
Example: "width"

For example, let's say type is set to "width", size is 7, trim is 0.05,
and pretty is false. This requests that numeric input fields be discretized into 7 bins of equal width, trimming the outer 5% of counts, and not rounding bin boundaries.

Field Discretizations is also used to transform numeric input fields to categoricals before further processing.
However, it allows the user to specify parameters on a per field basis, taking precedence over the global discretization.
It is a map whose keys are field ids and whose values are maps with the same format as discretization.
It also accepts edges, which is a numeric array manually specifying edge boundary locations.
If this parameter is present, the corresponding field will be discretized according to those defined bins,
and the remaining discretization parameters will be ignored.
The maximum value of the field's distribution is automatically set as the last value in the edges array.
A value object of a Field Discretizations object is composed of any combination of the following properties.

Field Discretizations
Parameters

Parameter

Type

Description

edgesoptional

Array

A numeric array manually specifying edge boundary locations. If this parameter is present the corresponding field will be discretized according to those defined bins, and the remaining discretization parameters will be ignored. The maximum value of the field's distribution is automatically set as the last value in the edges array.
Example: [1.0,3.3,9.4]

prettyoptional

Boolean,default is true

When enabled the bin locations and widths will be chosen to be easy to read numbers, specifically they will be 1, 2, 2.5, or 5 times a power of 10.
Example: false

sizeoptional

Integer,default is 32

The number of equal width bins. If pretty is enabled then this value acts as a maximum size, but the actual number of bins may be lower.
Example: 12

trimoptional

Float,default is 0

A real number between 0 and 0.1 specifying the portion of the overall population that may be removed from either tail of the distribution. For example, 0.01 indicates that 1 percent of the data may be removed from either tail. Default is 0, however, 0.01 often gives nice looking results.
Example: 0.01

typeoptional

String,default is "population"

Whether the field is discretized using an equal width or equal population strategy.
Example: "width"

You can also use curl to customize a new correlation.
For example, to create a new correlation named "my correlation", with only certain rows,
and with only three fields:

If you do not specify a name, BigML.io will assign to
the new correlation the dataset's name.
If you do not specify a range of instances, BigML.io
will use all the instances in the dataset.
If you do not specify any input fields,
BigML.io will include all the input fields in the dataset.

Read the Section on Sampling Your Dataset
to lean how to sample your dataset. Here's an example of correlation request with range and sampling specifications:

You can also use your browser to visualize the correlation
using the full BigML.io URL or pasting the
correlation/id into the BigML.com dashboard.

Correlation Properties

Once a correlation has been
successfully created it will have the following properties.

Correlation Properties

Property

Type

Description

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the correlation and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the correlation creation has been completed without errors.

columns
filterable,
sortable

Integer

The number of fields in the correlation.

correlations

Object

All the information that you need to recreate the correlation. It includes the field's dictionary describing the fields and their summaries, and the correlations. See the Correlations Object definition below.

This is the date and time in which the correlation was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

credits
filterable,
sortable

Float

The number of credits it cost you to create this correlation.

dataset
filterable,
sortable

String

The dataset/id that was used to build the correlation.

dataset_status
filterable,
sortable

Boolean

Whether the dataset is still available or has been deleted.

description
updatable

String

A text describing the correlation. It can contain restricted markdown to decorate the text.

excluded_fields

Array

The list of fields's ids that were excluded to build the correlation.

fields_meta

Object

A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned.

input_fields

Array

The list of input fields' ids used to build the correlation.

locale

String

The dataset's locale.

max_columns
filterable,
sortable

Integer

The total number of fields in the dataset used to build the correlation.

max_rows
filterable,
sortable

Integer

The maximum number of instances in the dataset that can be used to build the correlation.

name
filterable,
sortable,
updatable

String

The name of the correlation as your provided or based on the name of the dataset by default.

objective_field

String,default is dataset's pre-defined objective field

The id of the field to be used as the objective for a correlations test.
Example: "000001"

This is the date and time in which the correlation was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

white_box
filterable,
sortable

Boolean

Whether the correlation is publicly shared as a white-box.

The Correlations Object of test has the following properties.
Some correlation results will contain a p-value and
a significant boolean array, indicating whether the p_value is less than
the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided).
If p-value is greater than the
accepted significance level,
then then it fails to reject the null hypothesis,
meaning there is no statistically significant difference between the treatment groups.
For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and
p-value is 0.05, then significant is [false, false, false, true, true].

Correlations Object Properties

Property

Type

Description

categories

Object

A dictionary between input field id and arrays of category names selected for correlations.

A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details.

The Coefficients Result Object contains the correlation measures
between objective_field and each of the input_fields
when the two fields are numeric-numeric pairs. It has the following properties:

Coefficients Result Object Properties

Property

Type

Description

pearson

Float

A measure of the linear correlation between two variables, giving a value between +1 and -1, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. See Pearson's correlation coefficients for more information.

pearson_p_value

Float

A function used in the context of null hypothesis testing for pearson correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015

spearman

Float

A nonparametric (parameters are determined by the training data, not the model. Thus, the number of parameters grows with the amount of training data) measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect correlation of +1 or -1 occurs when each of the variables is a perfect monotone function of the other. See Spearman's correlation coefficients for more information.

spearman_p_value

Float

A function used in the context of null hypothesis testing for spearman correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015

The Contingency Tables Result Object contains the correlation measures
between objective_field and each of the input_fields
when the two fields are both categorical.It has the following properties:

A measure of association between two nominal variables. Its value ranges between 0 (no association between the variables) and 1 (complete association), and can reach 1 only when the two variables are equal to each other. It is based on Pearson's chi-squared statistic. See Cramer's V for more information.

tschuprow

Float

A measure of association between two nominal variables. Its value ranges ranges between 0 (no association between the variables) and 1 (complete association). It is closely related to Cramer's V, coinciding with it for square contingency tables. See Tschuprow's T for more information.

two_way_table

Array

Contingency Table as a nested row-major array with the frequency distribution of the variables. In other words, the table summarizes the distribution of values in the sample.
Example: [[2514, 362, 78, 38, 23], [889, 53, 39, 2, 1]]

The Chi-Square Object contains the chi-square statistic used to
investigate whether distributions of categorical variables differ from one another.
This test is used to compare a collection of categorical data with
some theoretical expected distribution. The object has the following properties.

Chi-Square Object Properties

Property

Type

Description

chi_square_value

Float

The value of the chi-square statistic.
Example: 1201.60468

p_value

Float

A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015

significant

Array

A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true]

The One-way ANOVA Result Object contains correlation measures
between objective_field and each of the input_fields
when the two fields are categorical-numerical pairs.
ANOVA is used to compare the means of numerical data samples. The ANOVA tests
the null hypothesis that samples
in two or more groups are drawn from populations with the same mean values.
See One-way Analysis of Variance for more information. The object has the following properties:

One-Way Anova Result Object Properties

Property

Type

Description

eta_square

Float

A measure of effect size, a measure of the strength of the relationship between two variables, for use in ANOVA. Its value ranges ranges between 0 and 1. A rule of thumb is: 0.02 ~ small, 0.13 ~ medium, and 0.26 ~ large. See eta-squared for more information.

f_ratio

Float

The value of the F statistic, which is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. It is the ratio of the variance calculated among the means to the variance within the samples.

p_value

Float

A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015

significant

Array

A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true]

An Objective Field Details Object has the following properties.

Object Field Details Object Properties

Property

Type

Description

column_number

Integer

The column number of the objective field.

datatype

String

The data type of the objective field.

name

String

The name of the objective field.

optype

String

The operation type of the objective field.

order

String

The order of the objective field.

Correlation Status

Creating correlation is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input and on the workload of BigML's
systems. The correlation goes through a number of
states until its fully completed. Through the status field in the
correlation you can determine when the correlation has been
fully processed and ready to be used to create predictions. These are the
properties that correlation's status has:

Correlation Status Object Properties

Property

Type

Description

code

Integer

A status code that reflects the status of the correlation creation. It can be any of those that are explained here.

Filtering and Paginating Fields from a Correlation

A correlation might be composed of hundreds or even thousands of
fields. Thus when retrieving a correlation,
it's possible to specify that only a subset of fields be retrieved, by using any combination of the following
parameters in the query string (unrecognized parameters are ignored):

Fields Filter
Parameters

Parameter

Type

Description

fieldsoptional

Comma-separated list

A comma-separated list of field IDs to retrieve.
Example: fields=000000,000002"

fulloptional

Boolean

If false, no information about fields is returned.
Example: "full=false"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Example: "iprefix=INCOME"

limitoptional

Integer

Maximum number of fields that you will get in the fields field.
Example: "limit=100"

offsetoptional

Integer

How far off from the first field in your dataset is the first field in the fields field.
Example: "offset=100"

A case-sensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Example: "prefix=income"

Since fields is a map and therefore not
ordered, the returned fields contain an additional key, order,
whose integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object
Objects Properties

Property

Type

Description

countoptional

Integer

Specifies the current number of fields in the resource.

limitoptional

Integer

The maximum number of fields that will be returned in the resource.

offsetoptional

Integer

The current offset in the pagination of fields.

totaloptional

Integer

The total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide (i.e., more than 200 fields).

Updating a Correlation

To update a correlation,
you need to PUT an object containing the fields that you want to update to the
correlation'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated correlation.

For example, to update correlation with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a correlation,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a correlation
a second time, or a correlation that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a correlation
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Listing Correlations

To list all the correlations,
you can use the correlation base URL.
By default, only the 20 most recent correlations
will be returned. You can see below how to change this number using
the limit parameter.

You can get your list of correlations directly in your browser
using your own username and API key with the following links.

Statistical Tests

Last Updated: Tuesday, 2018-03-13 12:20

A statistical test resource automatically runs some advanced statistical tests on the numeric fields of a dataset.
The goal of these tests is to check whether the values of individual fields conform or differ from some distribution patterns.
Statistical test are useful in tasks such as fraud, normality, or outlier detection.

The tests are grouped in the following three categories:

Fraud Detection Tests:

Benford: This statistical test performs a comparison of the distribution of first significant digits (FSDs) of each value of the field to the Benford's law distribution. Benford's law applies to numerical distributions spanning several orders of magnitude, such as the values found on financial balance sheets. It states that the frequency distribution of leading, or first significant digits (FSD) in such distributions is not uniform. On the contrary, lower digits like 1 and 2 occur disproportionately often as leading significant digits. The test compares the distribution in the field to Bendford's distribution using a Chi-square goodness-of-fit test, and Cho-Gaines d test. If a field has a dissimilar distribution, it may contain anomalous or fraudulent values.

Normality tests: These tests can be used to confirm the assumption that the data in each field of a dataset is distributed
according to a normal distribution. The results are relevant because many statistical and machine learning techniques rely on this assumption.

Anderson-Darling: The Anderson-Darling test computes a test statistic based on the difference between the observed cumulative distribution function (CDF) to that of a normal distribution. A significant result indicates that the assumption of normality is rejected.

Jarque-Bera: The Jarque-Bera test computes a test statistic based on the third and fourth central moments (skewness and kurtosis) of the data. Again, a significant result indicates that the normality assumption is rejected.

Z-score: For a given sample size, the maximum deviation from the mean that would expected in a sampling of
a normal distribution can be computed based on the 68-95-99.7 rule. This test simply reports this expected deviation
and the actual deviation observed in the data, as a sort of sanity check.

Outlier tests:

Grubbs: When the values of a field are normally distributed, a few values may still deviate from the mean distribution. The outlier tests reports whether at least one value in each numeric field differs significantly from the mean using Grubb's test for outliers. If an outlier is found, then its value will be returned.

Note that both the number of tests within each category and the categories may increase in the near future.

Statistical Test Base URL

You can use the following base URL to create, retrieve, update, and
delete statistical tests.

https://bigml.io/statisticaltest

Statistical Test base URL

All requests to manage your statistical tests must use HTTPS
and be authenticated using your username and API key to verify
your identity. See this section for more details.

Creating a Statistical Test

To create a new statistical test, you need to POST to the
statistical test base URL an object containing at least
the dataset/id that you want to use to create the
statistical test.
The content-type must always be "application/json".

You can easily create a new statistical test using
curl as follows. All you need is a valid
dataset/id and your authentication variable set up as
shown above.

Statistical Test Arguments

In addition to the dataset, you can also POST the following arguments.

Statistical Test Creation
Arguments

Argument

Type

Description

ad_sample_sizeoptional

Integer,default is 1024

The Anderson-Darling normality test is computed from a sample from the values of each field. This parameter specifies the number of samples to be used during the normality test. If not given, defaults to 1024.
Example: 128

ad_seedoptional

String

A string to be hashed to generate deterministic samples for the Anderson-Darling normality test.
Example: "MyADSeed"

categoryoptional

Integer,default is the category of the dataset

The category that best describes the statistical test. See the category codes for the complete list of categories.
Example: 1

dataset

String

A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006

default_numeric_valueoptional

String

It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"Example: "median"

descriptionoptional

String

A description of the statistical test up to 8192 characters long.
Example: "This is a description of my new statistical test"

excluded_fieldsoptional

Array,default is [], an empty list. None of the fields in the dataset is excluded.

Specifies the fields that won't be included in the statistical test.
Example:

["000000", "000002"]

fieldsoptional

Object,default is {}, an empty dictionary. That is, no names or preferred statuses are changed.

This can be used to change the names of the fields in the statistical test with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,default is []. All the fields in the dataset

Specifies the fields to be considered to create the statistical test.
Example:

["000001", "000003"]

nameoptional

String,default is dataset's name

The name you want to give to the new statistical test.
Example: "my new statistical test"

out_of_bagoptional

Boolean,default is false

Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true

projectoptional

String

The project/id you want the statistical test to belong to.
Example: "project/54d98718f0a5ea0b16000000"

rangeoptional

Array,default is [1, max rows in the dataset]

The range of successive instances to build the statistical test.
Example: [1, 150]

replacementoptional

Boolean,default is false

Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true

sample_rateoptional

Float,default is 1.0

A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5

seedoptional

String

A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample"

significance_levelsoptional

Array,default is [0.01, 0.05, 0.1]

An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1]

tagsoptional

Array of Strings

A list of strings that help classify and index your statistical test.
Example: ["best customers", "2018"]

You can also use curl to customize a new statistical test.
For example, to create a new statistical test named "my statistical test", with only certain rows, and
with only three fields:

If you do not specify a name, BigML.io will assign to
the new statistical test the dataset's name.
If you do not specify a range of instances, BigML.io
will use all the instances in the dataset.
If you do not specify any input fields,
BigML.io will include all the input fields in the dataset.

Read the Section on
to learn how to sample your dataset. Here's an example of statistical test
request with range and sampling specifications:

You can also use your browser to visualize the statistical test
using the full BigML.io URL or pasting the
statisticaltest/id into the BigML.com dashboard.

Statistical Test Properties

Once a statistical test has been
successfully created it will have the following properties.

Statistical Test Properties

Property

Type

Description

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the statistical test and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the statistical test creation has been completed without errors.

This is the date and time in which the statistical test was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

credits
filterable,
sortable

Float

The number of credits it cost you to create this statistical test.

dataset
filterable,
sortable

String

The dataset/id that was used to build the statistical test.

dataset_field_types

Object

A dictionary that informs about the number of fields of each type in the dataset used to create the statistical test. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields.

dataset_status
filterable,
sortable

Boolean

Whether the dataset is still available or has been deleted.

description
updatable

String

A text describing the statistical test. It can contain restricted markdown to decorate the text.

excluded_fields

Array

The list of fields's ids that were excluded to build the statistical test.

fields_meta

Object

A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned.

input_fields

Array

The list of input fields' ids used to build the statistical test.

locale

String

The dataset's locale.

max_columns
filterable,
sortable

Integer

The total number of fields in the dataset used to build the statistical test.

max_rows
filterable,
sortable

Integer

The maximum number of instances in the dataset that can be used to build the statistical test.

name
filterable,
sortable,
updatable

String

The name of the statistical test as your provided or based on the name of the dataset by default.

out_of_bag
filterable,
sortable

Boolean

Whether the out-of-bag instances were used to create the statistical test instead of the sampled instances.

price
filterable,
sortable,
updatable

Float

The price other users must pay to clone your statistical test.

private
filterable,
sortable,
updatable

Boolean

Whether the statistical test is public or not.

project
filterable,
sortable,
updatable

String

The project/id the resource belongs to.

range

Array

The range of instances used to build the statistical test.

replacement
filterable,
sortable

Boolean

Whether the instances sampled to build the statistical test were selected using replacement or not.

resource

String

The statisticaltest/id.

rows
filterable,
sortable

Integer

The total number of instances used to build the statistical test

sample_rate
filterable,
sortable

Float

The sample rate used to select instances from the dataset to build the statistical test.

seed
filterable,
sortable

String

The string that was used to generate the sample.

shared
filterable,
sortable,
updatable

Boolean

Whether the statistical test is shared using a private link or not.

shared_hash

String

The hash that gives access to this statistical test if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to this statistical test.

size
filterable,
sortable

Integer

The number of bytes of the dataset that were used to create this statistical test.

source
filterable,
sortable

String

The source/id that was used to build the dataset.

source_status
filterable,
sortable

Boolean

Whether the source is still available or has been deleted.

statistical_tests

Object

All the information that you need to recreate the statistical test. It includes the field's dictionary describing the fields and their summaries, and the statistical tests. See the Statistical Tests Object definition below.

status

Object

A description of the status of the statistical test. It includes a code, a message, and some extra information. See the table below.

subscription
filterable,
sortable

Boolean

Whether the statistical test was created using a subscription plan or not.

This is the date and time in which the statistical test was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

white_box
filterable,
sortable

Boolean

Whether the statistical test is publicly shared as a white-box.

The Statistical Tests Object of statistical test has the following properties.
Many statistical tests will contain a p-value and
a significant boolean array, indicating whether the p_value is less than
the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided).
If p-value is greater than the
accepted significance level,
then then it fails to reject the null hypothesis,
meaning there is no statistically significant difference between the treatment groups.
For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and
p-value is 0.05, then significant is [false, false, false, true, true].

A seed used to generate deterministic samples for the Anderson-Darling normality test.

fields
updatable

Object

A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details.

fraud

Array

An array of anomalous fields detection test results for each numeric field. See Fraud Object.

normality

Array

An array of data normality test results for each numeric field. See Normality Object.

outliers

Array

An array of outlier detection test results for each numeric field. See Outliers Object.

The distribution of first significant digits (FSDs) to the Benford's law distribution. For example, the FSD for 2015 is 2, and for 0.00609 is 6. The array represents the number of occurences for each digit from 1 to 9.
Example: [0, 0, 0, 22, 61, 54, 0, 0, 0]

negatives

Integer

The number of negative values.

zeros

Integer

The number of values exactly equal to 0.

The Chi-Square Object contains the chi-square statistic used to
investigate whether distributions of categorical variables differ from one another.
This test is used to compare a collection of categorical data with
some theoretical expected distribution. The object has the following properties.

Chi-Square Object Properties

Property

Type

Description

chi_square_value

Float

The value of the chi-square statistic.
Example: 1201.60468

p_value

Float

A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015

significant

Array

A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true]

The Cho-Gaines Object has the following properties.

Cho-Gaines Object Properties

Property

Type

Description

d_statistic

Float

A value based on Euclidean distance from Benford's distribution in the 9-dimensional space occupied by any first-digit vector to test Cho-Gaines d test.

significant

Array

A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. It does not respect the values passed in significance_levels, but always use [0.01, 0.05, 0.1].
Example: [false, true, true]

The Normality Object has the following properties.

Normality Object Properties

Property

Type

Description

name

String

Name of the normality test. Available values are anderson_darling, jarque_bera, and z_score.

The Anderson-Darling Result Object has the following properties. See Anderson-Darling Test for more information.

Anderson-Darling Result Object Properties

Property

Type

Description

p_value

Float

A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015

significant

Array

A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true]

The Jarque-Bera Result Object has the following properties. See Jarque-Bera Test for more information.

Jarque-Bera Result Object Properties

Property

Type

Description

p_value

Float

A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015

significant

Array

A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true]

The Z-Score Object has the following properties. A positive standard score
indicates a datum above the mean, while a negative standard score indicates a datum below the mean.
See z-score for more information.

Name of the outlier detection test. Currently only value available is grubbs.

result

Object

A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is grubbs, it returns Grubbs Result Object.

The Grubb's Test for Outliers Result Object has the following properties. It computes a t-test based on the maximum deviation from the mean. A significant result indicates that at least one outlier is present in the data. If an outlier is found, also returns the value of the outlier. Note that this test assumes that the data are normally distributed. See Grubb's test for outliers for more information.

Grubbs Result Object Properties

Property

Type

Description

outlier

Number

An outlier present in the data. It is available only when at at least of one of the boolean values in significant is true.

p_value

Float

A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015

significant

Array

A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true]

Statistical Test Status

Creating statistical test is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input and on the workload of BigML's
systems. The statistical test goes through a number of
states until its fully completed. Through the status field in the
statistical test you can determine when the test has been
fully processed and ready to be used to create predictions. These are the
properties that statistical test's status has:

Statistical Test Status Object Properties

Property

Type

Description

code

Integer

A status code that reflects the status of the statistical test creation. It can be any of those that are explained here.

elapsed

Integer

Number of milliseconds that BigML.io took to process the statistical test.

message

String

A human readable message explaining the status.

progress

Float, between 0 and 1

How far BigML.io has progressed building the statistical test.

Once statistical test has been successfully created, it will look like:

Filtering and Paginating Fields from a Statistical Test

A statistical test might be composed of hundreds or even thousands of
fields. Thus when retrieving a statisticaltest,
it's possible to specify that only a subset of fields be retrieved, by using any combination of the following
parameters in the query string (unrecognized parameters are ignored):

Fields Filter
Parameters

Parameter

Type

Description

fieldsoptional

Comma-separated list

A comma-separated list of field IDs to retrieve.
Example: fields=000000,000002"

fulloptional

Boolean

If false, no information about fields is returned.
Example: "full=false"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Example: "iprefix=INCOME"

limitoptional

Integer

Maximum number of fields that you will get in the fields field.
Example: "limit=100"

offsetoptional

Integer

How far off from the first field in your dataset is the first field in the fields field.
Example: "offset=100"

A case-sensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Example: "prefix=income"

Since fields is a map and therefore not
ordered, the returned fields contain an additional key, order,
whose integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object
Objects Properties

Property

Type

Description

countoptional

Integer

Specifies the current number of fields in the resource.

limitoptional

Integer

The maximum number of fields that will be returned in the resource.

offsetoptional

Integer

The current offset in the pagination of fields.

totaloptional

Integer

The total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide (i.e., more than 200 fields).

Updating a Statistical Test

To update a statistical test,
you need to PUT an object containing the fields that you want to update to the
statistical test'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated statistical test.

For example, to update statistical test with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a statistical test,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a statistical test
a second time, or a statistical test that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a statistical test
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Listing Statistical Tests

To list all the statistical tests,
you can use the statisticaltest base URL.
By default, only the 20 most recent statistical tests
will be returned. You can see below how to change this number using
the limit parameter.

You can get your list of statistical tests directly in your browser
using your own username and API key with the following links.

Models

Last Updated: Monday, 2017-10-30 10:31

A model is a tree-like representation of your
dataset with predictive power. You can create a
model selecting which fields from your
dataset you want to use as input
fields (or predictors) and
which field you want to predict, the objective field.

Each node in the model corresponds to one of the
input fields. Each node has an incoming branch except
the top node also known as root that has none. Each node has a number of outgoing
branches except those at the bottom (the "leaves") that have none.

Each branch represents a possible value for the input field where it
originates. A leaf represents the value of the
objective field given all the values for each input field in the chain of
branches that goes from the root to that leaf.

When you create a new model, BigML.io will
automatically compute a classification model or regression model
depending on whether the objective field that you want to predict is categorical
or numeric, respectively.

Model Arguments

In addition to the dataset, you can also POST the
following arguments.

Model Creation
Arguments

Argument

Type

Description

categoryoptional

Integer,default is the category of the dataset

The category that best describes the model. See the category codes for the complete list of categories.
Example: 1

dataset

String

A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006

depth_thresholdoptional

Integer,default is 512

When the depth in the tree exceeds this value, the tree stops growing. It has no effect if it's bigger than the node_threshold.
Example: 128

descriptionoptional

String

A description of the model up to 8192 characters long.
Example: "This is a description of my new model"

excluded_fieldsoptional

Array,default is [], an empty list. None of the fields in the dataset is excluded.

Specifies the fields that won't be included in the model.
Example:

["000000", "000002"]

fieldsoptional

Object,default is {}, an empty dictionary. That is, no names or preferred statuses are changed.

This can be used to change the names of the fields in the model with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,default is []. All the fields in the dataset

Specifies the fields to be included as predictors in the model.
Example:

["000001", "000003"]

missing_splitsoptional

Boolean,default is false

Defines whether to explicitly include missing field values when choosing a split. When this option is enabled, generates predicates whose operators include an asterisk, such as >*, <=*, =*, or !=*. The presence of an asterisk means "or missing". So a split with the operator >* and the value 8 can be read as "x > 8 or x is missing". When using missing_splits there may also be predicates with operators = or !=, but with a null value. This means "x is missing" and "x is not missing" respectively.
Example: true

nameoptional

String,default is dataset's name

The name you want to give to the new model.
Example: "my new model"

node_thresholdoptional

Integer,default is 512

When the number of nodes in the tree exceeds this value, the tree stops growing.
Example: 1000

objective_fieldoptional

String,default is the id of the last field in the dataset

Specifies the id of the field that you want to predict.
Example: "000003"

objective_fieldsoptional

Array,default is an array with the id of the last field in the dataset

Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference.
Example: ["000003"]

orderingoptional

Integer,default is 0 (deterministic).

Specifies the type of ordering followed to build the model. There are three different types that you can specify:

Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true

projectoptional

String

The project/id you want the model to belong to.
Example: "project/54d98718f0a5ea0b16000000"

random_candidate_ratiooptional

Float

A real number between 0 and 1. When randomize is true and random_candidate_ratio is given, BigML randomizes the tree and uses random_candidate_ratio * total fields (counting the number of terms in text fields as fields). To get the final number of candidate fields we round down to the nearest integer, but if the result is 0 we'll use 1 instead. If both random_candidates and random_candidate_ratio are given, BigML ignores random_candidate_ratio.
Example: 0.2

random_candidatesoptional

Integer,default is the square root of the total number of input fields.

Sets the number of random fields considered when randomize is true.
Example: 10

The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. So, if it is 3, then a both children of a new split must have 3 instances supporting them. Since instances may have non-integer weights, non-integer values are valid.
Example: 16

tagsoptional

Array of Strings

A list of strings that help classify and index your model.
Example: ["best customers", "2018"]

You can also use curl to customize a new
model. For example, to create a new model named "my model", with only certain rows, and
with only three fields:

If you do not specify a name, BigML.io will assign to
the new model the dataset's name. If
you do not specify a range of instances, BigML.io
will use all the instances in the dataset. If you do
not specify any input fields, BigML.io will include all the input fields in the
dataset, and if you do not specify an
objective field, BigML.io will
use the last field in your dataset.

Shuffling the Rows of Your Dataset

By default, rows from the input dataset are deterministically
shuffled before being processed, to avoid inaccurate models caused by
ordered fields in the input rows. Since the shuffling is deterministic,
i.e., always the same for a given dataset, retraining a model for the
same dataset will always yield the same result.

However, you can modify this default behaviour by including the
ordering argument in the model creation request, where "ordering"
here is a shortcut for "ordering for the traversal of input rows". When
this property is absent or set to 0,
deterministic shuffling takes place;
otherwise, you can set it to:

Linear: If you know that your input is already in random order.
Setting "ordering" to 1 in your model request tells
BigML to traverse the dataset in a linear fashion, without performing any
shuffling (and therefore operating faster).

Random: If you'd like to perform a really random shuffling, most
probably different from any other one attempted before. Setting
"ordering" to 2 will shuffle the input rows
non-deterministically.

Sampling Your Dataset

You can limit the dataset rows that are used to create a model in
two ways (which can be combined), namely, by specifying a row range and
by asking for a sample of the (alreaday clipped) input rows.

To specify a sample, which is taken over the row range or over the whole
dataset if a range is not provided, you can add the following arguments
to the creation request:

sample_rate
: A positive number that specifies the sampling rate, i.e., how often we
pick a row from the range. In other words, the final number of rows
will be the size of the range multiply by the sample_rate, unless
"out_of_bag" is true (see below).

replacement
: A boolean indicating whether sampling should be performed with or without
replacement, i.e., the same instance may be selected multiple times
for inclusion in the result set. Defaults to false.

out_of_bag
: If an instance isn't selected as part of a
sampling, it's called out of bag. Setting this parameter to true will
return a sequence of the out-of-bag instances instead of the sampled
instances.
This can be useful when paired with "seed". When
replacement is false,
the final number of row returned is the size of the range multiply by
one minus the sample_rate. Out-of-bag sampling with replacement gives rise to
variable-size samples. Defaults to false.

seed
: Rows are sampled probabilistically using a
random string, which means that, in general, two identical samples of
the same row range of the same dataset will be different. If you
provide a seed (as an arbitrary string), its hash value will be used as
the seed, and it'll be possible for you to generate deterministic
samples.

Finally, note that the "ordering" of the dataset described in the previous subsection is used on the result of the sampling.

Here's an example of a model request with range and sampling specifications:

Random Decision Forests

A model can be randomized by setting the randomize parameter to true. The default is false.

When randomized, the model considers only a subset of the possible
fields when choosing a split. The size of the subset will be the square
root of the total number of input fields. So if there are 100 input
fields, each split will only consider 10 fields randomly chosen from
the 100. Every split will choose a new subset of fields.

Although randomize could be used for other purposes, it's intended for
growing random decision forests. To grow tree models for a random forest, set
randomize to true and select a sample from the dataset. Traditionally
this is a 1.0 sample rate with replacement, but we suggest a 0.63
sample rate without replacement.

Retrieving a Model

Each model has a unique identifier in the form
"model/id" where id is a string of
24 alpha-numeric characters that you can use to retrieve the
model.

To retrieve a model with curl:

curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH"

$ Retrieving a model from the command line

You can also use your browser to visualize the model
using the full BigML.io URL or pasting the
model/id into the BigML.com dashboard.

Model Properties

Once a model has been
successfully created it will have the following properties.

Model Properties

Property

Type

Description

boosted_ensemble
filterable,
sortable

Boolean

Whether the model was built as part of an ensemble with boosted trees.

boosting

Object

Boosting attribute for the boosted tree. See the Gradient Boosting section for more information.
Example:

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the model and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the model creation has been completed without errors.

This is the date and time in which the model was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

credits
filterable,
sortable

Float

The number of credits it cost you to create this model.

credits_per_prediction
filterable,
sortable,
updatable

Float

This is the number of credits that other users will consume to make a prediction with your model if you made it public.

dataset
filterable,
sortable

String

The dataset/id that was used to build the model.

dataset_field_types

Object

A dictionary that informs about the number of fields of each type in the dataset used to create the model. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields.

dataset_status
filterable,
sortable

Boolean

Whether the dataset is still available or has been deleted.

description
updatable

String

A text describing the model. It can contain restricted markdown to decorate the text.

ensemble
filterable,
sortable

Boolean

Whether the model was built as part of an ensemble of not.

ensemble_id
filterable,
sortable

String

The ensemble id.

ensemble_index
filterable,
sortable

Integer

The number of order in the ensemble.

excluded_fields

Array

The list of fields's ids that were excluded to build the model.

fields_meta

Object

A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned.

input_fields

Array

The list of input fields' ids used to build the model.

locale

String

The dataset's locale.

max_columns
filterable,
sortable

Integer

The total number of fields in the dataset used to build the model.

max_rows
filterable,
sortable

Integer

The maximum number of instances in the dataset that can be used to build the model.

missing_splits
filterable,
sortable

Boolean

Whether to explicitly include missing field values when choosing a split while growing a model.

model

Object

All the information that you need to recreate or use the model on your own. It includes a very intuitive description of the tree-like structure that makes the model up and the field's dictionary describing the fields and their summaries.

name
filterable,
sortable,
updatable

String

The name of the model as your provided or based on the name of the dataset by default.

node_threshold
filterable,
sortable

String

The maximum number of nodes that the model will grow.

number_of_batchpredictions
filterable,
sortable

Integer

The current number of batch predictions that use this model.

number_of_evaluations
filterable,
sortable

Integer

The current number of evaluations that use this model.

number_of_predictions
filterable,
sortable

Integer

The current number of predictions that use this model.

number_of_public_predictions
filterable,
sortable

Integer

The current number of public predictions that use this model.

objective_field

String

The id of the field that the model predicts.

objective_fields

Array

Specifies the list of ids of the field that the model predicts. Even if this an array BigML.io only accepts one objective field in the current version.

ordering
filterable,
sortable

Integer

The order used to chose instances from the dataset to build the model. There are three different types:

0 Deterministic

1 Linear

2 Random

out_of_bag
filterable,
sortable

Boolean

Whether the out-of-bag instances were used to create the model instead of the sampled instances.

price
filterable,
sortable,
updatable

Float

The price other users must pay to clone your model.

private
filterable,
sortable,
updatable

Boolean

Whether the model is public or not.

project
filterable,
sortable,
updatable

String

The project/id the resource belongs to.

random_candidate_ratio
filterable,
sortable

Float

The random candidate ratio considered when randomize is true.

random_candidates
filterable,
sortable

Integer

The number of random fields considered when randomize is true.

randomize
filterable,
sortable

Boolean

Whether the model splits considered only a random subset of the fields or all the fields available.

range

Array

The range of instances used to build the model.

replacement
filterable,
sortable

Boolean

Whether the instances sampled to build the model were selected using replacement or not.

resource

String

The model/id.

rows
filterable,
sortable

Integer

The total number of instances used to build the model

sample_rate
filterable,
sortable

Float

The sample rate used to select instances from the dataset to build the model.

seed
filterable,
sortable

String

The string that was used to generate the sample.

selective_pruning
filterable,
sortable

Boolean

If true, selective pruning throttled the strength of the statistical pruning depending on the size of the dataset.

shared
filterable,
sortable,
updatable

Boolean

Whether the model is shared using a private link or not.

shared_clonable
filterable,
sortable,
updatable

Boolean

Whether the shared model can be cloned or not.

shared_hash

String

The hash that gives access to this model if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to this model.

size
filterable,
sortable

Integer

The number of bytes of the dataset that were used to create this model.

source
filterable,
sortable

String

The source/id that was used to build the dataset.

source_status
filterable,
sortable

Boolean

Whether the source is still available or has been deleted.

split_candidates
filterable,
sortable

Integer

The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024

stat_pruning
filterable,
sortable

Boolean

Whether statistical pruning was used when building the model.

status

Object

A description of the status of the model. It includes a code, a message, and some extra information. See the table below.

subscription
filterable,
sortable

Boolean

Whether the model was created using a subscription plan or not.

support_threshold
filterable,
sortable

Float

The parameter controls the minimum amount of support each child node must contain to be valid as a possible split.

This is the date and time in which the model was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

white_box
filterable,
sortable

Boolean

Whether the model is publicly shared as a white-box.

A Model Object has the following properties:

Model Object Properties

Property

Type

Description

depth_threshold

Integer

The depth, or generation, limit for a tree.

distribution

Object

This dictionary gives information about how the training data is distributed across the tree leaves. More concretely, it contains the training data distribution with key training, and the distribution for the actual prediction values of the tree with key predictions. The former is just the objective_summary of the tree root (see below), copied for easier individual retrieval, and both have the format of the objective summary in the tree nodes.

fields
updatable

Object

A dictionary with an entry per field in the dataset used to build the model. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details.

importance

Array of Arrays

A list of pairs [field_id, importance]. Importance is the amount by which each field in the model reduces prediction error, normalized to be between zero and one. Note that fields with an importance of zero may still be correlated with the objective; they were just not used in the model.

kind

String

The type of model. Currently, only stree is supported.

missing_strategy

String

Default strategy followed by the model when it finds a missing value. Currently, last_prediction. At prediction time you can opt for using proportional. See this Section for more details.

model_fields

Object

A dictionary with an entry per field used by the model (not all the fields that were available in the dataset). They follow the same structure as the fields attribute above except that the summary is not present.

root

Object

A Node Object, a tree-like recursive structure representing the model.

split_criterion

Integer

Method of choosing best attribute and split point for a given node.
DEPRECATED

support_threshold

Float

A number between 0 and 1. For a split to be valid, each child's support (instances / total instances) must be greater than this threshold.

Node Objects have the following properties:

Node Object Properties

Property

Type

Description

children

Array

Array of Node Objects.

confidence

Float

For classification models, a number between 0 and 1 that expresses how certain the model is of the prediction. For regression models, a number mapped to the top end of a 95 confidence interval around the expected error at that node (measured using the variance of the output at the node). See the Section on Confidence for more details. Note that for models you might have created using the first versions of BigML this value might be null.

count

Integer

Number of instances classfied by this node.

objective_summary

Object

An Objective Summary Object summarizes the objective field's distribution at this node.

output

Number or String

Prediction at this node.

predicate

Boolean or Object

Predicate structure to make a decision at this node.

Objective Summary Objects have the following properties:

Objective Summary Object Properties

Property

Type

Description

bins

Array

If the objective field is numeric and the number of distinct values is greater than 32. An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. For more information, see our blog post or read this paper.

categories

Array

If the objective field is categorical, an array of pairs where the first element of each pair is one of the unique categories and the second element is the count for that category.

counts

Array

If the objective field is numeric and the number of distinct values is less than or equal to 32, an array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count.

maximum

Number

The maximum of the objective field's values. Available when 'bins' is present.

minimum

Number

The minimum of the objective field's values. Available when 'bins' is present.

Predicate Objects have the following properties:

Predicate Object Properties

Property

Type

Description

field

String

Field's id used for this decision.

operator

String

Type of test used for this field.

value

Number or String

Value of the field to make this node decision.

Model Status

Creating a model is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input and on the workload of BigML's
systems. The model goes through a number of
states until its fully completed. Through the status field in the
model you can determine when the model has been
fully processed and ready to be used to create predictions. These are the
properties that a model's status has:

Model Status Object Properties

Property

Type

Description

code

Integer

A status code that reflects the status of the model creation. It can be any of those that are explained here.

Filtering a Model

It is possible to filter the tree returned by a GET to the model location by means of two optional query string parameters,
namely support and value.

Filter by Support

Support is a number from 0 to 1 that specifies the
minimum fraction of the total number of instances that a given
branch must cover to be retained in the resulting tree. Thus,
asking for (minimum) support of 0, is just asking for the whole
tree, while something like:

will return just the root node, that being the only one that covers all instances.
If you repeat the support parameter in the query string, the last one is used.
Non-parseable support values are ignored.

Filter by Values and Value Intervals

Value is a concrete value or interval of values (for regression
trees) that a leaf must predict to be kept in the returning tree. For instance:

in which case the union of the different predicates is used (i.e., the first query will return a tree will all leaves predicting "Iris-setosa" and all leaves predicting "Iris-versicolor".

Intervals can be closed or open in either end. For example, "(-2,10]", "[1,2)" or "(-1.234,0)", and the values of the left or right limits can be omitted, in which case they're taken as negative and positive infinity, respectively; thus "(,3]" denotes all values less or equal to three, as does "[,3]" (infinity not being a valid value for a numeric prediction), while "(0,)" accepts any positive value.

Filter by Confidence

Confidence is a concrete value or interval of values that a leaf must have to be kept in the returning tree.
The specification of intervals follows the same conventions as those of value. Since confidences are a continuous value,
the most common case will be asking for a range, but the service will accept also individual values.
It's also possible to specify both a value and a confidence. For instance:

Filtering and Paginating Fields from a Model

A model might be composed of hundreds or even thousands of
fields. Thus when retrieving a model,
it's possible to specify that only a subset of fields be retrieved, by using any combination of the following
parameters in the query string (unrecognized parameters are ignored):

Fields Filter
Parameters

Parameter

Type

Description

fieldsoptional

Comma-separated list

A comma-separated list of field IDs to retrieve.
Example: fields=000000,000002"

fulloptional

Boolean

If false, no information about fields is returned.
Example: "full=false"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Example: "iprefix=INCOME"

limitoptional

Integer

Maximum number of fields that you will get in the fields field.
Example: "limit=100"

offsetoptional

Integer

How far off from the first field in your dataset is the first field in the fields field.
Example: "offset=100"

A case-sensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Example: "prefix=income"

Since fields is a map and therefore not
ordered, the returned fields contain an additional key, order,
whose integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object
Objects Properties

Property

Type

Description

countoptional

Integer

Specifies the current number of fields in the resource.

limitoptional

Integer

The maximum number of fields that will be returned in the resource.

offsetoptional

Integer

The current offset in the pagination of fields.

totaloptional

Integer

The total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide (i.e., more than 200 fields).

Updating a Model

To update a model,
you need to PUT an object containing the fields that you want to update to the
model'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated model.

For example, to update a model with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a model,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a model
a second time, or a model that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a model
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Weight Field

A weight_field may be declared for either regression or classification models.
Any numeric field with no negative or missing values is valid as a weight field.
Each instance will be weighted individually according to the weight field's value.
See the toy dataset for credit card transactions below.

The last column represents the weight for each
transaction. We can use it as an input to create a model that will use
to weight each instance accordingly. In this case, fraudulent
transactions will weigh 10 times more than valid transactions in the
model building computations.

With Flatline,
you can define arbitrarily complex functions to produce weight fields,
making this the most flexible and powerful way to produce weighted models.

For instance, the request below would create a new dataset using the
example above that will add a new weight field using the previous and
multiplying by two when the amount of the transaction is higher than 500.

This method also works well when you query very large databases that can
produce the same row hundreds or thousands of times. You can just use one of
the rows and add the corresponding count as a weight field. This will reduce
the size of your sources enormously.

Objective Weights

The second method for adding weights only applies to classification
models. A set of objective_weights may be defined, one per objective class. Each instance will be weighted according to its class weight.

Weights of zero are valid as long as there are some positive valued weights.
If every weight does end up zero (this is possible with sampled datasets), then the resulting model will have a single node with a nil output.

Automatic Balancing

Finally, we provide a convenience shortcut for specifying weights for a
classification objective which are proportional to their category
counts, by means of the balance_objective flag.

Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value.
Example: "000005"

The nodes for a weighted tree will include a weight and
weighted_objective_distribution, which are the weighted
analogs of count and objective_distribution. Confidence, importance, and pruning calculations also take weights into account.

Ensembles

An ensemble is a number of models grouped together to
create a stronger model with better predictive performance.

Depending on the nature of your data and the specific parameters of the
ensemble, you can significantly boost the predictive performance for
single models, using exactly the same data.

You can create an ensemble just as you would create a
model with the following three basic machine learning
techniques: bagging, random decision forests,
and gradient tree boosting.

Bagging, also known as bootstrap aggregating, is one of the simplest ensemble-based strategies but often outperforms strategies that are more complex. The basic idea is to use a different random subset of the original dataset for each model in the ensemble. Specifically BigML uses by default a sampling rate of 100% with replacement for each model. You can read more about bagging here.

Random decision forests is the second ensemble-based strategy that BigML provides.
It consists, essentially, in selecting a new random set of the input fields at each split
while an individual model is being built instead of considering all the input fields.
To create a random decision forest you just need to set the randomize argument to true.
You can read more about random decision forests here.

Gradient tree boosting is the third strategy whose predictions are additive.
Each tree modifies the predictions of the previously grown tree.
You must specify the boosting argument in order to apply this technique.

Ensemble Base URL

You can use the following base URL to create, retrieve, update, and
delete ensembles.

https://bigml.io/ensemble

Ensemble base URL

All requests to manage your ensembles must use HTTPS
and be authenticated using your username and API key to verify
your identity. See this section for more details.

Creating an Ensemble

To create a new ensemble, you need to POST to the
ensemble base URL an object containing at least
the dataset/id that you want to use to create the
ensemble. The content-type must always be
"application/json".

Ensemble Arguments

In addition to the dataset, you can also POST the
following arguments, and like models,
you can use weights to deal with imbalanced datasets.
Click here
to find more information about weights.

Ensemble Creation
Arguments

Argument

Type

Description

boostingoptional

Object

Gradient boosting options for the ensemble. Required to created an ensemble with boosted trees. See the Gradient Boosting section for more information.
Example:

{
"iterations": 5,
"learning_rate": 0.75
}

NEW

categoryoptional

Integer,default is the category of the dataset

The category that best describes the ensemble. See the category codes for the complete list of categories.
Example: 1

dataset

String

A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006

depth_thresholdoptional

Integer,default is 512

When the depth in the tree exceeds this value, the tree stops growing. It has no effect if it's bigger than the node_threshold.
Example: 128

descriptionoptional

String

A description of the ensemble up to 8192 characters long.
Example: "This is a description of my new ensemble"

ensemble_sampleoptional

Object

The sampling to be used for each tree in the ensemble. It can contain a rate (default 1), and replacement (default true), and seed parameters. Note that this is different from the sample_rate, replacement, and seed used in other models, predictions or datasets, where sampling is applied once to the input dataset; rather, it's applied multiple times to the input, in order to create separate samplings for each tree composing the final ensemble. So there is not out_of_bag parameter here, and the seed is used to create a different seed for each of the generated trees.
Example:

{
"rate": 0.8,
"replacement": true,
"seed": "my ensemble seed"
}

excluded_fieldsoptional

Array,default is [], an empty list. None of the fields in the dataset is excluded.

Specifies the fields that won't be included in the models of the ensemble
Example:

["000000", "000002"]

fieldsoptional

Object,default is {}, an empty dictionary. That is, no names or preferred statuses are changed.

This can be used to change the names of the fields in the ensemble with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,default is []. All the fields in the dataset

Specifies the fields to be included as predictors in the models of the ensemble.
Example:

["000001", "000003"]

missing_splitsoptional

Boolean,default is true

Defines whether to explicitly include missing field values when choosing a split while growing the models of an ensemble. When this option is enabled, in each model, generates predicates whose operators include an asterisk, such as >*, <=*, =*, or !=*. The presence of an asterisk means "or missing". So a split with the operator >* and the value 8 can be read as "x > 8 or x is missing". When using missing_splits there may also be predicates with operators = or !=, but with a null value. This means "x is missing" and "x is not missing" respectively.
Example: flase

nameoptional

String,default is dataset's name

The name you want to give to the new ensemble.
Example: "my new ensemble"

node_thresholdoptional

Integer,default is 512

When the number of nodes in the tree exceeds this value, the tree stops growing.
Example: 1000

number_of_modelsoptional

Integer,default is 10

The number of models to build the ensemble. This parameter is ignored for boosted trees. See the Gradient Boosting section for more information.
Example: 100

objective_fieldoptional

String,default is the id of the last field in the dataset

Specifies the id of the field that the ensemble will predict.
Example: "000003"

orderingoptional

Integer,default is 0 (deterministic).

Specifies the type of ordering followed to build the models of the ensemble. There are three different types that you can specify:

Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true

projectoptional

String

The project/id you want the ensemble to belong to.
Example: "project/54d98718f0a5ea0b16000000"

random_candidate_ratiooptional

Float

A real number between 0 and 1. When randomize is true and random_candidate_ratio is given, BigML randomizes the trees and uses random_candidate_ratio * total fields (counting the number of terms in text fields as fields). To get the final number of candidate fields we round down to the nearest integer, but if the result is 0 we'll use 1 instead. If both random_candidates and random_candidate_ratio are given, BigML ignores random_candidate_ratio.
Example: 0.2

random_candidatesoptional

Integer,default is the square root of the total number of input fields.

Sets the number of random fields considered when randomize is true.
Example: 10

randomizeoptional

Boolean,default is false

Setting this parameter to true will consider only a subset of the possible fields when choosing a split. See the Section on Random Decision Forests for further details.
Example: true

rangeoptional

Array,default is [1, max rows in the dataset]

The range of successive instances to build the ensemble.
Example: [1, 150]

replacementoptional

Boolean,default is false

Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true

sample_rateoptional

Float,default is 1.0

A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5

seedoptional

String

A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample"

split_candidatesoptional

Integer,default is 32

The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024
Example: 128

The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. So, if it is 3, then a both children of a new split must have 3 instances supporting them. Since instances may have non-integer weights, non-integer values are valid.
Example: 16

tagsoptional

Array of Strings

A list of strings that help classify and index your ensemble.
Example: ["best customers", "2018"]

You can use curl to customize a new
ensemble from the command line. For example, to create
a new ensemble named "my ensemble", with only certain rows, and
with only three fields:

If you do not specify a name, the dataset's name will
be assigned to the new ensemble. If
you do not specify a range of instances,
the complete set of instances in the dataset will be
used. If you do not specify any input fields,
all the preferred input fields in the
dataset will be included, and if you do not specify an
objective field, the last field in your
dataset will be considered the objective field.

Gradient Boosting

When doing boosting, the number_of_models parameter described above is no longer valid as an input.
The number_of_models will now indicate the maximum number of boosting iterations explained below.
Note that when gradient boosting option is applied to classification models,
the actual number of models created will be a product of the number of classes (categories) and the iterations.
For example, if you set boosting iterations to 12 and the number of classes is 3,
then the number of models created will be 36 or less depending on whether an early stopping strategy is used or not.

In addition, our implementation of boosted trees support the following parameters,
which are all part of the boosting object:

Ensemble Boosting
Arguments

Argument

Type

Description

early_holdoutoptional

Float,default is 0

The portion of the dataset that will be held out for testing at the end of every iteration. If no significant improvement is made on the holdout, boosting will stop early. early_out_of_bag and early_holdout are mutually exclusive.If early_out_of_bag is enabled it will take precedent and early_holdout will be automatically set to 0 (disabled). This value should be between 0 (inclusive) and 1 (exclusive).
Example: 0.5

early_out_of_bagoptional

Boolean,default is true

If enabled, the out_of_bag samples are tested after every iteration and may result in an early stop if no significant improvement is made. To use this option, an ensemble_sample must also be requested.
Example: false

iterationsoptional

Integer,default is 10

The maximum number of boosting iterations to be performed. For regression problems, one boosted tree will be generated for every iteration. For classification problems, however, N trees will be generated for every iteration, where N is the number of classes.
Example: 32

learning_rateoptional

Float,default is 0.1

It controls how aggressively the boosting algorithm will fit the data.This value should be between 0 (exclusive) and 1 (exclusive).
Example: 0.75

If the boosted trees are using one of the early stopping tests
(early_out_of_bag or early_holdout),
then it will also have a list of scores indicating the quality of the boosted trees
after each iteration.

Individual trees in the boosted trees differ from trees in bagged or random forest ensembles.
Primarily the difference is that boosted trees do not try to predict the objective field directly.
Instead, they try to fit a gradient (correcting for mistakes made in previous iterations),
and this will be stored under a new field, named gradient.

This means the predictions from boosted trees cannot be combined with using the regular ensemble combiners. Instead, boosted trees use their own combiner that relies on a few new parameters included with individual boosted trees.
These new parameters will be contained in the boosting attribute in each boosted tree,
which may contain the following properties.

objective_class will indicate the class that each tree helps predict
if boosting is used for a classification problem
(there will be one tree for each class for every boosting iteration).

objective_field: contains the field id of the original objective field,
as boosted trees will always be regression trees whose new objective is a new generated field
(the previously mentioned gradient).

weight: captures the influence each tree has when computing predictions.

lambda: helps regulate the strength of a tree's output.
It's included for generating predictions when encountering missing data and
using the proportional strategy.

Nodes in boosted trees will also contain two new boosting related parameters,
g_sum and h_sum.
These are sums of the first and second order gradients,
and are needed for generating predictions when encountering missing data and
using the proportional strategy.

For regression problems, a prediction is generated by finding the prediction from each individual tree
and doing a weighted sum using each tree's weight.
Predictions for classification problems are similar, but separate weighted sums are
found for each objective_class.
That vector of weighted sums is then transformed into class probabilities using
the soft max function.

Retrieving an Ensemble

Each ensemble has a unique identifier in the form
"ensemble/id" where id is a string of
24 alpha-numeric characters that you can use to retrieve the
ensemble.

To retrieve an ensemble with curl:

curl "https://bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH"

$ Retrieving a ensemble from the command line

You can also use your browser to visualize the ensemble
using the full BigML.io URL or pasting the
ensemble/id into the BigML.com dashboard.

Ensemble Properties

Once an ensemble has been
successfully created it will have the following properties.

Ensemble Properties

Property

Type

Description

boosting

Object

Gradient boosting options for the ensemble including scores which indicates the quality of the boosted trees after each iteration and final_iterations.See the Gradient Boosting section for more information.

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the ensemble and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the ensemble creation has been completed without errors.

This is the date and time in which the ensemble was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

credits
filterable,
sortable

Float

The number of credits it cost you to create this ensemble.

credits_per_prediction
filterable,
sortable,
updatable

Float

This is the number of credits that other users will consume to make a prediction with your ensemble in case you decide to make it public.

dataset
filterable,
sortable

String

The dataset/id that was used to build the ensemble.

dataset_status
filterable,
sortable

Boolean

Whether the dataset is still available or has been deleted.

description
updatable

String

A text describing the ensemble. It can contain restricted markdown to decorate the text.

distributions

Array

Unordered list of distributions for each model in the ensemble. Each distribution is an Object with a entry for the distribution of instances in the training set and the distribution of predictions in the model. See a model distribution field for more details. Note that distributions must be accessed by the model_order below.

ensemble_sample

Object

The sampling to be used for each tree in the ensemble.

error_models
filterable,
sortable

Integer

The number of models in the ensemble that have failed.

excluded_fields

Array

The list of fields's ids that were excluded to build the ensemble.

finished_models
filterable,
sortable

Integer

The number of models in the ensemble that have finished correctly.

importance

Object

Average importance per field

input_fields

Array

The list of input fields' ids used to build the models of the ensemble.

locale

String

The dataset's locale.

max_columns
filterable,
sortable

Integer

The total number of fields in the dataset used to build the ensemble.

max_rows
filterable,
sortable

Integer

The maximum number of instances in the dataset that can be used to build the ensemble.

missing_splits
filterable,
sortable

Boolean

Whether to explicitly include missing field values when choosing a split while growing the models of an ensemble.

model_order

Array

Order in which each model in the list of models was finished. The distributions above must be accessed following this index.

models

Array

Unordered list of model/ids that compose the ensemble. Models are ordered by the model_order above.

name
filterable,
sortable,
updatable

String

The name of the ensemble as your provided or based on the name of the dataset by default.

node_threshold
filterable,
sortable

String

The maximum number of nodes that the model will grow.

number_of_batchpredictions
filterable,
sortable

Integer

The current number of batch predictions that use this ensemble.

number_of_evaluations
filterable,
sortable

Integer

The current number of evaluations that use this ensemble.

number_of_models
filterable,
sortable

Integer

The number of models in the ensemble.

number_of_predictions
filterable,
sortable

Integer

The current number of predictions that use this ensemble.

number_of_public_predictions
filterable,
sortable

Integer

The current number of public predictions that use this ensemble.

objective_field

String

Specifies the id of the field that the ensemble predicts.
Example: "000003"

ordering
filterable,
sortable

Integer

The order used to chose instances from the dataset to build the models of the ensemble. There are three different types:

0 Deterministic

1 Linear

2 Random

out_of_bag
filterable,
sortable

Boolean

Whether the out-of-bag instances were used to create the ensemble instead of the sampled instances.

price
filterable,
sortable,
updatable

Float

The price other users must pay to clone your ensemble.

private
filterable,
sortable,
updatable

Boolean

Whether the ensemble is public or not.

project
filterable,
sortable,
updatable

String

The project/id the resource belongs to.

random_candidate_ratio
filterable,
sortable

Float

The random candidate ratio considered when randomize is true.

random_candidates
filterable,
sortable

Integer

The number of random fields considered when randomize is true.

randomize
filterable,
sortable

Boolean

Whether the splits of each model in the ensemble considered only a random subset of the fields or all the fields available.

range

Array

The range of instances used to build the models of the ensemble.

replacement
filterable,
sortable

Boolean

Whether the instances sampled to build the ensemble were selected using replacement or not.

resource

String

The ensemble/id.

rows
filterable,
sortable

Integer

The total number of instances used to build the models of the ensemble

sample_rate
filterable,
sortable

Float

The sample rate used to select instances from the dataset to build the models of the ensemble.

seed
filterable,
sortable

String

The string that was used to generate the sample.

shared
filterable,
sortable,
updatable

Boolean

Whether the ensemble is shared using a private link or not.

shared_hash

String

The hash that gives access to this ensemble if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to this ensemble.

size
filterable,
sortable

Integer

The number of bytes of the dataset that were used to create this ensemble.

source
filterable,
sortable

String

The source/id that was used to build the dataset.

source_status
filterable,
sortable

Boolean

Whether the source is still available or has been deleted.

split_candidates
filterable,
sortable

Integer

The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024

status

Object

A description of the status of the ensemble. It includes a code, a message, and some extra information. See the table below.

subscription
filterable,
sortable

Boolean

Whether the ensemble was created using a subscription plan or not.

support_threshold
filterable,
sortable

Float

The parameter controls the minimum amount of support each child node must contain to be valid as a possible split.

This is the date and time in which the ensemble was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

white_box
filterable,
sortable

Boolean

Whether the ensemble is publicly shared as a white-box.

Ensemble Status

Creating a ensemble is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input, the number of models, and
on the workload of BigML's
systems. The ensemble goes through a number of
states until its fully completed. Through the status field in the
ensemble you can determine when the ensemble has been
fully processed and ready to be used to create predictions. These are the
properties that an ensemble's status has:

Ensemble Status Object Properties

Property

Type

Description

code

Integer

A status code that reflects the status of the ensemble creation. It can be any of those that are explained here.

Updating an Ensemble

To update an ensemble,
you need to PUT an object containing the fields that you want to update to the
ensemble'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated ensemble.

For example, to update an ensemble with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete an ensemble,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete an ensemble
a second time, or an ensemble that
does not exist, you will receive a "404 not found" response.

However, if you try to delete an ensemble
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Logistic Regressions

Last Updated: Tuesday, 2018-03-13 12:20

A logistic regression is a supervised machine learning method
for solving classification problems. The probability of the objective being a
particular class is modeled as the value of a logistic function, whose argument
is a linear combination of feature values. You can create a logistic regression
selecting which fields from your dataset you want to use as
input fields (or predictors) and which categorical
field you want to predict, the objective field.

For this formulation to be valid the features X1, X2, ... Xk must be numeric values.
To adapt this model to all the datatypes that BigML supports, we apply the following transformations to the inputs:

Categorical fields are 'one-hot' encoded by default. That is, a separate 0-1 numeric field is created for each category,
and exactly one of those fields has a value of 1, corresponding to the categorical value for the individual instance.
To specify different coding behavior, see the
Coding Categorical Fields for more details.

Each term present in a text field is mapped to a corresponding numeric field, whose value is the number of occurrences of that term in the instance.
Text fields without term analysis enabled are excluded from the model.

Each item present in an items field is mapped to a corresponding numeric field, whose value is the number of occurrences of that item in the instance.

Missing values in numeric fields can be explicitly included as another valid value by using the argument missing_numerics or
they can be replaced specifying a default_numeric_value.
If none of those arguments are enabled, instances containing missing numeric values will be ignored for training the model.

Logistic Regression Base URL

You can use the following base URL to create, retrieve, update, and
delete logistic regressions.

https://bigml.io/logisticregression

Logistic Regression base URL

All requests to manage your logistic regressions must use HTTPS
and be authenticated using your username and API key to verify
your identity. See this section for more details.

Creating a Logistic Regression

To create a new logistic regression, you need to POST to the
logistic regression base URL an object containing at least
the dataset/id that you want to use to create the
logistic regression. The content-type must always be
"application/json".

Logistic Regression Arguments

In addition to the dataset, you can also POST the
following arguments, and like models,
you can use weights to deal with imbalanced datasets.
Click here
to find more information about weights.

Logistic Regression Creation
Arguments

Argument

Type

Description

balance_fieldsoptional

Boolean,default is false

Whether to scale each numeric field such that its values are zero mean with a standard deviation of 1, based on the field summary statistics at training time.
Example: true

biasoptional

Boolean,default is true

Whether to include the bias term from the solution.
Example: false

coptional

Float,default is 1

The inverse of the regularization strength. Must be greater than 0.
Example: 2

categoryoptional

Integer,default is the category of the dataset

The category that best describes the logistic regression. See the category codes for the complete list of categories.
Example: 1

compute_statsoptional

Boolean,default is false

Whether to compute statistics and significance tests.
Example: true

dataset

String

A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006

default_numeric_valueoptional

String

It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"Example: "median"

descriptionoptional

String

A description of the logistic regression up to 8192 characters long.
Example: "This is a description of my new logistic regression"

epsoptional

Float,default is 0.0001

Stopping criteria for solver. If the difference between the results from the current and last iterations is less than eps, then the solver is finished.
Example: 0.1

excluded_fieldsoptional

Array,default is [], an empty list. None of the fields in the dataset is excluded.

Specifies the fields that won't be included in the logistic regression.
Example:

["000000", "000002"]

field_codingsoptional

List

Coding schemes for categorical fields: dummy, contrast, or other. Value is a map between field identifiers and a coding scheme for that field. See the Coding Categorical Fields for more details. If not specified, one numeric variable is created per categorical value, plus one for missing values.
Example:

Object,default is {}, an empty dictionary. That is, no names or preferred statuses are changed.

This can be used to change the names of the fields in the logistic regression with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,default is []. All the fields in the dataset

Specifies the fields to be included as predictors in the logistic regression.
Example:

["000001", "000003"]

missing_numericsoptional

Boolean,default is true

Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped.
Example: false

nameoptional

String,default is dataset's name

The name you want to give to the new logistic regression.
Example: "my new logistic regression"

Specifies the id of the field that you want to predict. The type of the field must be categorical.
Example: "000003"

objective_fieldsoptional

Array,default is an array with the id of the last field in the dataset

Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference. The type of the fields must be categorical.
Example: ["000003"]

out_of_bagoptional

Boolean,default is false

Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true

projectoptional

String

The project/id you want the logistic regression to belong to.
Example: "project/54d98718f0a5ea0b16000000"

rangeoptional

Array,default is [1, max rows in the dataset]

The range of successive instances to build the logistic regression.
Example: [1, 150]

regularizationoptional

String,default is "l2"

Either l1 or l2, which selects the norm to minimize when regularizing the solution. Regularizing with respect to the l1 norm causes more coefficients to be zero, using the l2 norm forces the magnitudes of all coefficients towards zero.
Example: "l1"

replacementoptional

Boolean,default is false

Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true

sample_rateoptional

Float,default is 1.0

A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5

seedoptional

String

A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample"

stats_sample_seedoptional

String

Random seed value used for stats sampling
Example: "My stats seed"

stats_sample_sizeoptional

Integer,default is -1

The number of rows to sample for calculating statistics. If -1 is given, then the number of rows will be calculated such that (rows x coefficients) <= 1E+8. The minimum between that number and the total number of input rows will be used.
Example: 1000

tagsoptional

Array of Strings

A list of strings that help classify and index your logistic regression.
Example: ["best customers", "2018"]

Coding Categorical Fields

Categorical fields must be converted to numerical values in order to be used in training a logistic regression model.
By default, they are "one-hot" coded. That is, one numeric variable is created per categorical value, plus one for missing values.
For a given instance, the variable corresponding to the instance's categorical value has its value set to 1,
while the other variables are set to 0.

Using the iris dataset as an example, we can express this coding scheme as the following table:

To specify different coding behavior, use the field_codings parameter.

The parameter value is an array where each element is a map describing the coding scheme to apply to a particular field, and containing the following keys:

field: The name or identifier of the field to code.

coding: The type of coding to use, either dummy, contrast, or other.

dummy_class: The class value to treat as the control value in dummy coding.

coefficients: The coefficients, which is a nested array of floating point values, to be used with contrast, or other coding.

The value for coding determines which of the following methods is used to code the field:

dummy: Use dummy coding.
The value is a string specifying the value to use as the control.
For example, the value {"field": "species", "coding": "dummy", "dummy_class": "virginica"} defines the following coding:

contrast: Use contrast coding.
The value is an array of vectors, each specifying the coding of an individual variable.
The vectors are checked for length.
If the lengths are less than the expected length by 1, then a 0 is implicitly appended to the end of each array,
so that missing values are ignored for the model.
In addition, each vector is checked that its elements sum to 0, and the entire collection of vectors is checked for orthogonality.
For example, the value {"field": "species", "coding": "contrast", "coefficients": [[0.5,-0.25,-0.25,0],[-1,2,0,-1]]}
defines the following coding:

other: A user-specified coding scheme.
Uses an array of vectors like in contrast, but only length is checked.
For coding vectors, the coefficients should be listed in the same order
in which the corresponding values appear in the field summary like [[1, 2, 3, 4, 5, 6, 7, 8], [-2 , 0, -2, 0, 2, 0, 2, 0]].

If multiple coding schemes are listed for a single field, then the coding closest to the end of the list is used.
Codings given for non-categorical variables are ignored.

If compute_stats is set to true, then all categorical fields without specified codings will be
assigned dummy coding. The dummy class will be the first by alphabetical order. This is because the default one-hot encoding
produces collinearity effects which result in an ill-formed covariance matrix.

You can also use curl to customize a new
logistic regression. For example, to create a new logistic regression named "my logistic regression",
with only certain rows, and with only three fields:

If you do not specify a name, BigML.io will assign to
the new logistic regression the dataset's name. If
you do not specify a range of instances, BigML.io
will use all the instances in the dataset. If you do
not specify any input fields, BigML.io will include all
the input fields in the dataset, and if you do not specify an
objective field, BigML.io will
use the last field in your dataset.

Retrieving a Logistic Regression

Each logistic regression has a unique identifier in the form
"logisticregression/id" where id is a string of
24 alpha-numeric characters that you can use to retrieve the
logistic regression.

You can also use your browser to visualize the logistic regression
using the full BigML.io URL or pasting the
logisticregression/id into the BigML.com dashboard.

Logistic Regression Properties

Once a logistic regression has been
successfully created it will have the following properties.

Logistic Regression Properties

Property

Type

Description

category
filterable,
sortable,
updatable

Integer

One of the categories in the table of categories that help classify this resource according to the domain of application.

code

Integer

HTTP status code. This will be 201 upon successful creation of the logistic regression and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the logistic regression creation has been completed without errors.

This is the date and time in which the logistic regression was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

credits
filterable,
sortable

Float

The number of credits it cost you to create this logistic regression.

credits_per_prediction
filterable,
sortable,
updatable

Float

This is the number of credits that other users will consume to make a prediction with your logistic regression if you made it public.

dataset
filterable,
sortable

String

The dataset/id that was used to build the logistic regression.

dataset_field_types

Object

A dictionary that informs about the number of fields of each type in the dataset used to create the logistic regression. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields.

dataset_status
filterable,
sortable

Boolean

Whether the dataset is still available or has been deleted.

description
updatable

String

A text describing the logistic regression. It can contain restricted markdown to decorate the text.

excluded_fields

Array

The list of fields's ids that were excluded to build the logistic regression.

fields_meta

Object

A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned.

input_fields

Array

The list of input fields' ids used to build the logistic regression.

locale

String

The dataset's locale.

logistic_regression

Object

All the information that you need to recreate or use the logistic regression on your own. It includes a list of coefficients and the field's dictionary describing the fields and their summaries. See here for more details.

max_columns
filterable,
sortable

Integer

The total number of fields in the dataset used to build the logistic regression.

max_rows
filterable,
sortable

Integer

The maximum number of instances in the dataset that can be used to build the logistic regression.

name
filterable,
sortable,
updatable

String

The name of the logistic regression as your provided or based on the name of the dataset by default.

number_of_batchpredictions
filterable,
sortable

Integer

The current number of batch predictions that use this logistic regression.

number_of_evaluations
filterable,
sortable

Integer

The current number of evaluations that use this logistic regression.

number_of_predictions
filterable,
sortable

Integer

The current number of predictions that use this logistic regression.

objective_field

String

The id of the field that the logistic regression predicts.

objective_fields

Array

Specifies the list of ids of the field that the logistic regression predicts. Even if this an array BigML.io only accepts one objective field in the current version.

out_of_bag
filterable,
sortable

Boolean

Whether the out-of-bag instances were used to create the logistic regression instead of the sampled instances.

price
filterable,
sortable,
updatable

Float

The price other users must pay to clone your logistic regression.

private
filterable,
sortable,
updatable

Boolean

Whether the logistic regression is public or not.

project
filterable,
sortable,
updatable

String

The project/id the resource belongs to.

range

Array

The range of instances used to build the logistic regression.

replacement
filterable,
sortable

Boolean

Whether the instances sampled to build the logistic regression were selected using replacement or not.

resource

String

The logisticregression/id.

rows
filterable,
sortable

Integer

The total number of instances used to build the logistic regression

sample_rate
filterable,
sortable

Float

The sample rate used to select instances from the dataset to build the logistic regression.

seed
filterable,
sortable

String

The string that was used to generate the sample.

shared
filterable,
sortable,
updatable

Boolean

Whether the logistic regression is shared using a private link or not.

shared_hash

String

The hash that gives access to this logistic regression if it has been shared using a private link.

sharing_key

String

The alternative key that gives read access to this logistic regression.

size
filterable,
sortable

Integer

The number of bytes of the dataset that were used to create this logistic regression.

source
filterable,
sortable

String

The source/id that was used to build the dataset.

source_status
filterable,
sortable

Boolean

Whether the source is still available or has been deleted.

status

Object

A description of the status of the logistic regression. It includes a code, a message, and some extra information. See the table below.

subscription
filterable,
sortable

Boolean

Whether the logistic regression was created using a subscription plan or not.

This is the date and time in which the logistic regression was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC).

white_box
filterable,
sortable

Boolean

Whether the logistic regression is publicly shared as a white-box.

A Logistic Regression Object has the following properties:

Logistic Regression Object Properties

Property

Type

Description

balance_fields

Boolean

Whether to scale each numeric field such that its values are zero mean with a standard deviation of 1, based on the field summary statistics at training time.

bias

Boolean

Whether to include the bias term from the solution.

c

Float

The inverse of the regularization strength.

coefficients

Array of Arrays

Coefficients of the logistic regression for each category in the objective field.

compute_stats

Boolean

Whether to compute statistics and significance tests.

eps

Float

Stopping criteria for solver. If the difference between the results from the current and last iterations is less than eps, then the solver is finished.

A dictionary with an entry per field in the dataset used to build the logistic regression. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details.

missing_class_in_coefficients

Boolean

Whether there is a missing class in the coefficients of the logistic regression.

missing_numerics

Boolean

Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped.

normalize

Boolean

Whether to normalize feature vectors in training and predicting.

regularization

String

Either l1 or l2 at the moment. It selects the norm to minimize when regularizing the solution. Regularizing with respect to the l1 norm causes more coefficients to be zero, and using the l2 norm forces the magnitudes of all coefficients towards zero.

stats

Object

Statistical tests to assess the quality of the model's fit to the data. See this Section for more details.

stats_sample_seed

String

Random seed value used for stats sampling.

stats_sample_size

Integer

The number of rows sampled for calculating statistical tests.

Coefficients Structure

The coefficients output field is an array of pairs, one pair per class.
The first element in the pair is a class value, and the second element is a nested array of coefficients for the logistic model
that gives the probability of that class.
Each inner array within the nested array contains the group of coefficients that pertain to a single input field.
The groups are listed in the same order as in input_fields,
with a final singleton array corresponding to the bias term.
The class-coefficient pairs are listed in the same order as the class values in the objective field summary.
If the model was trained with missing values in the objective field,
then a vector of coefficients will also be created for the missing class value, labeled with "", and listed last.

Numeric fields correspond to two coefficients.
The first predictor is the numeric value, and the second predictor is a binary value corresponding to missing values.
For example, a numeric field value of 5 maps to a value of 5 in the first predictor, and 0 in the second,
while a missing value maps to 0 in the first predictor, and 1 in the second.
If the missing_numerics parameter is false, then only a single predictor will be generated for numeric fields.

Categorical fields correspond to n+1 coefficients, where the first n coefficients correspond to class values,
and the final coefficient corresponds to a binary missing value predictor.

Text and items fields correspond to m+1 coefficients,
where the first m coefficients correspond to each term in the field's tag cloud, listed in the same order as in the field summary.
The final term corresponds to an empty string or itemset, or in the case of text fields,
a string which does not contain any terms in the text analysis vocabulary.

The final coefficient in the list corresponds to the bias term.

Significance Tests

If the compute_stats parameter is true,
then the logistic regression output contains a number of statistical tests to assess the quality of the model's fit to the data.
These are found under a field named stats. For each set of coefficients, the following statistics are computed:

likelihood_ratio: the difference in log likelihood between the fitted model and an intercept-only model.
Given as a pair [p-value, ratio]. This statistic tests whether the coefficients as
a whole have any predictive power over an intercept-only model.

confidence_intervals:
the size of the 95% confidence interval for each coefficient estimate. That is, for a coefficient estimate x, and an interval value n, the value of the coefficient is x ± n with a confidence of 95%.

standard_errors, z_scores,
p_values, confidence_intervals:
These statistics test the significance of individual coefficient estimates,
and are grouped in the same nested array fashion as the coefficients themselves.

To avoid lengthy computation times, statistics from large input datasets will be computed
from a sub-sample of the dataset such that the number of coefficients * rows is less than or equal to 1E+8.

It is possible for null to appear among the values contained in stats.
Wald test
statistics cannot be computed for zero-value coefficients, and so their corresponding entries are null.
Moreover, if the coefficients' information matrix is ill-conditioned,
e.g. if there are fewer instances of the positive class than the number of coefficients, then it is impossible
to perform the Wald test
on the entire set of coefficients. In this case standard_errors, z_scores,
p_values, and confidence_intervals will have a value of null.

Logistic Regression Status

Creating a logistic regression is a process that can take just a few
seconds or a few days depending on the size of the
dataset used as input and on the workload of BigML's
systems. The logistic regression goes through a number of
states until its fully completed. Through the status field in the
logistic regression you can determine when the logistic regression has been
fully processed and ready to be used to create predictions. These are the
properties that a logistic regression's status has:

Logistic Regression Status Object Properties

Property

Type

Description

code

Integer

A status code that reflects the status of the logistic regression creation. It can be any of those that are explained here.

elapsed

Integer

Number of milliseconds that BigML.io took to process the logistic regression.

message

String

A human readable message explaining the status.

progress

Float, between 0 and 1

How far BigML.io has progressed building the logistic regression.

Once a logistic regression has been successfully created, it will look like:

Filtering and Paginating Fields from a Logistic Regression

A logistic regression might be composed of hundreds or even thousands of
fields. Thus when retrieving a logisticregression,
it's possible to specify that only a subset of fields be retrieved, by using any combination of the following
parameters in the query string (unrecognized parameters are ignored):

Fields Filter
Parameters

Parameter

Type

Description

fieldsoptional

Comma-separated list

A comma-separated list of field IDs to retrieve.
Example: fields=000000,000002"

fulloptional

Boolean

If false, no information about fields is returned.
Example: "full=false"

iprefixoptional

String

A case-insensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one iprefix by repeating the parameter, in which case the union of the results is returned.
Example: "iprefix=INCOME"

limitoptional

Integer

Maximum number of fields that you will get in the fields field.
Example: "limit=100"

offsetoptional

Integer

How far off from the first field in your dataset is the first field in the fields field.
Example: "offset=100"

A case-sensitive string to retrieve fields whose name start with the given prefix; It is possible to specify more than one prefix by repeating the parameter, in which case the union of the results is returned.
Example: "prefix=income"

Since fields is a map and therefore not
ordered, the returned fields contain an additional key, order,
whose integer (increasing) value gives you their ordering. In all other
respects, the source is the same as the one you would get without any
filtering parameter above.

The fields_meta field can help you paginate fields. Its
structure is as follows:

Fields Meta Object
Objects Properties

Property

Type

Description

countoptional

Integer

Specifies the current number of fields in the resource.

limitoptional

Integer

The maximum number of fields that will be returned in the resource.

offsetoptional

Integer

The current offset in the pagination of fields.

totaloptional

Integer

The total number of fields in the resource.

Note that paginating fields might only be worth if you are going to
deal with really wide (i.e., more than 200 fields).

Updating a Logistic Regression

To update a logistic regression,
you need to PUT an object containing the fields that you want to update to the
logistic regression'
s base URL.
The content-type must always be: "application/json".
If the request succeeds, BigML.io will return with
an HTTP 202 response
with the updated logistic regression.

For example, to update a logistic regression with a new name you can use curl like this:

If the request succeeds you will not see anything on the command line
unless you executed the command in verbose mode. Successful
DELETEs will return "204 no content" responses with no body.

Once you delete a logistic regression,
it is permanently deleted. That is, a delete request cannot be undone.
If you try to delete a logistic regression
a second time, or a logistic regression that
does not exist, you will receive a "404 not found" response.

However, if you try to delete a logistic regression
that is being used at the moment, then BigML.io will not accept the request and
will respond with a "400 bad request" response.

Listing Logistic Regressions

To list all the logistic regressions,
you can use the logisticregression base URL.
By default, only the 20 most recent logistic regressions
will be returned. You can see below how to change this number using
the limit parameter.

You can get your list of logistic regressions directly in your browser
using your own username and API key with the following links.

Clusters

Last Updated: Tuesday, 2018-03-13 12:20

A cluster is a set of groups (i.e., clusters) of instances of a
dataset that have been automatically classified together according to a distance measure
computed using the fields of the dataset. Clusters can handle numeric, categorical,
text and items fields as inputs:

Categorical fields: a common way to handle categorical data is to take each category as a new field and assign 0 or 1 depending on the category. So a field with 20 categories will become 20 separate binary fields. BigML uses a technique called k-prototypes which modifies the distance function to operate as though the categories were transformed to binary values.

Text and item fields: each instance is assigned a vector of terms and then cosine similarity is
computed to determine closeness between instances.

Each cluster group is represented by
a centroid or center that is computed using the mean
for each numeric field and the mode for each categorical field. For text and items fields each
centroid contains the terms or items which minimize the average cosine distance between the
centroid and the points in its neighborhood.

To create a cluster, you can select an arbitrary number of clusters (i.e.,
k) and also select an arbitrary subset of fields from your dataset as
input_fields. You can use scales to select how each field
influences the distance measure used to group instances together.

Cluster Base URL

You can use the following base URL to create, retrieve, update, and
delete clusters.

https://bigml.io/cluster

Cluster base URL

All requests to manage your clusters must use HTTPS
and be authenticated using your username and API key to verify
your identity. See this section for more details.

Creating a Cluster

To create a new cluster, you need to POST to the
cluster base URL an object containing at least
the dataset/id that you want to use to create the
cluster.
The content-type must always be "application/json".

Cluster Arguments

In addition to the dataset, you can also POST the
following arguments.

Cluster Creation
Arguments

Argument

Type

Description

balance_fieldsoptional

Boolean,default is true.

When this parameter is enabled, all the numeric fields will be scaled so that their standard deviations are 1. This makes each field have roughly equivalent influence.
Example: true

categoryoptional

Integer,default is the category of the dataset

The category that best describes the cluster. See the category codes for the complete list of categories.
Example: 1

cluster_seedoptional

String

A string to generate deterministic clusters.
Example: "My Seed"

critical_valueoptional

Integer,default is 5

The clustering algorithm G-means is parameter free except for one, the critical_value parameter. G-means iteratively takes existing clusters and tests whether the cluster's neighborhood appears Gaussian. If it doesn't the cluster is split into two. The critical_value sets how strict the test is when deciding whether data looks Gaussian. The default is to 5, which seems to work well in most cases. A range of 1 - 10 is acceptable. A critical_value of 1 means data must look very Gaussian to pass the test, and can lead to more clusters being detected. Higher critical_value will tend to find fewer clusters.
Example: 3

dataset

String

A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006

default_numeric_valueoptional

String

It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"Example: "median"

descriptionoptional

String

A description of the cluster up to 8192 characters long.
Example: "This is a description of my new cluster"

excluded_fieldsoptional

Array,default is [], an empty list. None of the fields in the dataset is excluded.

Specifies the fields that won't be included in the cluster.
Example:

["000000", "000002"]

field_scalesoptional

Object,default is {}, an empty dictionary. That is, no special scaling is used.

With this argument you can pick your own scaling for each field. If a field isn't included in field_scales, BigML will treat the scale as 1 (no scale change). If both balance_fields and field_scales are present, then balance_fields will be applied first. This will make it easy for you do things like balancing age and salary, but then request that age be twice as important.
Example:

{
"000001": 4,
"000003": 2
}

fieldsoptional

Object,default is {}, an empty dictionary. That is, no names or preferred statuses are changed.

This can be used to change the names of the fields in the cluster with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:

{
"000001": {"name": "length_1"},
"000003": {"name": "length_2"}
}

input_fieldsoptional

Array,default is []. All the fields in the dataset

Specifies the fields to be considered to create the clusters.
Example:

["000001", "000003"]

koptional

Integer,default is null to use g-means cluster

The number of clusters. Must be null or a number greater than or equal to 1 and less than or equal to 300.
Example: 3

model_clustersoptional

Boolean,default is false

Whether a model for every cluster will be generated or not. Each model predicts whether or not an instance is part of its respective cluster.
Example: true

nameoptional

String,default is dataset's name

The name you want to give to the new cluster.
Example: "my new cluster"

out_of_bagoptional

Boolean,default is false

Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true