Data, Learning and Modeling

There are key concepts in machine learning that lay the foundation for understanding the field.

In this post, you will learn the nomenclature (standard terms) that is used when describing data and datasets.

You will also learn the concepts and terms used to describe learning and modeling from data that will provide a valuable intuition for your journey through the field of machine learning.

Data

Machine learning methods learn from examples. It is important to have good grasp of input data and the various terminology used when describing data. In this section, you will learn the terminology used in machine learning when referring to data.

When I think of data, I think of rows and columns, like a database table or an Excel spreadsheet. This is a traditional structure for data and is what is common in the field of machine learning. Other data like images, videos, and text, so-called unstructured data is not considered at this time.

Table of Data Showing an Instance, Feature, and Train-Test Datasets

Instance: A single row of data is called an instance. It is an observation from the domain.

Feature: A single column of data is called a feature. It is a component of an observation and is also called an attribute of a data instance. Some features may be inputs to a model (the predictors) and others may be outputs or the features to be predicted.

Data Type: Features have a data type. They may be real or integer-valued or may have a categorical or ordinal value. You can have strings, dates, times, and more complex types, but typically they are reduced to real or categorical values when working with traditional machine learning methods.

Datasets: A collection of instances is a dataset and when working with machine learning methods we typically need a few datasets for different purposes.

Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.

Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to train the model. It may be called the validation dataset.

We may have to collect instances to form our datasets or we may be given a finite dataset that we must split into sub-datasets.

Learning

Machine learning is indeed about automated learning with algorithms.

In this section, we will consider a few high-level concepts about learning.

Generalization: Generalization is required because the model that is prepared by a machine learning algorithm needs to make predictions or decisions based on specific data instances that were not seen during training.

Over-Learning: When a model learns the training data too closely and does not generalize, this is called over-learning. The result is poor performance on data other than the training dataset. This is also called over-fitting.

Under-Learning: When a model has not learned enough structure from the database because the learning process was terminated early, this is called under-learning. The result is good generalization but poor performance on all data, including the training dataset. This is also called under-fitting.

Online Learning: Online learning is when a method is updated with data instances from the domain as they become available. Online learning requires methods that are robust to noisy data but can produce models that are more in tune with the current state of the domain.

Offline Learning: Offline learning is when a method is created on pre-prepared data and is then used operationally on unobserved data. The training process can be controlled and can tuned carefully because the scope of the training data is known. The model is not updated after it has been prepared and performance may decrease if the domain changes.

Supervised Learning: This is a learning process for generalizing on problems where a prediction is required. A “teaching process” compares predictions by the model to known answers and makes corrections in the model.

Unsupervised Learning: This is a learning process for generalizing the structure in the data where no prediction is required. Natural structures are identified and exploited for relating instances to each other.

We have covered supervised and unsupervised learning before in the post on machine learning algorithms. These terms can be useful for classifying algorithms by their behavior.

Modeling

The artefact created by a machine learning process could be considered a program in its own right.

Model Selection: We can think of the process of configuring and training the model as a model selection process. Each iteration we have a new model that we could choose to use or to modify. Even the choice of machine learning algorithm is part of that model selection process. Of all the possible models that exist for a problem, a given algorithm and algorithm configuration on the chosen training dataset will provide a finally selected model.

Inductive Bias: Bias is the limits imposed on the selected model. All models are biased which introduces error in the model, and by definition all models have error (they are generalizations from observations). Biases are introduced by the generalizations made in the model including the configuration of the model and the selection of the algorithm to generate the model. A machine learning method can create a model with a low or a high bias and tactics can be used to reduce the bias of a highly biased model.

Model Variance: Variance is how sensitive the model is to the data on which it was trained. A machine learning method can have a high or a low variance when creating a model on a dataset. A tactic to reduce the variance of a model is to run it multiple times on a dataset with different initial conditions and take the average accuracy as the models performance.

Bias-Variance Tradeoff: Model selection can be thought of as a the trade-off of the bias and variance. A low bias model will have a high variance and will need to be trained for a long time or many times to get a usable model. A high bias model will have a low variance and will train quickly, but suffer poor and limited performance.

I would make the distinction between validation and testing datasets. You train your model on your training set, you use a validation set to tune the model parameters, and you use a test set to asses the accuracy of your model. Being careful your test set does not influence the modelling process in any way.

Dirk, the “tuning” on the validation set is for the model’s hyper-parameters, not for the actual “parameters” of the model! Depending on the type of model there will be different hyper-parameters to tune (the regularization parameter in a cost function for example).

Machine learning algorithms learn coefficients from data, like coefficients in linear regression to describe a line. These are the model parameters.

To learn the coefficients, we often use a learning algorithm, like stochastic gradient descent. This algorithm can have parameters to control learning, like a learning rate. The learning rate is an example of a hyperparameter.

I’ve got a question for you if you don’t mind: How do you think the volume of data affects some machine learning algorithms? For instance, do you think having more data is always good or there’s a point where the gains become so small due to the performance of the model plateaus?

Sir,
I read in one article that they divide their datasets in to online and offline. Can you please explain the difference between online and offline datasets. How can I divide my datasets in to online and offline datasets for research purpose? Thank you…

In regards to the bias-variance tradeoff, if you were to skew one way or the other, would it make sense to be high on the variance and low on the bias? sure, this will result in over-fitting, but at least that way you get what you need with some noise thrown in. with under-fitting you don’t get anything useful at all.