General principles

Contents

Dataset Format

Datasets are represented as a matrices, with lines corresponding to sample components and with columns corresponding to variables. Three types of tasks to be performed and, respectively, three formats to be used can be distinguished:

Task One - this is a regression task, a problem of prediction of the value of a dependent variable (or several variables), if independent variable values are known (for example, currency rate prediction based on prior values). The predicted value has real type. In this case, independent variables are kept in the first NVars columns of the array, with columns featuring dependent variables following them.

Task Two - this is a classification task, a problem involved in referring an observation to one of the classes. The predicted variable has a nominal value. As in the previous case, the first NVars of the columns contain independent variables. The following column will contain the class number (from 0 to NClasses-1). Fractional values are rounded to the nearest integer.

Lastly, the third type is referred to the tasks which are not regression or classification problems, such as clustering. In such cases, a training set will contain, as a rule, only independent variables (the first NVars columns).

Note #1
Data encoding in classification tasks is somewhat different from encoding used by many other packages. For example, it is customary in a number of neural-network libraries to encode adherence of an image to one of the classes by means of the NClasses-dimensional vector, through setting the relevant coordinate into 1 and making the rest equal to zero. This difference should be taken into account when the ALGLIB and other program packages are used simultaneously.

Nominal Variable Encoding

Nominal variables can be encoded in several ways: as integer or using either "1-of-N" or "1-of-N-1" encoding. Most ALGLIB algorithms can use any encoding, without requesting information about which variables are real and which are nominal, as well as concerning encoding used. The algorithm will just take a real matrix, and operate on it without going into particulars of its inner structure. It provides for flexibility and usability. However, different encodings have different action upon the speed and quality of algorithm operation. It is recommended to comply with the following conventions:

Real variables are kept as they are.

Nominal variables with two possible values are encoded by either "0" or "1" (that is, using the "1-of-N-1" encoding).

Nominal variables with three or more possible values are encoded using "1-of-N" encoding (e.g., "red", "yellow" and "green" can be encoded as "1 0 0", "0 1 0", "0 0 1").

No matter how many values are possible, a nominal variable can be encoded by integer (0, 1, 2, ...). However, such encoding is recommended to be used only in case values of the variable can be ordered, and to be applied only to nonlinear models (neural networks, decision trees). For instance, such values as "cold", "room temperature", "heat" can be arranged according to the increase of temperature, and encoded as "0", "1", "2". Meanwhile, reasonable arrangement of such values as "sour", "bitter", "sweet", "salt" is somewhat complicated.

The ALGLIB package can be used, even if the data are encoded without regard to these recommendations. However, some models may use such encoding to increase the speed and to improve quality of the results.

Missing Values Encoding

On the date this article was written, none of the algorithms can perform operations on datsets with missing values. However, this restriction can be avoided, if another value identifying the omission is added to like values of the variable. For example, if accepted values of the variable are "0", "1", "2", then the non-missing values can be encoded as "1 0 0 0", "0 1 0 0", "0 0 1 0", and the missing value can be transformed into "0 0 0 1". The following encoding will be analogous for the real variable: the non-missed value x is encoded as as "x 0", and the missing value is encoded as "0 1".

One more option is replacing the missing value by an average (or most probable) value for this variable.

Standard Error Codes

Many data analysis subprograms accept the Info output parameter. This variable contains a return code of the subroutine. A positive value means normal completion, and a negative value is evidence of an error. Subroutines of this section solve similar problems, therefore their error codes make up a uniform system, too:

-1 - incorrect parameters (e.g., negative size of a training set).

-2 - errors in the training set (e.g., a class N3 is detected in the training set when dealing with a binary classification problem).

-3 - task is degenerate. Specific understanding of the word "degenerate" will be subject to the situation. For example, a clustering task shall be considered degenerate if five elements need referring to six clusters.

-4 - non-convergence of an iterative algorithm.

ALGLIB Classifiers and Posterior Probabilities

There are two basic views commonly held in statistics on how a classification problem solution should look like. The first viewpoint is that any object shall refer to one and only one of the classes. For example, if email classification is in question, then "spam" and "non-spam" classes can be distinguished. There can be some uncertainty in the classification (an email can be somewhat similar to spam), but only the terminal decision - whether it is spam or non-spam - will be returned.

The second approach consists in obtaining a vector of posterior probabilities, that is, a vector having component parts equal to probabilities that the object belongs to each class. The algorithm does not take any decision on the classification of an email. It just notifies how much probability there is that a particular email is spam, and how much probability there is that it is not. And the decision making based on this information is transferred to the user.

The second approach is more flexible than the first one, and it is more reasonable. How does the classification algorithm happen to know about the order of priority the user is sticking to? In some cases, it is necessary to minimize the error made in one of the classes, e.g., the misclassification of an email as spam. Then the email will be classified as spam only in that case if there is very little probability (e.g., less than 0.05%) that it is NON spam. In other cases, all classes are equal to each other, and a class with a maximum conditional probability can just be chosen. Therefore, the outcome of any classification algorithm of the ALGLIB package is a posterior probability vector, instead of the class which an object can be put into.

Model Error

After the model is built, the error on a test (or training) set needs to be estimated. To estimate regression results, three measures of error can be used, that is, a root-mean-square error, an average error and an average relative error (the latter being calculated as per the records with a nonzero value of the dependent variable). These three measures of error are commonly known, and need not to be discussed.

If a classification problem is at issue, then five measures of error can be used. The first and best-known is the classification error (the number or percent of the incorrectly classified cases). The second equally known measure is cross-entropy. The ALGLIB package uses average cross-entropy per record estimated in bits (base 2 logarithm). The use of average cross-entropy (instead of total cross-entropy) permits comparable estimates for different test sets to be obtained.

The remaining three error measures are the root-mean-square error, average error and average relative error again. However, as opposed to the regression task, they are used here to characterize the posterior probability vector miscalculation. The error implies how much the probability vector calculated by means of a classification algorithm differs from the vector obtained on the basis of a test set (this vector's component parts are equal to 0 or 1, subject to the class which the object belongs to). The meaning of the root-mean-square error and average error is comprehensible: it is an error in conditional probability approximation that is averaged as per all probabilities. The average relative error is an average error in approximating the probability that an object is correctly classified (same as average error for binary tasks).