Microsoft Neural Network Algorithm Technical Reference

The Microsoft Neural Network uses a Multilayer Perceptron network, also called a Back-Propagated Delta Rule network, composed of up to three layers of neurons, or perceptrons. These layers are an input layer, an optional hidden layer, and an output layer.

A detailed discussion of Multilayer Perceptron neural networks is outside the scope of this documentation. This topic explains the basic implementation of the algorithm, including the method used to normalize input and output values, and feature selection methods used to reduce attribute cardinality. This topic describes the parameters and other settings that can be used to customize the behavior of the algorithm, and provides links to additional information about querying the model.

In a Multilayer Perceptron neural network, each neuron receives one or more inputs and produces one or more identical outputs. Each output is a simple non-linear function of the sum of the inputs to the neuron. Inputs pass forward from nodes in the input layer to nodes in the hidden layer, and then pass from the hidden layer to the output layer; there are no connections between neurons within a layer. If no hidden layer is included, as in a logistic regression model, inputs pass forward directly from nodes in the input layer to nodes in the output layer.

There are three types of neurons in a neural network that is created with the Microsoft Neural Network algorithm:

Input neurons

Input neurons provide input attribute values for the data mining model. For discrete input attributes, an input neuron typically represents a single state from the input attribute. This includes missing values, if the training data contains nulls for that attribute. A discrete input attribute that has more than two states generates one input neuron for each state, and one input neuron for a missing state, if there are any nulls in the training data. A continuous input attribute generates two input neurons: one neuron for a missing state, and one neuron for the value of the continuous attribute itself. Input neurons provide inputs to one or more hidden neurons.

Output neurons represent predictable attribute values for the data mining model. For discrete input attributes, an output neuron typically represents a single predicted state for a predictable attribute, including missing values. For example, a binary predictable attribute produces one output node that describes a missing or existing state, to indicate whether a value exists for that attribute. A Boolean column that is used as a predictable attribute generates three output neurons: one neuron for a true value, one neuron for a false value, and one neuron for a missing or existing state. A discrete predictable attribute that has more than two states generates one output neuron for each state, and one output neuron for a missing or existing state. Continuous predictable columns generate two output neurons: one neuron for a missing or existing state, and one neuron for the value of the continuous column itself. If more than 500 output neurons are generated by reviewing the set of predictable columns, Analysis Services generates a new network in the mining model to represent the additional output neurons.

A neuron receives input from other neurons, or from other data, depending on which layer of the network it is in. An input neuron receives inputs from the original data. Hidden neurons and output neurons receive inputs from the output of other neurons in the neural network. Inputs establish relationships between neurons, and the relationships serve as a path of analysis for a specific set of cases.

Each input has a value assigned to it, called the weight, which describes the relevance or importance of that particular input to the hidden neuron or the output neuron. The greater the weight that is assigned to an input, the more relevant or important the value of that input. Weights can be negative, which implies that the input can inhibit, rather than activate, a specific neuron. The value of each input is multiplied by the weight to emphasize the importance of an input for a specific neuron. For negative weights, the effect of multiplying the value by the weight is to deemphasize the importance.

Each neuron has a simple non-linear function assigned to it, called the activation function, which describes the relevance or importance of a particular neuron to that layer of a neural network. Hidden neurons use a hyperbolic tangent function (tanh) for their activation function, whereas output neurons use a sigmoid function for activation. Both functions are nonlinear, continuous functions that allow the neural network to model nonlinear relationships between input and output neurons.

Several steps are involved in training a data mining model that uses the Microsoft Neural Network algorithm. These steps are heavily influenced by the values that you specify for the algorithm parameters.

The algorithm first evaluates and extracts training data from the data source. A percentage of the training data, called the holdout data, is reserved for use in assessing the accuracy of the network. Throughout the training process, the network is evaluated immediately after each iteration through the training data. When the accuracy no longer increases, the training process is stopped.

The values of the SAMPLE_SIZE and HOLDOUT_PERCENTAGE parameters are used to determine the number of cases to sample from the training data and the number of cases to be put aside for the holdout data. The value of the HOLDOUT_SEED parameter is used to randomly determine the individual cases to be put aside for the holdout data.

Note

These algorithm parameters are different from the HOLDOUT_SIZE and HOLDOUT_SEED properties, which are applied to a mining structure to define a testing data set.

The algorithm next determines the number and complexity of the networks that the mining model supports. If the mining model contains one or more attributes that are used only for prediction, the algorithm creates a single network that represents all such attributes. If the mining model contains one or more attributes that are used for both input and prediction, the algorithm provider constructs a network for each attribute.

For input and predictable attributes that have discrete values, each input or output neuron respectively represents a single state. For input and predictable attributes that have continuous values, each input or output neuron respectively represents the range and distribution of values for the attribute. The maximum number of states that is supported in either case depends on the value of the MAXIMUM_STATES algorithm parameter. If the number of states for a specific attribute exceeds the value of the MAXIMUM_STATES algorithm parameter, the most popular or relevant states for that attribute are chosen, up to the maximum number of states allowed, and the remaining states are grouped as missing values for the purposes of analysis.

The algorithm then uses the value of the HIDDEN_NODE_RATIO parameter when determining the initial number of neurons to create for the hidden layer. You can set HIDDEN_NODE_RATIO to 0 to prevent the creation of a hidden layer in the networks that the algorithm generates for the mining model, to treat the neural network as a logistic regression.

The algorithm provider iteratively evaluates the weight for all inputs across the network at the same time, by taking the set of training data that was reserved earlier and comparing the actual known value for each case in the holdout data with the network's prediction, in a process known as batch learning. After the algorithm has evaluated the entire set of training data, the algorithm reviews the predicted and actual value for each neuron. The algorithm calculates the degree of error, if any, and adjusts the weights that are associated with the inputs for that neuron, working backward from output neurons to input neurons in a process known as backpropagation. The algorithm then repeats the process over the entire set of training data. Because the algorithm can support many weights and output neurons, the conjugate gradient algorithm is used to guide the training process for assigning and evaluating weights for inputs. A discussion of the conjugate gradient algorithm is outside the scope of this documentation.

If the number of input attributes is greater than the value of the MAXIMUM_INPUT_ATTRIBUTES parameter, or if the number of predictable attributes is greater than the value of the MAXIMUM_OUTPUT_ATTRIBUTES parameter, a feature selection algorithm is used to reduce the complexity of the networks that are included in the mining model. Feature selection reduces the number of input or predictable attributes to those that are most statistically relevant to the model.

Feature selection is used automatically by all Analysis Services data mining algorithms to improve analysis and reduce processing load. The method used for feature selection in neural network models depends on the data type of the attribute. For reference, the following table shows the feature selection methods used for neural network models, and also shows the feature selection methods used for the Logistic Regression algorithm, which is based on the Neural Network algorithm.

Algorithm

Method of analysis

Comments

Neural Network

Interestingness score

Shannon's Entropy

Bayesian with K2 Prior

Bayesian Dirichlet with uniform prior (default)

The Neural Networks algorithm can use both entropy-based and Bayesian scoring methods, as long as the data contains continuous columns.

Default.

Logistic Regression

Interestingness score

Shannon's Entropy

Bayesian with K2 Prior

Bayesian Dirichlet with uniform prior (default)

Because you cannot pass a parameter to this algorithm to control feature election behavior, the defaults are used. Therefore, if all attributes are discrete or discretized, the default is BDEU.

The algorithm parameters that control feature selection for a neural network model are MAXIMUM_INPUT_ATTRIBUTES, MAXIMUM_OUTPUT_ATTRIBUTES, and MAXIMUM_STATES. You can also control the number of hidden layers by setting the HIDDEN_NODE_RATIO parameter.

Scoring is a kind of normalization, which in the context of training a neural network model means the process of converting a value, such as a discrete text label, into a value that can be compared with other types of inputs and weighted in the network. For example, if one input attribute is Gender and the possible values are Male and Female, and another input attribute is Income, with a variable range of values, the values for each attribute are not directly comparable, and therefore must be encoded to a common scale so that the weights can be computed. Scoring is the process of normalizing such inputs to numeric values: specifically, to a probability range. The functions used for normalization also help to distribute input value more evenly on a uniform scale so that extreme values do not distort the results of analysis.

Outputs of the neural network are also encoded. When there is a single target for output (that is, prediction), or multiple targets that are used for prediction only and not for input, the model create a single network and it might not seem necessary to normalize the values. However, if multiple attributes are used for input and prediction, the model must create multiple networks; therefore, all values must be normalized, and the outputs too must be encoded as they exit the network.

Encoding for inputs is based on summing each discrete value in the training cases, and multiplying that value by its weight. This is called a weighted sum, which is passed to the activation function in the hidden layer. A z-score is used for encoding, as follows:

Discrete values

μ = p – the prior probability of a state

StdDev = sqrt(p(1-p))

Continuous values

Value present= 1 - μ/σ

No existing value= - μ/σ

After the values have been encoded, the inputs go through weighted summing, with network edges as weights.

Encoding for outputs uses the sigmoid function, which has properties that make it very useful for prediction. One such property is that, regardless of how the original values are scaled, and regardless of whether values are negative or positive, the output of this function is always a value between 0 and 1, which is suited for estimating probabilities. Another useful property is that the sigmoid function has a smoothing effect, so that as values move farther away from point of inflection, the probability for the value moves towards 0 or 1, but slowly.

The Microsoft Neural Network algorithm supports several parameters that affect the behavior, performance, and accuracy of the resulting mining model. You can also modify the way that the model processes data by setting modeling flags on columns, or by setting distribution flags to specify how values within the column are handled.

The following table describes the parameters that can be used with the Microsoft Neural Network algorithm.

HIDDEN_NODE_RATIO

Specifies the ratio of hidden neurons to input and output neurons. The following formula determines the initial number of neurons in the hidden layer:

HIDDEN_NODE_RATIO * SQRT(Total input neurons * Total output neurons)

The default value is 4.0.

HOLDOUT_PERCENTAGE

Specifies the percentage of cases within the training data used to calculate the holdout error, which is used as part of the stopping criteria while training the mining model.

The default value is 30.

HOLDOUT_SEED

Specifies a number that is used to seed the pseudo-random generator when the algorithm randomly determines the holdout data. If this parameter is set to 0, the algorithm generates the seed based on the name of the mining model, to guarantee that the model content remains the same during reprocessing.

The default value is 0.

MAXIMUM_INPUT_ATTRIBUTES

Determines the maximum number of input attributes that can be supplied to the algorithm before feature selection is employed. Setting this value to 0 disables feature selection for input attributes.

The default value is 255.

MAXIMUM_OUTPUT_ATTRIBUTES

Determines the maximum number of output attributes that can be supplied to the algorithm before feature selection is employed. Setting this value to 0 disables feature selection for output attributes.

The default value is 255.

MAXIMUM_STATES

Specifies the maximum number of discrete states per attribute that is supported by the algorithm. If the number of states for a specific attribute is greater than the number that is specified for this parameter, the algorithm uses the most popular states for that attribute and treats the remaining states as missing.

The default value is 100.

SAMPLE_SIZE

Specifies the number of cases to be used to train the model. The algorithm uses either this number or the percentage of total of cases not included in the holdout data as specified by the HOLDOUT_PERCENTAGE parameter, whichever value is smaller.

In other words, if HOLDOUT_PERCENTAGE is set to 30, the algorithm will use either the value of this parameter, or a value equal to 70 percent of the total number of cases, whichever is smaller.

The following distribution flags are supported for use with the Microsoft Neural Network algorithm. The flags are used as hints to the model only; if the algorithm detects a different distribution it will use the found distribution, not the distribution provided in the hint.

Normal

Indicates that values within the column should be treated as though they represent the normal, or Gaussian, distribution.

Uniform

Indicates that values within the column should be treated as though they are distributed uniformly; that is, the probability of any value is roughly equal, and is a function of the total number of values.

Log Normal

Indicates that values within the column should be treated as though distributed according to the log normal curve, which means that the logarithm of the values is distributed normally.