computing, resistance to hardware failure, robustness in handling different types of data, graceful

degradation (which is the property of being able to process noisy or incomplete information),

learning and adaptation (Rumelhart, McClelland and the PDP Research Group 1'"', Lippmann

1987; Hinton 1989).

One of the most popular neural net paradigms is the feed-forward neural network (FNN)

and the associated back-propagation (BP) training algorithm. In a feed-forward neural network,

the neurons (i.e., processing units) are usually arranged in layers. A feed-forward neural net

is denoted as I x H x O, where I, H and O represent the number of input units, the number

of hidden units, and the number of output units, respectively. Figure 1 gives a typical fully

connected 2-layer feed-forward network (by convention, the input layer does not count) with a

3 x 4 x 3 structure.

The input units simply pass on the input vector x. The units in the hidden layer and output

layer are processing units. Each processing unit has an activation function which is commonly

chosen to be the sigmoid function.
1
f(x) = (2)
1+ e-'v
where 7 is a constant controlling the slope of the function. The net input to a processing unit j

is given by

nce tj i + O (3)

where xi's are the outputs from the previous layer, is the weight (connection strength) of the

link connecting unit i to unit j, and Oj the bias, which determines the location of the sigmoid

TR91-008 Computer and Information Sciences, University of Florida

Output

Hidden

Input

ST I

Figure 1: A 3

4 x 3 feedforward neural network.

function on the x axis.
A feed-forward neural net works by training the network with known examples. A random
sample (xp, yp) is drawn from the training set {(xp, yp)lp = 1,2,..., P}, and x is fed into the
network through the input layer. The network computes an output vector op based on the hidden
layer output. op is compared against the training target y,. A performance criterion function is
defined based on the difference between op and y,. A commonly used criterion function is the
sum of squared error (SSE) function
1
F =EF= F E (Ypj opk)2 (4)
p p k
where p is the index for the pattern (example) and k the index for output units.
The error computed from the output layer is backpropagated through the network, and weight-
s (;I ) are modified according to their contribution to the error function.

T17

where Tr is called learning rate, which determines the step size of the weight updating. The BP

Our test using direct connections from input units to output units had mixed results. For the

airline data and ser562, neural nets with direct connections performed better. For the other

two series, the direct connection method was not as successful, except for the one-hidden-unit

neural net. However, we feel a general conclusion cannot be drawn as we used only fixed training

parameters. Other combinations of training parameters may result in better forecasts.

Note that with a direct connection, the 12 x 1 x 1 neural net produced a MAPS less than

that of the Box-Jenkins model. There are several cases in table 1 where the neural net models

have a larger forecast error than the Box-Jenkins model. We were be able to test different net

structures and training settings and find neural net models that outperformed the Box-Jenkins

method for virtually all series. For example, a 1 x 2 x 1 net with direct connection resulted in a

MAPS of 2.55 for ser301, and a 8 x 4 x 1 net with direct connection resulted a MAPS of 3.24

for ser310 (cf. table 1).

TR91-008 Computer and Information Sciences, University of Florida

3 Discussion

The experiments in the last section have shown that neural nets can, indeed, provide better

forecasts than the Box-Jenkins method in many cases. However, the performance of neural nets

is affected by many factors, including the network structure, the training parameters and the

nature of the data series. To further understand how and why neural nets may be used as

models for time series forecasting, we examine the representation ability, the model building

process and the applicability of the neural net approach in comparison with the Box-Jenkins

method.

3.1 Representation

As discussed in the review section, Box-Jenkins models are a family of linear models of autore-

gressive and moving average processes. For the airline passenger data, the Box-Jenkins model

that we identified (the same as identified by other researchers, e.g., Newton (l1'")) takes the

following form:

(1 B12)(1 B)x = (1 01B)(1 ,2B2 (6)

Rewriting the model, we have the following:

(1 B1 B12 + B3) = (1 0B 01,12 + 00,12B13)c (7)

or

Xt = .t-12 + (t-1 Xt-13) + (~ 0_ 01,1261t-2 + 0101,126-13) (8)

Equation 8 says that the forecast for the time period t is the sum of 1) the value of the time

series in the same month of the previous year; 2) a trend component determined by the difference

of previous month's value and last year's previous month's value; and 3) the effects of random

errors (or residual) of period t, t 1, t 12 and t 13 on the forecast.

If a time series is determined by a linear model as described above, the Box-Jenkins method

can do well as long as the pattern does not change.2 If the series is determined by a nonlinear
2For series with pattern shift, more sophisticated procedures are needed to identify the shift and modify the
model (Lee and Chen 1990).

TR91-008 Computer and Information Sciences, University of Florida

process, for instance, the logistic series generated by x(t + 1) = ax(t)( x(t)). The Box-Jenkins

method is likely to fail since no higher order terms exist in the models. On the other hand, a

neural net with a single hidden layer can capture the nonlinearity of the logistic series. Lapedes

and Farber (1''8") reported that their neural net models produced far better forecasts than

conventional methods for some chaotic time series, including the logistic series.

A feedforward neural network can be regarded as a general, non-linear model. In effect, it is

a complex function consisting of a convoluted set of activation functions f e C, where C is a set

of continuously differentiable functions, and the parameter set W called weights. In particular,

the activation functions are sigmoid functions as used in our neural net models. The output of

a feedforward neural net can be written as:

o= f(E wjkf (; ,;f I )))) (9)
j m

where xi is the ith element of the input vector x.

It has been proven that with an unlimited number of processing units in the hidden layer

a feed-forward neural net with a single hidden layer can serve as a universal approximator to

The robustness of neural net models is also reflected by the fact that they are essentially

assumption-free models, although assumptions about the training set can be made and statistic

inferences can be carried out (White 1989). This assumption-free property makes them applicable

to a wide range of pattern recognition problems. The Box-Jenkins model, as other statistic based

models, is subject to satisfiablity of the assumptions of the data series (e.g., the random errors

follow a normal distribution).

3The causal model with the Box-Jenkins method is known as Box-Jenkins transfer function model. The transfer
function model requires repeated applications of univariate Box-Jenkins method. The comparative performance
of the neural net causal model and the Box-Jenkins transfer function model is a sublect of future study.

TR91-008 Computer and Information Sciences, University of Florida

3.3 Generalization

Unless the neural net solution oscillates due to large training parameters, the training error (fit-

ting error) always decreases as the training length increases. This is not true for forecasting. The

results in table 1 show that in most cases increasing training length led to improved forecast. In

some cases in table 1 and most cases in table 2, longer training of the network led to deteriorated

forecasting. This is the results of overfitting. That is, the model fits too much to the peculiar

features of the training set and loses its generalization ability.

The Box-Jenkins method circumvents the overfitting problem through the diagnostic model

validation process. The number of parameters in the Box-Jenkins models are controlled in that

only those parameters that contribute to the data fitting with certain statistical significance are

retained. There are no established procedures for preventing overfitting in neural net models,

although a number of techniques have been i.--. -1. .1 in the literature (Weigend, Huberman and

Rumelhart 1990).

The significance of weights can also be examined. Those weights that do not significantly

affect the training error can be set to zero. When a weight associated with a input unit is set

to zero, the corresponding input ceases to contribute to the output. Hence, a reduced model is

obtained.

To test the efficacy of this technique, a 13 x 1 neural net model for the airline data was

used. Table 7 shows the weight changes of the neural net during the training process. The

weights of the neural network correspond to the coefficients of the input variables (since there is

no hidden layer), and the bias in the output unit corresponds to a constant term. Although the

neural network model is not a linear model because of the sigmoidal activation function of the

output unit it nevertheless identified the most important inputs (inputs with the larger value

of coefficients), as the Box-Jenkins model did. Note that those inputs are identified only after a