Abstract In this paper a neural network based methodology for the design of optimal controllers for nonlinear systems is presented. The overall architecture consists of two neural networks. The first neural network is a cost-to-go function approximator (CGA) which is trained to predict the cost-to-go from the present state of the system. The second neural network converges to an optimal controller as it is trained to minimize the output of the first network. The CGA can be trained using the available simulation or experimental data. Hence an explicit analytical model of the system is not required. The key to the success of the approach is in giving the CGA a special decentralized structure that makes its training relatively straightforward, and its prediction quality carefully controlled. The specific structure eliminates much of the uncertainties often involved in using artificial neural networks for this type of application. Validity of the approach is illustrated on a nonlinear aircraft model in an approach configuration. 1. Introduction Artificial neural networks have been investigated extensively in the optimal control of nonlinear systems. Recently, a control architecture known as Adaptive Critic Design (ACD)1-3 has been proposed for such an optimal control problem. ACD is based on the forward dynamic programming approach of optimization. The basic architecture consists of a critic that models the cost-to-go function and a controller. Both these structures are parameterized using neural networks. These two functions are trained simultaneously and each of them depends on the fidelity of the other to get trained properly. This could make the training of the* ?

overall system particularly challenging. An interesting solution approach has been presented recently, where the critic and the controller are pre-trained using linear models over the region of operation of the system and an algebraic training procedure is employed in the initialization of the neural networks.4 In our previous work, a modified parametric optimization approach was developed to generate optimal controllers in both state feedback form and dynamic output feedback form for linear system.5 In the present work, we generalize the approach to nonlinear systems. Parametric optimization imposes the form of the controller in advance and the controller parameters are found to optimize the performance measure. If the controller parameterization is done via neural networks, then given their universal function approximating capability6-7, the true optimal controller can be captured in theory. One objective of our research is to remove much of the uncertainties associated with training a neural network architecture that results in an optimal controller. The key to the success of our method is to give the neural network a very special structure that permits tight control over its prediction quality during training. The special structure also makes it possible that each subsystem of the overall network can be trained independently from other subsystems. The issue of interdependency among various portions of the overall network during training as encountered in the basic adaptive critic designs is therefore removed. In addition to being motivated by ACD, our work is also an application of the concepts put forth in adaptive and predictive control.8-12 Section 2 outlines the overall control architecture. Section 3 and 4 present the details of training of the cost-to-go function neural network and the controller

Graduate Research Assistant, Dept. of Mechanical and Aerospace Engineering Associate Professor, Thayer School of Engineering

Copyright ? 2002 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.

neural network respectively. Section 5 presents an application of the method on a nonlinear aircraft system in an approach configuration. The final concluding remarks are presented in Section 6. 2. Overall Neural Network Architecture As a starting point, we briefly review the modified parametric optimization approach developed in our previous work.5 For a given systemx(k + 1) = f [ x(k ), u (k )]

Instead of solving for one G, we now solve for the expanded set of controller gains G1 , G2 , ..., Gr . In theory, these unknown parameters are found by imposing the conditions?V = 0; ?G1 ?V ?V = 0; ... =0 ?G 2 ?G r

(2.2)

(2.6)

The structure of the function u is assumed and the parameters G are found to minimize a chosen cost-to-go function asV (k ) = 1 2

andG2 = G2 ( f , G1 , Q, R) G3 = G3 ( f , G2 , Q, R)

∑ [x ( k + i )r i =1

T

Qx ( k + i )

+ u ( k + i ? 1) T Ru ( k + i ? 1

]

(2.3)

MGr = Gr ( f , Gr ?1 , Q, R)

(2.7)

For the infinite horizon optimal control problem, the upper limit of the summation in the cost-to-go function should be infinity. Here, it is replaced by a finite value r that can be thought of as the order of approximation of the infinite horizon cost-to-go function. By the principle of optimality, optimizing the cost-to-go function is equivalent to optimizing the cummulative cost. This is the mechanism by which receding horizon model predictive control handles the optimal control problem. Thus it is expected that as r tends to infinity, the resultant control will converge to the truly optimal control solution. The parameters in G are found by imposing the necessary condition

?V =0 ?G

(2.4)

It was shown that solving the problem in this manner leads to a highly nonlinear algebraic optimization problem with many local minima even for a linear system. Looking at this issue from the point of view of the neural networks for optimal control which must be trained iteratively, this solution approach is highly undesirable. The modified parametric optimization approach introduces additional unknowns in order to simplify the solution of the unknown

The additional conditions in (2.7) are due to the fact that the extra gains are no longer independent, but they are related through the system equations. It was however found that because the state of the system x(k) enters (2.6), by simply varying x(k) we will have a complete set of conditions from which to find the expanded set of gains without having to invoke (2.7).5 Not having to include (2.7) in the solution technique is a major beneficial feature because (2.7) is nonlinear even for a linear system. Any solution that involves (2.7) would have to overcome serious nonlinearities that the conditions in (2.7) inherently possess. Furthermore, these conditions also involve the system model. Therefore using (2.7) would also prevent the development of a data-based approach that does not require an explicit system model. It was established in our previous work that for a linear system, the optimal control gain can be found by solving two sets of linear equations, one to identify the cost-to-go function, and the other to obtain the optimal controller gain. This solution approach is also data-based in that it can be carried out from available input-state or input-output data without needing an explicit model of the system in standard form.

The modified parametric optimization approach is now used to formulate a general form of the neural network control architecture. The overall architecture consists of two neural networks. One neural network is used as the cost-to-go function approximator (CGA), and the other is used as the controller. As seen from the previous discussion, the general nonlinear optimization problem gets much simplified by introducing r controller structures that take x(k) as their input and give u(k),…, u (k + r ? 1) as the r outputs. Thus the overall controller neural network can be built with r structures internally so that the network takes x(k) as its input and gives u(k) through u (k + r ? 1) as its output. These r outputs from the controller network along with the present state x(k) are then fed to the CGA network which gives V(k) as its output. For training the overall control architecture, the CGA network is first trained independently from the controller network. The values of the cost-to-go V(k) computed from (2.3) are used as the target values during the CGA training. Given the simulation or actual data of the system, the values of the states, x(k+1) through x(k+r) and inputs u(k) through u (k + r ? 1) can be collected for different starting values of the index k. From these values the true value of V(k) can be computed using equation (2.3). The states of the system x(k+1) through x(k+r) are functions of the state of the system x(k) and the inputs u(k) through u (k + r ? 1) . The cost-to-go function V(k) can therefore be modeled as a function of the state x(k) and the inputs u(k) through u (k + r ? 1) . The CGA is thus formulated to take these as its inputs and provide the cost-to-go function estimate V(k) as its output. Figure 1 illustrates the CGA training.x(k) u(k) u(k+1) u(k+r-1)

Verr – Error in the cost-to-go function (difference between V and Vnn). x(k) – State of the system at present time. u(k), u(k+1), u(k+r-1) – inputs at times k, k+1 and k+r-1. After training the CGA network, we can now train the controller network. In order to train the controller network, gradient of the cost-to-go function with respect to the inputs u(k) through u (k + r ? 1) can be calculated using back-propagation through the CGA. These are?V (k ) , ?u (k ) ?V (k ) ?u (k + 1)

,…,

?V ( k ) . ?u( k + r ? 1)

These gradients can be further back-propagated through the controller network to get?V ( k ) , ?G 1 ?V (k ) ?G 2

,…,

?V ( k ) . ?G r

The gains Gi’s here correspond to the weights of the neural network controller. These gradients can then be used to minimize the cost-to-go function by updating the weights of the neural network controller so that?V ( k ) → 0, i = 1...r ?G ix(k)x(k)

Figure 2 illustrates the training of the neural network controller from a trained CGA. From the general outline of the control architecture, it can be seen that the training of the CGA and the controller is decoupled from each other. The CGA network can be trained independently of the controller network using the available system data, and then it is used for training the controller. This form of decoupled training avoids the issue of having to train both networks simultaneously which is a much more difficult problem.

3. Cost-to-Go Approximator Network

The CGA network as proposed in the previous section has a large input space since it takes the state x(k) and the input values u(k) through u (k + r ? 1) as its inputs. In practice, presenting the network with a broad range of physically realizable test inputs still likely leaves the network untrained over a significant portion of the input space. The enormity of the input space is a typical training problem. An incompletely trained CGA network will surely result in failure of the controller network training step. This is a very critical issue to overcome. A well-trained CGA network is the most important factor in getting a successful nonlinear optimal controller. This problem leads us to giving the CGA a specific structure that facilitates its training.x(k+1)Subnet 1

controller network. As shown in Figure 4 any quadratic function can be programmed using a layer of squared neurons with the appropriate values of the layer weights in the quadratic layer.

q1x1

∑0 0

q1 x1

2

1

∑1

( q1 x1 + q2 x2 )

2

2

x2q2

∑Layer with square neurons

q2 x2

2

Figure 4: A weighted quadratic product Subnets 1 through r can be trained separately using the available system data. This ability to individually train the r subnets leads us to a CGA network that is overall well-trained. We note that with the proposed network structure we shall once again run into the problem of a large input space for the higher order subnets. For example the subnet r receives all the inputs presented to the CGA Network and will therefore pose a significant challenge for training. With the subnet structure, there is a way to get around this difficulty. In practice the subnets can be trained up to a certain order ? for which large input dimensionality is not an issue. r ? subnets can then be stacked together to These r produce the higher order subnets.For example two order-5 subnets can be stackedu(k),…,u(k+9)

Figure 3 presents the general outline of the proposed structure of the CGA network. We introduce subnets 1 through r that correspond to the single and multi-step predictors. Each of these subnets corresponds to a two-layer (sigmoid-linear) neural network. Thus subnet i takes the state of the system x(k) and the input values u(k) through u (k + r ? 1) as its inputs and provides the output x(k+i). The outputs of these subnets, x(k+1) through x(k+r) along with the input values u(k) through u (k + r ? 1) are fed to a layer of squared neurons to compute the quadratic cost-to-go value V(k). The weighting matrices Q and R are embedded in this layer. Figure 4 presents a simple example of producing a weighted quadratic product using such a layer of neurons. The presence of the quadratic layer helps ensure the positive definiteness of the cost-to-go function V(k) that is important in the next step of training the

x(k) u(k)

Subnet 1

x(k+1)

Quadratic Layerof Neurons V(k)k + 10 i = k +1

Subnet 2

u(k), u(k+1) u(k),…, u(k+2)u(k),…, u(k+3)

∑ ( x Qx + u Ru)T T

Subnet 3Subnet 4Subnet 5

x(k+3) x(k+4) x(k+5)

u(k+3),…, u(k+5)

Subnet 3Subnet 3Subnet 4

x(k+6)

u(k),…, u(k+4)

u(k+4),…, u(k+7) u(k+5),…, u(k+8)

Subnet 4Subnet 5

u(k+5),…,u(k+9)

x(k+10)

Figure 5: Implementation of the CGA of order r =10, using trained subnets of order 1 through 5.

together to produce an order-10 subnet. The training of

the CGA network therefore reduces to the training of 5 independent lower order subnets. Each of these subnets can be well trained and tested for its prediction quality based on available data. Figure 5 gives an example of an order-10 CGA network built with subnets with orders 1 through 5. This strategy allows for systematic contruction of a high quality CGA network.4. Neural Network Controller

The neural network controller consists of r internal structures that take the state of the system x(k) and give r control outputs u(k) through u (k + r ? 1) as given by equation (2.5). The structure of this network is therefore modeled as r independent two-layer controller subnets. All the r controller subnets are built with the first layer consisting of sigmoid neurons and the second layer with linear neurons. For larger values of i in u(k+i), more neurons are included in the hidden layer of the controller subnets to model the complexity of the functional relationship between x(k) and these controls. Figure 6 gives the details of the structure of the controller network. In this presentation we have kept these networks decoupled as suggested by equation (2.5). However, it is also possible to replace this collection of controller subnets by a single fully connected network if desired. In order to train the controller network, a combined network is formed that consists of the controller network and the CGA network. The controller part of∑

V(k) as its output. The input space of this network is therefore comparatively small. The portion of the combined network that corresponds to the CGA network is assigned with the weights and biases of the previously trained CGA network. While training the combined network, these weights and biases are held fixed, and only the controller part of the network is updated. In order to train this combined network, it is provided a randomized set of the state x(k) as its inputs. For all these inputs, the training is configured so that the desired output of the combined network is required to be zero. Due to the presence of the quadratic layer in the CGA part of this network when the desired outputs for all the training inputs are set to zero, the training algorithm tries to bring V(k) as close to zero as possible for all the given inputs. The training thus ends up finding a set of weights and biases that minimizes the cost-to-go function V(k) for all the given states x(k). This is indeed the desired result. The training minimizes V(k) by training the controller part of the combined network to produce an optimal controller. The combined controller-CGA network is thus equivalent to the critic in the standard ACD.1-4 However the combined network has the controller as a part of the network.5. Implementation Results System Description:

The control architecture outlined in the previoushγ hglideα

∑? ? ? ? ? ?

∑∑

u(k)

∑

∑

x V

∑? ? ? ? ? ?

x(k)

∑∑

u(k+1)

xglide

∑

? ? ?

Figure 7: Aircraft on a glide slope∑? ?

∑

∑? ? ?

u(k+r-1)

?

∑

∑

Figure 6: Neural network controller structurethe network has the internal structure as shown in Figure 6, and the CGA part of the network has the structure as shown in Figure 5. The combined network takes the state of the system as its input and gives the cost-to-go estimate

sections is implemented for the optimal control of a transport aircraft on the approach slope. Figure 7 depicts the system under consideration. The system consists of an aircraft that is trimmed on a specific approach slope. The goal of the control algorithm is to minimize the perturbations of this aircraft from the given approach slope.

The aircraft equations of motion in the wind-axes system are given by 13

(5.18-5.19) Defined in this manner, if we are only concerned about staying on the nominal approach slope, the control algorithm will penalize the perturbation variable perpendicular to the glide slope, while it can neglect the perturbation variable along the glide slope since that variable does not affect the dynamics of the rest of the state elements. So the number of aircraft equations can be reduced by one by neglecting the equation corresponding to this perturbation variable along the nominal glide slope. The equation for the dynamics of perturbation of the aircraft trajectory perpendicular to the glide slope is given by& ?h glide = (V nom + ?V ) sin( ?γ )

and the control vector consists of two control inputs, the aircraft angle of attack and the throttle inputu = [α δT ]T

(5.11)

The controller for this system is designed to minimize the cost-to-go function that penalizes perturbations from this nominal approach configuration. The equations of motion are therefore written in terms of perturbed values about the nominal values corresponding to the trimmed approach. Thus we have,x = x nom + ?x u = u nom + ?u

(5.20)

(5.12)

The other three equations of motion remain the same as (5.15 -5.17). Thus we have a nonlinear system with the state consisting of four elements and the control consisting of two elements

?x = ?h glide

[

?V

?γ ?δT ]T

?T

?u = [?α

]T (5.21-5.22)

For implementing the control algorithm, the nonlinear equations are discretized using a time step of 0.5 seconds. Tables 1 and 2 provide the details of the aircraft system and the nominal trim conditions of the aircraft. The aircraft model used for the present work is presented in 13 . Table 1: Aircraft Parameters: M(slugs) Tmax (lb) Sref (ft2) CLo 4660 42000 1560 CLα CDo

values around the nominal approach trajectory. The input variables to the subnets are scaled so that they are in the range –1 to 1. The subnets are trained with the Levenberg-Marquardt training algorithm using the MATLAB neural network Toolbox. Subnets with higher orders are created by cascading these 5 subnets with order 1 through 5. Having trained the subnets, the overall CGA is assembled along with the quadratic layer as a custom network using the custom network building functions of MATLAB neural network toolbox. The overall CGA is tested with random input data. Figure 8 shows the testing results for a CGA network with the order r of 15. The CGA network is tested with random values of the state x(k) and control u(k) through u (k + r ? 1) that go as the inputs. The figure shows the outputs given by the CGA network (O) and the error (?) between the output of the CGA network and the true value obtained by simulating the system. It can be seen that the error lies close to zero for the entire testing data set and the CGA network performs well in predicting the cost-togo estimate V(k). Figure 8 shows the comparison for a few selected data points.450 400 350

The CGA training corresponds to the training of the individual subnets and then assembling them with the fixed quadratic layer. Higher-order subnets call for larger input space and therefore more data points are needed for training. In order to miminize the computational cost, subnets up to order 5 are trained. Thus subnet 1 corresponds to the model of the system and subnet 5 corresponds to the 5-step ahead predictor. The data used for training the subnets are created with randomized inputs over a certain range of the variable

The controller network is built with r internal twolayer networks as explained in Section 4. Since the order of approximation, r is chosen to be 15, the network is built with 15 internal two-layer subnetworks. The structure of the controller network is shown in figure 6. The choice of the number of hidden neurons for each of the 15 controller subnets is an open but not critical question. With some experimentation, the number of hidden neurons in each of the controller subnets is taken to be

h(i ) = 4 + (i ? 1)

(5.24)

where h(i) is the number of neurons in the hidden layer associated with the i-order controller subnet. For example, the controller subnets with order r having the value of 1 has 4 neurons in its hidden layer, while the controller subnet with order 15 has 18 neurons in its hidden layer. Having chosen the structure of the controller network, a combination network is built by attaching the CGA network in front of the controller network. The CGA is assigned the weights and biases corresponding to the trained CGA network. The problem of controller training is posed equivalently as the problem of the combination network training since the weights and biases of the CGA portion are kept fixed. To train the combination network, random values of x(k) are given as inputs and the training is configured to seek the ideal value of V(k) of zero for all these inputs. As explained before, owing to the quadratic layer, the training ends up in minimizing the output V(k) of the combination network for all the given randomized inputs x(k) by updating the weights and biases of the controller part of the network. The combination network is trained using the Broyden, Fletcher, Goldfard, Shanno (BFGS) Quasi-Newton method available in the MATLAB neural network Toolbox. Other training algorithms are also used with similar results. Figure 9 shows the training results for the combination network.

minimum value. Since the optimization procedure is iterative, there is no guarantee of finding the global minimum for the problem. This issue is further investigated by starting the optimization routine everytime with a different initial set of weights for the controller sub-networks. The optimization routine produce sets of weights that are dissimilar. However the control trajectories from the controller corresponding to these different local minimal solutions are found to be very close to each other for the same state perturbations. After training the controller part of the combined network, the controller sub-network with order r = 1, is used as the closed-loop controller.Simulation Results:

The performance of the trained nonlinear controller is compared against the linear quadratic regulator (LQR) designed using the linearized dynamics around the nominal approach slope. In order to truly investigate the nonlinear nature of the controller, the aircraft system is given relatively large initial perturbations and the trajectories of the two closed-loop control strategies are compared. The value of the perturbation is given by

?x = [200 - 30 0.15 - 7000]T

Figure 9: Minimization of V(k) during controller training It can be observed that the training starts with a large value of V(k) corresponding to the controller subnetworks being non-optimal. As the training proceeds V(k) starts decreasing and ultimately reaches a Figure 10: Aircraft responses after an initial perturbation: ? -Open loop dynamics, ? - Closed loop with nonlinear neural network controller, ‘O’ – Closed loop with the LQR.

As explained in the beginning of this section, the state variable consists of the position perturbation perpendicular to the nominal slope (feet), velocity (ft/s), approach angle (rad.) and the engine thrust (lb.). Figure 10 compares the responses of the system with both the control designs along with the open loop response to this initial perturbation. It can be seen that the closed-loop system becomes unstable with the LQR design. The aircraft cannot get back to the nominal glide slope. The optimized nonlinear controller however is able to get the aircraft back on the nominal glide slope. The results clearly show that when the aircraft gets perturbations that take it far from its nominal operating regime, a LQR design based on the linearized dynamics is unable to restore equilibrium. However the optimal nonlinear controller trained over a large input space of the state perturbations can handle these perturbations. Observe that the open loop dynamics shows the typical oscillatory response.

Effect of the Order of Approximation r

The results shown in the previous discussion are obtained for a cost-to-go approximate V(k) of order r = 15. In the linear case, it was noted that as the order of approximation increases, the results approach the optimal LQR design. 5 In the same vein, it is interesting to see the effect of r on the nonlinear design. It is important to find the upper and lower limits of r. It is intersting to find out how small a value of r can be used in this example that can still provide a workable controller, and what value of r on the higher side starts giving diminishing returns in terms of excessive computational workload. Figure 12 shows the system response to the same initial perturbation by using the trained optimal nonlinear controllers obtained with different values of r. The differences between the responses of the aircraft system can be seen distinctly. For r = 5, the aircraft is unable to get back to the nominal glide slope. This can be seen from the graph of ?h versus time and ?V versus time

Figure 11: Nonlinear control trajectories with the neural network controller Figure 11 shows the control trajectories for the nonlinear controller. The control response from the LQR is not shown since it is unstable. The issue of the LQR design not being able to handle the perturbation is further investigated. The instability of the LQR design is due to the fact that the value of the control weighting matrix R is low and this leads to high values of controller outputs. These correspondingly takes the system away from its linear regime, and the closed loop system with the LQR control design correspondingly becomes unstable. Figure 12: Aircraft responses: ? - controller with r = 5, á - with r = 15, ? - with r = 25. Figure 13 shows the control time histories from these controllers. The controller produces higher overshoot and less damping for the lower value of r of 5. At the same time note that there are no appreciable differences in the state and control trajectories for the cases of r equal to 15 and 25. We can therefore save computational effort by choosing r of 15. On further investigation, it is found that an acceptable value of r is correlated to the slowest frequency or the longest time constant of the linearized dynamics of the system. For the aircraft system the longest time constant is 5.14

seconds corresponding to the phugoid mode. For the discrete time step of 0.5 seconds, r = 5 corresponds to a time window of 2.5 seconds while r = 15 corresponds to 7.5 seconds. For this reason r = 15 is sufficient for controlling the system. Similarly r = 25 corresponding to a time window of 12.5 seconds captures more information than what is absolutely necessary. Additional insight is obtained by looking at the cost function for each of these trajectories. The value of the cummulative cost J is computed for the entire trajectory over the simulation time frame

In this paper we have presented a general design method for the optimal control of nonlinear systems. Either simulation or actual data can be used to design the controller without requiring an explicit model of the system in standard form. The controller is shown to successfully stabilize the aircraft on the approach slope even with large initial state perturbations that could not be rejected by the optimal design based on the linearized aircraft dynamics about the nominal approach trajectory. The training of cost-to-go function is instrumental in the proposed controller design process. A specific neural network architecture has been developed to approximate the cost-to-go function. The architecture allows systematic component-bycomponent construction of the cost-to-go network. Each buiding block of the cost-to-go network can be trained invidually without regard to the others. This decentralized strategy eliminates much uncertainties associated with using artificial neural networks in this type of applications. Once the cost-to-go network construction is completed, the controller can be trained to be optimal without having to retrain the cost-to-go network. This decoupling of the training of the two networks is an attractive feature of our approach. In our approach the cost-to-go network is not only a function of the present system state, but also of present and future control action. When coupled with an optimally trained controlle this controller and cost-to-go combination network outputs the optimal cost-to-go value. Only then the optimal cost-to-go becomes a function of the system present state alone as expected in optimal control theory. Thus from the perspective of adaptive critic design, our entire network is a single critic. The control strategy thus designed without its implementation details is similar to that of nonlinear predictive control. The novel feature here lies in the use of multiple predictor subnets in building the cost-to-go network. Nonlinear predictive control architectures typically use a one-step system model for designing the controller instead of blocks of single and multiple-step ahead predictors as used here. Using multiple-step ahead predictors helps parallelize the CGA network architecture which reduces the error in the backpropagation derivatives needed to train the controller sub-networks. Another useful feature to note is that once the individual subnets are trained, any form of the cost-to-go function can be implemented since this corresponds to a fixed layer of neurons in place of the

The values of J for different values of r are given in Table 5. It is seen that increase in control performance starts diminishing after r = 15. Table 5: Cummulative cost as a function of r r J 5 10 15 20 25 410.8852 168.2999 93.6265 92.4814 88.5529

quadratic layer of neurons. We can therefore experiment with different functional forms of the cost function as well as different parameters in these functional forms without having to retrain the cost-togo network each time.Acknowledgements