This is an analysis of how to perform multistep predictions on the solar datasets (doc nndatasets). Main improvement compared to previous iterations of this code in the added filtering of the signal. However, results indicate that the predictability can be improved. Unfortunately, it is not even possible to generate a prediction using a generated repetitive dataset which is here based on the solar dataset.

METHOD

I use a main program to have the possibility to test the 9 different training algorithms. Here is the code for main:

Then I use a function to evaluate the different training algorithms. One thing that should be noticed is that I used mapminmax to norm the signal. I read somewhere that trainbr requires this. Also, it cant do any harm if all methods use this normalized signal, at least as I see it.

% STJ10 = [ ' ALWAYS GOOD TO GET A FEEL FOR THE DATA, ANY OTHER PURPOSES? PERHAPS PARAMETERS TO STUDY WHEN THE SIZE OF i IS INCREASED/DECREASED? ' ]

minthresh95 = min(thresh95); % 0.024638

medthresh95 = median(thresh95); % 0.027242

meanthresh95 = mean(thresh95); %0.0273

stdthresh95 = std(thresh95); %0.0011573

maxthresh95 = max(thresh95); %0.030326

% [minthresh95 medthresh95 meanthresh95 stdthresh95 maxthresh95]

% GEH11 = [ 'MIGHT WANT TO KNOW WHAT IF max(i) > 100' ]

% STJ11 = [ ' I HAVE WORKED WITH DIFFERENT SIZES OF i, INCREASING i SHOULD GIVE BETTER STATISTICAL REPRESENTATION OF THE ABOVE PARAMETERS (minthresh95 etc.), HOWEVER, AN INCREASED i CAUSES ONLY MINOR CHANGES OF ABOVE PARAMETERS' ]

autocorrt = nncorr(zt,zt,N-1,'biased');

siglag95 = -1+ find(abs(autocorrt(N:2*N-1))>=sigthresh95);

% This three lines below decreases the length of siglab95 and FD

l_sig=length(siglag95);

r_l_sig=ceil(l_sig/15); % 15 used when three times the length is not used

siglag95=siglag95(1:r_l_sig);

FD=siglag95(2:end);

FD_length=length(FD);

d=max(FD);

% siglag95=0:1:156;

% siglag95 = -1+ find(abs(autocorrt(N:2*N-1))>=0.6);

% ### Memory is maxing out if I use all lag values from the above line

% probably this is one of the most important improvement areas ###

% GEH12 = [ '1. ONLY NEED A SUFFICIENT SUBSET ' ...

% ' 2. I THINK YOUR INITIAL CALCULATION IS FLAWED ' ]

% STJ12 = [ ' THE LENGTH OF siglag95 IS DECREASED, HOWEVER, THE MOST SIGNIFICANT WAVELENGTHS ARE STILL TAKEN INTO ACCOUNT, FOR MORE INFO, SEE: ...

% ' but NOT random datadivision for timeseries .. get nonuniform time spacing' ]

%

% STJ17 = [ ' NARNET IS NOW DEFINED IN OUTER LOOP. WITH THIS COMMENT "I did not try randomizing these states here yet, possible future improvement..." I MEANT THAT I WOULD TRY DIFFERENT LEVELS OF TRN/VAL/TST. ']

% 'DOES NOT net.divideFcn='divideblock GUARANTEE THAT THE TIME

% SPACING IS UNIFORM?'...

% 'DEFINING NARNET IN THE OUT LOOP (AS IT LOOKS LIKE NOW DOES...

% 'NOT GIVE RANDOM WEIGHTS THUS A REPETITION OF Ntrials...

% ' WOULD NOT GIVE ANY ADDITIONAL INFORMATION ']

% GEH18 = ['MORE LATER']

% STJ18 = [ ' OK ']

best_Y=[];

NMSEin=Inf;

i=0;

for n=min_nodes:max_nodes

Nw = (NFD*O+1)*n+(n+1)*O;

% Nw when n = 1 : 154

% Nw when n = 2 : 307

% (HOWEVER, SIZE OF Nw SEEMS TO BE METHOD DEPENDANT)

i=i+1; % Counter for indices inside the second loop (do not tie up anything to n).

I have used the above code and investigated the following trn/val ratios (first number is trn, second is val): 50-50,61-39,74-26,89-11,93.5-6.5. The training method numbeer refers to the following training method (you will also be able to find this in the code for "main"):

% 1: g_train_method='trainlm';

% 2: g_train_method='trainbfg';

% 3: g_train_method='trainrp';

% 4: g_train_method='trainscg';

% 5: g_train_method='traincgb';

% 6: g_train_method='traincgf';

% 7: g_train_method='traincgp';

% 8: g_train_method='trainoss';

% 9: g_train_method='traingdx';

Here is the "winning" NMSEo from the while loop. This NMSEo is selected on the basis that it is smaller (or equal) than the 10 NMSEo (Ntrials) generated in the above for loop.

Then I close the loop and obtain NMSEc, error from closing the loop with no training of the closed loop performed.

Here I have limited the y axis to 0 - 2. As you can see, no values are below 1.

Then I close the loop and also perform training of the closed loop. With this I obtain the following NMSEc2.

Here I have limited the y axis to 0 - 2. NMSEc2 is decreased compared to NMSEc, however, regardless of method I am still far away from any desired value.

Please let me know if you would like me to submit tables of the data instead of figures..(however, I hope the message that I have not obtained a desired low value of NMSEc or NMSEc2 is still clear). In this result section I could show numerous figures that portray the high values of NMSEc or NMSEc2 but I see little meaning is uploading figures with the following appearance:

Start and end of sampling; 477 and 2400 are selected on the basis that the signal should be as continuous as possible. I used filtering and analysis of the diff to determine these values. The content of FD is similar for "part 1" and this "part 2". To not increase computational time too much I decreased Ntrials from 10 to 5. Here is a figure showing the "3*repeated solar dataset", I've added some black rectangles to make it easier to see reoccurring features.

For this analysis I have used a trn/val ratio of 66,67 - 33.33, I thought the model could generate a small NMSEo is the val set were identical to two reoccuring parts of the trn dataset.

RESULTS - PART 2

Here is the NMSEo for the 9 different methods using the "3*repeated solar dataset":

Here is the NMSEc (no training) for the 9 different methods using the "3*repeated solar dataset":

2.2568 243.24 6.013 28.174 10.056 9.0195 10.83 10.871 5.2844

Here is the NMSEc2 (training is performed) for the 9 different methods using the "3*repeated solar dataset":

2.2568 13.607 1.0029 27.614 10.056 8.9929 10.83 1.0002 1.0013

DISCUSSION

I could think of a few things to further develop the approach but, at this stage I would be very thankful for receiving a second opinion. I would be very happy if I've made one or a few simple mistakes that could give me a better NMSEc, I really thought that using a repeated signal as I did in part 2 would give me an improved prediction with a small residual error. If there are no simple fixes here, my next step is to investigate using all of the significant FD obtained from nncorr.

Direct link to this comment

Newsgroup discussions are ok but can be difficult to follow due to that its not possible to include images and label code. However, Would you like me to submit this content in the newsgroup thread instead?

I've done some further improvements and the current working hypothesis is that it isnot sufficient to include the enough data points in FD to account for the most significant wavelength(s), instead all of the significant FD from nncorr needs to be included. Why? Not sure yet. Hope to submit something during the weekend.

Direct link to this comment

I am impressed! I have had too little time to work with the code past few days, now I had a few minutes to spare which gave me the opportunity to load your code into a script and try it out. I can't see how you are able to obtain such a low NMSEo using only 35 delay time steps (I needed a few 100 to obtain reasonable NMSEo's) but I will investigate this as soon as I have the possibility to do so (adding the steps your mention above to the code will not increase NMSEo or require more delay time steps as I see it (this does not diminish the importance of e.g. sticking to using divideblock for these types of challenges)). Once I am on top of the code I will try filtering and a few other techniques to work with the outliers.

I worked many hours on the following, the steps below describe how I generated a prediction model that could generate a decent prediction. I’ve chosen to describe the most important features I’ve used with a list, starting with the most significant item. It should also be mentioned that my findings are built on working with the solar dataset, for other datasets different methods might be more applicable. There number in the list below describes a feature in the code, the the after the line of minus signs (--------------) describes what parameters I’ve used for the specific feature.

It is important to select a feedback delay (FD) that can generate convergence in closed loop training. Convergence in this is that the amplitude and the frequency of the prediction vs. the historical data should at least be similar. I’ve discovered that it is NOT sufficient to use a FD that covers the most significant wavelengths (https://se.mathworks.com/matlabcentral/answers/284253-it-is-not-possible-to-predict-wavelengths-longer-than-the-maximum-index-value-of-the-fd-in-narxnet-a ). I’ve done quite a few investigations using a FD with a smaller number of elements but have not been successful with a converging prediction for such cases. Basically, it does not matter if a large number of neurons are used, if the FD is too short non-convergence will occur. Perhaps (“perhaps” means that I’m not done evaluating this) a good indicator of that enough number of elements has been included in the FD is that MSE continues to decrease in the closed loop training. ----------------------------------------------------------------------------------------------------- In generating the result below a FD with a length of 76 % of siglags has been used. This means that FD will have a length of a bit more than 1000 elements.

The number of nodes is a difficult parameter, perhaps the most difficult parameter to define and understand. I have found that the statistical spread of predictability is larger for re-runs using the same amount of nodes compared to the mean value (15 predictions per x amount of nodes was used) of the predictability for different nodes. I think I see that a very small amount of nodes does not allow for the complexity required to generate a prediction signal that consists of several wavelengths. A too large amount of nodes might allow a prediction signal that is “too” complex. In my early work on this I was expecting a nice 2 degree polynomial showing a minimum error on the amount of nodes that would generate the smallest error, however, this is one example of what I got (X is the number of nodes, Y is an amplitude error): -----------------------------------------So, there is no simple answer to most optimal amount of nodes but with a significant amount of repetitions I have made the conclusion that 6 nodes (again, with a FD of 76 % of the siglags) does at least not give the worst result.

I would like to think that ‘divideint’ should work better in most cases compared to ‘divideblock’. -----------------I’ve used divideint with 50 % training data + 50 % validation data. With this, all one have to be certain of is that the sampling frequency is higher enough to cover the wavelengths that are of interest to predict. If I would go with ‘divideblock’ I don’t think I can say for sure that I have enclosed the interesting repeating pattern, regardless of the sizes of trn and val in ‘divideblock’.

For evaluating the predictability of different methods I don’t use tst data at all (most commonly used is 70 % trn, 15 % val and 15 % tst). I find it more pedagogic to discard the part of the original data I want to perform prediction on at an early stage and then use this data to compare with the generated prediction at a later stage. ---------------------------------------------------------------------------------------------------------------------------------------------- In the calculation I have performed I use the last 10 % meaning that if the total dataset is 3000 data points I use 2700 data points as trn and val data, use the generated model to predict 300 point into the future and compare this data with the real 300 datapoints. Perhaps this is in a sense the same as comparing R2 for different tst data, however, when comparing with unseen data the way I do it I can more easily understand the different between the prediction and the real tst data.

I’ve evaluated all available training methods, both for open and closed training (but ). I can’t use trainlm or trainbr due to the required memory for the large FD I use (also, I’ve found a bug associated to this matter that exists in R2016 that does not exist in R2012). Using different training methods will still generate the same amount of weight so a training method that is fast and generates a small MSE should be OK I guess. Well, I did not look at MSE when selecting the training method but prediction capabilities. -------------------------I’ve found that trainscg preforms well. Also, this method is quite fast. Other methods have a difficult time with the long FD.

I’ve, in all cases trained the closed loop after the open loop. This does not give a significant improvement to the prediction but there is an improvement.

Quantification of the accuracy of predictions ------------------------------------------------------------------------------------------I’ve not only used the distance between the predicted signal and the real tst data as a quantifier, I’ve also looked at how often the sign of the derivative is matching between the predicted signal and the real tst data. The reason for this is that in some cases it might be ok to have over- or undershoot if the frequency and datum content of the two signals is similar.

If you have questions on the above, please add the number in the list in your question.

So, to the results. As said, there is a large statistical spread in the predicted, even of the amount of nodes are constant. Consider the following scenario; let’s say I produce 10 predictions (like those seen in the figures below). I might be able to say that some predictions are less representative, even without knowledge of the real tst data. If e.g. the one prediction fluctuates between -2 and +2 and the trn and val data all lie between -1 and +1, then it might be possible that this prediction should be discarded. The same goes for frequency, I might be able to discard prediction based on frequency content that does not match the original (referring to trn and val data) signal. I’ve made 15 predictions, using 6 nodes and 76 % of siglags.

Figure below illustrates a prediction that is a bit small error than the average error calculated based on predicted vs. real tst data (based on the supporting claims in the section above that I might be able to discard some predictions that are “flying through the roof”) than the average of these 15 predictions:

Figure below illustrates the best result of the 15 predictions:

Figure below illustrates the worst result of the 15 predictions:

Please notice the large difference between the best and the worst prediction. The numerical error for the worst prediction is roughly three times larger compared to the best prediction.

In future development I am planning to look further into data division, it would be quite interesting to design other types of data divition setups that are not part of the standard (= existing) selection.ii. If the frequency of a dataset is towards the lower range of would would be desired, don’t you think it could be possible to oversample the signal with a factor of 2 using interp (linear) and then using added data, every second point as validation data (= still using 50 % trn and 50 % val)? By doing this, little of no information should be lost, this could perhaps be regarded as “getting the validation data for free”.

Direct link to this comment

Dividetrain --> I’ve put some hours on this; This method does not increase the accuracy of the prediction. I don’t think it is impossible to see improvement using dividetrain, however, I see three detriments: (1) increased likelihood of over fitting, (2) increased computational time and (3) only marginal room for improvement compared to divideint.

Divideint 34/33/33 --> At this early stage I still like produce a prediction that I can visualise. I think that parameters such as NMSE or R2 for tst values are applicable but I think such parameters could be more usable in an optimization (at a later stage, so to say). If I can visualise a prediction in a graph I analyse both the accuracy of wavelengths and amplitudes in one go.

DIVIDE into stationary subsections --> You are here thinking about e.g. using divideblock 44/26/30 and to be sure that the data is (more) stationary at each of these divisions? Regarding stationary data, do you have a suggestion that you think I should read on this subject? I get the concept of that it’s not that easy to perform predictions on data that contains larger trends and I’ve brushed up on Dickey–Fuller test (from wiki, mostly) but I think there is a better reference more applicable to neural networks than the ones I’ve come across.

Difference the series to reduce non-stationarity --> Are you here suggesting that I should perform the NN calculation of the diff of the dataset?...simply :

Direct link to this comment

% Dividetrain --> I’ve put some hours on this; This method does not increase the accuracy of the prediction. I don’t think it is impossible to see improvement using dividetrain, however, I see three detriments: (1) increased likelihood of over fitting, (2) increased computational time and (3) only marginal room for improvement compared to divideint.

(1) I have been operating under the edict: For unbiased timeseries prediction

max(indtrn,indval) < min(indtst)

So far, no success. Frustrated, I have cheated and looked at DIVIDETRAIN mainly to find out why everthing is failing. Surprisingly, even this has failed.

Now I'm pretty sure that the basic problems are

a. Nonstationarity

b. Insufficient use of significant delays.

c. I'm on vacation and my wife wants to denerd me for a week.

(2) Overfitting is never a problem for me because I always

a. Compare the number of training equations Ntrneq with the number of

unknown weights Nw (Search NEWSREADER and ANSWERS using Ntrneq Nw).

b. Try to minimize the number of hidden nodes subject to a maximum allowed

error

(3). So far training time has been no problem at all.

(4). With default delays , nothing works for me. I think skipping the estimation of significant delays is my biggest mistake.

% Divideint 34/33/33 --> At this early stage I still like produce a prediction that I can visualise. I think that parameters such as NMSE or R2 for tst values are applicable but I think such parameters could be more usable in an optimization (at a later stage, so to say). If I can visualise a prediction in a graph I analyse both the accuracy of wavelengths and amplitudes in one go.

% DIVIDE into stationary subsections --> You are here thinking about e.g. % using divideblock 44/26/30 and to be sure that the data is (more) stationary at each of these divisions? Regarding stationary data, do you have a suggestion that you think I should read on this subject? I get the concept of that it’s not that easy to perform predictions on data that contains larger trends and I’ve brushed up on Dickey–Fuller test (from wiki, mostly) but I think there is a better reference more applicable to neural networks than the ones I’ve come across.

My experience with trend removal has been limited to polynomials and the equivalent of replacing variables with differences.

% Difference the series to reduce non-stationarity --> Are you here suggesting that I should perform the NN calculation of the diff of the dataset?...simply : % % T = solar_dataset; % t = cell2mat(T); % t_diff=diff(t); % T_diff=num2cell(t); % … and so forth using t_diff and T_diff instead of t and T

Yes. This is standard procedure. In fact even higher order diferences are used (differences remove linear trends/2nd differences remove quadratic trends, etc).

i am working on almost same, predicting wind speed using narnet it have been long time, did lots of changes in network parameters, but all giving almost same accuracy or worse. so read more in deep, in ur answers here. and got to know that we need to do autocorrelation to get the FD. thus i followed the method above for my dataset. but i am getting d value very large :( can you help to find suitable delay and hidden layer neurons, i am getting open loop MSE = range (0.34xx to 0.38xx) for all tries which i did with different Delay and Hidden neurons, with different training function, and algorithms) finally i came to do this autocorrelation method, but here again stuck, d is "8445" which is very huge number.. please advise //