Week 1

May 14, 2012

Today code for a simple ANOVA and a comparison of two strains was tested to see if the results produced by our adviser could be reproduced.
The output without modifying the code for the comparison of two strains using wildtype and dGLN3 for the comparison could be reproduced.
A few difficulties have been faced when trying to modify the simple ANOVA code. It was attempted to match the results produced for dGLN3 using the two strain comparison code with the results of the simple ANOVA code with the input specified as the log fold concentrations for the dGLN3 data. After being initially unsuccessful, the code was revisited and looked at more closely. It was found that two matrices in the matrix division had an unequal number of rows (matrix X and matrix Y). The indices (which specify columns of the input file) were modified after which matrix X had the same number of rows as matrix Y. The following code was used:

May 15, 2012

A separate set of indices were made to designate the rows of the X matrices for the reduced and the full model. This separate set of indices (one for each timpeoint) removed rows of zeros that were previously added when substituting the select 0's for 1's to create the two X matrices. As a result, it was possible to call upon the original indices to extract the log fold changes corresponding to a specific deletion to create the Y matrix. After these corrections, the two strain comparisons were performed between wildtype and each one of the deletion strains.

Further modifications were made to the two strain comparison code to be able to make a comparison of all of the strains simulatenously. To do so, indices were added to designate the columns in the input corresponding to each of the deletion strains. A separate set of indices (one index per timepoint) for each strain as in the case of the two strain comparison. The parameters p were increased to 25 (five timepoints for each deletion strain) and the constraints q were increased to 20 to yield an X and Xh matrices with the appropriate numbers of columns. Changes were made to the out_data(ii,[number]) lines to account for the increase in the number of strains being compared. In the output, the data for each of the timepoints for each of the deletion strains matched the corresponding data in the two strain comparisons.

The next task is figure out how to handle missing data. Since both GCAT and Ontario chips were used, log fold change concentrations are missing for some genes for some timepoints for the wildtype (these cells show up as NaN). One possibility is to modify the X matrices so that they take into account the NaN's for each gene. On this note, a for loop was begun to try to exclude timepoints for a gene for which there is an NaN.

May 16, 2012

NaN's were removed from the Y matrix for each gene by using the command YY = Y(~isnan(Y(:,1)),1). This truncated the Y matrix from 43 rows to 34 rows. However, in doing so, it was not possible to solve for beta (the solution to the linear system XB=Y). Another method to deal with the NaN's was to replace them with a very large number using the command Y(isnan(Y(:,1)),1)=1e6 in order to be able to recognize the timepoints and flasks for which there were NaN's in the output. However, this messed up the B&H corrections.

The next step will build on trucating the Y matrix by getting rid of the rows with the NaN's altogether. To keep this step, the X and Xh matrix pairs must be formulated so that they are unique to each gene. A possibilty is to use a for loop to define each of the indices for each gene by the columns that do not contain an NaN for that gene. The Y matrix calculations would have to be within this loop.

May 17, 2012

A for loop was experimented with to generate different X and Xh matrix pairs for each gene to take into account those genes that have an NaN for a a flask for a timepoint in the wildtype. This loop contained a majority of the same coding that was used to set up the indices in the original two strain comparison code. The different lines of code are

These new lines of code retain the original column numbers of matrix a (the matrix containing the log fold change data) in the indices. The NaN's were then also removed from the corresponding Y vectors.
At first, the loop to compute the Y matrix was put into the loop to compute the X and Xh pairs. However, the code ran for too long at that point.
Then the Y loop was separated from the loop to compute the X and Xh matrices. An output was generated. However, for any of the genes that had at least one NaN, the entire row for that gene was blank.
In dissecting this new code several problems were found. First, the X matrices for genes that had NaN's were not being computed correctly. Since the indices still had the original column numbers of matrix a, the ones were not put into the X matrix in the correct places.
Briefly, cell arrays were looked at as a solution instead of numerical arrays because there can be blank cells in a cell array but numerical arrays do not contain blanks. However, this potential solution proved to complicate the calculations that need to be made.

May 18, 2012

(This entry is being written on May 20th regarding the work that was done on May 18th.)

The first issue that was tackled was the improper formation of the X and Xh matrices. For any of the genes that had at least one NaN, the 1's were put in the wrong positions in the X and Xh matrices. This is because the indices that were used to replace the 0's in the X and Xh matrices with 1 did not start at 1. This was necessary to later exlude the cells that had NaN's in the Y matrix. Rather than getting rid of the method to select the cells that do not have NaN's, the indices that were used to replace the 0's with 1's in the X and Xh matrices were renumbered if indx (the matrix of the indices for each time point) was less than 23 (the maximum number of columns correponding to the wildtype if none of the cells have an NaN for that gene). The following added bit of code was put into the for loop for the X and Xh matrices after the indx, indX, i1, and i2 lines of code:

However, this may have fixed the X and Xh matrices. However, in the final output, the rows corresponding to the genes with NaN's were still completely blank.

At this point, the wrong direction was taken to find a solution to this. Some experimenting was done to produce X and Xh matrices all of which had 43 rows but some of which had something other than a 0 or 1 in the cell to designate the NaN in the original a matrix. Along this line, the NaN's were kept in the Y matrix. The goal was to be able to somehow ignore the NaN's then when calculating beta and betah.

It was considered to keep the NaN's because it was thought that the problems with the output had to do with some X, Xh, and Y matrices having less than 43 rows while others had 43 rows. Then looking back at how linear systems of equations are solved from what I learned in my linear algebra class, I realized that the number of rows could not be the reason why the some rows of genes in the output were completely blank. Even though the number of rows changed, the number of columns corresponding to the different type points never did. All the X matrices had 10 columns and all the Xh matrices had 5 columns. Therefore, there should be 5 betas being calculated (one for each timepoint) for the wildtype and for dGLN3 for all of the genes regardless of whether or not a gene had NaN's.

The next step in troubleshooting the code will be testing it with just a select number of genes that have NaN's for some timpeoints and flasks.

Week 2

May 21, 2012

The two strain comparison code was tested with an output that contained only seven genes (six of which had an NaN). At this point, there were no blank rows. To fix the blank rows that appear for genes with NaN's, the computation for the y loop was put in the same loop as the computations for the X and Xh matrices. This ensured that when the betas and betahswere being caluclated that the X and Xh matrices changed accordingly (depending on the gene for which the computation was being done). Previously, the code was set up so that he betas and betahs were calculated with the last X and Xh matrix that was computed in the loop.

Several changes were made to this code to compare all strains (wildtype, dCIN5, dGLN3, dHMO1, and dZAP1).
First, indices were created for dCIN5, dGLN3, dHMO1, and dZAP1 (the two strain comparison code only had matrices for the widlytpe and dGLN3). The indices for the deletion strains were constructed for each timepoint similar to how they were constructed for dGLN3 making sure to specifiy the appropriate columns in the original data for each strain.

So that the X and Xh matrices were of the correct size and had 1's and 0's positions in the correct columns and rows, the indices for each timepoint for each strain were adjusted in the for loop.

In addition, the the calculations for t, t2, and N were expanded to include all of the indices for all of the strains. The X and Xh matrix for genes with no NaN's had dimensions 103x25 and 103x5, respectively. The X and Xh matrices for the genes with NaN's had dimensions 94x25 and 94x5, respectively.

The coputation for the Y matrix was also altered to include all of the strains:

The lines of code corresponding to the dimensions of the out_data were also altered to include all of the betas and betahs.

In addition, a few plots from the all strain comparison of the time versus expression genes with both low and high p values were saved. To do so, the if statement for the plots was removed. The save statement used to save each plot generated separately was

saveas(gcf,['Fig' num2str(ii) '.jpg'])

It would be beneficial to change this line of code so that the name of the figure corresponding to the gene for which the expression was being plotted.

Also, an avi file was made of the plots of time versus expression for genes with a p value of less than 0.05 and then those with less than 0.04 (this p value was observed to be the cut off for the B&H corrected p values in the final output for the all strian comparison). The plots for the genes with a p value less than 0.04 have to be redone because a banner that popped up in the figure window was also recorded (it blocks the gene name).

Before the for loop, the following line of code was added

aviobj = avifile('allstrain.avi','fps',15,'compression','Indeo3');

Some fiddling may still need to be done with the number of frames per second.

The following lines of code were added at the end of the for loop (after the if statement for to generate the plots):

%To get the handles from the figure open. Saves the entire figure window that is open.
frame = getframe(gcf);
%Adds the currently open plot to the avi file.
aviobj = addframe(aviobj,frame);

Right after the for loop, a command line was added to close the avi file:

aviobj = close(aviobj);

The avi file had to be opened in QuickTime player. It would not open in windows media player when either the compression format was Indeo3 or Indeo5.

May 22, 2012

Some changes were made to the code to write an avi file. However, despite all of the revisions, the avi file was not being made properly. For the first few seconds the plots that showed up for some of the genes looked nothing like they should. To be able to even play the file, the compression had to be set to 'None'. This is what probably caused the file to be 4 GB. They appearance of the strange graphs at the beginning of the movie may be attributed to the large size of the file.

Initially, the p values showing up in the plots in the avi file did not match the p values for the corresponding genes in the output. However, this error was fixed by adjusting the if statement in the loop that creates the plots for genes with a p value of less than 0.05.

The rest of today was spent dissecting the code we currently have for the deterministic model. The goal is to split the code for the general parameter estimation function into multiple parts, namely, reading the files into matlab, the least squares error function, plotting the graphs, and writing the output.

The frame = getframe(gcf) and aviobj = addframe(aviobj,frame) inside the if statement to generate the plots was removed. The aviobj=close(aviobj) line of code immediately after the for loop was replaced with

This time the movie generated with Cinepak compression code be played on Windows Media Player on an XP operating system. The size of this movie was significantly smaller than the size of the previous movie made without compression that could play with Windows Media Player.

There are three different scripts for the statistical analysis in MATLAB:

all_strains_compare_movie.m to do a comparison for all strains

wildtype_deletion_strain_compare.m to do a comparison between the wildtype and another strain

two_deletion_strain_compare.m to do a comparison between two deletion strains

The two strain comparison code was divided into two (the scripts in the last two bullets above) because the for loop designed to remove NaN's is not necessary for the deletion strain data. The principle differences between these two scripts is the calculations to renumber the indices. The following is the part of the script for the wildtype to deletion strain comparisons that contains differences with the other two strain comparison script:

The calculations for the Y matrices, beta's, betah's, fstats, pvals, and plots are within the for loop. These are the same calculations in the deletion strain to deletion strain comparison script. These caluclations are the only ones in the for loop for the deletion strain to deletion strain comparison script. The following is a part of the deletion strain to deletion strain comparison script that contains differences in comparison to the previous script.

The input of the mastersheet in this script is different from that in the previous general parameter estimation function. In the actual input sheet, the log2 concentrations were divided into several sheets (one for each deletion strain) and inputted sheet by sheet into MATLAB. In addition, the log2 concentrations for each flask for each timepoint were inputted rather than the average log2 concentrations for each timepoint. Indicators for the timepoints, strain name, and strain index were added to the optimization parameters sheet. This required modifying the for loop for inputting the optimization parameters.

The next steps with the deterministic model include making some changes to the general least squares error script where the error = ((log2(x1) - d1).^2) line of code needs to be modified to take into account the replicates in the log fold change data for each timepoint for each strain.

May 24, 2012

In order to modify the error calculation in the general least squares error function (in MATLAB) to take into account the change from using average log fold changes for each timpepoint to using all of the data, several modifications were made to other scripts. (To make any of these changes, I first dissected the various functions to understand the various parts more.)

The log fold change data had to be imported separately for each strain (this modification was made earlier in the week):

Nevertheless, averages for each timepoint were later calculated for the purposes of later plotting standard deviations on the time versus log fold change plots.

The driver to detect active genes will be taken out. All of the genes in the network are considered active genes. The assignment of a number to each active gene was done with the detect active genes script. Now the active changes are "detected" in the matrices with the log fold change data:

active = find(log2FC(1).d(2:end,1));

This array is called for in the general network dynamic function. The ode45 function in matlab approximates a solution to the differential equation defined in the general network dynamic function.

Previously, the error was calculated using the model and the average log fold changes for each timepoint from the data. However, this error calculation needs to be changed so that the unaveraged log fold changes can be used. A preliminary equation that was experiment with was

The problem here is that the equation does not take into account the last replicate for the t30 timepoint (because t30 has five replicates while the other timpeoints have four replicates).

Another concern is the actual data d and d1 that is being fed into the general least squares error function and other functions. For now, both d and d1 correspond to just the wildtype data. However, we will want to use the deletion strain data for the model as well.

% % Standard deviations for each timepoint for each gene. Necessary for the graphs
% % that will be produced.
sigmas = zeros(length(log2FC(1).d(2:end,1)),n_times);
for iT = 1:n_times
sigmas(:,iT) = std(log2FC(1).d(2:end,tindex(iT).indx(1):tindex(iT).indx(end)),0,2);
end

It is necessary to still find the standard deviations and the means to be able to produce graphs for each gene of the log fold change versus time with the upper and lower error bounds for each timepoint.

The error calculation in the general least squares error function was modified with the following lines of code:

errormat = 0;
for iT = 1:length(tspan)
for iF = 1:length(tindex(iT).indx)
errormat = errormat+(log2(x1(iT,:))-d(tindex(iT).indx(iF),:)).^2;
end
end
% This is the sum of all the errors
L = sum(errormat(:));

In the general least squares error function, a modification was also made to the first figure produced (with the counter and lse_out). At first, all of the data (not the average log fold changes) was being used to specify the second subplot in the figure which fits the model to the data. When viewing the figure, the model was being fit to only three of the thirteen sets of data points in the second subplot. There were thriteen sets of data points (in columns) since there are thirteen columns in the data due to the replicates in each timepoint. Instead of using all of the data, the code to produce the subplots was modified to instead use the average log fold change for each timepoint:

The script specifying the plots for each gene of log fold change versus time was modified. The if else statement having to do with cssigflg was removed. This designation had to do with the concentration sigmas (standard deviations) inputted into MATLAB when they were still in the input excel spreadsheet. However, since the standard deviations are now being calculated in MATLAB, there is no need to determine the kind of the plot to be produced depending on the value of cssigflg. The following is the new code for the plots:

Further work will need ot be done on the plots. In the plots for MAL33 and ZAP1, the upper error bounds are not shown. The y axis may not to be adjusted to be larger than [-3,3]. Also, the model for HAP5, the model goes beyond the error bounds for the t60 timepoint.

In the output excel spreadsheet, there are no log2 optimized concentrations nor network optimized B's for FHL1, SKO1, or SWI6.

In addition to dealing with the aforementioned concerns, we need to explore how to include the deletion strain data in the model.

Week 3

May 29, 2012

Copies were made of all of the functions and scripts in order to keep the original scripts that worked last week when making modifications to these scripts today. The ending _strains was added to each m file.

A line of code was added to the estimation driver that indicates the sheet with the data of the deletion strain for which the script is being run at a given time. Since the model is to be found for each one of the strains, it makes it easier in the long run just to import the data for the deletion strain that the code will be run for.

This change was made in the process of spending the day changing the general network dynamics function to be able to run the model for each of the deletion strains. It was necessary to delete the information in the weights matrix and the degradation rates array that corresponded to the gene that was deleted in the deletion strain of interest. One of the lines of code added, finds the index of the gene deleted in a given strain by matching it to the sheet number in the input corresponding to the log fold change data for the deletion strain. To be able to incorporate this line, the variables num and indices were added to the global command line.

When running the entire code, the first error come up was a mismatch of dimesnions in the line WXB = -W*zz+B which computes the exponent in part of the differential equation that is a sigmoid function. This error was due to the fact that B (which corresponds to b) had 18 rows while the other two arrays had 21. This occurred because b was being recomputed in the general least squares error function. As a result, the recomputed b was renamed to bb.

However, another error occurred at line 43 in the LSE_strains script. The matrix dimensions do not agree. It's uncertain why these errors are occuring because after comparing each line of code Nick and I seem to be using the same scripts.

More troubleshooting will be done tomorrow to address the newest error.

May 30. 2012

The errors that were encountered yesterday were fixed with some changes to the general network dynamic function and the calls to various functions. The reason why there was a mismatch in dimensions in the line WXB = -W*zz+B in the network dynamic function is that B was the same array as b. Intially, b was a 21x1 vector of 0's. However, in the general least squares error function b changed to a 18x1 vector of 0's because there 18 of the 21 total number of genes in the network regulate the expression of other genes. However, B needs to be a 21x1 vector of 0's to be able to compute WXB. For this to be possible, a line of code was added to the general network dynamics function instead of renaming the recalculated b in the general least squares error function:

B = zeros(n_genes,1);

As a result, n_genes was added to the global line.

The other error that occurred was solved by changing the function calls. There are currently two sets of scripts (one set is the original set that was composed last week and the other is the set for which the sheet number of the strain being worked on is specified in order to import the correct log fold change data). However, within the second aforementioned set of scripts, the LSE script and a few other functions called for functions from the first set of scripts. All function calls were changed to those corresponding to the scripts in the second set of script previously discussed.

However, with these changes, another problem arose. After running the fmincon function in the LSE script, the numbers composing the w1 array did not change as expected. Some of the numbers differed only in the ninth or so decimal place. This resulted in many of the genes having the same network optimized weights. This was not the case when the scripts from last week were executed.

To explore this problem, I started troubleshooting from the set of scripts from last week making changes that lead to the newest set of scripts one by one. After doing so, it seems that the problem that was observed arose when the general dynamic function is changed from what it was originally to the one constructed yesterday. This problem was evident in the first subplot of figure 1 plotting all of the theta values. Most of the 71 theta's stayed in practially the same line centered at 0. Hoever, what is odd is that the new general dynamic function produces the same dz vector as the old network dynamic function. In addition, the same x is found after applying the ode45 function to the new general network dynamic function as when applying the ode45 function to the original general network dynamic function. Despite this, the w1 vector (which is being recalculated by fmincon from the original vector corresponding to w0) is not being recalculated correctly by the fmincon function. This problem will be troubleshooted further tomorrow.

Towards the end of the day, the MATLAB scripts to do the ANOVA and p value corrections for the all strain and two strain comparisons were revisted. The ability to pause at a plot and to save a plot as a .jpg was added to the lines of code to produce the plots of the data and model for each gene having a p value of less than 0.05:

pause on
if pval<0.05
figure(1),plot(t,Y,'bo'),hold on,plot(t,Ym,'r+','LineWidth',3),plot(t,Yh,'k^','LineWidth',3),
title([b{ii+1,1} ' W = ',num2str(W),' pval = ',num2str(pval)]),...
xlabel('time, minutes'),ylabel('log fold change'),legend('data','model'),drawnow
hold off
nsig = nsig + 1;
[ii,nsig]
drawnow
%Allows you to pause at a graph for as long as you do not press a key on the keyboard.
pause
%Save as a .jpg
saveas(figure(1),[b{ii+1,1} '.jpg'])
%%Save as a .fig
%saveas(figure(1),[b{ii+1,1} '.fig'])
%%Captures the figure being currently displayed to put into the movie.
%movieArray(nsig) = getframe(gcf);
end

May 31 2012

The errors arising with w1 vector were were troubleshooted. First, the code including the new general network dynamics function was run through the fmincon function for 5 different networks: an identity matrix, a matrix with two diagonals of
1, a matrix with a diagonal of 1 and alternating 1's and 0's above the diagonal, a matrix with a diagonal of 1's and the first row of 1's, and a matrix with a diagonal of 1's and the first column of 1's. The results were as follow.

-The output (w1 vector) produced with the network matrix with a
diagonal of 1's and the first row of 1's had the most number of
entries that were not identical (the first 21).

-With the exception of the network matrix just mentioned, the number
of unique entries within the w1 vector for the most part increased as
the number of edges in the network increased:

identity matrix: 21 edges, 1 unique entry in w1 vector
matrix with two diagonals of 1: 35 edges, 2 unique entries in w1 vector
matrix with a diagonal of 1's and the first column of 1's: 41 edges, 2 unique
entries in w1 vector
matrix with a diagonal of 1 and alternating 1's and 0's above the diagonal:
121 edges, 11 unique entries

-Both the network matrix with a diagonal of 1's and the first row of
1's and the network matrix with a diagonal of 1's and the first column
of 1's had 41 edges. However, there were 21 unique entries for the
former network and only 11 unique entries for the later network in the
w1 vector.

Then, I ran each of the matrices through the code but setting the thresholds to 1.00E-10 because one of the messages that popped up in the command window was that the optimizaiton was terminated because "the magnitude of directional derivative in search direction less than 2*options." I obtained pretty much the same results. Between some corresponding entries in the corresponding w1 vectors, there may have been a difference in the fifth decimal place.

I then ran each of these five networks using the original general network dynamics function. All of the entires in the w1 vectors for each of the networks were unique. At which point, the general network dynamics function was then revisted because that is most likely whyere the problem lay.

It was observed that the parms_used variable changes within the for loop in the original network dyanmics code but it stays at 0 in the new general network dynamics code. Therefore, the line parms_used = parms_used + nAii; was added inside the for loop to calculate W in the new general network dynamics function. In the output, the network optimized weights for each gene were different (they were the same for a majority of the genes when the w1 vector was not being properly computed as described before).

In addition, the plots generated in the two strain comparison and all strain comparison code will need to be changed to fit a polynomial through the data for each strain being compared and to include the standard name as well as the systematic name for each gene for which a plot is being produced.