@#!^%$ microarray data!

You may recall (if you have had nothing better to do than read previous posts) that a few posts ago I raised a concern about how microarray data from independent slides should be combined (as ratio of means or as mean of ratios). The issue arose because the undergraduate student who hybridized the arrays and did the original analysis didn't report over-expression of the genes we have found over-expressed in our reanalysis of the data. I wanted to go back and compare her calculations to ours to see where the discrepancy arose.

So today I hoped to do that. I only had a rough draft of the student's undergraduate thesis. This didn't explain how she did her calculations, so I asked for the final version from her supervisor. He provided a copy, but it turns out to be identical to the version I have, although with his comments and suggestions rather than mine. So we have no information at all about how she did her calculations. (More annoyingly, I now suspect that she completely ignored all my carefully thought out suggestions for improving her thesis.)

But I also got a CD containing her data files from the other lab's computer. So I spent much of today going through her lists of genes that were up-regulated or down-regulated at least 2-fold, and comparing them to our lists from our reanalysis of the array data. The lists disagree completely. For the antibiotic we're not very interested in (ery), the genes scored as 'down' in this file are reported as 'up' in her thesis. (Well, the thesis only considered the subset of genes she thought interesting -'virulence genes'.) These same genes are 'up' in our analysis. So maybe she just switched 'up' and 'down' in the file, and discovered and corrected this error while writing her thesis.

But it's worse for the antibiotic we are very interested in (rif). For this, genes listed as 'up' in her file are also 'up' in her thesis. But these are genes that our analysis says are 'down'. And vice versa - the genes we find to be 'up', she lists as 'down'.

So now there are two discrepancies, and thus different places where an error could be. First, maybe she switched up and down in both the ery and rif files, and corrected only the ery error in her thesis. Second, maybe the dye assignments we used for both our analyses (ery and rif) were reversed but the up and down assignments in her lists are correct. In this case she must have mistakenly switched the up and down assignments for the ery analysis in her thesis.

I'm still digging into this accursed data set because we found a surprising pattern of gene induction in the rif-treated cells. But if we've switched the dye assignments then these genes are down, not up. Is this less surprising? I have no idea.

I'd like to throw this whole project out the window. But I've promised to grow some cells and do some RNA preps so the apparent gene induction effect can be tested by my collaborator's technician and student, using real-time PCR on cDNAs generated from independent RNA samples.

How long will this take? - not too long I think. I'll need to start a cell culture the night before. The next morning dilute the cells into medium with and without rif. Let the cells grow for at least 5 doublings (probably about 3 hours) and collect cells in microfuge tubes, and collect more after one more doubling. I won't need large amounts of culture because the real-time PCR analysis needs very little RNA, so one 2.0 ml tube of each culture at each time should be enough. Well, maybe two tubes of the first samples, because I'll need enough RNA to see in a gel. I'll do RNA preps of these cells, using the Qiagen RNAeasy kit; we have the kit and the preps take less than an hour as I recall. I said I'd grow cells and do preps twice, on different days, to get independent replicate RNAs. I'll need to run samples of each prep in a gel to check that the ribosomal RNAs are largely intact. Once I know the RNA concentration, treat 5 micrograms with 'DNA-free' to get rid of chromosomal DNA that would confound the PCR analysis.

I expect that the new RNA analysis will not confirm the surprising result we see in our present analysis, partly because the result is unexpected and thus more likely to be due to an error than to a previously unknown biological process, and partly because we now know that the data is indeed full of errors. But at least this will get me back at the bench, if only for a couple of days.