For pair wise comparisons of two RNA seq samples (sample A and B), I found a formula in 3 papers, all from the same group. They calculated an empirical Z-score assuming the distribution of RPKMs for each sample followed a Poisson model, but the denominator is different in their paper:

zscore = (RPKM_A - RPKM_B) / sqrt(RPKM_A - r * RPKM_B)

and in 2 papers:

zscore = (RPKM_A - RPKM_B) / sqrt(RPKM_A + r * RPKM_B)

RPKM_A and RPKM_B are RPKMs in the region of interest of A and B samples, respectively

r = NA/NB, where NX is the total number of aligned reads used for normalization

TWO QUESTIONS:

1) if the first one is correct, sometime "RPKM_A - r * RPKM_B" will be negative and the sqrt can't be taken

2) According to the original z-score formula, how do they came to this formula?

I don't know, but in few similar situations, I have sent emails to authors only to get a response about typos and writing errors. As you have said, the SQRT from a negative number would be one with an imaginary number, will that work?

In these studies, the goal is prenatal diagnosis of an autosomal recessive disorder. In short, a certain amount of fetal DNA is commonly found in the circulatory system of the mother (the amount changes over the duration of the pregnancy). Since the mother, in this case, is a carrier for the disorder, we need to somehow estimate the probability that the a possible over-abundance of a mutant allele observed in her blood is significant, since the overabundance would come from a homozygous mutant child.

So, firstly, these are not RPKMs, but raw counts (at least in the ones I've seen, namely here). r, then, is the ratio of fetal to maternal DNA that we get from the blood draw. We measure some total number of reads informatively covering a site, Nt, comprised of mutant (Nm) and wild-type (Nw) reads. If the fetus is homozygous mutant, then we'll have an Nm-Nw=r*Nt, since the entire fetal contribution will be in Nm. If, on the other hand, the fetus is heterozygous, then reads originating from it will be (more or less) evenly split between the two counts, and Nm-Nw=0. We can measure Nm-Nw, but in order to get a statistic, we need to make that a Z-score, by dividing by its standard deviation.

So how, then, might we arrive at a standard deviation. Recall that we have raw counts (not RPKMs), which means that the technical variance is the same as the mean (we have Poisson variance). The standard deviation is the square root of the variance, so:

sd(Nm) = sqrt(E(Nm)) = sqrt(0.5*Nt+0.5*r) in the homozygous case and sqrt(0.5*Nt) (E(Nm) is the expected value and I'll leave demonstrating that (Nt+r)/2 is the expected value as an exercise for you).

sd(Nw) = sqrt(E(Nw)) = sqrt(0.5*Nt-0.5*r) in the homozygous case and also sqrt(0.5*Nt) in the heterozygous case.

We then want the pooled deviation of the difference, which we can get by simply taking the square root of the pooled variance. The pooled variance ends up just being Nt, since you just take the sum of the variances (i.e., the sum of the squared standard deviations). Thus, the denominator is sqrt(Nt).

Update: As mentioned at the top, it seems that you're talking about different papers (thanks for the links) where they actually do use RPKMs. The logic is generally the same as that presented above and it should be noted that the denominator should be sqrt(RPKM_A + r*RPKM_B), since that's vaguely similar to sqrt(Nt). This is, however, a terrible way of doing things and should never be done. They were doing pair-wise comparisons, but one should really take a couple different dishes of ES cells with difference passages and then use them as replicates to actually gauge what the variance is.

Update2: By "Poisson distribution", they mean that the RPKM measurements themselves have Poisson variance. This is simply because the technical variance of the original counts are Poisson and an RPKM is just a transformation of the original count.

I updated with a bit more information at the end that's relevant. Basically, Nt is vaguely similar to RPKM_A+r*RPKM_B, since r is just a fudge factor that acts as a weight. As I added at the end, this isn't a very good way to do things. Please do not do this in your own work, the method should die a quick death.

They're published papers, so you should probably list specifically what they are - so we have broader context, and so others confused by the same issue can find them. There are several issues involved that are hard to work out based on just the information above. However, given the general definition of a z-score as a ratio of difference from a mean to the standard deviation of the mean, the denominator of your equations above is meant to represent the standard deviation. The standard deviation consists of the square root of the sum of the squared differences from the mean. Thus your concern about not taking the square root of a negative number is right on. The differences between a value and the mean (as listed in your first example) would normally first be squared (thus giving a positive result), prior to taking the square root. Nonetheless. without knowing what they are actually trying to achieve it seems like guess work. (i.e. why only normalize the denominator?).