I have some set of known size but with unknown elements, $(x_1, ..., x_N) \in X$, where the elements of $X$ are exponentially distributed random variables with unknown rate parameters, $(\lambda_1, ..., \lambda_N) \in R$. I also have a "black box" function $f$ that samples an element from $X$ with uniform probability, and then returns a randomly sampled value from the chosen element's exponential distribution (corresponding, perhaps, to the time until the first instance of an event governed by the chosen variable).

I'm looking to use $f$ to discern whether or not an exponentially distributed random variable, $x_q$, with known rate parameter, $\lambda_q$, exists in the set $X$. I also know that $\lambda_q$ is smaller then all other rate parameters in the set $X$ by at least a multiplicative factor $w$. Said another way, $\lambda_q \leq Min[(R-\lambda_q)]*w$, where $w < 1$.

Provided $w$, how many times must I use $f$ to sample from $X$ to decide whether $x_q \in X$ with some threshold confidence?

Note - If this problem is too open ended as things stand, please feel free to suggest additional restrictions or clarifications!

Note 2 - We can specify that $N \leq 100$, where $N$ is a positive integer, and that $w \leq \frac{1}{2}$, though we cannot say that $w << 1$.

Surely you have to know something about $N$ also in order for this to have any hope? Maybe you want a bound in terms of $N$?
–
Anthony QuasNov 17 '12 at 21:16

@Anthony Quas Fair point. I am looking for a bound in terms of $N$, and I have changed the question to specify that we know $N$.
–
user28187Nov 18 '12 at 7:20

What are typical values of $N$ and $w$? And what is $R$ in $R-\lambda_q$?
–
fedjaNov 19 '12 at 1:51

@fedja I have added some specifications for $N$ and $w$ in Note 2. I can tighten them as needed. $R - \lambda_q$ is meant to be the set $R$ without the element $\lambda_q$ (perhaps this notation is incorrect?)
–
user28187Nov 19 '12 at 1:56

@fedja Ah, $R$ is defined earlier as the set of rate parameters associated with the exponentially distributed random variables in $X$.
–
user28187Nov 19 '12 at 2:00

1 Answer
1

OK, here is what I have. I'll skip some derivations (I'll provide them later if you are interested) and just describe the conclusions. The final tables apply if you have noiseless data. Any noticeable amount of noise will cost you quite a bit here.

The problem of how to distinguish between two fixed densities $p(x)$ and $q(x)$ is classical. Suppose that we want to bound the combined probability of error by some small $\theta>0$. This means that if we are allowed to take $n$ samples, we have to find some set $E\subset\mathbb R^n$ such that $\int_E P+\int_{E^c}Q\le\theta$ where $P(x_1,\dots,x_n)=p(x_1)\dots p(x_n)$ and similarly for $Q$. Here $E$ is the set where we declare $q$ to be actual density. Note that in no way can this sum be better than $\int\min(P,Q)$ and we can achieve that by the standard maximal likelihood decision: we declare the density $Q$ if $P(X_1,\dots,X_n)<Q(X_1,\dots,X_n)$ and $P$ otherwise. We also can get a fairly clear idea of the necessary sampling size. In fact, we can tell it almost up to a factor of $2$. Note that $\min(P,Q)\le\sqrt{PQ}$, so
$$
\int\min(P,Q)\le \left(\int \sqrt{pq}\right)^n
$$.
On the other hand,
$$
\left(\int \sqrt{pq}\right)^{2n}=\left(\int \sqrt{PQ}\right)^{2}\le
\left(\int\min(P,Q)\right)\left(\int\max(P,Q)\right)\le 2\int\min(P,Q)
$$
Thus, if $\int\sqrt{pq}=e^{-H}$, then to reach the level $\theta$ of combined error, we need at least $\frac 12 H^{-1}\log\frac 1{2\theta}$ and $H^{-1}\log\frac 1\theta$ samples will suffice.

The problem with your case is that we test not two densities but two families of densities against each other. However, if my computations are correct, we are lucky and the likelihood test that distinguishes the worst pair is actually universal enough to achieve the level of confidence given by the above $\sqrt{pq}$ estimate. So assuming that $\lambda_q=w$ (so every other $\lambda$ is $\ge 1$), we can define
$p_L(x)=\frac{N-1}N Le^{-Lx}+\frac 1Nwe^{-wx}$, $q(x)=e^{-x}$ where $L=L(N,w)$ is determined from the maximization problem $\int\sqrt{p_Lq}\to\max$ (which in practice is better to pose as $H=\frac 12\int(\sqrt{p_L}-\sqrt q)^2\to\min$), then the corresponding maximal likelihood text works fine and gives a guaranteed bound $\theta$ for each one-sided error whenever the $\sqrt{pq}$ estimate yields the combined error of $\theta$.

I ran a small program to see what sampling sizes it gives for reasonable $w$ and $N$. The table for the sacramental $\theta=0.05$ is below. The lines are $N,L,n$. \phantom{+} is the artifact of the automatic LaTeX style formatting that I was too lazy to disable.
As you can see, with your $10^5$ samples you are just on the edge of "theoretically feasible" for $w=0.5,N=100$ but if you can drop either number, everything gets fairly nice (if no noise is present, of course).

I suggest you run a few simulations and see whether it works for you (the "general theory" should be OK, but I could make some stupid mistakes somewhere). Normally, you are getting something like
$$
n=8N^{\frac 1{1-w}}\log\frac {1}{\theta}
$$
as a rule of thumb for choosing the sample size. This is all "the best performance in the worst case" approach. If you actually have more information than you put in the post, that may help push the numbers down a bit :).

Feel free to ask questions but do not expect a quick answer: life is crazy at this end...