Thursday, March 24, 2011

In a previous post, I talked about how to produce exponentially weighted averages of time-embedded values. This is fine for cases where you really want averages, but it doesn't work for rates. Happily, the technique I presented for simple averages can be easily extended to work for rates as well.

To recap a bit, an on-line algorithm for computing a simple average works by accumulating the number of observations and the sum of the observations. There is no mention of time. For each observation $(x, t)$ we update our state $(s, w)$ according to
\begin{eqnarray*}
s_{n} &=& s_{n-1} + x \\
w_{n} &=& w_{n-1} + 1
\end{eqnarray*}
The exponentially weighted average approach that I mentioned before uses the time of each observation to discount the previous state based on the time of the previous observation. Thus, the state $(s,w,t)$ is updated according to
\begin{eqnarray*}
\pi &=& e^{-(t-t_{n-1})/\alpha} \\
s_{n} &=& \pi s_{n-1} + x \\
w_{n} &=& \pi w_{n-1} + 1 \\
t_{n} &=& t
\end{eqnarray*}
If we were to compute simple rates without discounting, then instead of interpreting $w$ as a count, we would interpret it as a time interval. Thus we would update state $(s,w,t)$ according to with an observation $(x, t)$ according to
\begin{eqnarray*}
s_{n} &=& s_{n-1} + x \\
w_{n} &=& w_{n-1} + (t - t_n) \\
t_n &=& t
\end{eqnarray*}
By analogy with the discounted version of the simple average, we can produce an exponentially weighted rate average by updating the state according to this
\begin{eqnarray*}
\pi &=& e^{-(t-t_{n-1})/\alpha} \\
s_{n} &=& \pi s_{n-1} + x \\
w_{n} &=& \pi w_{n-1} + (t-t_n) \\
t_{n} &=& t
\end{eqnarray*}
This approach has a defect in that each data pont should really be considered to be a report of events spread out in an uncertain way over the time since the last report. As such, the interval and the events should probably both be discounted as if they had been reported uniformly over the period in question. In practice, this correction does not matter much since the averaging time constant $\alpha$ is typically large compared to the average reporting interval. Another consideration comes up when multiple sources are reporting concurrently. If we want to do proper discounting of each observed number, then we have to keep track of the time since last report for each of the sources. Since this probably won't matter and since it considerably complicates the implementation, I would rather not do it.

See https://issues.apache.org/jira/browse/MAHOUT-634 for code. This will be in Mahout before long.

Also, if you are measuring rates, then it is common for rates to be reported from multiple sources independently. Such an average can be computed pretty easily using this same framework if the sources report often relative to the averaging time constant. This simple implementation just attributes each reported count as if they occurred in the interval since the most recent report from any reporter. If the time constant is relatively long, this can work out reasonably well as long as we are careful.

If reporting intervals are longer, then the averaging is a bit trickier because we really would like to attribute the reported counts over the entire interval from the last report from the same source. This means that we have to discount some of the counts because they are effectively kind of old.

Tuesday, March 22, 2011

Ted Yu on the hbase list wants to compute an exponential weighted average for the number of queries per second or other statistics that characterize the performance or state of an hbase system.

The trick is that the measurements are only available at irregular intervals. If they were sampled regularly, then the standard mixing trick would work:
\[
m_{n+1} = \mu m_n + (1-\mu) x_{n+1}
\]
where $m$ is our current estimate of the mean, $x_n$ is the $n$-th sample and $\mu$ determines how much history to use.

With unequal sample times, things become a bit more complicated. If we get lots of measurements all at once, we want to give them nearly equal weight but if we have a long gap, we want to weight the very old samples much less.

In fact, we want to weight old samples according to how old they are with exponentially decreasing weight. If we sample values $\left \lbrace x_1 \ldots x_n \right \rbrace$ at times $t_1 \ldots t_n$ then we want the weighted mean defined by
\[
m_n = {\sum_{i=1}^n x_i e^{-(t_n - t_i)/\alpha} \over \sum_{i=1}^n e^{-(t_n - t_i)/\alpha} }
\]
Here $\alpha$ plays the same role as $\mu$ did before, but on a different scale. If the evenly sampled data comes at time intervals $\Delta t$ then $\mu = e^{\Delta t / \alpha}$.

Happily, there is a very simple recurrence relationship that allows us to keep only two intermediate values while computing the value of $m_1 \ldots m_n$ in an entirely on-line fashion as the $x_i$ values arrive.