The goal is to expedite detection using the sliding window approach. In other words, an object classifier is known and I need to find where the possible locations of this object are in an image. This is a general problem in object detection.

We are given an intensity map (positive values - could be detection scores) on a rectangular grid (e.g. MxN intensity image). The goal is to find the bounding box (i.e. a rectangle of size mxn, where m and n are known and greater than 1) where the average intensity in the bounding box is maximum among all boxes. The brute-force algorithm would be to evaluate this value for all boxes (i.e. linear filtering) and take the maximum. Are there any more efficient ways to do this? how about approximate algorithms?

This question was for one choice of m and n. But now there's a finite set of m's and n's that I needed to find optimal locations for in the image. Do I rerun the previous algorithm for each choice of m and n independently or can I do something more efficiently?

I might not understand the question. Why isn't a 1x1 box around the maximum value always the answer?
–
Greg MullerOct 14 '10 at 20:57

1

This question seems more appropriate for stackoverflow. Even if resubmitted there, you should probably give more details about the problem. For example: are m and n given? If not, you could just take the spot in the grid with the largest intensity and call that a 1x1 box with the largest average intensity. If they are given, what else have you tried? Dynamic programming somehow seems natural.
–
Noah SteinOct 14 '10 at 21:09

-1: vote to close; seems like homework to me that the student needs to think about and solve on their own. Or of course the simple way to do it if $m$ and $n$ are given is to perform a $2$-dimensional convolution of the given $M$ by $N$ matrix with the $m \times n$ sized matrix consisting of all ones. The resulting convolution, called it $X$, has a maximal entry or entries identifying the positionning of the $m$ by $n$ matrix. This looks like homework for an image processing type of class. I'd vote to close it if I had closing vote power number of magic points.
–
sleepless in beantownOct 14 '10 at 21:15

What algorithms have you tried? What's your motivation behind this problem? Is it homework? Please take a look at the FAQ's and consider that a different forum might be more appropriate for this question, and that even on a different forum you might need to clarify and explain the problem more explicitly.
–
sleepless in beantownOct 14 '10 at 21:23

i updated the question, so any extra comments would be appreciated. Thanks!
–
BernardOct 15 '10 at 3:35

If m=M-1 and n=N-1 then you only need to look at the outer boundary (the cells not forced to be in the window) That would locate the optimal window although its average value would require looking inside,

In general you can first compute the differences a(i+m,j)-a(i,j) and also a(i,j+n)-a(i,j) and then with m (or n) additions of these find the effect of moving the window down (or left) by a step.

I haven't thought about trying to find the optimal of each size. (As commented, the optimal among all the sizes would be 1x1)

While your question as stated seems ill-formed (as commenters point out, the optimal solution is a 1x1 box around the maximum value), let's assume that there's some other constraint (MxN is too large for example, or there's a lower bound on the size of the query box) that makes this approach infeasible.

In that case, at least for approximations, and if MxN is very large, an approach based on $\epsilon$-approximations might work for you. Roughly speaking, you're trying to do range querying over a set of ranges that are "well structured" (formally, have low VC-dimension) and you'd like to extract a sample of the input so that range queries on this sample approximate range queries on the real input. It turns out that a random sample of the input of size roughly $O(1/\epsilon^2 \log 1/\epsilon)$ would suffice to estimate the ranges within error $\epsilon$. Algorithmically, then you merely implement your expensive procedure on this small sample, and presumably that would be more efficient.

The question is really asking about cross-correlation using matched filters. The known classifiers for objects in the set $S={s_1,s_2,... , s_x}$ are $m \times n$ matrices or subimages of sizes $m_i \times n_i$ for each object $s_i$. So now, $m$ and $n$ are no longer necessarily the same for the different objects being classified.

The matched filters can be convolved in 2-dimensions one at a time with the image $M$ and peaks ascertained in order to find candidate locations in which the target objects $s_i$ occur. Once you have the cross-correlation, you look for local maxima in order to select the candidate locations.

If there is any redundancy between the targets to be classified (e.g. $s_a$ and $s_b$ are very similar looking), then you might be able to get around having to do cross-correlation for each target with the entirety of the image. But if there are no similarities between the different targets to be detected, it is very likely that there is no better solution than doing full convolutions and cross-correlations for each object classifer $s_i$ over the image $M$.