Basically I'm considering this problem. You have some space $X$ from which you can draw points $x$ and $y$, a distance metric $d(x,y)$, and a sigma-algebra/probability measure on $X$. Maybe $X$ is ${\bf R}^n$ and you have a pdf $p(x)$, that's actually probably general enough for me.

Now the problem is you want to make an encoding function $f: X \to {0,1}N$ and a decoding function $g: {0,1}N$ such that the expected value of $d(x,f(g(x))$ is minimized. (Or possibly some non-decreasing function thereof, like $d(x(x,f(g(x))^2 )$ The basic idea is that $f$ is a function that maps from a point in your space to a fixed-length binary code. $g$ is a function that maps from a binary code vector to a point in your space. You want to find a code such that the loss of the compression is smallest.

It's a lot like PCA, but the code elements are binary, and the encoder/decoder functions are unrestricted.

One thing I've thought of so far is that this reduces to the problem of picking $2N$ points $P$ in $X$ such that the expected distance from $x$ to the nearest point in $P$ is minimized.

If anyone knows of any work on this kind of problem, I'd be very interested to read about it. I don't necessarily need a procedure for coming up with the optimal code or anything like that. I imagine someone must have derived some properties that the optimal code should have though.

1 Answer
1

The problem you're trying to solve is a generalization of what is called the 'continuous $k$-median problem'. In that problem, the center locations are arbitrary, whereas in your case, they are on a grid. However, your problem starts with an arbitrary metric space. While the cited problem is NP-hard, and is likely to remain so for your problem, there are useful heuristics that you can try for the case when $X = {\mathbb R}^n$, including $k$-means-style methods.

For example, you could fix a collection of centers on the grid, compute the Voronoi diagram, integrate your density within each cell to find the "average" location, and then "snap" this location to the nearest grid cell and repeat.