Noisy Time Series III: Theoretical Foundations of Compressed Sensing

Introduction:

In a previous post we introduced the problem of detecting Gravity Waves using Machine Learning and suggested using techniques like Minimum Path Basis Pursuit. Here, we drill down into the theoretical justifications of the general approach–called Compressed Sensing–and try to at least understand what the current theory actually tells us.

BP is a principle for decomposing a signal into an optimal superposition of dictionary elements (an over-complete basis), where optimal means having the smallest -norm of coefﬁcients among all such decompositions.

Unfortunately, neither of these off-the-shelf programs will work to find our Gravity Waves; for that, we need to add path constraints.

Our signal is very weak (SNR << 1) and we are trying to reconstruct a continuous, differentiable function of a known form. We will be push the computational and theoretical limits of these methods. To get there (and because it is fun), we highlight a little theoretical work on Compressed Sensing

In the past decade, work by Dave Donoho and Emmanuel Candes at Stanford and Terence Tao at Berkeley have formalized and developed the theoretical and practical ideas of Basis Pursuit under the general name compressed sensing.

Beyond the Sampling Theorem

Tao, a former child prodigy who won the Fields Medal in 2006, took some time off from pure math to show us that we are, in fact, limited by the signal structure, not the bandwidth.

We call S-sparsewhen at most only $latex S $ of the coefﬁcients can be non-zero (. For example, many images are S-sparse in a wavelet basis; this is the basis of the newer JPEG2000 algorithm.

This allows us to reconstruct a signal with as few data as possible–if we can guess the right basis.

Ideal Sparse Reconstruction

In an ideal world, we would like to find the most sparse solution, which means we solve the regularization problem

subject to

where the norm is just the number of non-zero components.

This is very difficult to achieve numerically. It turns out that we can approximate this problem with the regularization problem, which can be solved as a convex optimization problem. And we even have some ideas why. But why ask why?

IMHO, applied machine learning work requires applying off-the-shelf tools, even in seemingly impossible situations, and yet avoiding wasting effort on hopeless methods. The best applied scientists understand the limits of both the methods they use and the current underlying theory.

Theoretical Justifications for Compressed Sensing

Any early idea showed that we can, in fact, do better than the Sampling Theorem just using random projections:

Theorem (Candes-Romber 2004):

Suppose we choose m points randomly out of n. Then, with high probability, every S-sparse signal can be reconstructed from , so long as for some absolute constant C.

This result leads hope that this can be formalized. But what kind of conditions do we need on to at least have a hope of reconstructing ? A very simple result is

Proposition:

Suppose that any columns of matrix are linearly independent. (This is a reasonable assumption once .) Then, any S-sparse signal can be reconstructed uniquely from .

Proof:

Suppose not; then there are two S-sparse signals with . This implies But is 2S-sparse, so there is a linear dependence between 2S columns of . A contradiction.

So we might expect, on a good day. that every 2S columns of A are linearly independent. In fact, we can say a little more

There are now several theoretical results ensuring that Basis Pursuit works whenever the measurement matrix is sufﬁciently “incoherent”, which roughly means that its matrix entries are uniform in magnitude.

In practice, numerical experiments suggest that most S-sparse signals can be recovered exactly when

Spectrum of a Gaussian Random Matrix

This means we can strengthen the proposition above by requiring, for example, that every 4S columns of are approximately orthonormal. It turns out that many well known Random Matrices satisfy this property, such as Gaussian Random Matrices.

Technically, the RIP is stated as:

Suppose that there exists a constant such that, for every m × s submatrix of and for every vector we have

whenever

If the matrix satisfies some form of RIP, we can get a theoretical upper bound on how well the BP will perform. A basic result shows that true (i.e. RMS) error in the reconstructed signal is bounded by the (i.e. absolute) sample error:

Theorem (Candes-Romber-Tao):

The tightest bound is due to Foucart, who shows that this bound holds when

The astute reader will recognize that the RIP asks that the signal be sparse in an orthonormal basis and that the data matrix be uniformly incoherent.

Is this good enough? Have you ever seen a uniformly incoherent distribution in a real world data set?

D-RIP

Many signals may in-fact, be only sparse in the non-orthogonal, overcomplete basis. Recently, it has been shown how to define a D-RIP property [10,11] that applies to the more general case of even when the matrix is not as incoherent as we might expect or desire.

In some earlier posts, we introduced the Regularization (or Projection) Operator as a way to get at What a Kernel is (really). Recall that to successfully apply a Kernel, the associated, infinite order expansion should converge rapidly.

Likewise, here we define a similar Operator, , that projects our sample data into the Hilbert Space, or Signal Domain .

It is in this space that the signal is expanded in the over-complete basis

We might also refer to as a frame, although I skipping this technical detail for now. Essentially, this means that any infinite order expansion in converges rapidly enough that we can use it as a basis in our infinite dimensional space.

We can now adjust the RIP, extending the notion of considering every m × s submatrix to considering , the union of all subspaces spanned by all subsets of size s.

Technically, the D-RIP is stated as:

Suppose that there exists a constant such that

whenever

Similarly to the RIP, Guassian, subgaussian, and Bernoulli matrices satisfy the D-RIP with m ≈ s log(d/s). Matrices with a fast multiply (DFT with random signs) also satisfy the D-RIP with m approximately of this order.

There is also a known bound.

D-RIP Bound for Basis Pursuit (Needell, Stanford 2010):

Let D be an arbitrary tight frame and let A be a measurement matrix satisfying D-RIP with δ2s < 0.08. Then the solution to -analysis satisﬁes

The result says that -regularization/analysis is very accurate when converges quickly (has rapidly decaying coefficients)

This is the case in applications using Wavelets, Curvelets, and Chirplets.

Warning: Don’t just use Random Features

This does not say that we can use any random or infinite size basis. In particular, if we combine 2 over-complete basis sets, we might overtrain or, worse, fit a spurious, misleading signal. Been there, seen that. This is way too common in applied work, and is also the danger we face in trying to detect a weak signal in a sea of noise. And this is why we will add additional constraints.

Weak Signal Detection: A Preview

Recall that we wish to detect a very weak signal of the general form

in a sea of noise, where is a slowly varying function, and is some oscillatory function of unknown frequency, phase, and envelope. We need to detect the presence of –and the unknown critical time . We might be tempted to first think we can just create an overcomplete basis of functions for variety of critical times and solve the Basis Pursuit problem.

This will most likely fail for 2 reasons

each critical time defines a unique -frame, and we probably should not combine all basis sets / frames $latex g_{\lambda}(t,\tau) $ in the same optimization

even for a single pass of BP, there is just too much incoherent noise and/or perhaps even some other, transiently stable, detectable, but spurious signal.

Given these conditions, one might expect a clever, adaptive or matching pursuit-like approach to work better than vanilla constrained optimization (I’ll explain the difference in a bit). Recent research [12] indicates, however, that while we might believe

Folk Theorem: The estimation error one can get by using a clever adaptive sensing scheme is far better than what is achievable by a nonadaptive scheme.

that, in fact,

Surprise: The folk theorem is wrong in general.No matter how clever the adaptive sensing mechanism, no matter how intractable the estimation procedure, in general [one can not do better than] a random projection followed by minimization.

with the

Caveat: “This ‘negative’ result should not conceal the fact that adaptivity may help tremendously if the SNR [Signal-to-Noise Ratio] is sufficiently large

In our next post we will look at some real world examples of Chirplet Minimum Path Basis Pursuit using various path (and maybe even structural) constraints…stay tuned!

That is probably a fair statement. Most of the work goes finding a good, over-complete basis–no surprise there. Notice these bounding theorems are used to justify using the L1 norm; the ideal ideal solution is to find the L0 solution. Practically, this could be done with large scale monte carlo simulations, but the convex L1-norm problem is orders of magnitude easier to solve.

Also notice that we can do better than the Sampling Theorem with just random projections. If you think about that for a minute, it is astounding!

The L1 Norm provides a near optimal sparse solution when the underlying signal /data is sparse in some (say overcomplete) basis and the signal to noise ratio (SNR) is high The L2 Norm is suitable for non-sparse solutions and/or bandwidth limited signal…

What are the benefits of using estimators derived from L1 (as opposed to least squares) error minimization?…

A different perspective: L1 is suitable when the true signal or data is sparse (in some possibly overcomplete basis) and the signal-to-noise ratio (SNR) is high L2 is suitable when the true signal or data is not-sparse in the chosen basis, and/or is ba…

In a Bayesian setting for parameter estimation, what should be the parametric form of the prior distribution in order to perform l2 regularization? I’m sorry that the question I am asking isn’t related to this post, but seems like you can answer this. It would be great help to me I you give me your suggestion. Thanks in advance