....Although seemingly unrelated at rst glance, the layers of a deep neural network (DNN) can be viewed as iterations of some algorithm that have been unfolded into a network structure (Gregor and LeCun, 2010; Hershey et al., 2014). In particular, iterative thresholding approaches such as those mentioned above typically involve an update rule comprised of a xed, linear filter followed by a non-linear activation function that promotes sparsity. Consequently, algorithm execution can be interpreted as passing an input through an extremely deep network with constant layer weights (dependent on) at every layer....

The iterations of many sparse estimation algorithms are comprised of a fixed
linear filter cascaded with a thresholding nonlinearity, which collectively
resemble a typical neural network layer. Consequently, a lengthy sequence of
algorithm iterations can be viewed as a deep network with shared, hand-crafted
layer weights. It is therefore quite natural to examine the degree to which a
learned network model might act as a viable surrogate for traditional sparse
estimation in domains where ample training data is available. While the
possibility of a reduced computational budget is readily apparent when a
ceiling is imposed on the number of layers, our work primarily focuses on
estimation accuracy. In particular, it is well-known that when a signal
dictionary has coherent columns, as quantified by a large RIP constant, then
most tractable iterative algorithms are unable to find maximally sparse
representations. In contrast, we demonstrate both theoretically and empirically
the potential for a trained deep network to recover minimal $\ell_0$-norm
representations in regimes where existing methods fail. The resulting system is
deployed on a practical photometric stereo estimation problem, where the goal
is to remove sparse outliers that can disrupt the estimation of surface normals
from a 3D scene.

While ISTA and related algorothms are interesting, AMP solvers that have similar structure ought to be investigated as it is expected that the number of iterations is much smaller than that of traditional FISTA et al.

The following two earlier papers were focused on sparse coding and NMF (they are listed below) It would seem to me that much more insight could be gotten out of the studies on FISTA et related algorithms for compressive sensing but I can imagine I have a bias on that matter.

Model-based methods and deep neural networks have both been tremendously
successful paradigms in machine learning. In model-based methods, problem
domain knowledge can be built into the constraints of the model, typically at
the expense of difficulties during inference. In contrast, deterministic deep
neural networks are constructed in such a way that inference is
straightforward, but their architectures are generic and it is unclear how to
incorporate knowledge. This work aims to obtain the advantages of both
approaches. To do so, we start with a model-based approach and an associated
inference algorithm, and \emph{unfold} the inference iterations as layers in a
deep network. Rather than optimizing the original model, we \emph{untie} the
model parameters across layers, in order to create a more powerful network. The
resulting architecture can be trained discriminatively to perform accurate
inference within a fixed network size. We show how this framework allows us to
interpret conventional networks as mean-field inference in Markov random
fields, and to obtain new architectures by instead using belief propagation as
the inference algorithm. We then show its application to a non-negative matrix
factorization model that incorporates the problem-domain knowledge that sound
sources are additive. Deep unfolding of this model yields a new kind of
non-negative deep neural network, that can be trained using a multiplicative
backpropagation-style update algorithm. We present speech enhancement
experiments showing that our approach is competitive with conventional neural
networks despite using far fewer parameters.

In Sparse Coding (SC), input vectors are reconstructed using a sparse linear combination of basis vectors. SC has become a popular method for extracting features from data. For a given input, SC minimizes a quadratic reconstruction error with an L1 penalty term on the code. The process is often too slow for applications such as real-time pattern recognition. We proposed two versions of a very fast algorithm that produces approximate estimates of the sparse code that can be used to compute good visual features, or to initialize exact iterative algorithms. The main idea is to train a non-linear, feed-forward predictor with a specific architecture and a fixed depth to produce the best possible approximation of the sparse code. A version of the method, which can be seen as a trainable version of Li and Osher’s coordinate descent method, is shown to produce approximate solutions with 10 times less computation than Li and Osher’s for the same approximation error. Unlike previous proposals for sparse code predictors, the system allows a kind of approximate “explaining away ” to take place during inference. The resulting predictor is differentiable and can be included into globallytrained recognition systems. 1.