Abstract

We address the problem of sparse selection in linear models. A number of nonconvex penalties have been proposed in the literature for this purpose, along with a variety of convex-relaxation algorithms for finding good solutions. In this article we pursue a coordinate-descent approach for optimization, and study its convergence properties. We characterize the properties of penalties suitable for this approach, study their corresponding threshold functions, and describe a df-standardizing reparametrization that assists our pathwise algorithm. The MC+ penalty is ideally suited to this task, and we use it to demonstrate the performance of our algorithm. Certain technical derivations and experiments related to this article are included in the Supplementary Materials section.

Regularization path for LASSO (left), MC+ (nonconvex) penalized least-squares for γ = 6.6 (center) and γ = 1+ (right), corresponding to the hard-thresholding operator (best subset). The true coefficients are shown as horizontal dotted lines. Here LASSO is suboptimal for model selection, as it can never recover the true model. The other two penalized criteria are able to select the correct model (vertical lines), with the middle one having smoother coefficient profiles than best subset on the right.

The penalized least-squares criterion () with the log-penalty () for (γ, λ) = (500, 0.5) for different values of β̃. The “*” denotes the global minima of the functions. The “transition” of the minimizers, creates discontinuity in the induced threshold operators.

Left: The df (solid line) for the (uncalibrated) MC+ threshold operators as a function of γ, for a fixed λ = 1. The dotted line shows the df after calibration. Middle: Family of MC+ threshold functions, for different values of γ, before calibration. All have shrinkage threshold λS = 1. Right: Calibrated versions of the same. The shrinkage threshold of the soft-thresholding operator is λ = 1, but as γ decreases, λS increases, forming a continuum between soft and hard thresholding.

Recalibrated values of λ via λS(γ, λ) for the MC+ penalty. The values of λ at the top of the plot correspond to the LASSO. As γ decreases, the calibration increases the shrinkage threshold to achieve a constant univariate df.

Top row: objective function for SparseNet compared to other coordinate-wise variants—Type (a), (b), and MLLA Type (c)—for a typical example. Plots are shown for 50 values of γ (some are labeled in the legend) and at each value of γ, 20 values of λ. Bottom row: relative increase I(γ, λ) in the objective compared to SparseNet, with the average Ī reported at the top of each plot.