Stochastic proximal-gradient algorithms for penalized mixed models

Abstract

Motivated by penalized likelihood maximization in complex models, we study optimization problems where neither the function to optimize nor its gradient has an explicit expression, but its gradient can be approximated by a Monte Carlo technique. We propose a new algorithm based on a stochastic approximation of the proximal-gradient (PG) algorithm. This new algorithm, named stochastic approximation PG (SAPG) is the combination of a stochastic gradient descent step which—roughly speaking—computes a smoothed approximation of the gradient along the iterations, and a proximal step. The choice of the step size and of the Monte Carlo batch size for the stochastic gradient descent step in SAPG is discussed. Our convergence results cover the cases of biased and unbiased Monte Carlo approximations. While the convergence analysis of some classical Monte Carlo approximation of the gradient is already addressed in the literature (see Atchadé et al. in J Mach Learn Res 18(10):1–33, 2017), the convergence analysis of SAPG is new. Practical implementation is discussed, and guidelines to tune the algorithm are given. The two algorithms are compared on a linear mixed effect model as a toy example. A more challenging application is proposed on nonlinear mixed effect models in high dimension with a pharmacokinetic data set including genomic covariates. To our best knowledge, our work provides the first convergence result of a numerical method designed to solve penalized maximum likelihood in a nonlinear mixed effect model.

Note that the RHS and the LHS are equal when \(\theta = \theta _n\) so that for any point \(\tau \) which maximizes the RHS, it holds \({\mathcal {Q}}(\tau \vert \theta _n) - g(\tau ) \ge {\mathcal {Q}}(\theta _n \vert \theta _n) - g(\theta _n) \). This concludes the proof upon noting that such a point \(\tau \) is unique and equal to \(\theta _{n+1}\) given by Eq. (10).

Proof

By (H4a), there exists a constant \(C < \infty \) such that for any \(n \ge 1\) and \(1 \le j \le m_n, \Vert S(Z_{j,n-1})\Vert ^2 \le C \, W(Z_{j,n-1})\). In addition, by the drift assumption on the kernels \(P_\theta \), we have

Throughout the proof, we will write \(S_{n+1}\) instead of \(S_{n+1}^\mathrm {sa}\).

Proof of Theorem 4

We prove the almost-sure convergence of the three random sums given in Theorem 1. The third one is finite almost-surely since its expectation is finite (see Proposition 3). The first two ones are of the form \(\sum _n \mathsf {A}_{n+1} \left( S_{n+1} - {\bar{S}}(\theta _n) \right) \) where \(\mathsf {A}_{n+1}\) is, respectively,

Note that \(\mathsf {A}_{n+1} \in {\mathcal {F}}_n\) (the filtration is defined by Eq. (17)). By Lemma 7 and (H3b–c), for both cases, there exists a constant C such that almost-surely, for any \(n \ge 0\),

Proposition 4

Let \(\{\theta _n, n\ge 0\}\) be given by Algorithm 2. Assume H1, H3, (H4a–b) and (H5a) and (H4a–b). In the biased case, assume also (H4c) and (H5b)). Let \(\{\mathsf {A}_n, n \ge 0 \}\) be a sequence of \(d' \times q\) random matrices such that for any \(n \ge 0, \mathsf {A}_{n+1} \in {\mathcal {F}}_n\), and there exists a constant \(C_\star \) such that almost-surely

We give the proof of the convergence of the last term in the biased case: \({\mathbb {E}}\left[ S(Z_{k,n}) \vert {\mathcal {F}}_n \right] \ne {\bar{S}}(\theta _n)\). The proof in the unbiased case corresponds to the following lines with \(R_{j,1} = R_{j,2} =0\) and \({\hat{S}}_\theta = S\). Set \({\overline{\mathsf {D}}}_j :=\delta _j (1+\mathsf {D}_{j+1})\). By Lemma 8, there exists \({\hat{S}}_\theta \) such that

Upon noting that \({\mathbb {E}}\left[ \mathsf {A}_j \partial M_j \vert {\mathcal {F}}_{j-1} \right] = 0\), the almost-sure convergence of the series \(\sum _j {\overline{\mathsf {D}}}_j \mathsf {A}_j \partial M_j\) is proved by checking criteria for the almost-sure convergence of a martingale. By (35), there exists a constant C such that