First steps with Non-Linear Regression in R

Drawing a line through a cloud of point (ie doing a linear regression) is the most basic analysis one may do. It is sometime fitting well to the data, but in some (many) situations, the relationships between variables are not linear. In this case one may follow three different ways: (i) try to linearize the relationship by transforming the data, (ii) fit polynomial or complex spline models to the data or (iii) fit non-linear functions to the data. As you may have guessed from the title, this post will be dedicated to the third option.

What is non-linear regression?

In non-linear regression the analyst specify a function with a set of parameters to fit to the data. The most basic way to estimate such parameters is to use a non-linear least squares approach (function nls in R) which basically approximate the non-linear function using a linear one and iteratively try to find the best parameter values (wiki). A nice feature of non-linear regression in an applied context is that the estimated parameters have a clear interpretation (Vmax in a Michaelis-Menten model is the maximum rate) which would be harder to get using linear models on transformed data for example.

Fit non-linear least squares

The issue of finding the right starting values

Finding good starting values is very important in non-linear regression to allow the model algorithm to converge. If you set starting parameters values completely outside of the range of potential parameter values the algorithm will either fail or it will return non-sensical parameter like for example returning a growth rate of 1000 when the actual value is 1.04.

The best way to find correct starting value is to “eyeball” the data, plotting them and based on the understanding that you have from the equation find approximate starting values for the parameters.

#simulate some data, this without a priori knowledge of the parameter value
y
[1] 0.9811831
#plot the fit
lines(x,predict(m),col="red",lty=2,lwd=3)

Here it is the plot:

Using SelfStarting function

It is very common for different scientific fields to use different parametrization (i.e. different equations) for the same model, one example is the logistic population growth model, in ecology we use the following form:
$$ N_{t} = frac{K*N_{0}*e^{r*t}}{K + N_{0} * (e^{r*t} – 1)} $$
With (N_{t}) being the number of individuals at time (t), (r) being the population growth rate and (K) the carrying capacity. We can re-write this as a differential equation:
$$ dN/dt = R*N*(1-N/K) $$

library(deSolve)
#simulating some population growth from the logistic equation and estimating the parameters using nls
log_growth

This part was just to simulate some data with random error, now come the tricky part to estimate the starting values.
Now R has a built-in function to estimate starting values for the parameter of a logistic equation (SSlogis) but it uses the following equation:
$$ N_{t} = frac{alpha}{1+e^{frac{xmid-t}{scale}}} $$

#find the parameters for the equation
SS

We use the function getInitial which gives some initial guesses about the parameter values based on the data. We pass to this function a selfStarting model (SSlogis) which takes as argument an input vector (the t values where the function will be evaluated), and the un-quoted name of the three parameter for the logistic equation.

However as the SSlogis use a different parametrization we need to use a bit of algebra to go from the estimated self-starting values returned from SSlogis to the one that are in the equation we want to use.

That was a bit of a hassle to get from the SSlogis parametrization to our own, but it was worth it!

In a next post we will see how to go beyond non-linear least square to embrace maximum likelihood estimation methods which are way more powerful and reliable. They allow you to build any model that you can imagine.