Post navigation

R for Ecologists: Putting Together a Piecewise Regression

Piecewise regression comes about when you have ‘breakpoints’, where there are clearly two different linear relationships in the data with a sudden, sharp change in directionality. This crops up occasionally in ecology when dealing with, for example, species richness of understory plants and forest age. There is initially a rapid drop as succession progresses (rapid colonizing plants die as light becomes more limiting), then a sudden breakpoint and a more moderately paced rise in species richness as more shade tolerant plants and seedlings of older trees begin to colonize the mature forest.

If you Google ‘R piecewise regression’, you may get a variety of methods and advice on how to run a piecewise regression. Essentially, you can do it manually or use a canned package to run the regression. If you’re like me, you might be wondering: Are these methods are the same? If not, which is best? How do I actually use them? Well, hopefully I’ll answer some of these questions.

There are essentially two methods I’ll walk through: brute force iterative approaches (as in ‘The R Book’ by Crawley) and the ‘segmented’ package. Using these approaches allows you to statistically estimate the breakpoint, which is better than just eyeballing it and fitting two models around what you think is the breakpoint. Let statistics do the work for you objectively.

You should get a V-shaped plot, where the right side is a little shallower sloped.

Like this

METHOD 1: ITERATIVE SEARCHING

First, we’ll model the data using the iterative search procedure described by Crawley. We’re going to use a model that looks like this:

Note that the multiplication symbol is used in the R model definition sense, not the mathematical sense, meaning main effects and interactions for both variables

In this case, c is the breakpoint. I(x < c) and I(x > c) are essentially dummy variables. I(x < c) is 1 if x is less than the breakpoint and 0 if it is above. I(x > c) is a dummy variable that is 1 if x is above the breakpoint and 0 if it is below. It should be obvious that there are two sets of parameters being modeled, depending on the value of x.

The complicated bit is choosing the breakpoint. We can eyeball the data and say that the breakpoint is somewhere between 9 and 17. Choose a wider range than you might think, just to be safe. Create a variable called breaks to hold these breakpoints:

breaks <- x[which(x >= 9 & x <= 17)]

Now we’re going to iteratively search these breakpoints for the model that has the lowest residual MSE, using that as our criteria for the best model. Create an empty container for MSE values from each model, and use a for( ) loop to run a linear regression for each possible breakpoint. Formulate the linear model exactly like the above formula.

If we plot MSE by breakpoints, we can visually estimate the breakpoint as the lowest point on the curve:

It could easily be either 13 or 15, so I just pick the breakpoint with the lowest error:

breaks[which(mse==min(mse))]

This tells me that the breakpoint in my simulation was 15. I run my final model with my newly estimated breakpoint:

piecewise2 <- lm(y ~ x*(x < 15) + x*(x > 15))
summary(piecewise2)

The summary looks like this:

This is a pretty nasty output. Read it like this: (Intercept) is more or less useless on its own. The intercept for the line when x < 15 is (Intercept) + x < 15TRUE, or 3.3133 + 16.6352 = 19.9485. The slope of the line when x < 15 is x + x:x<15TRUE, 0.5843 – 1.3025 = -0.7182. So, when x is less than 15, the formula is 19.9485 – 0.7182x.

When x is greater than 15, the intercept is 3.3133 – 0.9116 = 2.4017 and the slope is just the variable x, 0.5843. So when x is greater than 15, the line has the equation 2.4017 + 0.5843x. Notice this table excludes the coefficients with NA estimates. Those arise from singularities and can be ignored.

Notice that the segments were not constrained to be touching or continuous. This is inherent in the algorithm that we used.

METHOD 2: USE ‘SEGMENTED’ PACKAGE

To use the ‘segmented’ package you will, of course, need to install and load it. The procedure of ‘segmented’ uses maximum likelihood to fit a somewhat different parameterization of the model:

More or less. This isn’t exactly it but it’s close enough

I(x > c) is a dummy variable as above, so when x < c, the model is essentially:

The γ term is simply a measure of the distance between the end of the first segment and the beginning of the next. The model converges when γ is minimized, thus this method constrains the segments to be (nearly) continuous. This is a major difference from the iterative approach in Method 1 above.

To use this method, you first fit a generic linear model. You then use the segmented( ) function to fit the piecewise regression. The segmented( ) function takes for its arguments the generic linear model, seg.Z which is a one sided formula describing the predictor with a segment (we only have one predictor, x, which has the segment), and psi, which is a starting value of the breakpoint (as in nls( ), you need to supply a best-guess estimate). More complicated models are a bit more complicated in terms of arguments, but this is a good starting example.

In our case, x is the predictor with a segment (it’s the only predictor) and based on my very first scatterplot (the first graph on the page), I’m going to guess the breakpoint is 14.

Using the summary(segmented.mod) command, you will get the estimated breakpoint +/- some error (12.07 for me), the intercept and slope for the first segment, and U1.x, which is the slope of the second segment.

UPDATE:

U1.x is not the slope of the second segment. It is the difference in slopes between the first and second segment. So if your coefficients are x = -0.9699 and U1.x = 1.4163, then the slope of the second segment is -0.9699+1.4163 = 0.4464. Using the slope(segmented.mod) command will give you the slopes of each segment, which should match the above calculation.

Plotting it is simple:

plot(x,y, pch=16, ylim=c(5,20))
plot(segmented.mod, add=T)

There you go: two methods for piecewise regression. Each is relatively simple to implement, but they do very different things. If you are dead set on piecewise regression, make sure you choose the one that makes the most sense for your question: do the segments need to be continuous or can they be discontinuous? Make sure you have a rationale for either.

Also note that these models are not nested, so you can’t used likelihood ratio tests for model selection. You can try to use AIC for model selection, but I advise you to think logically about the continuity of the segments instead.

56 thoughts on “R for Ecologists: Putting Together a Piecewise Regression”

This piecewise regression is also called sometimes “threshold regression”. In the time series context, threshold auto-regressive models (where y_{t-1} is the regressor) are available in the package tsDyn also.

Hi Nathan, I am also currently fiddeling around with PW-regressions. Did you ever a lattice plot with multiple segmented regressions in a lattice layout? i.e. 2 rows 3 columns and each one with its one seg-regression? nice post!

I haven’t before, but I just played around with it based on my example. segmented.mod$psi contains the breakpoints, so you can make use of that. Here’s a fairly simple code to do what you’re requesting based on the above example, which is simplified with one predictor and one break (you can view this code as a continuation of the post). This also assumes that you’re using the model from the ‘segmented’ package as the preferred model.

Hi, that´s very nice! I was thinking of another output though. I have 20 stations for which I want to have a PW-regression in each panel. Hence it would be something like xyplot(hour ~ ppm | station, data=data.frame,panel=function(x,y,…) { segmented regression here?}

But many thanks for your example! this brought me again a bit closer. I think I will give it a try on stackoverflow.com
cheers

Hi Nathan,
Some time ago I wrote the following code for a grid search, am not 100% sure but I think this is what you refer to as “iteratively”. Code can be found here: http://eranraviv.com/blog/piecewise-regression/
Thanks for the post!

Note that for optimizing piecewise regressions without constraints for adjacent segments, the “strucchange” package can be used. It assumes that the observations are already ordered with respect to the variable along which you want to split (which may be different from the variable/s you regress on). In this case, the observations are already ordered with respect to x, hence I can directly do the following:

This considers all possible solutions with a minimum segement size of 4 observations, i.e., up to 3 breakpoints. The plot then compares the best residuals sum of squares and associated BIC that can be obtained for 0, 1, 2, 3 breaks. There’s a clear drop from 0 to 1 where BIC is minimized. The associated breakpoint is

bp
x[bp$breakpoints]

Note that returns the last value of x in the first segment (not the first value of x in the second segment as your solution I did). The associated models are of course equivalent. Finally, you can extract intercept and slope in both segments or visualize the models. A lattice xyplot that one of the previous commenters suggested can also be easily created:

For more details on finding breakpoints with strucchange, see the associated manual page and also the corresponding 2003 paper published in Computational Statistics and Data Analysis – the second reference in citation(“strucchange”).

How about fitting a model like y = \beta_0 + \beta_1 * abs(x – \beta_x) + \beta_2 * x
say using nls? I’ve used this with some success in circumstances where I needed a
two-segment fit. The slopes of the two segments are \beta_2 + \beta_1 and \beta_2 – \beta_1 and the intersection is \beta_x. What’s nice is that you can parameterize it to constrain one or both slopes to have fixed values, as well.

Hi, I have been using the segmented package to fit regressions to describe a stepwise change in heart rate in single individuals and it has worked brilliantly for this purpose. I want to compare levels of variation around each of the segments (two in our instance), however in the output I get both a coefficient and S.E for each line through the summary(model) and if I use the slope(model) I get a slope and S.E too. The output from slope(model) makes more sense in light of my data i.e. a negative value for a slope corresponds to a negative slope, whereas the same is not true for the summary(model).

However, I just wanted to clarify whether this approach is correct and that slope(model) produces the values I should be using to describe the gradient of slopes and the error about each slope?

I updated the post to address your issue because there was a mistake in it that you caught. Using the segmented package, the summary(model) gives you x and U1.x. x is the slope of the first segment, and U1.x is the difference between the original slope and the new slope. slope(model) gives the correct slopes, but I’m betting that if you add x and U1.x together, you’ll get the same values reported by slope(model). At least, that’s how it worked for me.

Hi Nathan, thank you for the excellent article. I have two questions: (1) How would you change your R code to estimate a piecewise regression on time series data in an xts or ts object? and (2) Is it possible to estimate a piecewise multiple regression where only some (not all) of the independent variables have breaks?

1) I’m not too familiar with this. My default starting point would be to compare linear models with and without autoregressive error functions using AIC. If the autoregressive error function doesn’t add much to the model, then I’d treat the time series as a regular linear model and use the standard technique. I’m not totally sure it’s appropriate, but that’s where I’d start. If you must use a time series analysis because of autocorrelation, I’d look into the ‘tsDyn’ package suggested by Matthieu (he mentions threshold autoregression, which may be what you’re looking for).

2) I believe the ‘segmented’ package can do this. You specify your regular linear model (lin.mod <- lm(y ~ x1*x2)) and then in the segmented command, you only specify whichever effects you are interested in (segmented.Z=~x1). I'd double check on that, but that's the impression I get from the help file. Check the Rnews article for more info: http://dssm.unipa.it/vmuggeo/segmentedRnews.pdf

Hi! I am a new R user and I am trying to use Method 1 above to find a breakpoint in my data. Is there a way to get more refinement in the breakpoint? That is, I would like more sig figs in the output. Is it always going to yield a whole number?

It will in my example because I was only searching whole numbers. If you set up a sequence of numbers with more significant digits, you can get more refinement. For example, I was searching only integers between 1 and 20 (or so). In R, this is 1:20. If I want to search between 1 and 20, but include three significant digits, I set up a range of numbers seq(1, 20, by=0.001). Be careful, because this is an incredibly huge amount of numbers, so you need to balance the range you’re searching (i.e. 1 and 20) with the precision (i.e. 0.001) to optimize the amount of time it takes. So in my example, I was searching integers 5:20, which I set up using breaks <- 5:20.

If I wanted to search between 10 and 16, by every 0.01, I reset breaks as breaks <- seq(10, 16, 0.01).

Much of this discussion illustrates a misuse of statistical significance

How do you think this statistical question could be better put. Could you fit a linear model to the entire data set and compare with a piecewise regression with a breakpoint 16 years ago ?. A two regression line model obviously has more parameters than a single line – would AIC/BIC adequately test this ?

I hadn’t heard of this argument. Thanks for pointing it out. Clearly, warming trends are difficult to analyze and depend entirely on the dataset used, the location in question (some places are warming faster than others), the length of the time series used, etc. I’m not familiar enough with this argument or the data in question to weigh in on the argument. So my comment is purely statistical (which I think is what you’re getting at).

First, fitting a linear regression to time series data is inappropriate (which it looks like is what the blog author did). This is a time series, and you can clearly see autocorrelation present in the graph. So you need an autoregressive model, and you can use piecewise regressions with autoregressive functions.

Second, I think AIC would be a good way to test this. In fact, let’s say you use the absolute value function to fit the piecewise regression, where y = b0 + b1*x + b2*|x-c|. You can fit a second, OLS model: y = b0 + b1*x. The linear model is a subset, or nested within, the absolute value function (set b2 = 0). I believe (not 100% sure, but pretty sure) in this case, not only can you use AIC, but you can use a log-likelihood ratio test since one model is nested within the other. If you use the other methods described, you may have to use AIC, which can distinguish among non-nested models.

Note that in the case you have, the “transition” variable is not a simple variable, but it is time. Although this seems to be a small difference, it has some implications on the inference for the parameters, as well as for the testing procedure, so I do not think you can use directly these methods.

However, R has a very nice strucchange package, that will allow you to do everything you want to estimate and test the “structural break” you mention :-)

Thanks! I never deal with time series data, but I do know that when time is the predictor, things get tricky. I keep referring people to your comment when dealing with autoregressive piecewise regressions because strucchange and tsDyn seem to handle this issue well.

Great post, I am interested in using segmented regression to model the thermoneutral zone and lower critical temperature of metabolism in endotherms. The thermoneutral zone theoretically has zero slope (metabolism doesn’t change with temperature in this zone). Is there a way to force the slope of one segment of the regression to be zero???

Thanks a lot for this very clear paper.
I am also playing around with piecewise regressions and was wondering if with your ‘manual’ method there is a way to impose a zero-slope on one of the piecewise regressions.
Thank you,

Yes, there is. You can use dummy variables in a clever way to get this to happen. Imagine you’ve run the LS algorithm to find the breakpoint (the ‘manual’ method). The following code will force the line greater (or less than) the break point to be flat.

Note that this still allows discontinuous jumps. I tried playing with the segmented package to get a continuous line, but I haven’t come up with a way just yet. I’m sure there’s a way to incorporate this setup into the absolute value function as well, but I haven’t played with it much. This should get you started though.

Sure with the absolute value function, you can constrain the slopes of either segment. To refresh from above, the function to fit is
y = \beta_0 + \beta_1 * abs(x – \beta_x) + \beta_2 * x
You can calculate the values y at 0, \beta_x and 2 * \beta_x from which you can derive expressions for the slopes of the two segments. Then, fix one of the slopes to an arbitrary value, say 0, and solve for the values of the beta’s that you need to fix. Again, see

Dear Nathan,
thank you for the codes, it is very helpful. Would there be another reference for the iterative method than ‘The R Book’ by Crawley ? I looked in that book and it doesn’t give any references either…

Could you precise the page in Crawley you found this? It is somehow srange to coin this iterative search, as here evaluations are independent of each other.

This type of search comes by multiple names, grid search regarding the algorithm, and regarding the estimator, profile likelihood/SSR, or concentrated least square (as you search only the hyper-parameter, concentrating out/profiling the other slope parameters).

Dera Nathan,
I used tow segmented linear function. where the the slope of one segment is fixed to Zero while the intercept is varying. the breakpoint and the other Segment has no constraints. This was done for different genotypes and what i want to do is to compare the model parameteres between genotypes somthing like (Tukey test ) and to have the significant lettter. is that possible?
thank you in advance

Hi Nathan,
I am trying to fit some segmented models to a large and messy data set and your post has been very helpful. I have played around a lot with the segmented package but i keep getting wildly different results for my segmented models on the same variables. This has made me reconsider doing it manually. I have followed your procedure and I get results but they are 1) fairly different from the segmented() model and 2) different from what I would guess. Maybe there is just a difference that I need to make in the coding of the for() loop. This is what I have.

First, I can’t tell what the first line does, and it may just be a wordpress error. Remember that you’re manually setting an area to search, so it should be something like:

breaks = -2 & RE <= 2)]

Second, I define the model differently. I'd use:
piecewise1 <- lm(WTM ~ RE*I(RE = breaks[i]))
That’s the proper model to give two sets of coefficients for each segment. Your model is mis-specified (and not really doing anything, in the RE=breaks[i] part. I’m surprised it doesn’t throw out an error here, because that’s not a logic statement).

Thanks this is great. One other question. When reading directly from the output on what the p values mean. From top to bottom.

A) Test of whether the intercept is different from 0 (there is a linear relationship)
B) Test for whether the slope of the line to > BP is different from 0
C) Test for whether the intercept of the line BP is different from A
E) Test for whether the slope of the line < BP is different from B

And the f stat and pvalue for the whole output is testing whether the segmented relationship is "real" overall.

I’ve found your R code extremely helpful in implementing my own two-piece regression model. However, it seems like my data may be better suited to a three-piece model. I’m wondering how you would adapt your code (method 1) to find two break points. I’ve been trying to figure it out, but I’m not amazing with R.

Thats a great question that I don’t know the answer to off hand. I know the segmented package can do multiple break points, but beyond that would probably take me a while to figure out. Try starting here: https://rpubs.com/MarkusLoew/12164

Thank you for providing an easy to follow explanation of piecewise regression. I have been using the first method to find the break point in my data but I’m also interested in the equations for the two lines fitted. I had not had any problems until one set of data resulted in the output below:

Hi Nathan,
Thank you very much for this great article about piecewise regression. I have been using the second method to find the break point in my data, and now I would like to find the confidence intervals for the breakpoint itself. Do you know of any method to do this?

If you’re referring to the lm() approach, there isn’t CI for the breakpoint. This is because its a part of the model you’re specifying, not an estimated parameter. If you want to get CIs on a break point, either use absolute value regression or the ‘segmented’ package.

Dear Nathan,
Thank you very much for your very helpful blog about the piecewise regression. I just started working with it but I’m a bit wondering how robust the package “segmented” is. Maybe I did something wrong but if I let the code run several times without changing anything I get different values for the breakpoint. In this case it does not matter if I use psi=17 or psi=median(x) I anyhow get differing values for the breakpoint. Do you know how to overcome this problem? Any advise? Thank you very much already in advance!
Kind regards,
Nadine

Dear Nathan,
Thanks for your helpful blog. I’m working with piecewise regression and i was trying the first method (iterative searching). I was trying to understand what is doing exactly this R-script and to do that I tried to decompose the piecewise function into its two parts. For instance, taking your generated segmented data, I took just one break x-point (x=9) to see what’s going on in each specific point, so I run the model:
piecewise1 <- lm(y ~ x*(x =9))
and I did the same whit only the “first term” and the second one separately:
piecewise1.1 <- lm(y ~ x*(x < 9))
piecewise1.2 =9))
These three models had exactly the same RSE, F-value, p-value, adjustedR2 for the whole model, but they had some differences in the coefficients (intercept, x) values output between the “two-pieces model” (piecewise1) and the partial model 1.2 (piecewise1.2 =9)). But, partial models 1.1 and 1.2 had exactly the same coefficients values for x=9, and x:x=9, but with the opposite sign (see the R-outputs below).
My queries are:

1. I don’t understand exactly what is mathematically happening behind this R-code: if the code splits up the total data in two (or more) segments, how is possible to obtain unique RSE, F-value, p-value, adjustedR2 values for the whole model? How are these values calculated?

2. Do you think it is correct to work only with one part of the piecewise model (piecewise1.1 <- lm(y ~ x*(x < 9)) ) instead the complete model (piecewise1 <- lm(y ~ x*(x =9))) when you are looking for the breakpoint? (Because the mse -or RSE value- It’s exactly the same in the three cases and, if I understood it right, this is the only value interesting to look for the breakpoint).

Dear Nathan,
First of all, thanks a lot for your so quick reply!! I’m sorry but something weird is happening with my first post! I revised it and the models are not those that I posted (?). I write them all again (with the outputs that I got):
(May I have your e-mail adress to send to you the codes in another format to avoid these kind of problems?)

Nathan,
It happened again. I revised my second post and my models have been altered! I don’t understand what’s going on, but I think it could be a problem of format. May I have your e-mail adress to send to you a word document with my codes?
Thanks a lot!
S.

Nathan,
It happened again. I revised my second post and my models have been altered! I don’t understand what’s going on, but I think it could be a problem of format. May I have your e-mail adress to send to you a word document with my codes?
Thanks a lot!
S.

Ah I see. No I don’t, I’m not sure you can get an R2 for each particular segment, although the output should show you R2 for the whole model.

You can try a ‘pseudo-R2′, which would just be the square of the Pearson correlation coefficient between fitted and observed values in each segment. Or just use the Pearson correlation coefficient without squaring.