Analyzing Customer Churn – Time-Dependent Covariates

My previous series of guides on survival analysis and customer churn has become by far the most popular content on this blog, so I'm coming back around to introduce some more advanced techniques...

When you're using cox regression to model customer churn, you're often interested in the effects of variables that change throughout a customer's lifetime. For instance, you might be interested in knowing how many times that customer has contacted support, how many times they've logged in during the last 30 days, or what web browser(s) they use. If you have, say, 3 years of historical customer data and you set up a cox regression on that data using covariate values that are applicable to customers right now, you'll essentially be regressing customer's churn hazards from months or years ago on their current characteristics. Your model will be allowing the future to predict the past. Not terribly defensible.

In the classic double-slit experiment, past events are seemingly affected by current conditions. But unless you're a quantum physicist or Marty McFly, you're probably not going to see causality working this way.

In this post, we'll walk through how to set up a cox regression using "time-dependent covariates," which will allow us to model historical hazard rates on variables whose values were applicable at the time.

Setting up the data

Much like any statistical project, the hardest part of cox regression with time-dependent covariates is setting up the data. In traditional survival analysis, you usually have one record per subject (in our case, a customer), which simply includes the customer's age (either at present, or on the day she churned), and a dummy variable indicating whether the customer churned or got censored. If any covariates (say, gender) are going to be added to the survival model, they're simply added to the single record for each subject. Easy.

Time-varying covariates make this a little bit more complicated. To use a time-varying covariate, you must divide a customer's lifetime into "chunks" where the various values of the covariates apply. For example, check out this snippet of data below that includes survival data, plus an indicator showing whether a customer has contacted support:

Instead of simply an end time and a churn indicator, we now have an additional start time variable. Using the start time and end time, we can now break a customer's lifetime into pieces. For example, in the data above, customer 1000 has been around for 1000 days, has never contacted support, and hasn't churned yet. Customer 1001 first contacted support on day 649 (and therefore hadn't contacted support on days 0-648), then churned on day 655.

Now, getting your data structured this way may not seem too difficult and, for one variable, it's not that bad. But there are several complicating factors, which I discuss below.

For now, on to the modeling! If you'd like to work with the full set of dummy data used for this post, you can grab it here.

Doing some analysis!

Once you have your data set up, doing the actual cox regression looks pretty much like doing any other cox regression.

If you run this code on my dummy data, you'll get something that looks like this...

These results indicate that customers who have contacted support churn 1.89x faster than those who haven't - see the exp(coef) for contacted_support. That's a highly statistically significant result. You'll also notice that the proportional hazards test rejects. That's a red flag that the assumptions of the cox regression are being violated. If this were a real-world project, we'd probably want to go back and tweak some things. But, this is just a blog post, so we'll move on for demonstration purposes!

Plotting results

In basic survival analysis, we set up lots of cool little plots showing the survival curves for folks in different cohorts. But that's a little bit of a problem here... While a male customer will remain in the male cohort for the entirety of his customer life (barring extremely rare events), a customer could go from the "hasn't contacted support" to "has contacted support" camp at any time. So, we can't just plot the differences between the cohorts for their entire customer lives.

However, what we can do is plot survival curves for somebody who contacted support on an arbitrarily selected day. To do this, we'll actually be passing the results of our cox regression, along with some fake data on theoretical customers, into R's survfit function.

Let's start by setting up some fake data, with one imaginary customer who never contacts support, and another one that contacts support on day 500 of their customer life. We'll do it the long way to make sure it's all clear:

Now, to plot that data, we're simply going to pass it into survfit along with our cox results, and plot as we usually would!

If you run this, you'll get a pretty nice visual representation of the differences in survival chances for a customer that never contacted support, and one that contacted support at day 500. Awesome!

Unsurprisingly, these two lines are exactly the same until the moment one of the customers contacts support...

Congratulations! That's pretty much all there is to it. You're now doing cox regression with time-varying covariates!

Really setting up the data

Earlier in the post, I mentioned that setting up data for this type of model can be a hassle, and I'd like to circle back to that for a bit. Consider these complicating factors:

If you're going to use historical data in this type of model, you actually have to preserve historical data. You need to structure your data warehouse so that you know not only what information applies to your customers right now, but also what information applied to them at every point in their life. Even for a relatively small customer base, storing this type of data will cause your data warehouse to grow very quickly.

You may have to divide a customer's life into a lot of "chunks." In the earlier example, our only covariate was a dummy for whether the customer had contacted support, which meant we had a maximum of 2 chunks per customer. But what if we had 10 covariates... that could take on several values... and could change multiple times in a customer's life? The number of chunks a life would have to be divided into would increase exponentially, and the code to build this would quickly get out of hand.

There are many ways to deal with these types of issues, but here are a few techniques I've used in my work at Republic Wireless that you may want to consider:

Track critical customer data every day. For example, if you have different service plans, track what service plan somebody is on each and every day of their customer life. It will come in handy when you want to do survival analysis.

But don't necessarily feel like you need to store a row for every day. If you've got a million customers and you're storing one record per day, you'll be storing over a billion values before you know it... Instead, store data in "chunks" as well, and update the chunks each day. For example, you might know that "Bob" has been on your "Platinum Plan" from 2014-10-01 to today. Instead of adding a record tomorrow, you can simply have your daily ETL update the "end date" of that Platinum Plan record to show the new date. If Bob ever changes plans, then you can add a new record.

When you do your analysis, consider using 1-day chunk sizes. (Yes, I know this seems to contradict what I just said.) It may be easier to simply create one record per subject per day than it is to go through all the crazy combinatorics required to appropriately size the chunks for each individual when multiple variables get involved.

Sample individuals, not chunks. If you've got a million customers, you probably don't need to use them all to do your survival analysis. Especially as each customer takes on several rows to cover different time periods, the data can start bogging R down very quickly. A randomly selected sample of, say, 10,000 customers could be just what the doctor ordered.

Finally, look into the "tmerge" function in R's survival package. It can take two separate historical data sets on individuals and combine them together, creating the necessary time chunks automatically. I prefer doing most of my data setup in SQL in our data warehouse (since it's much higher-performing than my workstation), but if you like doing things in R, this is a good way to go.

Conclusion

Feel free to follow up with any questions or comments you may have. I'd especially be interested in other's suggestions for working with the tmerge function or otherwise preparing the data! Perhaps I'll do a full post on tmerge/sql data preparation at a later date...

“One common question with this data setup is whether we need to worry about correlated data, since a given subject has multiple observations. The answer is no, we do not. The reason is that this representation is simply a programming trick. The likelihood equations at any time point use only one copy of any subject, the program picks out the correct row of data at each time.”

Essentially, the cox regression is seeing whether the variables applicable to a person at any given point in time increase or decrease their chances of churning at that point in time. That’s true whether a customer has 100 “chunks” in their lifetime or 1 “chunk” – or even if they’re left and right-censored and we only have data on a random 20 days in the middle of their lifetime. As long as we have one record per customer per time period, things are fine.

On the second question… I don’t think so? The setup of the model should be the same, and it shouldn’t be too hard to interpret.

Your posts are really useful and interesting, thank you for writing them.

I have a question about modelling a specific type of attributes. In the example data that you use above, an attribute called ‘contacted_support’ is included. Suppose we want to test the hypothesis that this is a powerful predictor for churn. That is, in a certain period of time after contacting support, we expect the churn probability to be higher.

The package vignette states on page 2:
“We read this as stating that over the interval from 0 to 90 the creatinine for subject “5” was 0.9 (last known level), and that this interval did not end in a death.”

Contacting customer support happens at a specific point in time. It is not something that is present over a longer time interval such as the creatinine level. Following this logic, the time interval for the row in which this variable is TRUE should have length 1.

If the data is set up like this. Will the Cox model still be able to estimate a higher probability of churn over a longer period of time after the moment of contacting support? Or will it be forced to only estimate this probability for the single period in which the variable is TRUE.

Hope this question is clear, I would love to hear your thoughts on this.

Good question… In this post, the ‘contacted_support’ variable does not reflect whether or not somebody contacted support on a particular day. Instead, it reflects whether or not they’ve ever contacted support. That way, the effect of contacting support follows a customer through the rest of their lifetime.

Of course, you could set this up different ways if you thought the effect was different. Maybe you hypothesize that a support interaction is dangerous in terms of churn, but only for about a month… you could have ‘contacted_support’ be 1 for 30 days after the event, then go back to 0. It’s really up to you.

I have a very basic question related to plotting and legend that you have used.

In previous examples, i.e. gender, you mentioned male first and then female next (as 0 for male and 1 for female), in the legend. However, here you have mentioned ‘Contacted Support at Day 500’, first which is 1 and then ‘Never Contacted Support’ which is 0. So, what is the logic here? How do we decide that which line is contacted and which one isn’t while providing the legend? Sorry, it’s basic question but I am confused.

Good question and sorry for the slow response. The result of a survfit call is a list. If you look at the strata element in that list, it will list the strata in the order that they will be plotted and you can react accordingly. Something like:

It seems you’re wrestling through a lot of issues (many of which were addressed on CrossValidated), but I’ll address your specific question here… Technically, most of these methods are designed for continuous time survival analysis, but I’d be willing to hazard (pun very much intended) a guess that most researchers are using them with discrete data most of the time… I don’t think you should worry about it too much, unless you have a very small number of time periods (e.g., if you had quarterly data over 2 years or something).

thanks for this article, really interesting! I have a question: what if I have a “numerical” variable (i.e. visit on the website) that I want to monitor day by day (or week over week) and use it as a regressor? I’ve seen that you were suggesting “using 1-day chunk sizes” but how can I feed this into the regression model?

Not sure I’m totally following, but it sounds like you could likely use some sort of sliding window for this purpose. On each 1-day chunk, you could have a variable for “number of visits to the website in the last week” or something similar. You would change the 7 days used for the calculation each day, so that it was always representing the last week of a customer’s lifetime.