Recurrent Neural Networks for Churn Prediction

I just posted a simple implementation of WTTE-RNNs in Keras on GitHub: Keras Weibull Time-to-event Recurrent Neural Networks. I'll let you read up on the details in the linked information, but suffice it to say that this is a specific type of neural net that handles time-to-event prediction in a super intuitive way. If you're thinking of building a model to predict (rather than understand) churn, I'd definitely consider giving this a shot. And with Keras, implementing the model is pretty darn easy.

As proof of the model's effectiveness, here's my demo model (with absolutely no optimization) predicting the remaining useful life of jet engines. It's not perfect by any means, but it's definitely giving a pass to engines that are in the clear, and flagging ones that are more likely to fail (plus a few false positives). I have no idea how much better it could do with some tweaking:

Also, if anybody's curious as to why I've been in a bit of a post desert lately, my wife and I recently had a baby and I haven't been giving as much thought to the blog. However, I have some ideas brewing!

In the demo code I posted, test_y is a (samples, 2) tensor containing time-to-event (y) and a 0/1 event indicator (u), specifically for the testing data (as opposed to the training data). The event indicator for test_y is always 1, because this is the true remaining lifetime for all these engines.

test_x is simply a history of sensor readings for each engine, used to make predictions about its remaining useful life.

If the model is working well, when it’s fed test_x and asked to make predictions, it should generate predictions roughly equivalent to the lifetimes in test_y.

That’s actually exactly what the demo code is for… the alpha and beta parameters in the output describe a Weibull distribution showing the propensity for each engine to fail over future time.

If you need a point estimate of life remaining (like I’ve used for the graph above), you can simply set the Weibull survivor function equal to .5 using the output alpha and beta and solve for time. The survivor function is exp(-(t/a)^b). Set it equal to .5 and solve for t and you get a(-ln(.5))^(1/b).

Hope that helps.

JimmyMarch 30, 2017 / 2:53 am

I calculated the remaining useful life and I found the values are quite far from the actual one. Take engine 1 as an example, the predicted result was [ 112. 1. 218.93415833 4.13228035]. The t I solved is about
219*(-ln(.5))^(1/4)=196,
but the actual value is 31. Why is it so different?
Thanks!

If you look at the code, you’ll see that the output prints the test_y values next to the predictions, so that 112 in your output is the actual remaining life of that engine. Not sure where you got 31 from.

A few other things to remember:

1) This prediction is saying the engine has a 50% chance of surviving to day 196. We’d expect some engines to fail before then and some to fail after then.

2) This is machine learning, not black magic. No predictions are going to be perfect.

3) This is a demo and I made no attempts to optimize this model whatsoever. It’s entirely possible that making some tweaks to the model (additional layers of neurons, for example) could dramatically improve performance. Feel free to mess around.

Sorry about my stupid questions. XD
I have been learning LSTM recently and I have not completed a single model of RNN (LSTM). Your example is wonderful and I am sure I will learn a lot from it. The value 31 was learned from the data test_x. According to the data, engine 1 ran 31 cycles and failed, so I think TTE value should be 31.(Am I right?) However, the value is 112 in test_y.
Thanks!

If you’re looking to talk about this model, I’d be happy to have a conversation with you, but I haven’t written a dissertation on the subject (or any subject, for that matter)… Perhaps you’re referring to Egil Martinsson’s thesis?

I’ll send you an email offline, though I’m also mildly concerned this is trolling what with it being April 1 and all?

Thanks for the simple write up and code! I am still learning the workings and have a basic question –
The RUL = a(-ln(.5))^(1/b) that you mentioned to Jimmy’s post, is it absolute time or a percentage of the max time of the engine?

Excellent work, I have been digging into your code to get some light on recurrent models.
I have two questions:
1- Do you think that the result would be very different without masking?, you have already left-padded the sequences with zeros

2- You are using a sliding window of 100 timesteps, but another option is using a stateful char-rnn, so the network can remember the whole history and reset the state for each engine?, have you tried something like this?. I think that the advantage of your approach is that you can shuffle the training samples….

1) I’m not an expert on NNs, so I’m not 100% sure, but I expect it wouldn’t make a super-big difference. I’d imagine the network would learn that having everything set to 0s is basically useless information and begin to ignore it. Not ideal to have some of your model’s knowledge dedicated to that, though, if it’s not necessary.

2) Yes, you could use a stateful char-rnn, but there’d be two downsides. First, you could only back-propagate through time for as many time steps as you had in each batch. If you kept, say, 10 time-steps per batch, this might not matter… but I don’t know. I haven’t played with this dummy model much. Second, there’s just some Keras challenges involved in managing batches when you have time series of differing lengths. It’s not anything that can’t be accomplished, but it would be some extra coding.

I trained the dummy model on my laptop (core i5). I don’t remember exactly how long it took, but definitely no more than an hour or so.

I’ve scaled this up to a different data set with 52 observations of 67 features for ~100k individuals. As with the jet engine data, this expanded to a full historical look-back for each period in which the individual was observed. It was a pretty big model. Trains on a p2.8xlarge in ec2 (8 Tesla k80s) in a couple hours…

Hi Dayne, thanks so much for taking the time to put this together. I’m trying to walk through the work you’ve done with the goal of spinning up my own churn prediction model. Downloading the data from the NASA site and running the code from your github page, I’ve plotted the predicted vs actual time to survival, similar to the graph you have near the bottom of your post.

I see similar patterns and clusters between our two plots, except the vertical bar in my plot is centered at around 90 days whereas yours it at 60. Without asking you to debug my code, and being aware of the randomness inherent in the model, is it possible that your plot was generated from a different model than what you posted?

Like I said, I pasted your code in its entirety so I want to verify the input to the chart before I begin diving into other possible discrepancies. Again, thanks a ton for the all the work you’ve done, it’s a great simplified version of Egil’s work and has been really helpful for me to wrap my head around this.

Great post. I had a question about the normalization. If I train a model on a single feature that has raw values between 0-0.5 and the test data has values between 0-1.0 then wouldn’t the separate normalizations of the training and test sets be inconsistent? It would be even worse if the new incoming unlabeled data had values between 0-2.0 and so the normalization would be inconsistent across all three data sets. Is there a good way to address this?

Hi Dayne – How do you think this will extrapolate to a retail customer churn model which is not subscription based. Instead customers simply stop coming. Would the analogue to RUL be the number of days to the last transaction?

Do you know how to deal with irregular observations? For example during let’s say 12 months period there could be various number of observations for each subject at various dates during this 12 month period (pretty much random dates and number of events per subject). Number, frequency, and parameters taken during each observation could be related to the probability of the event – for example if device is brought in for repair multiple times during last 12 months it is obviously will affect its chance of a catastrophic failure in the next 30 days.

Hmmm… that’s an interesting question and, unfortunately, I don’t know that I have a great answer. It may be worth looking into the application of RNNs in situations with varying amounts of time between observations? I’d imagine somebody else has faced this issue before, though maybe in a different context.

Thanks for the kind words… The jet engine failure data is #6 on the linked page… “Turbofan Engine Degradation Simulation Data Set.” There’s an included readme and other resources.

I believe (though I didn’t go explicitly check, so you may want to verify) that all of the engines in the data set were followed until the end of their useful life, so there are no censored observations.

Its a wonderful work with a beautiful dataset to understand the Time to a failure event, I want to know more about some of the parameters in the test_results which it generates, as per my understanding.

There isn’t necessarily a “predicted time to failure” as a point value. Instead, alpha and beta give us information about the probability of failure over time. If you want to get a point estimate (as I did to make the graph in the GitHub ReadMe), you could look for the point in time where the survival function for the Weibull distribution crosses 50%. This is the point where the chance of having failed by that point goes over 50%, so it’s a defensible estimate for remaining useful life.

Please see my previous reply. There is no point value for predicted time to failure… instead there’s a distribution across future time.

For the purposes of that graph on GitHub, I did something quick and dirty – I looked for the point where the Weibull survival function crossed 50%. This is not exactly correct, but, as I said in the last post, it’s reasonable, and it gave me what I wanted – a quick and dirty way to plot my results against actual values and prove that the model learned something. If you’re looking for me to lay the math out for you, the survivor function is exp(-(t/a)^b). Set it equal to .5 and solve for t and you get a(-ln(.5))^(1/b). Plug in a and b and you get your estimate.

As it turns out (see pages 3-4 of the document linked below), the correct way to estimate expected future life is to integrate the survival function from 0 to infinity. Please, please, though… understand that neither of these are a prediction that the subject will experience a failure on that particular day, it’s just our best estimate for how long it will survive.

Thanks for the work.
I tried to re-implement your architecture using same data and following your code.
I think there is a weird effect if you train the network for more epochs or if you switch to adam which converges faster than RMSprop. What happens is that weights of the input connections of the output layer tend close to zero while the bias increase. Aka, the output become constant. Thus one value of alpha and beta for any test sequence.
I am still investigating the problem. It could be something related to the L2 normalization (default in the normalize scikit-learn function). I am also experimenting with batch normalization but I need to get rid of zero-variance features (that are: op_setting_3, sensor_measurement_18 and sensor_measurement_19).

Have you tried to run exactly the same code but increasing epochs to something like 500 or with a different optimizer?

Thank you Dayne for putting this together! I’ve a question about interpreting the results. Below is an example of the output from the code. I see 112 vs 194, 98 vs 155, 69 vs 86 and 82 vs 86. I’m unsure how to use these numbers to understand whether the model is effective or not. Would you mind taking a couple of examples from the output and explain how they show work. Should it be that the engines are ranked by predicted TTE so you can then prioritize the schedule maintenance for them? Sorry to ask you to tease out what these findings really mean.

As discussed above, the third and fourth values in the output for each observation are alpha and beta parameters of a weibull distribution describing the likelihood of engine failure over future time. You can use these parameters to calculate a distribution function or cumulative distribution function of likely failures for each engine. You’d then have to choose some statistic to help you decide how to prioritize which engines to focus on. Maybe it’s time until the CDF goes past 50%, for example… (in other words, time at which the engine is predicted to have a >50% chance of having failed). Ultimately, this is going to be up to the discretion of the researcher, and you’ll have to choose thresholds that give you the kind of precision/recall performance you want.

Really interesting work here. One question though; why are every observation marked as uncensored? I agree that the last day for each engine should be uncensored, but the days up till failure should be censored? I mean; you assume you dont know the future, right?

Hopefully you understand my question.

One other thing; what about using cox regression instead – do you have pro and cons for that approach?

Hi, if I am using this WTTE-LSTM for medical data where some of them has no event occuring within the recording data some do, what would my y-axis be? In your case you put all 1s because you’re assuming the engine will fail sometime in the future. What would you recommend in my case since it is not necessarily true that everyone will get, for example, cancer? Thanks!

Unfortunately, most of the survival analysis field is predicated on the assumption that everybody experiences the event of interest eventually. In medical contexts, the event of interest is often death, and everybody dies. If you’re looking at something that everybody may not experience (e.g., getting cancer), the assumptions fail. People will argue about how much of a problem that is in practice (and there are plenty of papers published that, for example, use vanilla survival analysis to look at things like divorce rates), but the assumptions fail nonetheless.

You can choose to ignore the problem and use the model and just treat people that never got cancer as censored, but do so at your own risk.

There’s also models specifically designed for handling data where not everybody experiences the event of interest. Google “cure models” if you’re interested. Incorporating the math behind those models into this project would be a pretty big undertaking, though.