Training a neural network typically involves using many epochs, each of which exposes the neural network to the full training data set, before the accuracy is no longer appreciably affected. For a lengthy overall training, it’s useful to save the training progress so that if there is an interruption for some reason, then training can be resumed rather than restarted.

After any epoch of training, the state of training can be saved using essentially the same technique as we did for saving a fully trained model and restoring it in a production environment for inferencing. To simulate a training interruption in my Bankloan sample, I broke the 3000 epoch training cell into two cells. The first training cell had the same training code except for stopping at 1500 epochs and for saving the progress at every 500 epochs using the following:

Relative to restoring a model for the purpose of inference, there are only a few small differences. First, the name of the file from which we read includes the epoch number. Second, we have to get values for a few more Python variables, like the y layer tensor to which we feed the correct output values during training, the training operation itself, and the accuracy testing tensor. Third, getting the training operation requires a slightly different call because it is an operation rather than a tensor. Fourth and last, getting the accuracy tensor requires that we give it a name during construction of the tensor flow because it cannot be extracted during restoration unless it has a name that was saved. This can be accomplished by simply adding an identity node to the front of the previous accuracy tensor, like this:

This gave the accuracy tensor the name ‘accuracy’ within the ‘test’ namescope. Reviewing the code, one may note that the training operation is not explicitly given a name. However, this is because the TensorFlow library itself assigns a default name of ‘GradientDescent’ to the operation during creation, which occurred within the ‘train’ namescope.

Speaking of the code, you can go here to download a copy of the notebook. Instead of one cell for all the training, the training is split into two cells, with the latter cell [10] reloading and resuming where the former cell [9] left off. Finally, note that it is possible to fully simulate interrupted training by stopping the Python kernel after the first half of training. Once you restart the kernel, simply rerun cells [1] to [4] to reload the training data, and then run the second half of training starting at cell [10]. The only difference will be a negligibly different accuracy result, relative to training all epochs with the same kernel, due only to the random number seed being regenerated when the Python kernel is restarted.

In this article, we’ll cover how to measure the quality of the TensorFlow regression model covered in a prior post. As usual, the code for the quality measurements can be obtained from my TensorFlow Samples repository, and you can use this code in IBM Data Science Experience / Watson Studio. The code is also written generically so that you can apply it to models built with other libraries, too.

A regression model solves a kind of problem that can’t be solved with a classification algorithm. A data scientist trains and uses a regression model when the variable being predicted is a continuous quantity or an ordinal quantity with a large value space. For example, if the input is an image of one of ten numeric digits, a classification model would predict which digit it is. Even though numbers are comparable, there’s nothing about an image of a two that makes the image less than an image of a three, any more than an image of a cat would be less than an image of a dog (as if!). On the other hand, a regression model would be used to predict a property value (essentially continuous) or the number of hours after a medical procedure that a patient will need to stay in an intensive care unit (ordinal with a high value space).

The linear regression model in the prior post was a linear regression model that used matrix operations to determine a line of ‘best fit’ for the housing data. There were 9 variables including the median house value that the linear regression model learned how to predict. So, the ‘best fit’ line is calculated to flow through 9-dimensional space in a way that is closest, overall, to all the 9-dimensional data points in the housing data.

But how good is the fit of that ‘best fit’ line? Sometimes the ‘best fit’ line not a good fit because variables are not linearly related to the dependent variable. At other times, there might be a linear relationship at a statistically significant level, but the model is still not that great of a fit because the relationship, and hence the data, is noisy. So, how do we measure whether we have a good regression model, an excellent one, or a poor one?

The R squared metric is a ratio that indicates the amount of the data’s variance from the mean that is accounted for by the regression model’s predictions. Before we unpack the meaning of that statement, let’s just first have a look at the library method you’d normally use to get the measurement. The variable ‘predicted_values’ contains a one-dimensional array of predicted median house values generated using the trained linear regression model. To prepare for the R squared calculation, we flatten the actual median house prices into the same one-dimensional format, and then we use the scikit learn method that calculates R squared for us:

The result in this case is a touch more than 0.637. One may have a rough sense that this is good because, well, more than half of the variance from the mean is explained or accounted for by the regression model. In other words, if you were given each house’s predictor variable values and you always answered the average price, then your answers would reflect a balance between sometimes be high and sometimes low. The total variance of the actual house prices from your constantly mean answers (yes, I meant that) is called the total sum of squares, and you can calculate it yourself very easily like this:

On the other hand, residual variance is the variance that is unexplained or not accounted for by the regression model. In other words, it is the variance that’s left over if you use the regression model to predicted values instead of using the mean. It is easily computed like this:

The ratio of the residual to total variance is the portion of unexplained error, and subtracting that from 1 gives the portion of variance explained by the regression model, which is R squared and is calculated easily as follows:

R_squared = 1.0 - SSres / SStot

The following illustration graphically depicts the difference that a linear regression model makes in accounting for variance. On the left, you see the results of constantly using the mean as the predicted value. Each data point is some distance from the mean line, and the square of that distance is the variance for that data point. The sum of the large reddish squares’ areas gives the total variance from the actual data values. On the right, you can see smaller blue squares of residual variance of the actual data points from the predicted values of the linear regression model.

Now you have a rough intuitive sense that the housing price model was a good model based on an R squared of 0.637. But, the precision of the intuition is like a rare steak. It tastes good, but we all know that data scientists are people, and people shouldn’t eat undercooked meat.

So, what is good, fair, poor, or excellent for R Squared? A number of sources that you will find out there will say that an R squared of 0.25 is a large effect size. However, this is large for detecting the effect of a treatment (e.g. a psychological technique, educational module, or medication). But a good R squared for a treatment's effect size is different from (and less than) the R squared that would correspond to a good predictive model.

In a 2015 study, a group of medical researchers created a new regression model for predicting the required length of stay in intensive care after heart surgery. The benchmark model in use at the time had an R squared of 0.356. This is consistent with answers I received while interviewing a few data scientists, who indicated that R squared values in the 0.3’s and 0.4’s would correspond to serviceable predictive models. Since they also said they’d want to keep experimenting to get better results, it would be fair to say that 0.3’s and 0.4’s are ‘fair’ values for R squared for a predictive model.

The purpose of the 2015 study, though, was to present the researchers’ new regression model, which had a much-improved R squared of 0.535. The “delighted tone” (Lewis, 2016, p. 79) the researchers had when describing the new model was due to the magnitude of improvement in R squared, but in that case it’s reasonable to conclude that the new R squared should be described with a qualitatively higher qualifier. As such, it is a ‘good’ R squared value. More generally. 0.5’s and 0.6’s would be considered ‘good’ to ‘quite good’ according to the data scientists I interviewed.

When asked ‘what is a good R squared,’ the data scientists I interviewed did, of course, start with admittedly reasonable disclaimers like “It depends on what you’re doing” and “it depends on the current benchmark.” But, the characterizations above and next are based on not having answers for those dependencies. R squared values in the 0.7’s were generally regarded as excellent, and the 0.8’s were outstanding. This left the 0.9’s in the realm of practically unachievable. Put another way, in real-world scenarios, it’ll be practically as rare as is the frequency that one should eat undercooked meat.

The first thing you should ask yourself when you see a trainable AI API like Watson Natural Language Classifier (NLC) is "Can I measure the quality of the training I do?" Fortunately, you can use IBM Data Science Experience (DSX) to do just this-- and you can use much the same techniques as I used for a TensorFlow neural network in the last article. The reason is that the measurement techniques are the same for any binary classifier algorithm or implementation, so the main difference we'll actually be covering in this article is how to get classification results out of Watson NLC so they can be fed into DSX tools like the ROC curve generator and AUC method that we covered previously, as well as a few new ones at the end.

Once you get an IBM Cloud account, go to the catalog and then enter into search or navigate to and then press the Watson Natural Language Classifier tile. Even with the free trial account, you can create a free trial Watson NLC instance (and a free trial Data Science Experience, if you haven't done that already).

There is a 5 minute tutorial video, and it really is somewhere less than 15 minutes to get your NLC instance up and running, even if you're like me and prefer to smell all the roses along the way. The tutorial shows you how to get your API credentials for the instance, which are called "username" and "password" in the code below. The tutorial also shows how to use curl right from your local computer to send training data to your NLC instance. Part of the payload response to that curl training command is the trained classifier ID, which is called "service_id" in the code below.

To start, in a DSX Jupyter Python notebook, we install Watson Developer Cloud, which allows access to all Watson APIs, and then we import Watson NLC in particular:

# These are from your Service credentials
username = "c8fecfe7-f753-4f2d-8365-5d6e67eed69c"
password = "password"
# This is the classifier ID you get in the training response
service_id = "0015a0x264-nlc-37123"
# This is the proxy object that talks to your NLC instance in IBM Cloud
NLC = NaturalLanguageClassifierV1(username=username, password=password)

The Watson Natural Language Classifier is capable of distinguishing any number of classes based on training data you give it (up to machine learning limits based on a current training set size limit of 15,000). The use cases for Watson NLC span from classifying high volumes of call center requests in a conversational agent to classifying all sentences of a document being ingested according to some important metric like amount of deviation from contract boilerplate language. To narrow the scope, we're going to do some simplifying methods that are based on training Watson NLC as a binary classifier (such as risky versus non-risky contract clauses).

This first simplifier method just takes the result payload from an NLC call (the call is further below) and cherry picks the predicted class name and confidence, assuming binary classification:

In my own sample NLC use case, I am using some sample data containing forecasted revenue questions and actual revenue questions. In the actual system where the data is from, we found that people had numerous starting prompts ('show me', 'give me', 'what is', ...), they had many ways of referring to actual versus forecasted revenue, and they tended to want the information broken out by industry (e.g. energy sector, tech sector, ...) and geography (e.g. countries). What I actually wanted to know is whether Watson NLC was susceptible to training data bias that I've seen in lesser classifiers. I deliberately skewed country mentions to 25 countries being 75% prevalent in actual revenue questions and another 25 countries being 75% prevalent in forecasted revenue questions. This modelled the phenomenon of more frequently asking for projections in emerging markets and more frequently asking for actuals in established markets.

Watson NLC was not even remotely fooled by this subterfuge because it "combines complex convolutional neural networks with a sophisticated language model." I only trained it on 756 of my 6000 samples, and not only was the accuracy perfect on a validation set of size 324, but the confidence values were always about 96%, more typically above 99%. Moreover, I had lots of data left over, so I started plugging in terms that I hadn't even showed it, like "What is the pipeline for healthcare in Canada?" Even though the training data did not contain 'pipeline,' Watson NLC's language model knows the proximities between words in the training data and many synonymous words that may come up in practice without your having to include all those variations in your training data. Suffice it to say that you need a MUCH harder problem before you have to worry all that much about Watson NLC training quality, but... it does come up and it's best to be prepared for when it does!

So, here's what I did to measure Watson NLC training quality with DSX. I held aside a validation set of 324 questions that were randomly selected from the same pool of 1080 questions as were the remaining 756 training set questions. I assume that internally Watson NLC divides the data I send into training set and a test set, so this is a second validation set that is not provided to the Watson NLC training API.

I used drag-and-drop to add this validation set to as a data asset called 'QuestionsTest.csv' in my DSX project, at which point I could just use plain old python to pull it into memory, like this:

The only downside to the loop-based approach above is that it is not terribly performant. It takes a good 5 minutes to run that loop! Maybe that is tolerable for a small validation set, but not really for a larger one, and it's certainly not acceptable for the use case of processing the sentences of a document. Fortunately, team Watson NLC appears to be coming to the rescue again. There is a currently undocumented "beta" endpoint for batch classification. Those with mad programmer skills can see the endpoint though, and it's not hard to fish around in the GitHub repo for the Python SDK, figure out who the programmer is, look at that programmer's fork, and end up with jankable code that can be smoothed into this nice little batch call method:

NOTE: This is not a GA feature at the time of this writing, and so it should not be used in any kind of production way. For example, the Watson NLC team may decide to place a lower limit on the maximum number of items in a single batch. Still, I for one am really looking forward to this capability making it feasible to use Watson NLC during document ingestion.

OK, so whether you choose the supported way or the naughty-but-fast way to get the validation results, the logical flow is still the same from here on in.

To do the basic accuracy calculation we need both the actual and predicted classes expressed as true or false (1 or 0).

The NLC training I described above resulted in a perfectly trained classifier, and it's FUN to see what the ROC curve looks like when that happens:

Finally, you can also analyze classifiers using a confusion matrix that helps you to see which classes your model may be confusing with which other classes. Here's some code I used to generate one for my forecasted versus actual revenue NLC classifier:

Again, for my particular training data, Watson NLC had no confusion, so here's what the output looked like:

In the diagonal from top left to bottom right, you see the frequencies of the correct classifications, and you hope to see that as the hottest (reddest) part of the map, as it is above. All other squares show incorrect classifications. For a binary classifier like the one I trained, you could see if there was a greater number of one kind of mistake over another. For example, in the bank loan model we previously covered, a confusion matrix reveals that more of the errors it makes are in favor of misclassifying loan defaulters as non-defaulters.

A confusion matrix becomes even more useful when you exploit the ability that neural net technologies, like Watson NLC, have to support N-way classification, rather than just binary classification. Although you can do N ROC curves to look at each class versus all other classes, it's very convenient to see an NxN confusion matrix that shows how often each class is mistaken for each other class. For example, given a handwritten numeric digit classifier, you might learn from a confusion matrix that it mistakes 3's and 8's more frequently than it makes other number pair mistakes. Finding and solving for the ambiguities in natural language classification is no less interesting, and hopefully now you've got some additional tools with which to do it.

In this article, we’ll cover how to measure the quality of the TensorFlow neural net model covered in this prior article. The code for this article can be obtained from the Jupyter notebook in my TensorFlow Samples repository. Although the machine learning model is written with TensorFlow, the code for this article is written generically so that you can apply it to machine learning models built with other libraries, too.

The neural net model in the previous article was a binary classifier that learned to distinguish those who were likely to default on a bank loan from those who weren't, based on a set of predictor variables. The most basic measure of accuracy for a binary classifier is the rate of correct classifications. This is what the neural net trains to optimize, and the model training in the code gets an accuracy result of just above 80%. For this data, is this a good result? An excellent result? Or could a much better model be produced using this data?

The basic accuracy result for our binary classifier was based on selecting the predicted value based on which class (defaulter or non-defaulter) got the higher confidence value. Put another way, there was a confidence threshold of 50% for determining whether (true) or not (false) a loan applicant would default on a bank loan.

The accuracy on the positive samples is called the true positive rate (TPR), and the accuracy on the negative samples is the true negative rate (TNR). The negative samples that the classifier gets wrong are the false positives, and so the false positive rate (FPR) is the percentage false positive samples divided by the total number of negative samples. In the terms of the bank loan example, the true positives are those that were predicted to default and that did default, and the false positives are those that were predicted to default but did not. If needed, further details are here.

A receiver operating characteristic (ROC) curve is a plot of the TPR versus FPR as we vary some variable related to the model being tested. In this case, we will vary the confidence threshold because it will give a fine grain view of the model’s ability to distinguish, based on confidence values, the true (defaulter) case from the false (non-defaulter) case. In fact, the Area Under the Curve (AUC) corresponds to the probability that the model will produce a higher confidence value for a randomly selected true case than it will for a randomly selected false case. The diagram below shows the ROC curve and AUC value for the bank loan TensorFlow neural net:

Due to the sklearn and matplotlib packages, it is easy to write the code that calculates the data for the ROC curve and the AUC:

The first parameter to roc_curve() gives the actual predicted values for each sample, and the second parameter is the set of confidence values for the true (1) class for each sample. The method produces the FPR and TPR data that is used by auc() to determine the AUC and that is also used by the plotting code below:

Finally, with an AUC of 0.85, the bank loan TensorFlow neural net is arguably a good model, especially given the variables and data at hand. The conclusion, then, is that it would be difficult to obtain higher accuracy by simply tuning the model. Other variables and/or data would be needed to train a more accurate machine learning model for this problem.

The intent in my last blog was to use a straightforward example neural net use case so we could focus on the nuts and bolts of TensorFlow, Python, a Jupyter notebook, and IBM Data Science Experience. However, for use cases involving neural nets and other statistical learning algorithms, there are good alternative technical choices within IBM Data Science Experience that don’t involve writing so much Python and TensorFlow library code. One such choice is to use SPSS Modeler Flows instead.

In my Sandbox project, I clicked the data button (top right labeled 10 01), then I clicked Load, which offers a UI for me to drag and drop my csv data file. This is the same bankloan data file used in my last blog. Go and get it so you too can have this IBM Data Science Experience.

In your project, go to the Assets view and then press ‘New Flow’ on the right side of the ‘SPSS Modeler flows’ section. Here’s how I filled in the ‘Create’ page:

Once you hit ‘Create’ then you get a blank SPSS canvas to start building your flow. Drag and drop the bank loan data file onto the left side of the canvas (this reads a file from Object Storage, but you could alternatively do a database select in this first step).

Now we’re going to select the fields we want to use and filter out unwanted columns. Drag and Drop the ‘Type’ node from the ‘Field Operations’ palette and then click and drag the output connector from the bankloan data node to the input connector of the Type node. Now right-click the Type node, choose Open and then press the Add Columns button and make sure all the fields’ checkboxes are checked so we can add all the fields. Then, hit the left arrow to go back to the Type node configuration.

Now, we only want the fields for age, education level, years of employment, years at current address, income, debt to income ratio, credit debt, other debt, and bank loan default field. So, for each of the preddef fields, change the Role from Input to None. Now go to the ‘ default’ field and change its Role from Input to Target because this is the dependent variable that we want to machine learn how to predict. Hit OK to finish setting up the Type node.

Now we’re going to filter out unwanted records in the data. In this case, there are some records that have no value for the default field, so we will filter them out because we can’t use them for machine learning how to predict the default field. Drag and Drop the Select node from Record Operations palette and connect it to the Type node. Right-click and Open the node. You can use an “Include” mode with a condition of default=”0” or default=”1”, or you could use an “Exclude” mode with a condition of @NULL(default).

You can right-click and choose Preview on the Select node to see the results of the flow up to now. That’s how I found that I needed quotes around the 0 and 1 in the condition.

Now that we’ve done some basic data preparation, it’s time to configure the machine learning.

Drag and drop a Partition node from the Field Operations palette, and connect the Select node output to the Partition node input. Right click to open it so we can set the training and testing sizes. To match the TensorFlow sample in my previous blog, use 70 percent for training and 30 percent for testing.

Once you hit OK on that, it’s time to drag and drop a Neural Net node from the Modeling palette and connect its input with the output of the Partition node. The UI knows to use the ‘default’ field as the predictive target, and the other fields are inputs, except the ones we marked with a role of None. Now just hit OK.

Now we’re ready to do some machine learning. Right-click the ‘default’ Neural Net node and select “Run” from the menu. This generates a training results node. Here’s what the final flow looks like:

And now, you can right click the yellow ‘default’ training node to see the results of training the neural net. Select the View Model menu item. You can see the accuracy values in the Model Evaluation tab. The Predictor Importance tab gives you a graph of which predictor variables were most important to affecting the prediction. And last but not least, the Network Diagram shows you something like this:

In conclusion, both the TensorFlow library and this SPSS model got the same accuracy, just north of 80% on this data. The difference was that with SPSS I had a drag and drop canvas where I could just configure the data preparation, the model, and the training and testing, so it took a lot less time to go from data to trained machine learning model.