This TensorFlow Neural Network tutorial has several aspects that are unique or not evident in other tutorials like the MNIST handwritten digits tutorial. The focus is on business, both in terms of the use case and data and in terms of extra steps needed to help take your data science results to production. We’ll cover the following:

Reading data and reshaping it for TensorFlow neural net input

Epoch based training and training data randomization

Training in small batches for larger data sets

Tuning hyperparameters like the network structure and activation function

Tricks for properly saving and restoring models for use in a production environment

How to do transfer learning

How to derive confidence values for neural net outputs

You can follow along if you have taken the time to locally install Jupyter/Python/TensorFlow/etc., or you can take a few minutes to sign up for the free trial of IBM Data Science Experience on IBM Cloud.

The sample data we’ll be using for training and testing is in a file called bankLoanData.csv, which is a sample data file I obtained from my laptop IBM SPSS statistics package. I’ve used this data because I could easily use SPSS to double-check that all the TensorFlow code was behaving as I expected. The advantage to both you and me, then, is that we can now easily adapt the resulting TensorFlow code to build bigger neural nets that learn from much larger data sets.

The goal will be to train and perform inferences with a TensorFlow neural network for predicting whether a loan applicant is likely to ‘default’ on a bank loan, based on features of the applicant that may be predictive of their ability to repay a loan. The dependent variable being predicted is the column named ‘default’ in the CSV file. The dependent variable is also called the ‘label’, and the data in the column is called the ‘labeled data’. The predictor variables are ‘age’, ‘ed’ (level of education), ‘employ’ (years with current employer), ‘address’ (years at current address), ‘income’ (household income in thousands), ‘debtinc’ (debt to income ratio x 100), ‘creddebt’ (credit card debt in thousands), and ‘othdebt’ (other debt in thousands). The predictor variables are also called ‘features’. The remaining columns of data are not needed and will be discarded in the code below. When training, the feature values of the instances (rows) of data are fed as input to the neural net, and the weights and biases of the neural network are adjusted so as to minimize ‘loss’, which coarsely maps to maximizing accuracy of the neural network’s output layer predictions of the labeled data.

The first cell of the Jupyter Python notebook has to do some version of reading the CSV file. In my prior tutorial, I showed how to load a CSV file into a database and then load the data into a Pandas dataframe using a SQL query. For this tutorial, any version of pandas.read_csv() will suffice. For example, in IBM Data Science Experience on IBM Cloud, you can simply drag and drop the CSV file to add it as a dataset, and then select “Insert Code” to automatically generate the code to read the CSV file from cloud object storage. For larger datasets, you may prefer to use a SparkSession Dataframe instead, but in that case, you’ll need to slightly adjust the numpy extraction code in the next notebook cell.

The next cell of code below assumes a Pandas dataframe named ‘df_data_1’ and uses it to extract the data into numpy arrays needed as input to the TensorFlow API. The comprehension in the first np.array() removes instances (rows) that have a missing label.

import numpy as np# Make a numpy array from the dataframe (remove rows with no value for 'default')
i = list(df_data_1.columns.values).index('default')data = np.array([x for x in df_data_1.values if x[i] in ['0', '1']])# Remove the columns for preddef1, predef2 and preddef3data = np.delete(data, slice(9,12), axis=1)# Separate the 'predictors' (aka 'features') from the dependent variable
# (aka 'label') that we will learn how to predictpredictors = np.delete(data, 8, axis=1)dependent = np.delete(data, slice(0, 8), axis=1)

The next cell reshapes the data just a bit. The predictors are separated from the dependent variable. The labeled data is converted from strings to integers, and the dependent array is flattened to one dimension to match the shape of the data that will come from the neural network output layer.The predictors are converted to all floats to facilitate matrix multiplication with weights and biases within the neural network.

# Convert the label to categorical for binary classificationdependent=dependent.astype(int)# And flatten it to 1D for use as the expected output label vector in TensorFlowdependent=dependent.flatten()dependent# Convert all the predictors to float to simplify this demo TensorFlow codepredictors=predictors.astype(float)# Get the shape of the predictorsm,n=predictors.shape

The next cell simply takes the first 500 instances as training data, leaving the remaining 200 instances for a test set. It’s not unusual to randomly select the training and test sets from the given data, but this particular sample was already random.It’s also typical to choose about a 70/30 percent split for training and test, and this code does so, except for rounding to a size divisible by the training batch size we’ll define later. This cell also defines a method that returns batch-sized slices of the training data. If the training data were too large to fit in memory, then this method could instead load data one batch at a time, such as with a SQL query.

m_train=500m_test=m-m_trainpredictors_train=predictors[:m_train]dependent_train=dependent[:m_train]predictors_test=predictors[m_train:]dependent_test=dependent[m_train:]# Gets a batch of the training data. # NOTE: Rather than loading a whole large data set as above and then taking array
# slices as done here, this method can connect to a data source and select just
# the batch of data needed.defget_training_batch(batch_num,batch_size):lower=batch_num*(m_train//batch_size)upper=lower+batch_sizereturnpredictors_train[lower:upper],dependent_train[lower:upper]

Now we’re set to start with some actual TensorFlow code. This next cell imports TensorFlow, makes a few useful initializations, and then defines a method that will build a neural network layer of a given size, fully connect it to a preceding layer, and set its output activation function.

importtensorflowastf# Make this notebook's output stable across runstf.reset_default_graph()tf.set_random_seed(42)np.random.seed(42)# A method to build a new neural net layer of a given size, fully connect # it to a given preceding layer X, and compute its output Z either with # or without (default) an activation function.# Call with activation=tf.nn.relu or tf.nn.sigmoid or tf.nn.tanh, etc.defmake_nn_layer(layer_name,layer_size,X,activation=None):withtf.name_scope(layer_name):X_size=int(X.get_shape()[1])SD=2/np.sqrt(X_size)weights=tf.truncated_normal((X_size,layer_size),dtype=tf.float64,stddev=SD)W=tf.Variable(weights,name='weights')b=tf.Variable(tf.zeros([layer_size],dtype=tf.float64),name='biases')Z=tf.matmul(X,W)+bifactivationisnotNone:returnactivation(Z)else:returnZ

Now we can add the code cell that builds the neural network structure. In this case, we’re going to have one input layer (X), one hidden layer (hidden1), and one output layer (outputs). The ### comments show how to add more hidden layers, but with this sample data, we’re going to be able to learn everything we can with only one layer. The output layer has two nodes, one for outputting class 0 (the loan applicant won’t default) and the other for class 1 (the loan applicant will default). The ‘y’ variable will be used during training to store the labeled data we expect to match with the output layer.

One line of code that helps makes this tutorial unique is the one that creates a tf.identity() node that gives the name ‘nn_output’. This enables us to save a name for the output layer so that we can recover and use the output layer after a restore.

The cell above and the next cell are where most of the hyperparameter tuning occurs. Neural network is just the algorithm. The input parameters passed to a neural network during inference are the feature values. During training, the input parameters are feature values and the expected output labeled data. But the neural network is adaptable beyond those input parameters, and so these configurable parts are called hyperparameters. The number and size of the hidden layers are among the hyperparameters, as is the activation function. For examples, you can try other numbers and sizes of hidden layers, and ‘tanh’ and ‘sigmoid’ are other activation functions to try. However, the given configuration seems to do very near the best on this data.

What we’ve done so far is to create the main part of a TensorFlow compute graph that happens to have the shape needed for a neural network. What we’re going to do in the next cell below is attach two different root nodes to the output layer, one that adds functionality for training and the other for testing. The ‘training_op’ uses the gradient descent method for minimizing loss (of perfect confidence in the correct answers and zero confidence in incorrect answers, where the correct answers are provided by the labeled data that will be in ‘y’).

Now we’re going to do a quick little cell that sets up our ability to save the model once it is trained. You only need to do these mkdir commands the first time you run the notebook, so you may want to put them in a separate cell to make it easier to skip them. Also, in Data Science Experience Local, you only need the second mkdir.

Now we can have the magic notebook cell that trains and saves the trained model. Each epoch of training exposes the neural net to the entire set of training data. When you see this code run, you will see accuracy increase over the many epochs, just as biological neural networks learn through repetition. For each epoch, we run through the training data in batches, to simulate how we’d handle a larger training set. Each batch of features and corresponding labeled data is fed to the ‘training_op’ root node in the compute graph, which is run by training_session.run().

One aspect of this tutorial that is made evident (relative to other tutorials) is the randomization of the training data that takes place at the beginning of each epoch.This essentially drives different data into the batches in each epoch, which dramatically improves accuracy over a larger number of epochs (though it is much easier programmatically to do this randomization when all data fits into memory).

# This is how many times to use the full set of training datan_epochs=3000# For larger training sets, we can break training into batches so only the# memory needed to store one batch of training data is usedbatch_size=50withtf.Session()astraining_session:init.run()forepochinrange(n_epochs):# Shuffling (across batches) is easier to do for small data sets and# helps increase accuracytraining_set=[[pt_elem,dependent_train[i]]fori,pt_eleminenumerate(predictors_train)]np.random.shuffle(training_set)predictors_train=[ts_elem[0]forts_elemintraining_set]dependent_train=[ts_elem[1]forts_elemintraining_set]# Loop through the whole training set in batchesforbatch_numinrange(m_train//batch_size):X_batch,y_batch=get_training_batch(batch_num,batch_size)training_session.run(training_op,feed_dict={X:X_batch,y:y_batch})ifepoch%100==99:acc_train=accuracy.eval(feed_dict={X:predictors_train,y:dependent_train})acc_test=accuracy.eval(feed_dict={X:predictors_test,y:dependent_test})print(epoch+1,"Training accuracy:",acc_train,"Testing accuracy:",acc_test)save_path=saver.save(training_session,"../datasets/Neural Net/Neural Net.ckpt")

Yet another reason why this tutorial is unique is that we’ll actually take a little sidebar to understand why, when doing business with a real stakeholder customer, we need to have a second test set, often called a validation set or a blind set. Why do we need a second test set? When I ask this, the usual reply is something like, “I don’t know, to double-check accuracy?”Well, sort of. But if you look at the structure of training, the weights and biases are affected not just by by the training data. Indirectly, they are also affected by the test data because we choose n_epochs to keep running training epochs until we get the best accuracy on the test set. In other words, we’re teaching to the test. The validation set or blind set has no such indirect effect on the weights and biases computed for the neural network. It is simply another test set that, to ensure construct validity, should be randomly from the same pool of data that the training set and test set are randomly selected from. In this way, the validation set is not just the ‘final exam’, it’s the first experience of the real world. Sidebar complete.

Once all training has been done, we save the trained model into the previously created datasets subdirectory. In this sample code, we are only saving at the end, but this same command can be used to save the intermediate results of a very long training run.

The data files that TensorFlow created during the save operation can be transported to a production environment. The neural network can then be restored using the code in the next cell below, and the output layer can be obtained and used for inference (using get_tensor_by_name()).In fact, showing how to do that is part of what makes this tutorial unique, as even the current TensorFlow documentation for save/restore (incorrectly) reuses variables after restore that were defined before save (rather than running variables obtained from the restored graph). The code below also shows how to reference into the hierarchy of a namescope.

As another sidebar unique in this tutorial, note that you can also use this method of naming with tf.identity and then getting the tensor from a restored graph to do transfer learning between neural nets. Specifically, once you create a hidden layer with make_nn_layer(), you can name it with tf.identity. Then, you train and save as shown above. Then, to transfer to a second neural net, you restore the trained and saved one, get the hidden layer by name, attach alternate fully connected hidden layers as needed, and an alternate output layer, and then train the new second neural network using the methodology above. Sidebar complete.

The cell below mocks up having a REST API receiving a batch of feature instances and converting them to a numpy array by simply taking a slice of the test data.With the inference TensorFlow session, the compute graph and the values it contained are restored. After that, we obtain the tensor corresponding to the neural network output layer by using the name we previously assigned.Then, we run the output layer, giving the batch of feature instances to the input layer ‘X’ (inference_session.run()). The predictions of the dependent variable are then obtained by choosing whichever of the two output layer nodes has the higher value (using np.argmax()).

Finally, to add one more feature that makes this tutorial unique, let’s look at how to get the actual confidence values for the predictions. Somehow, this may not have seemed as important when doing the MNIST hand-written digit tutorial, but in a business context it’s important to know how much confidence we have in an answer.

To get the confidence values, we do something interesting with the compute graph. Remember, it’s just a compute graph and won’t bite. In this case, we pop a new root node onto the output layer to apply the softmax function, which gives the probability of occurrence of each output value. Then, we do one last comprehension to ferret out the confidences of the predicted labels for each feature instance.

And that’s a wrap. Now, it’s your turn. Go ahead, get started with that free IBM Data Science Experience account so you can amaze your friends and bosses with your newfound TensorFlow AI machine learning data science hyperparametric hyperwizardry. You know you waaaanna!

In this blog, I'm going to describe the steps you can take to have an IBM Data Science Experience with TensorFlow. You can use these steps to create a Jupyter Python notebook that

reads training data from a BigSQL table into a Pandas dataframe

uses TensorFlow to train a simple machine learning model with the data

saves the machine learned model to a local dataset within IBM Data Science Experience

restores the machine learned model from the local dataset

performs inferences with the restored machine learned model

As a preliminary, I downloaded the California housing data from the same source as the scikit learn fetch_california_housing() method. This gave me a CSV file with a sample dataset that maps house prices to several predictor variables such as house age, number of bedrooms, and municipal population. That scikit learn method does a lot of work to download the data and parse it into a numpy array needed as input to TensorFlow, but I'd like to take it a little less easy and instead describe the process as if we were getting the data from an Enterprise Data Lake or similar data source.

Once I had the raw CSV file (no post-processing by fetch_california_housing()), I used the internal IBM Enterprise Data Lake lake web browser interface to upload it to HDFS. Then, I used my Eclipse database development tool to create a hadoop table called HOUSINGDATA in a shared BigSQL database within the internal IBM Enterprise Data Lake. Then, I loaded the table with the CSV file content. The SQL code I ran looks like this:

Alongside the IBM Enterprise Data Lake, we have a deployment of IBM Data Science Experience Local, but you can replicate these experiences using the public cloud IBM Data Science Experience.

You can get started by pressing "Create Project" to create a workspace that you can work in and, if you like, share with others. The projects are backed by GIT, so you have a lightweight collaboration method in which you are able to commit and accept changes with others. You can also "Export" the project as a zip file containing all assets. Once you click on a project to get into it, you can create various kinds of assets, including data sources, data sets, Jupyter and Zeppelin notebooks in Python, Scala and R, etc.

The first thing I did was go to the project's "Data Sources" list so I could create a connection to the BigSQL database containing my HOUSINGDATA table. This is where I set just the "JDBC URL" and my user credentials. The JDBC URL looks something like this "jdbc:db2://SharedServer.ibm.com:52000/BIGSQL:sslConnection=true". My JDBC URL also contains a parameter that tells the path to an additional trust store containing more certificate authority certificates, but this is because I needed to point to a shared internal database. And my user credentials authenticate me with that database. I named the data source "MY SHARED DATABASE" and then hit "Create".

The next thing I did was go to the project's "Assets" list, and scrolled down to the "Data Sets" list so I could "add data set". I switched to "Remote Data Set". I gave the name "HOUSING DATA from MY SHARED DATABASE" to the dataset. Then, for the remote settings, I chose the "MY SHARED DATABASE" data source and set the table name to "HOUSINGDATA". In my case, my tables are stored in a schema associated with my user ID, to keep them separate from other users of the same shared database, so I also entered that schema name, then hit "Save" to create the data set.

Now that the data source and data set have been created in the project, we can use them in Python. Again in the project's "Assets" list, I hit "add notebook". I gave the notebook a name of "TensorFlow Sample" and chose Jupyter Python for the tool and language, then hit "Create".

Once I clicked on the notebook to get into it, I was able to insert automatically generated code to get my HOUSINGDATA into a Pandas dataframe. One can also choose a Spark Dataframe, but the Pandas dataframe was sufficient for this sample. I clicked the insert code "10"/"01" menu icon, and changed to the "Remote" list. On the "HOUSING DATA from MY SHARED DATABASE" data set, I clicked "Insert to code" and then clicked "Insert Panda DataFrame". This inserts Python code that performs a few relevant imports, makes the database connection, and performs a "select * from" query on the "HOUSINGDATA" table into a Pandas DataFrame. The important bits for being able to understand the TensorFlow code look like this:

Now, we start getting into interesting code. This first snippet just imports numpy and then extracts data from the Pandas dataframe into numpy arrays, which is what TensorFlow needs as input. Because we'll be training a simple model with only 20,640 rows of data, we're loading it all at once, but for larger training sets, you'd do this in smaller epochs. The "housing_data" are the 20,460 values for each of the 8 predictor variables, and the "housing_target" is the vector of 20,640 house values that we will be machine learning how to predict. The remaining two lines are just a little house-keeping.

So, now we're going to do the world's simplest machine learning model because we'd like to be able to see the elements of TensorFlow-based machine learning with as little problem complexity as possible getting in the way of understanding. The word 'tensor' just means n-dimensional array, and TensorFlow is a library that makes it easy to specify a computational 'flow' of tensors and then to execute that flow in the most efficient way possible given the compute power available to TensorFlow. In essence, the data scientist describes what computations must occur, and then TensorFlow determines how to do the computations efficiently.

We're going to start by defining the 'flow' or computation graph that TensorFlow will run. In this case, we're going to define the compute tree for training a multiple linear regression using the 8 predictor variables and the housing value variable that we'd like to learn how to predict. Here's what that looks like:

The X variable is the matrix of 8 predictors by the 20,640 samples. XT is a transpose needed in the linear regression computation. The 'y' variable is the dependent variable, and it is assigned to the 20,640 housing values we have in the training data. The 'theta' variable is the vector of linear regression equation coefficients that will result from the series of matrix operations on the righthand side formula.

It is important to note that the code above just specifies the compute graph, i.e. the tensor flow. To perform the flow, you then run the following code:

# Run the compute graph
with tf.Session() as sess:
theta_value = theta.eval()

If you then run a line of code to output theta_value, you will get an output like this:

For a linear regression, this is the machine learned model. It gives the coefficients of a linear equation that is the best fit to the training data. Given values for the 8 predictor variables like house age and number of bedrooms, these coefficients can be used to predict a house value. We'll see how to do that below, but first, we're going to see how to save and reload the model in TensorFlow because you would typically want to save a model trained in IBM Data Science Experience and then transport it to a production deployment environment, where you'd want to restore it before actually using it for inference (prediction).

The first time I ran my notebook in IBM Data Science Experience, I used this line to create a subdirectory in datasets where I could save the TensorFlow model from this notebook:

!mkdir "../datasets/Linear Regression"

Then, to save the model, I defined a second simple TensorFlow compute model that just assigned the theta_value vector to a variable called "model". The code below creates and then executes this simple tensor flow, and then saves the result in the subdirectory created above.

The save method we're using here is useful to know about because it is the same "checkpoint" method that you would use if you were incrementally training a larger model in epochs. It's also useful to understand that what we're saving is the compute graph and the tf.Variable TensorFlow variables and values defined in the model we're checkpointing. In other words, what gets saved is specific to the type of model you're training because the type of model affects the compute graph, or tensor flow, that you specified. In a neural net, for example, you'd have to save the structure of the net in addition to the weights and biases. For a linear regression, we already know the structure is a linear equation, so we just save the coefficients. Regardless of what is being saved, TensorFlow actually saves four files, as shown by the line of code below and its output:

At last, you can now perform inferences using the 'theta_value' vector. To simulate making a prediction in the code below, I've used the 0th row of the housing_data for the values of the predictor values. I initialize 'predicted_value' to the constant coefficient of the linear equation, and then the remaining coefficients of the theta_value are placed in 'linear_coefficients' to make the loop easier to read. The loop then multiplies each predictor variable value housing_data[0][j] by the corresponding coefficient (each coefficient 'c' in the for loop iteration of linear_coefficients is, unfortunately, an array of size 1, so c[0] is used to get the actual value of the coefficient).

If you now run a line of Python code to see the value of predicted_value, you will get output like this:

411111.09606504953

Finally, it's worth noting that, for a larger kind of model, you can also use TensorFlow to perform the inference. But because this is a linear regression involving only 9 coefficients, using TensorFlow would probably just slow it down. Still, it is an easy tensor flow to write... an exercise for the reader!

I recently ran across this artice (https://lctech.vn/blog/ibm-watson-compares-trumps-inauguration-speech-obamas/). It describes the author's attempt at a comparative analysis of the personalities of Barack Obama and Donald Trump based on applying the IBM Watson Personality Insights API to their US Presidential inauguration speeches. The article has many charts, figures and analyses, according to various capabilities of the API. But, these cannot make up for the logical fallacy under which the API was applied in the first place.

UPDATE: Even as I was publishing this article, a similar misuse of IBM Watson Personality Insights API was reported by CNBC (http://www.cnbc.com/2017/07/17/tim-cook-is-silicon-valleys-most-imaginative-ceo-says-ibm-data.html). The analysis produced results such as that Apple's CEO Tim Cook is the Silicon Valley's most imaginative tech leader and that Microsoft's CEO Satya Nadella is one of the most assertive tech leaders. These are non sequiturs (they may be true or false, but the analysis doesn't actually establish these truths that it asserts).

One of the most important principles in data science is the test set for a machine learned model must be a good representative of the expected usage of the machine learned model. Otherwise, the accuracy of the machine learned model on the test set will have little to do with its accuracy in practice. In the field of psychometrics, this principle actually has a name: construct validity. Generally, it makes sense to take cues on measuring machine learning from the vast experience of educational psychologists who measure human learning.

A corollary principle in data science is that the training set for a machine learned model must be consistent with the test set. Otherwise, the machine learning algorithm will not be likely to learn the construct that the test set tests. In fact, it's not uncommon to draw the test set randomly from the training set, in which case the two sets are likely to be consistent, and the challenge reduces to determining whether the training and test sets provide a good representation of the intended use case. Essentially, data scientists spend a lot of time thinking about and working on training set quality in order to attain high construct validity.

But, if you are a data science consumer, then you have to think about these principles in reverse. If you are a software developer who uses an API that offers the inferential function of a machine learned model trained by a data scientist or data science team, then you are a consumer their data science results. Such is the case when you use IBM Watson Personality Insights.

According to this source, the API was trained based on mapping personality test results with the linguistic patterns of 200 tweets from the 600 participants. There is no evidence to suggest that our tweet writing is linguistically consistent with how we write emails, blogs, or other documents, much less speech transcripts from US Presidents' inauguration speeches or CEO speeches. For one thing, we know that except for tweet storms, successive tweets aren't necessarily all that much related to each other. But the sentences and paragraphs of these other forms of writing are much more logically and sequentially connected together. After all, that's why we have speech writers.

By comparison, if your use case is to determine personality traits of, say, a prospective customer or employee, based on their Twitter feed, then you're more likely to be appropriately using IBM Watson Personality Insights API.

In the case of this API, there are further questions that a psychologist would ask, and therefore that you should ask, too. In particular, the training data was drawn from a sample of 600 participants. But, are those participants representative of the target population on whom you will be doing the inferences with the API? For example, if your prospective customer or employee base comes from, say, the fashion industry, and if the training data participants came dominantly from, say, the tech industry or even from the population at large, then your results with the API may be significantly affected by the difference. Do your best to find out the demographics of the training data participants and your target population to see if there are mismatches. There are other similar questions. Are members of your target population more prone to tweet storms, retweeting, and/or replying to tweets than the training sample population? All of these tendencies are reflective of personality traits, so if there are differences between the training sample and the target population, then you may not be able to use the API.

For any API, you as a software developer are practicing a basic form of data science by checking these issues because you are ensuring construct validity between the inferences in your use case and the training data for the machine learned model you are consuming.

In today's cognitive computing products and techniques, the perception of greater intelligent responsiveness comes not so much from having true explanatory power, but rather just having strong predictive power over increasingly chaotic and larger data sets.

To give us a third point of reference beside linear regression and neural nets, I'll use some other terms to bring the focus to natural language processing. In 2011, the IBM Watson system demonstrated greater intelligence than the best human opponents in the domain of linguistically challenging factual Q&A. This was based on the ability to quickly produce high confidence answers from a large corpus of unstructured information in response to challenging questions.

The linguistic product that is now based on that system is called the IBM Watson Engagement Advisor. As with other cognitive computing techniques, the product must first be trained to be an effective system in the target domain. The corpus of unstructured information often takes the form of documents, such as instruction manuals, technical reports, journal articles, and wiki pages. During training, the most important entities and relationships expressed in the documents are identified and stored in order to expedite later search and retrieval during Q&A interactions with users of the system. The identification process within a document is often called annotation, and the annotation and storage processes together are called ingestion.

The most important concept in understanding the training is to understand what really drives the identification, or annotation, of the documents. It's simple, really. It's a Q&A linguistic product, and the annotation and ingestion expedite the production of the A's in response to the Q's, so it is imperative to have a strong and large representative sampling of the potential questions in order to train and test the efficacy of the system. The questions encode the key concepts (e.g. entities, relationships and so forth) to which users of the system are interested in getting answers. Annotators for these key concepts are developed and, during ingestion, they are executed upon the documents.

This is the very basic level of explanation about how the linguistic product would learn, or be trained, to be an effective cognitive computing system for a domain, and future entries will dive further into this topic. For now, let it suffice to say that ingestion and training results in a system capable of producing answers from the corpus in response to questions like those used in training.

During a run-time Q&A session with the system, the user begins by posing a natural language question. The question is first analyzed to find the key concepts, and then a multiphased approach is used to dig up the best results from the ingested corpus content. As with training, there's a lot more to be said over time about how the run-time Q&A works, so more interesting future entries to come, and in fact, it's intrinsically related to the training anyway. To conclude and tee up these future entries, I'll say the high order bit here is that a trained Q&A linguistic product seems somehow more intelligent than a linear regression or even a typical neural net application. Why is that? To get a bit more background for that explanation, I'd encourage you to visit or revisit a few of my earlier blog entries about cognitive computing. Compare your perceptions of the intelligence of the minimax algorithm in [1] with the linear regression method in [2], and compare [2] with the neural net in [3]. What's changing?

The speaker says that a challenge with neural nets in business applications is that they are black box, meaning that you can understand the inputs and the outputs but not really how it is deriving the outputs. Later, the speaker says that linear regression is a preferred technique because it has a very strong predictive and explanatory power.

It's not really true that linear regression has more explanatory power than neural nets. Rather, it is easier to understand the problems and the answers that can be solved by linear regression. By comparison, neural nets tend to be used to provide cognitive computing power to harder problems than linear regression can solve.

To put this another way, when you use linear regression, you actually begin by assuming linearity of the relation you want to predict. As the speaker points out, you can also make a non-linear assumption, and you can accommodate this using a data transformation, for example. But the high order bit is that you are asked to assume the data relationship, and that assumption is what is giving you the illusion of explanatory power. You can explain that the data follows a line, but this is due to your own assumption. Note that an important aspect of completing a linear regression model is determining the R2 or goodness of fit of the model. This is the part where you make sure that your assumption of linearity is valid. And if the assumption is invalid, then the model has no predictive value, so it does not matter that you can explain how it operates.

Under the interpretation that explanatory power is akin to predictive power, it turns out that neural nets have greater predictive power because they can produce results for a wider array of applications than linear regression can. There a neat table that relates the cognitive power of a neural net to the number of hidden layers. From the table, you can see that when a relationship actually is linear, a neural net can solve it without even using any hidden layers of neurons. When one or two hidden layers of neurons is present, neural nets transcend the capabilities of linear regression, in part because they do not require you to make any assumption about what the data relationship actually is.

And that's where the confusions comes in. The linear regression model requires you to assume linearity and so you know at least what geometric shape the relationship looks like. The neural net requires no such assumption, but nor does the trained neural network give you any hint at what the relationship is. The lack of knowing the relationship is confused for having less explanatory power.

But if you look at this a bit more abstractly, the trained linear regression model has the same exact problem of not providing any additional insight. A neural net is really just a pile of numbers giving constant weights to the neural connections that can convert inputs to outputs. Similarly, a linear regression model is just a pile of numbers that give constant weights to inputs to be linearly combined into an output. Sure you know the data relationship, but that's because you assumed it. The actual linear regression model gives you no insight into why one dimension has a large slope constant where another has a small slope.

An analogy I like to use is that the value of the neural net is not diminished by our inability to explain how it is that the little gray cells which implement our personal neural nets can produce the cognitive results that they do, and who among us would prefer to have cognitive powers defined by linear regression instead?

In terms of explanatory power, our biological neural nets perform an additional key function that we have not hitherto been able to achieve with artificial neural nets. We are able to construct additional information in the output that reveals causal relationships, or insights into the reasons for the phenomena it predicts. Put simply: we say why something is true. We provide a rationale. This is an aspect of explanatory power that, when achieved, dramatically increases the value and utility of any cognitive analytic. Theorem provers and Prolog programs have been able to do this for the applications to which they apply. In the area of unstructured information processing and data mining, you can see a demo of this concept in Watson Paths.

As an interesting possible counterexample to my last blog about MLR models not understanding the knowledge they learn, consider the neural network. Our brains are neural networks, and we are capable of learning at all levels of Bloom's Taxonomy, not just the knowledge level. Shouldn't artificial neural networks be able to achieve the same things?

The answer is no, not really. Our brains biologically, chemically and physically perform in ways that we scarcely understand, so our name for the thing we call "artificial neural network" is no less anthropomorphizing than when we say that a computer program of today "understands" anything.

Still (again), this is not to say that they aren't incredibly useful and effective. It's just that they are based on straightfoward and well-understood mechanical methods such as feed forward activation of neural outputs via sigmoidal threshold functions applied to inputs and back propagation of synaptic weight adjustments based on easily quantified classification errors. Before going any further, let's have a quick look at a diagram of an artificial neural network (ANN):

The ANN has an output layer on the right that is a classifier for input patterns received on the left. For example, an ANN for optical character recognition could have an input layer of an 8x8 matrix of bits, and the output layer could be an 8-bit code that indicates an ASCII character. The hidden layer(s) of neurons help the ANN to represent more sophisticated phenomena, though there is seldom need for more than one hidden layer. The "synaptic" connections between the neurons in the layers are weighted numbers, and the neurons apply the weights to the inputs and then feed the results into a Sigmoid function that essentially decides, like a transistor or switch, whether or not to fire the output.

An ANN is "trained" by giving it a sequence of input patterns for which the correct output pattern is known. The input pattern feeds forward through the ANN to produce an output. If there is a difference between the ANN output and the correct output, then the differential error is back propagated through the ANN to adjust the weights so that future occurrences of that input pattern are more likely to produce the correct output.

The synaptic weights, then, essentially represent the knowledge that the ANN "learns" from the input patterns. This is analogous to the constants that are "learned" by an MLR model. In fact, all elements of the ANN and MLR model architectures are analogous. The ANN input layer maps to the the independent X variables, the ANN output layer maps to the dependent Y variable, and the transition from input X values and the Y value that is achieved in MLR by multiplication and addition is achieved by a feed forward through synaptic connections, hidden layer neurons and Sigmoid functions in an ANN.

With such a one-to-one architecture mapping between ANNs and MLR models, it is easier to see them as having similar intellectual power. That's not to say they're equivalent, as ANNs are far more powerful. It's just that they're roughly the same (low) order of magnitude with respect to human intellect, and in terms of Bloom's Taxonomy, we call that order of magnitude "knowledge storage/retrieval".

Despite being in the lowest order of magnitude of intellect, the realm of today's artificial intelligence includes many interesting knowledge storage/retrieval techniques that are worth comparing and contrasting to see the range and limits of their power and the use cases they address. Stay tuned!

Ever since my first blog entry in this recent series on artificial intelligence, I've been highlighting the lesser, calculational nature of machine intelligence and learning-- as well as the valuable role it nonetheless can play in driving more effective human understanding and decisions. I've been doing this by articulating mainly what machines do, as that is the primary interest of mine and most who would read a developerWorks blog. Still, our interests will be served by taking an entry to discuss human learning as a counterpoint or contrast.

The multiple linear regression example in my last post is a good example to start with because it highlights the difference between accuracy versus understanding. If there is a linear relationship among the data, then an MLR can have a very high predictive accuracy, but it has no explanatory power whatsoever. The MLR model does not have, nor does it convey, any understanding as to why the relationship exists.

Let's see how this predictive accuracy rates in terms of human intelligence and learning. In this case, we can benefit from an instance of that delightful human propensity to apply ideas to themselves. Specifically, we humans have applied our learning abilities to the phenomenon of our learning abilities, with many useful results including Bloom's Taxonomy.

According to Bloom's taxonomy, the very lowest level of cognitive learning is the knowledge level, or the ability to remember and recall what is learned. When you think about it, you realize that an MLR model, like many predictive analytics, is really a storage mechanism for something that has been machine learned from data. In MLR, we store the constants of a linear formula as the representation of what has been learned from linearly related data.

The next higher level of Bloom's taxonomy is comprehension, which is where understanding and true explanatory power begin to surface. But human learning is so much more sophisticated than the knowledge level of machine learning that there are a number of levels above comprehension. There's the application level, in which we can use our knowledge to solve new problems, including being able to explain why the new solution works. The analysis level drills deeper into our ability to make inferences and generalizations. The synthesis level begins to get at our ability to be creative with what we've learned and come up with new ideas and solutions. Finally, the evaluation level gets at our ability to be subjective and judge quality and creativeness of ideas and solutions. We are beginning to see some faint glimmers of some elements of some of these levels in cognitive computing efforts like IBM Watson, but it is early days indeed.

While we're on the subject of human learning and Bloom's Taxonomy, it makes sense to digress for a bit and mention the IBM Social Learning product. This is a SaaS educational platform intended to help enterprises achieve a Smarter Workforce. A few reasons for the digression are

learning is a key ingredient of how a human workforce becomes smarter.

The IBM Social Learning product has a very nice feature that enables educational administrators to implement Bloom's Taxonomy in their learning materials. A component of the product is the Kenexa LCMS, or learning content management system, which includes various subcomponents like a course designer and a metadata dictionary. The educational administrator can add any metadata tag, such as "Learning Goal", and any tag values, such as "Basic Knowledge", "Comprehension", "Application", etc. Once this is done, the educational administrator can use the metadata tag values to classify any learning item in the LCMS accord to Learning Goal. Once these classified learning materials are published, learners can use the "Learning Goal" as a new faceted search criterion in the platform's learning library. A learner would be able to isolate and focus on "knowledge" level learning in a subject area before proceeding to comprehension and then application, for example. This will enable learners to effectively use the natural way in which their learning blooms, i.e. Bloom's Taxonomy.

Finally, there is an aspect of human learning that goes beyond Bloom's taxonomy, and it's an area that is highlighted by the IBM Social Learning product. There is a very important word in the product title: Social. This is crucial because it underscores the central role of communication and collaboration in the human learning process. We are an order of magnitude more effective at learning based on our interconnectedness to others who think and learn, rather having access to just data. This is pertinent to the advancement of artificial intelligence because "social" goes quite beyond the computing architecture underlying a lot of today's machine learning efforts.

Machine learning today is every bit as calculated, as simulated, as is machine intelligence. It is easier to use machine intelligence to highlight how much greater human cognition is, which is why I've been using a machine intelligence algorithm over the last several entries. However, the conclusion drawn so far is that, while machine intelligence is only simulated, it is still quite effective and valuable as an aid to human insight and decision making. Machine learning offers another leap forward in the effectiveness and hence value of machine intelligence, so let's see what that is.

Machine learning occurs when the machine intelligence is developed or adapted in response to data from the domain in which the machine intelligence operates. The James Blog entry only does this degenerately, at a very coarse grain level, so it doesn't really count except as a way to begin giving you the idea. The James Blog entry plays a game with you, and if he loses, he adapts by increasing his lookahead level so that his minimax method will play more effectively against you next time. In some sense, he learned that you were a better player. However, this is only a single integer of configurability with only a few settings of adjustment that controls only one aspect of the machine intelligence algorithm's operation. To be considered machine learning, a method must typically have a more profound impact on the operation of the algorithm, with much more adaptation and configurability based on many instances of input data. An example will clarify the more fine grain nature of machine learning.

The easiest example of which I can think is a predictive analytic algorithm called linear regression. Let's say you'd like to be able to predict or approximate the purchase price of a person's new car based on their age. Perhaps you want to do this so that you can figure out what automobile advertisements are most appropriate to show the person. Now, as soon as you hear this example, your human cognition kicks in and you rattle off several other likely variables that would impact the most likely amount of money a person is willing to spend on a car, such as their income level, debt level, nuclear familial factors, etc. This analytic technique is typically called multiple linear regression (MLR) exactly because we humans most often dream up many more than two variables that we want to simultaneously consider. Like most machine learning techniques, MLR does not learn of new factors to consider by itself. It only considers those factors that a human has programmed it to consider. When they are well chosen, additional variables typically do make an MLR model more effective, but for the purpose of discussing the concept of machine learning, the simple two-variable example suffices since your mind will have no problem generalizing the concept.

Suppose you have records of many prior car purchases, including a wide and nicely distributed selection of prices of the cars and ages of their buyers. This is referred to as "training data". If you plotted the training data, it might look something like the blue points in the image below. Let purchase price be on the vertical Y axis since it is the "dependent" variable that we want to predict, and let age be on the X-axis since it is a predictor, or "independent" variable. MLR uses a standard formula to compute a "line of best fit" through the given data points, again like the one shown in red in the picture.

A line has a formula that looks like this: Y=C1X1+C0, where C1 is a constant that governs the slant (slope) of the line, and C0 is a constant that governs how high or low the line is (C0 happens to be the point where the line meets the Y-axis, and the line slopes up or down from there). If we had more variables, then MLR would just compute more constants to go with each of them. For example, if we wanted to use two variable predictors of a dependent variable, then we'd be using MLR to create a line of the form Y=C2X2+C1X1+C0.

Technically, MLR computes the constants like C1 and C0 of the line Y=C1X1+C0 in such a way that the line minimizes the sum of the squares of the vertical (Y) distances between each data point and the line. For each point, we take its distance from the line as an amount of "error" in the prediction. We square it because that gets rid of the negative sign (and, less importantly, magnifies the error resulting from being further from the line). We sum the squares of the errors to get a total measure of the error produced by the line, and the line is computed so as to minimize that total error.

Once the constants have been computed, it is a trivial matter to use the MLR model as a predictor. You simply plug the known values of the predictor variables into the formula to compute the predicted Y-value. In the car buying example, X1 is the age of a potential buyer, and so you multiply that by the C1 constant, then add C0 to obtain the Y-value, which is the predicted value of the car.

In this way, hopefully you can see that the MLR "learns" the values of the constants like C1 and C0 from the given data points. Furthermore, the actual algorithm that produces the machine intelligence only computes the result of a simple linear equation, so hopefully you can also see that the predictive power comes mainly from the constants, which were "learned" from the data. In the case of the minimax method, most of the machine intelligence came from the algorithm, but with MLR-- as with most machine learning-- the machine intelligence is for the most part an emergent property of the training data.

Lastly, it's worth noting that there are a lot of "best practices" around using MLR. However, these are orthogonal to topic of this post. Suffice it to say that just like the minimax method has a very limited domain in which it is effective as a machine intelligence, MLR also has a limited domain. For example, the predictor variables (the X's) do need to be linearly related to the dependent variable in reality. However, within the limited domain of its linearly related data, MLR is quite effective and an excellent example of a simple machine learning technique that produces machine intelligence within that domain.

In the interest of space last time, I had to leave out an advanced topic on optimizing a "next best action" algorithm. Again, you can look at the full source we're discussing by just using the web browser's View Source on this page.

The optimization is known as alpha-beta pruning. In the code snippet below, you see that we break the j-loop that is scoring the response moves of a given move based on some condition involving the variables alpha and beta. Why does it make sense to stop looking at the competitive response moves for a given move? To see why, I've added the function declaration so we can discuss where the alpha value comes from and what it means.

Understanding alpha-beta pruning requires you to take a more global view of the recursion that is doing the evaluation. The alpha values passed into scoreMove() are the beta values from the calling level of the Minimax algorithm. It will help to keep at least the player's moves and the opponents responses in mind as we go through this.

Let's say that scoreMove() has been called to score a player's Kth move. Beforehand, moves 1 to K-1 will have been fully explored by depth-first recursion, including the opponents responses, the player's counter-responses, and so on. The alpha value received by scoreMove() for move K reflects the best fully explored "net" score for the player on moves 1 to K-1. Within scoreMove(), we first compute the raw benefit of the new move K, storing the result in moveScore. Now comes the alpha-beta pruning trick. The j-loop successively explores each opponent response move for the player's move K, and clearly the beta value takes on the value of the highest scoring response move that the opponent can make. The final score for move K is the raw benefit to the player of move Kminus the benefit beta that the opponent can realize in response.

Thought-provoking question: Do we really need to know the absolute best move that the opponent can make in response to the player's move K? Or do we just need to find an opponent move that is good enough that, when subtracted from the raw benefit of move K, proves that the player would be better off choosing the earlier move associated with the alpha value? Of course, the answer is that we only need a good enough opponent move, and this is why we break the j-loop when we find that move. If we were to continue the j-loop, all we do is unnecessary work that might (or might not) find an even better opponent response move that would make move K look like an even worse decision for the player. But there is no need to do this extra work. Once the expression "(moveScore-beta < alpha)" becomes true, we have proven that move K is less beneficial than one of the moves 1 to K-1.

From a practical standpoint, this optimization averages better than double the run-time performance of the "what-if" logic. Who doesn't want double, right? Well, this "what-if" analysis is a combinatorial explosion of analysis; to put that in perspective, you get less than one extra move of lookahead due to this optimization. Yet despite this dash of cold water about how much deeper you could take the "what if" logic due to alpha-beta pruning, it remains true that, for a given level of explorative depth, everybody wants the result twice as fast or more, so alpha-beta pruning is very handy.

This entry is for developers who want a good mental model for how a prescriptive analytics algorithm can simulate intelligent behavior. We'll focus on the intelligent behavior in the James Blog entry, since it is quite competitive with humans. Reminder: Just hit "view source" in your browser to get the code we're talking about here.

The first thing to note is that the domain of the intelligence is quite constrained and circumscribed relative to the full realm of human intellectual endeavor. This is what makes it computationally feasible to perform a "what if" analysis to "imagine" possible scenarios and determine a next best action. Here's roughly how it works. The computer's available next actions are examined and measured for their immediate benefit. Then, for each action, the response action of the opponent is measured for its immediate benefit to the opponent, and so on. Once the real benefit of each opponent move is tabulated, the value of the best opponent action is subtracted from the immediate value of a given computer move. The best computer move is determined as the highest value move resulting from the immediate benefit minus the score of the best opponent move.

One thing I like about the game Kalah is that it is really easy to explain the competitive algorithm, relative to harder games like Chess. In Chess, evaluating the immediate benefit of a move can be challenging, especially at the beginning of the game. It's not just about the value of the piece you take because many moves don't take pieces. The value of a move is often about gaining control over spaces of the board to limit the opponent's attack and defense options. But in Kalah, you get good intelligent game play from a much simpler board evaluation. The value of a move is simply a matter of how many seeds you gain by that move.

This code (at the beginning of KalahGame.scoreMove) just copies the current board, makes the proposed move for the given player, then evaluates the new board value minus the value of the old board configuration for the given player. In effect, you get the number of seeds gained for the player by the move.

That's when things get interesting. The move scoring then becomes iteratively recursive. Each valid move of the opponent is then evaluated by recursively calling the move scoring method. Like this:

The first line is just a trick to switch between player 1 and player 2 in the levels of recursion. The "beta" value is the highest scoring move of the opponent so far, so once we switch to the opponent player in the first line, the second line just sets a large negative score so that the loop will start by selecting the first available move as being a good idea. The j loop tries each move, and the if test on the succeeding line just ensures that there is a non-zero number of seeds to pick up-- in other words, it ensure the move is valid. Then, the opponent's move is scored by recursively calling KalahGame.scoreMove(). When the recursion returns, the succeeding if test checks whether the move is better than the best result so far, stored in "beta". If it is, then this move becomes the new "beta". The alpha/beta business at the end of the j loop is an optimization that can be safely ignored. Once the j loop has examined all the moves, the best opponent move score "beta" is subtracted from the immediate benefit value of the player's move.

This is how each of a player's possible moves is scored in Kalah.getBestComputerMove(): The move's immediate benefit in the number of seeds scored minus the best value obtained from a recursive lookahead of possible opponent responses that accounts for the player's responses to the opponent, and the opponents responses in kind, and so on down to the limit of the look ahead level.

The fun bit of this code is that it is used not only to determine the computer's best move, but when you ask for the "Expert Advisor" to help you, it applies exactly the same logic to *your* board position in order to determine a recommended next move for you.

To conclude, here is a small diagram to help you see what is going on.

In this example, we're near the end of the game, and Player 1 must decide whether to make move 2 or 4. With move 2, there is an immediate benefit of 4 seeds because the 1 seed lands in an empty house, allowing the player to score that seed as well as the 3 seeds in the opposing house. This seems like a good idea, but is it? Well, Player 2's moves should be examined. In the short term, Player 2 can only respond with move 5, but this spreads out the 4 seeds. If you look ahead to the end of the game, you can see that Player 2 will ultimately score all four of those seeds. But also in the recursion, it is unavoidable that Player 1 will be able to score the remaining two seeds on the top row of houses. So the net benefit to Player 1 of making move 2 is only 2 seeds: the immediate 4 seeds, minus the 4 earned by player 2 in the rest of the game, plus the 2 additional seeds that Player 1 earns in the rest of the game. Not as good as it initially looked. However, it does turn out to be better than move 4 for Player 1. The immediate move yields no seeds for Player 1. Then, in the rest of the game play, Player 2 is able to earn 7 seeds, and Player 1 only earns 3 seeds. So, if player 1 makes move 2, then the opponent gains 4 more seeds than player 1 does.

Well, that's a wrap for this explanation of the 2-party competitive algorithm known as the "Minimax" method. Hopefully you can now see that it's not real intelligence but rather just tabulation of best outcomes according to a scoring method and constrained to a set of rules for determining valid next moves. Demystified, it becomes no more surprising that the algorithm defeats humans than it is when an algorithm can beat a human at calculating the square of a 5 digit number.

Still, this is roughly what a person does. Time and again, new possibilities are "imagined" by testing "what if" this move is made or that move is made. And the algorithm does win a lot of games, which is precisely why prescriptive analytics algorithms are so valuable as expert advisors. If you take the material covered here up by an order of magnitude, you get IBM Deep Blue. Another order of magnitude, and you get IBM Watson. The sky's the limit!

David Lee Roth and Eddie Van Halen have been trying to get us to do it for decades: "JUMP!" Douglas Hofstadter would qualify that with "... out of the system!" Here's what that means.

Machine intelligent entities like James Blog exist within a certain system, conforming to a prescribed set of rules, and they really can't escape the confines and constraints of that programming. Within that limited domain, they do calculate wonderful results that can seem intelligent. In an early version, I found myself adding a logger so I could see why James Blog was not making some moves that seemed very good. Time and again, I would find that the good move now set up the conditions for a better opponent move later, which is exactly what the artificial intelligence is supposed to detect and avoid.

The algorithm does this so well that it is really hard to beat, especially on the maximum lookahead value I set, which was 6. Frankly, if you're new to this game, you have to work to beat even the initial lookahead level setting of 2, which means that James Blog only looks at its own moves and your countermoves to see what will produce the greatest net gain in seeds relative to you.

Because it is hard to beat this little game and see the special winners message, this opened up a delightful opportunity to talk about an important capacity of human intelligence that could be exemplified by determining the winners message without winning. I used a Zen-like characterization of a "winless win" as a nod to Hoftstadter's style in the book Gödel, Escher, Bach.

Put simply, we are not limited in our thinking to the confines of the system. We regularly "take it up a level" or "think outside the box". In this case, the system is a blog entry presented in a web page. So you can jump out of the system by using the View Source feature of your web browser to take a look at James Blog's code, where you will find the winners message: "I, for one, welcome my non-computer overlord." The message is an allusion to Ken Jennings' capitulation to IBM Watson, which was an awesome pop culture nod to The Simpsons-- awesome because both Jeopardy and the Watson AI are about sorting out exactly those kinds of allusions.

Frankly, I had a lot of fun with allusions, both in the blog entry and while holding the programmer challenge to achieve this winless win. For example, James mentions that he outfoxes his friend Wiley, alluding to the famous coyote, who is in the same animal family as a fox (Canidae), which is a tiny aural tweak from Canada, where I live. So, James can beat his wiley creator. Similarly, in tweets and status updates, I made numerous allusions to The Matrix movie, such as when I nearly used Morpheus's command to Neo: "Quit trying to hit me and hit me." The exception is that I changed the 'h' to a 'g', making 'git', which is what we use to get source code.

This kind of wordplay and allusion bears some similarity to "jumping out of the system". Hofstadter calls it contextual slipping, or my favorite word for it: counterfactualization. We take some piece of reality that we know about, and we ask "what if this were different?" We slip, or change, some piece of that reality to see if we end up with something new and useful. I find the notion of counterfactualization fascinating because it seems like a good operationalization of some other really important words: creativity, playfulness, humour, imagination.

Still, it might be a while between when we can efficiently and effectively operationalize contextual slipping and when we can generalize that to achieve machine intelligence that can jump out of any system in the way that I asked programmers to do with James Blog. At some point, I realized that there is a beautiful geometric analogy that helps explain why. In the book Flatland, the Sphere is able to escape the plane via the use of a third geometric dimension that is physically orthogonal to the two that comprise the plane. In this way, Sphere is able to see Square's inner workings. That is a great analogy with what we did by jumping out of the web page using View Source to see James Blog's inner workings. There was a whole different, higher level of understanding about what James was and how we could know more about it, and it is fitting to say we got that winners message by thinking outside the box.

Next blog will be a developer's tour of the particular machine intelligence algorithm built into James Blog. After that, will be a discussion of the relationships between machine intelligence, machine learning, and predictive analytics, so stay tuned!

Your intelligent behavior is based on sentient *understanding*. Sentient schmentient. I'll bet my intelligent behavior can outfox yours. I've done so with my friend Wiley from Canidae, and he's a genius! So, let's see how much good your sapience does you, shall we?

The rules of the contest are simple. You get the top six "houses" and the "store" on the top left. I get the bottom six houses and the bottom right store. We each start out with 6 seeds in each of our 6 houses, and 0 seeds in our stores. To win, you have to get more than half of the seeds into your store (for you knuckle draggers, that's 37 or more). I'll let you go first, so you already start with advantage.

To take your turn, you pick one of your houses that contains seeds. That house is emptied, and its seeds are "sowed" one at a time in a counterclockwise fashion, including your store but excluding mine. So, it takes 13 seeds to traverse from a starting house, through your store, through my houses, and back to your (now empty) starting house. Every seed that goes into your store gets you closer to victory.

You can earn a seed or two from your move, but there are a few more rules that can earn you lots of seeds. First, if the last seed you sow lands in your store, you get another turn, and you can have multiple extra turns if you make your moves in the right order. Second, if the last seed you sow lands in an empty house, then you earn that seed from the empty house and all seeds in the house of mine immediately below the empty house. I call this a "big take". Third, if I run out of seeds in all my houses, then you earn all the seeds in your houses. Of course, I can also earn lots of seeds by these same rules, which is why YOU'RE GOING TO LOSE MEAT BAG!

I will take it easier on you at first, but I'll play harder if you earn the privilege. And there's a special message for you, a badge of distinction, if you manage to beat me when I play my hardest. Ooops. You... win?!? Wake up! Your teetering bulb is dreaming!

Milk Drinker (you)

Blog, James Blog (me)

Less messages, I understand how JB is winning

SPOILER ALERT. PLAY A WHILE, BEFORE LOOKING ANY FURTHER.

OK, so hopefully you've played enough to know you're not going to be getting that badge of distinction anytime soon (unless you have some of the rare talents of Ted Neustaedter). But also hopefully you're coming to the understanding that I really have no clue what I'm doing when I beat you. What I'm doing is mechanical, not miraculous. I'm being no more intelligent, really, than a calculator squaring a five digit number. Now, when one of you meat bags does it, it actually is miraculous. But the miracle is that you can do it at all on your hardware given that it is designed more for sentient understanding of what mechanical operations like squaring are, what they're good for, and what to combine them with.

I am just doing the fine-grain operations of my Minimax algorithm, but it is you who understands our contest at a higher level than that. That's why machine intelligence like mine is best applied as an expert advisor. For example, if you hit "Invoke Expert Advisor", you are asking me to advise you in the limited domain where my simulated intelligence would seem like real intelligence.

Keep using that expert advisor button and see how much faster you earn that special "badge of distinction" message. Go ahead. You won't be able to do it entirely without also sprinkling in your own intelligence at some points. This will be because you will hit some key points where your sentient understanding recognizes a *pattern* that emerges that will allow you to see how to beat my mechanical intelligence, where even my own advice is unable to do so. What will most likely happen is that you'll use the advice to hold your own for most of the game. My advice will help you avoid moves that give me extra turns and "big take" opportunities. But at some point, you may see that I am beginning to be starved of seeds in my houses. You, as an expert, will have this insight sooner than I see it coming using my mechanical calculations because your sentient intelligence truly understands what is going on at that higher level.

But of course, you would have a much harder time getting to that point without my advice. And that is what makes machine intelligence like advanced analytics on big data and machine learning technologies like IBM Watson invaluable to you. In short, expert advisors can turbocharge the smarts in your smarter workforce.

In a recent video interview, the IBM CEO Ginni Rometty comments that Watson 2.0 will understand images that it sees, and that Watson 3.0 will be able to debate, i.e. to understand what it is talking about with another party. An impressive roadmap, each of these is an incredible leap forward from its predecessor.

It is, however, worth qualifying the term 'understand'. It is being used figuratively, not literally, to communicate the rough order of magnitude improvement in capability. When such a leap is made, it seems analogous to sentient understanding, even though it isn't. Imagine for a moment what Archimedes would have thought at first of a hand-held calculator, given that he had the power of Roman numerals with which to calculate pi to several digits. And yet, we would not now interpret such a device as artificial intelligence. As soon as the mechanical nature of a level of capability becomes clear, so too does the fact that it does not constitute sentient intelligence (Hofstadter's exposition of Tesler's "theorem").

You can see this assertion play out in multiple levels of Bob Sutor's scale of cognitive computing. There are levels that are clearly not cognitive intelligence, as Sutor points out, but if you lay out the scale on a timeline of decades or centuries, it is clear that each level might once have been interpreted as being indistinguishable from magic.

So where on Sutor's scale is Watson? And what implications does that have for development best practices?

Watson is clearly not on the "Sentient (we can do without humans) systems" level. As sentient beings, we don't just know things with a certain calculated accuracy or confidence level, or determine that we don't know if our confidence is low. We experience desire to know more, and we experience fear of the unknown. We are teetering bulbs of dread and dream (Hofstadter's delightful invocation of a Russell Edson poem). I urge you to let that characterization of us sink into your mind. In Watson technology, IBM has modeled a certain class of knowledge and mechanical reasoning, and in other research, IBM is doing so by simulating some of the known structure of biological brains. However, we don't yet know how to model fear and desire, dread and dream. In my opinion, these are inextricably bound together in sentient intelligence, separating it from simulated intelligence. In other words, intelligent behavior is a construct that works for the dread and dream engine of the sentient, and in the absence of dread and dream, seeming intelligent behavior is but a mechanical simulation of understanding. As an aside, I hope we only manage to model desire and fear around the same time we figure out how to model ethics (as Asimov cautions).

Does this characterization of Watson as a mechanical simulation of understanding detract from its value? Does it detract from the order of magnitude improvement it heralds as an usher of the era of cognitive computing? Of course not, quite the opposite. It is simply fantastic that this level of "Learning, Reasoning, Inference Systems" (Sutor's scale) is now computationally and economically feasible at the scale needed to help sentient intelligence (that's us) to solve real world problems. Quick, what is the square root of 7. Can't do it? No problem. Even if you're Arthur Benjamin, you'd be better off just hitting a few keys on a calculator. Quick, what are the most likely diagnoses for the patient's presenting symptoms? An "expert advisor" like Watson can be just what it takes to help determine the next best action, especially when time is of the essence because a life hangs in the balance.

The term "expert advisor" is appropriate. It conveys that the system is a "Learning, Reasoning, Inference System" that does not have sentient understanding and is therefore made available to advise and guide the actions of an expert. This is analogous to the way spreadsheets guide the results reported by accountants and chief financial officers. That being said, we also know not to put spreadsheets in the hands of toddlers. From a development practice standpoint, it is crucial to keep in mind that "expert advisor" means that the deployed system should be advising someone who is a qualified expert in the exact domain in which the "expert advisor" system was trained. Especially when a life hangs in the balance, access to the "expert advisor" system needs to be performed by those with expert qualifications in the domain because only they can reasonably be expected to use sentient understanding to interpret and follow up on the advice. In other words, the term 'expert' in 'expert advisor' should apply to the user more so than the advisor.

Now, given an enterprise workforce of those with qualified sentient understanding of their topic areas, Watson-style expert advisors are just the type of technological advancement that will help them work smarter, not harder, to meet the needs of customers and colleagues and to produce a competitive advantage for the business.

Due to being an eponymous blog, it has become that time to redirect my blog and increase its aperture to cover a much wider range of IBM-related topics that developers will find interesting and that reflect my own broader range of pursuits and thoughts within IBM.

These days I work in the Smarter Workforce segment of IBM Collaboration Solutions, which is responsible for building out cloud-based solutions for employee talent optimization. How do you attract employees? Retain them? Provide education when they are recruited, promoted or need remediation? How do you best equip employees to share information and enable one another to achieve better customer satisfaction and better business results? How do you measure the results?

So, if you're not in this particular problem space, why should you care? Well, there is a remarkable dynamism in this problem space due to the fact that it seeks to help human beings interact more effectively and efficiently with other human beings. As a result, many of today's most interesting topics, technologies and techniques are applicable: social computing, cloud computing, mobile computing, security, bigdata, business analytics and algorithms, and even psychological science and cognitive computing.

Think about what it takes to give everyone a smarter edge. Think of everything that might be needed to do it, plus everything they might want to do, and everything they might want to do it with. Then, think of enabling them to do it everywhere. Now we're talking the same language.

When I started on Java Server Pages (JSP) as a topic, I had intended it to be a blog topic. But it grew quite beyond blog size, so now that the technical work is finished, I can give you the meta-level on using JSP with Enterprise IBM Forms.

The work I'm telling you about here is intended to make it easy for you to exploit the powerful, simplifying JSP technique within the XFDL+XForms markup of IBM Forms documents. It took a some work to sort it all out, but with that done, it is easy for you to replicate what I did and gain the benefits. I wrote this wiki page on the IBM Forms product wiki to help you get set up, and the page references the developerWorks article I put together to show how to use JSP in your XFDL+XForms forms.

The first hurdle was how to get JSP to work with the IBM Forms Webform Server. It already works with the IBM Forms Viewer by just setting the JSP contentType to application/vnd.xfdl, but the Viewer is a client-side program used only in the minority fraction of cases to support offline/disconnected form filling. The majority of customers deploy Webform Server because it translates the XFDL+XForms into HTML and Javascript automatically so that end-users only need a web browser to fill out their enterprise IBM Forms.

It was pretty challenging to get the JSP to talk to the Webform Server Translator module, so I was pretty happy when that started to work for me. It's one of those cases of only needing a line or two of code, but it being really hard to get exactly the right line or two. As Mark Twain once said, it's like the difference between lightning and the lightning bug. Anyway now that we know the smidge of code, it's easy for you to copy and use in your XFDL-based JSPs.

At first I thought, OK I have a good blog topic, but then I realized we weren't covering the full Forms information lifecycle. Put simply, a form is possibly prepopulated and then served, it collects data, but then it comes back and you have to do something with the data collected. So, back for more work sorting out how to receive a completed form into a JSP and use its values in JSP scriptlet code that helps prepopulate the next outbound form. This was a fair bit less challenging, as it maps very closely to how you start up the IBM Forms API in a regular Java servlet. Remember, JSP is just a convenient notation that the web application server knows how to turn into a Java servlet. JSP just makes it easier for you to focus on your special sauce application code.

Well, now that I could handle the whole Forms information lifecycle, I realized I hadn't covered the software development lifecycle. Back to the salt mines again. The problem was that JSP annotations are incompatible with XML. Although there is an alternative XML syntax for JSP, I devote a section in the article to explaining why it's a bit of a train wreck, and I focus instead on the normal JSP annotations. By representing them as XML processing instructions, we're able to maintain the XFDL and the JSP logic together using the IBM Forms Designer, and then use an XSLT to convert to actual JSP when it's time to deploy the IBM Form. This was really important to me because, quite frankly, if a new feature does not work in the Design environment for a language, then the feature essentially does not exist in the language.

Now, that's a wrap! I hope you like the article and get accelerated development benefit from it. JSP is really for building quick prototypes and demos, and also for solving simpler problems much more simply than using straight Java servlet coding. It's even a really nice complement to using Java servlet coding within a larger project. So don't delay, get ready to use JSP with XFDL today.