I recently ran across this artice (https://lctech.vn/blog/ibm-watson-compares-trumps-inauguration-speech-obamas/). It describes the author's attempt at a comparative analysis of the personalities of Barack Obama and Donald Trump based on applying the IBM Watson Personality Insights API to their US Presidential inauguration speeches. The article has many charts, figures and analyses, according to various capabilities of the API. But, these cannot make up for the logical fallacy under which the API was applied in the first place.

UPDATE: Even as I was publishing this article, a similar misuse of IBM Watson Personality Insights API was reported by CNBC (http://www.cnbc.com/2017/07/17/tim-cook-is-silicon-valleys-most-imaginative-ceo-says-ibm-data.html). The analysis produced results such as that Apple's CEO Tim Cook is the Silicon Valley's most imaginative tech leader and that Microsoft's CEO Satya Nadella is one of the most assertive tech leaders. These are non sequiturs (they may be true or false, but the analysis doesn't actually establish these truths that it asserts).

One of the most important principles in data science is the test set for a machine learned model must be a good representative of the expected usage of the machine learned model. Otherwise, the accuracy of the machine learned model on the test set will have little to do with its accuracy in practice. In the field of psychometrics, this principle actually has a name: construct validity. Generally, it makes sense to take cues on measuring machine learning from the vast experience of educational psychologists who measure human learning.

A corollary principle in data science is that the training set for a machine learned model must be consistent with the test set. Otherwise, the machine learning algorithm will not be likely to learn the construct that the test set tests. In fact, it's not uncommon to draw the test set randomly from the training set, in which case the two sets are likely to be consistent, and the challenge reduces to determining whether the training and test sets provide a good representation of the intended use case. Essentially, data scientists spend a lot of time thinking about and working on training set quality in order to attain high construct validity.

But, if you are a data science consumer, then you have to think about these principles in reverse. If you are a software developer who uses an API that offers the inferential function of a machine learned model trained by a data scientist or data science team, then you are a consumer their data science results. Such is the case when you use IBM Watson Personality Insights.

According to this source, the API was trained based on mapping personality test results with the linguistic patterns of 200 tweets from the 600 participants. There is no evidence to suggest that our tweet writing is linguistically consistent with how we write emails, blogs, or other documents, much less speech transcripts from US Presidents' inauguration speeches or CEO speeches. For one thing, we know that except for tweet storms, successive tweets aren't necessarily all that much related to each other. But the sentences and paragraphs of these other forms of writing are much more logically and sequentially connected together. After all, that's why we have speech writers.

By comparison, if your use case is to determine personality traits of, say, a prospective customer or employee, based on their Twitter feed, then you're more likely to be appropriately using IBM Watson Personality Insights API.

In the case of this API, there are further questions that a psychologist would ask, and therefore that you should ask, too. In particular, the training data was drawn from a sample of 600 participants. But, are those participants representative of the target population on whom you will be doing the inferences with the API? For example, if your prospective customer or employee base comes from, say, the fashion industry, and if the training data participants came dominantly from, say, the tech industry or even from the population at large, then your results with the API may be significantly affected by the difference. Do your best to find out the demographics of the training data participants and your target population to see if there are mismatches. There are other similar questions. Are members of your target population more prone to tweet storms, retweeting, and/or replying to tweets than the training sample population? All of these tendencies are reflective of personality traits, so if there are differences between the training sample and the target population, then you may not be able to use the API.

For any API, you as a software developer are practicing a basic form of data science by checking these issues because you are ensuring construct validity between the inferences in your use case and the training data for the machine learned model you are consuming.

In today's cognitive computing products and techniques, the perception of greater intelligent responsiveness comes not so much from having true explanatory power, but rather just having strong predictive power over increasingly chaotic and larger data sets.

To give us a third point of reference beside linear regression and neural nets, I'll use some other terms to bring the focus to natural language processing. In 2011, the IBM Watson system demonstrated greater intelligence than the best human opponents in the domain of linguistically challenging factual Q&A. This was based on the ability to quickly produce high confidence answers from a large corpus of unstructured information in response to challenging questions.

The linguistic product that is now based on that system is called the IBM Watson Engagement Advisor. As with other cognitive computing techniques, the product must first be trained to be an effective system in the target domain. The corpus of unstructured information often takes the form of documents, such as instruction manuals, technical reports, journal articles, and wiki pages. During training, the most important entities and relationships expressed in the documents are identified and stored in order to expedite later search and retrieval during Q&A interactions with users of the system. The identification process within a document is often called annotation, and the annotation and storage processes together are called ingestion.

The most important concept in understanding the training is to understand what really drives the identification, or annotation, of the documents. It's simple, really. It's a Q&A linguistic product, and the annotation and ingestion expedite the production of the A's in response to the Q's, so it is imperative to have a strong and large representative sampling of the potential questions in order to train and test the efficacy of the system. The questions encode the key concepts (e.g. entities, relationships and so forth) to which users of the system are interested in getting answers. Annotators for these key concepts are developed and, during ingestion, they are executed upon the documents.

This is the very basic level of explanation about how the linguistic product would learn, or be trained, to be an effective cognitive computing system for a domain, and future entries will dive further into this topic. For now, let it suffice to say that ingestion and training results in a system capable of producing answers from the corpus in response to questions like those used in training.

During a run-time Q&A session with the system, the user begins by posing a natural language question. The question is first analyzed to find the key concepts, and then a multiphased approach is used to dig up the best results from the ingested corpus content. As with training, there's a lot more to be said over time about how the run-time Q&A works, so more interesting future entries to come, and in fact, it's intrinsically related to the training anyway. To conclude and tee up these future entries, I'll say the high order bit here is that a trained Q&A linguistic product seems somehow more intelligent than a linear regression or even a typical neural net application. Why is that? To get a bit more background for that explanation, I'd encourage you to visit or revisit a few of my earlier blog entries about cognitive computing. Compare your perceptions of the intelligence of the minimax algorithm in [1] with the linear regression method in [2], and compare [2] with the neural net in [3]. What's changing?

The speaker says that a challenge with neural nets in business applications is that they are black box, meaning that you can understand the inputs and the outputs but not really how it is deriving the outputs. Later, the speaker says that linear regression is a preferred technique because it has a very strong predictive and explanatory power.

It's not really true that linear regression has more explanatory power than neural nets. Rather, it is easier to understand the problems and the answers that can be solved by linear regression. By comparison, neural nets tend to be used to provide cognitive computing power to harder problems than linear regression can solve.

To put this another way, when you use linear regression, you actually begin by assuming linearity of the relation you want to predict. As the speaker points out, you can also make a non-linear assumption, and you can accommodate this using a data transformation, for example. But the high order bit is that you are asked to assume the data relationship, and that assumption is what is giving you the illusion of explanatory power. You can explain that the data follows a line, but this is due to your own assumption. Note that an important aspect of completing a linear regression model is determining the R2 or goodness of fit of the model. This is the part where you make sure that your assumption of linearity is valid. And if the assumption is invalid, then the model has no predictive value, so it does not matter that you can explain how it operates.

Under the interpretation that explanatory power is akin to predictive power, it turns out that neural nets have greater predictive power because they can produce results for a wider array of applications than linear regression can. There a neat table that relates the cognitive power of a neural net to the number of hidden layers. From the table, you can see that when a relationship actually is linear, a neural net can solve it without even using any hidden layers of neurons. When one or two hidden layers of neurons is present, neural nets transcend the capabilities of linear regression, in part because they do not require you to make any assumption about what the data relationship actually is.

And that's where the confusions comes in. The linear regression model requires you to assume linearity and so you know at least what geometric shape the relationship looks like. The neural net requires no such assumption, but nor does the trained neural network give you any hint at what the relationship is. The lack of knowing the relationship is confused for having less explanatory power.

But if you look at this a bit more abstractly, the trained linear regression model has the same exact problem of not providing any additional insight. A neural net is really just a pile of numbers giving constant weights to the neural connections that can convert inputs to outputs. Similarly, a linear regression model is just a pile of numbers that give constant weights to inputs to be linearly combined into an output. Sure you know the data relationship, but that's because you assumed it. The actual linear regression model gives you no insight into why one dimension has a large slope constant where another has a small slope.

An analogy I like to use is that the value of the neural net is not diminished by our inability to explain how it is that the little gray cells which implement our personal neural nets can produce the cognitive results that they do, and who among us would prefer to have cognitive powers defined by linear regression instead?

In terms of explanatory power, our biological neural nets perform an additional key function that we have not hitherto been able to achieve with artificial neural nets. We are able to construct additional information in the output that reveals causal relationships, or insights into the reasons for the phenomena it predicts. Put simply: we say why something is true. We provide a rationale. This is an aspect of explanatory power that, when achieved, dramatically increases the value and utility of any cognitive analytic. Theorem provers and Prolog programs have been able to do this for the applications to which they apply. In the area of unstructured information processing and data mining, you can see a demo of this concept in Watson Paths.

As an interesting possible counterexample to my last blog about MLR models not understanding the knowledge they learn, consider the neural network. Our brains are neural networks, and we are capable of learning at all levels of Bloom's Taxonomy, not just the knowledge level. Shouldn't artificial neural networks be able to achieve the same things?

The answer is no, not really. Our brains biologically, chemically and physically perform in ways that we scarcely understand, so our name for the thing we call "artificial neural network" is no less anthropomorphizing than when we say that a computer program of today "understands" anything.

Still (again), this is not to say that they aren't incredibly useful and effective. It's just that they are based on straightfoward and well-understood mechanical methods such as feed forward activation of neural outputs via sigmoidal threshold functions applied to inputs and back propagation of synaptic weight adjustments based on easily quantified classification errors. Before going any further, let's have a quick look at a diagram of an artificial neural network (ANN):

The ANN has an output layer on the right that is a classifier for input patterns received on the left. For example, an ANN for optical character recognition could have an input layer of an 8x8 matrix of bits, and the output layer could be an 8-bit code that indicates an ASCII character. The hidden layer(s) of neurons help the ANN to represent more sophisticated phenomena, though there is seldom need for more than one hidden layer. The "synaptic" connections between the neurons in the layers are weighted numbers, and the neurons apply the weights to the inputs and then feed the results into a Sigmoid function that essentially decides, like a transistor or switch, whether or not to fire the output.

An ANN is "trained" by giving it a sequence of input patterns for which the correct output pattern is known. The input pattern feeds forward through the ANN to produce an output. If there is a difference between the ANN output and the correct output, then the differential error is back propagated through the ANN to adjust the weights so that future occurrences of that input pattern are more likely to produce the correct output.

The synaptic weights, then, essentially represent the knowledge that the ANN "learns" from the input patterns. This is analogous to the constants that are "learned" by an MLR model. In fact, all elements of the ANN and MLR model architectures are analogous. The ANN input layer maps to the the independent X variables, the ANN output layer maps to the dependent Y variable, and the transition from input X values and the Y value that is achieved in MLR by multiplication and addition is achieved by a feed forward through synaptic connections, hidden layer neurons and Sigmoid functions in an ANN.

With such a one-to-one architecture mapping between ANNs and MLR models, it is easier to see them as having similar intellectual power. That's not to say they're equivalent, as ANNs are far more powerful. It's just that they're roughly the same (low) order of magnitude with respect to human intellect, and in terms of Bloom's Taxonomy, we call that order of magnitude "knowledge storage/retrieval".

Despite being in the lowest order of magnitude of intellect, the realm of today's artificial intelligence includes many interesting knowledge storage/retrieval techniques that are worth comparing and contrasting to see the range and limits of their power and the use cases they address. Stay tuned!

Ever since my first blog entry in this recent series on artificial intelligence, I've been highlighting the lesser, calculational nature of machine intelligence and learning-- as well as the valuable role it nonetheless can play in driving more effective human understanding and decisions. I've been doing this by articulating mainly what machines do, as that is the primary interest of mine and most who would read a developerWorks blog. Still, our interests will be served by taking an entry to discuss human learning as a counterpoint or contrast.

The multiple linear regression example in my last post is a good example to start with because it highlights the difference between accuracy versus understanding. If there is a linear relationship among the data, then an MLR can have a very high predictive accuracy, but it has no explanatory power whatsoever. The MLR model does not have, nor does it convey, any understanding as to why the relationship exists.

Let's see how this predictive accuracy rates in terms of human intelligence and learning. In this case, we can benefit from an instance of that delightful human propensity to apply ideas to themselves. Specifically, we humans have applied our learning abilities to the phenomenon of our learning abilities, with many useful results including Bloom's Taxonomy.

According to Bloom's taxonomy, the very lowest level of cognitive learning is the knowledge level, or the ability to remember and recall what is learned. When you think about it, you realize that an MLR model, like many predictive analytics, is really a storage mechanism for something that has been machine learned from data. In MLR, we store the constants of a linear formula as the representation of what has been learned from linearly related data.

The next higher level of Bloom's taxonomy is comprehension, which is where understanding and true explanatory power begin to surface. But human learning is so much more sophisticated than the knowledge level of machine learning that there are a number of levels above comprehension. There's the application level, in which we can use our knowledge to solve new problems, including being able to explain why the new solution works. The analysis level drills deeper into our ability to make inferences and generalizations. The synthesis level begins to get at our ability to be creative with what we've learned and come up with new ideas and solutions. Finally, the evaluation level gets at our ability to be subjective and judge quality and creativeness of ideas and solutions. We are beginning to see some faint glimmers of some elements of some of these levels in cognitive computing efforts like IBM Watson, but it is early days indeed.

While we're on the subject of human learning and Bloom's Taxonomy, it makes sense to digress for a bit and mention the IBM Social Learning product. This is a SaaS educational platform intended to help enterprises achieve a Smarter Workforce. A few reasons for the digression are

learning is a key ingredient of how a human workforce becomes smarter.

The IBM Social Learning product has a very nice feature that enables educational administrators to implement Bloom's Taxonomy in their learning materials. A component of the product is the Kenexa LCMS, or learning content management system, which includes various subcomponents like a course designer and a metadata dictionary. The educational administrator can add any metadata tag, such as "Learning Goal", and any tag values, such as "Basic Knowledge", "Comprehension", "Application", etc. Once this is done, the educational administrator can use the metadata tag values to classify any learning item in the LCMS accord to Learning Goal. Once these classified learning materials are published, learners can use the "Learning Goal" as a new faceted search criterion in the platform's learning library. A learner would be able to isolate and focus on "knowledge" level learning in a subject area before proceeding to comprehension and then application, for example. This will enable learners to effectively use the natural way in which their learning blooms, i.e. Bloom's Taxonomy.

Finally, there is an aspect of human learning that goes beyond Bloom's taxonomy, and it's an area that is highlighted by the IBM Social Learning product. There is a very important word in the product title: Social. This is crucial because it underscores the central role of communication and collaboration in the human learning process. We are an order of magnitude more effective at learning based on our interconnectedness to others who think and learn, rather having access to just data. This is pertinent to the advancement of artificial intelligence because "social" goes quite beyond the computing architecture underlying a lot of today's machine learning efforts.

Machine learning today is every bit as calculated, as simulated, as is machine intelligence. It is easier to use machine intelligence to highlight how much greater human cognition is, which is why I've been using a machine intelligence algorithm over the last several entries. However, the conclusion drawn so far is that, while machine intelligence is only simulated, it is still quite effective and valuable as an aid to human insight and decision making. Machine learning offers another leap forward in the effectiveness and hence value of machine intelligence, so let's see what that is.

Machine learning occurs when the machine intelligence is developed or adapted in response to data from the domain in which the machine intelligence operates. The James Blog entry only does this degenerately, at a very coarse grain level, so it doesn't really count except as a way to begin giving you the idea. The James Blog entry plays a game with you, and if he loses, he adapts by increasing his lookahead level so that his minimax method will play more effectively against you next time. In some sense, he learned that you were a better player. However, this is only a single integer of configurability with only a few settings of adjustment that controls only one aspect of the machine intelligence algorithm's operation. To be considered machine learning, a method must typically have a more profound impact on the operation of the algorithm, with much more adaptation and configurability based on many instances of input data. An example will clarify the more fine grain nature of machine learning.

The easiest example of which I can think is a predictive analytic algorithm called linear regression. Let's say you'd like to be able to predict or approximate the purchase price of a person's new car based on their age. Perhaps you want to do this so that you can figure out what automobile advertisements are most appropriate to show the person. Now, as soon as you hear this example, your human cognition kicks in and you rattle off several other likely variables that would impact the most likely amount of money a person is willing to spend on a car, such as their income level, debt level, nuclear familial factors, etc. This analytic technique is typically called multiple linear regression (MLR) exactly because we humans most often dream up many more than two variables that we want to simultaneously consider. Like most machine learning techniques, MLR does not learn of new factors to consider by itself. It only considers those factors that a human has programmed it to consider. When they are well chosen, additional variables typically do make an MLR model more effective, but for the purpose of discussing the concept of machine learning, the simple two-variable example suffices since your mind will have no problem generalizing the concept.

Suppose you have records of many prior car purchases, including a wide and nicely distributed selection of prices of the cars and ages of their buyers. This is referred to as "training data". If you plotted the training data, it might look something like the blue points in the image below. Let purchase price be on the vertical Y axis since it is the "dependent" variable that we want to predict, and let age be on the X-axis since it is a predictor, or "independent" variable. MLR uses a standard formula to compute a "line of best fit" through the given data points, again like the one shown in red in the picture.

A line has a formula that looks like this: Y=C1X1+C0, where C1 is a constant that governs the slant (slope) of the line, and C0 is a constant that governs how high or low the line is (C0 happens to be the point where the line meets the Y-axis, and the line slopes up or down from there). If we had more variables, then MLR would just compute more constants to go with each of them. For example, if we wanted to use two variable predictors of a dependent variable, then we'd be using MLR to create a line of the form Y=C2X2+C1X1+C0.

Technically, MLR computes the constants like C1 and C0 of the line Y=C1X1+C0 in such a way that the line minimizes the sum of the squares of the vertical (Y) distances between each data point and the line. For each point, we take its distance from the line as an amount of "error" in the prediction. We square it because that gets rid of the negative sign (and, less importantly, magnifies the error resulting from being further from the line). We sum the squares of the errors to get a total measure of the error produced by the line, and the line is computed so as to minimize that total error.

Once the constants have been computed, it is a trivial matter to use the MLR model as a predictor. You simply plug the known values of the predictor variables into the formula to compute the predicted Y-value. In the car buying example, X1 is the age of a potential buyer, and so you multiply that by the C1 constant, then add C0 to obtain the Y-value, which is the predicted value of the car.

In this way, hopefully you can see that the MLR "learns" the values of the constants like C1 and C0 from the given data points. Furthermore, the actual algorithm that produces the machine intelligence only computes the result of a simple linear equation, so hopefully you can also see that the predictive power comes mainly from the constants, which were "learned" from the data. In the case of the minimax method, most of the machine intelligence came from the algorithm, but with MLR-- as with most machine learning-- the machine intelligence is for the most part an emergent property of the training data.

Lastly, it's worth noting that there are a lot of "best practices" around using MLR. However, these are orthogonal to topic of this post. Suffice it to say that just like the minimax method has a very limited domain in which it is effective as a machine intelligence, MLR also has a limited domain. For example, the predictor variables (the X's) do need to be linearly related to the dependent variable in reality. However, within the limited domain of its linearly related data, MLR is quite effective and an excellent example of a simple machine learning technique that produces machine intelligence within that domain.

In the interest of space last time, I had to leave out an advanced topic on optimizing a "next best action" algorithm. Again, you can look at the full source we're discussing by just using the web browser's View Source on this page.

The optimization is known as alpha-beta pruning. In the code snippet below, you see that we break the j-loop that is scoring the response moves of a given move based on some condition involving the variables alpha and beta. Why does it make sense to stop looking at the competitive response moves for a given move? To see why, I've added the function declaration so we can discuss where the alpha value comes from and what it means.

Understanding alpha-beta pruning requires you to take a more global view of the recursion that is doing the evaluation. The alpha values passed into scoreMove() are the beta values from the calling level of the Minimax algorithm. It will help to keep at least the player's moves and the opponents responses in mind as we go through this.

Let's say that scoreMove() has been called to score a player's Kth move. Beforehand, moves 1 to K-1 will have been fully explored by depth-first recursion, including the opponents responses, the player's counter-responses, and so on. The alpha value received by scoreMove() for move K reflects the best fully explored "net" score for the player on moves 1 to K-1. Within scoreMove(), we first compute the raw benefit of the new move K, storing the result in moveScore. Now comes the alpha-beta pruning trick. The j-loop successively explores each opponent response move for the player's move K, and clearly the beta value takes on the value of the highest scoring response move that the opponent can make. The final score for move K is the raw benefit to the player of move Kminus the benefit beta that the opponent can realize in response.

Thought-provoking question: Do we really need to know the absolute best move that the opponent can make in response to the player's move K? Or do we just need to find an opponent move that is good enough that, when subtracted from the raw benefit of move K, proves that the player would be better off choosing the earlier move associated with the alpha value? Of course, the answer is that we only need a good enough opponent move, and this is why we break the j-loop when we find that move. If we were to continue the j-loop, all we do is unnecessary work that might (or might not) find an even better opponent response move that would make move K look like an even worse decision for the player. But there is no need to do this extra work. Once the expression "(moveScore-beta < alpha)" becomes true, we have proven that move K is less beneficial than one of the moves 1 to K-1.

From a practical standpoint, this optimization averages better than double the run-time performance of the "what-if" logic. Who doesn't want double, right? Well, this "what-if" analysis is a combinatorial explosion of analysis; to put that in perspective, you get less than one extra move of lookahead due to this optimization. Yet despite this dash of cold water about how much deeper you could take the "what if" logic due to alpha-beta pruning, it remains true that, for a given level of explorative depth, everybody wants the result twice as fast or more, so alpha-beta pruning is very handy.

This entry is for developers who want a good mental model for how a prescriptive analytics algorithm can simulate intelligent behavior. We'll focus on the intelligent behavior in the James Blog entry, since it is quite competitive with humans. Reminder: Just hit "view source" in your browser to get the code we're talking about here.

The first thing to note is that the domain of the intelligence is quite constrained and circumscribed relative to the full realm of human intellectual endeavor. This is what makes it computationally feasible to perform a "what if" analysis to "imagine" possible scenarios and determine a next best action. Here's roughly how it works. The computer's available next actions are examined and measured for their immediate benefit. Then, for each action, the response action of the opponent is measured for its immediate benefit to the opponent, and so on. Once the real benefit of each opponent move is tabulated, the value of the best opponent action is subtracted from the immediate value of a given computer move. The best computer move is determined as the highest value move resulting from the immediate benefit minus the score of the best opponent move.

One thing I like about the game Kalah is that it is really easy to explain the competitive algorithm, relative to harder games like Chess. In Chess, evaluating the immediate benefit of a move can be challenging, especially at the beginning of the game. It's not just about the value of the piece you take because many moves don't take pieces. The value of a move is often about gaining control over spaces of the board to limit the opponent's attack and defense options. But in Kalah, you get good intelligent game play from a much simpler board evaluation. The value of a move is simply a matter of how many seeds you gain by that move.

This code (at the beginning of KalahGame.scoreMove) just copies the current board, makes the proposed move for the given player, then evaluates the new board value minus the value of the old board configuration for the given player. In effect, you get the number of seeds gained for the player by the move.

That's when things get interesting. The move scoring then becomes iteratively recursive. Each valid move of the opponent is then evaluated by recursively calling the move scoring method. Like this:

The first line is just a trick to switch between player 1 and player 2 in the levels of recursion. The "beta" value is the highest scoring move of the opponent so far, so once we switch to the opponent player in the first line, the second line just sets a large negative score so that the loop will start by selecting the first available move as being a good idea. The j loop tries each move, and the if test on the succeeding line just ensures that there is a non-zero number of seeds to pick up-- in other words, it ensure the move is valid. Then, the opponent's move is scored by recursively calling KalahGame.scoreMove(). When the recursion returns, the succeeding if test checks whether the move is better than the best result so far, stored in "beta". If it is, then this move becomes the new "beta". The alpha/beta business at the end of the j loop is an optimization that can be safely ignored. Once the j loop has examined all the moves, the best opponent move score "beta" is subtracted from the immediate benefit value of the player's move.

This is how each of a player's possible moves is scored in Kalah.getBestComputerMove(): The move's immediate benefit in the number of seeds scored minus the best value obtained from a recursive lookahead of possible opponent responses that accounts for the player's responses to the opponent, and the opponents responses in kind, and so on down to the limit of the look ahead level.

The fun bit of this code is that it is used not only to determine the computer's best move, but when you ask for the "Expert Advisor" to help you, it applies exactly the same logic to *your* board position in order to determine a recommended next move for you.

To conclude, here is a small diagram to help you see what is going on.

In this example, we're near the end of the game, and Player 1 must decide whether to make move 2 or 4. With move 2, there is an immediate benefit of 4 seeds because the 1 seed lands in an empty house, allowing the player to score that seed as well as the 3 seeds in the opposing house. This seems like a good idea, but is it? Well, Player 2's moves should be examined. In the short term, Player 2 can only respond with move 5, but this spreads out the 4 seeds. If you look ahead to the end of the game, you can see that Player 2 will ultimately score all four of those seeds. But also in the recursion, it is unavoidable that Player 1 will be able to score the remaining two seeds on the top row of houses. So the net benefit to Player 1 of making move 2 is only 2 seeds: the immediate 4 seeds, minus the 4 earned by player 2 in the rest of the game, plus the 2 additional seeds that Player 1 earns in the rest of the game. Not as good as it initially looked. However, it does turn out to be better than move 4 for Player 1. The immediate move yields no seeds for Player 1. Then, in the rest of the game play, Player 2 is able to earn 7 seeds, and Player 1 only earns 3 seeds. So, if player 1 makes move 2, then the opponent gains 4 more seeds than player 1 does.

Well, that's a wrap for this explanation of the 2-party competitive algorithm known as the "Minimax" method. Hopefully you can now see that it's not real intelligence but rather just tabulation of best outcomes according to a scoring method and constrained to a set of rules for determining valid next moves. Demystified, it becomes no more surprising that the algorithm defeats humans than it is when an algorithm can beat a human at calculating the square of a 5 digit number.

Still, this is roughly what a person does. Time and again, new possibilities are "imagined" by testing "what if" this move is made or that move is made. And the algorithm does win a lot of games, which is precisely why prescriptive analytics algorithms are so valuable as expert advisors. If you take the material covered here up by an order of magnitude, you get IBM Deep Blue. Another order of magnitude, and you get IBM Watson. The sky's the limit!

David Lee Roth and Eddie Van Halen have been trying to get us to do it for decades: "JUMP!" Douglas Hofstadter would qualify that with "... out of the system!" Here's what that means.

Machine intelligent entities like James Blog exist within a certain system, conforming to a prescribed set of rules, and they really can't escape the confines and constraints of that programming. Within that limited domain, they do calculate wonderful results that can seem intelligent. In an early version, I found myself adding a logger so I could see why James Blog was not making some moves that seemed very good. Time and again, I would find that the good move now set up the conditions for a better opponent move later, which is exactly what the artificial intelligence is supposed to detect and avoid.

The algorithm does this so well that it is really hard to beat, especially on the maximum lookahead value I set, which was 6. Frankly, if you're new to this game, you have to work to beat even the initial lookahead level setting of 2, which means that James Blog only looks at its own moves and your countermoves to see what will produce the greatest net gain in seeds relative to you.

Because it is hard to beat this little game and see the special winners message, this opened up a delightful opportunity to talk about an important capacity of human intelligence that could be exemplified by determining the winners message without winning. I used a Zen-like characterization of a "winless win" as a nod to Hoftstadter's style in the book Gödel, Escher, Bach.

Put simply, we are not limited in our thinking to the confines of the system. We regularly "take it up a level" or "think outside the box". In this case, the system is a blog entry presented in a web page. So you can jump out of the system by using the View Source feature of your web browser to take a look at James Blog's code, where you will find the winners message: "I, for one, welcome my non-computer overlord." The message is an allusion to Ken Jennings' capitulation to IBM Watson, which was an awesome pop culture nod to The Simpsons-- awesome because both Jeopardy and the Watson AI are about sorting out exactly those kinds of allusions.

Frankly, I had a lot of fun with allusions, both in the blog entry and while holding the programmer challenge to achieve this winless win. For example, James mentions that he outfoxes his friend Wiley, alluding to the famous coyote, who is in the same animal family as a fox (Canidae), which is a tiny aural tweak from Canada, where I live. So, James can beat his wiley creator. Similarly, in tweets and status updates, I made numerous allusions to The Matrix movie, such as when I nearly used Morpheus's command to Neo: "Quit trying to hit me and hit me." The exception is that I changed the 'h' to a 'g', making 'git', which is what we use to get source code.

This kind of wordplay and allusion bears some similarity to "jumping out of the system". Hofstadter calls it contextual slipping, or my favorite word for it: counterfactualization. We take some piece of reality that we know about, and we ask "what if this were different?" We slip, or change, some piece of that reality to see if we end up with something new and useful. I find the notion of counterfactualization fascinating because it seems like a good operationalization of some other really important words: creativity, playfulness, humour, imagination.

Still, it might be a while between when we can efficiently and effectively operationalize contextual slipping and when we can generalize that to achieve machine intelligence that can jump out of any system in the way that I asked programmers to do with James Blog. At some point, I realized that there is a beautiful geometric analogy that helps explain why. In the book Flatland, the Sphere is able to escape the plane via the use of a third geometric dimension that is physically orthogonal to the two that comprise the plane. In this way, Sphere is able to see Square's inner workings. That is a great analogy with what we did by jumping out of the web page using View Source to see James Blog's inner workings. There was a whole different, higher level of understanding about what James was and how we could know more about it, and it is fitting to say we got that winners message by thinking outside the box.

Next blog will be a developer's tour of the particular machine intelligence algorithm built into James Blog. After that, will be a discussion of the relationships between machine intelligence, machine learning, and predictive analytics, so stay tuned!

Your intelligent behavior is based on sentient *understanding*. Sentient schmentient. I'll bet my intelligent behavior can outfox yours. I've done so with my friend Wiley from Canidae, and he's a genius! So, let's see how much good your sapience does you, shall we?

The rules of the contest are simple. You get the top six "houses" and the "store" on the top left. I get the bottom six houses and the bottom right store. We each start out with 6 seeds in each of our 6 houses, and 0 seeds in our stores. To win, you have to get more than half of the seeds into your store (for you knuckle draggers, that's 37 or more). I'll let you go first, so you already start with advantage.

To take your turn, you pick one of your houses that contains seeds. That house is emptied, and its seeds are "sowed" one at a time in a counterclockwise fashion, including your store but excluding mine. So, it takes 13 seeds to traverse from a starting house, through your store, through my houses, and back to your (now empty) starting house. Every seed that goes into your store gets you closer to victory.

You can earn a seed or two from your move, but there are a few more rules that can earn you lots of seeds. First, if the last seed you sow lands in your store, you get another turn, and you can have multiple extra turns if you make your moves in the right order. Second, if the last seed you sow lands in an empty house, then you earn that seed from the empty house and all seeds in the house of mine immediately below the empty house. I call this a "big take". Third, if I run out of seeds in all my houses, then you earn all the seeds in your houses. Of course, I can also earn lots of seeds by these same rules, which is why YOU'RE GOING TO LOSE MEAT BAG!

I will take it easier on you at first, but I'll play harder if you earn the privilege. And there's a special message for you, a badge of distinction, if you manage to beat me when I play my hardest. Ooops. You... win?!? Wake up! Your teetering bulb is dreaming!

Milk Drinker (you)

Blog, James Blog (me)

Less messages, I understand how JB is winning

SPOILER ALERT. PLAY A WHILE, BEFORE LOOKING ANY FURTHER.

OK, so hopefully you've played enough to know you're not going to be getting that badge of distinction anytime soon (unless you have some of the rare talents of Ted Neustaedter). But also hopefully you're coming to the understanding that I really have no clue what I'm doing when I beat you. What I'm doing is mechanical, not miraculous. I'm being no more intelligent, really, than a calculator squaring a five digit number. Now, when one of you meat bags does it, it actually is miraculous. But the miracle is that you can do it at all on your hardware given that it is designed more for sentient understanding of what mechanical operations like squaring are, what they're good for, and what to combine them with.

I am just doing the fine-grain operations of my Minimax algorithm, but it is you who understands our contest at a higher level than that. That's why machine intelligence like mine is best applied as an expert advisor. For example, if you hit "Invoke Expert Advisor", you are asking me to advise you in the limited domain where my simulated intelligence would seem like real intelligence.

Keep using that expert advisor button and see how much faster you earn that special "badge of distinction" message. Go ahead. You won't be able to do it entirely without also sprinkling in your own intelligence at some points. This will be because you will hit some key points where your sentient understanding recognizes a *pattern* that emerges that will allow you to see how to beat my mechanical intelligence, where even my own advice is unable to do so. What will most likely happen is that you'll use the advice to hold your own for most of the game. My advice will help you avoid moves that give me extra turns and "big take" opportunities. But at some point, you may see that I am beginning to be starved of seeds in my houses. You, as an expert, will have this insight sooner than I see it coming using my mechanical calculations because your sentient intelligence truly understands what is going on at that higher level.

But of course, you would have a much harder time getting to that point without my advice. And that is what makes machine intelligence like advanced analytics on big data and machine learning technologies like IBM Watson invaluable to you. In short, expert advisors can turbocharge the smarts in your smarter workforce.