Thursday, June 16, 2005

A function simply defines an input-to-output mapping between two sets of elements. The input set is referred to as the domain, the output set is the co-domain or range. There is only one basic requirement of a function: each and every element of the input set (domain) must correspond with (map to) a single element of the output set (co-domain), although multiple input elements may map to the same output element.

Below, for example, is a complete list of mappings which completely define a very simple logical function called Exclusive-OR (often abbreviated XOR):

(0,0)->(0)

(0,1)->(1)

(1,0)->(1)

(1,1)->(0)

This particular function takes two binary-valued inputs and generates a single binary-valued output.

Defining a function using an explicit lookup-table of input-output mappings works well for XOR, as it is discrete and there are only four possible two-bit combinations of input (22). Lookup-tables become utterly impractical, however, when one wishes to describe functions of higher dimensionality or possibly real-valued input/output spaces. To define a more complex function, more powerful mechanisms are necessary.

A very useful (and more powerful) way to think about functions is to imagine them as surfaces embedded in high-dimensional spaces; surfaces that must bend and conform in particular ways in order to pass through particular points, these points of course being samples of whatever it is that is to be modeled (see image below). The idea is that once a surface (or hypersurface) has been correctly shaped to fit the known data points, it can then also be used to make predictions about unknown points! This predictive ability is referred to as Generalization (note that lookup-tables, like the one described above, are incapable of any form of generalization). Another very important concept is that a "simpler" surface (i.e. smoother) is likely to generalize more effectively than a very "complex" one (i.e. distorted or wrinkled) and is thus preferable. [see Occam's Razor, VC-Dimension]

There exist many techniques for creating, poking, prodding, sculpting, and analyzing these high-dimensional surfaces (i.e. functions). A few popular examples of these include:

An algorithm is a procedure performed on data that can involve branching conditional logic and requires a finite (but possibly very large) number of steps in order to complete. It differs from a function mapping in that it unfolds through time, rather than being a simple input to output relationship.

Many attempts have been made to modify the techniques developed for the training of static function mappings, and to apply them to functions that recursively refer to themselves, which moves into the realm of algorithms. An example of one such self-referential architecture is known as a Recurrent Neural Network (RNN), which can be trained in various ways, including two popular methods known as Real-Time Recursive Learning (RTRL) and BackPropagation Through Time (BPTT).

The problem is that none of this actually works very well in practice! The various training methods cannot handle long time delays or large amounts of non-linearity. They manage to learn only the simplest of algorithms. The reason for this limitation is that the error-signal (i.e. the information which is used for training) is either amplified to the point of distortion, or else dwindles and disappears, and this phenomena increases the further back through time the error-signal must travel. Each step of algorithmic execution further weakens the signal. It has been shown that this problem is essentially unavoidable and is commonly referred to as the Problem of Vanishing Gradients.

Unfortunately, there exists no general way to examine the specifications of a potential algorithm and figure out, a priori, if it will ever halt after being fed with a given piece of input data. This dilemma has been sensibly dubbed the Halting Problem and has ties to many other areas of theoretical interest, such as Gödel's incompleteness theorem.

A SIMPLE ALGORITHM:(1) Set X to a randomly chosen positive integer(2) While X does not equal 1 do the following: (2.1) if X is even, do X:=X/2 (2.2) otherwise (X is odd), do X:=(X*3)+1(3) Halt

For example, even the "simple" algorithm outlined above has never been proven to halt for all possible initial values of X. It has been extensively tested over the years, and it has always been empirically observed to halt, but this does not constitute a proof.

In addition to the Halting Problem, there are other, even more disturbing gaps in what can be known about algorithms. Here is a quote from a wonderful book entitled Algorithmics: The Spirit Of Computing by David Harel:

The fact that verification and halting are undecidable is only one small part of a far more general phenomenon, which is actually much deeper and a lot more devastating. There is a remarkable result, called Rice's theorem, which shows that not only can we not verify programs or determine their halting status, but we can't really figure out anything about them. No algorithm can decide any nontrivial property of computations. More precisely, let us say we are interested in deciding some property of programs, which is (1) true of some programs but not of others, and (2) insensitive to the syntax of the program and to its method of operation or algorithm; that is, it is a property of what the program does, of the problem it solves, and not of the particular form that solution takes. For example, we might want to know whether a program ever outputs a "yes," whether it always produces numbers, whether it is equivalent to some other program, etc. etc. Rice's theorem tells us that no such property of programs can be decided. Not even one. They are all undecidable. We can really forget about being able to reason automatically about programs, or to deduce things about what our programs do. This is true whether our programs are small or large, simple or complex, or whether what we want to find out is a general property or something petty and idiosyncratic. Virtually nothing about computation is computable!

DIRE IMPLICATIONS

Clearly even the simplest of algorithms can be monstrosities to deal with, at least when it comes to creating them or making definitive statements about their behavior.

None of the gradient based methods that have been developed to deal with continuous function-mappings (as described in the previous section above) are able to work effectively in the space of general-purpose algorithms, because none of them can completely avoid the Problem of Vanishing Gradients.

It also appears that trying to make definitive statements about the behavior of any particular arbitrary algorithm is, from a proof-theoretic standpoint, impossible.

These sobering insights severely limit the set of mathematical tools that can be applied to the creation of useful algorithms. On the one hand, we are unable to use continuous mechanisms (derivative-based error-minimization) to create useful algorithms. On the other hand, we aren't able to use symbolic mechanisms (logical proofs) to talk about useful algorithms either!

Seemingly the only mechanisms left available to us for dealing with algorithms are ones that, much like Mother Nature, primarily rely on trial-and-error. We essentially are forced to change our strategy from that of creating solutions to that of searching for solutions.

In fact, some of the discoveries made using these evolutionary methods have already greatly surpassed the abilities of human designers. Here is one:

This very striking (and small) example is an antenna designed for NASA using a Genetic Algorithm. This solution is superior in several respects to anything that human engineers have thusfar managed to come up with. And exactly why it works remains somewhat of a mystery...

In the end it should come as no surprise that our most powerful methods for discovery are directly inspired by Nature. She's been doing this a lot longer than we have.

Once a given local optima has been reached, it is often very difficult to escape, particularly if adaptation is taking place on a fairly non-linear or "rough" fitness landscape. In his extensive work dealing with adaptation on NK-model landscapes, biologist Stuart Kaufman has made the following observations in his book The Origins Of Order:

Walks always stop at local optima well below the global optimum. Trapping is rife in network space. This has a critical consequence: adaptive walks typically cannot achieve arbitrary patterns of activities on attractors! Adaptation via fitter variants in network space becomes grossly hindered by the rugged structure of the landscape. Walks become frozen into small regions of the space. Any intuition we may have harbored that mutation and selection alone could tune attractors to arbitrary patterns of behavior appears to be wrong.

I suggest that once a given solution has reached a fairly fit local optima and has stopped improving at a reasonable rate, further computational resources are best spent on experiments involving other potential solutions of lesser fitness that still have a reasonable chance for improvement (i.e. have not yet reached their own local optima), hopefully allowing at least one of them to surpass the current best optima before inevitably grinding to a halt.

Assuming that the above idea is correct, it would be highly desirable to have some way of detecting when a given solution has reached an optima (either local or global) and thus be able to automatically trigger reallocation of search resources.

The Theory of Records may offer a way to do just that...

FELLER'S THEORY OF RECORDS

To again quote Kaufman:

In keeping with the sensible intellectual strategy of gaining insight into complex behavior by proposing simple initial caricatures, it is of interest to find that a simple, near-universal law describes the behavior of an adapting population in the limit of search by long jumps across sequence space. The "law" is equivalent to Feller's theory of records, which asks the waiting time for successive new athletic records to be set. The answer is that the waiting time doubles after each record is set.

Kaufman proceeds to explain his findings relating to optimization on a variety of problems, including the Traveling Salesman Problem, in which the rate of improvement very closely matches the Theory of Records schedule described above.

The Theory of Records can be characterized by the following simple equation:

X = Log2T

where X is the total number of improvements found during the lifetime of the agent, and T is the total number of generations of experimentation that the agent has undergone.

LOCAL OPTIMA DETECTION VIA THE THEORY OF RECORDS

I suggest that it may be possible to use the Theory Of Records to statistically predict whether or not a given agent has become "stuck" on a local optima by examining the density of it's recent improvements X w.r.t. its total lifetime T.

We would anticipate that once an agent had reached a "dead-end" it would stop improving, thus leading to the following inequality:

X < Log2T

The above inequality would be a signal to us that further improvement of a particular agent is becoming increasingly unlikely. This would suggest that we allocate fewer computational resources to that particular agent, and instead redirect those freed resources to other, more promising (albeit lower fitness) agents within the population.

Friday, June 03, 2005

First, the existence (at some unspecified point in the future) of some really good Machine Learning algorithms; perhaps not yet surpassing human intellect or even passing the Turing Test, but considerably more powerful than today's methods.

Second, imagine some relatively inexpensive and portable real-time brain monitoring technologies, perhaps not sensitive at the level of the individual neuron, but maybe sensitive enough to differentiate between cortical columns; in other words, far better than any current EEG or fMRI technologies.

Now let's imagine an interesting application that might be developed as a result of these hypothetical technological breakthroughs: a program that could generate pleasing novel music (or any other media form) in real-time based on the physiological/neurological state of the listener!

PHASE I: Data-Mining

Imagine that a listener, as the subject of this thought-experiment, is exposed to a set of songs. They listen to each song in turn, and then give some form of feedback concerning their reactions to that song; perhaps a simple thumbs-up or thumbs-down; perhaps a rating from 1 to 10. The particular form of feedback is inconsequential, however the more information that is given the better. During this listening session the listener has been attached to some sort of brain-monitoring device (as eluded to above) which has relayed a vast amount of information (possibly Terabytes, certainly Gigabytes) to the CPU. The CPU concurrently receives 3 forms of data: 1) the raw waveform information of the given audio file, 2) the real-time cortical activity patterns of the listener brain in response to this auditory stimuli, and 3) the consciously elicited response of the listener, recorded in some form of "rating".

The CPU now has the task of making two types of highly complex associative mappings. First, it needs to be able to correlate the auditory stimuli with the effect produced in the cortical dynamics of the listener brain. The mapping must be good enough to be at least somewhat predictive; i.e. the CPU should be able to make a decent guess about the effect that a new piece of audio will have on the neurological state of the listener. Second, the CPU needs to make an additional complex associative mapping between the cortical state of the listener and the conscious response elicited (e.g. thumbs-up or thumbs-down).

Now the CPU can begin to experimentally generate new music, and can receive meaningful feedback in real-time about the success of it's efforts! [Note that there may be major differences between different individual brains! This means that the Phase I data-mining process would likely have to be repeated for each user.]

We will assume that the CPU's initial efforts would be utterly disastrous, but could presumably improve with time via evolutionary selective pressures combined with massive compute power (this is the future after all...)

The end result of this process would (hypothetically) be personalized music/sound-scapes that fluctuate and re-adapt in real-time depending on the minute moment-by-moment details of the listener's mood and state of attention, and would be custom-tailored to the brain-dynamics of that particular individual.

An offshoot of this idea might be the cultivation of particular states of consciousness such as 'alertness', 'relaxation', 'creativity', etc...

Also, the sensory stimuli need not (as mentioned above) be limited to audio...

I have spent the last 5 years learning about and actively experimenting with Artificial Intelligence. At first this exploration was focused primarily on Neural Networks, but it has since blossomed out into a wider range of areas. Here are some of the topics that currently fascinate me: