An early classic in computational neuroscience was a 1993 paper by Elman called “The Importance of Starting Small.” The paper describes how initial limitations in a network’s memory capacity could actually be beneficial to its learning of complex sentences, relative to networks that were “adult-like” from the start. This still seems like a beautiful idea – the cognitive limitations of children may somehow be adaptive for the learning they have yet to do.

And Elman is not alone in proposing it; a number of other researchers have proposed that a lack of cognitive control or working memory capacity could actually be beneficial. Unfortunately, there is very little behavioral data which supports this idea.

In 1999, Rodhe & Plaut appeared to deal the death blow to this idea. They showed that Elman’s result is true only for a very particular type of sequential input stimulus: those where long-range dependencies contain intervening information that are correlated with the items showing long-range dependence. It’s worth pausing to consider how large the gulf is between theory and data on this point.

Newport, Braver, Thompson-Schill, Dayan, and undoubtedly others have all suggested the same general idea (with varying rationales): somehow, cognitive limitations must be advantageous. Otherwise, the cost of these limitations would surely eliminate them, perhaps evolutionarily (for example, among children/teenagers that do something stupid and accidentally kill themselves).

In fact, when Rohde & Plaut used input stimuli which did contain correlations between information intervening between items with long-range dependence, they actually observed an advantage (or at least no disadvantage) for starting “big.” The message was clear: something peculiar about the training data or parameters used by Elman must have driven the results.

Rohde & Plaut argue that connectionist networks intrinsically extract more basic covariations in training data before extracting more complex ones. Subsequent work by Conway et al has demonstrated that staged input can improve language-like learning in some cases, but the potential benefit of initial limitations in memory capacity remains where it did as of Rohde & Plaut’s paper.

A related issue concerns the cascade-correlation algorithm for changing the topology of neural networks. Briefly, the concept is that the network can spontaneously generate new units for processing once its learning appears to stagnate. Some claim these networks can learn up to 1000-5000% faster than those using the more standard backpropagation algorithm with a pre-specified architecture, but I can’t find a citation to back this claim, and I can’t check it in Emergent (it doesn’t include a cascade correlation algorithm). Nonetheless, cascade correlation is the only implemented algorithmic “self-shaping” mechanism I know of (please see comments section for important corrections – apparently there are many forms of this, including one described in this followup post).

That was going to be the end of this post. In a funny case of synchronicity, I discovered after writing it that Krueger & Dayan have a new paper – available online only as of New Year’s – demonstrating a new case in which the Elman result holds. I’ll discuss that in my next post.

There are a number of algorithms that evolve the topology along with the weights of neural networks, which is very much the “self-shaping” mechanism you’re talking about. The one I’m most familiar with is NEAT (NeuroEvolution of Augmenting Topologies), developed by Kenneth Stanley:

One of its defining aspects is “starting small”, reducing the dimensionality of the search for an optimal topology by starting with minimal architectures and incrementally adding nodes and connections. As in Elman’s work, starting small yields better performance than starting large.

Derek – it’s always a pleasure – this is great! I am unfamiliar with this work. The Krueger & Dayan paper I’ll discuss tomorrow changes its own topology as well, but with unclear effects on generalization. Maybe I’ll post on NEAT soon as well.

CHCH
I haven’t implemented an RBF network yet (although I’m getting ready to), but my understanding is that additional hidden nodes are added to the network if the existing kernel centers are too far away from the current instance. By adding additional hidden nodes, an RBF network accumulates additional knowledge.

With a standard backprop, it is tricky to adjust weightings on the fly, because the network was probably trained with multiple epochs. So, the question becomes what significance to use, so that you balance new information against the already established behaviors?

A lot of specialists state that loan help a lot of people to live the way they want, just because they are able to feel free to buy necessary things. Furthermore, different banks give collateral loan for different classes of people.