Building an Artificial General Intelligence

Tuesday, 22 December 2015

Figure 1: Our headline image is from the Cognitive Consilience: An atlas of key pathways cross-referenced to supporting literature articles. The complexity and variety of routing within the brain can be appreciated with this beautiful illustration. Note in particular the specialisation of cortical cells and the way this affects their interactions with other cells in the cortex and elsewhere in the brain. Explore this fantastic resource yourself.

By David Rawlinson and Gideon Kowadlo

This is part 3 of our series “how to build an artificial general intelligence” (AGI). Part 1 was a theoretical look at General Intelligence (follow the link if you don’t know what General Intelligence is).

We believe that the Thalamo-Cortical system is the origin of General Intelligence in people. In Part 2 we presented very broadly how the Thalamo-Cortical system is structured and organised. We applied some core concepts, such as hierarchy, to help us describe the system.

We also looked at the cellular structure of the Cortex and in particular introduced Pyramidal cells.

This article is again about what we can learn from reverse-engineering the Thalamo-Cortical system, but this time from its connectivity, which we present in terms of circuits and pathways.

Pathways and Circuits

A pathway is a gross pattern of sequential connectivity between brain regions - for example, if part A is highly connected to part B, and activity in A is followed by activity in B, we say there exists a pathway between A and B. Cells in the Thalamo-Cortical system are connected to each other in quite restricted and specific ways, so these pathways are quite informative.

Circuits are more specific and precise details of both connectivity and functional interaction between neurons. In computational neuroscience there exists a concept called the Canonical Cortical Micro-Circuit. The specifics of this circuit are not widely agreed, because (a) the Cortex is complex and (b) many of the evidence-gathering exercises are statistical observations (e.g. “X% of outputs from A and Y% of output from B projects to region C”) which may obscure fundamental functional or topological features. For example, outputs from A and B may project to cells with exclusive roles, but physically co-located in C. Statistical, regional approaches will not capture such distinctions.

In the neuroscience literature there’s a frustrating habit of selectively reporting supporting details while ignoring others. Perhaps this is simply because it's impossible to describe any part exhaustively. In particular, there is a lot of contradictory information about Cortical circuits. But the research can still shed some light on what is happening. Just don’t expect all sources to be consistent or complete!

The hierarchy defines a structure made of Columns, and determines which Columns interact. Pathways describe the patterns of interaction between cells within a Column, and between Columns. We assume that all Columns are functionally identical prior to training.

Cells within Columns are usually identified by both the physical location of cell bodies within particular Layers in the Column, and by the morphology (shape) of the cell. Data flow to and from Cortical cells is largely restricted to a handful of core pathways that begin and terminate in particular cell types in specific cortical layers.

Hawkins’ Hierarchical Temporal Memory (HTM) introduces 3 of the 4 pathways in a single, coherent scheme and relates them to a general intelligence algorithm. We will borrow this terminology and describe them in detail below. Their names describe the direction of data flow and the routing used:

Note that cells in all cortical layers (except, perhaps, C4) receive input via their dendrites in C1. In other words, feedback from C6 to C1 is then used as input to many layers. Feedback from C6 to C6 is generally not input for other layers.

In neuroscience, Feed-Forward usually means the flow of data away from external sources such as sensors (towards greater abstraction, if you believe in a cortical hierarchy). Feed-Back means the opposite - data flow towards regions that have direct interaction with external sensors and motors.

Direct pathways are so-called because data is routed directly from one cortical column or region to another, without a stop along the way. Indirect pathways are routed via other structures. The “Feed-Forward-Indirect” pathway described by Hawkins is routed via the Thalamus.

Hawkins assigns specific roles to these pathways, but we will be re-interpreting them in the next article.

Figure 2: Routing of 3 core pathways, based on a diagram from the HTM/CLA White Paper. Note the involvement of specific cortical layers with each pathway, and the central role of the Thalamus. The names of the pathways indicate direct (cortex-to-cortex) and indirect (cortex-thalamus-cortex) variants, with direction being either forward (away from external sensors and motors, towards increasing abstraction) or backward (towards more concrete regions dealing with specific sensor/motor input).

The role of the Thalamus

Let’s recap: The Cortex is composed of Columns, organised into a hierarchy. Cells pass messages directly to other Columns that are higher or lower in the hierarchy. Messages may also be transmitted indirectly between Columns, via the Thalamus.

This section will describe indirect pathways involving the Thalamus. Figure 3 is a reproduction of a figure from Sherman and Guillery (2006) that has two new features of interest. These authors use the terminology “first order” to denote cortical regions receiving direct sensor input and “higher order” to denote cortical regions receiving input from “first order” cortical regions. This corresponds with the notion of hierarchy levels 1 and 2.

The Thalamus is a significant part of the “Feed-Forward Indirect” pathway. This pathway originates at Cortex layer 5 and propagates to a nucleus in the Thalamus. There, the nucleus may react by transmitting a (presumably corresponding) signal to one or more other Cortical Columns, in a different region. In some theories of cortical function, the target Column is conceptually “higher” in the hierarchy. The Thalamic input enters the Cortex via Thalamic axons terminating in Cortex layer 4 and is then propagated to Cortex Layer 5 where the pathway begins again.

Figure 3 also shows Cells in Columns in Cortex layer 6 fairly accurately form reciprocal modulatory connections to Thalamic nuclei that provide input to the Column via C4 and C5! Therefore, a Column within the Cortex has influence on data that it receives from the Thalamus. In effect, the Cortex is not a passive recipient but works with the Thalamus to control its input. The figure also depicts C6 cells projecting to C6 in lower regions (our second feedback pathway).

Figure 3: Pathways between cortical columns in different regions, showing layer involvement in each pathway and the role of the Thalamus. Sherman and Guillery use the terminology “first order” to denote cortical regions receiving direct sensor input and “higher order” to denote cortical regions receiving input from lower (e.g. “first order”) cortical regions. This corresponds with the notion of hierarchy levels 1 and 2. Note that in addition to the 3 pathways shown in the previous figure, we see additional direct feedback pathways and reciprocal feedback from Cortex layer 6 to the Thalamic nuclei that stimulate the cortical region. Image source.

Motor output

At this point it is interesting to look at how the Cortex can influence or control behaviour, particularly the generation of motor output. There are two pathways that allow the cortex to influence or control behaviour:

Note that in both cases, the origin of action selection is the Basal Ganglia. In the first case, the Basal Ganglia control signals emitted by the Thalamus, with these signals in turn affecting activity within Cortex layer 5 (C5). C5, particularly in motor areas, has been studied in detail. 10-15% of the cells in these areas are very large pyramidal neurons known as Betz cells, that can be observed to drive muscles very directly with few synapses in between. These cells are more prevalent in primates and are especially important for control of the hands. This makes sense given that manual tasks are typically more complex and require greater dexterity than movements by other parts of the body. The human Cortex is believed to be crucial for innovative and sophisticated manual tasks such as tool-making.

Within the Cortical layers, C5 seems to be uniquely involved in motor output. Figure 4 shows some of the ways output from Pyramidal cells in C5 project output to areas of the brain associated with motor output and control. In contrast, pyramidal cells in C2/3 predominantly project to other areas of the cortex and are not directly involved in control.

Figure 4: Pyramidal cells in C5 project output to areas of the brain associated with motor output and control. In contrast, pyramidal cells in C2/3 predominantly project to other areas of the cortex and are not directly involved in control. Image source.

The second way that the Cortex can influence motor output is via the Basal Ganglia. In this case, we propose that the Cortex might provide contextual information to assist the Basal Ganglia in its direct control outputs, but we found no evidence that the Cortex is able to exert control over the Basal Ganglia.

We suggest Cortical influence over the Basal Ganglia is less interesting from a General Intelligence perspective, because the hierarchical representations formed within the Cortex are not exploited, and execution is performed by more ancient brain systems not associated with General Intelligence qualities.

For the rest of this article series, we will ignore control pathways that do not involve the Cortex, and will focus on direct control output from Cortex layer 5.

Action Selection

It is widely believed that action selection occurs within the flow of information from Cortex through the Basal Ganglia, a group of deep, centralised brain structures adjacent to the Thalamus. There are a number of theories about how this occurs, but it is generally believed to involve a form of Reinforcement Learning used to select ideas from the options presented by the Cortex, with competitive mechanisms for clean switching and conflict resolution.

A major output of the Basal Ganglia is to the Thalamus; one prevailing theory of this relationship is that the Basal Ganglia controls the gating or filtering function performed by the Thalamus, effectively manipulating the state of the Cortex in consequence. The full loop then becomes Cortex → Basal Ganglia → Thalamus → Cortex (see Wikipedia for a good illustration, or figure 5).

As discussed above, this article will focus on motor output generated directly by the Cortex.

Figure 5: Pathways forming a circuit from Cortex to Basal Ganglia to Thalamus and back to Cortex. Image Source.

Canonical Cortical Circuit

We now have the all the background information needed to define a “Canonical Cortical micro-Circuit” at a cellular level. All the information presented so far has been relatively uncontroversial, but this circuit is definitely our interpretation, not an established fact. However, we will present some evidence to (inconclusively) support our interpretation.

Figure 6: Our interpretation of the canonical cortical micro-circuit. Only a single cortical region or Column is shown. Arrow endings indicate the type of connection - driver, modulator or inhibitor. The numbers 2/3, 4, 5, and 6 refer to specific cortical layers. Each shape represents a set of cells of a particular type, not an individual cell. Self-connections and connections within each set are not shown, but often exist. Shapes T and B refer to Thalamus and Basal Ganglia, not broken down into specific cell layers or types. Data enters the diagram at 4 points, labelled A-D, but does not exit; in general the system forms a circuit not a linear path. Note that shape T occurs twice, because the circuit receives data from only one part of the Thalamus but projects to two areas in forward and backward directions.

Diagram Explanation

We will use variants of the diagram shown in figure 6 to explain our interpretation of cortical function. In this diagram, only a single Cortical region or Column (used interchangeably here) is shown. In later diagrams, we will show 3 hierarchy levels together so the flow of information between hierarchy levels is apparent.

In these diagrams, shapes represent a class of Neurons within a specific Cortical Layer. The numbers 2/3, 4, 5 and 6 refer to the Cortical layers in which these cell classes occur. The shapes labelled T and B refer to the Thalamus and Basal Ganglia (internal cell types and layers are not shown). Arrows on the diagram show the effect of each connection, either driving (providing information or input that causes another cell to become active), modulation (stimulating or inhibiting the activity of a target cell) or inhibition (exclusively inhibiting the activity of a target cell).

If you want more detail on the thalamic end of the thalamocortical circuitry, an excellent source is this paper by Sherman.

There are many interneurons (described in the previous article) that are not shown in this diagram. We chose to omit these because we believe they are integral to the function of a layer of pyramidal cells within a Column, rather than an independent system. Specifically, we suggest that inhibitory interneurons implement local self-organising and local competitive functions (e.g. winner-take-all), ensuring sparse activation of the cell types represented by shapes in our diagram (C2/3, C4, C5, and C6). The self-organising behaviour also ensures that cells within each column optimise coverage of observed input patterns given a finite cell population. Inclusion of the interneurons would clutter the diagram without adding much explanatory value.

We also omit self-connections within a class of cells represented by a shape. These self-connections likely provide context and contribute to learning and exclusive activity within the class, but don’t make it easier to understand circuits in terms of cortical layers and hierarchy levels.

Excitatory Circuit

Figure 7 shows a multilevel version of the cortical circuit, similar to the multi-level figure from Sherman and Guillery (figure 3). We can now understand where the inputs to the circuit come from, in terms of other layers and external Sensors (S) and Motors (M). Note that Motors are driven directly from C5.

Figure 7: The cortical micro-circuit across several levels of Cortex with involvement of Thalamus and Basal Ganglia. The red highlight shows a single excitatory ‘circuit’. See text for details.

The red path in figure 7 shows our excitatory “canonical circuit”: Data flows from the Thalamus to spiny stellate (star-shape in figures) cells in C4 (see source and source), from where it propagates to pyramidal cells in C2/3, and then to pyramidal cells in C5. C6 is known as the multiform layer, but also contains many pyramidal cells of unusual proportions and orientations. C6 cells are driven by C5, and in turn modulate the Thalamus. Note that C6 cells within a region modulate the same Thalamic nuclei that provide input to that region of Cortex.

Inhibitory Circuit

A second, inhibitory circuit exists alongside our excitatory circuit. In addition to providing input to the Cortex via C4, axons from the Thalamus also drive inhibitory Parvalbumin-expressing (PV) neurons in C4 (shown as circles in the diagram). These inhibitory neurons make up a large fraction of all the cells in C4, and inhibit pyramidal cells in C5 (see source or source ).

This means that the input from the Thalamus can be both informative and executive. It is executive in that it actually manipulates the activity of layer 5 within the Cortex, and informative by providing a copy of the signal driving the manipulation to C4. Figure 8 shows our inhibitory circuit. We believe this circuit is of critical importance because it provides a mechanism for the Thalamus to centrally manipulate the state of the Cortex, specifically layer 5 and 6 pyramidal cells. This hypothesis will be expanded in the next article.

Figure 8: The inhibitory micro-circuit. The red highlight shows how the Thalamus controls activity in C5 within a Column by activating inhibitory cells in C4. The circuit is completed by C5 pyramidal cells driving C6 cells, which in turn modulate the activity of the same Thalamic nuclei that selectively activates C5. Each shape denotes a population of cells of a specific type within a single Column, excluding ‘T’ and ‘B’ that refer to the Thalamus and Basal Ganglia respectively.

Figure 9: Inhibitory interneurons in the Cortex. Of particular interest are the “PV” cells that are driven by axons from the Thalamus terminating in layer 4 and in turn inhibit pyramidal cells in layer 5. Image source.

Pathways and the Canonical Circuits

Now let’s look at how pathways emerge from our cortical micro-circuit. Figures 10, 11, 12 show the Feed-Forward Direct, Feed-Forward Indirect and first Feed-Back pathways respectively. We also include another direct, Feed-Back pathway terminating at C6 (figure 13). Feed-back direct pathways terminating at C1, where many fibres are intermingled, are harder to interpret than feedback terminating directly at C6. Pyramidal neurons from many layers have dendrites in C1.

Figure 10 highlights the Feed-Forward direct pathway. Signals propagate from C4 to C2/3 and then to C4 in a higher Column. This pattern is repeated up the hierarchy. This pathway is not filtered by the Thalamus or any other central structure. Although activity from C2/3 propagates to C5, it does not ascend the hierarchy via this route: C5 in one Column does not directly connect to C5 in a higher Column, only via an indirect pathway (see below).

Figure 11: Feed-Forward Indirect pathway.

Figure 11 highlights the Feed-Forward Indirect pathway. The Thalamus is involved in this pathway, and may have a gating or filtering effect. Data flows from the Thalamus to C4, to C2/3, to C5 and then to a different Thalamic nucleus that serves as the input gateway to another cortical Column in a different region of the Cortex.

Figure 13 highlights the second of two Feed-Back Direct pathways. This pathway might be involved in cascading control activity down the hierarchy towards sensors and motors - the next article will expand on this idea. Activity propagates from C6 to C6 directly. C6 modulates the activity of local C5 cells and relevant Thalamic nuclei that drive local C5 cells. Note that connections from a Column to the Thalamus are reciprocal; feedback from C6 to the Thalamus targets the same nuclei that project axons to C4.

Figure 13: The second of two Feed-Back Direct pathways.

Summary

We’ve presented some additional, detailed perspectives on the organisation and function of circuits and pathways within the Thalamo-Cortical system and presented our interpretation of the canonical cortical micro-circuit.

So what’s the point of all this information? What do these circuits and pathways do, and why are they connected this way? How do they work?

It might seem that we’ve stopped short of really trying to interpret all this information and that’s because we are, indeed, holding back. After having spent so much time presenting background information, the next article finally attempts to understand why the thalamocortical system is connected in the ways described here, and how this system might give rise to general intelligence.

Sunday, 29 November 2015

Figure 1: The physical architecture of general intelligence in the brain, namely the Thalamo-Cortical system. The system comprises a central hub (the Thalamus) surrounded by a surface (the Cortex, shown in blue). The Cortex is made up of a number of functionally-equivalent units called Columns. Each Column is a stack of cell layers. Columns share data, often via the central hub. The hub filters the data relayed between Columns.

Authors: Rawlinson and Kowadlo

This is part 2 of our series on how to build an artificial general intelligence (AGI). This article is about what we can learn from reverse-engineering mammalian brains. Part 1 is here.

The next few articles will try to interpret some well-established neuroscience in the context of general intelligence. We’ll ignore anything we believe is unrelated to general intelligence, and we’ll simplify things in ways that will hopefully help us to think about how general intelligence happens in the brain.

It doesn’t matter if we are missing some details, if the overall picture helps us understand the nature of general intelligence. In fact, excluding irrelevant detail will help, as long as we keep all the important bits!

These articles are not peer reviewed. Do assume everything here is speculation, even when linked to a source reference (our interpretation may be skewed). There isn’t space to repeatedly add this caveat throughout these articles.

1. Physical Architecture

First we’ll review and interpret the gross architecture of the brain, focusing on the Thalamo-Cortical system, which we believe is primarily responsible for general intelligence.

The Thalamo-Cortical system comprises a central hub (the Thalamus and Basal Ganglia), surrounded by a thin outer surface (the Cortex). The surface consists of a large number of functionally-equivalent units, called Columns. The Cortex is wrinkled so that it’s possible to pack a large surface area into a small space.

Why are the units called Columns? It’s the physical structure of their connectivity patterns. Cells within each Column are highly interconnected, but connections to cells in other Columns are fewer and less varied. Columns occupy the full thickness of the surface, approximately 6 distinct layers of cells. Since these layers are stacked on top of each other and are loosely connected between stacks, we have the appearance of a surface made of Columns.

In the previous article we described the ideal general intelligence as a structure made of many identical units that have each learned to play a small part in a distributed system. These theoretical units are analogous to Columns in real brains.

Columns can be imagined to be independent units that interact by exchanging data. However, data travelling between Columns often takes an indirect path, via the central hub.

The hub filters messages passed between Columns. In this way, the filter acts as a central executive that manages the distributed system made up of many Columns.

We believe this is a fundamental aspect of the architecture of general intelligence.

In mammalian brains, the filter function is primarily the role of the Thalamus, although its actions are supported and modulated by other parts, particularly the Basal Ganglia.

Other brain components, such as the Cerebellum, are essential for effective motor control but maybe not essential for general intelligence. They are not within the scope of this article.

2. Logical Architecture

The Cortex has both a physical structure (a layered surface, partitioned into columns) and a logical structure. The logical structure is a hierarchy - a tree-like structure that describes which columns are connected to each other (see figure 2).

Connections between columns are reciprocal: “Higher” columns receive input from “Lower” columns, and return data to the same columns. This scheme is advantageous: Higher columns have (indirect) input from a wider range of sources; lower columns use the same resources to model more specific input in greater detail. This occurs naturally because each column tries to simplify the data it outputs to higher columns, allowing columns of fixed complexity to manage greater scope in higher levels, as data is incrementally transformed and abstracted.

Only Columns in the lowest levels receive external input and control external outputs (albeit, often indirectly via subcortical structures).

Figure 2: The logical architecture of the cortex, a hierarchy of Columns. The hierarchy gives us the notion of “levels”, with the lowest level having external input and output, and higher levels being increasingly abstract. The logical architecture is superimposed on the physical architecture. Note that inter-Column connections may be gated by the central hub (not shown).

Note that there are not necessarily fewer columns in each hierarchy level; there may be, but this is not essential. However, abstraction increases and scope broadens as we move to higher hierarchy levels.

We can jump between the physical and logical architectures of the Cortex. Moving over the surface implies moving within the hierarchy. It also implies that moving between areas we will observe responses to different subsets of input data. Moving to higher hierarchy levels implies an increase in abstraction. We can observe this effect in human brains, for example by following the flow of information from the processing of raw sensor data to more abstract brain areas that deal with language and understanding (see figure 3).

Figure 3: Flow of information across the physical human Cortex also represents movement higher or lower in the logical hierarchy (increasing or decreasing abstraction). In fact, we can observe this phenomenon in human studies. Different parts of the hierarchy are specialised to conceptual roles such as understanding what, why and where things are happening. Image source.

One final point about the logical architecture. The hierarchical structure of the Cortex is mirrored in the central hub, particularly in the Thalamus and Basal Ganglia, where we see the topology of the cortical Columns preserved through indirect pathways via central hub structures (figure 4).

Figure 4: Data flows between different Columns within the Cortex either directly, or via our conceptual “central hub”. Our hub includes Basal Ganglia such as the Striatum, and the Thalamus. Throughout this journey the topology of the Cortex is preserved. Image source.

3. Layers and Cells

Each Column has approximately 6 distinct “layers”. Like every biological rule, there are exceptions; but it suffices for the level of detail we require here. The layers are visual artefacts resulting from variations in cell type, morphology, connectivity patterns and therefore function between the layers (figure 5).

Figure 5: Various stainings showing variation in cell type and morphology between the layers of the Cortex. Image source.

The Cortex has only 5 functional layers. Structurally, it has 6 gross layers; but one layer is just wires; no computation. In addition, the functional distinction between layers 2 and 3 is uncertain, so we will group them together. This gives us just 4 unique functional layers to explain.

We will use the notation C1 ... C6 to refer to the gross layers:

C1 - just wiring, no computation; not functional

C2/3 (indistinct)

C4

C5

C6 (known as the “multiform” layer due to the variety of cell types)

The cortex is made of a veritable menagerie of oddly shaped cells (i.e. Neurons) that are often confined to specific layers (see figure 6). Neurons have a body (soma), dendrites and axons. Dendrites provide input to the cell, and reach out to find that data. Axons transmit the output of the cell to places where it can be intercepted by other cells. Both dendrites and axons have branches.

Figure 6: Some of the cell types found in different cortical layers. Image source.

An important feature of the Cortex is the presence of specialized Neuron cells with pyramidal Soma (bodies) (figure 7). Pyramidal cells are predominantly found in C2/3, C5 and C6. They are very large and common cells in these layers.

Hawkins suggests that the Basal dendrites provide a sequential or temporal context in which the Pyramidal cell can become active. Output from the cell along its axon branches only occurs if the cell observes particular instantaneous input patterns in a particular historical context of previous Pyramidal cell activity.

Within one layer of a Column, Pyramidal cells exhibit a self-organising property that results in sparse activation. Only a few Pyramidal cells respond to each input stimulus. The Pyramidal cells are powerful pattern and sequence classifiers that also perform a dimensionality-reduction function; when active, the activity of a single Pyramidal cell represents a pattern of input over a period of time.

The training mechanism for sparsity and self-organisation is local inhibition. In addition to Pyramidal cells, most of the other Neurons in the Cortex are so-called “Interneurons” that we believe play a key role in training the Pyramidal cells by implementing a competitive learning process. For example, Interneurons could inhibit Pyramidal cells around an active Pyramidal cell ensuring that the local population of Pyramidal cells responds uniquely to different input.

Unlike Pyramidal cells, which receive input from outside the Column and transmit output outside the Column, Interneurons generally only work within a Column. Since we consider Interneurons play a supporting role to Pyramidal cells, we won’t have much more to say about them.

Figure 7: A Pyramidal cell as found in the Cortex. Note the Apical and Basal dendrites, hypothesised to recognise simultaneous and historical inputs patterns respectively. The complete Pyramidal cell is then a powerful classifier that when active represents a particular set of input in a specific historical context. Image source.

Summary

That's all we feel is necessary to say about the gross physical structure of the Thalamo-Cortical system and the microscopic structure of its cells and layers. The next article will look at the circuits and pathways by which these cells are connected, and the computational properties that result.

Thursday, 12 November 2015

The artificial neuron model used by Jeff Hawkins and Subutai Ahmad in their new paper (image reproduced from their paper, and cropped). Their neuron model is inspired by the pyramidal cells found in neocortex layers 2/3 and 5.

They propose that each dendrite is individually a pattern-matching system similar to a traditional artificial neuron: The dendrite has a set of inputs to which it responds, and a transfer function that decides whether enough inputs are observed to "fire" the output (although nonlinear continuous transfer functions are more widely used than binary output).

In the paper, they suggest that a single pyramidal cell has dendrites for recognising feed-forward input (i.e. external data) and other dendrites for feedback input from other cells. The feedback provides contextual input that allows the neuron to "fire" only in specific sequential contexts (i.e. given a particular history of external input).

To produce an output along its axon, the complete neuron needs both an active feed-forward dendrite and an active contextual dendrite; when the neuron fires, it implies that a particular pattern has been observed in a specific historical context.

In the original CLA whitepaper, multiple sequential contexts were embodied by a "column" of cells that shared a proximal dendrite, although they acknowledged that this differed from their understanding of the biology.

The new paper suggests that basket cells provide the inhibitory function that ensures sparse output from a column of pyramidal cells having similar receptive fields. Note that this definition of column differs from the one in the CLA whitepaper!

The other interesting feature of the paper is its explanation of the sparse, distributed sequence memory that arises from a layer of the artificial pyramidal cells with complex, specialised dendrites. This is also a feature of the older CLA whitepaper, but there are some differences.

Hawkins and Ahmad's paper does match the morphology and function of pyramidal cells more accurately than traditional artificial neural networks. Their conceptualisation of a neuron is far more powerful. However, this doesn't mean that it's better to model it this way in silico. What we really need to understand is the computational benefit of modelling these extra details. The new paper claims that their method has the following advantages over traditional ANNs:

We follow Numenta's work because we believe they have a number of good insights into the AGI problem. It's great to see this new theoretical work and to have a solid foundation for future publications.

Friday, 30 October 2015

This is the first of three articles detailing our latest thinking on general intelligence: A one-size-fits-all algorithm that, like people, is able to learn how to function effectively in almost any environment. This differs from most Artificial Intelligence (AI), which is designed by people for a specific purpose. This article will set out assumptions, principles, insights and design guidelines based on what we think we already know about general intelligence. It turns out that we can describe general intelligence in some detail, although not enough detail to actually build it...yet.

The second article will look at how these ideas fit existing computational neuroscience, which helps to refine and filter the design; and the third article will describe a (high-level) algorithm that is, at least, not contradictory to the design goals and biology already established.

As usual, our plans have got ahead of implementation, so code will follow in a few weeks after the end of the series (or months...)

FIGURE 1: A hierarchy of units. Although units start out identically, they become differentiated as they learn from their unique input. The input to a unit depends on its position within the hierarchy and the state of the units connected to it. The hierarchy is conceptualized as having levels; the lowest levels are connected to sensors and motors. Higher levels are separated from sensors and motors by many intermediate units. The hierarchy may have a tree-like structure without cycles, but the number of units per level does not necessarily decrease as you move higher.

Architecture of General Intelligence

Let’s start with some fundamental assumptions and outline the structure of a system that has general intelligence characteristics.

It Exists

We assume there exists a “general intelligence algorithm” that is not irreducibly complex. That is, we don’t need to understand it in excruciating detail. Instead, we can break it down into simpler models that we can easily understand in isolation. This is not necessarily a reasonable assumption, but there is evidence for it:

Hierarchy

Our reading and experimentation has suggested that hierarchical representation is critical for the types of information processing involved in general intelligence. Hierarchies are built from many units connected together in layers. Typically, only the lowest level of the hierarchy receives external input. Other levels receive input from lower levels of the hierarchy instead. For more background on hierarchies, see earlierposts. Hierarchy allows units in higher layers to model more complex and abstract features of input, despite the fixed complexity of each unit. Hierarchy also allows units to cover all available input data and allow combinations of features to be jointly represented within a reasonable memory limit. It’s a crucial concept.

Synchronization

Do we need synchronization between units? Synchronization can simplify sequence modelling in a hierarchy by restricting the number of possible permutations of events. However, synchronization between units may significantly hinder fast execution on parallel computing hardware, so this question is important. A point of confusion may be the difference between synchronization and timing / clock signals. We can have synchronization without clocks, but in any case there is biological evidence of timing signals within the brain. Pathological conditions can arise without a sense of time. In conclusion we’re going to assume that units should be functionally asynchronous, but might make use of clock signals.

Robustness

Your brain doesn’t completely stop working if you damage it. Robustness is a characteristic of a distributed system and one we should hope to emulate. Robustness applies not just to internal damage but external changes (i.e. it doesn't matter if your brain is wrong or the world has changed; either way you have to learn to cope).

Scalability

Adding more units should improve capability and performance. The algorithm must scale effectively without changes other than having more of the same units appended to the hierarchy. Note the specific criteria for how scalability is to be achieved (i.e. enlarge the hierarchy rather than enlarge the units). It is important to test for this feature to demonstrate the generality of the solution.

Generality

The same unit should work reasonably well for all types of input data, without preprocessing. Of course, tailored preprocessing could make it better, but it shouldn’t be essential.

Local interpretation

The unit must locally interpret all input. In real brains it isn’t plausible that neuron X evolved to target neuron Y precisely. Neurons develop dendrites and synapses with sources and targets that are carefully guided, but not to the extent of identifying specific cells amongst thousands of peers. Any algorithm that requires exact targeting or mapping of long-range connections is biologically implausible. Rather, units should locally select and interpret incoming signals using characteristics of the input. Since many AI methods require exact mapping between algorithm stages, this principle is actually quite discriminating.

Cellular plausibility

Similarly, we can validate designs by questioning whether they could develop by biologically plausible processes, such as cell migration or preferential affinity for specific signal coding or molecular markers. However, be aware that brain neurons rarely match the traditional integrate-and-fire model.

Key Insights

It’s surprising that in careers cumulatively spanning more than 25 years we (the authors) had very little idea how the methods we used everyday could lead to general intelligence. It is only in the last 5 years that we have begun to research the particular sub-disciplines of AI that may lead us in that direction.

Today, those who have studied this area can talk in some detail about the nature of general intelligence without getting into specifics. Although we don’t yet have all the answers, the problem has become more approachable. For example, we’re really looking to understand a much simpler unit, not an entire brain holistically. Many complex systems can be easily understood when broken down in the right way, because we can selectively ignore detail that is irrelevant to the question at hand.

From our experience, we've developed some insights we want to share. Many of these insights were already known, and we just needed to find the right terminology. By sharing this terminology we can help others to find the right research to read.

We’re looking for a stackable building block, not the perfect monolith

We must find a unit that can be assembled into an arbitrarily large - yet still functional - structure. In fact, a similar feature was instrumental in the success of “deep” learning: Networks could suddenly be built up to arbitrary depths. Building a stackable block is surprisingly hard and astonishingly important.

We’re not looking to beat any specific benchmark

... but if we could do reasonably well at a wide range of benchmarks, that would be exciting. This is why the DeepMind Atari demos are so exciting; the same algorithm could succeed in very different problems.

Abstraction by accumulation of invariances

This insight comes from Hawkins’ work on Hierarchical Temporal Memory. He proposes that abstraction towards symbolic representation comes about incrementally, rather than as a single mapping process. Concepts accumulate invariances - such as appearance from different angles - until labels can correctly be associated with them. This neatly avoids the fearful “symbol grounding problem” from the early days of AI.

Biased Prediction and Selective Attention are both action selection

We believe that selective bias of predictions and expectations is responsible for both narrowing of the range of anticipated futures (selective ignorance of potential outcomes) and the mechanism by which motor actions are generated. A selective prediction of oneself performing an action is a great way to generate or “select” that action. Similarly, selective attention to external events affects the way data is perceived and in turn the way the agent will respond. Filtering data flow between hierarchy units implements both selective attention and action selection, if data flowing towards motors represents candidate futures including both self-actions and external consequences.

The importance of spatial structure in data

As you will see in later parts of this article series, the spatial structure of input data is actually quite important when training our latest algorithms. This is not true of many algorithms, especially in Machine Learning where each input scalar is often treated as an independent dimension. Note that we now believe spatial structure is important both in raw input and in data communicated between units. We’re not simply saying that external data structure is important to the algorithm - we’re claiming that simulated spatial structure is actually an essential part of algorithms for dynamically dividing a pool of resources between hierarchy units.

Sparse, Distributed Representations

We will be using Sparse, Distributed Representations (SDRs) to represent agent and world state [RE ]. SDRs are binary data (i.e. all values are 1 or 0). SDRs are sparse, meaning that at any moment, only a fraction of the bits are 1's (active). The most complex feature to grasp is that SDRs are distributed: No individual bit uniquely represents anything. Instead, data features are jointly represented by sets of bits. SDRs are overcomplete representations - not all bits in a feature-set are required to “detect” a feature, which also means that degrees of similarity can also be expressed as if the data were continuous. These characteristics also mean that SDRs are robust to noise - missing bits are unlikely to affect interpretation. .

Predictive Coding

SDRs are a specific form of Sparse (Population) Coding where state is jointly represented by a set of active bits. Transforming data into a sparse representation is necessarily lossy and balances representational capacity against bit-density. The most promising sparse coding scheme we have identified is Predictive Coding, in which internal state is represented by prediction errors. PC has the benefit that errors are propagated rather than hidden in local states, and data dimensionality automatically reduces in proportion to its predictability. Perfect prediction implies that data is fully understood, and produces no output. A specific description of PC is given by Friston et al but a more general framework has been discussed in several papers by Rao, Ballard et al since about 1999. The latter is quite similar to the inter-region coding via temporal pooling described in the HTM Cortical Learning Algorithm.

Generative Models

Training an SDR typically produces a Generative Model of its input. This means that the system encodes observed data in such a way that it can generate novel instances of observed data. In other words, the system can generate predictions of all inputs (with varying uncertainty) from an arbitrary internal state. This is a key prerequisite for a general intelligence that must simulate outcomes for planned novel action combinations.

Dimensionality Reduction

In constructing models, we will be looking to extract stable features and in doing so reduce the complexity of input data. This is known as dimensionality reduction, for which we can use algorithms such as auto-encoders. To cope with the vast number of possible permutations and combinations of input, an incredibly efficient incremental process of compression is required. So how can we detect stable features within data?

By the definition of general intelligence, we can’t possibly hope to provide a tutor-algorithm that provides the optimum model update for every input presented. It’s also worth noting that internal representations of the world and agent should be formed without consideration of the utility of the representations - in other words, internal models should be formed for completeness, generality and accuracy rather than task-fulfilment. This allows less abstract representations to become part of more abstract, long-term plans, despite lacking immediate value. It requires that we use unsupervised learning to build internal representations.

Hierarchical Planning & Execution

We don’t want to have to model the world twice: Once for understanding what’s happening, and again for planning & control. The same model should be used for both. This means we have to do planning & action selection within the single hierarchical model used for perception. It also makes sense, given that the agent’s own actions will help to explain sensor input (for example, turning your head will alter the images received in a predictable way). As explained earlier, we can generate plans by simply biasing “predictions” of our own behaviour towards actions with rewarding outcomes.

In the context of an intelligent agent, it is generally impossible to discover the “correct” set of actions or output for any given situation. There are many alternatives of varying quality; we don’t even insist on the best action but expect the agent to usually pick rewarding actions. In these scenarios, we will require a Reinforcement Learning system to model the quality of the actions considered by the agent. Since there is value in exploration, we may also expect the agent to occasionally pick suboptimal strategies, to learn new information.

There is still a role for supervised learning within general intelligence. Specifically, during the execution of hierarchical control tasks we can describe both the ideal outcome and some metric describing similarity of actual outcome to desired. Supervised learning is ideal for discovery of actions with agency to bring about desired results. Supervised Learning can tell us how best to execute a plan constructed in an Unsupervised Learning model, that was later selected by Reinforcement Learning.

Challenges Anticipated

The features and constraints already identified mean that we can expect some specific difficulties when creating our general intelligence.

Allocation of limited resources

This is an inherent problem when allocating a fixed pool of computational resources (such as memory) to a hierarchy of units. Often, resources per unit are fixed, ensuring that there are sufficient resources for the desired hierarchy structure. However, this is far less efficient than dynamically allocating resources to units to globally maximize performance. It also presupposes the ideal hierarchy structure is known, and not a function of the data. If the hierarchy structure is also dynamic, this becomes particularly difficult to manage because resources are being allocated at two scales simultaneously (resources → units and units → hierarchy structure), with constraints at both scales.

In our research we will initially adopt a fixed resource quota per hierarchy unit and a fixed branching factor for the hierarchy, allowing the structure of the hierarchy and resources per unit to be determined by data. This arrangement is the one most likely to work given a universal unit with constant parameters, as the number of inputs to each unit is constrained (due to the branching factor). It is interesting that the human cortex is a continuous sheet, and evidences dynamic resource allocation as neuroplasticity - resources can be dynamically assigned to working areas and sensors when others fail.

Signal Dilution

As data is transformed from raw input into a hierarchical model, information will be lost (not represented anywhere). This problem is certain to occur in all realistic tasks because input data will be modelled locally in each unit without global oversight over which data is useful. Given local resource constraints, this will be a lossy process. Moreover, we have also identified the need for units to identify patterns in the data and output a simplified signal for higher-order modelling by other units in the hierarchy (dimensionality reduction). Therefore, each unit will deliberately and necessarily lose data during these transformations. We will use techniques such as Predictive Coding to allow data that is not understood (i.e. not predictable) to flow through the system until it can be modelled accurately (predicted). However, it will still be important to characterise the failure modes in which important data is eliminated before it can be combined with other data that provides explanatory power.

Detached circuits within the hierarchy

Consider figure 2. Here we have a tree of hierarchy units. If the interactions between units are reciprocal (i.e. X outputs to Y and receives data from Y) there is a strong danger of small self-reinforcing circuits forming in the hierarchy. These feedback circuits exchange mutually complementary data between a pair or more units, causing them to ignore data from the rest of the hierarchy. In effect, the circuit becomes “detached” from the rest of the hierarchy. Since sensor data enters via leaf-units at the bottom of the hierarchy, everything above the detached circuit is also detached from the outside world and the system will cease to function satisfactorily.

In any hierarchy with reciprocal connections, this problem is very likely to occur, and disastrous when it does. In Belief Propagation, another graphical model, this problem manifests as “double counting” and is avoided by nodes carefully ignoring their own evidence returned to them.

FIGURE 2: Detached circuits within the hierarchy. Units X and Y have formed a mutually reinforcing circuit that ignores all data from other parts of the hierarchy. By doing so, they have ceased to model the external world and have divided the hierarchy into separate components.

Dilution of executive influence

A generally-intelligent agent needs to have the ability to execute abstract, high-level plans as easily as primitive, immediate actions. As people we often conceive plans that may take minutes, hours, days or even longer to complete. How is execution of lengthy plans achieved in a hierarchical system?

If abstract concepts exist only in higher levels of the hierarchy, they need to control large subtrees of the hierarchy over long periods of time to be successfully executed. However, if each hierarchy unit is independent; how is this control to be achieved? If higher units do not effectively subsume lower ones, executive influence will dilute as plans are incrementally re-interpreted from abstract to concrete (see figure 3). Ideally, abstract units will have quite specific control over concrete units. However, it is impractical for abstract units to have the complexity to "micro-manage" an entire tree of concrete units.

FIGURE 3: Dilution of executive influence. A high-level unit within the hierarchy wishes to execute a plan; the plan must be translated towards the most concrete units to be performed. However, each translation and re-interpretation risks losing details of the original intent which cannot be fully represented in the lower levels. Somehow, executive influence must be maintained down through an arbitrarily deep hierarchy.

Let’s define “agency” as the ability to influence or control outcomes. Lacking the ability to cause a particular outcome is a lack of agency over the desired and actual outcomes. By making each hierarchy unit responsible for the execution of goals defined in the hierarchy level immediately above, we indirectly maximise the agency of more abstract units. Without this arrangement, more abstract units would have little or no agency at all.

Figure 4 shows what happens when an abstract plan gets “lost in translation” to concrete form. I walked up to my car and pulled my keys from my pocket. The car key is on a ring with many others, but it’s much bigger and can’t be mistaken by touch. It can only be mistaken if you don’t care about the differences.

In this case, when I got to the car door I tried to unlock it with the house key! I only stopped when the key wouldn't fit in the keyhole. Strangely, all low-level mechanical actions were performed skillfully, but high level knowledge (which key) was lost. Although the plan was put in motion, it was not successful in achieving the goal.

Obviously this is just a hypothesis about why this type of error happens. What’s surprising is that it isn't more common. Can you think of any examples?

FIGURE 4: Abstract plan translation failure: Picking the wrong key but skilfully trying it in the lock. This may be an example of abstract plans being carried out, but losing relevant details while being transformed into concrete motor actions by a hierarchy of units.

In our model, planning and action selection occur as biased prediction. There is an inherent conflict between accurate prediction and bias. Attempting to bias predictions of events beyond your control leads to unexpected failure, which is even worse than expected failure.

The alternative is to predict accurately, but often the better outcome is the less likely one. There must be a mechanism to increase the probability of low-frequency events where the agent has agency over the real-world outcome.

Where possible, lower units must separate learning to predict and trying to use that learning to satisfy higher units’ objectives. Units should seek to maximise the probability of goal outcomes, given an accurate estimate of the state of the local unit as prior knowledge. But units should not become blind to objective reality in the process.

Conflict resolution

General intelligence must be able to function effectively in novel situations. Modelling and prediction must work in the first instance, without time for re-learning. This means that existing knowledge must be combined effectively to extrapolate to a novel situation.

We also want the general intelligence to spontaneously create novel combinations of behaviour as a way to innovate and discover new ways to do things. Since we assume that behaviour is generated by filtering predictions, we are really saying we need to be able to predict (simulate) accurately when extrapolating combinations of existing models to new situations. So we also need conflict resolution for non-physical or non-action predictions. The agent needs a clear and decisive vision of the future, even when simulating outcomes it has never experienced.

The downside of all this creativity is that there’s really no way to tell whether these combinations are valid. Often they will be, but not always. For example, you can’t touch two objects that are far apart at the same time. When incompatible, we need a way to resolve the conflict.

Evaluating alternative plans is most easily accomplished as a centralised task - you have to bring all the potential alternatives together where they can be compared. This is because we can only assign relative rewards to each alternative; it is impossible to calculate meaningful absolute rewards for the experiences of an intelligent agent. It is also important to place all plans on a level playing-field regardless of the level of abstraction; therefore abstract plans should be competing against more concrete ones and vice-versa.

Therefore, unlike most of the pieces we've described, action selection should be a centralised activity rather than a distributed one.

Parameter Selection

In a hierarchical system the input to “higher” units will be determined by modelling in “lower” units and interactions with the world. The agent-world system will develop in markedly different ways each time. It will take an unknown amount of time for stable modelling to emerge, first in the lower units and then moving higher in the hierarchy.

As a result of all these factors it will be very difficult to pick suitable values for time-constants and other parameters that control the learning processes in each unit, due to compounded uncertainty about lower units’ input. Instead, we must allow recent input to each unit to determine suitable values for parameters. This is online learning. Some parameters cannot be automatically adjusted in response to data. For these, to have any hope of debugging a general intelligence, a fixed parameter configuration must work for all units in all circumstances. This constraint will limit the use of some existing algorithms.

Summary

That wraps up our theoretical overview of what we think a general intelligence algorithm must look like. The next article in this series will explain what we've learnt from biology’s implementation of general intelligence - ourselves! The final article will describe how we hope to build an algorithm that satisfies all these requirements.

Thursday, 15 October 2015

We have found a fantastic resource, part of the IBM Blue Brain Project, that clearly and interactively maps out interactions between neocortical neurons. The data comes from their attempts to simulate a piece of cortex down to the level of biologically-realistic neurons.

Erik Laukien is back with a demo of Sparse, Distributed Representation with Reinforcement Learning.

This topic is of intense interest to us, although the problem is quite a simple one. SDRs are a natural fit with Reinforcement Learning because bits jointly represent a state. If you associate each bit-pattern with a reward value, it is easy to determine the optimum action.

However, since this is an enormous state-space, it is not practical to do so. Instead, one might associate only all observed bit patterns with reward, or cluster them somehow to reduce the number of reward values that must be stored. Anyway, these are thoughts for another day.

Friday, 9 October 2015

An interesting article by Gerard Rinkus comparing the qualities of sparse distributed representation and quantum computing. In effect, he argues that because distributed representations can simultaneously represent multiple states, you get the same effect as a quantum superposition.

The article was originally titled "sparse distributed coding via quantum computing" but I think that gets the key conclusions backwards (maybe I'm wrong?).

"I believe that SDR constitutes a classical instantiation of quantum superposition and that switching from localist representations to SDR, which entails no new, esoteric technology, is the key to achieving quantum computation in a single-processor, classical (Von Neumann) computer."

I think that goes a bit too far. Yes, it would seem to have some of the same advantages as quantum computing, with the additional benefit of fitting classical computing technology that is mass manufactured at low cost.

However may be moot, now that true quantum computing looks to become practical: