Introduction

In this article (which can be found here) the authors try to reproduce in a spiking recurrent neural network key features of the spontaneous activity and neural variability observed physiologically in humans and animals. The authors focus on four features found in the experimental literature:

Trial-to-trial variability […] decreases following the onset of a sensory stimulus… [1]

Spontaneous activity outlines evoked sensory activity [2,3] — meaning that once a (metric) space of neural activity has been defined (for instance [0Hz,50Hz]^N for the firing rates of a population of N neurons), the spontaneous activity falls into similar regions as evoked activity.

Similarity between spontaneous and evoked activity increases during development [4,5]

Spontaneous activity (prior to stimulus onset) can be used to predict perceptual decisions (e.g. a perceptual decision can mean, in this context, classifying an ambiguous Face-Vase image into one of two categories: Face or Vase) [6,7].

SORN networks

The artificial neural network used in the study is called S.O.R.N, which stands for Self-Organizing Recurrent Neural network. Earlier papers from the group described this type of network and its capabilities in detail [8,9]. In particular these networks are more efficient than reservoir computing networks in sequential learning tasks [8], and in learning artificial grammars [9]. Open source (Python) code for simulating this network is available here.

Network units

In the present study, the network is composed of 200 excitatory units and 40 inhibitory units. Each unit performs a weighted sum of its inputs, compares it to a dynamic threshold, T, and applies a heaviside step function Θ in order to produce a binary output. Below are the update equations for the excitatory population x and the inhibitory one, y; the subscripts for the weights, W,stand for: E=Excitatory, I=Inhibitory, U=External inputs.

Plasticity rules

The excitatory units of the network obey three plasticity rules:

(discrete-time) STDP: the weight from neuron i to j is increased if neuron j spikes right after neuron i and decreased if neuron j spikes right before neuron i. Authors claim this rule to be the main learning mechanism in the network.

Synaptic Normalization (SN): all incoming connections are scaled at each time step, for each neuron, so that they sum to 1. The same holds for all outgoing connections from a neuron.This rule controls the weights range and seems to imply that the weight matrix is doubly stochastic.

Intrinsic Plasticity (IP): the threshold of each neuron varies with time in order for the average firing rate of the neuron to track a target firing rate. Those target firing rates are uniformly sampled from a small neighborhood around a fixed value of 0.1 (arbitrary units). This rule ensures stable average activity of the network. Both IP and SN are labeled homeostatic mechanisms by the authors.

Network stimulation

A stimulus for the network is defined as an activation of 10 excitatory units. In the article, stimuli are labeled by letters. Thus, the network receives the stimulus A at time n if the 10 excitatory units corresponding to A receive a 1 as an input with input weight 0.5 at time n. Stimulus B would correspond to another subset of 10 excitatory units. In general the sub-populations of A and B may overlap, but in the inference task presented below, they are chosen to be disjoint as stimulus ambiguity is an independent variable.

Input weights are always 0.5.

Sequence learning task

A first set of experiments involving the SORN network aimed at reproducing facts 2-3 from the Introduction above.

Task timeline

Results

(corresponding to fact 2 from Intro) When both evoked and spontaneous activity are projected onto the first three principal components of the evoked activity, the authors notice two things: a) evoked activity forms 4 distinguishable clusters which ‘represent’ the letter position in the sequence. That is, the letters A and E fall in one cluster, B and F in another one, etc. b) the spontaneous activity ‘hops’ between clusters in an order consistent with the positioning 1->2->3->4->1-> etc. The authors also compared the evoked activity, the spontaneous activity and the shuffled spontaneous activity in a particular 2D-projection of population activity high-dimensional space. This projection technique is called multidimensional scaling (MDS). Here, the authors observe a high overlap of evoked and spontaneous activity, and a high separation from shuffled spontaneous activity.

(corresponding to fact 3 from Intro) Here, the authors modified the timeline presented above in order to observe the effect of learning on the KL-divergence between the distributions of evoked and spontaneous activity. Firstly, different networks were trained with the same stimuli as above but for different training times; and with p=0.5. Secondly, the evoked activity was observed with two types of stimuli: a) the same stimuli as during training (natural condition) b) sequences EDCBA and FGH which were new to the network (control condition). Spontaneous activity was subsequently observed for each network. The results are shown in Fig. 1:

Fig.1: “Spontaneous activity becomes more similar to evoked activity during learning”, and does more so for familiar stimuli than for new stimuli.

Inference task

This task was designed by the authors in order to reproduce facts 1 and 4 from the Introduction above.

Task timeline

Results

Neural variability (measured by the Fano Factor) decreases at stimulus onset, and it decreases more for stimuli that were presented frequently during training, compared to those presented rarely.

The network’s decision statistics can be modeled by sampling-based inference. See below.

Sampling-based inference

The network is assumed to perform Bayesian inference in the following way:

Only two stimuli can be presented, A and B (short version of AXXX___, BXXX___ above). Their relative frequency of presentation during the training phase of the network is the prior probability distribution: P(A), P(B)=1-P(A).

The neurons from the excitatory population that are stimulated at the presentation of A are called the A-population and similarly for the B-population, which is disjoint from the A-population. These populations play the role of sensory neurons (conditionally independent, given the stimulus) that collect evidence. A neuron from the A-population is meant to (if the encoding were perfect) spike at the presentation of A and not spike at the presentation of B. Note that the authors explain why the encoding is imperfect: It is because the neuron’s threshold and inhibitory inputs depend on the history of the network.

Each sensory neuron can:

correctly spike on presentation of the stimulus that it is meant to code

correctly remain silent on the absence of the stimulus that it is meant to code

incorrectly spike on the absence of the stimulus that it is meant to code

incorrectly remain silent on presentation of the stimulus that it is meant to code

The four probabilities corresponding to the four possible events described just above fully characterize the likelihood functions of the network. Note that only two of those probabilities — θ_1 corresponding to the first bullet point above and θ_0 corresponding to the second bullet point — are sufficient to recover the other two, as complementary events sum to 1. These two probabilities together with the stimulus prior P(A) are the only three free parameters of the model. The parameter P(A) is manipulated in the experiments whereas θ_0 and θ_1 are fitted to the network’s realizations (see Fig. 2 and 3)

In the experiment, ambiguity of the stimulus is controlled by the fraction f_A of neurons from the A-population that are actually stimulated at presentation of A, the missing A-population being replaced by a portion of the B-population of size 1-f_A.

Given the assumptions above, the number n_a of A-neurons being active at the time of stimulus (or non-stimulus) presentation follows a distribution that is the sum of two binomials; and similarly for the number n_b of B-neurons active. These numbers represent the respective active evidence collected be each population.

The authors can compute from there the expected posterior distribution over the stimulus, <P(A|n_a,n_b)> and <P(B|n_a,n_b)>=<1-P(A|n_a,n_b)>, where the average is taken over the possible values from n_a and n_b. After the fitting of θ_0 and θ_1, these average posterior distributions were explicitly computed and represented by the gray dotted lines in Fig. 2 and 3 for different values of f_A.

The authors assume that each time the network answers A or B, it is sampling from the posterior distribution. That is, the network will answer A with probability P(A|n_a,n_b), where n_a, n_b now result from a single realization. This is why the authors, and others before them, call this decision strategy sampling-based inference.

Fig. 2: Here p=P(A)=1/3 was fixed. The green and blue curves always add up to 1 and represent an average over 20 experiments of the network’s responses. The dashed gray lines represent the averaged theoretical posterior distributions over the stimulus, after fitting the parameters theta_0 and theta_1 to minimize the mean squared error to the colored curves.

Fig. 3: this figure shows the agreement of the network behavior with the sampling-based inference model along several prior distributions (x-axis). The value on the y-axis correspond to the height of the point of intersection of the two curves in Fig. 2.

Authors argue that their network might be performing Bayesian inference since the dashed line above (optimal decision) is close to the network’s performance. One must not forget that the authors effectively fitted the likelihoods of their model (θ_0 and θ_1) to the data.

On noise…

The authors insist on the fact that noise may not play the important role that we tend to attribute to it in the brain. They argue that several qualitative and quantitative features of neural variability and spontaneous activity were reproduced by the SORN network, which is completely deterministic.

…we propose that the common practice to make heavy use of noise in neural simulations should not be taken as the gold standard.

If the SORN network is presented with input A at some time step, it might produce some output. But if it is stimulated again with A at a later time, the output might be different. This is because the internal state of the network might have changed.

At the end of their discussion section, the authors attempt to formalize a theory. It goes as follows:

Define W to be the state of the world (to be understood as the environment from which stimuli are drawn), S the brain state of the animal (or human) and R the neural response.

Efficient sensory encoding should maximize the mutual information between the neural response and the world state, conditioned on the animal’s current brain state: MI(W;R|S).

This quantity can be rewritten: H(R|S)-H(R|W,S).

Maximizing H(R|S) has the meaning of neural responses keeping a high variability, given that S is known. The authors do not insist too much on this point but say:

Maximizing the first term also implies maximizing a lower bound of H(R), which means that the brain should make use of a maximally rich and diverse set of responses that, to an observer who does not have access to the true state of the world and the full brain state, should look like noise.

Finally, minimizing H(R|W,S) amounts to making R a deterministic function of W and S. In other words, making the neural response a deterministic function of the current brain and world states. This is exactly the case for the SORN network.

As a final personal comment, let us note that the authors themselves mention the efficacy of stochastic modeling in theoretical Neuroscience. Furthermore, it is well known in Mathematics that deterministic chaotic systems and stochastic systems can be statistically indistinguishable. Hence, successes (and failures) of stochastic modeling can continue to guide its use.

The general question the authors try to address is whether the firing rates from a population before and after training can be used to infer the learning rule that led to the changes in the patterns of activity. This is is a difficult inverse problem, and requires a number of assumptions, as outlined below.The paper is motivated (and supported) by several experimental observation, including that training can lead to higher activity in a small group of cells, and reduced activity in the majority of the remainder in a population.

The authors use as motivation and analyze data from monkeys performing two different tasks: passive viewing task (monkeys view various visual stimuli and make no response), and a dimming-detection task (similar to the passive viewing task except that monkeys were required to detect and indicate a subtle decrease in luminance of the stimulus by releasing a manual lever). Neuronal responses (firing rates) to novel and familiar stimuli in inferior temporal cortex (ITC) were recorded during the tasks. Repeated presentations of an initially novel stimulus leads to a gradual decrease of responses to the stimulus in a substantial fraction of recorded neurons. The response to familiar stimuli is typically more selective, with lower average firing rates, but higher maximum responses in putative excitatory neurons. This indicates an overall decrease in synaptic weights, except for the maximally responsive cells with increased input synaptic weights.

What type of learning rules could explain these data? The authors assume that only recurrent weights are changed by training, and consider a rate-based plasticity rule. The firing rate of neuron i is defined by its inputs via a transfer function (f-I curve),

Training changes the recurrent synaptic weights, so that we can write

The changes in synaptic strengths lead to changes in synaptic inputs to neurons, and consequently to changes in their firing rates,

The changes in inputs can be approximately by

They make the assumption that the learning rule is a separable, i.e. the weight change depends on the product of two functions, one depending solely on pre-, and the other solely on postsynaptic rates. Under these assumptions the dependence of the learning rule on postsynaptic firing rates is

Thus the dependence of the learning rule on the postsynaptic firing rate can be obtained from the input changes by subtracting a constant offset, and rescaling its magnitude. Note that this requires several further assumptions – importantly that everything but the change in input current is independent of i.

The assumptions that are made to deduce the learning rule are summarized in this figure.

Using the method illustrated above, they investigated the effect of visual experience in inferior temporal cortex (ITC) using neurophysiological data obtained from two different laboratories in monkeys performing two different tasks. The distributions of firing rates for novel and familiar stimuli indicated an overall negative change in input currents after learning. The authors applied the analysis outlined above separately to putative excitatory and inhibitory cells. Excitatory neurons showed negative changes when postsynaptic firing rate was low, but positive changes when it was sufficiently high. Inhibitory neurons showed negative input changes at all firing rates.

In the experimental data obtained during the passive viewing task, they further analyzed the learning effects on input currents in individual neurons. In excitatory neurons, they found diverse patterns of input changes that can be classified into three categories: neurons showing only negative changes, neurons showing negative changes for low firing rates and positive changes for high firing rates, and neurons showing only positive changes. Averaging the input change curves of excitatory neurons showing both negative and positive changes led to depression for low firing rates and potentiation for high firing rates. For neurons showing both negative and positive changes, they defined a threshold θ as the postsynaptic firing rate where input changes become positive. Denote the normalized threshold, obtained by subtracting the mean rate and dividing by the s.d. of the rate by θ′. The threshold θ was strongly correlated with both mean and s.d. of postsynaptic firing rates, but such correlation disappeared for the normalized threshold θ′. This implies that the threshold is dependent on neuronal activity, reminiscent of the BCM learning rule. The threshold observed in ITC neurons is around 1.5 s.d. above the mean firing rate, so that a majority of stimuli lead to depression while few (the ones with the strongest responses) lead to potentiation.

They next addressed whether a network model with the type of learning rule inferred from data can maintain stable learning dynamics as it is subjected to multiple novel stimuli and whether the changes of activity patterns with learning observed in the experiment can be reproduced.

When divided into subgroups by percentile ranking, most the firing rates of most groups decreased with learning in excitatory and inhibitory neurons. Only the firing rates of the excitatory neurons in the highest percentile group increased. These changes led to increased selectivity and increased sparseness for learned stimuli, in accordance with experimental data.

This paper provides an approach for deducing the learning rule from experimental data, although it requires several assumptions. Some of these assumptions may be too strong, and difficult to verify, such as the separable function of the learning rule, the rank preservation of stimuli with learning, and changes only in the recurrent weights. However, the inferred learning rule agrees with those that have been reported in other experiments. It also resembles the widely used BCM rule (2), offering further support that this approach captures at least the qualitative features of the learning rule correctly.

Comment by Krešimir Josić: There are a number of assumptions behind this approach, that are generally explained well. However, the Gaussianity of the distribution of input currents is unclear to me. In Fig. 2b,c, this assumption is used to back out the f-I curve from responses to familiar stimuli. However, in Fig. 2f the distribution of input currents computed for familiar stimuli look non-Gaussian. How can the two be reconciled?

Animals make decisions based on both local sensory information as well as social information from their neighbors (Couzin 2009). One common goal of animals’ decision is to choose environmental locations that are best for foraging food. Decision making by individuals that collect exclusively non-social information (e.g., availability of food, threats by predators, or shelter) has been modeled extensively using both heuristic and Bayesian inference frameworks (Bogacz et al 2006). In some instances, optimal decision strategies can be identified by applying Bayesian inference methods to relate accumulated evidence to the underlying truth of the environment. However, principled models of decision making using both social and non-social information have yet to be fully developed. Most collective decision making models tend to be heuristic equations that are then fit to data, ignoring essential components of probabilistic inference.

This paper aims to develop a probabilistic model of decision making in individuals using both local information and knowledge of their neighbors’ behaviors. For the majority of the paper, they focus on decision making between two options. This is meant to model recent experiements on stickleback foraging between two feeding sites (Ward et al 2008). The framework can be extended to a variety of contexts including more than two options as well as considerations of the history-dependence of group decisions, which the authors consider. They start with the assumption that each animal computes the probability that option Y is the “best” one (e.g., safest or highest yielding) given non-social information C and social information B.

Bayes’ theorem can then be used to compute

A major insight of the paper is then that by dividing by the numerator, the effects of non-social information can be separated from social information

where is the likelihood ratio associated with non-social information and contains all the social information.

Now, one issue with the social information term is that it is comprised of behavioral information from all the other animals, and these behaviors are likely to be correlated. However, the authors assume the focal individual ignores these correlations for simplicity. It would be interesting to examine what is missed by making this independence assumption. In general, independence assumptions allow joint densities to be split into products , so assuming then

For the majority of the paper, the authors focus on three specific behaviors: , choosing site ; , choosing site ; and , remaining undecided. This means that the main parameters of the model are the likelihood ratios

indicating how informative each behavior is about the quality of a particular site. Since the model has such a low number of parameters, it is straightforward to fit it to existing data.

The authors specifically fit it to data collected from laboratory data on sticklebacks performing a binary choice task (Ward et al, 2008), where each option is equally good. In this case, the probability of a fish choosing site simplifies considerably:

,

so there is only one free parameter , which controls the strength of social interaction. For large values of , the population very quickly will align itself with one of the two options, since animals make choices probabilistically based on . Asymmetries are introduced in the experimental data by placing replica fish at one or both of the possible sites, and this intial condition influences the probability of the remaining fish’s selection. Remarkably, the single parameter model fits data quite well, as shown in the above figure.

From here, the paper goes on to explore more nuances in the model such as the case where one site is noticably better than another or when some replica fish are more attractive than others. All these effects can be captured and fit the data set from Ward et al (2008) fairly well. In general, social interactions in the model setup a bistable system, that tends to one of two steady states where almost all animals choose one of the two sites. This should not be surprising, since the function has a very familiar sigmoidal form often taken as an interaction function in neural network models (Wilson and Cowan, 1972) and ferromagnetic systems. Again, these models tend to admit multistable solutions.

One issue that the authors explore near the end of the paper is the effect of dependencies on the ultimate probability of choice distribution. In this case, the history of a series of choice behaviors is taken into account by animals making subsequent decisions. In this case, animals may actually pay more attention to dissenting individuals that are in the minority than the majority of individuals that are aligned with the prevailing opinion. The general idea is that dissent could indicate some insight that that single animal has over the other. The authors’ exploration of this phenomenon is cursory, but it seems like there is room to explore such extensions in more detail. For instance, animals could weight the opinions of their neighbors based on how recently those decisions were made. An analysis of the influence of the order of decisions on the ultimate group decision would also be a way to generate a more specific link between model and data.

Note: The model the authors develop is closely linked to Polya’s urn, a toy model of how inequalities in distributions are magnified over time. Essentially, the urn contains a collection of balls of different colors (say black and white). A ball is then drawn randomly from the urn and replaced with two balls of that color. This step is then repeated. Thus, an asymmetry in the number of balls of each color will lead to the more prevalent color having a higher likelihood of being selected. This will lead to that color’s dominance being increased. The probability matching in the Perez-Escudero and Polavieja (2011) model plays the role of drawing and replacing process. The distribution of balls is effectively the probability distribution of selecting one of two choices.

Information arriving form the sensory periphery is represented in the activity of neural populations in cortex. The responses of the cells within these populations is correlated. Such correlations can impact the amount of information that can be recovered about a stimulus from the population responses (Zohary, et al. 1994). However, the question of what type of correlations limit the information in a neural population, and where they are likely to originate has not been fully answered. In particular, correlations can (and do) arise from shared feedforward input, recurrent connectivity, and common, population wide modulations. Is any one of these sources primarily responsible for limiting information?

The present paper builds on earlier work which argues that information limiting correlations are primarily a reflection of peripheral sensory noise (Moreno-Bote, et al. 2014), and suboptimal computation (Beck et al. 2012). The following figure captures the idea of the first paper: The population activity changes as a function of the stimulus as f(s).This traces out a curve in the space of neural responses (axes correspond to the average activity of each neuron). The f’f’^T noise in the figure is due to correlations that prevent the averaging out of noise along the direction of f(s). These are the correlations that prevent discrimination between the response to two nearby stimuli, f(s1) and f(s2), since they induce noise that cannot be averaged out.

The questions is where do these information limiting correlations originate? To answer this questions the authors construct a simple feedforward network of orientation tuned neurons responding to Gabor patches (see figure on right). The simplicity of the setup makes it analytically tractable. The covariances can be approximated, allowing for further analytical insights. In particular, the law of total covariance shows immediately that correlations decay with difference in orientation preference, as observed in experiments.

The information processing inequality states that you cannot get more information about the visual input from the response of neurons in V1 than from neurons in LGN – here, and in many other references, this is made precise using Fisher Information, although see note below. It therefore stands to reason that information limiting correlations are due to the limited information available in the sensory periphery.

Importantly, the origin of information limiting correlations is easy to track in this setup. An important point is that as the response properties, as characterized by the spatial filters of the different neurons, are changed, the tuning curves and correlations change in tandem. In a number of previous studies (including some of our work, Josić, et al. 2009), these characteristics of the neural response have been changed independently. Here and in previous work the authors correctly argue that this is not realistic, as it can lead to violations of the data processing inequality.

Interestingly, the Fisher information of the population response of the V1 layer in the model is FI_V1 = FI_LGN cos^2 (α), where α is the angle between I’(s) and the vector space spanned by the filters of the individual cells in V1. Thus if the subspace are spanned by the filters contains I’(s), no information is lost in the response of V1. This approach can be used to show that sharpening of tuning curves does not always increase Fisher information.

Global fluctuations shared between all neurons can also affect the information carried by the population response. Interestingly, the authors show that these correlations do not by themselves limit the information in the response, but do typically decrease it. Thus they rule out common top down activity modulation as the main source of information limiting correlations.

One of the most interesting results is the splitting of the covariance matrix of the neural response into an information-limiting part, and one that does not affect information. This allows us them examine the size of information limiting correlations. Perhaps surprisingly, in realistic settings information limiting correlations are pretty small – perhaps only 1% of the total correlations. This likely makes them difficult to measure, despite the fact that these are the correlations that have the highest impact on an optimal linear readout.

The feedforward setup and focus on linear Fisher information is what makes the analysis in the paper possible. However, it also means that the results mainly apply to fine discrimination of stimuli that can be parametrized by a scalar. The larger issue is that in most situations outside of the laboratory fine sensory discrimination may not be all that important. It is possible that the brain keeps as much information as possible about the world. I would argue that the processing of sensory information in most situations is a process of discarding irrelevant information, and preserving only what matters. In many of those cases, maximizing Fisher Information may may not be all that important.

However, the authors do make a good point that many sensory areas do operate in the saturated regime: The neurometric and psychometric thresholds can be comparable. This would not be expected in the unsaturated regime, where a single neuron would not contribute much to the population.

Note: The question of how Fisher information is related to other ways of quantifying encoding quality is not completely resolved – see for instance this recent articlethis recent article. This touches on ethological and evolutionary questions, as the sensory systems have evolved to extract information that is important for an animal’s survival.

The role of synchrony in coding has long been debated. In particular, it is not clear if information can be conveyed through tightly coordinated spiking of groups of cells. I just caught up with this paper by Wang, et al on how adaptation can modulate thalamic synchrony to increase the discriminability of signals. They stimulated the whiskers of anesthetized rats and recorded responses both in the thalamus and the part of the cortex to which these neurons project. They noticed that these cell will strongly adapt to stimulation. After adaptation it became more difficult to detect a stimulus, but it also became easier to discriminate between different stimuli. In other words, the range of responses (as measured by the total activity, ie number of spikes in the cortical region) became more discernible after adaptation. Surprisingly, the activity in the thalamus did not change in the same way after adaptation. However, the level of synchrony in the response of the thalamic cells displayed a higher diversity after adaptation. This translated into larger discriminability downstream.

Randy Bruno has a nice review of the role of synchrony, which gives an overview of the results of this paper

Reaching consensus and understanding the underlying convergence speed is one of the best studied problem in social network models and agent-based systems. Olshevsky and Tsitsiklis 2011 SIAM paper describe many of the computational algorithms dealing with agreement and averaging formation on a communication network. The paper is focussed on analyzing already existing algorithms in the respective area and designing new efficient algorithms where consensus or agreement can lead to averaging algorithms with polynomial bounds on their resulting convergence rates.

Given a group of agents, each with its real-valued initial opinion in a communication network, influences its neighborhood opinion and hence every other agent in the network. These time-evolving opinions of all agents are expected to converge to the same point (average of initial opinions in case of averaging problem) provided each agent assigns appropriate weights to their neighborhood information (i.e. weights are entries of a stochastic matrix A) and also the dynamically evolving network is strongly connected. The convergence rate of such a process is solely determined by powers of matrix A in case of time-invariant communication network. Having an aperiodic and irreducible Markov chain determining the system in such a case is enough to guarantee convergence of such consensus algorithms. In order to have both agreement and averaging problem interleaved on an equal-neighbor, bidirectional graph, they proposed to run the agreement algorithm in parallel with two different initial opinions of every agent, one with scaled initial opinion by cardinality of the local neighborhood of each agent and other only depending on the latter condition of cardinality. The worst convergence time of such algorithm was shown to be O(n^3.

However, in case of dynamically evolving topology, the agreement or consensus algorithm is not polynomially bound as proved by Cao, Morse and Anderson. Olshevsky and Tsitsiklis in 2006 provided a remedy to such existing problem by proposing a “load-balancing algorithm” where agents share their initial load (or opinion) with their neighbors and try to equalize their respective loads (or opinions). Such an algorithm possess a polynomial bound on its convergence rate leading to a favorable performance.