the inconsistent

This year, we had the first tutorial session for COSYNE. From the beginning of COSYNE, there was a demand & plan for a tutorial session (according to Alex Pouget). There were 293 registered for the tutorial, and it was their first COSYNE for [155, 208] (95% CI) participants. Everybody who gave me feedback was very happy with Jonathan Pillow‘s 3.5 hour lecture (slides & code) on statistical modeling techniques for neural signals, so we are planning to run another tutorial next year at #COSYNE19 Lisbon, Portugal.

Basic stats: 857 registrations, 709 abstracts submitted, 396 accepted (55.8%) which was increased from 330 at Salt lake city thanks to a bigger venue at Denver (curiously, it used to be where NIPS was up till 2000).

Iain D. Couzin, Collective sensing and decision-making in animal groups: From fish schools to primate societies (Gatsby lecture)
Ian showed how he went from theoretical models of swarm behavior to virtual reality for studying interacting fish (I thought this is how you spend money in science!), how the group can solve spatial optimization problems without any individual having access to the gradient (PNAS 2009), how collective consensus can go from following strongly biased minority to a “democratized” decision making as the swarm size increased (Science 2017), and also how the collective transitions from averaging to winner take all (fig from Trends CogSci 2009).

I-80. Michael Okun, Kenneth Harris. Frequency domain structure of intrinsic infraslow dynamics of cortical microcircuits
In the time scale of tens of seconds (infraslow), they showed that inter-spike intervals alone does not, but with matching power spectral density does explain much of the slow variations. (Spikes with matching spectra and ISI were generated using a variation of amplitude adjusted Fourier transform).

I-15.Rudina Morina, Benjamin Cowley, Akash Umakantha, Adam Snyder, Matthew Smith, Byron Yu. The relationship between pairwise correlations and dimensionality reduction
From paired recordings, the spike count correlation distribution is often reported as evidence of low-dimensional activity. From population recordings, factor analysis is often used as measures of neural dimensionality. How do these two relate? Both quantities only depend on the covariance matrix, hence they investigate how the mean and standard deviation of spike count correlation (rSC) relate to dimensionality as a function of shared variance using the generative model of factor analysis. They found an interesting trend: low-dim can be either large mean rSC with small std rSC or small mean and large std (which can be shown by rotating the loadings matrix).

Timothy Behrens. Building models of the world for behavioural control (invited talk)
I usually ignore fMRI talks, but the 6-fold symmetry of conceptual space was very cool. In the “canonical” bird neck & leg length space, OFC and other areas showed grid-network like signal modulations (Science 2016).

Marlene Cohen. Understanding the relationship between neural variability and behavior (invited)If correlated variability is important, it should (1) be related to performance, (2) related to individual perceptual decisions, (3) be selectively communicated between brain areas. She showed that the first principal component of noise correlation, but not the signal encoding direction, is most correlated with the choice. This was just recently published in Science 2018.

T-24. Caroline Haimerl, Eero Simoncelli. Shared stochastic modulation can facilitate biologically plausible decoding
Noise correlation tends to be in the direction of strongest decoding signal for unknown reason (Lin et al. 2015; Rabinowitz et al. 2016). They used the neural response weighted by the shared gain modulation for decoding, which was near optimal.

Byron Yu. Brain-computer interfaces for basic science (invited)They used BCI to study how the monkey can change its output in a short time scale within the “neural manifold”. Surprisingly, the neural repertoire (distribution of possible population firing patterns) does not shift nor change in shape, but mostly reassigns meaning! (Nature Neuroscience 2018) The animal can learn out of manifold perturbation as well, but that takes days (as detailed in Emily Oby‘s talk (T-25) followed right after).

T-26. Evan Remington, Devika Narain, Eghbal Hosseini, Mehrdad Jazayeri. Control of sensorimotor dynamics through adjustment of inputs and initial condition
In a ready-set-go time interval production task with variable gain (animal has to reproduce 1.5 times the duration sometimes), the mean neural activity of the population forms a #neuralManifold. On the interval subspace, the two temporal gains produced identical mean trajectories, while on the gain subspace, they were separated. (bioRxiv 261214)

Máté Lengyel. Sampling: coding, dynamics, and computation in the cortex (invited)
If the population neural activity represents samples from the posterior, what neural dynamics would produce them (Rubin et al. 2015)? He showed that a stochastic stabilized supralinear network (SSN) with a ring architecture (not ring attractor) can sample and also reproduce neurophysiological temporal dynamics such as on/off-set response and quenching of variability (bioRxiv 2016). Also he trained an RNN to amortize inference of a simple Gaussian scale mixture model of vision, and the solution found by the RNN turns out to be non-detailed balance solution to sampling (as demonstrated by the anti-symmetric part of cross-correlation over time).

Vivek Jayaraman. Navigational attractor dynamics in the Drosophila brain: Going from models to mechanism (invited)
Beautiful work on the ellipsoid body–protocerebral bridge circuit and their computation involving bump attractor dynamics and path integration.

Joni Wallis. Dynamics of prefrontal computations during decision-making (invited)
Theta-oscillation phase of OFC locks to the trials when the reward criteria were linearly changing. A closed-loop microstimulation of OFC at the peak of theta disrupts learning, possibly due to disrupted theta-locked communication with hippocampus.

III-108. Rainer Engelken, Fred Wolf. A spatiotemporally-resolved view of cellular contributions to network chaos
They implemented a event-based recurrent spiking neural network that is so efficient that they can simulate a very large number (15 million) of neurons and study their dynamics. They quantified Lyapunov exponents efficiently and computed cross-correlation against the participation index.

III-75. KiJung Yoon, Xaq Pitkow. Learning nonlinearities for identifying regular structure in the brain’s inference algorithm
Loopy belief propagation often produces poor inference on non-tree graphical models. Can we do better by training a recurrent neural network to do amortized inference on graphs? The answer is yes, and it can generalize to larger networks and unseen graph structures.

III-121. Dongsung Huh, Terrence Sejnowski. Gradient descent for spiking neural networks
They derived a differentiable synapse which gradually responses to membrane voltage near threshold. The presynaptic neuron still spikes, but the differentiable synapse allows gradient descent training of the recurrent spiking neural network. During training they can slowly make the synapse tighter and tighter to finally reach a non-differentiable synapse (arXiv 2017).

Towards a theory of high dimensional, single trial neural data analysis: On the role of random projections and phase transitions
Surya Ganguli

Surya talked about conditions for recovering the embedding dimension of discrete neural responses from noisy single trial observations (very similar to his talk at NIPS 2014 workshop organized by me). He models neural response as where S is sparse sampling matrix, U is a random orthogonal embedding matrix, X is the latent manifold driven by P stimulus conditions. Assuming Gaussian noise, and using free probability theory [Nica & Speicher], he shows the recovery condition .

What should hidden layers do in a deep neur(on)al network? He talked about some happy coincidences: What is the objective function for STDP in this setting [Bengio et al. 2015]? Deep autoencoders and symmetric weight learning [Arora et al. 2015]. Energy based models approximates back-propagation [Bengio 2015].

The Human Visual Hierarchy is Isomorphic to the Hierarchy learned by a Deep Convolutional Neural Network Trained for Object Recognition
Pulkit Agrawal

Which layers of various CNN trained on image discrimination task explain the fMRI voxels the best? [Agrawal et al. 2014] shows hierarchy of CNN matches the visual hierarchy and it’s not because of the receptive field sizes.

CNN often loses the ‘where’ information in the pooling process. What-where-convnet keeps the ‘where’ information in the pooling stage and use it to reconstruct the image [Zhao et al. 2015].

Mechanistic fallacy and modelling how we think
Neil Lawrence

He came out as a cognitive scientist. He talked about System 1 (fast subconscious data-driven inference which handles uncertainty well) and System 2 (slow conscious symbolic inference that thinks it is driving the body), and how they could talk to one another. Interesting solution to the variations of the trolly problem and how System 1 kicks in and gives the ‘irrational’ answer.

Approximation methods for inferring time-varying interactions of a large neural population (poster)
Christian Donner and Hideaki Shimazaki

Inference on an Ising model with latent diffusion dynamics on the parameters (both first and second order). Due to large number of parameters, it needs multiple trials with identical latent process to make good inference.

Discussion on the interface between neuroscience and machine learning. Are we only focusing on ‘vision’ problems too much? What problems should neuroscience focus on to help advance machine learning? How can datasets and problems change machine learning? Should we train DNN to perform more diverse tasks?

To implement optimal control in the latent state space, they used iterative Linear-Quadratic-Gaussian control applied directly to video. A gaussian latent state space was decoded from images through a deep variational latent variable model. One step prediction of latent dynamics was modeled to be locally linear where the dynamics matrices were parameterized by a neural network that depends on the current state. A variant of a variational cost that minimizes instantaneous reconstruction, and also KL divergence between predicted latent and the reconstructed latent. Deconvolution network was used, and as can be seen in the [video online], the generated images are a bit blurry, but iLQG control works well.

Simple vector space embedding of natural words in [Mikolov et al. NIPS 2013] showed “Madrid” – “Spain” + “France” is closest to “Paris”. Authors show that making such analogy in computer generated images is possible through a deep architecture (top figure on the right). To make an analogy of the form a : b = c : ?, first three images are encoded via f, then f(b) – f(a) + f(c), or more generally T((f(b)-f(a)), f(c)), is decoded via g to generate the output image. They trained convolutional neural network f such that T(f(b)-f(a)) is close to f(d)-f(c). Decoder with same architecture but with up-sampling instead of pooling is used for g. The performance on simple object transformation and video game character animation are quite impressive! [recorded talk]

Authors propose a CNN autoencoder and training method that aims to infer ‘graphics parameters’ such as lighting and viewing angle from images. Usually the deep latent variables are hard to interpret, but here they force interpretability by training subsets of latent variables only (holding others constant) and using input with the corresponding invariance. The resulting ‘disentangled’ representation learns a meaningful approximation of a 3D graphics engine. Trained via SGVB [Kingma & Welling ICLR 2014].

In model based reinforcement learning, predicting the next state given the current state and action accurately is a key operation. Authors show very impressive video prediction given a couple of previous frames of Atari games and a chosen action. Hidden state is estimated using CNN, temporal correlation is learned using LSTM, and action interacts multiplicatively with the state. They used curriculum learning to make increasingly long prediction sequences with SGD + BPTT. They replaced the model-free DQN [Minh et al. NIPS 2013 workshop] with their model. See impressive results at [online videos and supplement] for yourself!

In many discrete probability models the (computationally intractable) normalizer for the distribution often hinders efficient estimation for high dimensional data (e.g., Ising model). Instead of using KL-divergence (equivalent to MLE) between the model and empirical distribution, if a homogeneous divergence which invariant under scaling of the underlying measure, we might be able to circumvent the difficulty. Authors use the pseudo-spherical (PS) divergence [Good, I. J. (1971)] and a trick to weight the model by the empirical distribution to make a convex optimization procedure for obtaining near MLE solution.

(1) In many early sensory systems, there’s an expansion of representation to a larger number of downstream neurons with sparser representation. This expansion ratio is around 10-100 times, and sparseness of 0.1-0.01 (fraction of neurons active). In [Babadi & Sompolinsky Neuron 2014], they derived how random connection is worse than hebbian learning for a certain scaling and sparseness constraints for representing cluster identities in the input space. (2) How about stacking such layers? Hebbian synaptic learning squashes noise as the network gets deeper. (3) Learning context-dependent influence as mixed (distributed) representation. [Mante et al. Nature 2013] is not biologically feasibly learned. Interleave sensory and context signal into deep structure with hebbian learning to solve it. (4) Extend perceptron theory for learning point clouds to manifold clouds (i.e., line segments, and L-p balls).

If output is very high-dimensional, but sparse, as in the classification with large number of categories, the gradient computation bottleneck is the last layer. Authors propose a clever computational trick to compute gradient efficiently.

Hippocampal network can produce a sequence of activation (at rest) that represents goal-directed future plans. By taking the eigendecomposition of the Markov transition matrix of the maze, they obtain the ‘successor representation’ [Dayan NECO 1993] and implement it with a biologically plausible neural network.

By taking the limit of large number neurons with tuning curve centers drawn from a Gaussian, they derive a near optimal point process decoding framework. By optimizing on a grid, they derive the theoretically optimal Gaussian that minimizes MSE.

Variational inference often results in factorized forms of approximate posterior that are tighter than the exact Bayesian posterior. Authors derive a method to recover the lost covariance among parameters by perturbing the posterior. For exponential family variational distribution, a simple closed form transformation involving the Hessian of the expected log posterior. [julia code on github]

Authors derive closed-form estimators with theoretical guarantee for GLM with sparse parameter. In the high-dim regime of , sample covariance matrix is rank-deficient, but by thresholding and making it sparse, it becomes full rank (original sample cov should be well approx by this sparse cov). They invert the inverse link function by remapping the observations by small amount: e.g., 0 is mapped to for Poisson so that logarithm doesn’t blow up.

By replacing the sum over the samples in the Hessian for GLM regression with expectation, and applying Stein’s lemma assuming Gaussian stimuli, he derived a computationally cheap 2nd order method (O(np) per iteration). This trick relies on large enough sample size , and the Gaussian stimuli distribution assumption can be relaxed in practice if by central limit theorem. Unfortunately, the condition for the theory doesn’t hold for Poisson-GLM!

In EP, each likelihood contribution to the posterior is stored as an independent factor which is not scalable for large datasets. Authors propose to further approximate by using n copies of the same factor thus making the memory requirement of EP independent of n. This is similar to assumed density filtering (ADF) but with much better performance close to EP.

They propose a pair of probabilistic time series models for variational inference (one generative and one recognition model) and use variance controlled log-derivative trick to do stochastic optimization. Using a binary vector, they can model an exponentially large state space, and further introduce hierarchy (deep structure) that can produce longer time scale nonlinear dependences. Each node is extremely simple: linear-logistic-Bernoulli. Zhe told he that they applied to 3-bouncing-balls video dataset represented as 900 dimensional vector, but the generated samples were not perfect and balls would often get stuck. [code on github]

They apply the SGVB reparameterization trick to parameters instead of latent variables. Most importantly, they chose a reparametrization such that the noise is independent for each observation. Upper figure (from dpkingma.com) illustrates the parameterization with slow learning due to the noise being correlated with all samples in the mini-batch, while the lower figure shows the independent form. This relates to variational interpretation of dropout, but now the dropout rate can be learned in a more principled manner.

The two mains tricks for estimating stochastic gradient for are the log-derivative trick, and the reparameterization trick (used by the first two papers I introduced above). Reparameterization has much smaller variance, hence leads to faster convergence, however, it can only be used for continuous x and differentiable f. Here authors propose an extension of the log-derivative trick with small variance by (numerically) integrating over 1d over the latent variable that is directly controlled by the corresponding parameter, while holding the Markov blanket constant.

In conventional recurrent neural network (RNN), noise is limited to the input/output space, and the internal states are deterministic. Authors add a stochastic latent variable node to an RNN, and incorporate variational autoencoder (VAE) concepts. Latents are only time dependent through the deterministic recurrent states (with hidden LSTM units), and had a much lower dimension. They train on raw waveform of speech, and were able to generate mumbling sound that resembles the speech (I sampled their cool audio), and similarly for 2D handwriting. Their model seemed to work equally well with different complexity of observation models, unlike plain RNNs which require complex observation models to generate reasonable output. [code on github]

To generate texture images, they started with a deep convolutional neural network, and trained another network’s input with fixed weights until the covariance in certain layers matched. If they started with a white noise image, they could sample textures (via gradient descent optimization). [code on github]

Victoria (Vika) Gitman talked about non-standard models of Peano arithmetic. She listed the first-order form of Peano axioms which is supposed to describe addition, multiplication, and ordering of natural numbers . However, it turns out there are other countable models that are not natural number and yet satisfy Peano axioms. She used the compactness theorem, a corollary of completeness theorem (Gödel 1930), that (loosely) states that for a consistent first-order system, if any finite subset of axioms has a model, then the system has a model. She showed that if we add a constant symbol ‘c’ (in addition to 0 and 1) to the language of arithmetic, and a set of infinite axioms which is consistent with the Peano axioms: {c > 0, c > 1, c > 2, … }, then using the compactness theorem, there exists a model. This model is somewhat like integers sprinkled on rational numbers , in the sense that (…, c-2, c-1, c, c+1, c+2, …) are all larger than the regular , but then 2c is larger than all of that. Then there are also fractions of c such as c/2, and so on. This is still countable, since it is a countable collection of countably infinite sets, but this totally blew our minds. In this non-standard model of arithmetic, those ‘numbers’ outside can be represented as a pair in , but actual computation with those numbers turn out to be non-trivial (and often non-computable).

Ashish Myles talked about the incompleteness theorem, and other disturbing ideas. Starting from the analogy of liar’s paradox, Ashish stated that arithmetic (with multiplication) can be used to encode logical statements into natural numbers, and also write a (recursive) function that encapsulates the notion of ‘provable from axioms’. The Gödel statement G roughly says that “the natural number that encodes G is not provable”. Such statement is true (in our meta language) since if it is false, there’s a contradiction. However, either adding G or not G as an axiom to the original system is consistent. Even after including G (or “not G”) as an axiom to Peano arithmetic, there’ll be statements that are true but not provable! Vika gave an example statement that is true for natural number but is not provable from Peano axioms: all Goodstein sequence terminates at 0.

At this point, we were all feeling very cold inside, and needed some warm sunshine. So, we continued our discussion outside:

Kyle Mandli talked about Axiom of Choice (AC), which is an axiom that is somewhat counter intuitive, and independent of the Zermelo-Fraenkel (ZF) set theory: Both ZF with AC and ZF with not AC are consistent (Gödel 1964). We discussed many counter intuitive “paradoxes” as well as usefulness of AC in mathematics.

Diana Hall talked about an counter intuitive bet: suppose we have a fair coin, and we are tossing to create a sequence. Would you bet on seeing HTH first or HHT first? At first one might think they are equally likely. However, since there’s a sequence effect that makes them non-equal!

Unfortunately, due to time constraints we couldn’t talk about Uygar planed: “approximate solutions to combinatorial optimization problems implies P=NP”, hopefully we’ll hear about it on BBD11!

NIPS is growing fast with 2400+ participants! I felt there were proportionally less “neuro” papers compared to last year, maybe because of a huge presence of deep network papers. My NIPS keywords of the year: Deep learning, Bethe approximation/partition function, variational inference, climate science, game theory, and Hamming ball. Here are my notes on the interesting papers/talks from my biased sampling by a neuroscientist as I did for the previous meetings. Other bloggers have written about the conference: Paul Mineiro, John Platt, Yisong Yue and Yun Hyokun (in Korean).

The NIPS Experiment

The program chairs, Corinna Cortes and Neil Lawrence, ran an experiment on the reviewing process and estimated the inconsistency. 10% of the papers were chosen to be reviewed independently by two pools of reviewers and area chair, hence those authors got 6-8 reviews, and had to submit 2 author responses. The disagreement was around 25%, meaning around half of the accepted papers could have been rejected (the baseline assuming independent random acceptance was around 38%). This tells you that the variability in NIPS reviewing process is, so keep that in mind whether your papers got in or not! They accepted all papers that had disagreement between the two pools, so the overall acceptance rate was a bit higher this year. For details, see Eric Price’s blog post and Bert Huang’s post.

Latent variable modeling of neural population activity

How can we quantify how two populations of neurons interact? A full interaction model would require O(N^2) which quickly makes the inference intractable. Therefore, low-dimensional interaction model could be useful, and this paper exactly does this by extending the ideas of canonical correlation analysis to vector autoregressive processes.

How can you put more structure to a PLDS (Poisson linear dynamical system) model? They assumed disjoint groups of neurons would have loadings from a restricted set of factors only. For application, they actually restricted the loading weights to be non-negative, in order to separate out the two underlying components of oscillation in spinal cord. They have a clever subspace clustering based initialization, and a variational inference procedure.

How do you capture discrete states in the brain, such as UP/DOWN states? They propose using a probabilistic hierarchical hidden Markov model for population of spiking neurons. The hierarchical structure reduces the effective number of parameters of the state transition matrix. The full model captures the population variability better than coupled GLMs, though the number of states and their structure is not learned. Estimation is via variational inference.

General Machine Learning

From results in statistical physics, they hypothesize that there are more saddles in high-dimension which are the main cause of slow convergence of stochastic gradient descent. In addition, exact Newton method converges to saddles, (stochastic) gradient descent is slow to get out of saddles, causing lengthy platou in training neural networks. They provide a theoretical justification for a known heuristic optimization method which is to take the absolute value of eigenvalues of the Hessian when taking the Newton step. This avoids saddles, and dramatically improves convergence speed.

Extends the Gumbel-Max Trick to an exact sampling algorithm for general (low-dimensional) continuous distributions with intractable normalizers. The trick involves perturbing a discrete-domain function by adding an independent samples from Gumbel distribution.They construct Gumbel process which gives bounds on the intractable log partition function, and use it to sample.