Abstract

It is generally believed that during economic decisions, striatal neurons represent the values associated with different actions. This hypothesis is based on studies, in which the activity of striatal neurons was measured while the subject was learning to prefer the more rewarding action. Here we show that these publications are subject to at least one of two critical confounds. First, we show that even weak temporal correlations in the neuronal data may result in an erroneous identification of action-value representations. Second, we show that experiments and analyses designed to dissociate action-value representation from the representation of other decision variables cannot do so. We suggest solutions to identifying action-value representation that are not subject to these confounds. Applying one solution to previously identified action-value neurons in the basal ganglia we fail to detect action-value representations. We conclude that the claim that striatal neurons encode action-values must await new experiments and analyses.

To identify neurons that represent the values the subject associates with the different actions, researchers have searched for neurons whose firing rate is significantly correlated with the average reward associated with exactly one of the actions. There are several ways of defining the average reward associated with an action. For example, the average reward can be defined by the reward schedule, for example, the probability of a reward associated with the action. Alternatively, one can adopt the subject’s perspective, and use the subject-specific history of rewards and actions in order to estimate the average reward. In particular, the Rescorla–Wagner model (equivalent to the standard ones-state Q-learning model) has been used to estimate action-values (Kim et al., 2009; Samejima et al., 2005). In this model, the value associated with an action i in trial t, termed Qi(t), is an exponentially-weighted average of the rewards associated with this action in past trials:

(1)Qi(t+1)=Qi(t)+α(R(t)−Qi(t))ifa(t)=i

Qi(t+1)=Qi(t)ifa(t)≠i

where at and Rt denote the choice and reward in trial t, respectively, and α is the learning rate.

The model also posits that in a two-alternative task, the probability of choosing an action is a sigmoidal function, typically softmax, of the difference of the action-values (see also [Shteingart and Loewenstein, 2014]):

(2)Pr(a(t)=1)=11+e−β(Q1(t)−Q2(t))

where β is a parameter that determines the bias towards the action associated with the higher action-value. The parameters of the model, α and β, can be estimated from the behavior, allowing the researchers to compute Q1 and Q2 on a trial-by-trial basis.

In this paper we conduct a systematic literature search and conclude that the literature has, by and large, ignored two major confounds in this and in similar analyses. First, it is well-known that spurious correlations can emerge in correlation analysis if both variables have temporal correlations (Granger and Newbold, 1974; Phillips, 1986). Here we show that neurons can be erroneously classified as representing action-values when their firing rates are weakly temporally correlated. Second, it is also well-known that lack of a statistically significant result in the analysis does not imply lack of correlation. Because in standard analyses neurons are classified as representing action-values if they have a significant regression coefficient on exactly one action-value and because decision variables such as policy are correlated with both action-values, neurons representing other decision variables may be misclassified as representing action-values. We propose different approaches to address these issues. Applying one of them to recordings from the basal ganglia, we fail to identify any action-value representation there. Thus, we conclude that the hypothesis that striatal neurons represent action-values still remains to be tested by experimental designs and analyses that are not subject to these confounds. In the Discussion we address additional conceptual issues with identifying such a representation.

This paper discusses methodological problems that may also be of relevance in other fields of biology in general and neuroscience in particular. Nevertheless, the focus of this paper is a single scientific claim, namely, that action-value representation in the striatum is an established fact. Our criticism is restricted to the representation of action-values, and we do not make any claims regarding the possible representations of other decision variables, such as policy, chosen-value or reward-prediction-error. This we leave for future studies. Moreover, we do not make any claims about the possible representations of action-values elsewhere in the brain, although our results suggest caution when looking for such representations.

The paper is organized in the following way. We commence by describing a standard method for identifying action-value neurons. Next, we show that this method erroneously classifies simulated neurons, whose activity is temporally correlated, as representing action-values. We show that this confound brings into question the conclusion of many existing publications. Then, we propose different methods for identifying action-value neurons, that overcome this confound. Applying such a method to basal ganglia recordings, in which action-value neurons were previously identified, we fail to conclusively detect any action-value representations. We continue by discussing the second confound: neurons that encode the policy (the probability of choice) may be erroneously classified as representing action-value, even when the policy is the result of learning algorithms that are devoid of action-value calculation. Then we discuss a possible solution to this confound.

Results

Identifying action-value neurons

We commence by examining the standard methods for identifying action-value neurons using a simulation of an operant learning experiment. We simulated a task, in which the subject repeatedly chooses between two alternative actions, which yield a binary reward with a probability that depends on the action. Specifically, each session in the simulation was composed of four blocks such that the probabilities of rewards were fixed within a block and varied between the blocks. The probabilities of reward in the blocks were (0.1,0.5), (0.9,0.5), (0.5,0.9) and (0.5,0.1) for actions 1 and 2, respectively (Figure 1A). The order of blocks was random and a block terminated when the more rewarding action was chosen more than 14 times within 20 consecutive trials (Ito and Doya, 2015a; Samejima et al., 2005).

Model of action-value neurons.

(A) Behavior of the model in an example session, composed of four blocks (separated by dashed vertical lines). The probabilities of reward for choosing actions 1 and 2 are denoted by the pair of numbers above the block. Black line denotes the probability of choosing action 1; vertical lines denote choices in individual trials, where red and blue denote actions 1 and 2, respectively, and long and short lines denote rewarded and unrewarded trials, respectively. (B) Neural activity. Firing rate (line) and spike-count (dots) of two example simulated action-value neurons in the session depicted in (A). The red and blue-labeled neurons represent Q1 and Q2, respectively. Black horizontal lines denote the mean spike count in the last 20 trials of the block. Error bars denote the standard error of the mean. The two asterisks denote p<0.01 (rank sum test). (C) Values. Thick red and blue lines denote Q1 and Q2, respectively. Note that the firing rates of the two neurons in (B) are a linear function of these values. Thin red and blue lines denote the estimates of Q1 and Q2, respectively, based on the choices and rewards in (A). The similarity between the thick and thin lines indicates that the parameters of the model can be accurately estimated from the behavior (see also Materials and methods). (D) and (E) Population analysis. (D) Example of 500 simulated action-value neurons from randomly chosen sessions. Each dot corresponds to a single neuron and the coordinates correspond to the t-values of the regression of the spike counts on the estimated values of the two actions. Dashed lines at t=2 denote the significance boundaries. Color of dots denote significance: dark red and blue denote a significant regression coefficient on exactly one estimated action-value, action 1 or action 2, respectively; light blue – significant regression coefficients on both estimated action-values with similar signs (ΣQ)); orange - significant regression coefficients on both estimated action-values with opposite signs (ΔQ)); Black – no significant regression coefficients. The two simulated neurons in (B) are denoted by squares. (E) Fraction of neurons in each category, estimated from 20,000 simulated neurons in 1,000 sessions. Error bars denote the standard error of the mean. Dashed lines denote the naïve expected false positive rate from the significance threshold (see Materials and methods).

To simulate learning behavior, we used the Q-learning framework (Equations 1 and 2 with α=0.1 and β=2.5 (taken from distributions reported in [Kim et al., 2009]) and initial conditions Qi1=0.5). As demonstrated in Figure 1A, the model learned: the probability of choosing the more rewarding alternative increased over trials (black line). To model the action-value neurons, we simulated neurons whose firing rate is a linear function of one of the two Q-values and whose spike count in a 1 sec trial is randomly drawn from a corresponding Poisson distribution (see Materials and methods). The firing rates and spike counts of two such neurons, representing action-values 1 and 2, are depicted in Figure 1B in red and blue, respectively.

One standard method for identifying action-value neurons is to compare neurons' spike counts after learning, at the end of the blocks (horizontal bars in Figure 1B). Considering the red-labeled Poisson neuron, the spike count in the last 20 trials of the second block, in which the probability of reward associated with action 1 was 0.9, was significantly higher than that count in the first block, in which the probability of reward associated with action 1 was 0.1 (p<0.01; rank sum test). By contrast, there was no significant difference in the spike counts between the third and fourth blocks, in which the probability of reward associated with action 1 was equal (p=0.91; rank sum test). This is consistent with the fact that the red-labeled neuron was an action 1-value neuron: its firing rate was a linear function of the value of action 1 (Figure 1B, red) Similarly for the blue labeled neuron, the spike counts in the last 20 trials of the first two blocks were not significantly different (p=0.92; rank sum test), but there was a significant difference in the counts between the third and fourth blocks (p<0.001; rank sum test). These results are consistent with the probabilities of reward associated with action 2 and the fact that in our simulations, this neuron’s firing rate was modulated by the value of action 2 (Figure 1B, blue).

This approach for identifying action-value neurons is limited, however, for several reasons. First, it considers only a fraction of the data, the last 20 trials in a block. Second, action-value neurons are not expected to represent the block average probabilities of reward. Rather, they will represent a subjective estimate, which is based on the subject-specific history of actions and rewards. Therefore, it is more common to identify action-value neurons by regressing the spike count on subjective action-values, estimated from the subject’s history of choices and rewards (Funamizu et al., 2015; Ito and Doya, 2015a; Ito and Doya, 2015b; Kim et al., 2009; Lau and Glimcher, 2008; Samejima et al., 2005). Note that when studying behavior in experiments, we have no direct access to these estimated action-values, in particular because the values of the parameters α and β are unknown. Therefore, following common practice, we estimated the values of α and β from the model’s sequence of choices and rewards using maximum likelihood, and used the estimated learning rate (α) and the choices and rewards to estimate the action-values (thin lines in Figure 1C, see Materials and methods). These estimates were similar to the true action-value, which underlay the model’s choice behavior (thick lines in Figure 1C).

Next, we regressed the spike count of each simulated neuron on the two estimated action-values from its corresponding session. As expected, the t-value of the regression coefficient of the red-labeled action 1-value neuron was significant for the estimated Q1t182Q1=4.05 but not for the estimated Q2t182Q2=-0.27. Similarly, the t-value of the regression coefficient of the blue-labeled action 2-value neuron was significant for the estimated Q2t182Q2=3.05 but not for the estimated Q1t182Q1=0.78.

A population analysis of the t-values of the two regression coefficients is depicted in Figure 1D,E. As expected, a substantial fraction (42%) of the simulated neurons were identified as action-value neurons. Only 2% of the simulated neurons had significant regression coefficients with both action-values. Such neurons are typically classified as state (ΣQ) or policy (also known as preference) (ΔQ) neurons, if the two regression coefficients have the same or different signs, respectively (Ito and Doya, 2015a). Note that despite the fact that by construction, all neurons were action-value neurons, not all of them were detected as such by this method. This failure occurred for two reasons. First, the estimated action-values are not identical to the true action-values, which determine the firing rates. This is because of the finite number of trials and the stochasticity of choice (note the difference, albeit small, between the thin and thick lines in Figure 1C). Second and more importantly, the spike count in a trial is only a noisy estimate of the firing rate because of the Poisson generation of spikes.

Confound 1 – temporal correlations

The red and blue-labeled neurons in Figure 1D were classified as action-value neurons because their t-values were improbable under the null hypothesis that the firing rate of the neuron is not modulated by action-values. The significance threshold (t = 2) was computed assuming that trials are independent in time. To see why this assumption is essential, we consider a case in which it is violated. Figure 2A depicts the firing rates and spike counts of two simulated Poisson neurons, whose firing rates follow a bounded Gaussian random-walk process:

(A) Two example random-walk neurons that appear as if they represent action-values. The red (top) and blue (bottom) lines denote the estimated action-values 1 and 2, respectively that were depicted in Figure 1C. Gray lines and gray dots denote the firing rates and the spike counts of two example random-walk neurons that were randomly assigned to this simulated session. Black horizontal lines denote the mean spike count in the last 20 trials of each block. Error bars denote the standard error of the mean. The two asterisks denote p<0.01 (rank sum test). (B) and (C) Population analysis. Each random-walk neuron was regressed on the two estimated action-values, as in Figure 1D and E. Numbers and legend are the same as in Figure 1D and E. The two random-walk neurons in (A) are denoted by squares in (B). Dashed lines in (B) at t=2 denote the significance boundaries. Dashed lines in (C) denote the naïve expected false positive rate from the significance threshold (see Materials and methods). (D) Fraction of random-walk neurons classified as action-value neurons (red), and classified as state neurons (ΣQ) or policy neurons (ΔQ) (green) as a function of the magnitude of the diffusion parameter of random-walk (σ). Light red and light green are standard error of the mean. Dashed lines denote the results for σ=0.1, which is the value of the diffusion parameter used in (A)-(C). Initial firing rate for all neurons in the simulations is f1=2.5Hz.

where ft is the firing rate in trial t (we consider epochs of 1 second as ‘trials’), zt is a diffusion variable, randomly and independently drawn from a normal distribution with mean 0 and variance σ2=0.01 and x+ denotes a linear-threshold function, x+=x if x≥0 and 0 otherwise.

These random-walk neurons are clearly not action-value neurons. Nevertheless, we tested them using the analyses depicted in Figure 1. To that goal, we randomly matched the trials in the simulation of the random-walk neurons (completely unrelated to the task) to the trials in the simulation depicted in Figure 1A. Then, we considered the spike counts of the random-walk neurons in the last 20 trials of each of the four blocks in Figure 1A (block being defined by the simulation of learning and is unrelated to the activity of the random-walk neurons). Surprisingly, when considering the top neuron in Figure 2A and utilizing the same analysis as in Figure 1B, we found that its spike count differed significantly between the first two blocks (p<0.01, rank sum test) but not between the last two blocks (p=0.28, rank sum test), similarly to the simulated action 1-value neuron of Figure 1B (red). Similarly, the spike count of the bottom random-walk neuron matched that of a simulated action 2-value neuron (compare with the blue-labeled neuron in Figure 1B; Figure 2A).

Moreover, we regressed each vector of spike counts for 20,000 random-walk neurons on randomly matched estimated action-values from Figure 1E and computed the t-values (Figure 2B). This analysis erroneously classified 42% of these random-walk neurons as action-value neurons (see Figure 2C). In particular, the top and bottom random-walk neurons of Figure 2A were identified as action-value neurons for actions 1 and 2, respectively (squares in Figure 2B).

To further quantify this result, we computed the fraction of random-walk neurons erroneously classified as action-value neurons as a function of the diffusion parameter σ (Figure 2D). When σ=0, the spike counts of the neurons in the different trials are independent and the number of random-walk neurons classified as action-value neurons is slightly less than 10%, the fraction expected by chance from a significance criterion of 5% and two statistical tests, corresponding to the two action-values. The larger the value of σ, the higher the probability that a random-walk neuron will pass the selection criterion for at least one action-value and thus be erroneously classified as an action-value, state or policy neuron.

The excess action-value neurons in Figure 2 emerged because the significance boundary in the statistical analysis was based on the assumption that the different trials are independent from each other. In the case of a regression of a random-walk process on an action-value related variable, this assumption is violated. The reason is that in this case, both predictor (action-value) and the dependent variable (spike count) slowly change over trials, the former because of the learning and the latter because of the random drift. As a result, the statistic, which relates these two signals, is correlated between trials, violating the independence-of-trials assumption of the test. Because of these dependencies, the expected variance of the statistic (be it average spike count in 20 trials or the regression coefficient), which is calculated under the independence-of-trials assumption, is an underestimate of the actual variance. Therefore, the fraction of random-walk neurons classified as action-value neurons increases with the magnitude of the diffusion, which is directly related to the magnitude of correlations between spike counts in proximate trials (Figure 2D). The phenomenon of spurious significant correlations in time-series with temporal correlations has been described previously in the field of econometrics and a formal discussion of this issue can be found in (Granger and Newbold, 1974; Phillips, 1986).

Is this confound relevant to the question of action-value representation in the striatum?

Is a random-walk process a good description of striatal neurons’ activity?

The Gaussian random-walk process is just an example of a temporally correlated firing rate and we do not argue that the firing rates of striatal neurons follow such a process. However, any other type of temporal correlations, for example, oscillations or trends, will violate the independence-of-trials assumption, and may lead to the erroneous classification of neurons as representing action-values. Such temporal correlations can also emerge from stochastic learning. For example, in Figure 2—figure supplement 1 we consider a model of operant leaning that is based on covariance based synaptic plasticity (Loewenstein, 2008; Loewenstein, 2010; Loewenstein and Seung, 2006; Neiman and Loewenstein, 2013) and competition (Bogacz et al., 2006). Because such plasticity results in slow changes in the firing rates of the neurons, applying the analysis of Figure 1E to our simulations results in the erroneous classification of 43% of the simulated neurons as representing action-values. This is despite the fact that action-values are not computed as part of this learning, neither explicitly or implicitly.

Are temporal correlations in neural recordings sufficiently strong to affect the analysis?

To test the relevance of this confound to experimentally-recorded neural activity, we repeated the analysis of Figure 2B,C on neurons recorded in two unrelated experiments: 89 neurons from extracellular recordings in the motor cortex of an awake monkey (Figure 2—figure supplement 2A–B) and 39 auditory cortex neurons recorded intracellularly in anaesthetized rats (Figure 2—figure supplement 2C–D; [Hershenhoren et al., 2014]). We regressed the spike counts on randomly matched estimated action-values from Figure 1E. In both cases we erroneously classified neurons as representing action-value in a fraction comparable to that reported in the striatum (36 and 23%, respectively).

Strong temporal correlations in the striatum

To test the relevance of this confound to striatal neurons, we considered previous recordings from neurons in the nucleus accumbens (NAc) and ventral pallidum (VP) of rats in an operant learning experiment (Ito and Doya, 2009) and regressed their spike counts on simulated, unrelated action-values (using more blocks and trials than in Figure 1E, see Figure legend). Note that although the recordings were obtained during an operant learning task, the action-values that we used in the regression were obtained from simulated experiments and were completely unrelated to the true experimental settings. Again, we erroneously classified a substantial fraction of neurons (43%) as representing action-values, a fraction comparable to that reported in the striatum (Figure 2—figure supplement 3).

We conducted an extensive literature search to see whether previous studies have identified this confound and addressed it (see Materials and methods). Two studies noted that processes such as slow drift in firing rate may violate the independence-of-trials assumption of the statistical tests and suggested unique methods to address this problem (Kim et al., 2013; Kim et al., 2009): one method (Kim et al., 2009) relied on permutation of the spike counts within a block (Figure 2—figure supplement 4, see Materials and methods) and another (Kim et al., 2013), used spikes in previous trials as predictors (Figure 2—figure supplement 5). However, both approaches still erroneously classify unrelated recorded and random-walk neurons as action-value neurons (Figure 2—figure supplements 4 and 5). The failure of both these approaches stems from the fact that a complete model of the learning-independent temporal correlations is lacking. As a result, these methods are unable to remove all the temporal correlations from the vector of spike-counts.

Our literature search yielded four additional methods that have been used to identify action-value neurons. However, as depicted in Figure 2—figure supplement 6 (corresponding to the analyses in [Ito and Doya, 2009; Samejima et al., 2005]), Figure 2—figure supplement 7 (corresponding to the analysis in [Ito and Doya, 2015a]), Figure 2—figure supplement 8 (corresponding to the analysis in [Wang et al., 2013]) and Figure 2—figure supplement 9 (corresponding to a trial design experiment in [FitzGerald et al., 2012]), all these additional methods erroneously classify neurons from unrelated recordings and random-walk neurons as action-value neurons in numbers comparable to those reported in the striatum (Figure 2—figure supplement 6–9). The fMRI analysis in (FitzGerald et al., 2012) focused on the difference between action-values rather than on the action-values themselves (see confound 2), and therefore we did not attempt to replicate it (and cannot attest to whether it is subject to the temporal correlations confound). We did, however, conduct the standard analysis on their unique experimental design - a trial-design experiment in which trials with different reward probabilities are randomly intermingled. Surprisingly, we erroneously detect action-value representation even when using this trial design (Figure 2—figure supplement 9). This erroneous detection occurs because in this analysis, the regression’s predictors are estimated action-values, which are temporally correlated. From this example it follows that even trial-design experiments may still be subject to the temporal correlations confound.

Some previous publications used more blocks. Shouldn’t adding blocks solve the problem?

In Figures 1 and 2 we considered a learning task composed of four blocks with a mean length of 174 trials (standard deviation 43 trials). It is tempting to believe that experiments with more blocks and trials (e.g., [Ito and Doya, 2009; Wang et al., 2013]) will be immune to this confound. The intuition is that the larger the number of trials, the less likely it is that a neuron that is not modulated by action-value (e.g., a random-walk neuron) will have a large regression coefficient on one of the action-values. Surprisingly, however, this intuition is wrong. In Figure 2—figure supplement 10 we show that doubling the number of blocks, so that the original blocks are repeated twice, each time in a random order, does not decrease the fraction of neurons erroneously classified as representing action-values. For the case of random-walk neurons, it can be shown that, contrary to this intuition, the fraction of erroneously identified action-value neurons is expected to increase with the number of trials (Phillips, 1986). This is because the expected variance of the regression coefficients under the null hypothesis is inversely proportional to the degrees of freedom, which increase with the number of trials. As a result, the threshold for classifying a regression coefficient as significant decreases with the number of trials.

Possible solutions to the temporal correlations confound

The temporal correlations confound has been acknowledged in the fMRI literature, and several methods have been suggested to address it, such as ‘prewhitening’ (Woolrich et al., 2001). However, these methods require prior knowledge, or an estimate of the predictor-independent temporal correlations. Both are impractical for the slow time-scale of learning and therefore are not applicable in the experiments we discussed.

Another suggestion is to assess the level of autocorrelations between trials in the data and to use it to predict the expected fraction of erroneous classification of action-value neurons. However, using such a measure is problematic in the context of action-value representation because the autocorrelations relevant for the temporal correlations confound are those associated with the time-scale relevant for learning - tens of trials. Computing such autocorrelations in experiments of a few hundreds of trials introduces substantial biases (Kohn, 2006; Newbold and Agiakloglou, 1993). Moreover, even when these autocorrelations are computed, it is not clear exactly how they can be used to estimate the expected false positive rate for action-value classification.

Finally, it has been suggested that the temporal correlation confound can be addressed by using repeating blocks and removing neurons whose activity is significantly different in identical blocks (Asaad et al., 2000; Mansouri et al., 2006). We applied this method by applying a design in which the four blocks of Figure 1 are repeated twice. However, even when this method was applied, a significant number of neurons were erroneously classified as representing action-values (Materials and methods).

We therefore propose two alternative approaches.

Permutation analysis

Trivially, an action-value neuron (or any task-related neuron) should be more strongly correlated with the action-value of the experimental session, in which the neuron was recorded, than with action-values of other sessions (recorded in different days). We propose to use this requirement in a permutation test, as depicted in Figure 3. We first consider the two simulated action-value neurons of Figure 1B. For each of the two neurons, we computed the t-values of the regression coefficients of the spike counts on each of the estimated action-values in all possible sessions (see Materials and methods). Figure 3A depicts the two resulting distributions of t-values. As a result of the temporal correlations, the 5% significance boundaries (vertical dashed lines), which are defined to be exceeded by exactly 5% of t-values in each distribution, are substantially larger (in absolute value) than 2, the standard significance boundaries. On this analysis, a neuron is significantly correlated with an action-value if the t-value of the regression on the action-value from its corresponding session exceeds the significance boundaries derived from the regression of its spike count on all possible action-values.

Permutation analysis.

(A) Red and blue correspond to red and blue - labeled neurons in Figure 1B. Arrow-heads denote the t-values from the regressions on the estimated action-values from the session in which the neuron was simulated (depicted in Figure 1A). The red and blue histograms denote the t-values of the regressions of the spike-count on estimated action-values from different sessions in Figure 1E (Materials and methods). Dashed black lines denote the 5% significance boundary. In this analysis, the regression coefficient of neural activity on an action-value is significant if it exceeds these significance boundaries. Note that because of the temporal correlations, these significance boundaries are larger than ±2 (the significance boundaries in Figure 1,2). According to this permutation test the red-labeled but not the blue-labeled neuron is classified as an action-value neuron (B) Fraction of neurons classified in each category using the permutation analysis for the action-value neurons (green, Figure 1) and random-walk neurons (yellow, Figure 2).Dashed lines denote the naïve expected false positive rate from the significance threshold (Materials and methods). Error bars denote the standard error of the mean. The test correctly identifies 29% of actual action-value neurons as such, while classification of random-walk neurons was at chance level. Analysis was done on 10,080 action-value neurons and 10,077 random-walk neurons from 504 simulated sessions (C) Light orange, fraction of basal ganglia neurons from (Ito and Doya, 2009) classified in each category when regressing the spike count of 214 basal ganglia neurons in three different experimental phases on the estimated action-values associated with their experimental session. This analysis classified 32% of neurons as representing action-values. Dark orange, fraction of basal ganglia neurons classified in each category when applying the permutation analysis. This test classified 3.6% of neurons as representing action-value. Dashed line denotes significance level of p<0.01.

Indeed, when considering the Top (red) simulated action 1-value neuron, we find that its spike count has a significant regression coefficient on the estimated Q1 from its session (red arrow) but not on the estimated Q2 (blue arrow). Importantly, because the significance boundary exceeds 2, this approach is less sensitive than the original one (Figure 1) and indeed, the regression coefficients of the Bottom simulated neuron (blue) do not exceed the significance level (red and blue arrows) and thus this analysis fails to identify it as an action-value neuron. Considering the population of simulated action-value neurons of Figure 1, this analysis identified 29% of the action-value neurons of Figure 1 as such (Figure 3B, green), demonstrating that this analysis can identify action-value neurons. When considering the random-walk neurons (Figure 2), this method classifies only approximately 10% of the random-walk neurons as action-value neurons, as predicted by chance (Figure 3B, yellow). Similar results were obtained for the motor cortex and auditory cortex neurons (not shown).

Permutation analysis of basal ganglia neurons

Importantly, this permutation method can also be used to reanalyze the activity of previously recorded neurons. To that goal, we considered the recordings reported in (Ito and Doya, 2009). The results of their model-free method (Figure 2—figure supplement 6) imply that approximately 23% of the recorded neurons represent action-values at different phases of the experiment. As a first step, we estimated the action-values and regressed the spike counts in the different phases of the experiment on the estimated action-values, as in Figure 1 (activity in each phase is analyzed as if it is a different neuron; see Materials and methods). The results of this analysis implied that 32% of the neurons represent action values (p<0.01) (Figure 3—figure supplement 1). Next, we applied the permutation analysis. Remarkably, this analysis yielded that only 3.6% of the neurons have a significantly higher regression coefficient on an action-value from their session than on other action-values (Figure 3C). Similar results were obtained when performing a similar model-free permutation analysis (regression of spike counts in the last 20 trials of the block on reward probabilities, not shown). These results raise the possibility that all or much of the apparent action-value representation in (Ito and Doya, 2009) is the result of the temporal correlations confound.

Trial-design experiments

Another way of overcoming the temporal correlations confound is to use a trial design experiment. The idea is to randomly mix the reward probabilities, rather than use blocks as in Figure 1. For example, we propose the experimental design depicted in Figure 4A. Each trial is presented in one of four clearly marked contexts (color coded). The reward probabilities associated with the two actions are fixed within a context but differ between the contexts. Within each context the participant learns to prefer the action associated with a higher probability of reward. Naively, we can regress the spike counts on the action-values estimated from behavior, as in Figure 1. However, because the estimated action-values are temporally correlated, this regression is still subject to the temporal correlations confound (Figure 2—figure supplement 9). Alternatively, we can regress the spike counts on the reward probabilities. If the contexts are randomly mixed, then by construction, the reward probabilities are temporally independent. These reward probabilities are the objective action-values. After learning, the subjective action-values are expected to converge to these reward probabilities. Therefore, the reward probabilities can be used as proxies for the subjective action-values after a sufficiently large number of trials. It is thus possible to conduct a regression analysis on the spike counts at the end of the experiment, with reward probabilities as predictors that do not violate the independence assumption.

A possible solution for the temporal correlations confound that is based on a trial design.

(A) A Q-learning model was simulated in 1,000 sessions of 400 trials, where the original reward probabilities (same as in Figure 1A) were associated with different cues and appeared randomly. Learning was done separately for each cue. Top panel: The first 20 trials in an example session. Background colors denote the reward probabilities in each trial. Black circles denote the learned value of action-value 1 in each trial. Top and bottom black lines denote choices of action 1 and 2, respectively. Long and short lines denote rewarded and unrewarded trials, respectively. Bottom panels: Two examples of the grouping of trials with the same reward probabilities to show the continuity in learning. Note that the action-value changes only when action 1 is chosen because it is the action-value associated with action 1. (B) and (C) population analysis for action-value neurons. 20,000 action-value neurons were simulated from the model in (A), similarly to the action-value neurons in Figure 1. For each neuron, the spike-counts in the last 200 trials of the session were regressed on the reward probabilities (see Materials and methods). Legend is the same as in Figure 1D–E. Dashed lines in (B) at t=2 denote the significance boundaries. Dashed lines in (C) denote the naïve expected false positive rate from the significance threshold (see Materials and methods). This analysis correctly identifies 59% of action-value neurons as such. (D) and (E) population analysis for random-walk neurons. 20,000 Random-walk neurons were simulated as in Figure 2. Same regression analysis as in (B) and (C). Only 8.5% of the random-walk neurons were erroneously classified as representing action-values (9.5% chance level).

To demonstrate this method, we simulated learning in a session composed of 400 trials, randomly divided into 4 different contexts (Figure 4). Learning followed the Q-learning equations (Equations 1 and 2), independently for each context. Next, we simulated action-value neurons, whose firing rate is a linear function of the action-value in each trial (dots in Figure 4A, upper panel). We regressed the spike counts of the neurons in the last 200 trials (approximately 50 trials in each context) on the corresponding reward probabilities (Figure 4B). Indeed, 59% of the neurons were classified this way as action-value neurons (Figure 4C, 9.5% is chance level). By contrast, considering random-walk neurons, only 8.5% were erroneously classified as action-value neurons, a fraction expected by chance.

Confound 2 – correlated decision variables

In the previous sections we demonstrated that irrelevant temporal correlations may lead to the erroneous classification of neurons as representing action-values, even if their activity is task-independent. Here we address an unrelated confound. We show that neurons that encode different decision variables, in particular policy, may be erroneously classified as representing action-values. For clarity, we will commence by discussing this caveat independently of the temporal correlations confound. Specifically, we show that neurons whose firing rate encodes the policy (probability of choice) may be erroneously classified as representing action-values, even when this policy emerged in the absence of any implicit or explicit action-value representation. We will conclude by discussing a possible solution that addresses this and the temporal correlations confounds.

For concreteness, we consider a particular reinforcement learning algorithm, in which the probability of choice Prat=1 is determined by a single variable W that is learned in accordance with the REINFORCE learning algorithm (Williams, 1992): Prat=1=11+e-W(t) where ∆Wt=α∙(2∙Rt-1)∙(at-Prat=1), where α is the learning rate, Rt is the binary reward in trial t and at is a binary variable indicating whether action 1 was chosen in trial t. In our simulations Wt=1=0, α=0.17. For biological implementation of this algorithm see (Loewenstein, 2010; Seung, 2003).

We tested this model in the experimental design of Figure 1 (Figure 5A). As expected, the model learned to prefer the action associated with a higher probability of reward, completing the four blocks within 228 trials on average (standard deviation 62 trials).

Erroneous detection of action-value representation in policy neurons.

(A) Behavior of the model in an example session, same as in Figure 1A for the direct-policy model. (B) Red and blue lines denote ‘action-values’ 1 and 2, respectively, calculated from the choices and rewards in (A). Note that the model learned without any explicit or implicit calculation of action-values. The extraction of action-values in (B) is based on the fitting of Equation 1 to the learning behavior. (C) Strong correlation between policy from the direct-policy algorithm and action-values extracted by fitting Equation 1 to behavior. The three panels depict probability of choice as a function of the difference between the calculated action-values (left), ‘Q1’ (center) and ‘Q2’ (right). This correlation can cause policy neurons to be erroneously classified as representing action-values (D) and (E) Population analysis, same as in Figure 1D and E for the policy neurons. Legend and number of neurons are also as in Figure 1D and E. Dashed lines in (D) at t=2 denote the significance boundaries. Dashed lines in (E) denote the naïve expected false positive rate from the significance threshold (see Materials and methods).

Despite the fact that the learning was value-independent, we can still fit a Q-learning model to the behavior, extract best-fit model parameters and compute action-values (see also Figure 2—figure supplement 1). The computed action-values are presented in Figure 5B. Note that according to Equation 2, the probability of choice is a monotonic function of the difference between Q1 and Q2. Therefore, we expect that the probability of choice will be correlated with the computed Q1 and Q2, with opposite signs (Figure 5C).

We simulated policy neurons as Poisson neurons whose firing rate is a linear function of the policy Pr(a(t)=1) (Materials and methods). Next, we regressed the spike counts of these neurons on the two action-values that were computed from behavior (same as in Figures 1D,E and 2B,C, Figure 2—figure supplement 1C,D, – Figure 2—figure supplement 2B,D, – Figure 2—figure supplement 3). Indeed, as expected, 14% of the neurons were significantly correlated with both action values with opposite signs (chance level for each action value is 5%, naïve chance level for both with opposite signs is 0.125%, see Materials and methods), as depicted in Figure 5D,E. These results demonstrate that neurons representing value-independent policy can be erroneously classified as representing ΔQ.

Neurons representing policy may be erroneously classified as action-value neurons

Surprisingly, 38% of policy neurons were significantly correlated with exactly one estimated action-value, and therefore would have been classified as action-value neurons in the standard method of analysis (9.5% chance level).

To understand why this erroneous classification emerged, we note that a neuron is classified as representing an action-value if its spike count is significantly correlated with one of the action values, but not with the other. The confound that led to the classification of policy neurons as representing action-values is that a lack of statistically significant correlation is erroneously taken to imply lack of correlation. All policy neurons are modulated by the probability of choice, a variable that is correlated with the difference in the two action-values. Therefore, this probability of choice is expected to be correlated with both action-values, with opposite signs. However, because the neurons are Poisson, the spike count of the neurons is a noisy estimate of the probability of choice. As a result, in most cases (86%), the regression coefficients do not cross the significance threshold for both action-values. More often (38%), only one of them crosses the significance threshold, resulting in an erroneous classification of the neurons as representing action values.

Is this confound relevant to the question of action-value representation in the striatum?

If choice is included as a predictor, is policy representation still a relevant confound?

An alternative approach has been to consider only those neurons whose spike count is not significantly correlated with choice (Stalnaker et al., 2010; Wunderlich et al., 2009). Repeating this analysis for the Figure 5 policy neurons, we still find that 24% of the neurons are erroneously classified as action-value neurons (8% are classified as policy neurons).

Is this confound the result of an analysis that is biased against policy representation?

The analysis depicted in Figures 1D,E, 2B,C, 4B–E and 5D,E is biased towards classifying neurons as action-value neurons, at the expense of state or policy neurons, as noted by (Wang et al., 2013). This is because action-value classification is based on a single significant regression coefficient whereas policy or state classification requires two significant regression coefficients. Therefore, (Wang et al., 2013) have proposed an alternative approach. First, compute the statistical significance of the whole regression model for each neuron (using f-value). Then, classify those significant neurons according to the t-values corresponding to the two action-values (Figure 5—figure supplement 1B). Applying this analysis to the policy neurons of Figure 5 with a detection threshold of 5% we find that indeed, this method is useful in detecting which decision variables are more frequently represented (its major use in [Wang et al., 2013]): 25% of the neurons are classified as representing policy (1.25% expected by chance). Nevertheless, 12% of the neurons are still erroneously classified as action-value neurons (2.5% expected by chance; Figure 5—figure supplement 1B).

Additional issues

In many cases, the term action-value was used, while the reported results were equally consistent with other decision variables. In some cases, significant correlation with both action-values (with opposite signs) or significant correlation with the difference between the action-values was used as evidence for ‘action-value representations’ (FitzGerald et al., 2012; Guitart-Masip et al., 2012; Kim et al., 2012; 2007; Stalnaker et al., 2010). Similarly, other papers did not distinguish between neurons whose activity is significantly correlated with one action-value and those whose activity is correlated with both action-values (Funamizu et al., 2015; Her et al., 2016; Kim et al., 2013; Kim et al., 2009). Finally, one study used a concurrent variable-interval schedule, in which the magnitudes of rewards associated with each action were anti-correlated (Lau and Glimcher, 2008). In such a design, the two probabilities of reward depend on past choices and therefore, the objective values associated with the actions change on a trial-by-trial basis and are, in general, correlated.

A possible solution to the policy confound

The policy confound emerged because policy and action-values are correlated. To distinguish between the two possible representations, we should seek a variable that is correlated with the action-value but uncorrelated with the policy. Consider the sum of the two action-values. It is easy to see that CorrQ1+Q2,Q1-Q2∝VarQ1-VarQ2. Therefore, if the variances of the two action-values are equal, their sum is uncorrelated with their difference. An action-value neuron is expected to be correlated with the sum of action-values. By contrast, a policy neuron, modulated by the difference in action-values is expected to be uncorrelated with this sum.

We repeated the simulations of Figure 4 (which addresses the temporal correlations confound), considering three types of neurons: action-value neurons (of Figure 1), random-walk neurons (of Figure 2), and policy neurons (of Figure 5). As in Figure 4, we considered the spike counts of the three types of neurons in the last 200 trials of the session, but now we regressed them on the sum of reward probabilities (state; in this experimental design the reward probabilities are also the objective action-values, which the subject learns). We found that only 4.5 and 6% of the random-walk and policy neurons, respectively, were significantly correlated with the sum of reward probabilities (5% chance level). By contrast, 47% of the action-value neurons were significantly correlated with this sum.

This method is able to distinguish between policy and action-value representations. However, it will fail in the case of state representation because both state and action-values are correlated with the sum of probabilities of reward. To dissociate between state and action-value representations, we can consider the difference in reward probabilities because this difference is correlated with the action-values but is uncorrelated with the state. Regressing the spike count on both the sum and difference of the probabilities of reward, a random-walk neuron is expected to be correlated with none, a policy neuron is expected to be correlated only with the difference, whereas an action-value neuron is expected to be correlated with both (this analysis is inspired by Fig. S8b in (Wang et al., 2013) in which the predictors in the regression model were policy and state). We now classify a neuron that passes both significance tests as an action-value neuron. Indeed, for a significance threshold of p<0.05 (for each test), only 0.2% of the random-walk neurons and 5% of the policy neurons were classified as action-value neurons. By contrast, 32% of the action-value neurons were classified as such (Figure 6). Note that in this analysis only when more than 5% of the neurons are classified as action-value neurons we have support for the hypothesis that there is action-value representation rather than policy or state representation.

A possible solution for the policy and state confounds.

(A) The Q-learning model (Equations 1 and 2) was simulated in 1,000 sessions of 400 trials each, where the reward probabilities were associated with different cues and were randomly chosen in each trial, as in Figure 4. Learning occurred separately for each cue. In each session 20 action-value neurons, whose firing rate is proportional to the action-values (as in Figure 1) were simulated. For each neuron, the spike-counts in the last 200 trials of each session were regressed on the sum of the reward probabilities (ΣQ; state) and the difference of the reward probabilities (ΔQ; policy, see Materials and methods). Each dot denotes the t-values of the two regression coefficients of each of 500 example neurons. Dashed lines at t=2 denote the significance boundaries. Neurons that had significant regression coefficients on both policy and state were identified as action-value neurons. Colors as in Figure 1D. (B) Population analysis revealed that 32% of the action-value neurons were identified as such. Error bars are the standard error of the mean. Dashed black line denotes the expected false positive rate from randomly modulated neurons. Dashed gray line denotes the expected false positive rate from policy or state neurons (see Materials and methods) (C) Same as in (A) with random-walk neurons, numbers are as in Figure 2. (D) Population analysis revealed that less than 1% of the random-walk neurons were erroneously classified as representing action-values. (E-F) To test the policy neurons, we simulated a direct-policy learning algorithm (as in Figure 5) in the same sessions as in (A-D). Learning occurred separately for each cue. In each session 20 policy neurons, whose firing rate is proportional to the probability of choice (as in Figure 5) were simulated. As in (A-D), the spike-counts in the last 200 trials of each session were regressed on the sum and difference of the reward probabilities. (E) Each dot denotes the t-values of the two regression coefficients of each of 500 example neurons. (F) Population analysis. As expected, only 5% of the policy neurons were erroneously classified as representing action-values.

A word of caution is that the analysis should be performed only after the learning converges. This is because stochastic fluctuations in the learning process may be reflected in the activities of neurons representing decision-related variables. As a result, policy or state-representing neurons may appear correlated with the orthogonal variables. For the same reason, any block-related heterogeneity in neural activity could also result in this confound (O'Doherty, 2014).

To conclude, it is worthwhile repeating the key features of the analysis proposed in this section:

Trial design is necessary because otherwise temporal correlations in spike count may inflate the fraction of neurons that pass the significance tests.

Regression should be performed on reward probabilities (i.e., the objective action-values) and not on estimated action-values. The reason is that because the estimated action-values evolve over time, this trial design does not eliminate all temporal correlations between them (Figure 2—figure supplement 9).

Reward probabilities associated with the two actions should be chosen such that their variances should be equal. Otherwise policy or state neurons may be erroneously classified as action-value neurons.

Discussion

In this paper, we performed a systematic literature search to discern the methods that have been previously used to infer the representation of action-values in the striatum. We showed that none of these methods overcome two critical confounds: (1) neurons with temporal correlations in their firing rates may be erroneously classified as representing action-values and (2) neurons whose activity co-varies with other decision variables, such as policy, may also be erroneously classified as representing action-values. Finally, we discuss possible experiments and analyses that can address the question of whether neurons encode action-values.

Temporal correlations and action-value representations

It is well known in statistics that the regression coefficient between two independent slowly-changing variables is on average larger (in absolute value) than this coefficient when the series are devoid of a temporal structure. If these temporal correlations are overlooked, the probability of a false-positive is underestimated (Granger and Newbold, 1974). When searching for action-value representation in a block design, then by construction, there are positive correlations in the predictor (action-values). Positive temporal correlations in the dependent variable (neural activity) will result in an inflation of the false-positive observations, compared with the naïve expectation.

This confound occurs only when there are temporal correlations in both the predictor and the dependent variable. In a trial design, in which the predictor is chosen independently in each trial and thus has no temporal structure, we do not expect this confound. However, when studying incremental learning, it is difficult to randomize the predictor in each trial, making the task of identifying neural correlates of learning, and specifically action-values, challenging. With respect to the dependent variable (neural activity), temporal correlations in BOLD signal and their consequences have been discussed (Arbabshirani et al., 2014; Woolrich et al., 2001). Considering electrophysiological recordings, there have been attempts to remove these correlations, for example, using previous spike counts as predictors (Kim et al., 2013). However, these are not sufficient because they are unable to remove all task-independent temporal correlations (see also Figure 2—figure supplements 4–10). When repeating these analyses, we erroneously classified a fraction of neurons as representing action-value that is comparable to that reported in the striatum. The probability of a false-positive identification of a neuron as representing action-value depends on the magnitude and type of temporal correlations in the neural activity. Therefore, we cannot predict the fraction of erroneously classified neurons expected in various experimental settings and brain areas.

One may argue that the fact that action-value representations are reported mostly in a specific brain area, namely the striatum, is an indication that their identification there is not a result of the temporal correlations confound. However, because different brain regions are characterized by different spiking statistics, we expect different levels of erroneous identification of action-value neurons in different parts of the brain and in different experimental settings. Indeed, the fraction of erroneously identified action-value neurons differed between the auditory and motor cortices (compare B and D within Figure 2—figure supplement 2). Furthermore, many studies reported action-value representation outside of the striatum, in brain areas including the supplementary motor area and presupplementary eye fields (Wunderlich et al., 2009), the substantia nigra/ventral tegmental area (Guitart-Masip et al., 2012) and ventromedial prefrontal cortex, insula and thalamus (FitzGerald et al., 2012).

Considering the ventral striatum, our analysis on recordings from (Ito and Doya, 2009) indicates that the identification of action-value representations there may have been erroneous, resulting from temporally correlated firing rates (Figure 3 and Figure 2—figure supplement 3). It should be noted that the fraction of action-value neurons reported in (Ito and Doya, 2009) is low relative to other publications, a difference that has been attributed to the location of the recording in the striatum (ventral as opposed to dorsal). It would be interesting to apply this method to other striatal recordings (Ito and Doya, 2015a; Samejima et al., 2005; Wang et al., 2013). We were unable to directly analyze these recordings from the dorsal striatum because relevant raw data is not publicly available. However, previous studies have reported that the firing rates of dorsal-striatal neurons change slowly over time (Gouvêa et al., 2015; Mello et al., 2015). As a result, identification of apparent action-value representation in dorsal-striatal neurons may also be the result of this confound.

Temporal correlations naturally emerge in experiments composed of multiple trials. Participants become satiated, bored, tired, etc., which may affect neuronal activity. In particular, learning in operant tasks is associated, by construction, with variables that are temporally correlated. If neural activity is correlated with performance (e.g., accumulated rewards in the last several trials) then it is expected to have temporal correlations, which may lead to an erroneous classification of the neurons as representing action-values.

Temporal correlations – beyond action-value representation

Action-values are not the only example of slowly-changing variables. Any variable associated with incremental learning, motivation or satiation is expected to be temporally correlated. Even 'benign' behavioral variables, such as the location of the animal or the activation of different muscles may change at relatively long time-scales. When recording neural activity related to these variables, any temporal correlations in the neural recording, be it in fMRI, electrophysiology or calcium imaging may result in an erroneous identification of correlates of these behavioral variables because of the temporal correlation confound.

In general, the temporal correlation confound can be addressed by using the permutation analysis of Figure 3, which can provide strong support to the claim that the activity of a particular neuron or voxel co-varies with the behavioral variable. Therefore, the permutation test is a general solution for scientists studying slow processes such as learning. More challenging, however, is precisely identifying what the activity of the neuron represents (for example an action-value or policy). There are no easy solutions to this problem and therefore caution should be applied when interpreting the data.

Differentiating action-value from other decision variables

Another difficulty in identifying action-value neurons is that they are correlated with other decision variables such as policy, state or chosen-value. Therefore, finding a neuron that is significantly correlated with an action-value could be the byproduct of its being modulated by other decision variables, in particular policy. The problem is exacerbated by the fact that standard analyses (e.g., Figure 1D–E) are biased towards classifying neurons as representing action-values at the expense of policy or state.).

As shown in Figure 6, policy representation can be ruled out by finding a representation that is orthogonal to policy, namely state representation. This solution leads us, however, to a serious conceptual issue. All analyses discussed so far are based on significance tests: we divide the space of hypothesis into the ‘scientific claim’ (e.g., neurons represent action-values) and the null hypothesis (e.g., neural activity is independent of the task). An observation that is not consistent with the null hypothesis is taken to support the alternative hypothesis.

The problem we faced with correlated variables is that the null hypothesis and the ‘scientific claim’ were not complementary. A neuron that represents policy is expected to be inconsistent with the null hypothesis that neural activity is independent of the task but it is not an action-value neuron. The solution proposed was to devise a statistical test that seeks to identify a representation that is correlated with action-value and is orthogonal to the policy hypothesis, in order to also rule out a policy representation.

However, this does not rule out other decision-related representations. A ‘pure’ action-value neuron is modulated only by Q1 or by Q2. A ‘pure’ policy neuron is modulated exactly by Q1-Q2. More generally, we may want to consider the hypotheses that the neuron is modulated by a different combination of the action values, a∙Q1+b∙Q2, where a and b are parameters. For every such set of parameters a and b we can devise a statistical test to reject this hypothesis by considering the direction that is orthogonal to the vector a,b. In principle, this procedure should be repeated for every pair of parameters a and b that in not consistent with the action-value hypothesis.

Put differently, in order to find neurons that represent action-values, we first need to define the set of parameters a and b such that a neuron whose activity is modulated by a∙Q1+b∙Q2 will be considered as representing an action-value. Only after this (arbitrary) definition is given, can we construct a set of statistical tests that will rule out the competing hypotheses, namely will rule out all values of a and b that are not in this set. The analysis of Figure 6 implicitly defined the set of a and b such that a≠b and a≠-b as the set of parameters that defines action-value representations. In practice, it is already very challenging to identify action-values using the procedure of Figure 6 and going beyond it seems impractical. Therefore, studying the distribution of t-values across the population of neurons may be more useful when studying representations of decision variables than asking questions about the significance of individual neurons.

Importantly, the regression models described in this paper allow us to investigate only some types of representations, namely, linear combinations of the two action-values. However, value representations in learning models may fall outside of this regime. It has been suggested that in decision making, subjects calculate the ratio of action-values (Worthy et al., 2008), or that subjects compute, for each action, the probability that it is associated with the highest value (Morris et al., 2014). Our proposed solution cannot support or refute these alternative hypotheses. If these are taken as additional alternative hypotheses, a neuron should be classified as representing an action-value if its activity is also significantly modulated in the directions that are correlated with action-value and are orthogonal to these hypotheses. Clearly, it is never possible to construct an analysis that can rule out all possible alternatives.

We believe that the confounds that we described have been overlooked because the null hypothesis in the significance tests was not made explicit. As a result, the complementary hypothesis was not explicitly described and the conclusions drawn from rejecting the null hypothesis were too specific. That is, alternative plausible interpretations were ignored. It is important, therefore, to keep the alternative hypotheses explicit when analyzing the data, be it using significance tests or other methods, such as model comparison (Ito and Doya, 2015b).

Are action-value representations a necessary part of decision making?

One may argue that the question of whether neurons represent action-value, policy, state or some other correlated variable is not an interesting question. This is because all these correlated decision variables implicitly encode action-values. Even direct-policy models can be taken to implicitly encode action-values because policy is correlated with the difference between the action-values. However, we believe that the difference between action-value representation and representation of other variables is an important one, because it centers on the question of the computational model that underlies decision making in these tasks. Specifically, the implication of a finding that a population of neurons represents action-values is not that these neurons are involved somehow in decision making. Rather, we interpret this finding as supporting the hypothesis that action-values are explicitly computed in the brain, and that these action-values play a specific role in the decision making process. However, if the results are also consistent with various alternative computational models then this is not the case. Some consider action-value computation to be a necessary part of decision making. By contrast, however, we presented here two models of learning and decision making that do not entail this computation (Figure 2—figure supplement 1, Figure 5). Other examples are discussed in (Mongillo et al., 2014; Shteingart and Loewenstein, 2014) and references therein.

The involvement of the basal ganglia in general and the striatum in particular in operant learning, planning and decision-making is well documented (Ding and Gold, 2010; McDonald and White, 1993; O'Doherty et al., 2004; Palminteri et al., 2012; Schultz, 2015; Tai et al., 2012; Thorn et al., 2010; Yarom and Cohen, 2011). However, there are alternatives to the possibility that the firing rate of striatal neurons represents action-values. First, as discussed above, learning and decision making do not entail action-value representation. Second, it is possible that action-value is represented elsewhere in the brain. Finally, it is also possible that the striatum plays an essential role in learning, but that the representation of decision variables there is distributed and neural activity of single neurons could reflect a complex combination of value-related features, rather than ‘pure’ decision variables. Such complex representations are typically found in artificial neural networks (Yamins and DiCarlo, 2016).

Finally, we would like to emphasize that we do not claim that there is no representation of action-value in the striatum. Rather, our results show that special caution should be applied when relating neural activity to reinforcement-learning related variables. Therefore, the prevailing belief that neurons in the striatum represent action-values must await further tests that address the confounds discussed in this paper.

Materials and methods

Literature search

In order to thoroughly examine the finding of action-value neurons in the striatum, we conducted a literature search to find all the different approaches used to identify action-value representation in the striatum and see whether they are subject to at least one of the two confounds we described here.

The key words ‘action-value’ and ‘striatum’ were searched for in Web-of-Knowledge, Pubmed and Google Scholar, returning 43, 21 and 980 results, respectively. In the first screening stage, we excluded all publications that did not report new experimental results (e.g., reviews and theoretical papers), focused on other brain regions, or did not address value-representation or learning. In the remaining publications, the abstract of the publication was read and the body of the article was searched for ‘action-value’ and ‘striatum’. After this step, articles in which it was possible to find description of action-value representation in the striatum were read thoroughly. The search included PhD theses, but none were found to report new relevant data, not found in papers. We identified 22 papers that directly related neural activity in the striatum to action-values. These papers included reports of single-unit recordings, fMRI experiments and manipulations of striatal activity.

Of these, two papers have used the term action-value to refer to the value of the chosen action (also known as chosen-value) (Day et al., 2011; Seo et al., 2012) and therefore we do not discuss them.

An additional study (Pasquereau et al., 2007) used the expected reward and the chosen action as predictors of the neuronal activity and found neurons that were modulated by the expected reward, the chosen action and their interaction. The authors did not claim that these neurons represent action-values, but it is possible that these neurons were modulated by the values of specific actions. However, the representation of the value of the action when the action is not chosen is a crucial part of action-value representation which differentiates it from the representation of expected reward, and the values of the actions when they were not chosen were not analyzed in this study. Therefore, the results of this study cannot be taken as an indication for action-value representation, rather than other decision variables.

In two additional papers, it was shown that the activation of striatal neurons changes animals’ behavior, and the results were interpreted in the action-value framework (Lee et al., 2015; Tai et al., 2012). However, a change in policy does not entail an action-value representation (see, for example, Figure 5 and Figure 2—figure supplement 1). Therefore, these papers were not taken as strong support to the striatal action-value representation hypothesis.

Taken together, we concluded that previous reports on action-value representation in the striatum could reflect the representation of other decision variables or temporal correlations in the spike count that are not related to action-value learning.

To model neurons whose firing rate is modulated by an action-value, we considered neurons whose firing rate changes according to:

(4)f(t)=B+K⋅r⋅(Qi(t)−0.5)

Where ft is the firing rate in trial t, B=2.5Hz is the baseline firing rate, Qit is the action-value associated with one of the actions i∈1,2, K=2.35Hz is the maximal modulation and r denotes the neuron-specific level of modulation, drawn from a uniform distribution, r~U-1,1. The spike count in a trial was drawn from a Poisson distribution, assuming a 1 sec-long trial.

To model neurons whose firing rate is modulated by the policy, we considered neurons whose firing rate changes according to:

(5)f(t)=B+K⋅r⋅(Pr(a(t)=1)−0.5)

Where ft is the firing rate in trial t, B=2.5Hz is the baseline firing rate, Prat=1 is the probability of choosing action 1 in trial t that changes in accordance with REINFORCE (Williams, 1992) (see also Figure 5 and corresponding text). K=3Hz is the maximal modulation and r denotes the neuron-specific level of modulation, drawn from a uniform distribution, r~U-1,1. The spike count in a trial was drawn from a Poisson distribution, assuming a 1 sec-long trial.

In the covariance based plasticity model the decision-making network is composed of two populations of Poisson neurons: each neuron is characterized by its firing rate and the spike count of a neuron in a trial (1 sec) is randomly drawn from a Poisson distribution. The chosen action corresponds to the population that fires more spikes in a trial (Loewenstein, 2010; Loewenstein and Seung, 2006). At the end of the trial, the firing rate of each of the neurons (in the two population) is updated according to ft+1=ft+η∙R(t)∙st-ft, where ft is the firing rate in trial t, η=0.07 is the learning rate, R(t) is the reward delivered in trial t (R(t)∈0,1 in our simulations) and st is the measured (realized) firing rate in that trial, that is the spike count in the trial. The initial firing rate of all simulated neurons is 2.5Hz. The network model was tested in the operant learning task of Figure 1. A session was terminated (without further analysis) if the model was not able to choose the better option more than 14 out of 20 consecutive times for at least 200 trials in the same block. This occurred on 20% of the sessions. We simulated two populations of 1,000 neurons in 500 successful sessions. Note that because on average, the empirical firing rate is equal to the true firing rate, ft=st, changes in the firing rate are driven, on average, by the covariance of reward and the empirical firing rate: ∆ft≡ft+1-ft=η∙covRt,st(Loewenstein and Seung, 2006). The estimated action-values in Figure 2—figure supplement 1 were computed from the actions and rewards of the covariance model by assuming the Q-learning model (Equations 1 and 2).

The data in Figure 2—figure supplement 2A–B was recorded by Oren Peles in Eilon Vaadia's lab. It was recorded from one female monkey (Macaca fascicularis) at 3 years of age, using a 10 × 10 microelectrode array (Blackrock Microsystems) with 0.4 mm inter-electrode distance. The array was implanted in the arm area of M1, under anesthesia and aseptic conditions.

Behavioral Task: The Monkey sat in a behavioral setup, awake and performing a Brain Machine Interface (BMI) and sensorimotor combined task. Spikes and Local Field Potentials were extracted from the raw signals of 96 electrodes. The BMI was provided through real time communication between the data acquisition system and a custom-made software, which obtained the neural data, analyzed it and provided the monkey with the desired visual and auditory feedback, as well as the food reward. Each trial began with a visual cue, instructing the monkey to make a small hand movement to express alertness. The monkey was conditioned to enhance the power of beta band frequencies (20-30Hz) extracted from the LFP signal of 2 electrodes, receiving a visual feedback from the BMI algorithm. When a required threshold was reached, the monkey received one of 2 visual cues and following a delay period, had to report which of the cues it saw by pressing one of two buttons. Food reward and auditory feedback were delivered based on correctness of report. The duration of a trial was on average 14.2s. The inter-trial-interval was 3s following a correct trial and 5s after error trials. The data used in this paper, consists of spiking activity of 89 neurons recorded during the last second of inter-trial-intervals, taken from 600 consecutive trials in one recording session. Pairwise correlations were comparable to previously reported (Cohen and Kohn, 2011), rSC=0.047±0.17 (SD), (rSC=0.037±0.21 for pairs of neurons recorded from the same electrode).

Animal care and surgical procedures complied with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and with guidelines defined by the Institutional Committee for Animal Care and Use at the Hebrew University.

The auditory cortex recordings appearing in Figure 2—figure supplement 2C–D are described in detail in (Hershenhoren et al., 2014). In short, membrane potential was recorded intracellularly in the auditory cortex of halothane-anesthetized rats. The data consists of 125 experimental sessions recorded from 39 neurons. Each session consisted of 370 pure tone bursts. Tone duration was 50 ms with 5 ms linear rise/fall ramps. In the data presented here, trials began 50 ms prior to the onset of the tone burst. For each session, all trials were either 300 msec or 500 msec long. Trial length remained identical throughout a session and depended on smallest interval between two tones in each session. Spike events were identified following high pass filtering with a corner frequency of 30Hz. Local maxima that were larger than 60 times the median of the absolute deviation from the median (MAD) were classified as spikes. The data presented here consists only of the spike counts in each trial, rather than the full membrane potential trace.

The basal ganglia recordings that are analyzed in Figure 3 and Figure 2—figure supplement 3 are described in detail in (Ito and Doya, 2009). In short, rats performed a combination of a tone discrimination task and a reward-based free-choice task. Extracellular voltage was recorded in the behaving rats from the NAc and VP using an electrode bundle. Spike sorting was done using principal component analysis. In total, 148 NAc and 66 VP neurons across 52 sessions were used for analyses (In 18 of the 70 behavioral sessions there were no neural recordings).

Estimation of action-values from choices and rewards

To imitate experimental procedures, we regressed the spike counts on estimates of the action-values, rather than the subjective action-values that underlay model behavior (to which the experimentalist has no direct access). For that goal, for each session, we assumed that Qi1=0.5 and found the set of parameters α^ and β^ that yielded the estimated action-values that best fit the sequences of actions in each experiment by maximizing the likelihood of the sequence. Action-values were estimated from Equation 1, using these estimated parameters and the sequence of actions and rewards. Overall, the estimated values of the parameters α and β were comparable to the actual values used: on average, α^=0.12±0.09 (standard deviation) and β^=2.6±0.7 (compare with α=0.1 and β=2.5).

Exclusion of neurons

Following standard procedures (Samejima et al., 2005), a sequence of spike-counts, either simulated or experimentally measured was excluded due to low firing rate if the mean spike count in all blocks was smaller than 1. This procedure excluded 0.02% (4/20,000) of the random-walk neurons and 0.03% (285/1,000,000) of the covariance-based plasticity neurons. Considering the auditory cortex recordings, we assigned each of the 125 spike counts to 40 randomly-selected sessions. 23% of the neural recordings (29/125) were excluded in all 40 sessions. Because blocks are defined differently in different sessions, some neural recordings were excluded only when assigned to some sessions but not others. Of the remaining 96 recordings, 14% of the recordings × sessions were also excluded. Similarly, considering the basal ganglia neurons, we assigned each of the 642 recordings (214 × 3 phases) to 40 randomly-selected sessions. 11% (74/(214 × 3)) of the recordings were excluded in all 40 sessions. Of the remaining 568 recordings, 9% of the recordings × sessions were also excluded. None of the simulated action-value neurons (0/20,000) or the motor cortex neurons (0/89) were excluded.

Where st is the spike count in trial t, Q1t and Q2t are the estimated action-values in trial t, ϵt is the residual error in trial t and β0-2 are the regression parameters.

The computation of the t-values of the regression of the spike counts on the reward probabilities in the trial design experiment (as in Figure 4) was done using the following regression model:

(7)s(t)=β0+β1RP1(t)+β2RP2(t)+ϵ(t)

Where t denotes the trial. Only the last 200 trials of the session were anlyzed. s(t) is the mean spike count, RP1t and RP2t are the reward probabilities corresponding to action 1 or action 2, respectively (in this experimental design RP could be 0.1,0.5 or 0.9), ϵt is the residual error and β0-2 are the regression parameters.

The computation of the t-values of the regression of the spike counts on state and policy in a trial design experiment (as in Figure 6) was done using the following regression model:

Significance of t-values slightly depends on session length. For the session lengths we considered, 0.05 significance bounds varied between 1.962 and 1.991. For consistency, we chose a single conservative bound of 2. Similarly, 0.025 and 0.01 significance bounds were chosen to be 2.3 and 2.64, respectively.

For all significance boundaries the false positive thresholds were computed naively, that is, assuming the analysis is not confounded in any way and that the two predictors are not correlated with each other. For example, assuming the false positive rate from a single t-test for a significant regression coefficient is P, for the standard analysis, the false positive rate for each action-value classification was defined as P∙(1-P), and the false positive rate was equal for state and policy classification and was defined as P2/2. In Figure 6 the false positive rate computed for random-walk neurons was P2/2 for each action-value classification, and the false positive rate computed for state or policy neurons was P/2 for each action-value classification.

For each action-value and random-walk neuron, we computed the t-values of the regressions of its spike-count on estimated action-values from the sessions of Figure 1E. Because the number of trials can affect the distribution of t-values, we only considered in our analysis the first 170 trials of the 504 sessions longer or equal to 170 trials. This number, which is approximately the median of the distribution of number of trials per session, was chosen as a compromise between the number of trials per session and number of sessions. When performing the permutation test on the basal ganglia data we included all recordings and only the first 332 trials in each session, which is the smallest number of trials used in a session in this dataset.

Two points are noteworthy. First, the distribution of the t-values of the regression of the spike count of a neuron on all action-values depends on the neuron (see difference between distributions in Figure 3A). Similarly, the distribution of the t-values of the regression of the spike counts of all neurons on an action-value depends on the action-value (not shown). Therefore, the analysis could be biased in favor (or against) finding action-value neurons if the number of neurons analyzed from each session (and therefore are associated with the same action-values) differs between sessions. Second, this analysis does not address the correlated decision variables confound.

Finally, we would like to point out that there is an alternative way of performing the permutation test, which is applicable when the number of sessions is small, while the number of neurons recorded in a session is large. Instead of comparing the t-values from the regression of a neuron on different action-values, one can compare the t-values from different neurons on the same action-value. However, this method is only applicable under the assumption that the temporal correlations that are not related to action-value in the neuronal activity are similar between sessions.

In Figure 2—figure supplement 4 we considered the experiment and analysis described in (Kim et al., 2009). That experiment consisted of four blocks, each associated with a different pair of reward probabilities, (0.72, 0.12), (0.12, 0.72), (0.21, 0.63) and (0.63, 0.21), appearing in a random order, with the better option changing location with each block change. The number of trials in a block was preset, ranging between 35 and 45 with a mean of 40 (this is unlike the experiment described in Figure 1, in which termination of a block depended on performance).

First, we used Equations 1 and 2 to model learning behavior in this protocol. Then, we estimated the action-values according to choice and reward sequences, as in Figure 1. These estimated action-values were used for regression of the spike counts of the random-walk, motor cortex, auditory cortex, and basal ganglia neurons in the following way: each spike count sequence was randomly assigned to a particular pair of estimated action-values from one session. The spike count sequence was regressed on these estimated action-values. The resultant t-values were compared with the t-values of 1000 regressions of the spike-count, permuted within each block, on the same action-values. The p-value of this analysis was computed as the percentage of t-values from the permuted spike-counts that were higher in absolute value than the t-value from the regression of the original spike count. The significance boundary was set at p<0.025 (Kim et al., 2009). Neurons with at least one significant regression coefficient (rather than exactly one significant regression coefficient) were classified as action-value modulated neurons (Kim et al., 2009).

Following (Asaad et al., 2000) we conducted an additional analysis with repeating blocks. We simulated learning behavior in the same experiment as in Figure 2—figure supplement 10. This experiment is composed of 8 blocks - the 4 blocks of Figure 1, repeated twice, in random permutation. We restricted our analysis to the 438 sessions with 332 trials or fewer (332 trials is the shortest session in the basal ganglia recording). Each spike count was analyzed 40 times, using 40 randomly-assigned sessions. For each block, we restricted the analysis to the neuronal activity in the last 20 trials of the block.

First, we conducted four one-way ANOVAs (using MATLAB’s anova1) to compare the neuronal activities in blocks associated with the same action-values (e.g., the neuronal activity in the two blocks, in which reward probabilities were (0.1,0.5)). Neurons were excluded from further analysis if we found a significant difference in their firing rates in at least one of these comparisons (df(columns)=1, df(error)=38, p<0.1). This procedure excludes from further analysis ‘drifting’ neurons, whose spike count significantly varied in the session.

Next, for each action-value we conducted a one-way ANOVA (using MATLAB’s anova1), which compared the neuronal activity between the two blocks in which the action-value was 0.1 and the two blocks in which the action-value was 0.9 (df(columns)=1, df(error)=78, p<0.01). We classified neurons as representing action-values if there was a significant difference between their firing rates for one action-value but not for the other.

Despite the removal of ‘drifting’ neurons, this analysis yielded an erroneous classification of action-value neurons in all datasets: random-walk neurons, 18%; motor cortex neurons, 12%; auditory cortex neurons, 5%; basal ganglia neurons, 9%. This is despite the fact that the expected false positive rate is only 2%. These results indicate that the exclusion of ‘drifting’ neurons as in (Asaad et al., 2000) does not solve the temporal correlations confound.

Data from the motor cortex, auditory cortex, and basal ganglia was the same as in Figure 2—figure supplements 2–3. Data for random-walk included 1000 newly simulated neurons, using the same parameters as in Figure 2 (this was done to create enough trials in each simulated spike count).

Decision letter

Timothy E Behrens

Reviewing Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Striatal action-value neurons reconsidered" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Elber-Dorozko and Loewenstein examine issues using trial-by-trial spike data to determine whether neural activity is associated with action-values. First, they note that standard regression inference can lead to false detection of action-value correlations when samples are not independent. They illustrate the prevalence of false detections using simulation and data with no plausible relation to action-values. Second, they note that different reinforcement learning models without action-value representations can yield significant action-value coding when analyzed using prior approaches. The authors present well-thought-out analyses and data that highlight analytic, experimental and conceptual difficulties in identifying action-value coding. Although both issues the authors examine have been acknowledged in the literature, and different approaches have been made to deal with or minimize their potential effects, the authors' paper represents an important synthesis and analysis that will likely lead to clearer future experiments.

However, there were several key concerns.

Firstly, the reviewers thought that there should be a more careful attention to the wider literature. In the reviewer discussion, this was thought to be of particular importance as you are making a technical point in a journal with a broad readership. It is particularly important that you are clear about which parts of the literature your results speak to.

For example:

1) The authors conclude that current methods of studying action-values are confounded, and propose that experiments should use actions whose reward values are not learned over time during the experiment, and instead are indicated by sensory cues with their values picked randomly on each trial.

However, the authors leave unmentioned the large literature of studies taking an alternate approach: recording striatal activity during planning and execution of instructed actions for cued reward outcomes, in which each trial randomizes the instructed action, the reward, or both. The authors should discuss what implications this literature has for whether the striatum encodes action-values vs. policies vs. other variables. Examples include the work from the labs of Schultz (e.g. Hassani et al., 2001; Cromwell and Schultz, 2003; Cromwell et al., 2005), Hikosaka (e.g. Kawagoe et al., 1998; Lauwereyns et al., Neuron 2004), and Kimura (e.g. Hori et al., 2009).

Most of these studies did not specifically claim that the activity they reported represents "action-values" in the reinforcement learning sense (and hence the present authors shouldn't feel obligated to try to debunk them), but they do seem highly relevant to the larger question the present authors raise. These studies did attempt to test whether neurons represented actions, values, and notably, their interaction (e.g. a cell whose activity scales more with action A's value than action B's value), which resembles the concept of "action-value".

Also, these studies may be somewhat resistant to the authors' criticisms about confounds from temporal correlations (since rewards were either explicitly cued, or kept deterministic and stable in well-learned blocks of trials, rather than slowly fluctuating during extended learning) and confounds with action probability (since the actions were instructed and hence a priori equally probable on each trial).

Of particular interest, a paper by Pasquereau et al. (2007) seems to fulfill all the requirements the present authors set for a test of striatal action-value coding; if so, this seems worthy of mention. That study manipulated the reward value of four actions (up, down, left, right), randomly assigning their reward probabilities on each trial and indicating them with visual cues. Unfortunately, as the present authors note, the study did not explicitly analyze their results in the action-value vs. chosen-value framework. However, the paper did report that some neurons had significant action x value interactions – for example, a cell that is more active when planning rightward movements (action), with stronger activity when the planned movement was more valuable (value), and with this value-modulation greater for rightward movements than other movements (action x value). This is not a pure chosen-value signal as the present authors seem to claim that paper reported. One could argue that it contains a key feature of an action-value signal as the value modulation is strongest for one specific action.

2) The authors correctly pointed out that some earlier studies of action value used a suboptimal task design and their conclusions need to be more rigorously validated. However, in the broader field, the potential risk of "drift" in neural recording has been well recognized. For example, "Neurophysiological experiments that compare activity across different blocks of trials must make efforts to be confident that any neural effects are not the result of artifacts of that design, such as slow-wave changes in neural activity over time." (Asaad et al., 2000). In the same Asaad et al. paper, a better design with repeated, alternated block types was used, similar in concept to randomized block design that the authors proposed here. Such designs have also been used in many neural studies of cognition – to name just two examples: value manipulations (Lauwereyns et al., 2002), rule manipulations (Mansouri et al., 2006). The problem thus seems relatively limited to one type of analysis that introduces temporal correlation across trials in an effort to estimate Q values. By the authors' account, this amounts to 5 papers from 3 different labs.

3) What about previous results arguing for prominence of a specific type of value representation? The authors touch on this, but it would be helpful to discuss specific results. In particular, the cited study of Wang et al., Nat Neuro 2003 reported that their unbiased angular measure of DMS value coding was distributed significantly non-uniformly, with net value (ΣQ) coding more prevalent than other types (their Figure 7). Whereas the null hypothesis simulations in this paper predict very different results, either a uniform distribution (Figure 2—figure supplement 8) or a dearth of ΣQ neurons (Figure 5—figure supplement 1). The authors should discuss whether this previous result can therefore still be interpreted as evidence of value coding (at least, net-value coding), rather than strictly policy coding, in the striatum.

(Also, it is odd that the authors cite Wang et al. as a study that "claimed to provide direct evidence that neuronal activity in the striatum is specifically modulated by action-value", since the main result was specifically finding prevalence of net-value, not action-value coding).

4) I found the authors' choice of basal ganglia data misleading (Ito and Doya, 2009). First, because these data are recordings from the nucleus accumbens and ventral pallidum, which are not the first basal ganglia structures one thinks of as encoding action values. Second, because the original authors of the study from which the data was obtained noted that action value coding was low in these structures, leading them to suggest that action value coding was not the primary function of the nucleus accumbens and ventral pallidum. This is mentioned in the Results subsection “Permutation analysis of basal ganglia neurons”, but should be noted in the Discussion (the current text in the Results could probably just be moved).

5) Previous methods dealing with trial correlations have different success at reducing false positive rate of detecting action-values. In particular, the method of Wang et al. (2013) comes very close to attaining the correct size of the test for action-values. Indeed, it appears to be the only existing method from which one would reasonably conclude that the ventral striatal data set analyzed probably does not exhibit much action-value coding (2-3% above the expected 5%). I think it would be useful to have a figure in the main text comparing the different methods to the authors' permutation test (using for example, just the basal ganglia data set). In addition, Wang et al.’s method is also pretty good at identifying policy neurons, which is important because it could be applied retrospectively to existing data sets.

6) The authors' biggest suggestion for rigorously detecting a neural action value representation is "Don't use a task with learning, use a trial-based design where subjects associate contexts with well-learned sets of action values". That is perfectly fine for scientists whose goal is specifically to test whether a brain area encodes action values. However, what about the many scientists whose explicit goal is to study neural representation of time-varying values, and hence need to use learning tasks? Many scientists are studying (1) the neural basis of value learning, (2) brain areas specifically involved in early learning (not well-trained performance), (3) motivational variables specifically present for time-varying action values not well-trained ones (e.g. volatility, certain forms of uncertainty, etc.). If the authors can give an approach that will let these scientists make accurate estimates of neural time-varying value coding during learning tasks, that would certainly be valuable to the field.

I feel like their methods could potentially be used to achieve this in a straightforward way (by combining their novel permutation test from Part 1 of their paper with their method of testing for correlation with both sum and difference of values from Part 2 of their paper). But they don't lay this out explicitly in their paper at present, since they are more focused on the narrower implication ("Do striatal neurons encode action values?") rather than the broader implication of their results ("In general, how can one properly measure time-varying action values?").

Secondly, the reviewers had some particular concerns about the action value vs. policy representation issue. For example:

1) Regarding the second confound of policy vs. action value:

- The authors seem to be arguing against a straw-man version of how to relate neural activity to behavior. Typically we infer the underlying computations by testing how well different hypothesized models can fit the behavior and then search for correlates of the most likely computation in the brain. The authors seem to test only how well the neural activity correlates with different hypothesized models.

- The proposed solution for distinguishing policy from action value also has a very high rate of false negative (78%).

2) I feel that the point the authors make about action-value vs. policy representations may actually be underselling the true extent of the confound, and so their proposed solution may not be sufficient. However, this all depends on how the authors want to define a 'policy neuron' vs. a 'value neuron', as I explain below. I think they should clarify this.

2.1) Their arguments seem to assume that neural policy representations are in the form of action probabilities, which can then be identified by the key signature that they relate to action-values in a 'relative' manner (e.g. an 'action 1' neuron that is correlated positively with Q1 must be correlated negatively with Q2), and hence will be best fit as encoding ΔQ (Q1 – Q2). However, depending on how they define 'policy', this may not be the case.

Notably, even for reinforcement learning agents that do not explicitly represent action-values, few of them directly learn a policy in its most raw form of a set of action probabilities. Instead, they represent the policy in a more abstract parameter space. The simplest parameterization is a vector of action strengths, one for each possible action. Then during a choice, the probability of each action is determined by applying a choice function (e.g. softmax) to the action strengths of the set of actions that are currently available. The choice's outcome is then used to do learning on the action strengths. This method is used by some traditional actor-critic agents (which represent state values and action strengths, but not action-values). My impression is that the authors' covariance-based model is similar, in that the variables that it updates when it learns are the input weights W1 and W2 to each pool (with one input weight for each action, thus being analogous to action strengths).

Note that in these models, the action strengths are not explicitly represented in a 'relative' manner; only the resulting action probabilities are (since the probabilities must sum to 1). It's not clear to me whether a neuron encoding an action strength would be classified as a 'policy neuron' or an 'action-value neuron' by the authors' current framework, nor is it clear to me which outcome the authors would prefer. I believe the dynamics of actor-critic learning would cause the action strengths to be somewhat 'relative' (e.g. the best action is nudged toward positive strength while all others are nudged to negative strength), but I'm not sure big this effect would be, or whether this would also occur for the authors' covariance-based model, or whether this would occur if > 2 actions are available. It is possible that these types of learning tasks can't discriminate between action strengths (e.g. from an actor-critic) versus action-values (e.g. from a Q-learner). So, the authors should clarify whether they believe this is an important distinction for the present study.

2.2) Suppose we agree that neurons only count as coding the policy if they encode action probability (and not strength). Their proposed solution still seems model-dependent because it assumes that the policy is such that the probability of choosing an action is a function of the difference in action-values (Q1 – Q2) and hence policy neurons can be identified as encoding ΔQ and not ΣQ. However, there is data suggesting that humans and animals are also influenced by ratios of reward rates rather than just differences (e.g. "Ratio vs. difference comparators in choice", Gibbon and Fairhurst, J Exp Anal Behav 1994; "Ratio and Difference Comparisons of Expected Reward in Decision Making Tasks", Worthy, Maddox, and Markman, 2008). If so, then policy neuron activity could be related to a ratio (e.g. Q1 / Q2), which is correlated with both ΔQ and ΣQ.

Here is my proposed solution. It seems to me that if 'policy neurons' are equated to action probabilities, then the proper method of distinguishing policy from value coding would be to design a task that explicitly dissociates between the probability of choosing an action (encoded by policy neurons) and the action's value (encoded by action-value neurons). For instance, suppose an animal is found to choose based on the ratio of the reward rates, such that it chooses A 80% of the time when V(A) = 4*V(B). Then we can set up the following three trial types:

V(A), V(B), p(choose A)

8, 2, 80%

4, 1, 80%

4, 4, 50%

A neuron encoding V(A) should be twice as active on the first trial type as the other two trial types (since V(A) is twice as high), while a neuron encoding the policy p(choose A) should be equally active on the first two trial types (since p(choose A) = 80%). Of course, more trial types might be desired to further dissociate this from encoding of ΣQ and ΔQ. Also, note that this approach is model-dependent, because it requires a model of behavior to estimate the true p(action) on each trial (or else careful psychophysics to find two pairs of action-values that make the subject have the same action probabilities).

In general, to use this approach in a regression-based manner, one would (1) fit a model to behavior, (2) use the model to derive p(action,t) and V(action,t) for each action and each trial t, (3) fit neural activity as a function of those variables (and possibly others, such as the actually performed action, ΣQ, etc.), (4) test whether the neuron is significantly modulated by p(action), V(action), or both, controlling for temporal correlation using the authors' proposed method that uses task trajectories from other sessions as a control. Of course, if the model says that choice is indeed based on the value difference ΔQ as the authors currently assume, then this approach would simplify to the same one the authors currently propose.

Thirdly, the reviewers raised some questions about the corrections proposed and whether there in fact remained evidence for action value coding in the Basal Ganglia.

1) A critical assumption is that there exists temporal correlation strong enough to contaminate the analysis. It would be helpful to report the degree of this temporal correlation in the basal ganglia data set vs. the motor/auditory cortex data and the random walk model.

A figure, in the format of Figure 1D, showing the distribution of t-values for the actual basal ganglia data set analyzed with trial-matched Q estimates should be presented. This information is critical for effective comparisons to other data sets.

2) The authors proposed two possible solutions for this type of study. The first is to use a more stringent (and appropriate) criterion for significance, given the often wrongly assumed variance due to correlation. The permutation test is definitely in the right direction, particularly for reducing false positives. However, I am concerned by the really high rate of false negatives (~70% misses). "Considering the population of simulated action-value neurons of Figure 1, this analysis identified 29% of the action-value neurons of Figure 1 as such". Considering other unaccountable variables in typical experiments, particularly that basal ganglia neurons may have mixed selectivity both at the population and single-neuron level, such a high false negative rate seems to carry high risks of missing a true representation.

3) The authors suggested randomized blocks as the second solution. In addition to my earlier point, by their own account, such a design is not new and has been implemented in three separate studies >5 years ago. The authors pointed out some issues with those studies, which will need to be addressed in the future, but did not suggest any solutions.

4) The authors stated that the detrending analysis does not resolve the confound. However, judging from Figure 2—figure supplement 7, the detrending analysis resulted in ~29% significant Q modulation in the basal ganglia, in contrast to ~14% for random walk, ~12% for motor cortex and 10% for auditory cortex. Compared to other figures, which showed similar percentage for all four datasets, it seems that the basal ganglia data set is most robust to this analysis. Doesn't this support the idea of an action value representation in the basal ganglia?

5) The authors focus on statistical significance. Does examining the magnitude of the effects distinguish erroneous from "real" action value coding? It seems incomplete to only plot the t-values, which are important for understanding parameter precision, without presenting the parameters effect sizes. Can real action value coding be distinguished by effect sizes that were meaningfully large (i.e., substantive versus statistical significance)?

6) Along related lines, it seems like examining the pattern of effects is also useful. When comparing Figure 1D and Figure 2B, one can see that the erroneous detections included positive and negative ΔQ and ΣQ neurons, whereas for real detections (Figure 1D), there are much fewer of these neurons (by definition). All the erroneous detections generate spherical t-value plots, indicating that combinations of one or the other action value are independent. This seems not to be the case for real detections (in the authors simulations), nor in real data (Samejima et al., 2005). This suggests that any non-uniformity in detecting combinations of action value coding would be evidence that it is not erroneous (even if the type I error is not properly controlled).

7) The simulations in Figure 2 are useful, but it would be useful to translate the diffusion parameter (σ) of the random walk into an (auto) correlation. This would make it easier for a reader to interpret how this relates to real data.

8) Is the M1 data a proper control? It is hard to tell from the task description here. I wouldn't be able to replicate the task that was used given the description here. If that M1 data is published, a citation would be helpful. My concerns are whether it might have had unusually large temporal correlations and thus exaggerated the degree to which such correlations might confound action-value studies, due to either (1) having blocks of trials (as opposed to randomly interleaved trial types), (2) being a BMI task in which animals were trained to induce the recorded ensemble to emit specific long-duration activity patterns.

Author response

However, there were several key concerns.

Firstly, the reviewers thought that there should be a more careful attention to the wider literature. In the reviewer discussion, this was thought to be of particular importance as you are making a technical point in a journal with a broad readership. It is particularly important that you are clear about which parts of the literature your results speak to.

For example:

1) The authors conclude that current methods of studying action-values are confounded, and propose that experiments should use actions whose reward values are not learned over time during the experiment, and instead are indicated by sensory cues with their values picked randomly on each trial.

However, the authors leave unmentioned the large literature of studies taking an alternate approach: recording striatal activity during planning and execution of instructed actions for cued reward outcomes, in which each trial randomizes the instructed action, the reward, or both. The authors should discuss what implications this literature has for whether the striatum encodes action-values vs. policies vs. other variables. Examples include the work from the labs of Schultz (e.g. Hassani et al., 2001; Cromwell and Schultz, 2003; Cromwell et al., 2005), Hikosaka (e.g. Kawagoe et al., 1998; Lauwereyns et al., Neuron 2004), and Kimura (e.g. Hori et al., 2009).

Most of these studies did not specifically claim that the activity they reported represents "action-values" in the reinforcement learning sense (and hence the present authors shouldn't feel obligated to try to debunk them), but they do seem highly relevant to the larger question the present authors raise. These studies did attempt to test whether neurons represented actions, values, and notably, their interaction (e.g. a cell whose activity scales more with action A's value than action B's value), which resembles the concept of "action-value".

Also, these studies may be somewhat resistant to the authors' criticisms about confounds from temporal correlations (since rewards were either explicitly cued, or kept deterministic and stable in well-learned blocks of trials, rather than slowly fluctuating during extended learning) and confounds with action probability (since the actions were instructed and hence a priori equally probable on each trial).

Of particular interest, a paper by Pasquereau et al. (2007) seems to fulfill all the requirements the present authors set for a test of striatal action-value coding; if so, this seems worthy of mention. That study manipulated the reward value of four actions (up, down, left, right), randomly assigning their reward probabilities on each trial and indicating them with visual cues. Unfortunately, as the present authors note, the study did not explicitly analyze their results in the action-value vs. chosen-value framework. However, the paper did report that some neurons had significant action x value interactions – for example, a cell that is more active when planning rightward movements (action), with stronger activity when the planned movement was more valuable (value), and with this value-modulation greater for rightward movements than other movements (action x value). This is not a pure chosen-value signal as the present authors seem to claim that paper reported. One could argue that it contains a key feature of an action-value signal as the value modulation is strongest for one specific action.

We agree with the reviewers that such trial designs, when trials are temporally independent, are not subject to the temporal correlation confound. We have added a paragraph about these papers and explained there why their findings cannot be used as a support to the striatal action-value representation hypothesis. In short, we do not doubt that the striatum plays an important role in decision making and learning. However, this finding, as well as the evidence in support of representation of other decision variables in the basal ganglia do not entail action-value representation in the striatum, as there are alternatives that are consistent with these findings. These points are clarified in the Discussion (Section “Other indications for action-value representation”).

Specifically regarding Pasquereau et al. (2007), we agree that the results are not consistent with pure chosen-value representation and changed the text accordingly. The finding that neurons are co-modulated by action and expected reward is indeed very interesting. However, it cannot be taken as evidence for action-value representation for several reasons. First, a policy neuron is also expected to be co-modulated by these two variables. Second, the example neurons in Figure 6 in that paper are clearly modulated by the value of other actions, which is inconsistent with the action-value hypothesis (no such quantitative analysis was performed at the population level). Finally, an essential test of action-value representation is that the value of the action is represented even when this action is not chosen. This was not tested in that paper (although in principle, it can be tested using existing data; The prediction of action-value representation is that the activity of that neuron is modulated by the value of the left target even when this target is not chosen). This is clarified, in short, in the “Literature search” section in the Materials and methods.

2) The authors correctly pointed out that some earlier studies of action value used a suboptimal task design and their conclusions need to be more rigorously validated. However, in the broader field, the potential risk of "drift" in neural recording has been well recognized. For example, "Neurophysiological experiments that compare activity across different blocks of trials must make efforts to be confident that any neural effects are not the result of artifacts of that design, such as slow-wave changes in neural activity over time." (Asaad et al., 2000). In the same Asaad et al. paper, a better design with repeated, alternated block types was used, similar in concept to randomized block design that the authors proposed here. Such designs have also been used in many neural studies of cognition – to name just two examples: value manipulations (Lauwereyns et al., 2002), rule manipulations (Mansouri et al., 2006). The problem thus seems relatively limited to one type of analysis that introduces temporal correlation across trials in an effort to estimate Q values. By the authors' account, this amounts to 5 papers from 3 different labs.

In response to this comment, we examined the papers proposed by the reviewer. We found that this method does not resolve the temporal correlations confound, as described in the Results section about possible solutions to the first confound (section "Possible solutions to the temporal correlations confound”) and in the Materials and methods section (the section “ANOVA tests for comparisons between blocks, excluding ‘drifting’ neurons”).

3) What about previous results arguing for prominence of a specific type of value representation? The authors touch on this, but it would be helpful to discuss specific results. In particular, the cited study of Wang et al., Nat Neuro 2003 reported that their unbiased angular measure of DMS value coding was distributed significantly non-uniformly, with net value (ΣQ) coding more prevalent than other types (their Figure 7). Whereas the null hypothesis simulations in this paper predict very different results, either a uniform distribution (Figure 2—figure supplement 8) or a dearth of ΣQ neurons (Figure 5—figure supplement 1). The authors should discuss whether this previous result can therefore still be interpreted as evidence of value coding (at least, net-value coding), rather than strictly policy coding, in the striatum.

(Also, it is odd that the authors cite Wang et al. as a study that "claimed to provide direct evidence that neuronal activity in the striatum is specifically modulated by action-value", since the main result was specifically finding prevalence of net-value, not action-value coding).

We do not discuss the issue of non-uniform results in the paper but we agree that non-uniform results may be an indication of a true modulation by some variable. For example, if only neurons that are positively correlated with action-values are found (rather than negatively correlated with them) – this would be a strong indication for a modulation that is not caused by random fluctuations.

However, it is important to point out that small changes in the analysis may bias it in unexpected ways. In Author response image 1 we repeated the analysis of Wang et al., 2013 for the random-walk neurons. This analysis is slightly different form the one presented in Figure 2—figure supplement 8. There, we analyzed only the last 20 trials in each block (following Samejima et al. (2005), we now added a clarification in the figure legend). Wang et al. (2013) analyzed all the trials in a block except the first 10 and utilized 5-9 blocks. Analyzing all the trials in a block except the first 10 and utilizing 8 blocks (order of blocks as in Figure 2—figure supplement 10), surprisingly, we find a small, but significant bias towards representation of (𝛴𝑄) (p=2.9%), as in Wang et al., 2013.

Importantly, we have not fully followed the experimental setting in Wang et al. (2013). Specifically, we were not sure what was their rule for a termination of a block and we used the Samejima et al. (2005) rule. Therefore, we are unsure about the consequence of the bias we now found to their conclusions. However, this analysis shows that a biased result is not always an indication of true modulation.

With respect to the second point, we agree that Wang et al.’s (2013) main point is that the dorsomedial striatum represents net-value (i.e., 𝛴𝑄). However, they do report that "in the DMS, all categories of neuron types were represented above chance" (p. 645). Nevertheless, we added this point in the legend of Figure 2—figure supplement 8, where the Wang et al. (2013) analysis is repeated.

(Computation of p-value: The p-value for the probability of receiving this fraction of state neurons was computed under the assumption that the significant neurons were distributed uniformly between classifications. If classification is uniform, the expected fraction of neurons in each category will be 10.11%. Here we classified 11.93% of the neurons as representing state. We used 20,000 neurons in 1000 different sessions. Taking 1000 sessions as the sample size, we calculated the probability of a binomial distribution with prob. 10.11% to yield more than 119 classifications in 1000 sessions).

4) I found the authors' choice of basal ganglia data misleading (Ito and Doya, 2009). First, because these data are recordings from the nucleus accumbens and ventral pallidum, which are not the first basal ganglia structures one thinks of as encoding action values. Second, because the original authors of the study from which the data was obtained noted that action value coding was low in these structures, leading them to suggest that action value coding was not the primary function of the nucleus accumbens and ventral pallidum. This is mentioned in the Results subsection “Permutation analysis of basal ganglia neurons”, but should be noted in the Discussion (the current text in the Results could probably just be moved).

We moved the text to the Discussion (section “Temporal correlations and action-value representations”, fourth paragraph).

5) Previous methods dealing with trial correlations have different success at reducing false positive rate of detecting action-values. In particular, the method of Wang et al. (2013) comes very close to attaining the correct size of the test for action-values. Indeed, it appears to be the only existing method from which one would reasonably conclude that the ventral striatal data set analyzed probably does not exhibit much action-value coding (2-3% above the expected 5%). I think it would be useful to have a figure in the main text comparing the different methods to the authors' permutation test (using for example, just the basal ganglia data set). In addition, Wang et al.’s method is also pretty good at identifying policy neurons, which is important because it could be applied retrospectively to existing data sets.

In an attempt to make our analyses as similar as possible to the original analyses we used different thresholds for significance for different methods. Specifically, in Wang et al. analysis we find that 7% – 8% of the basal ganglia neurons represent an action value, whereas only 0.25% are expected by chance. To clarify this, we added the significance threshold to the different figures to make this difference clear.

Regarding the analysis in Wang et al. (2013) on policy neurons, we address this question in the section “Is this confound the result of an analysis that is biased against policy representation?”. This analysis indeed yields more policy than action-value neurons, but still a fraction much larger than expected by chance of policy neurons is classified as action-value neurons.

With regards to the suggestion of adding the figure, we are unsure about the added value of such a figure. In the supplementary figures we demonstrate that all these methods erroneously classify neurons in the basal ganglia recordings as representing unrelated action-values. In view of these findings, we fear that using them to identify true action-values in those recordings may mislead the readers.

6) The authors' biggest suggestion for rigorously detecting a neural action value representation is "Don't use a task with learning, use a trial-based design where subjects associate contexts with well-learned sets of action values". That is perfectly fine for scientists whose goal is specifically to test whether a brain area encodes action values. However, what about the many scientists whose explicit goal is to study neural representation of time-varying values, and hence need to use learning tasks? Many scientists are studying (1) the neural basis of value learning, (2) brain areas specifically involved in early learning (not well-trained performance), (3) motivational variables specifically present for time-varying action values not well-trained ones (e.g. volatility, certain forms of uncertainty, etc.). If the authors can give an approach that will let these scientists make accurate estimates of neural time-varying value coding during learning tasks, that would certainly be valuable to the field.

I feel like their methods could potentially be used to achieve this in a straightforward way (by combining their novel permutation test from Part 1 of their paper with their method of testing for correlation with both sum and difference of values from Part 2 of their paper). But they don't lay this out explicitly in their paper at present, since they are more focused on the narrower implication ("Do striatal neurons encode action values?") rather than the broader implication of their results ("In general, how can one properly measure time-varying action values?").

The paper addresses two confounds, that are somewhat orthogonal. The temporal correlation confound can be addressed using the permutation analysis of Figure 3, which can provide strong support to the claim that the activity of a particular neuron co-varies with learning. This is a general solution for scientists studying slow processes such as learning.

Precisely defining or interpreting what the activity of the neuron represents (for example an action-value or policy) is more challenging and in general, there are no easy solutions and caution should be applied when interpreting the data. We now discuss these points in the 'Temporal correlations – beyond action-value representation' section of the Discussion.

With respect to the proposed solution, to rule out policy representation, the analysis in Figure 6 includes a regression on an orthogonal variable – state. For the two variables to be orthogonal it is required mathematically that the two action-values will have the same variance (section “A possible solution to the policy confound”). This can be achieved in a controlled experiment where reward probabilities are used, but we cannot control for the variance of the action-values when we estimate them from behavior. Therefore, we could not find a way to combine the solution from Figure 3 with the regression analysis from Figure 6. However, in other cases, this may not be an issue, depending on the specific variable and question.

Secondly, the reviewers had some particular concerns about the action value vs. policy representation issue. For example:

1) Regarding the second confound of policy vs. action value:

- The authors seem to be arguing against a straw-man version of how to relate neural activity to behavior. Typically we infer the underlying computations by testing how well different hypothesized models can fit the behavior and then search for correlates of the most likely computation in the brain. The authors seem to test only how well the neural activity correlates with different hypothesized models.

We respectfully disagree with the review for two reasons:

First, the reviewer hints that because action-value based models best describe behavior, we should search for action-value representations. We would like to note that while the view that action-value based models best describe behavior is widespread, there is strong evidence that favors other models (e.g., Erev et al., Economic Theory, 2007, see also Shteingart and Loewenstein, 2014 for review). Therefore, it is still an open question whether action-value representation exists in the brain.

Second, policy representation (representation of the probability of choice) is likely to exist even if the brain computes action-values. If neurons represent policy, then they may be misclassified as representing action-values.

- The proposed solution for distinguishing policy from action value also has a very high rate of false negative (78%).

We agree with this point and we remedied the analysis to decrease its false negative rate. For true action-value neurons, the rate of correct detection vs. false negatives depends on the strength of their modulation by action-value, together with the power of the analysis.

We used neurons whose correct detection rate in the original analyses was comparable to the literature (~40%). The analysis in the previous version of the manuscript decreased this rate to 22%. It indeed suffered from limited power also because it only employed 80 trials. To increase the power of the analysis, we repeated the analysis using 400 trials in total (rather than the original 280 trials) and conducting the analysis on the last 200 trials. We now correctly classify 32% of action-value neurons as such (see Figure 6). Considering that the original analysis in Figure 1 was biased towards classifying neurons as representing action-value, rather than policy or state and that our new analysis requires passing two significance tests, we take this correct detection rate to be reasonable.

We changed Figures 4 and 6, together with their figure legends and descriptions of the analysis accordingly.

2) I feel that the point the authors make about action-value vs. policy representations may actually be underselling the true extent of the confound, and so their proposed solution may not be sufficient. However, this all depends on how the authors want to define a 'policy neuron' vs. a 'value neuron', as I explain below. I think they should clarify this.

2.1) Their arguments seem to assume that neural policy representations are in the form of action probabilities, which can then be identified by the key signature that they relate to action-values in a 'relative' manner (e.g. an 'action 1' neuron that is correlated positively with Q1 must be correlated negatively with Q2), and hence will be best fit as encoding ΔQ (Q1 – Q2). However, depending on how they define 'policy', this may not be the case.

Notably, even for reinforcement learning agents that do not explicitly represent action-values, few of them directly learn a policy in its most raw form of a set of action probabilities. Instead, they represent the policy in a more abstract parameter space. The simplest parameterization is a vector of action strengths, one for each possible action. Then during a choice, the probability of each action is determined by applying a choice function (e.g. softmax) to the action strengths of the set of actions that are currently available. The choice's outcome is then used to do learning on the action strengths. This method is used by some traditional actor-critic agents (which represent state values and action strengths, but not action-values). My impression is that the authors' covariance-based model is similar, in that the variables that it updates when it learns are the input weights W1 and W2 to each pool (with one input weight for each action, thus being analogous to action strengths).

Note that in these models, the action strengths are not explicitly represented in a 'relative' manner; only the resulting action probabilities are (since the probabilities must sum to 1). It's not clear to me whether a neuron encoding an action strength would be classified as a 'policy neuron' or an 'action-value neuron' by the authors' current framework, nor is it clear to me which outcome the authors would prefer. I believe the dynamics of actor-critic learning would cause the action strengths to be somewhat 'relative' (e.g. the best action is nudged toward positive strength while all others are nudged to negative strength), but I'm not sure big this effect would be, or whether this would also occur for the authors' covariance-based model, or whether this would occur if > 2 actions are available. It is possible that these types of learning tasks can't discriminate between action strengths (e.g. from an actor-critic) versus action-values (e.g. from a Q-learner). So, the authors should clarify whether they believe this is an important distinction for the present study.

The reviewer is making an interesting and important point. An initial requirement for a neuron to be considered an action-value neuron, a policy neuron or any decision variable-neuron, is that it is significantly more correlated with these decision variables than with decision variables that are unrelated to the current task. The permutation analysis of Figure 3 can be used to find such neurons.

The question of which decision variable the neuron represents (assuming that it passed the permutation test) is a more difficult one. The reason is that the different decision variables are correlated. Moreover, because these variables are all some function of past actions and rewards, and relate to future choice, many existing and future decision-making models are expected to have modules whose activity correlates with these variables. One may argue that the question of whether neurons represent action-value, policy, state or some other correlated variable is not an interesting question. This is because all these correlated decision variables implicitly encode action-value. Even direct-policy models can be taken to implicitly encode action-value, because policy is correlated with the difference between the action-values. However, we believe that the difference between action-value representation and representation of other variables is an important one, because it centers on the question of the computational model that underlies decision-making in these tasks.

Often, reports of action-value representation are taken to support the hypothesis that action-values are explicitly computed in the brain, and that these action-values play a specific role in the decision making process. While other models may include no such calculation they can still include neuronal activity that correlates with action-value, as in the covariance-based plasticity model (at the level of the population). One proper way of ruling out competing hypotheses about the variables the neuronal activity correlates with is to test for significant correlations in directions that are correlated with action-value but are orthogonal to each of the competing hypotheses.

Clearly, one cannot attempt to rule out all possible hypotheses. However, even in the restricted framework of value-based Q-learning, a necessary condition for a neuron to be considered as representing an action-value is that it is not representing other decision variables of that model such as policy. Regarding alternative models for learning, clearly the more restrictive the characterization of the response properties of a neuron in the task, the more informative it is about the underlying neural computation.

We added a section in the Discussion titled “Are action-value representations a necessary part of decision making? “that addresses these issues.

2.2) Suppose we agree that neurons only count as coding the policy if they encode action probability (and not strength). Their proposed solution still seems model-dependent because it assumes that the policy is such that the probability of choosing an action is a function of the difference in action-values (Q1 – Q2) and hence policy neurons can be identified as encoding ΔQ and not ΣQ. However, there is data suggesting that humans and animals are also influenced by ratios of reward rates rather than just differences (e.g. "Ratio vs. difference comparators in choice", Gibbon and Fairhurst, J Exp Anal Behav 1994; "Ratio and Difference Comparisons of Expected Reward in Decision Making Tasks", Worthy, Maddox, and Markman, 2008). If so, then policy neuron activity could be related to a ratio (e.g. Q1 / Q2), which is correlated with both ΔQ and ΣQ.

We agree, but any analysis can only consider and compare the hypotheses that are explicitly acknowledged. We added a paragraph in the Discussion addressing this point (section “Differentiating action-value from other decision variables”, fifth paragraph).

Here is my proposed solution. It seems to me that if 'policy neurons' are equated to action probabilities, then the proper method of distinguishing policy from value coding would be to design a task that explicitly dissociates between the probability of choosing an action (encoded by policy neurons) and the action's value (encoded by action-value neurons). For instance, suppose an animal is found to choose based on the ratio of the reward rates, such that it chooses A 80% of the time when V(A) = 4*V(B). Then we can set up the following three trial types:

V(A), V(B), p(choose A)

8, 2, 80%

4, 1, 80%

4, 4, 50%

A neuron encoding V(A) should be twice as active on the first trial type as the other two trial types (since V(A) is twice as high), while a neuron encoding the policy p(choose A) should be equally active on the first two trial types (since p(choose A) = 80%). Of course, more trial types might be desired to further dissociate this from encoding of ΣQ and ΔQ. Also, note that this approach is model-dependent, because it requires a model of behavior to estimate the true p(action) on each trial (or else careful psychophysics to find two pairs of action-values that make the subject have the same action probabilities).

In general, to use this approach in a regression-based manner, one would (1) fit a model to behavior, (2) use the model to derive p(action,t) and V(action,t) for each action and each trial t, (3) fit neural activity as a function of those variables (and possibly others, such as the actually performed action, ΣQ, etc.), (4) test whether the neuron is significantly modulated by p(action), V(action), or both, controlling for temporal correlation using the authors' proposed method that uses task trajectories from other sessions as a control. Of course, if the model says that choice is indeed based on the value difference ΔQ as the authors currently assume, then this approach would simplify to the same one the authors currently propose.

This is an elegant experimental design and not unlike the one we consider in Figure 6. However, with respect to the proposed analysis, there are two important differences. One is the question of whether behavior is modulated by the ratio of reward rates, the difference of reward rates or a different function. In the paper we posited that it is the difference in the reward rates that modulates behavior when analyzing the data in the value-based framework. We agree, that it is possible that the ratio is a better predictor of behavior. Our choice followed that of the previous publications and is based on the assumption of the Q-learning model that the probability of choice is a monotonic function of the difference between action-values.

Second, in point 4, the reviewers propose to test the type of representation by looking for significant modulation or the lack of it. However, a non-significant result for one variable, is not an indication that it was not the modulator. As described in Figure 5, this can lead to confounds. Furthermore, policy and action-value will have shared variance, and so some of the modulation of the neuronal activity cannot be conclusively attributed to any of them. Therefore, it is better to use model comparison (likelihood) when considering the results of this analysis. In our manuscript we focus on significance tests that can rule out specific possibilities under the null hypothesis.

Note, that the design suggested by the reviewers can also be used to reject the hypothesis that neurons are policy neurons. For neurons whose activity differs significantly between the first two cases (p(choose A)=80%) the null hypothesis that they represent policy can be rejected. In the experimental design we simulate in the paper (Figure 6) this is like comparing the activity of neurons at the end of two blocks where the policy is similar (this is an assumption which can be tested empirically). We can compare the neural activity in (0.1, 0.5) with (0.5, 0.9), and the activity in (0.5, 0.1) with (0.9, 0.5). To rule out the possibility of state representation we should compare the activity and the end of the following blocks: compare (0.1, 0.5) with (0.5, 0.1), and (0.5, 0.9) with (0.9, 0.5). As the reviewers note, this is in fact exactly what we do in the analysis in Figure 6. We regress neuronal activity on state – sum(0.1, 0.5)=0.6, sum(0.5, 0.9)=1.4, sum(0.5, 0.1)=0.6, sum(0.9, 0.5)=1.4. This effectively compares activity in cases with the same policies in a regression model.

Thirdly, the reviewers raised some questions about the corrections proposed and whether there in fact remained evidence for action value coding in the Basal Ganglia.

1) A critical assumption is that there exists temporal correlation strong enough to contaminate the analysis. It would be helpful to report the degree of this temporal correlation in the basal ganglia data set vs. the motor/auditory cortex data and the random walk model.

Author response image 2 shows a plot of the autocorrelation of the spike counts in each trial for the different data sets (averaged over the spike counts in each group; light-colors denote SEM; computed using MATLAB’s ‘autocorr’ function).

We believe that it is better to refrain from including this figure in the paper for two reasons: (1) The autocorrelations relevant for the temporal correlations confound are those associated with the time-scale relevant for learning, tens of trials. Computing such autocorrelations in experiments of a few hundreds of trials introduces substantial biases (Newbold and Agiakloglou, 1993; Kohn, 2006). This is also demonstrated in the negative autocorrelation of the random-walk spike counts, computed using sessions of 151-379 trials. Alternative measures for autocorrelation are also problematic when applied to small samples, see (Kohn, 2006). (2) We are not aware of theoretical mapping from the autocorrelation function to the temporal correlations confound. For example, considering the autocorrelations below, it is not clear how to compare the basal ganglia and the motor cortex datasets with respect to the temporal correlations confound when considering their autocorrelation functions. For these reasons, computing autocorrelation functions to quantify the temporal correlations confound may be misleading rather than useful.

We added a paragraph to the manuscript, describing the potential problems with the autocorrelation measure (section “Possible solutions to the temporal correlations confound”, second paragraph).

A figure, in the format of Figure 1D, showing the distribution of t-values for the actual basal ganglia data set analyzed with trial-matched Q estimates should be presented. This information is critical for effective comparisons to other data sets.

We added Figure 3—figure supplement 1, which reports this information.

2) The authors proposed two possible solutions for this type of study. The first is to use a more stringent (and appropriate) criterion for significance, given the often wrongly assumed variance due to correlation. The permutation test is definitely in the right direction, particularly for reducing false positives. However, I am concerned by the really high rate of false negatives (~70% misses). "Considering the population of simulated action-value neurons of Figure 1, this analysis identified 29% of the action-value neurons of Figure 1 as such". Considering other unaccountable variables in typical experiments, particularly that basal ganglia neurons may have mixed selectivity both at the population and single-neuron level, such a high false negative rate seems to carry high risks of missing a true representation.

The rate of misses of action-value neurons in our analysis depends on the parameters that we used to model these neurons. We used parameters such that the "standard" methods miss approximately 60% of the action value neurons. With the permutation test we miss approximately 70%. Other parameters would yield different rates of misses. If selectivity is weak then indeed, it will be more difficult to identify such neurons. However, a necessary condition for a neuron to be classified as a task-related neuron is that it is more correlated with decision variables in its corresponding session than with these decision variables in other sessions. We do not see a way around it even if this requirement is associated with a substantial rate of false identifications.

One approach to increase the power of any analysis will be to use as many trials as possible, as can be seen from the increase in the correct detection rate in Figure 6, caused by the addition of trials (we could not add trials in this analysis because we analyzed the original neurons of Figure 1). Another alternative is to consider population coding rather than to focus on individual neurons. This analysis is, however, beyond the scope of this paper.

3) The authors suggested randomized blocks as the second solution. In addition to my earlier point, by their own account, such a design is not new and has been implemented in three separate studies >5 years ago. The authors pointed out some issues with those studies, which will need to be addressed in the future, but did not suggest any solutions.

We are not sure that we understand this comment. In our second solution we proposed randomized trials and not randomized blocks. If the reviewer relates to the similarity of our second solution to (Fitzgerald, Friston and Dolan, 2012) then crucially, we used reward probabilities in the analysis and not estimated action-values. This removes temporal correlations which are present when estimated action-values are used (see Figure 2—figure supplement 9). In addition, our analysis in Figure 6 rules out policy and state representations, which was not present in (Fitzgerald, Friston and Dolan, 2012). This last point is also relevant to (Cai et al., 2011 and Kim et al., 2012).

4) The authors stated that the detrending analysis does not resolve the confound. However, judging from Figure 2—figure supplement 7, the detrending analysis resulted in ~29% significant Q modulation in the basal ganglia, in contrast to ~14% for random walk, ~12% for motor cortex and 10% for auditory cortex. Compared to other figures, which showed similar percentage for all four datasets, it seems that the basal ganglia data set is most robust to this analysis. Doesn't this support the idea of an action value representation in the basal ganglia?

Originally, we were not clear enough on this issue. We’ve added clarifying sentences in the figure legends. The analysis of the basal ganglia data in Figure 2—figure supplement 7erroneously identified unrelated action-values from simulations. In fact, this analysis indicates that detrending is even less useful there than in other datasets.

5) The authors focus on statistical significance. Does examining the magnitude of the effects distinguish erroneous from "real" action value coding? It seems incomplete to only plot the t-values, which are important for understanding parameter precision, without presenting the parameters effect sizes. Can real action value coding be distinguished by effect sizes that were meaningfully large (i.e., substantive versus statistical significance)?

To address this comment, we compared the explained variance of the action-value and random-walk neurons used in our paper. Surprisingly, the explained variance of the random-walk neurons is higher than that of the true action-value neurons.

One may argue that very high explained variance (say R2 > 0.25) can be used as conclusive evidence of action-value representation. However, we find that if the diffusion coefficient of the random-walk neurons is sufficiently large then a substantial fraction of the neurons will exhibit high values of R2. For example, with a diffusion coefficient of 0.5 31% of the random-walk neurons exhibit R2 > 0.25.

6) Along related lines, it seems like examining the pattern of effects is also useful. When comparing Figure 1D and Figure 2B, one can see that the erroneous detections included positive and negative ΔQ and ΣQ neurons, whereas for real detections (Figure 1D), there are much fewer of these neurons (by definition). All the erroneous detections generate spherical t-value plots, indicating that combinations of one or the other action value are independent. This seems not to be the case for real detections (in the authors simulations), nor in real data (Samejima et al., 2005). This suggests that any non-uniformity in detecting combinations of action value coding would be evidence that it is not erroneous (even if the type I error is not properly controlled).

We partially answer this question (question 3 in the first set of comments above). Some non-uniformities may indeed indicate that the result are not due to random modulations. However, even when dealing with random modulations we may see certain biases that are caused by the design of the analysis. Another example is Figure 2. There we find that in the random-walk dataset, the fraction of state neurons is larger than that of policy neurons. We shortly address the fact that the results may be biased towards a specific classification in some experimental designs in Figure 2—figure supplements 4, 5, and Figure 3—figure supplement 1.

7) The simulations in Figure 2 are useful, but it would be useful to translate the diffusion parameter (σ) of the random walk into an (auto) correlation. This would make it easier for a reader to interpret how this relates to real data.

As discussed above, we fear that presenting autocorrelations may be misleading. Particularly, the autocorrelations of the random-walk function for a finite (and small) number of trials, which is relevant for experiments is very different from the function obtained when the number of trials is large. This is depicted in Author response image 4, where we compare the autocorrelation of the random-walk sessions of the paper, with the autocorrelation function of the same process, computed using 5,000 trials.

8) Is the M1 data a proper control? It is hard to tell from the task description here. I wouldn't be able to replicate the task that was used given the description here. If that M1 data is published, a citation would be helpful. My concerns are whether it might have had unusually large temporal correlations and thus exaggerated the degree to which such correlations might confound action-value studies, due to either (1) having blocks of trials (as opposed to randomly interleaved trial types), (2) being a BMI task in which animals were trained to induce the recorded ensemble to emit specific long-duration activity patterns.

The motor cortex data was recorded in Eilon Vaadia’s lab and has not been published yet. We agree that the specific task the subject is performing may influence the overall firing rate or the temporal correlations in the neural activity and hence the false positive rates in the detection of action-value representation. However, we think it is unlikely that the recordings in this data set are an outlier in terms of autocorrelations. First, the monkey was extensively trained and all trials were identical, so there is nothing in the design of the task that suggests long-term correlations between trials. Second, the monkey was conditioned to enhance the power of beta band frequencies (20-30Hz). This frequency band is two orders of magnitude different than the time scale separating different trials (on average 14.2 seconds). Finally, we considered spike count prior to the beginning of the trials, while the monkey was still waiting for a GO signal.

Funding

Israel Science Foundation (757/16)

Deutsche Forschungsgemeinschaft (CRC1080)

Gatsby Charitable Foundation

Yonatan Loewenstein

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We are extremely grateful to Oren Peles, Eilon Vaadia and Uri Werner-Reiss for providing us with their motor cortex recordings, Bshara Awwad, Itai Hershenhoren, Israel Nelken for providing us with their auditory cortex recordings, Kenji Doya and Makoto Ito for providing us with their basal ganglia recordings, Mati Joshua, Gianluigi Mongillo, Jonathan Roiser and Roey Schurr for careful reading of the manuscript and helpful comments and Inbal Goshen, Hanan Shteingart and Wolfram Schultz for discussions.

eLife is a non-profit organisation inspired by research funders and led by scientists. Our mission is to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science.eLife Sciences Publications, Ltd is a limited liability non-profit non-stock corporation incorporated in the State of Delaware, USA, with company number 5030732, and is registered in the UK with company number FC030576 and branch number BR015634 at the address:
eLife Sciences Publications, Ltd
Westbrook Centre, Milton Road
Cambridge CB4 1YG
UK