Abstract

Individual variability in reward-based learning has been ascribed to quantitative variation in baseline levels of striatal dopamine. However, direct evidence for this pervasive hypothesis has hitherto been unavailable. We demonstrate that individual differences in reward-based reversal learning reflect variation in baseline striatal dopamine synthesis capacity, as measured with neurochemical positron emission tomography. Subjects with high baseline dopamine synthesis in the striatum showed relatively better reversal learning from unexpected rewards than from unexpected punishments, whereas subjects with low baseline dopamine synthesis in the striatum showed the reverse pattern. In addition, baseline dopamine synthesis predicted the direction of dopaminergic drug effects. The D2 receptor agonist bromocriptine improved reward- relative to punishment-based reversal learning in subjects with low baseline dopamine synthesis capacity, while impairing it in subjects with high baseline dopamine synthesis capacity in the striatum. Finally, this pattern of drug effects was outcome-specific, and driven primarily by drug effects on punishment-, but not reward-based reversal learning. These data demonstrate that the effects of D2 receptor stimulation on reversal learning in humans depend on task demands and baseline striatal dopamine synthesis capacity.

Keywords: dopamine, reward, punishment, striatum, PET, Learning

Adaptation to our environment requires the anticipation of biologically relevant events by learning signals of their occurrence, i.e. reward-based learning. Models of reward-based learning use a prediction error signal, representing the difference between expected and obtained events, to update their predictions based on the environment (Sutton and Barto, 1998). A putative mechanism of the prediction error signal for reward is the phasic firing of dopamine neurons in the midbrain (Montague et al., 1996; Schultz et al., 1997). These neurons innervate large parts of the brain, including the striatum, the major input structure of the basal ganglia. In keeping with this anatomical arrangement, the striatum has often been implicated in reward-based learning and its modulation by dopamine (Cools and Robbins, 2004; Frank, 2005; Pessiglione et al., 2006; Schonberg et al., 2007) and reward-based learning is modulated by agonists of D2/D3 receptors that are abundant in the striatum (Frank and O'Reilly, 2006; Pizzagalli et al., 2007). Schonberg et al (2007) have recently proposed that individual differences in reward-based learning may reflect differences in striatal dopamine function. However, there is no direct evidence for this hypothesis. Here we demonstrate a significant positive relationship between reward-based learning and baseline striatal dopamine synthesis capacity, as measured with uptake of the positron emission tomography (PET) tracer [18F]fluorometatyrosine (FMT).

We further establish the link between dopamine in the striatum and reward-based learning by showing that effects of dopamine D2 receptor stimulation also depend on baseline striatal dopamine synthesis capacity. Evidence from studies with experimental animals (Williams and Goldman-Rakic, 1995; Zahrt et al., 1997; Arnsten, 1998) has revealed an ‘inverted U’-shaped relationship between D1 receptor stimulation in the prefrontal cortex and cognitive performance. This relationship has been related to baseline-dependency of drug effects, so that low baseline dopamine levels are remedied, while high baseline dopamine levels are detrimentally over-dosed by the same dopamine D1 receptor agonist (Phillips et al., 2004). Although recent studies with humans, which have made use of genetic variation in the D2 receptor gene, have suggested that a similar mechanism might underlie contrasting effects of D2 receptor stimulation in the striatum on reward-based learning (Frank and O'Reilly, 2006; Cohen et al., 2007), direct evidence for baseline-dependency of dopaminergic drug effects on reward-learning in the striatum is lacking. We combined neurochemical PET imaging with behavioural psychopharmacology to test this hypothesis. We studied the effects of a single oral dose (1.25mg) of the dopamine D2 receptor agonist bromocriptine on reversal learning in young healthy volunteers, who also, on a separate occasion, underwent a PET scan with the tracer FMT. Subjects with low synthesis capacity were predicted to benefit from D2 receptor stimulation with bromocriptine, while subjects with high synthesis capacity were predicted to be detrimentally overdosed by the same drug.

Methods

General procedure

The UC Berkeley Committee for the Protection of Human Subjects approved the procedures, which were in accord with the Helsinki Declaration of 1975.

Eleven subjects (all female; mean [standard deviation] age = 22.2 [2.0]) underwent a single PET scan with the tracer 6-[18F]fluoro-L-m-tyrosine (FMT). The PET data from these subjects were previously reported in relation to their working memory capacity as measured with the listening span test (Cools et al., 2008b). Detailed neuropsychological characteristics of the subjects are presented in that previous paper. All subjects were screened for psychiatric and neurological disorders; exclusion criteria were any history of cardiac, hepatic, renal, pulmonary, neurological, psychiatric or gastrointestinal disorders, an episode of loss of consciousness, use of psychotropic drugs, sleeping pills and heavy marihuana use (>10 times in a lifetime).

Subjects were selected from a sample of subjects that had also participated in a psychopharmacological study on the effects of bromocriptine (Cools et al., 2007). Initial selection of subjects for this study was based on their high or low scores on the Barratt Impulsiveness Inventory (BIS-11; Patton et al., 1995). However, there was no relationship between dopamine synthesis capacity and trait impulsivity, as reported in our previous report on the PET data from these subjects (Cools et al., 2008b) (all pearson correlations < 0.1). As part of this study, subjects completed an established observational reversal learning task (Figure 1) (Cools et al., 2006; Cools et al., 2008a) on two occasions, once after intake of placebo and once after intake of bromocriptine, in a double-blind, placebo-controlled cross-over design. The dose of bromocriptine (1.25mg) was selected based on previously observed changes in cognitive performance (Kimberg et al., 1997; Gibbs and D'Esposito, 2005) and minimal side effects (Luciana et al., 1992; Luciana and Collins, 1997). The order of bromocriptine and placebo testing was approximately counterbalanced (six subjects received bromocriptine on the first session). One bromocriptine dataset from the punishment condition was missing. The reversal learning task was administered approximately 3.5 hours after capsule intake, coinciding with the time-window of maximal drug effects (Drewe et al., 1988; Lynch, 1997). Subjects were instructed to have a light meal approximately 1 hour before arrival and they were asked to refrain from caffeine and cigarettes on the days of testing.

PET imaging and analysis

FMT is a substrate of L-aromatic amino acid decarboxylase and an index of presynaptic dopamine synthesis capacity, i.e. processes that occur in striatal terminals of midbrain dopamine neurons. The tracer was synthesized with a modification of the procedure as previously reported (Namavari et al., 1993). Scanning procedures, data analysis and region of interest procedures were also as previously reported (Cools et al., 2008b).

All subjects were scanned approximately 60 mins after administration of an oral dose of 2.5 mg/kg of the peripheral decarboxylase inhibitor carbidopa in order to increase brain uptake of the tracer. Participants were positioned on the scanner bed with a pillow and an elastic band to comfortably restrict head motion. Images (voxel size was 3.6mm *3.6 mm in-plane with 4mm slice thickness) were obtained on a Siemens ECAT EXACT HR scanner in 3D acquisition mode. A 10 min transmission scan was obtained for attenuation correction, then approximately 2.5 mCi of FMT were subsequently injected as a bolus in an antecubital vein and a dynamic acquisition sequence in 3D mode was obtained: 4 × 1 min, 3 × 2 min, 3 × 3 min, 14 × 5 min for a total of 89 min of scan time.

Data were reconstructed using an ordered subset expectation maximization (OSEM) algorithm with weighted attenuation, an image size of 256 × 256, and 6 iterations with 16 subsets. A Gaussian filter with 6mm FWHM was applied, with a scatter correction. Images were evaluated for subject motion and realigned as necessary using algorithms implemented in SPM2.

Structural scans (high-resolution MP-FLASH; 0.875 × 0.875 × 1.54 mm) were obtained during the prior MRI study (structural scans were not available for two out of eleven subjects). This permitted the use of the high-resolution MRI image for anatomical verification of the localization of functional PET ROIs. Bilateral cerebellar ROIs (10 mm spheres) were used as a reference ROI in conjunction with ROIs in striatum and a simplified reference tissue model with a graphical analysis approach (Patlak and Blasberg, 1985; Lammertsma and Hume, 1996). This leads to an influx constant Ki, which reflects regional FMT uptake scaled to the volume of distribution in the reference region.

In order to test hypotheses about differences in subregions of the striatum, we defined ROIs in the right and left caudate and putamen. An axial image representing the sum of the last four emission scans of the PET scanning session (4 × 5 min frames) was coregistered to the high-resolution MR scan using a 12-parameter affine algorithm implemented in SPM2. ROIs were drawn on these images (Wang et al., 1996; Volkow et al., 1998), using the atlas of Talairach and Tournoux (Talairach and Tournoux, 1988) to delineate the caudate and putamen. Regions were drawn on data in native space in order to preserve differences in tracer uptake due to anatomical variability between subjects. We have previously demonstrated the ability to draw ROIs with high inter-rater reliability (Klein et al., 1997). The Patlak model was fitted with dynamic data from each ROI from 24 – 89 minutes, when the regression is highly linear (r>0.99).

It might be noted that there was a delay between the acquisition of the PET data and that of the behavioural data (mean 16.7 months; standard deviation 2.4 months). Uptake of the tracer [18F] fluorometatyrosine is thought to reflect a relatively stable process (synthesis capacity) that is not particularly sensitive to small state-related changes, much like uptake of the tracer F-DOPA. Thus, a study by Vingerhoets et al. (1994) demonstrated that striatal Ki is a reliable measurement, with it having a 95% chance of lying within 18% of its value within an individual normal subject. We argue that the delay does not confound our results, but rather renders them more striking. Any instability in the PET measurement across the delay would have reduced rather than enhanced the likelihood of obtaining the results, which were statistically controlled for noise by an alpha level of 0.05. Data analyses supported this hypothesis, as the effects were stronger when effects of interest were corrected for the delay between the two sessions than when they were not. Here we report only those analyses in which we corrected for the delay, by entering it as a covariate, although all effects were also significant when they were not corrected for the delay. Furthermore, in the Supplemental Results C, we report an additional analysis, explicitly addressing in a quantitative manner the possibility that the effect reflects noise.

Experimental paradigm

The task required the learning and reversal of predictions of reward and punishment. On each trial two vertically adjacent stimuli were presented: one face and one scene; location randomized; about 19in. viewing distance; subtending about 3° horizontally and 3.5° vertically. One of the stimuli was associated with reward, while the other was associated with punishment. On each trial, one of the two stimuli was highlighted with a black border surrounding the stimulus and subjects had to predict, based on trial and error learning, whether the highlighted stimulus would lead to reward or punishment. Subjects indicated their predictions by pressing, with the index or middle finger, one of two coloured buttons (corresponding to keys ‘b’ and ‘n’ depending on the response-outcome mapping) on a laptop keyboard. They pressed the green button for reward and the red button for punishment. The outcome-response mappings were counterbalanced between subjects. The (self-paced) response was followed by an interval of 1000ms, after which the outcome was presented for 500ms. Note that this outcome did not provide direct performance feedback. Reward consisted of a green smiley face, a “+$100” sign and a high-frequency jingle tone. Punishment consisted of a red sad face, a “−$100” sign and a single low-frequency tone. After the outcome, the screen was cleared for 500ms, after which the next two stimuli were presented. The stimulus-outcome contingencies reversed multiple times provided learning criteria were met.

Each subject performed one practice block and four experimental blocks. Each practice block consisted of one acquisition stage and one reversal stage (learning criterion was 20 [not necessarily consecutive] correct trials). Each experimental block consisted of one acquisition stage and a variable number of reversal stages. The task proceeded from one stage to the next following a number of consecutive correct trials, as determined by a pre-set learning criterion. Learning criteria (i.e. the number of consecutive correct trials following which the contingencies changed) varied between stages (mean = 6.9; S.D. = 1.8; range from 5 to 9), in order to prevent predictability of reversals. The maximum number of reversal stages per experimental block was sixteen, although the block terminated automatically after completion of 120 trials (~6.6 min), so that each subject performed 480 trials (4 blocks) per experimental session. The mean number of stages completed is reported in Supplementary Table 1.

The task consisted of two conditions (two blocks per condition). In the unexpected reward condition, reversals were signalled by unexpected reward occurring after the previously punished stimulus was highlighted. Conversely, in the unexpected punishment condition, reversals were signalled by unexpected punishment following the previously rewarded stimulus. Accuracy on the trial following the unexpected outcome (“switch trials”) reflected the degree to which subjects updated their predictions based on unexpected outcomes. The stimulus that was highlighted on the first trial of each reversal stage (on which the unexpected outcome was presented) was always highlighted again on the second trial of that stage (i.e. the switch trial on which the subject had to implement the reversed contingencies and switch their predictions).

Based on prior data (Frank et al., 2004; Cools et al., 2006; Frank and O'Reilly, 2006), we predicted that bromocriptine would have contrasting effects on reward- and punishment-based reversal learning. Following this prior work, we were specifically interested in the drug effect on the balance between (reversal) learning from reward and from punishment. Therefore, relative reversal learning scores were calculated, by which accuracy scores on punishment-based switch trials were subtracted from accuracy scores on reward-based switch trials. The additional advantage of this method is that it controls for within-subject variability due to other factors such as arousal, attention and motivation, which would have affected each measure in the same direction. Thus, general effects of the drug that were not specific to the ability to learn from reward or punishment were subtracted out. Further, we report drug effects (differences between the placebo and the bromocriptine session), because it is these drug effects that are of primary interest in the present study. Finally, we also report reward- and punishment-based reversal learning under placebo and under bromocriptine separately.

Statistical analysis

Mean proportions of correct responses on the learning task were calculated as reported previously (Cools et al., 2006; Cools et al., 2008a). Repeated measures ANOVAs were conducted using SPSS 15.0 with drug and valence as within-subject factors and dopamine synthesis capacity as a covariate. The delay between acquisition of the drug and PET data was also entered as a covariate. Pearson product-moment correlation coefficients were calculated between Ki-values extracted from regions of interest and behavioural data. All correlations represent partial correlations, correcting for the delay between PET and drug data acquisition. All correlations were also significant without this correction. The distribution of none of the parameters assessed here deviated from normality as indicated by Kolmogorov-Smirnov tests (all Ps = 0.2).

All reported P-values are two-tailed.

Results

In our sample of young healthy volunteers, influx constant Ki values varied between 0.018 and 0.027, falling well within the range of ‘normal’ values observed previously (Eberling et al., 2007). Subjects performed well on the reversal learning task, with an average accuracy rate on trials after the unexpected outcomes greater than 90% (Supplemental Table 2).

In supplementary analyses, we aimed to disentangle two alternative hypotheses regarding dopaminergic modulation. Specifically, to establish whether the here described effects reflect a modulation of learning or switching, we applied computational reinforcement learning algorithms to fit individual subjects' trial-by-trial sequence of choices (Sutton and Barto, 1998; Frank et al., 2007b). These algorithms allowed us to generate learning-rate parameters (separately for reward and punishment) that were not directly observable in the behavioural data. Detailed methods and results are presented in the Supplementary Materials. Critically, a significant relationship was obtained between dopamine synthesis and the drug effect on reward learning rate (r8 = -0.71, P = .02), as well as between dopamine synthesis and the drug effect on punishment learning rate (r10 = 0.78, P = 0.01) (Supplementary Figure and Table 3).

In summary, higher dopamine synthesis capacity in the striatum was associated with better reward-based reversal learning under placebo. Furthermore, bromocriptine improved reward-based reversal learning in subjects with low synthesis capacity, while impairing it in subjects with high synthesis capacity. Conversely, the same drug dose impaired punishment-based reversal learning in subjects with low synthesis capacity, while improving it in subjects with high synthesis capacity.

Discussion

Baseline dopamine measures predicted reversal learning due to reward prediction errors relative to punishment prediction errors. The result provides the first empirical evidence for the pervasive, but hitherto untested hypothesis that individual variation in reward-based learning reflects quantitative variation in baseline levels of striatal dopamine function, as indexed by uptake of a PET dopamine synthesis tracer. Critically, the effect was outcome-specific, so that high dopamine synthesis was associated with a bias towards reward- relative to punishment-based reversal learning. This observation concurs with recent theoretical models and empirical work (Frank et al., 2004; Frank, 2005). For example patients with Parkinson's disease, which is characterized by severe dopamine depletion in the striatum, exhibit difficulty with learning from reward relative to punishment, as measured with the present paradigm (Cools et al., 2006) as well as a probabilistic selection task (Frank et al., 2004). Furthermore, the common dopamine-enhancing antiparkinson medication reversed this bias, leading to difficulty with learning from punishment relative to reward (Cools et al., 2006; Frank et al., 2004). The present data demonstrate that individual differences in baseline levels of dopamine in the healthy population also predict reward- relative to punishment-learning biases.

Perhaps most critically, this work provides the first direct evidence for the existence of an ‘inverted u’-shaped relationship between striatal function and dopamine D2 receptor stimulation. Based on evidence from experimental animals (Skirboll et al., 1979; Torstenson et al., 1998), we argue that this curvilinear dose-response curve might reflect differential stimulation of pre- versus postsynaptic D2 receptors in high- versus low-baseline subjects respectively. Specifically, we hypothesize that the established self-regulatory mechanism of presynaptic action of bromocriptine, by which dopamine cell firing, release and/or synthesis are inhibited, is more pronounced in subjects with already high baseline levels of synaptic dopamine than in subjects with low baseline levels of dopamine. Furthermore, disproportionate efficacy of self-regulatory (presynaptic) mechanisms in high-baseline subjects might be accompanied by desensitization of postsynaptic D2 receptors, thereby further reducing the postsynaptic efficacy of bromocriptine. Thus bromocriptine might have paradoxically reduced synaptic dopamine levels, thereby impairing reward-based learning, via a presynaptic mechanism of action in high-baseline subjects. Conversely, we hypothesize that the same drug enhanced reward-based learning by increasing dopamine transmission via a postsynaptic mechanism of action in low-baseline subjects.

The effects likely reflect general associative learning from unexpected outcomes rather than switching per se, as demonstrated by the supplementary model-based analyses. Although the rapid updating required in the current task, on which subjects expressed high learning rates, is different from the slower types of incremental probabilistic learning (Cools et al., 2001; Frank et al., 2004), which require the integration of outcomes across more distant histories, we hypothesize that similar associations will be observed between striatal dopamine synthesis and updating during slow learning. Consistent with this hypothesis is the observation that effects of dopaminergic manipulations on incremental positive reinforcement learning correlate with effects on rapid working memory updating (Frank and O'Reilly, 2006; Frank et al., 2007a), suggestive of similar dopaminergic influences on parallel neurobiological circuits.

One potential caveat of the present study is the considerable delay between the acquisition of the PET data and that of the behavioural data. We argue that this delay does not confound our results for the following reasons. First, there is evidence that measures of dopamine synthesis capacity are relatively stable in healthy volunteers, even across many years. Specifically, Vingerhoets et al (1994b) have shown that the fluorodopa PET index decreased non-significantly over 7 years by 0.3% per year. In addition the reliability of change coefficient was 96% confirming their previous study showing that striatal activity measured with PET is a highly reproducible measurement (Vingerhoets et al., 1994a), although we acknowledge that our age group was younger than the one studied by Vingerhoets et al.. Second, the effects were statistically controlled for noise by an alpha level of 0.05 and remained highly significant after statistical correction for the acquisition delay. Finally, any instability in the PET measurement across the delay would have reduced rather than enhanced the likelihood of obtaining the result. There is a possibility that, if there had been no delay, the correlation between synthesis capacity and behavioural data might have been even stronger. Therefore, the reported correlations might represent noisy versions of the true correlations.

Our data elucidate not only a fundamental mechanism underlying the behavioural role of striatal dopamine, but also identify an important neurobiological factor, i.e. baseline striatal dopamine synthesis, that contributes to the large variability in dopaminergic drug efficacy. This finding should have far-reaching implications for individualized drug development in neuropsychiatry, where variable drug efficacy provides a major problem for the treatment of patients with heterogeneous spectrum disorders like schizophrenia, attention deficit/ hyperactivity disorder and drug addiction.

Supplementary Material

Supp1

Supp2

Acknowledgments

We thank Lee Altamirano, Elizabeth Kelley, George Elliott Wimmer and Emily Jacobs for assistance with data collection and Cindee Madison for assistance with data analysis. The work was supported by NIH grants MH63901, NS40813, DA02060 and AG027984.

Footnotes

Competing financial interests The authors declare they have no competing financial interests

Author contributions. R.C. and M.D. conceived of the study. A.M. and S.E.G. collected and analyzed the PET data, while R.C. analyzed and interpreted the psychopharmacological data in relation to the PET data. W.J. developed methods for acquisition and analysis of PET data. M.F. performed the model-based analyses and helped R.C. interpret the data and write the paper. All authors discussed the results and commented on the paper.