Abstract

Electrophysiological data disclose rich dynamics in patterns of neural activity evoked by sensory objects. Retrieving objects from memory reinstates components of this activity. In humans, the temporal structure of this retrieved activity remains largely unexplored, and here we address this gap using the spatiotemporal precision of magnetoencephalography (MEG). In a sensory preconditioning paradigm, 'indirect' objects were paired with 'direct' objects to form associative links, and the latter were then paired with rewards. Using multivariate analysis methods we examined the short-time evolution of neural representations of indirect objects retrieved during reward-learning about direct objects. We found two components of the evoked representation of the indirect stimulus, 200 ms apart. The strength of retrieval of one, but not the other, representational component correlated with generalization of reward learning from direct to indirect stimuli. We suggest the temporal structure within retrieved neural representations may be key to their function.

eLife digest

Seeing an object triggers a complex and carefully orchestrated dance of brain activity. The spatial pattern of the brain activity encoding the object can change multiple times even within the first second of seeing the object. These rapid changes appear to be a core feature of how the brain understands and processes objects.

Yet little is known about how these patterns unfold through time when we remember an object. Remembering, or retrieving information about objects, is how we use our knowledge of the world to make good decisions. It is not clear whether, during remembering, there are rapid changes in the patterns similar to those that happen when directly seeing an object. Mapping brain activity during remembering could help us understand how stored information can guide decisions.

Using recently developed methods in brain imaging and statistics, Kurth-Nelson et al. found that two distinct patterns of brain activity appeared when viewing particular objects. One occurred around 200 milliseconds after viewing an object, and the other appeared a bit later, by about 400 milliseconds. Later, when remembering the object, these patterns reappeared in the brain, but at different points in time. Furthermore, these two patterns had distinct roles in learning associated with the objects to guide later decisions.

This work shows that rapid changes in the pattern of neuronal activity are central to how stored information is retrieved and used to make decisions.

At retrieval, recent studies have explored the fast evolution of neural representation. EEG studies provide evidence that some information is retrieved as early as 300 ms following a cue (e.g., Johnson et al., 2008; Yick and Wilding, 2008; Wimber et al., 2012). Manning et al. (2011), (2012), using electrocorticography, and Jafarpour et al. (2014), using MEG, showed that oscillatory patterns are also reinstated during retrieval; in two of these studies the predominance of low oscillatory frequencies in reinstatement suggests a potential spectral signature.

However, the dynamics of representation during direct experience of an object have never been tied to the dynamics of retrieval. It is not known which of the patterns evoked in sequence by direct experience are reinstantiated during retrieval, what the temporal relationship is in their retrieval, or what functional significance this has.

Recent advances in multivariate methods for MEG have greatly improved our ability to discern fast-changing distributed representations in humans (Carlson et al., 2013; Cichy et al., 2014; Jafarpour et al., 2013, 2014; van de Nieuwenhuijzen et al., 2013; Sandberg et al., 2013). Here, we apply these methods to a simple sensory preconditioning task adapted from Wimmer and Shohamy (2012). Sensory preconditioning is a well-established paradigm in which subjects first form an association between two stimuli (‘direct’ or Sd and ‘indirect’ or Si) and then form an association between the direct stimulus and a reward (Brogden, 1939). Generalization of value to the indirect stimulus is evidence of retrieving the learned association (Gewirtz and Davis, 2000). Using fMRI, Wimmer and Shohamy (2012) showed that neural representations of the associated indirect stimulus are reinstated when direct stimuli are presented during the Reward-learning phase, and this retrieval is linked to the generalization of value from direct to indirect stimuli. This suggests that reinstatement through the learned associative link may be part of the mechanism for value updating. Our aim here is to explore the temporal structure of this reinstatement, which may help to shed light on the mechanisms of value updating as well as providing general insight into the dynamics of representations during retrieval.

We therefore examined retrieval in the same paradigm, using MEG to gain temporal precision. We show that the neural representation of the indirect stimulus can be decomposed into at least two temporal components with distinct properties, and these are retrieved at different times during the Reward-learning phase. The retrieval of only one of these components is correlated with a behavioral measure of the generalization of value across the learned associations.

Results

Behavior

We used a slightly modified version of the behavioral task employed by Wimmer and Shohamy (2012). This involved three phases (Figure 1A). In the Association phase, subjects watched visual stimuli appearing sequentially at the center of the screen. The stimuli alternated between photographs (‘Si’) and circular fractals (‘Sd’), with a short blank fixation interval between each stimulus. Each Si came from one of three categories (face/body/scene), and each unique Si was deterministically followed by a unique Sd, thus establishing a pairing between Si and Sd images. As in Wimmer and Shohamy (2012), debriefing revealed that subjects were not aware of the Si–Sd pairings. There were two unique Si in each category; making for a total of six unique Si and six unique Sd stimuli used for later phases (along with six additional unique Si and six additional unique Sd that functioned as dummies for the Association phase cover task and were included in the imaging analysis).

Task design and behavior.

Subjects participated in a sensory preconditioning task comprising three phases: Association, Reward and Decision. (A) In the Association phase, subjects were exposed to pairs of stimuli (presented sequentially). One member (called Si) of each pair was taken from one of three classes (faces, bodies, and scenes); the other member (Sd) was a fractal. In the Reward phase, some of the fractals (labelled Sd+) were paired with reward; the others (labelled Sd−) were not. Through the pairing, this implicitly established a separation between Si+ and Si−. In the Decision phase, subjects chose between Si+ and Si− within the same category, or between Sd+ and Sd−. All photos shown are from pixabay.com and are in the public domain. (B) In the Decision phase, subjects displayed a strong preference for Sd+ over Sd− (p = 6.9 × 10−4, one-sample t-test). There was no preference at the group level for Si+ over Si−, but we exploited the variability between subjects for value-related analyses. The change in relative liking from before to after the experiment was more positive for Sd+ than Sd− (p = 0.04, one-sample t-test); but there was no significant difference between the changes for Si+ and Si−. Bar heights show group means and dots show individual subjects. Error bars show standard error of the mean.

In the Reward phase, of the two Sd images associated with a category of Si images, one, which we therefore call Sd+ was followed by a reward on 14 out of 18 presentations (and otherwise by a neutral outcome, a blue square); the other, which we call Sd- was always followed by a neutral outcome. By virtue of the prior pairing, this established an Si+ and Si− for each category.

In the Decision phase, subjects were faced with pairwise choices between an Si+ and an Si−, or an Sd+ and an Sd−. The two items always had the same category (face/body/scene) for Si, or associated category for Sd. Subjects exhibited a strong preference for Sd+ over Sd− (p = 6.9 × 10−4), but as a group showed no evidence of preferring Si+ over Si− (p = 0.9) (Figure 1B).

Distinct temporal components of neural object representation

Neural activity was recorded by magnetoencephalography (MEG) during all three phases. We first explored where in space and time the MEG signal carried information about the Si stimuli being presented in the Association phase. Using one-way ANOVA, we found that the raw amplitude, in single time bins, of the event-related field (ERF) at many individual sensors was significantly related to the Si category (Figure 2). (The significance threshold was set to 95% of peak-level over space and time from 100 random category label shuffles, to correct conservatively for multiple comparisons.)

Event-related field (ERF) discriminates between categories (face/body/scene) at time of Si presentation.

Sensors became category-discriminative in two waves. (A) The first time, relative to stimulus onset, when the relationship between ERF amplitude and category membership became significant by ANOVA (significance threshold set at 95% of peak-level (across all sensors and all time) log10(p) of 100 shuffles) at each of 275 sensors. Many occipital and temporal sensors first became predictive of Si category between 90 and 230 ms post stimulus onset, followed by some parietal and frontal sensors ranging from 330–550 ms post stimulus onset. Open circles indicate the sensors that never reached 95% peak-level. (B) Histogram of how many sensors first became significantly discriminative at each time following stimulus presentation.

Next, we built a multivariate linear SVM classifier, which combined the reports of multiple sensors (Figure 3A). As in many previous studies (cf. Norman et al., 2006; Cichy et al., 2014), the extra sensitivity achieved by combining multiple features supported the use of multivariate analysis to track neural representations (Figure 3—figure supplement 1). We constructed null distributions at each time bin by repeating this procedure 100 times with randomly shuffled category labels. At 200 ms post-stimulus, the 95th percentile of the null distribution was 35.0% accuracy, and the median was 33.7% (deviating from 1/3rd only due to the finite number of shuffles).

(A) Multivariate decoding performed well to predict the category of photograph (Si) in the Association phase. Cross-validated linear SVM prediction accuracy using all 275 sensors at each time bin is shown. A pattern of two distinct peaks in classifier accuracy around 200 ms and 400 ms after Si onset is evident. (B) At 200 ms after Si onset, there was no difference in representational similarity between same-category and different-category Si objects (left panel, p = 0.2 by t-test between subjects). At 400 ms, representational similarity was higher for same-category than different-category objects (right panel, p = 5 × 10−7). F1–F4, B1–B4 and S1–S4 refer to the unique faces, bodies and scenes presented during the Association phase. (C) When discriminating fractal identity (i.e., a 6-way classification problem of stimuli with no natural categories), performance was sharply peaked before 200 ms after fractal onset. Shaded area shows standard error of the mean.

We observed two distinct peaks in multivariate classification performance, one centered approximately around 200 ms and the other around 400 ms post-stimulus onset. Although these peaks had measurable width, for simplicity, we will henceforth refer to them as ‘200 ms’ and ‘400 ms’. To test more formally for two distinct peaks in classification, we asked whether there was significant concavity in the evolving classification accuracy in the interval from 200 to 400 ms, by regressing the classification accuracy against linear and quadratic functions of time. At the group level, the quadratic term was significantly different from zero (p = 0.02). We also performed this regression on the accuracy curves from individual subjects; many subjects trended toward a positive quadratic term, but none reached significance at a Bonferroni-corrected threshold (Figure 3—figure supplement 2). Finally, to rule out any peculiarities in the SVM algorithm being responsible for two distinct peaks in classification accuracy, we also repeated the same analysis at the group level with a variety of nearest-mean classifiers and found the same pattern (Figure 3—figure supplement 3).

Given past observations and ideas about separate post-stimulus phases encoding qualitatively different kinds of stimulus information (Schmolesky et al., 1998; Lamme and Roelfsema, 2000; Riesenhuber and Poggio, 2000; Engel et al., 2001; Bar, 2003; Cichy et al., 2014), we asked if these two peaks had different representational similarity structure. We calculated representation similarity matrices (Kriegeskorte et al., 2008), which reflect the similarity in activation patterns between each pair of unique stimuli. We found that at 200 ms, the activity patterns evoked by stimuli within a category were no more similar than those evoked by stimuli in different categories (Figure 3B, left panel; p = 0.2, paired t-test between subjects); whereas at 400 ms, patterns within a category were substantially more similar than between categories (Figure 3B, right panel; p = 5 × 10−7). This is consistent with the idea that the dominant coding of stimulus information changes between 200 and 400 ms.

Further supporting the idea that the later component of the ERF had a relatively more dominant coding of categorical information, we found that the cross-validated performance of a linear SVM in a 6-way discrimination of fractal identity was sharply peaked at 160 ms post-stimulus onset, and lacked a substantial second peak (Figure 3C). We note that a shift in the timing of the early peak from ∼200 ms to ∼160 ms could be consistent with previous observations (Bobak et al., 1987; Cichy et al., 2014) that the precise timing of each wave of representation is sensitive to the particular stimuli concerned.

Sd elicits retrieval of associated Si representation

During the Reward phase, Sd (fractals) and outcomes (coin/blue square) were presented. We confirmed it was possible to predict the identity of fractals (cf. Figure 3C) and outcomes (Figure 3—figure supplement 4) reliably based on the MEG signal. However, the main intention of our study was to examine whether the activity evoked by these stimuli contained information about the Si stimulus with which the Sd had been associated. To this end, we trained classifiers on neural responses to Si in the Association phase (exactly as above, but using all trials because cross-validation was not necessary), and tested these classifiers on neural responses elicited in the Reward phase when Sd was presented. The classifier was considered to be correct if it reported the category label of the Si that had previously been paired with this Sd. We performed this train-on-Si, test-on-Sd procedure for every pair of times relative to the onsets of Si (in the Association phase) and Sd (in the Reward phase), leading to a 2-D grid of classification accuracies (Figure 4A). These 2-D grids were then smoothed with a 2-D Gaussian kernel (σ = 30 ms).

Early and late components of associated object representation retrieved at time of cue and outcome, respectively.

During the Reward phase, the 200 ms component of the Si representation was retrieved for an extended period from shortly after Sd was presented, while the 400 ms component of Si representation was retrieved around the time the outcome was presented. (A) Classifiers trained around 200 ms after Si presentation in Association phase and tested around 400 ms after Sd presentation in Reward phase decode the object category previously associated with the Sd. Photo is from pixabay.com and is in the public domain. (B) Classifiers trained around 400 ms after Si presentation and tested 70 ms after outcome presentation decode the object category previously associated with the Sd. In A and B, black outlines show p = 0.05 peak-level significance thresholds (empirical null distribution generated by 1000 random permutations of training category labels, see Methods for more details). (C) Peak classification accuracy in the 200 ms and 400 ms rows of A and B. By 2-way ANOVA, there was no main effect of 200 ms vs 400 ms or of Sd vs outcome, but there was a significant interaction (p = 0.04). Error bars show standard error of the mean.

We observed that the classifiers trained around 200 ms post-Si presentation achieved above-chance accuracy in predicting which Si category had previously been associated with the presented Sd (the 95th percentile of the peak-level achieved in 200 random shuffle tests is shown as a solid black line in each panel). This effect was above chance from 270–530 ms following presentation of Sd. In other words, the spatial pattern of brain activity present 200 ms after presentation of Si in the Association phase was partially reinstantiated 270–530 ms after presentation of Sd in the Reward phase. Note that the randomization of Si–Sd pairings across subjects makes exceedingly unlikely the possibility that some visual features of Si happen to be shared with the associated Sd and might therefore carry a shared neural signature.

We also applied the same set of classifiers to the activity evoked by presentation of outcome (coin or neutral blue square) that followed each Sd in the Reward phase. The classifiers trained around 400 ms after Si achieved above-chance accuracy in predicting the Si category previously associated with the Sd presented on this trial (Figure 4B). This effect was strongest at 70 ms following presentation of the outcome, meaning that the spatial pattern of activity present 400 ms after presentation of Si in the Association phase was at least partially reinstantiated 70 ms after presentation of the outcome in the Reward phase. Since the outcome always appeared 3500 ms after Sd in each trial, 70 ms after presentation of outcome was equivalently 3570 ms after presentation of Sd. Since all the information necessary to retrieve Si was carried by Sd, some of the retrieval process might occur before onset of the outcome.

Two-way ANOVA revealed no significant main effects of 200 ms vs 400 ms or Sd vs outcome but a significant interaction (p = 0.04; Figure 4C). That is, the peak accuracy following Sd was higher for the 200 ms than the 400 ms classifier, while the peak accuracy following outcome was higher for the 400 ms than the 200 ms classifier, implying a double dissociation in the component that was more strongly retrieved at Sd vs outcome. Both forms of cross-classification were very much less accurate than (linear) classification of the identity of the Sd (fractals) or outcome (coin/blue square) from the activity directly evoked by these stimuli (cf. Figure 3C and Figure 3—figure supplement 4).

To investigate which MEG sensors carried retrieved information, we again trained classifiers on Si-evoked data and tested on Sd– or outcome-evoked data (i.e., cross-classification). However, rather than using all 275 sensors, we repeated the procedure for 2000 iterations using a different random subset of 50 sensors each time. To investigate the retrieval identified in Figure 4A,B, we restricted analysis to 60 × 60 ms temporal ROIs centered on the peaks of cross-classification in Figure 4A,B, and averaged over these temporal ROIs. For each sensor, each iteration of this procedure thus yielded a single classification accuracy. We could then calculate how accurate the cross-classification was on average when a given sensor participated in classification. The average of these data across subjects are shown in Figure 5, separately for Sd- and outcome-evoked data. To test whether these spatial patterns were significantly different, we again used a linear SVM with cross-validation to predict whether each pattern originated from Sd–or outcome-evoked data. Each pattern was mean-subtracted to avoid any trivial classification based on overall higher cross-classification performance for Sd- than outcome-evoked data. Prediction accuracy reached 71.2%, which was greater than chance by one-tailed binomial test (p = 0.002).

Contributions of sensors to retrieval.

To explore which brain areas carried the information about Si that was retrieved at the time of Sd and outcome, we copied the procedure of training linear category classifiers on presentation of Si, and predicting the category at the time of Sd or outcome—but instead of using all 275 sensors, we repeated the analysis 2000 times using subsets of 50 sensors randomly selected on each iteration. The contribution of sensor s was taken to be the mean of all prediction accuracies (within 60 × 60 ms temporal ROIs containing the peak time bins) achieved using an ensemble of 50 sensors that included s. Intriguingly, the information about the category of Si retrieved at the time Sd was presented emerged primarily from occipital sensors (A), while the information about the category of Si retrieved at the time the outcome was shown appeared more strongly in parietal and temporal sensors (B). In the difference between the two conditions, no individual sensor survived correction for multiple comparisons. However, a linear SVM was reliably able to classify whether a spatial pattern belonged to Sd or outcome (71.2% accuracy, p = 0.002 by one-sided binomial test against chance classification).

Preference for Si+ is correlated with retrieval of stimulus-specific representation at outcome time

Finally, we were intrigued by the apparent retrieval of only the late (400 ms) and not the early (200 ms) component of the Si representation during outcome presentation. The representational similarity analysis in Figure 3B suggested that this 400 ms component might preferentially encode stimulus category. Thus, we speculated the value of the associated Si category, rather than the value of a particular Si stimulus, might be updated when the outcome appears. This could provide a potential explanation for the lack of group-level behavioral preference for Si+ over Si− during the subsequent Decision phase, since each Si category contained both an Si+ and an Si−, with equal presentations. This hypothesis predicts that, although at the group level there might be no significant retrieval of the 200 ms component of Si representation during outcome presentation, the subjects who did retrieve the 200 ms component of Si might have a positive preference for Si+ over Si−. (Meanwhile, a preference for Si− over Si+ should be unrelated to retrieval.) We therefore plotted the correlation between behavioral preference and accuracy of Si-trained classifier in predicting the associated category of the Sd stimulus presented on this trial. This analysis was split according to whether subjects preferred Si− over Si+ (Figure 6A) or Si+ over Si− (Figure 6B). Remarkably, in subjects preferring Si+ over Si−, reinstatement of the 200 ms component of Si was strongly correlated with behavioral preference. Shuffling subject identities yielded a null distribution of peak log10 p-values for the correlation of classifier accuracy with behavioral preference. The 400 ms classifier showed no substantial positive correlation with behavioral preference (Figure 6C), while the 200 ms classifier showed a corrected-significant peak in correlation strength ∼400 ms after the onset of the outcome (Figure 6D). The raw data driving these correlations are also shown in Figure 6E,F.

Retrieval of early component of Si representation predicts value updating across subjects.

At the group level, only the 400 ms component was significantly retrieved at the time of outcome (cf. Figure 4B). However, at the single-subject level, the degree of retrieval of the 200 ms component correlated with value updating. As in Figure 4B, the accuracy of classifiers trained at each time bin around Si (in the Association phase) was tested at each time bin around the time of outcome (in the Reward phase) to predict the category of the Si associated with the Sd preceding the outcome. In each time*time bin, this accuracy was regressed, across subjects, against the behavioral preference for Si+ over Si− from the Decision phase (i.e., P(Si+)). As we only explored positive correlations, one-tailed log10 p-values of the regression are reported. (A) In subjects who preferred Si− over Si+, there were no correlations between the degree of preference and the degree of reinstatement of Si at outcome. (B) In subjects who preferred Si+ over Si−, there was a strong correlation between the degree of preference and the degree of reinstatement. This correlation peaked at around 400 ms after outcome onset. (C, D) Red and blue traces show single rows of panels A and B at 200 and 400 ms. Significance was tested by randomly shuffling subject identities to obtain a null distribution of peak-level log10 p-values. Thresholds are shown at 95% of the null distribution of the peak-level of 200 and 400 ms rows, and at 95% of the null distribution of peak-level of all rows. (E, F) Raw classification accuracies underlying the correlations in A–D, when training at 200 ms after Si onset and testing at 400 ms after outcome onset. Each point is a subject.

Discussion

We used a sensory preconditioning paradigm to explore the temporal structure of the retrieval of representations through associative links. We found that presenting photographs (Si, in three categories) elicited an evolving representation with two temporally distinct components: one around 200 ms and the other around 400 ms after stimulus onset. The earlier component was reinstated when a fractal (Sd) previously paired with the Si was presented. The later component was reinstated when a rewarding or neutral outcome was presented following Sd. Although at the group level there was no significant reinstatement of the earlier component at the time of outcome, between subjects the degree of reinstatement of this earlier component correlated with the degree of subsequent value generalization.

Our results fit comfortably with the large body of literature showing that retrieval (which is notably unconscious here and in Wimmer and Shohamy, 2012, as contrasted with conscious retrieval that is more commonly studied) induces reinstantiation of at least some aspects of the pattern of neural activity evoked by the original presentation. For instance, in the fMRI study whose design we copied (Wimmer and Shohamy, 2012), univariate methods were used to show the equivalent of Si category retrieval during the Reward phase. Equally, ERP studies have found neural signals as early as 300 ms following a retrieval cue that are different depending on which information is retrieved or whether the information is retrieved (Johnson et al., 2008; Yick and Wilding, 2008). Further, using MEG, Jafarpour et al. (2014) identified reinstatement of a pattern of oscillatory activity appearing approximately 180 ms following presentation of the retrieved item. This pattern was reinstated approximately 500 ms following the retrieval cue, slightly later than the 400 ms we observed.

Multivariate pattern analysis provides a much more powerful microscope than traditional univariate analysis for detecting distributed patterns encoding neural representations (Norman et al., 2006). Combining MVPA with MEG enables tracking the fast time-evolution of these representations (Schmolesky et al., 1998; Jafarpour et al., 2013; Cichy et al., 2014). Using these methods we have extended previous findings on retrieval to now establish a mapping between the dynamics of object representation and the dynamics of retrieval in this behavioral paradigm.

We identified two temporal components of object representation that were retrieved at different times. The earlier component of Si representation, which appeared roughly 200 ms following Si presentation, was first detectable 270 ms following presentation of Sd. This is consistent with past ERP studies showing similar timing, which have been taken as suggesting that reactivation is mediated by hippocampus (Bosch et al., 2014). The prolongation of this representation from 270–530 ms may represent averaging (over trials or subjects) of temporally abrupt retrievals, or a sustained information retrieval.

By contrast, the late component of Si representation re-appeared 70 ms following outcome presentation. The outcome did not provide any additional information about Si category, so the representation of Si must have been sustained in some form through the (fixed) delay between Sd and outcome. This raises questions such as where the information about Si was held during the delay, and what are the implications of this timing. For the former, we were only able to detect a representation of Si when it took the form of a spatial pattern of activity mirroring the pattern at presentation of Si. Thus information might have been online in the activity of, for instance, prefrontal neurons (Fuster, 2001; Wang et al., 2006), but in a different form from that inspired by Si itself (Sakai and Miyashita, 1991; Rainer et al., 1999). Alternatively, it might have been stored in short-term synaptic weight changes (Hempel et al., 2000; Seung, 2003; Florian, 2007; Mongillo et al., 2008).

Supporting the idea of these ∼200 ms and ∼400 ms components as distinct representational periods, we note the following. First, there was a decrease in classification accuracy between these periods. Second, classifiers trained on one epoch had low accuracy in the other epoch (Figure 3—figure supplement 5), suggesting information about the stimuli was coded differently between epochs. Third, the epochs had different similarity structure with respect to the stimulus categories (Figure 3B). Fourth, the patterns from the two epochs were doubly dissociated in terms of their retrieval at Sd vs outcome (Figure 4), while the time period between the two peaks (i.e., around 300 ms post-stimulus) was not strongly retrieved either at Sd or outcome (Figure 4).

In terms of timing, the relatively precise epoch of retrieval of Si following the presentation of the outcome may reflect the point of strongest overlap between a variety of timings in individual subjects. Alternatively, it may be that a representation that is latent became detectable as soon as more power arose in the visual-evoked ERF due to onset of the outcome. Yet another possibility is expectations of the next stimulus partly drive representations in the first 10 s of milliseconds after a visual onset, before the present stimulus is processed.

The low accuracy in classifying retrieved representations (∼35%) compared to evoked responses (∼60%) might imply that retrieved representations (perhaps especially those that subjects are not consciously aware of) were weak compared to evoked representations. It is also possible that Si representations were only retrieved on a subset of trials, weakening the average signal. Finally, it is possible that retrieved representations had a distributed spatial pattern that was only partly overlapping with the evoked representation, making it more difficult to detect with pattern classifiers trained on evoked activity.

We exploited the distinct temporal components of retrieval to help elucidate the neural underpinnings of value generalization through associations. In both our study and in the similar design of Wimmer and Shohamy (2012), behavioral evidence of sensory preconditioning rests wholly on stimulus-specific retrieval (since the rewards associated with each category are balanced). If the 400 ms component of Si representation preferentially encodes information about category rather than specific stimuli, as suggested by our representational similarity analysis, retrieval of solely this component at outcome time might cause value learning to be assigned to categories rather than individual stimuli. This hypothesis would explain our finding that the subjects who retrieve the 200 ms component at outcome show behavioral evidence of sensory preconditioning. Under this interpretation, the correlation that Wimmer and Shohamy found in BOLD between retrieved stimulus representations and behavior between subjects may also have been driven by the 200 ms component of the stimulus representation; these temporally precise signals could not be distinguished using fMRI. Although the particular representations online at the time of reward were probably driven by quirks of this task design (since other sensory preconditioning experiments have found robust group-level preference for Si paired with rewarded Sd (e.g., Seidel, 1959)), the finding is of general importance because it suggests that the exact timing of reward relative to fast-evolving neural representational structures is crucial to value updating and credit assignment.

Like Wimmer and Shohamy (2012), we have compared a behavioral value generalization measure against the output of a neural classifier trained on the category of Si, rather than the identity of an individual Si. The latter would give a more direct test of the idea that subjects who retrieve a representation of the specific Si paired with the particular Sd viewed on this trial drive larger value updates. Although it is in principle possible to train a classifier to distinguish between individual exemplars of an Si category, this did not reach a sufficiently high level of performance in our hands, perhaps limited by the relatively small number of training samples per unique stimulus. Future experiments could also employ Si+ and Si− stimuli that are more neurally distinguishable.

We noted in the ‘Introduction’ a large number of proposals for the use of associative information both at the time of decision (online) or when a decision is not imminent (offline). Offline and online processes may share similar mechanisms (Doll et al., 2014), and in some cases the division between offline and online mechanisms is blurred. For example, retrieving elements of past experiences may serve as part of the process of planning in advance for the next time related situations are encountered (Dragoi and Tonegawa, 2011, 2013), similar to the psychological notion of implementation intentions (Gollwitzer, 1999).

Some theoretical methods (e.g., the successor representation (Dayan, 1993) and beta-models [Sutton, 1995]) shift a portion of the burden of online calculations using offline updates to carefully structured representations. In sensory preconditioning, it is an open question whether generalized values are updated offline (either during the Reward phase or in between the Reward and Decision phases), retrieved through associative links at the time of decision, or a mix of both. In animals the vulnerability of sensory preconditioning to extinction (Gewirtz and Davis, 2000) hints at an online mechanism, but it is equally possible that extinction drives offline value updates through the same generalization mechanism as acquisition. Indeed, although our description of the reinstatement of Si suggests that it arises through a distinct process of retrieval, we cannot distinguish this from the subtly different possibility hinted by these ideas that the representation of Sd changed through the associative learning so that it more closely resembles that of Si.

Finally, we note that timing of event-related signals depends strongly on stimulus properties (e.g., Bobak et al., 1987). Multivariate classification also yields different timings in the peaks of classification depending on the specific kinds of categories evaluated (Cichy et al., 2014). Thus the particular temporal structure of evoked responses is most likely specific to the stimuli used. Mapping this structure for a given task and stimuli can be leveraged to probe the dynamics of retrieval.

In summary, neural retrieval of representations through associative links is central for memory and decision-making. Here we provide evidence that the dynamical structure within retrieval is functionally relevant for value-guided decision making. Analyzing the fine temporal structure of representations also increases the potential for studying temporally rich retrieval processes such as replay and planning in humans, which were previously confined to animal recordings.

Materials and methods

Subjects

Twenty-nine adults participated in the experiment, recruited from the UCL Institute of Cognitive Neuroscience subject pool. Three were excluded before the start of analysis for large movement or myographic artifacts. Of the 26 remaining, age quartiles were 18.7, 19.5, 21.3, 26.7, 41.4 years; 14 were female, and 1 was left-handed. All participants had normal or corrected-to-normal vision and had no history of psychiatric or neurological disorders. All participants provided written informed consent and consent to publish prior to start of the experiment, which was approved by the Research Ethics Committee at University College London (UK), under ethics number 1825/005.

Task

Participants performed three phases of a simple behavioral task (copied almost exactly from Wimmer and Shohamy, 2012; but with timings set to be faster for MEG) designed to induce and measure sensory preconditioning. The task was coded in Cogent (Wellcome Trust Centre for Neuroimaging, United Kingdom), running in MATLAB 7.14 (Mathworks, Natick, Massachusetts).

Before the experiment, participants rated 78 images, one at a time, using a visual analog scale to indicate how much they subjectively liked each image, ranging from ‘Strongly Dislike’ to ‘Strongly Like’. These images consisted of 60 photos (20 faces, 20 body parts, 20 scenes), and 18 fractals. Luminance and contrast varied between images (Figure 3—figure supplement 6). Four of each photo category and 12 fractals were then selected to be used in the experiment. For each subject we chose the stimuli whose liking ratings were closest to neutral; different subjects therefore saw different images in the experiment.

In the first (‘Association’) phase of the experiment, each of the 12 selected photos (‘Si’, indirect stimuli) were deterministically paired with a different fractal pattern (‘Sd’, direct stimuli). Two of each Si category were ‘dummies’ for the cover task, and two were ‘real’ stimuli. Subjects viewed Si and Sd images sequentially while performing a cover task of pressing one button in response to rightside-up images and a different button for upside-down images, with the button response mapping randomized across subjects. Dummies had a 50% chance of being upside-down, and real stimuli were never upside-down. Dummies were not presented in subsequent phases. In each trial, subjects saw an Si for 1750 ms, followed by an interstimulus-interval (ISI) of 1000 ms, followed by the paired Sd for 1750 ms, followed by an intertrial-interval (ITI) of 2500 ms. Every nine trials, each of the six real Si stimuli was presented once, and one of each of the dummy Si stimuli in each category was presented once (both reals and dummies were always followed by the paired Sd). The order was randomly permuted over every 9 trials, and this was repeated 12 times, for a total of 108 trials. In debriefing at the end of the experiment, no subject reported being aware of any pairing between Si and Sd indicating the effectiveness of the cover task; the Si–Sd association was implicit. No subject reported being aware that the dummies did not appear in later phases.

In the second (‘Reward’) phase, subjects were taught that some of the fractals (Sd+) were worth money, while others (Sd−) were not. In each conditioning trial, subjects saw an Sd for 2000 ms, followed by an ISI of 1500 ms, and then either a reward (image of a one pound sterling coin) or no-reward (blue square) for 2000 ms, followed by an ITI of 3000 ms. Each Sd appeared 18 times, for a total of 108 trials. Sd− were never rewarded, while Sd+ were rewarded 14 out of 18 times that they appeared. The cover task was to press one button for any Sd or for no-reward, and a different button for reward (meaning that in an unrewarded trial, the same button was to be pressed twice; while in a rewarded trial two different buttons should be pressed). Pressing the correct button to ‘pick up’ the coin led to actually receiving this money at the end of the experiment (divided by a constant factor of ten); subjects were informed of this. Through the unique pairing between Si and Sd, conditioning implicitly established Si+ (previously paired with Sd+) and Si− (previously paired with Sd−). The pairing was such that each Si category contained one Si+ and one Si−.

In the third (‘Decision’) phase, in each trial subjects made a pairwise choice between either two Sd images or two Si images. The two Si images were always of the same category (face/body/scene): one Si+ and one Si−; likewise, the two Sd images, an Sd+ and an Sd−, had always been previously paired with the same Si category. Subjects were instructed that they would receive monetary reward for choosing the correct stimulus, but, as in Wimmer and Shohamy (2012), were given no instructions about how to identify the correct stimulus (except to choose the one they thought was more lucky). They actually received these rewards at the end of the experiment, again divided by ten. In addition to the money earned within the task, subjects received a flat compensation of £10. Each pairwise choice was repeated 4 times for a total of 24 trials. Any preference for Si+ over Si− would provide evidence of sensory preconditioning.

After the experiment, subjects again provided subjective liking ratings on a visual analog scale, this time for each Si and Sd actually used in the experiment (excluding dummies).

Behavioral analysis

Decision-phase preferences for Sd+, Sd−, Si+, and Si− were measured by averaging the four binary responses for each pair, and performing a one-sample t-test between subjects on the mean response against 50%. Similar results could be obtained by treating the first choice of each subject for each pair as an independent draw from a Bernoulli distribution and comparing the results to p = 0.5. Changes in subjective liking ratings from Pre-Liking to Post-Liking phases were differences on an arbitrary scale (pixels in the visual analog scale) and were linearly de-trended as subjects showed a robust tendency to increase all ratings at the end of the experiment compared to the beginning (many subjects reported in debriefing that they liked most of the stimuli more because they were more familiar at the end of the experiment).

MEG acquisition

MEG was recorded continuously at 600 samples/second using a whole-head 275-channel axial gradiometer system (CTF Omega, VSM MedTech, Canada), while participants sat upright inside the scanner. Continuous head localization was recorded with three fiducial coils at the nasion, left pre-auricular, and right pre-auricular points. The task script sent synchronizing triggers (outportb in Cogent) which were written to the MEG data file. A projector displayed the task on a screen ∼80 cm in front of the participant. Participants made responses on a button box using either thumbs or index fingers as they found most comfortable.

MEG analysis

All analysis was performed in MATLAB. Some analyses used SPM12b (Wellcome Trust Centre for Neuroimaging, United Kingdom). Data were first converted to SPM12 format using spm_eeg_convert. Each event was then epoched, using spm_eeg_epochs, to 1000 ms segments from −400 ms to +600 ms relative to the event, based on the triggers recorded from the task script. All timings were corrected for one frame (1/60 s) of lag between triggers and refreshing of the projected image, measured using a photodiode outside the task. The 600 samples in each epoch were then reduced to 50 time bins by averaging together each consecutive 12 samples. Thus, the time bins were spaced every 20 ms and represented the average raw signal of the 12 samples within that 20 ms. Pre-stimulus bins were treated as baseline.

We built three-way classifiers for the category of the Si stimuli. Classifiers were trained based on the activity evoked by the presentation of the Si stimuli in the Association phase, and used to classify the activity associated with the presentation of the Sd and outcome stimuli in the Reward phase. Classifiers were built for each time bin following Si presentation, and tested on each time bin following Sd and outcome presentation during the Reward phase, giving rise to (Association) time*(Reward) time maps of classification performance.

Support vector machine (SVM) classification analyses were performed with the svmtrain/svmpredict routines from libsvm (National Taiwan University, Taiwan; http://www.csie.ntu.edu.tw/∼cjlin/libsvm). Each feature used for classification (i.e., a sensor at a time bin) was independently z-transformed before classification. Results are reported with linear kernels. The regularization parameter C was tuned to optimize cross-validation performance in cross-validation of Association-phase data (C = 105) but was then fixed for all further analyses. Cross-validation was tested using leave-one-out, k-fold (5, 10, or 20), or repeated random subsampling (50 or 100 independent subsamples with 10% of samples left-out), without any difference in results between methods.

In Figure 4, we show 2-dimensional maps where the dimensions are times relative to two different events. To generate statistical significance thresholds for these maps, we recalculated these maps many times with independently shuffled category labels for the stimuli. Each shuffle yielded a map that contained no true information about the stimuli, but preserved overall smoothness and other statistical properties. The peak levels of each of these maps were extracted, and the distribution of these peak levels formed a nonparametric empirical null distribution. The 95th percentile of this distribution is reported as the significance threshold.

Representational similarity between two different trials was measured by correlation between the patterns of activation over sensors, at the same time bin relative to stimulus onset.

Classifiers trained on Association-phase data were used directly to predict Reward-phase data without any tuning to optimize cross-classification performance. All (Association) time*(Reward) time maps of classification performance were smoothed by a 2-D Gaussian kernel (σ = 30 ms) for display and for calculating peak-level shuffling statistics.

Analyses that didn't work

In the interest of reporting our work as completely as possible, we discuss a set of analyses that were based on relevant hypotheses, but did not lead to significant results.

1) An important issue in the analysis of retrieved representations is to make sure that what are apparently retrieved representations are not in fact coincidences in the representation of the retrieved object and the retrieval cue. In the analyses in the main paper, this is controlled by randomizing Si–Sd pairings between subjects. We attempted another way of controlling for this, by training a classifier on all subjects' (except one) Sd-evoked data (using the category labels of the associated Sis), and testing on the left-out subject. If this procedure, repeated across left-out subjects, would produce an above-chance prediction of the Si category associated with the displayed Sd, this would imply that the Sd-evoked data contain a real representation of Si. Unfortunately when we attempted this, the group-level prediction of Si category did not reach significance. We speculate this is because the category representation differs substantially between subjects (supported by Sandberg et al. (2013)); an issue that the analysis in the main paper is immune to because classifiers are trained separately for each subject.

2) Wimmer and Shohamy (2012) regressed their neural signal against within-category differences in behavioral preference. For example, if one subject in the Decision phase preferred the face paired with the rewarded fractal, but did not prefer the scene paired with the rewarded fractal, then he or she was more likely to have a large fusiform face area activation during presentation of the face-paired fractal in the Reward phase than to have a large parahippocampal place area activation during presentation of the scene-paired fractal. We attempted the same analysis but no correlation with neural decoding reached significance. In our hands collapsing within categories to look at between-subject variance in total value updating appeared more statistically powerful. Along similar lines, we also trained classifiers to distinguish individual stimuli in the Association phase (e.g., a particular face, rather than the category of faces—so the classifier learned about 12 distinct categories), and applied these classifiers to activity at the time of outcome in the Reward phase. The classifier was treated as ‘correct’ if it predicted the identity of the photograph that had been previously associated with the fractal presented on this trial of the Reward phase. We then correlated the resulting correctness ratings against the behavioral preference for Si+ over Si− in the Decision phase (just as in Figure 6 of the main paper, but classifying individual stimuli rather than categories). However, these correlations did not reach shuffle-corrected significance. This may be a result of the difficulty of classifying many individual stimuli with relatively few trials.

3) We wondered if, when photos (Si) were presented during the Decision phase, it would be possible to identify neural signals containing information about the paired fractal (Sd). It is possible that this could represent an online retrieval of value information about Sd to guide the choice about Si. However, we could not detect above-chance classification of either associated Sd when pairs of Si were presented during the Decision phase. We suspect the patterns of representation may be more difficult to disentangle when two stimuli are shown on-screen at the same time.

Decision letter

Howard Eichenbaum

Reviewing Editor; Boston University, United States

eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.

Thank you for sending your work entitled “Temporal structure in the retrieval of associations” for consideration at eLife. Your article has been favorably evaluated by a Senior editor, a Reviewing editor, and 3 reviewers.

The following individuals responsible for the peer review of your submission have agreed to reveal their identity: Howard Eichenbaum (Reviewing editor), Nikolaus Kriegeskorte (peer reviewer), and Kenneth Norman (peer reviewer). A further reviewer remains anonymous.

The Reviewing editor and the reviewers discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission.

The reviewers were consensual in their enthusiasm for the question under study and for the sophisticated design and analyses, for the novel findings on temporal dynamics of memory reactivation, and the inclusion of negative findings. There were, however, major concerns about the strength of some of the findings, and there were other specific concerns by individual reviewers about some of the statistical tests, the exposition, and the interpretation of specific findings. These concerns are provided below.

Reviewer 1:

Reinstatement is only significant at the level of category-average representations, not individual exemplars of each face, body, and scene. This is a pity because each category was equally often associated with positive and neutral reward. This makes it difficult to assess whether the specific reward-predictive stimuli were reinstated at the point of reward and whether the strength of such reinstatement predicted the propagation of value updates. The authors speculate that category-level reinstatement might explain the absence of significant value updating suggested by their behavioural post-test. They do show that stronger reinstatement of the early component of the category representation just before reward onset predicts subjects' value-based choices in the behavioural post-test. However, although reasonably corrected for multiple testing across latencies, this effect is barely significant and is one of several analyses attempting to establish a relationship between reinstatement of reward-predictive stimuli and value updating. [Reviewing editor: This comment clarifies the reviewers' concern about this issue, but it may not be possible to improve on the comments provided by the authors in the discussion of their findings.]

Reviewer 2:

1) The Introduction does a nice job of identifying a gap in the literature, though I believe it overstates this gap in some ways. I would suggest additional discussion of the findings of Manning et al. (2011; 2013, Journal of Neuroscience), which identify the spatial and spectral signatures of reinstatement. Perhaps the authors could emphasize the temporal aspects of associative retrieval more thoroughly in this section.

2) The authors should more clearly identify the cognitive constructs they are assessing. Throughout the manuscript, the authors seem to alternate between focusing on a paradigm designed for associative retrieval, sensory preconditioning, and value updating. Given that the task design allows for investigating each of these, the authors should state this upfront and additionally discuss the prior literature linking these constructs in the Introduction.

3) The rationale for using univariate vs. multivariate feature sets should be stated earlier in the text. Is the rationale only to verify that more information is captured using multivariate techniques compared to univariate techniques? If so, I think simply citing the prior literature demonstrating this should be sufficient and it is unnecessary and a bit tangential to focus on this point as a primary finding. As the text stands now, it takes a while to get to the point of the primary research findings.

4) Pattern classification performance was calculated within subject and then averaged across subjects, revealing peaks around 200ms and 400ms. To what extent are these peaks evident in individual subjects? The authors partially raise this issue themselves in the Discussion section, but may wish to provide quantitative results regarding this issue.

5) Although the data generally supports the claim that they are primarily measuring “evoked” activity, the authors should be cautious to not imply that it is only evoked activity.

6) In the Discussion section: Several EEG and fMRI studies of temporal order memory (some using an RSA approach) are excluded but should be cited and/or discussed (Hsieh, 2011, Journal of Neuroscience; Hsieh et al., 2014, Neuron; Ezzyat and Davachi, 2014, Neuron).

7) Overall, the data support the claim of two temporally distinct representational periods. However, such an interpretation is complicated by a highly significant (double the chance rate) classification performance during the two windows. The authors should discuss this more thoroughly. Additional analyses described above may also clarify this interpretational discrepancy.

Reviewer 3:

1) What kinds of multiple comparisons corrections are the authors using for the analyses shown in Figure 4A and 4B? Elsewhere in the paper, the investigators appear to be using an approach of controlling for familywise error at p < .05, but it is not clear what is going on here (the paper only says p = 0.05 “peak-level significance thresholds”). Whatever correction the authors use will need to correct for the use of classifiers trained on different time points and also their exploration of multiple time points during the reward learning phase. For what it's worth, I thought that the approach taken in Figure 6 (controlling for all time points, but only two classifiers: the 200ms-trained classifier and the 400ms-trained classifier) was acceptable; a similar approach could be applied here, if it isn't being used already.

2) For the RSA analysis, it was not clear if the authors obtained a meaningful measure of “self-similarity” values (i.e., how similar is the pattern for a particular item to the pattern evoked by other instances of that same item). If they did not measure this, it would be useful to re-do the analysis in a way that obtains self-similarity measurements. Having this information would allow the authors to get separate readouts of item-specific information (e.g., by contrasting the diagonal cells to the off-diagonal cells from the same category) and category information. In particular, it would be useful to get these measures (along with some statistical assessment of their reliability, e.g., through bootstrapping) for both the 200ms and 400ms time points. While it is clear that category structure is not strongly represented at the 200ms time point, the RSA analysis in its current form does not speak to how strongly item-specific information is represented at the two time points (the text, as it is currently written, seems to be implying that there is less information about individual items at 400ms than 200ms, but we don't know that).

3) As shown in Figure 1, about half of the participants showed P(Si+) that was below .5, and some of these participants showed P(Si+) values that were well below .5. While I can understand how variance in the level of reactivation (during the reward learning phase) could lead to variance in participants' preference, ranging from no preference to strong preference in favor of the associated item, but it is not clear how variance in reactivation could lead to the opposite preference. The only other interpretation of the below-chance performance is that it is noise, but in that case, how can it be explained using classifier evidence? It would be useful if the authors commented on this.

4) A suggestion: to the extent that classification performance is suffering due to a lack of training data, we have sometimes found in my lab that we can improve classification by using a “leave one subject out” approach (i.e., train on all but one subject, test on the left-out subject). This approach assumes that brain patterns are relatively consistent across participants. If that assumption is generally true, then the 25-fold increase in the number of training patterns can improve classification accuracy by a substantial margin (conversely: if there is extensive between-subject variability in the patterns, then moving to a leave-one-subject out approach can hurt classification accuracy). Anything that improves classification accuracy has the potential to greatly boost the interpretability of these results.

Author response

We thank all the reviewers for their detailed and insightful comments and questions, which greatly improved the clarity of the paper, as well as uncovering a new analysis that we believe strengthens its final result.

Reviewer 1:

Reinstatement is only significant at the level of category-average representations, not individual exemplars of each face, body, and scene. This is a pity because each category was equally often associated with positive and neutral reward. This makes it difficult to assess whether the specific reward-predictive stimuli were reinstated at the point of reward and whether the strength of such reinstatement predicted the propagation of value updates.

Thanks very much for the comments. We agree. It would be great to see convincing reinstatement at the level of individual exemplars. We suspect getting the power to resolve this would require more presentations of each individual stimulus.

The authors speculate that category-level reinstatement might explain the absence of significant value updating suggested by their behavioural post-test. They do show that stronger reinstatement of the early component of the category representation just before reward onset predicts subjects' value-based choices in the behavioural post-test. However, although reasonably corrected for multiple testing across latencies, this effect is barely significant and is one of several analyses attempting to establish a relationship between reinstatement of reward-predictive stimuli and value updating. [Reviewing editor: This comment clarifies the reviewers' concern about this issue, but it may not be possible to improve on the comments provided by the authors in the discussion of their findings.]

Yes, we completely agree. We added the “failed analyses” section and the more conservatively corrected threshold in Figure 6 to try to be upfront about this. Originally we were reasonably happy that the effect did survive modest correction, even with subjects being unconscious of the associations.

But, in response to Reviewer 3, we split the correlation between decoding and behavior according to positive (preferring Si+ over Si-) or negative (preferring Si- over Si+) preference. We now find that looking only at people with a positive behavioral effect greatly improves the strength of the correlation (new Figure 6).

This finding is consistent with Reviewer 3's observation that above-chance decoding could be linked to a positive behavioral effect, but below-chance decoding should not be linked to a negative behavioral effect. We thank the reviewer for this insight and this now strengthens the significance of the relationship between reinstatement and value updating.

Reviewer 2:

1) The Introduction does a nice job of identifying a gap in the literature, though I believe it overstates this gap in some ways. I would suggest additional discussion of the findings of Manning et al. (2011, 2013, Journal of Neuroscience), which identify the spatial and spectral signatures of reinstatement. Perhaps the authors could emphasize the temporal aspects of associative retrieval more thoroughly in this section.

Thank you for the suggestion. We have added the Manning citations (2013, JoN, should be 2012, JoN, right?), mentioning explicitly how they investigated oscillatory patterns. We have also tried to frame more clearly the specific gap and the question we're looking at (i.e., decomposing the trajectory of spatial patterns that appear at study and seeing which of these elements reappear at retrieval).

At retrieval, recent studies have explored the fast evolution of neural representation. EEG studies provide evidence that some information is retrieved as early as 300ms following a cue (e.g., Johnson et al., 2008; Wimber et al., 2012; Yick and Wilding, 2008). Manning et al. (2011, 2012), using electrocorticography, and Jafarpour et al. (2014), using MEG, showed that oscillatory patterns are also reinstated during retrieval; in two of these studies the predominance of low oscillatory frequencies in reinstatement suggests a potential spectral signature.

However, the dynamics of representation during direct experience of an object have never been tied to the dynamics of retrieval. It is not known which of the patterns evoked in sequence by direct experience are reinstantiated during retrieval, what the temporal relationship is in their retrieval, or what functional significance this has.”

2) The authors should more clearly identify the cognitive constructs they are assessing. Throughout the manuscript, the authors seem to alternate between focusing on a paradigm designed for associative retrieval, sensory preconditioning, and value updating. Given that the task design allows for investigating each of these, the authors should state this upfront and additionally discuss the prior literature linking these constructs in the Introduction.

Thank you for this suggestion. We are now more explicit in the Introduction, by first introducing the sensory preconditioning paradigm, and then describing how associative retrieval is thought to mediate value updating in this paradigm. We add citations here from the original Brogden paper and from a more recent review that nicely links these constructs together:

“Sensory preconditioning is a well-established paradigm in which subjects first form an association between two stimuli ('direct' or Sd and 'indirect' or Si) and then form an association between the direct stimulus and a reward (Brogden, 1939). Generalization of value to the indirect stimulus is evidence of retrieving the learned association (Gewirtz and Davis, 2000). Using fMRI, Wimmer & Shohamy (2012) showed that neural representations of the associated indirect stimulus are reinstated when direct stimuli are presented during the Reward-learning phase, and this retrieval is linked to the generalization of value from direct to indirect stimuli. This suggests that reinstatement through the learned associative link may be part of the mechanism for value updating. Our aim here is to explore the temporal structure of this reinstatement, which may help to shed light on the mechanisms of value updating as well as providing general insight into the dynamics of representations during retrieval.”

3) The rationale for using univariate vs. multivariate feature sets should be stated earlier in the text. Is the rationale only to verify that more information is captured using multivariate techniques compared to univariate techniques? If so, I think simply citing the prior literature demonstrating this should be sufficient and it is unnecessary and a bit tangential to focus on this point as a primary finding. As the text stands now, it takes a while to get to the point of the primary research findings.

Thank you for this excellent suggestion. We have moved the univariate classification analysis to the supplement, leaving just the following in the main text:

4) Pattern classification performance was calculated within subject and then averaged across subjects, revealing peaks around 200ms and 400ms. To what extent are these peaks evident in individual subjects? The authors partially raise this issue themselves in the Discussion section, but may wish to provide quantitative results regarding this issue.

We have added a supplementary figure (new Figure 3–figure supplement 2) showing the raw decoding accuracy curves for individual subjects. Although the single-subject curves rest on very little data compared to the group curve, and so should be interpreted cautiously, some individual subjects may show a qualitative pattern of multiple peaks in decoding accuracy.

We also show betas from the quadratic term (a measure of positive curvature) of a regression of time on decoding accuracy between 200ms and 400ms, fit separately for each subject. No individual subject reached significant curvature (i.e. a significant beta on the quadratic term in the regression) after Bonferroni correction. This could be consistent with the idea that the group-level effect is being driven by a trend present in many subjects, rather than by a few extreme subjects.

5) Although the data generally supports the claim that they are primarily measuring “evoked” activity, the authors should be cautious to not imply that it is only evoked activity.

Given several of the reviewer comments, we realized it would take more figures and discussion to treat fully the time-frequency transformed analyses and the issue of evoked versus induced activity. As this issue is tangential to the conclusions of the current manuscript, we have decided to leave it for a future contribution. We therefore removed the supplementary figure about time-frequency analysis and any mention of evoked versus induced signals.

6) In the Discussion section: Several EEG and fMRI studies of temporal order memory (some using an RSA approach) are excluded but should be cited and/or discussed (Hsieh, 2011, Journal of Neuroscience;Hsieh et al., 2014, Neuron;Ezzyat and Davachi, 2014, Neuron).

Thank you for pointing these out. We have added the citations with the following text:

“In humans, frontal theta power (Hsieh et al., 2011) and patterns of activity in hippocampus (Ezzyat and Davachi, 2014; Hsieh et al., 2014) are implicated in coding temporal order within sequences of stimuli. Applying methods from the present work could be useful to establish a finer grained map of the representations used in complex memory and decision processes.”

7) Overall, the data support the claim of two temporally distinct representational periods. However, such an interpretation is complicated by a highly significant (double the chance rate) classification performance during the two windows. The authors should discuss this more thoroughly. Additional analyses described above may also clarify this interpretational discrepancy.

High classification performance within each window seems to suggest that there is some consistency in the pattern of activity evoked at this latency following stimulus onset. We have also added the following paragraph to the Discussion:

“Supporting the idea of these ∼200ms and ∼400ms epochs as distinct representational periods, we note the following. First, there was a decrease in classification accuracy between these periods. Second, classifiers trained on one epoch had low accuracy in the other epoch (Figure 3–figure supplement 5), suggesting information about the stimuli was coded differently between epochs. Third, the epochs had different similarity structure with respect to the stimulus categories (Figure 3B). Fourth, the patterns from the two epochs were doubly dissociated in terms of their retrieval at Sd versus outcome (Figure 4), while the time period between the two peaks (i.e., around 300ms post-stimulus) was not strongly retrieved either at Sd or outcome (Figure 4).”

Reviewer 3:

1) What kinds of multiple comparisons corrections are the authors using for the analyses shown inFigure 4A and 4B? Elsewhere in the paper, the investigators appear to be using an approach of controlling for familywise error at p < .05, but it is not clear what is going on here (the paper only says p = 0.05 “peak-level significance thresholds”). Whatever correction the authors use will need to correct for the use of classifiers trained on different time points and also their exploration of multiple time points during the reward learning phase. For what it's worth, I thought that the approach taken inFigure 6(controlling for all time points, but only two classifiers: the 200ms-trained classifier and the 400ms-trained classifier) was acceptable; a similar approach could be applied here, if it isn't being used already.

We actually used a more conservative correction for Figure 4. At each time point of training (i.e., the y-axis), we randomly permuted the category labels used to train the classifiers 1000 times. We then tested each of these null classifiers at each time point of testing (the x-axis). Thus, each of the 1000 permutations generated a time-by-time map of classification accuracies like those in Figures 4A and 4B. We found the peak of each of these null accuracy maps, yielding 1000 peaks. The 95th percentile of these 1000 peaks is represented by the black contours in 4A and 4B. This statistical correction is especially powerful since it doesn't rest on any prior assumptions about there being interesting signals at 200ms and 400ms.

To make this clearer, we have added to the figure caption, and we have added the following text to the Methods:

“In Figure 4, we show 2-dimensional maps where the dimensions are times relative to two different events. To generate statistical significance thresholds for these maps, we recalculated these maps many times with independently shuffled category labels for the stimuli. Each shuffle yielded a map that contained no true information about the stimuli, but preserved overall smoothness and other statistical properties. The peak levels of each of these maps were extracted, and the distribution of these peak levels formed a nonparametric empirical null distribution. The 95th percentile of this distribution is reported as the significance threshold.”

2) For the RSA analysis, it was not clear if the authors obtained a meaningful measure of “self-similarity” values (i.e., how similar is the pattern for a particular item to the pattern evoked by other instances of that same item). If they did not measure this, it would be useful to re-do the analysis in a way that obtains self-similarity measurements. Having this information would allow the authors to get separate readouts of item-specific information (e.g., by contrasting the diagonal cells to the off-diagonal cells from the same category) and category information. In particular, it would be useful to get these measures (along with some statistical assessment of their reliability, e.g., through bootstrapping) for both the 200ms and 400ms time points. While it is clear that category structure is not strongly represented at the 200ms time point, the RSA analysis in its current form does not speak to how strongly item-specific information is represented at the two time points (the text, as it is currently written, seems to be implying that there is less information about individual items at 400ms than 200ms, but we don't know that).

We agree. From the analysis we reported, we couldn't tell anything about the amount of information about individual items at 400ms vs. 200ms. The main claim we can make is that the neural representation of items within a category looks more similar at 400ms than 200ms, compared to between-category items. We have amended our descriptions of this throughout the paper, and excised all references to having less item-specific information at 400ms.

However, we pursued this suggestion directly by checking the similarity (measured by correlation) of neural responses evoked to different presentations of the same stimuli. These similarities were not different between 200ms and 400ms (p=0.3 by t-test). Thus, at least with this approach, we can't reach any conclusions about the absolute amount of item-specific information at 200ms versus 400ms. In the future, we think it will be interesting to investigate exactly what is different about the coding of the representations in different time epochs.

3) As shown inFigure 1, about half of the participants showed P(Si+) that was below .5, and some of these participants showed P(Si+) values that were well below .5. While I can understand how variance in the level of reactivation (during the reward learning phase) could lead to variance in participants' preference, ranging from no preference to strong preference in favor of the associated item, but it is not clear how variance in reactivation could lead to the opposite preference. The only other interpretation of the below-chance performance is that it is noise, but in that case, how can it be explained using classifier evidence? It would be useful if the authors commented on this.

Very good point. We had been laboring under the misapprehension that having the wrong representation online could lead to negative behavioral preference. But in fact, as the reviewer noted, this doesn't make sense because you'd need to have the wrong ’category’ online to get below-chance decoding, and the wrong ’exemplar’ of a category online to get P(Si+)<0.5, and these two things are orthogonal. Thanks very much for bringing this to our attention.

In response to this, we split Figure 6 into two separate analyses: one using only subjects with P(Si+)>=0.5, and the other using only subjects with P(Si+)<=0.5. The new Figure 6 replaces Figure 6 from the original manuscript. It turns out that there is a much stronger correlation between behavior and decoding when we only consider P(Si+)>=0.5! Furthermore, the strongest peak of this correlation is at a time (∼400ms post-outcome) that might be easier to interpret than the peak in the original analysis (which was ∼100ms pre-outcome).

We would interpret this as validation of the reviewer's point: a positive behavioral preference can be driven by correct retrieval (i.e., above-chance decoding), but not vice versa. It's also nice that by cleaning up the analysis in this way, the effect is revealed to be highly significant (surviving nonparametric correction for peak level of the entire time-by-time map) rather than marginally significant.

We have also updated the Results and Discussion sections to reflect this new analysis.

4) A suggestion: to the extent that classification performance is suffering due to a lack of training data, we have sometimes found in my lab that we can improve classification by using a “leave one subject out” approach (i.e., train on all but one subject, test on the left-out subject). This approach assumes that brain patterns are relatively consistent across participants. If that assumption is generally true, then the 25-fold increase in the number of training patterns can improve classification accuracy by a substantial margin (conversely: if there is extensive between-subject variability in the patterns, then moving to a leave-one-subject out approach can hurt classification accuracy). Anything that improves classification accuracy has the potential to greatly boost the interpretability of these results.

Thanks for this suggestion. Early on we did experiment with training classifiers on multiple subjects' data together. In this data set it usually hurt performance more than it helped; and when we compared the patterns of individual subjects we found that they were often quite different from one another. We are very interested in exploring ways of pooling subjects' data together to improve classification, for example by learning some transformation to map subjects into a common space.

Gatsby Charitable Foundation

Medical Research Council (MR/K005464/1)

Wellcome Trust (098362/Z/12/Z)

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Anna Jafarpour, Will Penny, Aidan Horner, Martin Hebart and Tim Behrens for helpful discussions and Laurence Hunt and 'Ōiwi Parker Jones for comments on an earlier version of this manuscript.

Ethics

Human subjects: All participants provided written informed consent and consent to publish prior to start of the experiment, which was approved by the Research Ethics Committee at University College London (UK), under ethics number 1825/005.

eLife is a non-profit organisation inspired by research funders and led by scientists. Our mission is to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science.eLife Sciences Publications, Ltd is a limited liability non-profit non-stock corporation incorporated in the State of Delaware, USA, with company number 5030732, and is registered in the UK with company number FC030576 and branch number BR015634 at the address:
eLife Sciences Publications, Ltd
Westbrook Centre, Milton Road
Cambridge CB4 1YG
UK