2
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 2 Bibliography Our concern today is to carefully analyze this paper both k for content, and k to classify its material for entry into a database thus setting the stage for your own contribution to our initial database of brain imaging studies related to action, action recognition, imitation, and language. An important step is to enter all references you find of interest into the database. Where possible, insert abstracts and links to full text. However: Do not restrict your follow-up reading only to Web-accessible papers.

3
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 3 Page evidence from lesion studies that retrieval of words for actions can be related to structures in the left premotor/prefrontal region and in the left posterior temporal regions (Caramazza and Hillis, 1991; Damasio and Tranel, 1993; Daniele et al., 1994; Hillis and Caramazza, 1995; Miceli et al., 1988; Miozzo et al., 1994; Thompson-Schill et al., 1998; Zingneser and Berndt, 1990). Suggests a possible annotation to add to these 8 papers. Later reading of such papers could then expand the information for individual papers. Some convergent evidence can also be found in several functional imaging and electrophysiological studies (Fiez et al., 1996; Grabowski et al., 1996; Hinke et al., 1993; Koenig et al., 1999; Martin et al., 1995; Martin et al., 2000; Perani et al., 1999; Petersen et al., 1988; Pulvermuller et al., 1999; Raichle-et al., 1994; Warburton et al., 1996; Wise et al., 1991). Note that to annotate these 12 papers we have to rephrase the comment preceding them :... evidence from functional imaging or electrophysiological studies that retrieval of words for actions can be related to structures in the left premotor/prefrontal region and in the left posterior temporal regions

4
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 4 Pages In agrammatic aphasics there is often an impaired use of prepositions, the closed class words, some of which denote spatial relations (e.g., Friederici, 1982, 1985; Friederici et al., 1982; Schwartz et al., 1980; Tesak and Hummer, 1994; Zurif and Caramazza, 1976). As above. Issue: Simply entering the annotation for each paper versus crediting the source of the annotation. The next “clump” is to be entered for this paper as “Guiding Hypothesis”: The salient aspects of the neural activations caused by naming actions and naming spatial relations occur in left frontal operculum and left parietal cortices but not in left infero-temporal cortices (IT) or right parietal cortices. The actions focused on are those denoted by action verbs, while the spatial relations are those denoted by locative prepositions (e.g., in, on, above, below). But note that this has to be linked to the outcome of the study: Was the hypothesis confirmed or not? As the richness of the database increases, we will search for data in other papers pro and con each hypothesis and we will want to post summaries of the “best” hypotheses. [What is best?]

5
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 5 Support for the Hypothesis The following must be linked to the hypothesis it supports. The judgement call is whether “Support for hypothesis” is a label for the entry or a label for the link between it and the hypothesis. My point: The database you will work with is an initial design. Thus your assignment will both to enter data and to suggest better ways to structure the database for searches, report generation, etc., etc. The parietal and frontal regions are hypothesized to pertain to both actions and spatial relations because of their known involvement in the processing of space and of motion in space. This involvement was first suggested in classical human neuropsychological studies (Newcombe, 1969), and in nonhuman primate studies (Ungerleider and Mishkin, 1982). The selection of the left as opposed to right aspect of those structures derives from the assumption that the linguistic denotation of actions and spatial relations will be preferentially handled by the language-dominant hemisphere. [NSR: No supporting reference.] The prediction that left IT would not be activated came both from the fact that left IT is active when words denoting concrete entities are retrieved and the fact that there is no compelling reason to expect this region to be involved when spatial manipulations are being performed on concrete entities that are not being specifically identified or named. [NSR] There is some evidence suggesting that left IT is not necessary for the retrieval of words for actions (Damasio and Tranel, 1993)]

6
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 6 Clumps Need Not Have Contiguous “Atoms” The next paragraph on p.1054 breaks into 2 parts: Additions to the Hypothesis: We predicted that word retrieval for actions would activate the lateral temporo-occipital cortices related to motion processing, specifically those known as area MT, but we did not predict activation in lateral temporo-occipital cortices during word retrieval for spatial relations. Support for these portions of the hypothesis: Area MT has been identified by neurophysiological and neuroimaging studies to be involved in the perception of real motion, or motion suggested by consecutive presentation of static images (Goebel et al., 1998; Kaneoke et al., 1997; Stevens et al., 2000; Tootell et al., 1995; Watson et al., 1993; Zeki et al., 1991, 1993). However, there is no compelling evidence to suggest that area MT might be involved in the processing of spatial relations. Note then the issue of (a) aggregating the hypothesis from separate parts of the paper but (b) then decomposing it into the separate pieces for which different supporting data may be marshaled. Again, as the database progresses, some parts of a hypothesis may be further supported; others may be modified; while yet others will come to be rejected outright. What database tools will support this kind of dynamic knowledge management?

7
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 7 The Structure of the Lexicon What is the structure of the lexicon? Conceptual issue: How does a conceptual structure propagate to, e.g., different lexical forms? Relate to general issues of metaphor and analogy? What is stored for later retrieval, what is computed on the fly? Later lectures will explore the idea of Verb-Argument structure, as in Hit(Harry, hammer, nail) Bite(Yingshu, kaki) and suggest that the representation of a verb may obligatorily involve representation of the general characteristics of nouns which fill its slots. We may thus wish to post hypotheses as we enter material into the database, and then gradually link the hypothesis to supporting or contrary data as we build the database and search it for the contributions of others. But then we need an inference engine to (a) assess reliability of each possibly relevant datum and then (b) compute a confidence value for the hypothesis.

8
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 8 Semi-Models and Experiment Design Need to better tease apart k the recognition processes engaged in naming an object or an action k going from a representation of the recognized entity to a representation of its name k going from the representation of a name to a representation of the recognized entity k the use of words in sentence production and perception p.1054:. In addition to regions involved in implementing the actual vocalization of the response, other regions must be involved in processing the conceptual knowledge behind a given action or spatial relation, and in retrieving the specific morphemes used in the correct response. The experiments conducted in this study aimed at identifying regions involved in conceptual processing and intermediary word retrieval and at excluding activation of regions involved in implementing responses, which were shared by the target and control tasks, and were to be canceled out in the subtraction of the former from the latter. The experiments do not address the issue of the degree to which conceptual processing and word retrieval can be functionally separated. To what extent should this analysis of key subprocesses be entered as part of the hypothesis. Is the appeal to subtraction “theory-free”?

9
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 9 Theoretical framework p.1054: Damasio, 1989; Damasio et al., 1990, 1996; Damasio and Damasio, 1994; Tranel et al., 1997a, 1997b: Word-form production is dependent on three kinds of neural structures: (1) those which support conceptual knowledge and are located in early and high-order sensory cortices of both hemispheres; (2) those which support the vocal implementation of word forms and are located in classical left perisylvian language areas; and (3) language-related mediational or intermediary structures, located in inferotemporal and parieto-frontal regions, which are engaged by the structures described in (1) to guide the implementation described in (2).

10
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 10 Theoretical framework pp : The same system of ensembles and pathways is not recruited equally, in the same subject, on all occasions. Moreover, there is more than one system of ensembles and pathways to support a particular function, i.e., in all likelihood there are several systems assisting with the retrieval of the action verbs and locative prepositions required by our tasks, and those different systems can be engaged depending on the task demands the subject has to do, among other factors. We presume that certain systems probably support the most effective and complete version of a certain performance, and are thus a "preferred" system, but there are other systems that can support the same performance, albeit not necessarily as efficiently. The sensorimotor patterns that embody the explicit representations of word-forms, e.g., the word forms for actions or spatial relations, do not occur at the intermediary sites. But those patterns are triggered by language-related intermediary ensembles and circuits, which have in turn been triggered by the concept-related intermediary ensembles and circuits. Such intermediary roles are presumed to be played by the frontal and parietal sites hypothesized to be activated by the tasks.

11
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 11 Methods = Protocol We can pretty much tease apart three components inn this kind of experiment: k Subjects k Imaging Procedure Y includes scanning method, warping method, and statistical analysis k Tasks The main data gathered will be of the kind k When subjects execute task A as compared to task B, then regions R1, R2,.. are more significantly activated  Activation may depend on an indirect estimate of rCBF (regional cerebral blood flow, in PET: Positron Emission Tomography) or the BOLD signal ( based on the measurement of local changes in the electromagnetic field, due to changes in the concentration of oxygenated blood [diamagnetic] and deoxygenated blood [paramagnetic], in fMRI: functional Magnetic Resonance Imaging) Y The regions R1, R2, clearly depend on the confidence level set for the test of significance.

12
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 12 Caveats k A region that is more significantly active in task A than in task B may nonetheless play a critical role in task B. k What is the neural activity which rCBF or BOLD in a region correlates with? Y My favorite: Integrated synaptic activity Y Needed: The “vampire model” of the neuron (and glia).

13
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 13 Methods Typically, theorists’ eyes tend to glaze over when getting to the Methods section, which is often skipped. However, once one accumulates data from multiple experiments, one may find apparent discrepancies, and it may only be by examining details of the Methods that we can resolve these discrepancies or form judgements of the weight to give to various data in forming a summary one will use as a basis for later work, whether designing new experiments, creating models, or … Thus it is crucial that methods be included in the database. Having said this, I will leave you to read the paper for Subject and PET Imaging methodology, and proceed to the tasks.

15
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 15 Two Cohorts The subjects were divided into two cohorts of 10 subjects with 5 men and 5 women each. One of the cohorts was studied for the retrieval of words denoting actions and the other for the retrieval of words denoting spatial relations. Each subject received 8 injections of 50 mCi of [ 15 O] water. For each cohort there were four tasks. Each of the tasks was performed twice. The session was divided, and in each half-session the sequence of tasks was randomized. For the cohort involved in the retrieval of words denoting actions the tasks were as follows: (1) retrieval of words denoting actions performed with an implement, mostly manipulable tools and utensils (ISI 1.8 s); (2) retrieval of words denoting an action performed without an implement, movements of the body or body parts of the agent (ISI 1.8 s); (3) retrieval of words denoting tools and utensils (ISI 1.8 s); and (4) an orientation judgment performed on the faces of unknown persons requiring the response up if the face was in the canonical position (up) and down if the face was inverted (ISI 1.0 s). For the cohort involved in the retrieval of words denoting spatial relations the tasks were as follows: (1) retrieval of words denoting the spatial relation between two (or among three) concrete entities (mostly tools and utensils) depicted as realistic line drawings, in which the target object was shaded in red (ISI 1.5 s); (2) the same task as (1) but using abstract rather than realistic drawings (ISI 1.5 s); (3) retrieval of the words denoting the names of the red shaded tools/utensils in the stimuli used in (1), (ISI 1.5 s); and (4) an orientation judgment performed on the faces of unknown persons requiring the response yes if the face was in the canonic position (up) and no if the face was inverted (ISI 1.0 s).

16
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 16 Reasons for Experimental Design For word retrieval for actions they considered that the processing of actions performed without an implement, mostly whole body actions, might be partially segregated from that of words denoting actions performed with an implement For the retrieval of words denoting spatial relations they predicted that using abstract stimuli might activate regions specifically related to recognition of a given spatial relation and its subsequent verbal labeling, “uncontaminated" by associations with the concrete objects used in the stimulus (e.g., ring on the finger versus ring around the finger: the latter is correct relative to the spatial relation, the former is the one that reflects typical usage). They have used the control task of judging the orientation of unknown faces before, but did not want subjects in the second cohort to say up or down in the control task because they reflect a word denoting a spatial relation between the position of the face and the surround. Thus they preferred to use yes for the canonic presentation of the face and no when it was inverted. [But then why not do this for both cohorts?] They used a second control task for the target tasks involving concrete objects so as to compare directly the retrieval of words denoting actions produced with an implement or denoting relations between concrete entities, to the retrieval of words denoting those entities.

17
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 17 Displaying Regions k Plotting confidence levels on a 3D structure based on an MRI (i.e., non- functional) of the subject’s brain  Plotting confidence levels on a 3D structure based on a standard structure obtained by warping the subject’s brain MRI to a “standard brain”, e.g., that of a French woman given in Talairach, J., Tournoux, P. (1988) Co-Planar Stereotaxic Atlas of the Human Brain. Georg Thieme Verlag, Stuttgart, Germany. k 2D displays: showing contour maps of confidence level for active “blobs” on various views of the brain. k Giving the names of the neuroanatomical structures in which (much of) the mass of each blob is located. k Giving the T 88 Talairach coordinates of the (perhaps loosely determined) “center” of each blob. But not every lab uses the Talairach atlas, human brains differ greatly, and so do warping methods.

20
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 20 But: What do you notice about Table 1A as compared to 1B? And what about 2A compared to 2B? k See last paragraph of p.1058 and last page of penultimate paragraph of p.1060 for the answer. What does this suggest for database construction? A wish list for the database: Explicit links between k Talairach coordinates in a table and k a view of the brain showing the blob containing those coordinates. I do not plan to look at this or the other 3 tables in detail in class: Get the idea now, review the paper later.

23
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 23 The results in the search volume: (1) Naming actions {performed with or without an implement) minus the standard control task [orientation of unknown faces]* showed significant maxima in left frontal operculum, left posterior middle frontal gyrus, left IT, left and right inferior parietal lobule (angular and supramarginal gyri). The left and right supramarginal gyri also showed significant minima as did the right angular gyrus (see Table 1A). *For entry in the data base, the term “control task” is inadequate. (2) Naming spatial relations (using tools/utensils or abstract shapes), minus the standard control task showed significant maxima in left frontal operculum, left posterior middle frontal gyrus, and left IT when using tools/utensils, but no maxima in the inferior parietal lobule, either on the left or on the right (see Table 2A). (3) Naming actions performed with an implement minus naming actions performed with an implement showed a cluster of significantly active voxels in the depth of the posterior sector of the middle temporal gyrus (-45, -52, -1, t = +4.76), and two additional smaller clusters in the left supramarginal gyrus (-55, -27, +29 and -42, -42, +46 with t = and +4.12, respectively); i.e., there was more activation in this region during the task in which tools/utensils were used (Fig. 2A). Are there any implications of such differences for syntax or only for the underlying perception of actions?

24
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 24 The results in the search volume (4) Naming spatial relations using tools/utensils minus naming spatial relations using abstract shapes showed significant activation in left IT (-37, -47, -10, t = +5.05); i.e., there was more activation in this region during the retrieval of words for spatial relations using tools/utensils than when using abstract drawings. This subtraction also revealed a significant negativity in the right supramarginal gyrus (+46, -44, +41, t = -4.39), suggesting that there was more activity during the retrieval of words for spatial relations using abstract shapes rather than when using tools/utensils (Fig. 2B). (5) Naming of actions performed with an implement minus naming of implements showed two maxima in the posterior temporo-occipital region, at the level of the middle temporal gyrus, one within the search volume, in the left hemisphere (-43, -72, +9, t of +6.33), and the other outside the search volume, in the right hemisphere (+49, -69, +3, t of +5.45). See Fig. 3A. (6) Naming spatial relations minus naming of implements shows significant activation in the left supramarginal gyrus (-62, -41, +27, t of 4.29). See Fig. 3B. (7) A matter of taste as to whether to include or omit? k Note the different judgements of an author versus a collator transferring a paper into a database.

25
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 25 Discussion The subtraction of naming actions performed with a tool or utensil minus naming tools or utensils, reveals activation in both MT areas, as predicted. The subtraction of naming spatial relations {using stimuli with real objects) minus naming tools or utensils reveals activation in left supramarginal gyrus, but not in the MT region also as predicted. The bilateral activation of MT, a region which both neurophysiological studies (Zeki, 1993, for review) and functional imaging studies (Corbetta et al., 1990; Watson et al., 1993; Zeki et al., 1991) have implicated in the perception of movement is intriguing, considering that the stimuli used in this study are static. We presume that in order to perform the naming of an action from a static stimulus the subject will generate a mental simulation of the movement. Courts and Kanwisher (2000) observed activation [of MT?] with viewing of static pictures representing motion, a finding entirely consonant with ours. The activation site seen here is posterior to the sites noted in several previous studies engaged in the generation of a verb from the viewing of a concrete entity as opposed to the depiction of an action (Fiez et al., 1996; Martin et al., 1995; Warburton et al., 1996; Wise et al., 1991; summarized in Martin et al., 2000). Note the need to tease apart summary of data from elsewhere, summary of data from this study, and hypotheses (presumptions) and interpretations supported by the data.

26
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 26 Discussion The finding of left frontal opercular activation during the naming of actions is concordant with previous findings of activation at this site when "verb generation" tasks were used, as in Peterson et al.'s original study (1988) and in the several subsequent studies using the same or similar paradigms (e.g., Fiez et at., 1996; Grabowski et al., 1996; Hinke et al., 1993; Koenig et al., 1999; Martin et al., 1995; Perani et al., 1999; Raichle et al., 1994; Warburton et al., 1996; Wise et al., The fact that naming spatial relations did not show activation sites in the parietal lobe when the subtraction involved the standard control task is probably explained by the nature of this task. Although we asked subjects to say yes and no, instead of up or down, in order to avoid the retrieval of a word denoting a spatial relation, the task still relies on the perception and recognition of a spatial relation, namely the spatial orientation of a face. It is possible that this process engages parietal structures that are also engaged in the production of the word that denotes the spatial relation between the two target objects. The fact that we did detect activation in left parietal cortices when the retrieval of words denoting concrete entities is used as control task (subtraction 6), favors this explanation.

27
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 27 Discussion Our prediction that inferotemporal cortices would not be involved in naming actions or spatial relations was not supported. Note: link needed from hypothesis to data which disconfirm it. Both naming actions as well as naming spatial relations (minus the standard control task) show that the posterior sector of IT also becomes active. However, this activation is significantly stronger when concrete entities are part of the stimuli, as seen in the results of subtraction 3 and 4, and depicted in Fig. 2. We do not believe that IT is involved in retrieving words for actions or spatial relations, but rather that the words denoting the objects represented in the stimuli are also retrieved, consciously or not along with the words that denote the actions and spatial relations. MAA: Note this for my later discussion of verb-argument structure. The possibility that naming actions performed with an implement versus actions performed without an implement might produce different results was not confirmed. The contrast between the two conditions revealed only a small region of activation in posterior IT during naming actions performed with an implement. Again, this activation may point to the conscious or nonconscious concurrent retrieval of the names of objects (tools or utensils) used in the stimuli.

28
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 28 Discussion The contrast between naming spatial relations from concrete objects and from abstract shapes seems to support our prediction that using abstract stimuli might activate regions specifically related to the recognition of a given spatial relation. The stronger activation of the right supramarginal gyrus during naming spatial relations from abstract shapes speaks to this point. The last sentence seems to lack force. The left supramarginal activation identified in association with the process of naming the spatial relation between two entities (using the subtraction of naming concrete entities from naming spatial relations) suggests a major involvement of systems involved in object manipulation in both personal and extrapersonal space. The retrieval of concepts related to, say, "in-ness" or "on-ness" or "between-ness," requires spatial analyses that engage components of the so called "where" system (Ungerleider and Mishkin, 1982; Kosslyn, 1994). [MAA: The “how” system. More specificity required.] The supramarginal activation lateralized to the left might also be related to the retrieval of the actual word (e.g., on, in, between). To clarify this issue further it will be necessity to gather additional evidence from lesion studies and functional imaging studies in which recognition and naming can be separated.

29
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 29 Discussion To clarify this issue further it will be necessity to gather additional evidence from lesion studies and functional imaging studies in which recognition and naming can be separated. It is interesting to note that the direct contrast between the two conditions involving spatial relationships showed a strong activation in left posterior IT when the stimuli were concrete entities, the same site that becomes active when the words denoting those concrete entities were being retrieved. This finding also seems to indicate that, regardless of the task, the presence of concrete entities as stimuli engages the system used to process words denoting those entities. This is in keeping with the fact that the spatial task using "abstract-shapes" significantly engaged the right supramarginal gyrus, a region not significantly activated when the stimuli are concrete objects. The use of abstract shapes probably promotes a nonverbal strategy in which the subject is forced to analyze "coordinate spatial relations" in order (in the sense of Kosslyn, 1994) to produce the verbal response, without relying on an automatically selected verbal response usually associated with a particular object in a particular language. We interpret the right hemisphere engagement as signaling part of the conceptual process relative to spatial relations. Do you find this argument convincing? I feel the need for more explicit computational models which can be tuned and updated in the face of accumulating evidence.

30
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 30 Discussion Because both the words for spatial relationship and the words for actions used in this study constitute a subtype of prepositions and verbs, respectively, the brain sites identified here should not be seen as representative of the networks necessary for retrieval of prepositions or verbs in general. We see our results simply as identifying part of the brain circuitry that we believe is necessary to retrieve words that designate spatial relationships or actions. We are not suggesting that the frontal and parietal regions engaged in the retrieval of words for actions or for spatial relations are specific to the retrieval of this type of words. We know, even from the results of this study, that these regions are also engaged in the retrieval of words for concrete entities (the subtractions of naming concrete entities from naming actions, and from spatial relations, do not show any significant difference in activity in these regions). We are simply saying, in the perspective of the theoretical framework summarized in the background section, that word retrieval for actions and for spatial relations from visually presented stimuli, when performed in its most efficient way, does engage a system of which these areas are a component.