WELCOME CILC 2011‐ VALENCIA The organizing committee of the III International Congress of the Spanish Association of Corpus Linguistics (AELINCO) wishes you a warm welcome to Valencia and would like to thank all the speakers and attendees who are taking part in this annual meeting of the Association. Both the Universidad Politécnica de Valencia, and the Department of Applied Linguistics were enthusiastic about holding the congress here, and we are grateful for the support and encouragement given in order to bring this about. We are delighted with the number of proposals submitted from within Spain itself, and from all over the world, which shows how relevant these annual events are in order to promote discussion and to reflect on specific aspects of studies in Linguistics. In line with the specific aims of AELINCO and previous Conferences, the third edition of the International Conference on Corpus Linguistics focuses on the dissemination of research conducted within the framework of Corpus Linguistics, including different aspects of natural language processing and corpus analysis using the different tools which have been developed in the field of Information and Communication Technologies (ICTs) for the study of specific languages and genre. We sincerely hope that the present edition of the AELINCO congress will be a success, and that the participants enjoy having the opportunity to exchange ideas and inform each other about different research projects in the nine thematic panels and the plenary sessions. Lastly, we would like to thank all those who have participated in the organization of the congress and the different sponsors, without whose help and finance the event would not have been possible. Benvinguts i Benvingudes a València! Welcome, and enjoy your stay in Valencia! The Organizing Committee CILC 2011

The Congress folders can be picked up in the Department of Applied Linguistics (3rd floor), on the 7th and 8th April. Any late arrivals on the 9th April should get in touch with the congress organisers for their documentation.

‐

Speakers will have 15 minutes for their communication and at the end of each panel session they will have 5 minutes each for questions.

‐

There will be a Congress notice board on the 2nd floor of the building indicating any last minute changes or any other alterations to the programme.

‐

The posters will be on show in the corridor on the 2nd floor of the Applied Linguistics Department from 16.00 to 18.00 on Friday 8th April.

‐

Attendees will have wi‐fi access in the whole of the building and the password can be found in the Congress documents folder. Internet access is also available in the Aula Multimedia 2 (2nd floor) when this is free.

‐

All Congress events are open to all attendees, except the Gala dinner, which must be booked in advance (Congress website: www.cilc2011.upv.es).

‐

‐ Certificates will only be given to presenters who have paid the conference fee and have presented their paper.

‐

‐ Those speakers who wish to publish their papers in the conference proceedings must follow the style guidelines for publication that can be found on the conference website and send their article to [email protected] by May 8th, 2011. Selected articles amongst those received will be published in a special edition of an international publisher.

SPECIFIC FREQUENCY AND ITS ROLE IN FOREIGN LANGUAGE VOCABULARY ACQUISITION Su‐han Cheng and Jeng‐yih Hsu A CORPUS‐BASED STUDY OF THE VOCABULARY USE IN AN ENGLISH NEWSPAPER Jueves/Thursday, 7 de abril de 2011 17.30‐19.30 Aula multimedia 1, 2º piso/floor

Panel 6: Linguistic variation and corpus Barry Pennock‐Speck VOICE‐OVERS IN BRITISH TELEVISION ADS: A CORPUS ANALYSIS OF A WRITTEN‐TO‐BE‐SPOKEN GENRE Javier Ruano‐Garcia THE WORLD HAS GOT SOME HINT OF HER COUNTRY SPEECH: ON THE ENREGISTERMENT OF THE ‘NORTHERN DIALECT’ Chris Culy, Verena Lyding and Henrik Dittmann STRUCTURED PARALLEL COORDINATES: A VISUALIZATION FOR ANALYZING STRUCTURED LANGUAGE DATA Gerold Schneider and Fabio Rinaldi A DATA‐DRIVEN APPROACH TO ALTERNATIONS BASED ON PROTEIN‐PROTEIN INTERACTIONS Fatima Faya Cerqueiro REQUEST MARKERS IN DRAMA: DATA FROM THE CORPUS OF IRISH ENGLISH Biblioteca, 2º piso/floor

Richa and Shahid Mushtaq Bhat CASE SYNCRETISM IN URDU‐HINDI: A CHALLENGE FOR NLP Imen Ktari POSTMODIFIERS ACTING AS COMPLEMENTS AND ADJUNCTS IN POPULAR AND ACADEMIC MEDICAL ARTICLES: A GENERATIVE CORPUS‐BASED APPROACH

focusing both on recognised patterns / constructions such as N that or V n as n but also on less frequently considered patterns such as ADJ about n and other adjective patterns. The applications of this approach, such as the automatic recognition of evaluative meaning, will be considered, as will the limitations of the approach. VIERNES/FRIDAY 8 DE ABRIL. 18.30‐19.30 SALON DE GRADOS DEL DEPARTAMENTO DE LINGÜÍSTICA APLICADA. 3ª PLANTA. Mike O’Donnell (Universidad Autónoma de Madrid) Using learner corpora to redesign university‐level ESL education. This talk will discuss various means in which a learner corpus collected from ESL students can be used to reshape the educational experience of these students, or those who follow them. Firstly, a learner corpus can provide strong input to the English‐teaching curriculum. We can extract 'grammatical profiles' from the learner corpora, showing, for each proficiency level, the grammatical structures which are most critical for the developing students at that level. For this, we can use error annotation, to track what students are doing wrong at each level, and also automatic grammatical analysis, to see what they are getting right. Secondly, an error‐annotated learner corpus provides a good basis for material preparation for the teacher. When teaching a particular structure, they can see what kinds of errors the students make, and how frequently. This tells them how much of their teaching materials to dedicate to each problem area, and provides examples to use in those materials. The error corpus can also be used to produce exercises for the students, for instance, asking the students to identify errors, or correct them. Thirdly, we will discuss how the learner profiles mentioned above can be used by an intelligent online exercise system, which offers questions targeted directly at the needs of the student at their current point of language development, and that adapts its conception of the student's proficiency on the basis of the student's responses.

RESÚMENES DE PONENCIAS (ORDEN ALFABÉTICO DEL APELLIDO DEL PRIMER AUTOR) ABSTRACTS (ALPHABETIC ORDER OF SURNAME) Alcantud Díaz, María Panel: 2. Discurso, análisis literario y corpus VIOLENCE IN CHILDREN’S TALES: A SYSTEMIC CORPUS AND CRITICAL DISCOURSE ANALYSIS OF CINDERELLA The main aim of this article is to discuss the results achieved after investigating the presence of violence in the brothers Grimm’s Cinderella (Tatar 1987,1992,2004); through a corpus‐based analysis (Biber 1998) with the intention of finding out what kind of verbal processes predominate in this tale and whether they can be related to violent actions. The tool used for the analysis was WordSmith Tools 5 (Scott, 2010). The study involved first an analysis of frequencies of the lexical units in Cinderella, followed by a comparison of the results obtained in the frequency test to two reference corpora: British national Corpus and Cobuild Concordancer. The analysis was completed with a study of the concordances of some selected words, seeking in detail the context in which they appear. Once the quantitative and qualitative surveys were completed, I then proceeded to analyse the type of verbal processes (Halliday 1994:106‐175) extracted from the frequency list. These were classified according to the framework proposed by Downing (2002:111). Thus verbal processes were classified as belonging to six categories: material, mental, verbal, behavioural, existential and relational. After classifying them, these same verbal processes were analysed according to four parameters: who (agent), what (type of action) to whom (affected) and under what circumstances. The results obtained in the frequency and concordance tests of this tale, seemed to indicate that violence is certainly present in Cinderella. The method proved to be a good tool to check whether each character’s identity and their social position (power) were somehow related to the infliction of violence. That is, if some characters took the advantage of their predominant position and thus inflicted violence upon other characters. As a general conclusion of the analysis of the results, a tentative proposal could be formulated: that a corpus‐based analysis in conjunction with both, a transitivity analysis and a critical discourse analysis, could empirically detect the presence of controversial and polemic topics such as violence in different types of texts. The results could be used as evidence to support a social intervention by means of a linguistic intervention (Graddol and Swann 1989) aimed at decreasing the amount of violent language and situations reproduced in children’s tales.

Alcaraz‐Mármol,Gema and Lourdes Cerezo‐García Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje SPECIFIC FREQUENCY AND ITS ROLE IN FOREIGN LANGUAGE VOCABULARY ACQUISITION Several studies (Saragi et al. 1978; Hulstijn et al. 1996; Reyes 1999; Waring and Takaki 2003; Pigada and Schmitt 2006; Webb 2007) have highlighted the role of specific frequency – i.e, the number of times a word occurs in a text – when it comes to second language vocabulary acquisition. In fact, especially in non‐naturalistic contexts of learning, “individual texts within each corpus can vary from one to another and from the overall frequency list which a corpus produces” (Milton 2009: 25). As stated above, the specific frequency of a word may differ from general frequency. Knowing the number of times a word is to be encountered for acquisition would help designers create reading materials adjusted to the learners’ needs. Unfortunately, to date, there is no agreement on the number of occurrences that are necessary for acquisition. What is more, we do not even know whether all words need to be encountered the same number of times. A number of studies have focused on this issue (Horst et al. 1998; Laufer 1998; Nation and Wang 1999; Rott 1999). Scholars have tried to determine, as accurately as possible, the number of times a word needs to occur to enable acquisition. What we find in this respect are various different outcomes, ranging from 5 and 20 occurrences. Yet, most of these works are carried out under artificial or laboratory conditions which may be far from mirroring the authentic

learning context. The current study aims to approach the real situation of the classroom. It seeks to define the relationship between specific frequency and vocabulary acquisition within the context of EFL formal instruction. We pursue to answer two research questions: 1) Is there a significant relationship between specific frequency and immediate vocabulary acquisition, regarding receptive and productive knowledge? 2) Is there a significant relationship between specific frequency and mid‐term vocabulary retention, regarding receptive and productive knowledge? In order to achieve our aim, a group of nine‐year‐old students of EFL in their fourth year of Elementary Education was tested on vocabulary contained in their coursebook. The input for the experiment was taken from Unit 3, which introduced a total of 21 target words (17 nouns and 4 adjectives). These words were classified into three groups, according to their specific frequency. Both written and oral occurrences were taken into consideration. Three weeks before starting Unit 3, target words were pre‐ tested. Once students had worked with this unit, a receptive and a productive test were distributed both immediately after finishing the unit, and three months later. Results show that the effect of specific frequency on vocabulary learning differs depending on the moment this learning is assessed, that is, whether it is tested just immediately after dealing with vocabulary or some months later.

Alcina Sousa and Alda Correia Panel: 2. Discurso, análisis literario y corpus FROM MODERNITY TO POST‐MODERNITY: CONFLICTING VOICES IN LITERARY DISCOURSE ‐ A CORPUS ANALYSIS OF YOU AND ONE From modernity to postmodern discourse, places, landscapes and people are aesthetically perceived and reshaped, within the perspective of alterity/otherness, upon which one constructs the image of “one’s own” and the “other” in a dialogical game of mirrors. This paper discusses the possibilities of a corpus analysis applied to literary interpretation. It is, thus, our goal to present our preliminary findings of a work in progress intended to disambiguate some pronominal references, i.e. one / you, as they occur in prose fiction, namely in two of Virginia Woolf’s and Hugo Hamilton’s novels. These involve readers in a dialogic interpretation of the text’s “polyglossia”, either conveying the generic pronoun reference or the protagonist’s inner voice. In Hamilton’s The Speckled People (2003), the shifting pronominal reference I/you points to a multitude of pulls either inwards or outwards be it in the sphere of the individual and the community to which he belongs, or in the physical space. Very often in the novel, the focaliser / protagonist presents an alternate view to mainstream ideology, reinforced by the generic pronoun reference you. By contrast, one occurs more frequently in Virginia Woolf’s texts. This evidences a linguistic/stylistic choice conforming to patterns of use from modernity to post‐modernity which draw the attention to her way of conceiving her feminist project and a postmodern aesthetics.This analysis will benefit from a multi‐layered interpretive framework drawing on discourse analysis, and corpus‐based approaches, particularly in that it unpacks ways in which writers make use of linguistic structures. The analysis of the collocational meaning (in Alan Partington 1998: 9‐10) “can provide powerful support for a reader’s intuition”. Consequently, the reader is challenged “to explore new kinds of identity and forms of relationship” or, according to Martin Montgomery et al. (1995: 121), “to see the world from unfamiliar and revealing angles… by subverting the commonsense bonds between utterances and their situations of use”.

The study of authorial style in literary and non‐literary works has always been a staple in humanities. It is generally assumed by researchers in the field that people have a characteristic pattern of language use that can be detected in their way of speaking and in their writings, and the first applications of this theory aimed at authorship attribution. As Juola puts it, “[d]isputes about the ownership of words have been around for as long as words themselves could be owned” (2008: 237). In the era of personal computers and corpus linguistics, the study of style in language has seen its greatest development, giving rise to the discipline known as "Computer Stylometry”. Within this field, simple statistics have been combined successfully, being the most notable example of this the Delta method (Burrows, 2002). This method is considered to produce very positive results (Cantos et al., 2010); hence the fact that authors such as Argamon (2008) and Hoover (2004, 2004a) have proposed interesting modifications of the method. This method has been commonly evaluated on literary texts, such as English poems and novels, by different authors. More recently, it has also been used to discover patterns of similarity and difference in works by the same author, in order to detect stylistic variation throughout their work and to examine how patterns in dialogue are used to individualize characters, that is to say, to construct their idiolect. Even though this kind of computational testing provides a sound basis for an emerging discipline, there are so far just two studies which explore characters’ idiolects, and none of them include Delta procedure in their research methodology. First of all, Rybicky (2006) studied character idiolects in Henryk Sienkiewicz’s trilogy and their two English translations. Subsequently, Rybicki (2008) has conducted an examination of the idiolects of the characters of Shakespeare’s Hamlet, in which nine randomly‐selected translations into various languages are compared by means of Multidimensional Scaling graphs of characters’ speech, based on the relative frequencies of the most common words. In view of the preceding discussion, this work is intended as a contribution to the available empirical knowledge on the computational stylometric analysis of literature through the application of Delta method. Specifically, we will delve into characterisation in Oscar Wilde’s oeuvre, since, to the best of our knowledge, this celebrated writer has not been object of any computational stylistic analysis yet. For the discrimination of characters within the same play, we have performed Delta and Delta Prime analysis of the idiolects in English originals. Specifically, the spreadsheet have listed the 100 most common words in descending order of their frequency in the corresponding subset, shown their mean frequencies as percentages of that set, presented the corresponding standard deviations, and given z‐scores representing their divergences from the means of the other subsets. In addition, a Wilcoxon signed‐rank test has been performed. The results do suggest idiolectal divergences among several characters and certain linguistic patterns shared by characters of the same social group.

Almela, Moisés Panel: 4. Lexicología y lexicografía basadas en córpora FROM COLLOCATION TO INTER‐COLLOCATION: DEVELOPING A DYNAMIC APPROACH TO COMBINATORIAL LEXICOGRAPHY The lexicographical treatment of collocation has been focused on descriptions of dependencies between words. This involves typically the combination of a node and its collocates. This perspective of analysis can be described as “intra‐collocational”, because it is centered on the analysis of internal relationships within a bigram. There are, however, strong reasons to argue that the intra‐collocational perspective in combinatorial lexicography is incomplete and sometimes even misleading. Recent studies in corpus‐ based lexicology have suggested that the collocational profile of a node is in part shaped by interdependencies among its collocates (Cantos & Sánchez, 2001; Sánchez et al., 2007; Almela et al., 2011). Therefore, in order to increase the accuracy of collocational descriptions, the intra‐collocational perspective should be complemented with an “inter‐collocational” analysis that is, with an analysis of the way in which different collocations of a word exert an influence on each other. The existence of an interaction between two or more collocations is observed wherever the association strength of a node‐ collocate pair is reinforced or weakened as a result of the effect exerted by other neighboring elements. Thus, given a node word W and three of its collocates (C1, C2, C3), the probability of finding C1 in the context of W can be increased or decreased by the presence C2 or C3. To put it more formally, we can say that the intra‐collocational perspective is concerned with dependencies of the following form: W|C1, W|C2, C1|W, C2|W, etc., while the inter‐collocational perspective is concerned with dependencies of a more complex form, namely: (W,C1)|C2, (W,C1)|C3, (W,C2)|C3, etc. For example, the likelihood that the noun policy functions as a direct object of the verb review is higher when it is modified by adjectives such as existing or current in comparison with cases in which policy is modified by local; and conversely, the probability of finding other verbal collocates, such as implement and develop, in the context of policy is higher when the adjective is local in comparison with situations in which the adjective is existing or current. Thus, we can say that existing and current are “co‐collocates” of the pair review + policy, but not of the pair implement + policy. This paper submits a proposal for introducing inter‐collocational information into electronic collocation dictionaries. There are, of course, serious objections to the incorporation of this type of contextual data in printed dictionaries, due to obvious limitations of space. However, in electronic lexicography these practical difficulties can be resolved with the help of expanded menus and user interfaces. The central idea of this paper is that by creating a more dynamic design of lexical entries in electronic combinatorial dictionaries it is possible to include more detailed contextual information, especially inter‐collocational relations. The advantages over more conventional approaches to combinatorial lexicography will be illustrated with reference to lexical entries for the nouns policy and control.

Alonso‐almeida, Francisco and Ivalla Ortega‐Barrera Panel: 5. Corpus, estudios contrastivos y traducción EVIDENTIALITY AND EPISTEMIC MODALITY IN ENGLISH AND SPANISH LEGAL SCIENTIFIC DISCOURSE: A CORPUS‐ BASED STUDY This paper explores the concepts of evidentiality and epistemic modality in a corpus of English and Spanish legal scientific discourse. The data for analysis is taken from Evycorpe, a database of English scientific papers in the fields of computing, medicine and law published between 1998‐2008. For the present work, we only focus on the legal part of the corpus, but the results will be implemented with the other two register subdomains in the future. The Spanish legal corpus has been gathered for this contrastive study following the same Evycorpe criteria of compilation. The notions of epistemic modality and evidentiality are differently treated in the literature (Dendale and Tasmowski 2001). Whereas for some scholars evidentiality represents a subdomain of epistemic modality (Chafe 1986, Palmer 2001), there are others who consider evidentiality as an independent category (Cornillie 2009). Epistemic modality is strongly connected to the idea of “truth” and the authors’ responsibility concerning their statements (Traugott 1989; Sweetser 1990; Stukker, Sanders and Verhagen 2009). Evidentiality is seen as the coding of the authors’ “source of knowledge”, and this may eventually imply differing degrees of certainty concerning the proposition manifested (Carretero 2004). In this paper, we follow an intersective approach and, although both categories are kept theoretically distinct, they undergo functional overlapping. The use of these strategies might be indexical of the authors’ position and intention in discourse (Marín Arrese 2009). This said, our main objectives are (1) to identify and classify epistemic and evidential markers in the corpus, and (2) to describe their frequency of occurrence in each language subcorpus and their functions mainly as stance markers. The paper concludes that epistemic markers appear in higher frequency in the English texts, whereas the Spanish ones tend to show more examples of evidential strategies, although in both cases these marker types aim to be manifestation of face‐saving expressions (Brown & Levinson 1978), among other pragmatic effects.

Alvarez Mosquera, Pedro Panel: 9. Usos específicos de la Lingüística de Corpus TESTING THE EXCEPTION: AN ANALYSIS OF EMINEM’S LANGUAGE USES FROM A CORPUS‐BASED APPROACH. Eminem’s presence in the hip‐hop scene has been controversial ever since he burst into the music world in the late 90’s (Bozza 2003: 93). His exceptional success as a Caucasian in a predominantly African American genre is reflected in the number of records he sold and the significant support he garnered from influential figures in the hip‐hop world. While Eminem was attacked by those who accused him of being a product of the music industry for the purpose of selling millions of records to the white market, others defended him for his genuine talent as a rapper. Analyzing rap’s linguistic component, which plays a central role in the genre, is a way to potentially evaluate Eminem’s authenticity as a rapper in an objective manner. By maintaining a sociolinguistic approach, we used Wordsmith Tools to process Eminem’s language choices in his album, The Marshall Mathers LP, launched in 2000, and we compared them with contemporary African American rapper Jay Dilla’s album, Welcome 2 Detroit, released in 2001. Analyzing similarly sized corpora from two rappers who share the same relative age, city of origin, and gender, allows us to focus on ethnicity and language as the center of this study. Our results emphasize significant similarities in how both rappers use rap as a communicative device, following specific linguistic patterns ascribed to the role and function of the African griot in the African American tradition. However, important differences were also noted. The limited references to the central concept of community, and the absolute absence of the term nigger in Eminem’s corpus (among other features), set him apart from the African American group and put him closer in line with the corpus associated with other Caucasian rappers (Álvarez‐Mosquera 2010). Finally, our data also illustrates that authenticity is a highly disputed quality in rap music. Rap is intrinsically interwoven with ethno‐cultural patterns as a result of the Black Experience (Rose 1994: 123), which has made African American rappers’

Bartholamei Junior, Lautenai Antonio Panel: 1. Diseño, compilación y tipos de córpora PEPCO: DESIGNING A PARALLEL AND COMPARABLE TRANSLATIONAL CORPUS IN BRAZIL Brazilian studies in translation have been growing in last years, as well the use of corpus tools to help researchers. The used of tools provided by corpus linguist are often used to help translators in their researches or training. PEPCo (pepco.ufsc.br) was designed to be a tool which can help scholars and researchers in their task of create e explore texts in the corpus. Design process of PEPCo was carried out in two steps: (i) corpus design, i.e., text selection, representativeness; and (ii) development of tools, i.e., the use of a MySQL database and PHP scripting language, designing of an interface for querying and retrieving data from the corpus using HTML, CSS and JavaScript. Most used tools provided by PEPCo are parallel concordances, monolingual concordances, word‐lists, n‐grams, and PEPCo Builder. PEPCo Builder is a tool that makes easier the corpus compilation by the user. The user does not need to have

technical knowledge on corpus tools and scripting, he/she only needs a pre‐aligned parallel text in a text processor and all sentences/paragraphs need to match in source and target texts. Then, both source and target text are uploaded using a web form and user receive an unique corpus ID by an e‐mail provided in the form and then can access his/her own corpus through a web page. The result (in progress) is a parallel corpus of about 3 million words and a comparable corpus of about 5 million words which could be useful for many researchers in translation studies in Brazil. Most researches using PEPCo are related to translation studies and translational phenomena emerging from a compiled corpus. Popular genres in PEPCo are Fantasy, Science‐Fiction, Medical and Academic Texts. Corpus tools provide filters to user search for specific texts, genres, period, authors, translators, publishers. Also, users can specify to query only on source text, target text or both. In case of querying for both texts, user can define a node for source text and another one for target text. PEPCo is used by students and teachers to researches and translator’s training in Southern Brazil. PEPCo developers and users are always integrating new resources provided to aid each new research.

Bengoechea, Mercedes and José Simón Panel: 6. Corpus y variación lingüística FEMINIST LANGUAGE REFORM IN SPANISH ADVERTISING. A CORPUS‐BASED RESEARCH Within the framework of a broader research project, we have examined the evolution of gender adscape along the past years. Our aim was to investigate to what extent non‐sexist language has been used in the advertisements published during October 2007 in the most influential newspaper in Spain, El País, which is also the one with the widest readership. We have collected two samples in three years: the first one corresponds to October 2007 and the second to October 2010. In addition, all advertising received in a middle‐class home in Madrid during the same period was equally collected and analysed. A key element in our survey was the corpus we created with our samples. In order to streamline the study, a database was created in which, once scanned, some 700 ads were registered using a double format, jpg images and pdf, together with the text of the advert. Among common data (date, section, page, etc.), we also registered the type of product or service advertised. Then, in the same database we annotated them according to gender treatment in verbal usage. In this paper we present the results of the first phase of our study, which corresponds to the advertising in el País during the month of October 2007, with particular emphasis on the corpus methodology we have followed.

Bouda, Peter Panel: 3. Estudios gramaticales basados en córpora LANGUAGE DOCUMENTATION CORPORA IN DESCRIPTIVE LINGUISTICS The role of corpora in the creation of descriptive grammars has gained a lot of attention in the last decades. Still, only few grammars directly refer to corpus analysis as a main mean to extract the linguistic information they present. In recent years the usage of software tools in language documentation projects generated a new source of linguistic data, that will be used to compile decriptive grammars for lesser‐used and endangered language in the future. It is the goal of this paper to present a software solution to search and analyze annotated corpora that were created in language documentation projects. The software is especially designed for the application with DOBES corpora, but may be extended to other kinds of corpora later on. In the first part, I will outline some of the questions a descriptive linguist will pose to a corpus when he is in the process of writing a grammar. Those questions resulted in a typology of searches the linguist needs to apply to a corpus, in order to extract the information about grammatical types and relations on all linguistic levels. This typology was the basis to create a list of requirements for a software tool that is currently used in two language documentation projects. Real‐world examples from those projects will be presented to show how to derive grammatical descriptions from corpora through search and analysis within the software tool. In the second part I would like to present the technical solution in detail, a preliminary version of a database/concordancing software specifically designed to fulfil the functions and principles outlined in the first part. It supports the Elan and Toolbox file format, two of the main software packages used in DOBES documentation projects. Those data files typically contain transcriptions, morpho‐syntactic annotations and translations, which are accessible through a search interface within the software. Search results are displayed with full interlinear data, so that context and annotation data are displayed to the user. The software implements the search strategies that were derived from the requirements outlined in the first part, for example successive searches on previous search results, or search for classes of words, morphemes, glosses, etc. extracted from fieldwork sketches. Parts of the corpus or search results may directly be published in hypertext documents, i.e. in digital grammars, by a simple copy and paste procedure. Later versions of the software will allow publishing whole corpora in a standardized XML format based on the Corpus Encoding Standard with fixed URLs that allow access and links to the data on a simple web server. Depending on access restrictions the underlying data files may also be accessed directly from the DOBES archive at the Max‐Planck‐Institute in Nijmegen or other archives.

THE FOCUSING USES OF VERY, PURE, SHEER, MERE. A CORPUS‐BASED INVESTIGATION OF THEIR FUNCTIONAL‐ STRUCTURAL STATUS AND THEIR DIACHRONIC DEVELOPMENT. The starting point of this paper is formed by the problems posed by a little described element of the English NP, viz. the prenominal focusing adjective. It occurs in postdeterminer position and its semantics are similar to focusing adverbs, such as inclusive ‘even’ (1, 2) and exclusive ‘only’ (3), manifesting wide (1,2) and narrow scope (3). (1) Many commentators feel that the deadly cocktail of drugs, guns and Aids sweeping inner city America is threatening the very existence of Afro‐Americans. (2) Anyone who freezes with fright at the mere sight of the dentist’s chair will be pleased to know that you can now tune into something more relaxing than a screeching drill. (3) We had been hoping for it to coincide with Keats’s birthday, but you can imagine how hard it proved to cram 12 whole quatrains into a mere four hours. The central question is whether they are best treated as secondary determiners (Bolinger 1968, Adamson 2000) because of their structural position and general ‘reference‐modifying’ function, or as a type of emphasizer (Quirk et al 1985, Vandewinkel & Davidse 2008) because of their inherent or latent scalarity. We will approach this issue from a diachronic angle, studying the focusing uses from their earliest appearances on (in which they may still be entwined with secondary determiner and/or degree modifier uses) and analysing the diachronic changes they underwent to clarify their status in contemporary English. This investigation will be based on systematic qualitative and quantitative analysis of historical and contemporary corpusdata with the adjectives very, pure, sheer and mere. Extractions were made from the Helsinki corpus (750‐1150), the Penn‐Helsinki Parsed Corpora of Middle English (1150‐1500) and Early Modern English (1500‐1710), the Corpus of Late Modern English Texts (1710‐1920), and the COBUILD corpus (1993‐). The first diachronic question that we want to settle is whether the focusing uses of these adjectives emerged as a subtype of the degree modifier use or of the secondary determiner use. We will answer this question by charting the relative proportions of these three uses throughout the main periods of English and by investigating the bridging contexts (Wilkins & Evans 2000) in which one reading is a focusing reading. Our second diachronic question pertains to the pragmatic‐semantic development of the various focusing uses: exclusive, inclusive, particularizing; wide vs. narrow scope; scalar vs. non‐scalar (König 1989, Nevalainen 1991 1994, Eckardt forthc.). Despite the original association of pure, sheer and mere with exclusive meaning and of very with inclusive meaning, they all developed focusing uses unpredicted by their lexical meaning. Based on close analysis of all the relevant contextualized examples, we will trace paths of change, based both on the more general meaning shifts established in pragmatic theory and on the gradual extension of collocates of the adjectives in their focusing use. Our data‐based reconstruction of these collocational histories will allow us to assess the importance in “emergent grammar” of collocational persistence and extension, with the language community’s awareness of “prior text“ as an important source of grammaticalization (Hopper 1998). This extensive qualitative and quantitative study of corpus data will allow us to develop an historically‐informed description of the neglected prenominal focuser function of adjectives. We will situate the focuser function in relation to subjective and intersubjective meaning and scalarity in the whole English NP.

Brett, David and Antonio Pinna Panel: 9. Usos específicos de la Lingüística de Corpus LEXICAL BUNDLES IN US PRESIDENTIAL SPEECHES: A CORPUS‐DRIVEN STUDY OF B. CLINTON'S, G.W. BUSH'S AND B. OBAMA'S ADDRESSES In this paper we investigate patterns of variability in lexical bundles in a corpus of US presidential addresses and compare our findings with those reported in the literature concerning other fields of discourse. In our study we adopted Biber’s (2009) methodological approach which he used to

investigate variability within multi‐word units using two corpora: a 4.5‐million‐word corpus of American English conversation; and a 5.3‐million‐word corpus of academic prose. Initially, the corpora were searched for 4‐grams, discarding sequences with a frequency of less than 10 occurrences per million words. Each corpus was then searched for a series of sequences composed of three of the components of each 4‐gram, allowing variability in the fourth slot, e.g. *234, 1*34 etc. If the token in a given slot in each 4‐gram composed less than 50% of the results for that slot, the slot was deemed to be variable, as opposed to fixed, and marked with an asterisk. This procedure permitted the identification of typical patterns of variability in the formulaic sequences across the two corpora. For example, internal variability in one slot (1*34/12*4) was seen to be relatively common in Academic Prose, whereas initial and final variability (*23*) was more frequent in the conversation data. The corpus which we have used for this study is composed of US presidential addresses and remarks delivered by B. Clinton (1993‐2000), G.W. Bush (2001‐2008) and B. Obama (2009‐2010). As a macro‐genre Presidential speeches are monologic texts characterized by being usually prepared to be recited in public. They could therefore be expected to contain features of both written and oral language, possibly tending towards the oral end of the cline. This led us to speculate that our data would fit this picture by showing patterns of variability which positioned Presidential speeches as more or less evenly straddling the oral‐written divide as defined by Biber’s (2009) findings. Broadly speaking, the presidential data patterns display greater similarity to those of conversation, rather than academic writing: internal variation (12*4/1*34 and 1*3*/*2*4), which is characteristic of academic writing, is infrequent in both; conversely, variation in the external slots (123*/*234) is common in both (particularly so in the former), while being considerably less frequent in academic prose. However, a marked difference may be noted in the proportions of wholly invariable patterns (1234). In Biber's conversation and academic prose data, these represent merely 7% and 8.5% of the total patterns, respectively. On the other hand, this pattern constitutes no less than c. 21% of the total in our presidential data. Further analysis reveals considerable variation among presidents: Bush's use of such patterns is remarkably high in comparison to his immediate predecessor and successor. On the whole, we may conclude by observing that the presidential address data displays far higher levels of formulaicity than the reference genres, as almost 55% of the patterns are of three types: 1234, 123* and *234.

Brown, David and Laura Aull Panel: 2. Discurso, análisis literario y corpus “TOUGH GUYS” AND “CATFIGHT CRAZY”: A CORPUS‐BASED ANALYSIS OF GENDER REPRESENTATIONS IN SPORTS REPORTAGE This study uses a corpus‐based approach to investigate the discursive representations of athletes and their connection to ideologies of gender. To carry out this investigation, we have compiled two specialized corpora: one containing press accounts covering a fight that took place between the Detroit Shock and the Los Angeles Sparks of the Women’s National Basketball Association (WNBA) and the other containing press accounts covering a fight between the Detroit Pistons and the Indiana Pacers of the National Basketball Association (NBA). In our analysis, we find that the narratives in the NBA corpus are constructed around the allocation of blame, often focusing on the role of a particular player, Ron Artest, and the behavior of fans. In contrast, the narratives in the WNBA corpus are often constructed around the fight’s effect on the league—in particular whether the fight will bring positive or negative attention. In addition, the WNBA corpus contains a large number of gender‐marked tokens (e.g., female, men, girls, boys, daughters, femininity) indicating that the reportage often generalizes the specifics of the WNBA fight to construct broader representations of gender and gender norms. The results of the study are facilitated by the analysis of keywords, token frequencies, and collocations, as well as comparisons of linguistic features of our corpora to sports reportage features more generally evidenced in the Corpus of Contemporary American English. The purpose of our investigation is two‐fold. First we want to interrogate the intersections of gender, sport, and language, in order to illustrate how sport can be a productive site for exploring issues related to language and ideology, but also that it is importantly

implicated in social constructions of gender. Second, we want to contribute to the growing body of research using corpora both large (e.g., Rayson, Leech, and Hodges 1997; Schmid and Fauth 2003) and specialized (e.g., Motschenbacher 2009) to show, in Baker’s (2008: 74) words, “the untapped potential” of corpus linguistics in the study of language and gender.

Camiña, Gonzalo Panel: 3. Estudios gramaticales basados en córpora NEW NOUNS IN THE SCIENTIFIC REGISTER OF LATE MODERN ENGLISH: A CORPUS‐BASED APPROACH. This paper revises word‐formation processes in the scientific register of English in the eighteenth century. Using corpus‐based methodology, the parser Coruña Corpus Tool and other data processing software, it aims at providing relative frequency patterns to illustrate the most productive processes to

coin new nouns in the fields of astronomy and philosophy in the Late Modern English period. To achieve this we have analysed over 400,000 lexical items corresponding to two sub‐corpora contained in the Coruña Corpus of English Scientific Writing, i.e. the Corpus of English Texts on Astronomy (CETA), and the Corpus of English Philosophical Texts (CEPhiT). By means of quantifiable data, we intend to measure the productivity of the different units and processes involved in the coining of nouns. Besides, we will offer two different approaches to the linguistic material in the corpus: on the one hand, diachronic evaluations of the entire corpus that may define the features of the scientific register in general; on the other hand, a synchronic comparison of the two disciplines that may identify unique morphological characteristics inherent to each of them.

Cantos, Pascual, Aquilino Sánchez, Raquel Criado and Moisés Almela Panel: 2. Discurso, análisis literario y corpus COMPUTING READING DIFFICULTY IN ENGLISH LITERATURE (19TH AND 20TH CENTURIES): A CORPUS‐BASED STUDY Readability indices (Coleman & Liau, 1975) have been widely used in order to measure textual difficulty. They have proven to be consistent and reliable (Smith & Kincaid, 1970) and can be truly useful for the automatic classification of texts, especially within the language teaching discipline. Among other applications, they allow for the previous determination of the difficulty level of texts without even the need of reading them through. The Automated Readability Index (ARI, hereafter) was originally used to produce an approximate representation of the US grade level needed to comprehend a specific text. Its calculation is based on two ratios: word length (in characters) and sentence length (in words). In this research we shall enlarge its domain and apply the ARI, one of the most used readability indices, to English prose. The aim of this investigation is threefold: first, examining and determining the degree of reading difficulty, ARI, of the 19th and 20th century novels specified below; second, by means of the data obtained, trying to classify and arrange them according to their degrees of reading difficulty, both

individually and chronologically; and third, correlating the data with the English language proficiency level of Spanish university students of Grado de Estudios Ingleses (compliant with the European Space for Higher Education, active from the academic year 2009‐2010) and the Licenciatura de Filología Inglesa (the old Curricula Plan, to become extinct in 2012‐2013). Methodologically, we shall calculate the ARI indices of the text corpus consisting of 17 novels by renowned British writers in the 19th and 20th centuries. The authors and novels selected are: (a) from the 19th century, Charles Dickens (Oliver Twist, David Copperfield, A Tale of Two Cities, Great Expectations, Our Mutual Friend); Emily Brontë (Wuthering Heights); Charlotte Brontë (Jane Eyre); George Eliot (Middlemarch); William Makepeace Thackeray (Vanity Fair), and Thomas Hardy (Far from the Madding Crowd); (b) from the 20th century, Joseph Conrad (Heart of Darkness); David Herbert Richards Lawrence (Sons and Lovers); Virginia Wolf (To the Lighthouse); Aldous Huxley (Brave New World); Graham Greene (The Heart of the Matter); George Orwell (1984) and William Golding (Lord of the Flies). Next, we shall arrange the resulting data in a hierarchical way, by means of a cluster analysis, in order to establish the similarities/divergences encountered among the authors/novels/centuries. Finally, we shall correlate the data with the proficiency level of English of our Spanish university students of Grado de Estudios Ingleses and Licenciatura de Filología Inglesa. We are confident that the ARI indices, the clustering of the authors/novels and the resulting correlation might highlight in some way whether the proficiency level of English of our students is up to the degree of difficulty of the English novels recommended in the curricula at our universities. The practical results can be taken as a reference for deciding on the ordering and grading of the literary texts studied along the degree of Grado de Estudios Ingleses.

Carmo, Felix Panel: 9. Usos específicos de la Lingüística de Corpus WHAT DO COMPRESSION ALGORITHMS TELL US ABOUT LANGUAGE? In recent years, there have been many studies in the domain of machine learning regarding the application of compression algorithms to detecting patterns in text and languages. These studies have shown that using these algorithms on unsupervised experiments with different models of data compression can identify regularities which often elude a linguistic analysis. We will present some of these studies, such as the one by Cilibrasi and Vitanyi (2004), in which this method was used in conjunction with clustering techniques to discriminate and group languages by language family, literary works by author, and literary translations by translator. However, these studies pose a lot of questions on what enables a technology which clearly has no linguistic knowledge, such as data compression, to identify distinguishing features in complex computer objects like natural language texts. Mahoney (2010) claims that text compression is a hard Artificial Intelligence problem, due to the difficulty in reaching an adequate language model, and then coding it efficiently. Some of the questions we pose relate to the capacity of these algorithms to distinguish between a string of characters and a meaningfully organised phrase of words. We also question which mathematical parameters improve an algorithm’s efficiency in detecting text regularities. Ultimately, these questions try to understand what these algorithms show us about language. We will include some of our own research with a parallel corpus, which shows that, even in small‐scale research, compression algorithms are efficient tools for finding textual relations that we would not expect from a mathematical analysis tool. In our experiment, compression algorithms highlight fundamental differences between English and Portuguese translations. There is however, a lot of work to be done in order to identify which text features lead to the algorithm detecting these differences. This is an ongoing project, and a few new stages of work may be added to the presentation.

THE USE OF CORPUS ANALYSIS TO MANAGE FOREIGN LANGUAGE ACQUISITION IN A BILINGUAL COMMUNITY Worldwide communication is possible nowadays using English as an international language or lingua franca. English is used in countries with different cultural backgrounds, a fact which affects in the use of pragmatic strategies. On occasions, authors who communicate in a foreign language cannot avoid the use of structures that are more common in their mother tongue (L1). In a monolingual community, language errors could be caused by L1 interference; nevertheless the methodology applied in error analysis and in corpus compilation could vary in a bilingual community. The linguistic status of three languages in contact may not be equal; consequently ideological, linguistic and social factors could influence language acquisition. The main objective of this paper was to find out if the general methodology used for corpora classification is adequate for a corpus of learners with different linguistic background. Furthermore, we analysed if the increasing importance of English as a lingua franca influences students to consider local or national languages less important when developing professional skills. In this article, we used corpus analysis methodology to determine if learners whose mother tongues were Spanish and Catalan varied their errors when learning English. Foreign language acquisition is a universal concept although we consider that the proficiency of some skills could depend on the mother tongue of the learner. In order to analyse the corpora, which included the errors of English texts written by students whose mother tongue was Catalan or Spanish, we conducted an experimental research that included the categories of communicative, grammatical and lexical errors. The results showed that students with different cultural backgrounds produced a dissimilar amount of communicative and lexical errors while both groups produced a similar amount of grammatical errors. As a consequence of this research, we concluded that the methodology used to detect errors should vary depending on the linguistic background of learners.

Casas Pedrosa, Antonio Vicente Panel: 3. Estudios gramaticales basados en córpora MAIN FEATURES OF ENGLISH PREDICATIVE PREPOSITIONAL PHRASES IN ICE‐GB This paper is aimed at identifying which are the main characteristics of those English prepositional phrases which perform the function of subject complement in the British component of ICE. Such is the case of “She first fell in love with Will when she was eighteen, and she adores him still” (ICE‐GB:W2F‐019 #47:1). After introducing the notions of prepositional phrase and subject complement, these structures will be described from the morphological, syntactic, semantic, lexical, and socio‐pragmatic points of view and examples will be provided. Although in terms of frequency this is not the syntactic function prepositional phrases more often perform, they are taken into account because of their complexity and due to the lack of detailed analyses. In most cases they are described as isolated examples and this phenomenon is not considered to be a very productive one. Morphologically speaking, prepositional phrases can be defined as those phrases headed by a preposition which requires another unit following it and acting as its complement. Even though there is a wide range of units that can perform the function of complement of a preposition, attention will only be paid to noun phrases. They can be very simple (consisting of a single noun, as “on fire”) or more complex (for instance, “in the pink of health”). From the syntactic point of view, prepositional phrases usually perform the functions of adverbial, postmodifier of noun phrases and complement of adjective and prepositional phrases. Nevertheless, they can also behave as subject and object complements: “That is of no importance” (Quirk et alii, 1985: 732) and “I don’t consider myself at risk” (op.cit.: 733). As far as semantics is concerned, when acting as subject and object complements, prepositional phrases convey meanings which are similar to those of adjectives, since they express qualities or characteristics. Thus “on cloud nine” and “in the doldrums” can be replaced by “very happy” and “depressed”, respectively. Lexically speaking, some of the examples under analysis are idiomatic, their meaning being metaphorical. Such is the case of “(be) on tenterhooks”, which is defined as follows in OALD6 (1340) as “(to be) very anxious or excited while you are waiting to find out sth or see what will happen”. More information is provided as regards its origin: “From tenterhook, a hook which in the past was used to keep material stretched on a drying frame

during manufacture”. As far as socio‐pragmatics is concerned, sometimes these structures are selected because they allow speakers to express the same meaning by means of a lower number of words. This is the case of “in hand”, defined as “receiving attention and being dealt with” (OALD5: 537). Moreover many of these structures are labelled as “colloquial”, “informal”, “old‐fashioned”, or “slang” in dictionaries. In some cases they can even convey two different meanings, one being neutral and the other, informal; the phrase “on the job” in OALD6 (697), is thus defined as “while doing a particular job” and “(BrE, slang) Having sex”.

Cheng, Su‐han and Jeng‐yih Hsu Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje A CORPUS‐BASED STUDY OF THE VOCABULARY USE IN AN ENGLISH NEWSPAPER In an attempt to create a journalistic English word list (JEWL), this study examines the most frequently occurring words in a 20 million‐word journalistic English corpus (JEC) collected from an English newspaper published in Taiwan between 2002 and 2009. Adopting a commercial concordance software package, ConcGram 1.0, this study is able to report its findings on the statistically frequent words, collocations, and four‐word lexical bundles. Altogether, 411 word families, which accounts for 4.66 % of total running words in the entire journalistic English corpus, 100 most frequent collocations of the 7 types (i.e., verb‐noun, adjective‐noun, noun‐verb, noun 1of noun 2, adverb‐adjective, verb‐adverb, and noun‐noun), and 100 most frequent four‐word lexical bundles are recorded in this study. This journalistic English word list (JEWL), containing perhaps the most important single‐word items, the top 100 collocations, and the most commonly seen four‐word bundles, may serve as a guide not only for instructors in designing textbooks and courses for journalistic English but also for learners in setting

their goals for vocabulary learning and improving their understanding and comprehension of media English.

Among the challenges facing researchers who use English as an Academic Language (EAL) is finding out how to publish in the high‐impact journals edited by the predominately English‐language industry. For many EAL researchers in Spain, the problem is compounded by the discourse community’s standards, especially since different fields and different journals seem to have different standards regarding ‘linguistically‐acceptable’ manuscripts. Recently, the terms of acceptability have become much more demanding as editors expect not only grammatical or semantic correctness, but also the elimination of any `non‐native‐like’ stylistic patterns which hinder comprehension. For instance, native Spanish and Catalan speakers tend to construct overly complex sentences in English; hence, their manuscripts are often criticized and even rejected because of the excessively wordy phrasing or exceedingly awkward expressions. If EAL researchers were provided with specific strategies to minimize wordiness and avoid awkwardness, they might be able to enhance the readability of their manuscripts and increase the probability of success in the publication process. Given our interest in analyzing these complex areas of EAL production, we compiled a unique corpus of scientific manuscripts, written directly in English by UPV researchers and faculty, thoroughly revised by one of the present authors, and eventually published as peer‐reviewed articles in English‐language journals in their fields of study (e.g. thermodynamics, civil engineering, agricultural machinery, economics, biotechnology, crop production and food sciences). The initial corpus was created with 20 original manuscripts that included all the modifications written in by the linguistic consultant (author 3) together with the 20 published articles, which had been modified at the discretion of the researchers‐authors. Each set of papers (manuscript draft(s) + published article) contained in the corpus was manually scrutinized by the linguistic analysts (authors 1 and 2), who assessed the differences between the original manuscripts and those accepted for publication. The initial analyses revealed a high frequency of reduction‐type modifications, that is, many of the native consultant’s suggestions targeted unnecessary, redundant and overly‐complex phrases. Therefore, it seemed of interest to systematically identify the instances in the corpus and to classify what we call ‘nip & tuck’ procedures. These procedures aimed to effectively reduce (nip) the wordiness and rephrase (tuck) the awkwardness in the EAL production of these researchers‐authors. In this paper, we shall first examine the unique features of this specific corpus and highlight the findings of the research conducted so far. Then, we will describe the corpus‐based qualitative typology, developed from instances of wordy and awkward EAL writing patterns. Finally, we will conclude with suggestions as to how this typology may help Spanish researchers to improve their writing and broaden our understanding of the more complex processes involved in EAL production of scientific discourse.

Cruz‐García, Laura and Heather Adams Panel: 5. Corpus, estudios contrastivos y traducción ADDRESSING THE POTENCIAL CUSTOMER IN FINANCIAL ADVERTS: A CONTRASTIVE ANALYSIS IN ENGLISH AND SPANISH The aim of this study is twofold: (1) to identify and describe the linguistic resources that copy writers use in ads for financial products in order to establish the relationship between the addresser and the addressees in two different cultures (British and Spanish), and (2) to contrast the findings in each language and culture to find out to what extent this relationship differs from one language to another. To this end, we have analysed a corpus of 60 ads for financial products, made up of two sub‐corpora (30 from the British and 30 from the Spanish mainstream press published in the first half of 2004) from both linguistic and pragmatic perspectives. The linguistic analyses carried out cover the most representative lexical, semantic, syntactical, graphic and phonic elements used to convey the advertising message, while the pragmatic analysis pays particular attention to the legal constraints pertaining in this product sector, as well as the role of consumer expectations, thus setting our linguistic analysis firmly within the social and cultural framework that gave rise to the production of these texts. Our analyses are carried out from the perspective of the translator’s need to have a thorough knowledge of both the linguistic features and extra‐linguistic factors that govern the production of a given type of text in a given cultural and communicative situation. Our intention is to explore and describe the differences that emerge from a detailed analysis of a representative sub‐corpus in English and another in Spanish, each firmly embedded in their source culture. In order to determine the relationship existing between addresser

and addressee, we have looked at the register used in the texts, paying special attention to lexical and semantic elements such as the use of informal language, puns and figurative language, on the one hand; and morphosyntactic elements such as the personal pronouns and verb forms used by the addressers to refer to themselves and to the addressees. Our conclusions will be of interest not only to translators working in advertising but also to trainee translators (and their trainers), as pragmatic factors shape the forms of address used.

Cuenca, Maria Josep and Josep Ribera Panel: 5. Corpus, estudios contrastivos y traducción DEICTIC NEUTRALIZATION AND OVERMARKING IN TRANSLATING FICTION (ENGLISH‐CATALAN) Demonstratives, as space deictic elements, are analyzed in situational terms, that is, as linguistic items that point to elements of the situational ground of utterance with regard to the deictic origin. However, Corpus Analysis shows several puzzling facts from a traditional point of view: (i) non‐situational uses outnumber the cases in which demonstratives indicate proximity or distance with respect to the addressor, (ii) non‐situational demonstratives are frequently neutralized in translation (i.e., they are translated by a non deictic unit or deleted), and (iii) new demonstratives show up in the target text (that is what we call deictic overmarking). This research is based on a corpus of fiction in English and the translation of the texts into Catalan. The English demonstratives this/these and that/those and their Catalan counterparts have been analyzed and the general strategies activated in translation have been identified, namely: a) maintenance, b) shift, c) neutralization, and d) overmarking. In this presentation, neutralization and overmarking will be dealt with in detail. Our analysis puts forward that non‐ situational demonstratives are much more frequent in our corpus (400 cases, 83.5%) than situational ones and that neutralization is the most frequent strategy when translating them (177 cases, 44.3%). Non‐situational deictics are frequently neutralized because they alternate with other phoric processes, such as ellipsis or 3rd person pronouns. In fact, Catalan shows a tendency to avoid deictic marking in syntactic contexts where the demonstrative could be interpreted as too focal or somehow emphatic. The strategy, which is mainly syntactically conditioned—neutralization is favoured when the demonstrative is in subject position or can be pronominalized by a clitic in the target language—, implies a loss of deictic force and sometimes also the empathetic nuance that the deictic adds, affecting the implication of the character or the narrator in the narration. On the other hand, overmarking is also very frequent, since many non‐deictic English units are translated into Catalan by means of demonstratives (232 cases out of 519 demonstratives in Catalan, 44.7%). This translation strategy introduces in the target text subjective and intersubjective values not expressed in the source text. In conclusion, neutralization and overmarking are very frequent in translating fiction and have an effect on the target text by underspecifying or introducing, respectively, subjective and intersubjective values in the narration. The changes in the deictic perspective of the source text introduced by these strategies are not due to the systemic differences of the languages involved in the process of translation, but to syntactic and pragmatic factors leading to the underspecification or the introduction of the addressor’s subjectivity in the target text.

Culy, Chris, Verena Lyding and Henrik Dittmann Panel: 6. Corpus y variación lingüística STRUCTURED PARALLEL COORDINATES: A VISUALIZATION FOR ANALYZING STRUCTURED LANGUAGE DATA We present a visualization tool called Structured Parallel Coordinates (SPC), a specialization of Parallel Coordinates (cf., e.g., Inselberg, 2009), customized for the presentation and analysis of different types of structured language data, as found in corpora. We introduce three applications of the tool. They show SPC alone and as part of a broader process of data exploration, connected in particular with corpus queries. We provide detailed descriptions of the SPC visualizations and their interactive functionalities,

demonstrate how they can be employed in different linguistic analysis tasks, and explain the motivation behind design decisions taken to respond to characteristics of linguistic data. Parallel Coordinates are a way of representing multidimensional data using a two‐dimensional display. Each dimension is represented along a vertical axis, and the values for a piece of data are connected by a line (see Figure 1). Interactive versions of Parallel Coordinates are flexible tools for data analysis, since selecting points and lines in the Parallel Coordinates display is the same as filtering the data (Inselberg, 2009). Parallel Coordinates are typically used with data dimensions that are conceptually independent, such as car size, year of manufacture, and mileage (cf. Frank and Asuncion 2010 for a standard test data set). However, language datasets often have dimensions which are interrelated or which have internal structure. One fundamental type of structure is the sequential order of linguistic units like words, phrases, or paragraphs. Another type of structure comes from meta‐information associated with corpus texts, e.g. dates, where the data for each point in time can be treated as a dimension, and these dimensions are ordered (chronologically) with respect to each other. Rank orderings of (co‐)occurrences of linguistic units provide an example of dimensions that have an internal structure: the ranks. SPC is designed specifically to deal with the special nature of structured language data such as these (cf. Collins et al. 2009 for another take on Parallel Coordinates for textual data). We present three applications of Structured Parallel Coordinates: (1) KWIC results as SPC, (2) ngrams and frequencies, and (3) ranking comparisons. Figure 1 shows a SPC display of the rank ordering by frequency of the top 20 (German) words starting with [Ss]elbst “self‐“, counted by lemma, in 5 years of newspaper text, ranging from 1991 to 2006. The words which do not appear in all years are grayed out, and the word Selbstbestimmung “self‐determination” has been selected and highlighted with a thick line. The relative frequencies within years are indicated by green bars. SPC is a JavaScript tool that can easily be used with new kinds of data. For example, colleagues are using SPC to analyze learner texts. SPC and the applications are freely available under an Open Source license. SPC is an innovative tool for corpus analysis, which illustrates opportunities that are created when visualization techniques are adapted to the special needs of language information.

De Vos, Lien Panel: 3. Estudios gramaticales basados en córpora THE USE OF GENDER‐MARKED PRONOUNS IN DUTCH: GRAMMATICAL VERSUS CONCEPTUAL GENDER. The Dutch pronominal gender system provides a unique source for the investigation of variation and change, since it appears that the system is changing at a different pace within both varieties of Dutch. The northern Dutch variety, as spoken in the Netherlands, nowadays has a so‐called semantic gender system: the choice of a particular pronoun depends on the conceptual properties of the referent, and no longer on the grammatical gender of the antecedent it refers to. The most crucial parameter in the process seems to be ‘individuation’: highly individuated nouns, such as count nouns referring to concrete entities, trigger the use of traditionally masculine pronouns, whereas lowly individuated nouns, such as abstract mass nouns, trigger the use of the traditionally neuter pronoun (Audring 2009). However, the Dutch gender system originally was grammatical, and gender‐marked pronouns were strictly related to the grammatical gender of the antecedent noun. The southern Dutch variety, as spoken in Belgium, was believed to have retained this system in which pronouns agree in gender with their antecedent noun, which can be masculine, feminine or neuter. Recent studies have rendered this belief invalid, by illustrating that even adolescents do not yet reach an adult‐like proficiency in the grammatical gender system, and that the influence of grammatical gender on pronominal reference gradually decreases from generation to generation (De Vos 2009, De Vogelaer & De Sutter to appear). Clear semantic patterns are observed, which may indicate the erosion of the original, grammatical system and the origination of a new, conceptually‐based gender system. All of these previous investigations on southern Dutch have gathered their data in a similar way: by means of questionnaires, consisting of completion tasks. However, this excludes possible influence of discourse‐factors on pronominal reference and it narrows down the view on semantic factors, since there is only a small amount of words under investigation. In this paper, these previous studies will be compared to a corpus‐ based investigation of the development in gender‐marked pronouns in southern Dutch. The data is gathered from the Corpus Gesproken Nederlands ‘Spoken Dutch Corpus’, a nine million word corpus of contemporary spoken Dutch. The results of this paper will not only confirm the presence of semantic factors influencing the use of gender‐marked pronouns, it will also supplement the existing data with a broader view on pronoun usage in spoken language. From these results it will follow that the choice between grammatical and conceptual (semantic) gender depends on much more than semantic factors, such as the discourse setting and linguistic context. The aim of it is to adjust and complement the ruling theories on this development of gender‐marked pronouns in Dutch and to establish a framework that can be used for further research, which includes challenging some methodological issues.

Díez Bedmar, María Belén Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje SPANISH STUDENTS’ MAIN PROBLEMS WHEN WRITING THE ENGLISH EXAM IN THE UNIVERSITY ENTRANCE EXAMINATION: A LEARNER CORPUS‐BASED ANALYSIS The research conducted on the English Exam in the University Entrance Examination in Spain has been divided into three main blocks (García Laborda, 2006): i) its validity design; ii) its construct validity, inter‐ and intra‐rater reliability, the raters’s scorings, etc; and iii) the need for the improvement of the exam. However, there have also been studies which have analysed the students’ written production when taking this exam, as reflected in various (computer) learner corpora. In an edited book (Iglesias Rábade, 1999a), five papers presented the students’ spelling errors (Doval Suárez, 1999), their morpho‐syntactic errors (Crespo García, 1999), lexical errors (González Álvarez, 1999), problems in closed word classes (Woodward Smith, 1999), and in their textual organization (Iglesias Rábade, 1999b). Similarly, two PhD dissertations also focused on the students’ errors when writing this exam in the foreign language by means of an Interlanguage Analysis (IA) or a Computer‐aided Error Analysis (CEA). Thus, Wood Wood (2002) concentrated on the students’ article use, and Rodríguez Aguado (2004) scrutinized their morphological and syntactic errors, as well as those problems related to orthography and vocabulary use. Despite the importance of these studies to know the main problems which pre‐university students’ show when writing in the foreign language, two main limitations can be found in these seven studies. First, each of them focused on a limited number of aspects of the foreign language, which results in an incomplete description of the students’ written performance at this stage. Second, different methodologies were employed, e.g. various error‐taxonomies, preventing the direct comparison of results. In order to bridge these two limitations, Díez‐Bedmar (2010) analysed a representative sample of the compositions written on the same topic for the English Exam in the University Entrance Examination in Jaén in June 2008 by means of a CEA with the UCL Error Editor (Hutchinson, 1996), and the widely‐used Error Tagging Manual, version 1.1. (Dagneaux, Denness, Granger and Meunier, 1996). This paper is divided into two main parts. The first one presents the findings obtained in Díez‐Bedmar (2010), which allows an updated description of the students’ profile at this stage of their foreign language acquisition process. The use of a widely‐used error taxonomy also entails the comparison of results with those provided in the extensive research which has also employed the Error Tagging Manual in the Spanish and international contexts. In the second part of the paper, a comparison is made between the findings in Díez‐Bedmar (2010) and those presented in the above‐mentioned publications, so that it is possible to point to interesting tendencies regarding the common errors made by secondary‐ school leavers. The information offered in this paper may prove the starting point to cater for the students’ empirically‐based needs at this stage, by means of teaching materials at the end of the secondary school education, or the design of appropriate courses when entering the European Higher Education Area (EHEA) in Spain.

Ekaterina Tarpomanova, Svetlozara Leseva, Svetla Koeva, Borislav Rizov, Hristina Kukova, Tsvetana Dimitrova and Maria Todorova Panel: 1. Diseño, compilación y tipos de córpora DESIGN AND DEVELOPMENT OF THE BULGARIAN SENSE‐ANNOTATED CORPUS The paper describes the methodology, compilation, annotation and applications of the Bulgarian Sense‐ Annotated Corpus (BulSemCor) ‐ a manually annotated corpus of over 100,000 words in which each language unit (LU) is assigned a sense according to the Bulgarian wordnet (BulNet). The input corpus is an excerpt from a general structured corpus of contemporary Bulgarian designed according to the Brown Corpus methodology. The input corpus consists of over 800 text units of 100+ words each, selected according to the density of highest frequency open‐class lemmas. The corpus is represented in

a flat xml format. The text is encoded as a list of xml tags 'word' whose attributes store relevant information such as form, lemma, selected sense, annotator. Another attribute encodes a parent ID that links the tokens identified as part of a compound. The corpus annotation tool provides a number of functionalities such as (i) input data editing including insertion and deletion of tokens, identification of MWEs with contiguous or non‐contiguous constituents; (ii) flexible text navigation strategies ‐ forward and backward navigation according to a given criterion such as all words, non‐annotated words, all instances of a current sense or word, etc.; (iii) flexible search strategy allowing both exact match search according to wordform or lemma, and regular expression search. The tool interface features fully‐ fledged visualisation of the wordnet synsets for the available candidate senses for a selected LU through coupling with the system for wordnet development and exploration. The annotation tool is OS independent, adaptable to annotation schemes for different language levels, affords multiple‐user concurrent access and dynamic real time update of changes in the knowledge base. The annotation of BulSemCor involves the following steps. In the preprocessing stage automatic lemmatization is performed. Next, the LUs are mapped to the corresponding BulNet senses through their lemma. The semantic annotation proper consists in the selection of the correct sense from the available candidates. The annotated LU inherits all the information contained in the selected synset, thus receiving morpho‐ syntactic annotation (through the POS) besides the semantic one. One of BulSemCor's main features is the exhaustive annotation approach requiring that each LU be annotated. It has resulted in: (i) enlargement of the Bulgarian wordnet with closed‐class words and language specific concepts; (ii) reconsideration of a number of theoretical assumptions; (iii) practical decisions regarding interlingual asymmetry. The main application of BulSemCor is to serve as a training corpus for WSD tasks. It has already been employed in two implementations. In the first one based on Hidden Markov Models, BulSemCor has been used in the training and evaluation. A second, knowledge‐based implementation currently under development, uses it mainly for the purposes of evaluation. BulSemCor has a variety of applications in linguistic research from lexicology and lexicography to semantics, grammar, stylistics, etc. An online demo of the corpus has been implemented and made publicly available. It affords search for words according to wordform or lemma. The available senses are sorted according to frequency of occurrence and are supplied with a gloss and an example.

Faya Cerqueiro, Fatima Panel: 6. Corpus y variación lingüística REQUEST MARKERS IN DRAMA: DATA FROM THE CORPUS OF IRISH ENGLISH In the Late Modern English period we observe a change in the use of main request markers, whereas pray was the most common courtesy marker in requests at the beginning of this period, it was eventually replaced by please and the former marker disappeared entirely in the twentieth century. A preliminary study in ARCHER (A Representative Corpus of Historical English Registers) showed that these markers were found mainly in three types of texts, namely letters, fiction and drama. The analysis of those items in novels and letters have already brought interesting results about the evolution of these markers and especially about the replacement of pray by please (cf. Faya Cerqueiro 2008 and 2009). Nevertheless, requests markers have not been studied in drama texts yet. Therefore, an analysis of plays will help to complete the whole picture of the main request markers in the Late Modern English period and will allow text‐type comparisons. For this purpose I will make use of the Corpus of Irish English. The Corpus of Irish English collects Irish documents written in English from the early fourteenth century up to the twentieth century, allowing diachronic analyses. The different genres represented in this corpus comprise poetry, glossaries, sketches and full‐length plays, although drama is the best represented genre in the corpus. The material compiled from the sixteenth to the eighteenth centuries in the corpus includes not only “genuine representations of Irish English by native Irish writers” but also “texts by non‐ Irish writers where the non‐native perception of the Irish English is found” (Hickey 2003: 242). As regards number of words, the drama selection of this corpus contains an approximate number of 500,000 words, although the twentieth century provides almost half of them. Drama is probably the most profitable fictional genre for the study of pragmatic issues, especially those regarded as typical of the spoken language. Even though it should be admitted that this genre contains an imitation of actual speech, it represents the spoken medium as close as possible and if it is “used with the necessary caution, plays may also yield insights into what counted as polite or impolite behaviour and how, for instance, greetings, insults or compliments were realised at that time” (Jucker 1994: 535). Culpeper and Kytö (1999) classify drama as constructed dialogue with minimum of narratorial intervention, since apart from stage directions, plays contain dialogue almost exclusively. There are important contributions to historical pragmatics using only drama, proving the relevance of this text‐type in pragmatic analysis (cf. Brown and Gilman 1989).

Fragaki, Georgia Panel: 2. Discurso, análisis literario y corpus EVALUATIVE ADJECTIVES IN A CORPUS OF GREEK OPINION ARTICLES Existing attempts to describe evaluation in text treat adjectives as mere devices of evaluation. However, the reverse question has not been raised: which are the adjectives that can function evaluatively in texts? The answer commonly given to this is descriptive adjectives (cf. Hewings 2004: 253) or adjectives having positive or negative meaning, relative or superlative degree, or gradability, that is having the typical features of descriptive adjectives (cf. Hunston & Francis 2000: 188‐189, Hunston & Sinclair 2000: 91). A systematic corpus‐based study of adjectives can reveal a different picture: Fragaki (2010) claims for Greek that several adjective categories can assume an evaluative function, among which a special category of evaluative adjectives, whose exclusive function is evaluation. The aim of this paper is to contribute to the description of the category of evaluative adjectives, drawing on a corpus of opinion articles from the Corpus of Greek Texts (CGT), a reference corpus of Greek. The corpus of the study includes texts of 450,576 words from three Greek newspapers of different political orientation. It is suggested that, while descriptive adjectives are commonly used for the attribution of a good or a bad property to an object of evaluation, the category of evaluative adjectives is used for evaluation relating to modality, comment, intensification and importance. With respect to these functions, four groups of evaluative adjectives are distinguished: a) modal adjectives, b) comment adjectives, c) intensifying adjectives and d) adjectives of importance. The criteria used for this classification are both functional and semantic and are based on extensive corpus analysis of the data. It is notable that two of these groups (modal adjectives and adjectives of importance) concur with Hunston’s (1994) and Thompson & Hunston’s (2000) parameters of evaluation. In addition, modal adjectives as carriers of deontic or epistemic modality, as well as intensifying and adjectives of importance as a means of denoting the degree to which something happens or the importance with which something is viewed, contribute indirectly to the positive or negative evaluative frame of the text (cf. attitudinal frame, Bublitz 2002). Finally, comment adjectives are employed for making a (usually) negative comment on an object of evaluation and in this way offer direct evidence for the evaluative frame of the text.

Froehlich, Heather Panel: 1. Diseño, compilación y tipos de córpora ARE YOU A MAN?: ON SEEING GENDER IN SHAKESPEARE Through a literary‐linguistic, discourse‐oriented computational approach I will present a new way to find patterns of gender in Early Modern drama. Building on previous corpus stylistic studies (Culpeper 2001 and 2002, Hunston and Francis 2000, and Fischer‐Starke 2010), I suggest that the use of gender‐specific terms are not in proportion to the character population of a play. Using AlphaX and Excel, I assemble examples of both grammatical gender and natural gender within the context of a line of Shakespeare’s plays. This study presents a comprehensive overview of grammatical (subject/object) and thematic roles through a comparative study of third‐person personal pronouns and gender‐specific nouns in Macbeth and The Merry Wives of Windsor through the building of a pilot database of each word within the context of a sentence. The relationship of grammatical and semantic roles are encoded and thus manifest themselves into a literary representation of gender: the textual representation of gender is encoded by the language used. Macbeth is a play that is very concerned with masculinity, whereas The Merry Wives of Windsor focuses primarily on women. Gender identification in both plays in proportion to the gender representation of characters is less overt and more often encoded in the text itself: through the building of this database, I comment on the predictability of gender representation in relationship to the gender proportions of a cast. The implications of proportional representation of a cast have been largely ignored in (feminist) stylistic studies of Shakespeare’s texts, a field which chooses instead to focus on the overt patriarchical structures presented in Early Modern drama; my study begins to fill this void through a critique of Shakespeare’s plays as a (proto)feminist texts.

Fuster Márquez, Miguel and Begoña Clavel Arroitia Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje ENGLISH LANGUAGE TEACHING AND LEARNING IN TERTIARY EDUCATION: CORPUS CHOICE AND IMPLEMENTATION The aim of this contribution is to propose a model to integrate corpus linguistics (CL) in the teaching of the English Language at university level. This research is still in progress since we need to assess the results at the end of this academic year. The subjects of this study are our own students in the second year of the compulsory module (English Language IV) of the newly implemented degree of English Studies at the Universitat de València. It is precisely in the new university paradigm, in which students are required to learn to learn, develop skills and solve problems autonomously, that the deployment of corpus methodologies contributes to the enhancement of students’ potential in such a direction. As Sinclair (2004) points out, students should be given the opportunity of consulting authentic language and corpus‐based methodologies may come to cater for that need. It remains true that after decades of CL, even those textbooks targetting advanced learners contain written and spoken language samples which are not authentic. Exclusive exposure to textbooks cannot be sufficient if we wish our students to grasp more fully how real language actually works. Our study focuses specifically on the development of writing which fits in with the long tradition of corpus research devoted to productive written skills. It is our contention that if teachers are willing to embark on this type of experience there is no need to resort to large reference corpora, such as the BNC or the COCA, or The Bank of English although these are truly invaluable sources. However, a much modest proposal would consist in compiling smaller corpora which can immediately be applied offline in the classroom through freeware tools such as AntConc. Our proposal is structured around three corpora. The first corpus we have designed contains

updated articles of leading newspapers from the UK and the USA, which have been gathered by means of Lexis Nexis. This corpus can be used when what we have in mind is “general English”. A second corpus contains recent academic articles published in leading journals, but exclusively in the field of humanities. This corpus meets the demands of the curriculum in our degree, since our students’ learning goals include the attainment of competence in academic English in the field of the Humanities. And the third one is a much more modest ad hoc learner corpus which contains our own students production, with the hope of obtaining a much more accurate picture of their learning stage. The aim of this whole project is no other than to offer a coherent procedure to promote corpus exploitation, either indirectly by teachers through the design of corpus‐based activities, or through hands‐on corpus exploration by students. We believe that an inductive approach through corpus‐driven awareness‐ raising activities is in conformity with the main guidelines being implemented in higher education pedagogy.

García Varela, Ana Patricia Panel: 5. Corpus, estudios contrastivos y traducción ‘WHEN POLICE ARRIVED AT THE SCENE’ OR ‘HAN VENIDO DOS POLICÍAS’: ON THEME AND THEMATIC PROGRESSION IN NEWS REPORTS* In this paper I shall explore the interaction between Theme‐Rheme choices across English and Spanish journalistic discourse in order to see how this interaction is instantiated in the two languages (Halliday & Hasan 1976; Halliday 1985; Francis 1989, 1990; Fries 1994; Gómez‐González 1994, 2001; Taboada 1995; Halliday & Mathiessen 2004; Arús Hita 2010). In particular, two research questions will be addressed: 1) Which Theme‐Rheme patterns characterize journalistic discourse in English and Spanish? 2) Which patterns of Thematic Progression are more recurrent in this genre across the two languages? The data will consist on news reports dealing with cases of domestic violence extracted from the online versions of four journals: The Guardian and The Times (English), on the one hand, and El País and El Mundo (Spanish), on the other. The results show that, despite the typological differences between English and Spanish, the thematic organization of news reports is, in general terms, rather similar in the two languages, although differences in the length of news reports as well as in the thematised elements are salient.

Garcia‐Pastor, Maria Dolores Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje LEARNERS’ DISAGREEMENTS IN EFL: L2 PRAGMATICS AND THE USE OF A LEARNER CORPUS IN THE LANGUAGE CLASSROOM The instruction and learning of pragmatic issues in a second or foreign language (L2 pragmatics henceforth) has been granted increasing attention recently as reflected in current European trends that search for innovation and development in second/foreign language (L2/FL) teaching and learning (García‐Pastor, 2009, in press). Likewise, the use of corpora in English language teaching (ELT) has been encouraged in the past few years in an attempt to foster new advances in the field (cf., Bellés‐Fortuño et al., 2010). This study aims to emphasize the importance of considering L2 pragmatics and the adoption of a corpus‐based approach in the English as a Foreign Language (EFL) classroom by exploring the interlanguage features of learners’ disagreements in EFL, and their perceptions of these communicative acts therein. Pragmatics here refers to the linguistic resources for conveying communicative acts and

relational or interpersonal meanings in a language, and the social perceptions underlying interlocutors’ interpretation and performance of communicative action (Kasper & Rose, 2001). More specifically, this research attempts to gain insight into learners’ L2 pragmatics, so that useful information is provided that can help teachers raise their EFL students’ awareness of pragmatic issues in the target language. To this end, disagreements in a corpus of 28 EFL face‐to‐face conversations of 30 minutes to 1 hour duration each were analysed, and then used in the EFL classroom to examine learners’ perceptions of these communicative acts in the target language and generate discussion. In general, a different use of mitigation devices in EFL disagreements was observed in contrast with English native speakers’ production of these communicative acts (García, 1989; Kreutel, 2006). Learners therefore showed lack of awareness of the linguistic resources commonly employed for voicing disagreement in the target language. As for their perceptions and discussion of EFL disagreements in the classroom, learners viewed these communicative acts in the target language as adequate and polite at a social level on the whole, which can be said to reflect somehow their L1 pragmatic assumptions on disagreement performance (cf. Cordella, 1996). However, they mostly perceived EFL disagreements as inadequate and impolite at an individual level, thereby evincing pragmatic assumptions typically associated with these instances of communicative action in L1 English (cf. Locher, 2004; Pearson, 1986; Pomeranz, 1984). These findings suggest that a closer look at learners’ productions and perceptions of target language behaviour using learner corpora in the classroom can be useful to achieve a better understanding of our students’ L2 pragmatics, and help them in their development of target language proficiency.

Goethals, Patrick Panel: 5. Corpus, estudios contrastivos y traducción DEMONSTRATIVE MODIFIERS AND DEFINITE ARTICLES IN TRANSLATION: A CONTRASTIVE PERSPECTIVE. In this paper I will elaborate a contrastive linguistic analysis of the alternation between the demonstrative modifier (este/ese/aquel problema) and the definite article (el problema) in Spanish and Dutch. The methodology is based on a bidirectional corpus of translated texts (Spanish‐Dutch and Dutch‐Spanish). Several studies have focused on the semantics of the demonstrative paradigm, in order to distinguish it from the definite article. These studies usually adopted a monolingual or a generic point of view. Instead, very little is known about specific contrastive differences: do both categories relate to each other in a similar way in different languages? The data that come from the bidirectional corpus of translated texts suggest that Dutch and Spanish indeed differ significantly. Concretely, Dutch demonstratives appear to be more broadly used than their Spanish counterparts, and therefore quite often correspond to a definite article in the Spanish source or target text. In the corpus this becomes clear when translational shifts are considered. From a quantitative point of view, the following observations can be made: 1) in the Spanish‐Dutch subcorpus, Spanish demonstrative modifiers are rarely translated by a Dutch definite article (23 examples, or 5,7% of the Spanish demonstrative modifiers). Far more frequently, a Dutch demonstrative was newly introduced to translate a Spanish definite article (110 examples, or 21,5% of the Dutch demonstratives).

2) in the Dutch‐Spanish subcorpus, the same tendency is found: there are relatively few examples of Dutch definite articles being translated by a Spanish demonstrative modifier (16 examples, or 4% of the Spanish demonstratives), and a relatively high number of cases where a Dutch demonstrative modifier is translated by a Spanish definite article (81 examples, or 17% of the Dutch demonstratives). The fact that the same tendency is found in both subcorpora is important, since it suggests that these translational shifts are not to be seen as a translation universal, but instead as a consequence of a contrastive difference between the two languages. Although the main part of the paper will be dedicated to the methodological implications of the use of bidirectional translational corpora, and to the presentation of the quantitative results of the corpus study, I will also present a qualitative, semantic analysis of some recurrent shifts. In general, there seems to be some evidence that, compared to Spanish, the Dutch demonstrative can be more easily used with an identifying function, instead of the typical reclassifying function of demonstratives. In Spanish, this identifying function would be rather the exclusive domain for the definite article. This semantic analysis might account for shifts such as (1) or (2): (1) ES [entonces] no habría sanciones y los gringos pendejos no joderían con la soberanía (Fiesta de Dumas) NL [dan] zouden er geen sancties zijn en zouden die klotegringo’s niet zitten te zeiken over soevereiniteit [= esos gringos pendejos] (2) ES ‐ ¿Cómo será? Espero que no sea como las otras. (Medeplichtige) NL ‘Hoe zou ze zijn? Ik hoop niet zoals die anderen.’ [= esas otras]

Gómez, Angeles Panel: 5. Corpus, estudios contrastivos y traducción CORPUS STUDY BETWEEN THE ENGLISH GERUND AND ITS SPANISH COUNTERPARTS The previous contrastive studies between the English gerund and its Spanish counterparts present limitations in two specific areas. Firstly, the previous studies do not include all the translation possibilities or counterparts (Alonso García 2003; Izquierdo 2006 and 2008; Losada Durán 1980; Piñeiro and García 2001). In fact, according to our corpus data, in comparison to previous studies, it can be ascertained that the English gerund displays a greater variety of counterparts of a varied nature. In the second place, we have proven that most of the previous studies do not include a cognitive characterization of the English gerund and its counterparts; our work includes a conceptual description of the English gerund and its counterparts. We argue that it is important to include a cognitive description because this description facilitates us to establish a hierarchy between the English gerund and its counterparts based on their coincidences and differences from a cognitive point of view. In this sense, the use of a parallel corpus enables us to check in greater depth the cognitive relationship of the English gerund and its Spanish counterparts. This confirms that a parallel corpus is a suitable tool when carrying out a contrastive analysis. Thanks to the corpus, we have carried out two different studies (the English gerund and the Spanish counterparts) which, in turn, complement each other and confirm part of our hypothesis and also provide interesting results in the field of translation. We have defined the English gerund according to its nominal profile, as an abstract entity, based on cognitive grammar and psycho‐mechanical observations (Langacker 2008, and Duffley 2003 and 2006 respectively). The analysis of the counterparts highlights the validity of analysing the English gerund from a nominal profile. In fact, from a conceptual point of view the most frequent counterparts, the infinitive and the substantive share with the English gerund the abstract region’s interpretation. According to our corpus data, it can be ascertained that the majority of the most frequent counterparts can be predicted within the Spanish system and show a syntactic and semantic independence in opposition to the English gerund. As the analysis progresses, we observe the frequency of less predictable translations which put the role

orthonymy into play. The concept of orthonymy designates the most habitual, natural and authentic way of expressing yourself in a language. In these cases, in general, it is corroborated that the Spanish translation “distances itself” from the linguistic system of the source language in favour of a more authentic translation of the target language. First, we will provide the cognitive characterization of the English gerund. To follow, we will present the counterparts in terms of their cognitive coincidences and differences in relation to English gerund. And, finally we will provide a translation approach by which a particular Spanish counterpart can be explained.

Specifically, compelling evidence is adduced for the existence of a number of non‐trivial analogies regarding (i) the core meaning of the constructions, (ii) the semantico‐pragmatic profile of the Element in Focus (henceforth EIF), (iii) their newness orientation and (iv) their (positive/negative) interpersonal flavour and (v) their thematic and cohesive flexibility, inter alia, which enable us to treat the constructions in (1)‐(4) as forming a family (or constellation) of constructions. First, it is argued that the core constructional meaning of focus constructions is to provide the identification by the speaker/writer of an entity (person or thing) (i.e. the EIF = identified at a particular stage in discourse as the Focus of Attention) that is connected with an open proposition that may be equational (as in clefts) or characterizing (as in the other constructions). Furthermore, the constructions under scrutiny here appear to move along a cline (or, alternatively, a path) of referential > non‐referential functions (see further Dasher 1995). Second, The EIF is more likely than not referential and specific, which means that subject expressions in idiom‐chunks or non‐specific expressions are ruled out in the slot in these constructions, as shown in (5). Third, following Zimmerman (2007: 158), it is argued that ‘newness’ “must take into account discourse‐pragmatic notions like hearer expectation or discourse expectability of the focused content in a given discourse situation. The less expected a given content is judged to be for the hearer, relative to the Common Ground, the more likely a speaker is to mark this content by means of special grammatical devices, giving rise to emphasis.” However, the examination of the constructions at hand here shows that this should be best regarded as a tendency rather than as a cut‐ and‐dried generalization. Finally, regarding their interpersonal nature, contrastive focus constructions in general and clefts in particular convey a positive or negative stance by subject/speaker towards the content of the proposition. Therefore, these constructions encode a higher degree of subjectivity (i.e. emotional intensity) and convey a slightly more accusatory or a slightly more laudatory tone than their non‐cleft/non‐focus counterparts (Perzanowski & Gurney 1997: 221‐222). Thus, in a (6), the pressure exerted by Carter is considered to be the major driving force to free Nicaragua from censorship. This semantico‐pragmatic facet of the constructions under scrutiny here can be grounded on the notion of subjectivity, viz. “the way in which natural languages, in their structure and normal manner of operation, provide for the locutionary agent’s expression of himself and his own attitudes and beliefs.” (cf. Lyons 1982: 102; Scheibman 2002). Finally, regarding thematic and cohesive flexibility, given that contrastive focus constructions can be used to explicitly signal a contrast, alternative, or correction with respect to a previous stretch of discourse, they qualify as cohesion‐building devices. A case in point is example (7), where the writer makes recourse to restatement based on a play on words to convey his/her negative stance on the supporters of the Spanish socialist Party.

Goutsos, Dionysis, Constantin Potagas, Dimitris Kasselimis, Maria Varkanitsa and Ioannis Evdokimidis Panel: 1. Diseño, compilación y tipos de córpora THE CORPUS OF GREEK APHASIC SPEECH: DESIGN AND COMPILATION The study of aphasia in Greek lacks large‐scale empirical findings, mainly because of the theoretical orientation of the field. Computer language corpora can usefully fill this gap and give a new perspective to the study of the speech of Greek aphasic patients. The paper’s goals are to present the design and compilation of the Corpus of Greek Aphasic Speech (CGAS), a new resource for the study of aphasia in Greek, and to discuss its possible applications. The aims and design of the corpus and the methods followed for its compilation are presented. A pilot corpus was first created, including data from 20 patients, treated between 2006 and 2008. Two type texts from each patient’s spoken output have been included in the corpus, namely spontaneous speech and picture description (12.663 words, in total, of which 10.332 belong to patients’ talk). On the basis of the pilot corpus, a classification of paraphasias or speech errors has been attempted and the frequency and type of each category has been studied. The Corpus of Greek Aphasic Speech is envisaged to include data from 114 patients, that is 228 texts of 50.000 words in total (of which 41.000 spoken by patients). In conclusion, it is argued that the exploitation of specialized computer corpora can have important advantages for the study of aphasia and can usefully complement current research on aphasia in Greek, both quantitatively and qualitatively. Among the most important consequences of using corpora in aphasia research is the view of speech errors as the product of situated language use by specific speakers rather than as isolated examples of lack of competence.

Gregori‐Signes, Carmen Panel: 2. Discurso, análisis literario y corpus COMMUNITY DIGITAL STORIES: A CORPUS ANALYSIS Digital Storytelling is genre which is rapidly expanding in many different fields including education, socio‐cultural studies, turism and marketing, to mention but some. However despite the variety of digital storytelling “little has been written on digital storytelling, outside the occasional “how‐to” guides by practitioners” (Hartley and McWilliam 2009:5). This article seeks to make a contribution in the analysis of community stories to check whether they could be classified as examples of socio‐political digital storytelling. Socio‐political digital storytelling is here defined as type of digital story which may potentially become a powerful tool that may help bring up and out issues that may concern and affect democracy (Couldry 2008) and social welfare. For the analysis of these community stories I draw upon the principles of critical discourse analysis‐ this being understood as an approach rather than a method‐ combined with corpus linguistic methodology; and on the principles of sociopragmatics (Leech 1983: 10) since I believe in the importance not only of studying communication within its sociocultural context, but also in the need to find out the different sociopragmatic rules that may apply when denouncing a situation which affects or affected the author’s life in the past (cf. Gregori 2010). The stories analysed in this article have been obtained from the website Australian Centre for the Moving Image (ACMI) and have been transcribed and analysed drawing upon two different corpora: a) the content of a total of 10 websites that admit using digital stories with social purposes; b) a detailed analysis of the topics or semantic macrostructures and of the local meanings (van Dijk 2001:101) of each 25 stories. Due to space restrictions, the analysis here focuses on the study of community stories by looking mainly at the textual structure of the stories and of the web pages, thus paying attention to: a) the topics of the texts; b) the lexical choice or vocabulary in the stories. The hypotheses operating in the analysis can be stated as follows: a) whilst it is probable that each story displays its own idiosyncracies, the results of the analysis should at least shed some light on the factors that may be of interest for the members of a community; b) that although the participants may not all fit the same pattern regarding age, time,

motivation to write the story, and physical, intellectual, linguistic, social, cultural and emotional development, among others, a corpus analysis of their content should show a relation of topics/ vocabulary of the social representations (van Dijk 2001:113), the knowledge, attitudes, ideologies, norms and values of the social order which they abide. If that were the case, not only would the hypotheses be confirmed; but, secondly, this would prove that corpus analysis may be considered as a valid tool to find out more about the nature of different types of digital stories.

Grochocka, Marta Panel: 4. Lexicología y lexicografía basadas en córpora NONCE FORMATIONS AS INDICATORS OF PRODUCTIVE WORD‐FORMATION PROCESSES IN ENGLISH Coinage, borrowing and word formation are the three major methods of extending the lexicon, with the last one being the most productive. In other words, the highest proportion of neologisms come into existence as a result of word‐formation processes in which already existing elements of a language are manipulated in some creative way. Every neologism begins its lifecycle as a nonce formation which is created as a consequence of satisfying a particular communicative need arising on a particular occasion. To begin with, it is crucial to make a clear distinction between nonce formations and neologisms as there is considerable terminological confusion in the literature. Another problem is that nonce formations themselves may be perceived in two opposing ways, i.e. as ad hoc, context‐dependent and non‐lexicalizable deviations from word‐formation rules (Hohenhaus 1998), or quite the contrary, as formations which are regular, structurally transparent, productively coined and hence predictable (Štekauer 2002). The latter viewpoint is adopted in the present study. Moreover, being indicative of productive word‐formation rules, nonce formations are believed to be worthy of study, although they are often transient creations with little chance of becoming institutionalised. Additionally, various types of nonce formations are discussed, with context‐dependent naming units and neologistic wordplay as the prime focus of interest. A web‐based application called NeoDet has been developed for the purpose of compiling a study corpus of journalistic texts and extracting neologism candidates from the corpus, among which a host of nonce formations and wordplay units can be found. The three‐million‐word corpus consists of articles and blogs from the most widely read British newspapers and tabloids (i.e. The Daily Telegraph, The Times, The Guardian, The Sun, and The Daily Mail) published between 1st January and 31st December 2009. The neologism candidate detection procedure is based on the exclusion principle, with the exclusion sources including a few online dictionaries (i.e. OALD7, MW11, MEDAL2, CH11, CALD3, LDOCE5, Google Dictionary and dictionary.com), four slang dictionaries, the British National Corpus, as well as a wordlist of proper names and geographical names. A lexical item is regarded as a neologism candidate only when it is absent from all the exclusion sources. Once a nonce formation coined by means of affixation has been discovered, the NeoDet search engine is used in order to establish the degree of productivity exhibited by a given prefix or suffix. In this way, studying nonce formations makes it possible to uncover English productive affixes and draw conclusions concerning their meanings. Furthermore, the study sheds light on certain strategies adopted by journalists with the aim of attracting public attention. All in all, new naming units are coined not only to compensate for the denotational deficiency of a language, but also with the purpose of being eye‐ and ear‐catching, witty, amusing and memorable.

Gutiérrez, Camino Panel: 1. Diseño, compilación y tipos de córpora FROM CATALOGUE TO CORPUS IN DTS: TRANSLATED AND CENSORED CINEMA UNDER FRANCO (TRACECI 1951‐ 1962) One of the main proposals of Descriptive Translation Studies (DTS) is that, in order to obtain relevant results, we need to carry out a systematic study of those original and translated texts that, far from being chosen at random, have been carefully selected following certain well defined criteria. Textual selection should, therefore, be considered as one of the key stages of the research. This presentation aims at highlighting the role of TRACE* Catalogues as an essential tool in textual selection, by describing the transition from Catalogue to Corpus in the study of translated and censored cinema under Franco during the 50s and 60s, which is part of the research that has been carried out by the TRACE (translation and censorship) project for more than ten years. In the current TRACE Catalogues of translated and censored narrative, theatre, poetry and audiovisual (cinema and TV) texts, “each individual target text is accounted for in a single record, that contains both contextual and pre textual information related to that target text. This is what makes TRACE database a potential matrix for the selection of corpora (Merino 2001), and why each catalogue can be defined as zero‐corpus” (Merino 2005). Their compilation has been done by systematically feeding them with the information gathered from both

censorship archives and other sources of information. The TRACEci 1951‐1962 Catalogue currently holds around 3,500 entries, with useful pre/contextual information about the films that were translated (mainly dubbed) from English into Spanish, censored, and shown in the Spanish screens from 1951 to 1962. From the analysis of the information recorded in the Catalogue, certain sets/chains of source and target texts can be identified as prototypical examples depending on the purpose of the analysis, that is, depending on the different translation and censorship phenomena worth studying: for example, the effect of official and/or religious censorship, the translation and censorship of different genres, different types of films (the so‐called “commercial films” or “films of special interest”), etc. Our presentation will show the way the TRACEci 1951‐1962 Catalogue has been compiled and the way it has been analysed in order to identify certain texts which will be part of the TRACE parallel corpus and will, therefore, become the objects of close study.

Hedeland, Hanna Panel: 1. Diseño, compilación y tipos de córpora INTERACTION OF TECHNOLOGY AND METHODOLOGY IN BUILDING AND SHARING AN ANNOTATED LEARNER CORPUS OF SPOKEN GERMAN This paper discusses the technological and methodological challenges in creating and sharing HAMATAC, the Hamburg Map Task Corpus. In the first part of the paper, I will introduce the HAMATAC corpus, which consists of 24 recordings of advanced German learners solving a map task (Brinckmann et al. 2008) in pairs. It also includes metadata on all speakers’ language biographies. The first corpus version, consisting of original recordings, orthographic transcriptions and metadata, is publicly available. Future versions will include annotations describing various linguistic levels and phenomena – the more subjective in nature, the more interesting from a methodological perspective. Currently we are annotating disfluencies, one example of such subjective phenomena, using an annotation scheme with necessarily interpretative categories. The corpus presentation will also include an overview of EXMARaLDA, which was used to create the HAMATAC corpus. The EXMARaLDA system consists of data models, formats and tools for transcribing, annotating, managing and analysing spoken language corpora with help of three software components: The Partitur‐Editor, a tool for transcription and multi‐ level annotation of digital audio or video recordings, the Corpus Manager, a tool for compiling recordings and transcriptions into a corpus and managing corpus metadata, and EXAKT, a tool for carrying out queries and analyses. I will demonstrate how these components are used for corpus building and to analyse corpus data. I will also describe how the entire set of digital data can be transformed into formats independent of these tools and shared with others via a website. In the second part of the paper I will use HAMATAC to discuss different solutions to some recurrent methodological issues in corpus building and sharing and show how technological and methodological aspects can be said to interact. ‐ One of the most fundamental questions arises from the non‐trivial problems inherent in transcribing spoken language in general and learner language in particular – how do we represent the non‐standard characteristics of the data? ‐ Do the possibilities resulting from technological advances – extensive querying of linguistic data or integrated audio or video in a transcript – affect choices regarding the visual representation? ‐ How can we ensure comparability with other digital corpora, yet without the restriction of shared transcription conventions? ‐ How do we implement and apply annotation schemes with various layers, different types of annotations, possibly overlapping each other across and within layers? ‐ How can we assess transcription and annotation quality when our annotation categories, as in the case of disfluencies, are inherently interpretative?

‐ How do we establish guidelines clear enough to allow for intersubjectivity and thus for each manual annotation task to be replicable? ‐ And how do we ensure our corpus project results in a sustainable language resource? In this sense, I will argue that the interaction with technological aspects plays an important role in further developing the methodology of linguistic corpus building and sharing.

Iria Romay Panel: 6. Corpus y variación lingüística A PRELIMINARY STUDY OF NEUTRAL MOTION VERBS IN LOB AND FLOB The semantic domain of motion and space has been exhaustively studied in the last decades, being considered a cognitive universal, together with colour terms or terms referring to family members, among others. Research in the particular field of motion is mainly based on Talmy’s (1991, 2000, 2007) typological classification of languages into Satellite‐framed (S‐languages) and Verb‐framed (V‐ languages). The difference here lies in the lexicalization of the path of motion. If one language codifies or ‘frames’ a path within the verb (e.g. Spanish María cruzó el parque), then it is a ‘verb‐framed’ language, whereas if it codifies path through satellites (e.g. English Mary walked across (the park)), it is referred to as being ‘satellite‐framed’. Thus, motion events in V‐languages are typically expressed by the combination of a path verb and a subordinate adverbial of manner, in contrast with S‐languages, which express them by means of a manner‐motion verb and a path satellite. In keeping with the abovementioned typological differences, V‐language users tend to encode fewer path segments than S‐ language users in both speech and written language. Moreover, in S‐languages, path information is expressed in a more compact way than in V‐languages. Therefore, there seems to be general agreement on the supremacy of English (S‐language) over Spanish (V‐language) in the expression of motion events, since English makes use of more fine‐grained distinctions, especially if we consider motion verbs which also imply manner meanings. These verbs are used much more widely than their Spanish counterparts and can occur in a wider number of contexts. Thus, apparently, and due to lexicalization patterns, there exist remarkable differences between the two languages in what concerns the variety of verbs

expressing manner of motion. The pilot research presented in this paper is part of a larger project whose aim is to provide a contrastive analysis of the development of verbs of manner of motion in English and Spanish as represented in different corpora. There are indications (see, for instance, Martínez Vázquez 2001) that usage in the field of motion may be undergoing change, particularly in Spanish, as a result of contact with or borrowing from English, but also in English itself. In this preliminary study, however, the focus will only be on the English field of motion along the diachronic dimension. For this purpose, three neutral English run verbs (walk, run, and jump) that express manner of motion have been taken into consideration by comparing two sub‐periods of Present‐day British English (the 1960s and the 1990s) as represented in the LOB and the FLOB corpora respectively. These three verbs have been selected on the basis of their frequency and also because they are generally used in sentences which provide movement information through the verb itself or through other parts of the sentence (the information provided does not only refer to the subject entity but also to manner, path and ground). Therefore, run verbs can be considered one of the core elements in spatial semantics when expressing change of location.

Ivanova, Anna Panel: 2. Discurso, análisis literario y corpus PRESIDENTIAL SPEECH IN 140 SYMBOLS: A CROSS‐CULTURAL ANALYSIS OF TWITTER USE BY BARACK OBAMA&DMITRIY MEDVEDEV. The present study is a continuation of a pilot project on the use of Twitter by Barack Obama. As it was proposed elsewhere (Ivanova 2011: in press), a cross‐cultural comparative analysis was necessary to get a complete understanding of political talk online as a phenomenon of the 21st century. For this purpose we collected a corpus of Twitter messages (English version) posted by Russian President Dmitry Medvedev who opened his Twitter account during an official visit to the USA in June 2010. Thus, updated corpus comprises 831 tweets posted by Russian and American Presidents during the period June‐January 2010‐2011. The analysis shows: 1. Twitter use does not coincide with presidents’ work weeks; 2. a slight decrease in Twitter use by Russian leader, while his American colleague sticks to a steady rhythm. Mean for tweets per month: Obama 64, Medvedev 40, i.e. Obama posted 1.6 more tweets; 3. 0.68 of all Obama’s messages contain external links; while Medvedev’s Twitter has only 0.27 of them (0.61 ‐ are president’s photos); 4. low lexical density of corpora: 0.19 (Obama), 0.31 (Medvedev); 5. mean for characters: a. Barack Obama: 120 (range: 41‐140); mode=139; StDev=21,63; b. Dmitry Medvedev: 116 (range: 16‐140); mode=140; StDev=24,86; 6. Gunning‐Fog Index: 14.8 (Obama), 16.8 (Medvedev); 7. high usage of “we” (N=128), “watch” (N=97) and “live” (N=95) in American corpus; and of “we” (N=63), “Russia” (N=30) and “today” (N=29) in Russian one; 8. the most frequent collocates of node WE within the span 4:4 are: a. in Obama’s Twitter: WE 128 ;

b. in Medvedev’s Twitter: WE 63 Thus, we conclude that: 1. Twitter use by both presidents presents a monodirectional interaction channel where Twitter platform is used as an advertisement tool to give an additional promotion to presidents and their cabinets’ actions; 2. Nearly maximum use of available symbols proves an extensive use of Twitter by both presidents; 3. According to readability index both Twitter corpora are classified as technical documents, i.e. their target audience is expected to have a university degree; 4. The lexical component of both Twitter corpora is restricted to the professional side of presidents’ political actions and excludes any other type of information, i.e. there are no chunks containing other type of vocabulary which we then consider as lexically even distributed. This continuation of a previous study proves Twitter to be a useful online social platform which serves as an additional promotion tool in the domain of political communication. Its language component does not go beyond political vocabulary which is then seen as lexically limited. Thus, we see that new technologies are used to tell basically the same “old” story but in modern and fashionable frame.

Ji, Meng Panel: 6. Corpus y variación lingüística A CORPUS‐BASED STUDY OF DIACHRONIC REGISTER VARIATION IN MODERN CHINESE This paper sets out to investigate diachronic register variation in modern Chinese through a corpus‐ based comparative study of two large‐scale monolingual corpora of modern Chinese, i.e. the Lancaster Corpus of Modern Chinese (LCMC) (1990s) and the UCLA Corpus of Modern Chinese (early 2000s). The study of register variation came to prominence in the 1990s with the advent of language corpora and the technical advancement of natural language processing tools. Earlier attempts were made at uncovering the patterns underlying register variation. The patterns thus identified might help establish a multidimensional framework for cross‐cultural and cross‐linguistic analysis (Biber, 1995). The validity and wider applicability of the model was tested with four orthographically different linguistic systems which were English, Nukulaelae Tuvaluan, Korean, and Somali. It is however argued in this paper that the representativeness of the model thus built requires further verification with language data collected from orthographically similar but socio‐culturally different linguistic systems such as Korean and Chinese. That is because the development of modern written registers in these two languages, despite their many shared textual and discourse conventions, may have well followed distinctive patterns of evolution as a result of the different cross‐cultural contacts with the West that they were exposed to. Therefore, in this paper, we aim to explore the particular patterns of register variation in modern Chinese within the multidimensional framework of linguistic analysis proposed in Biber (1998). The innovative of relevant corpus data and methods proved essential in the discovery of novel textual and linguistic events bearing on the changing nature of written genres in modern Chinese as documented in the two larges‐scale comparable corpora under investigation.

Judith Laso, Natalia, Elisabet Comelles, Isabel Verdaguer Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje USING A CORPUS‐BASED CLAUSE PATTERN DATABASE IN THE ENGLISH GRAMMAR CLASSROOM The use of corpus‐based tools has proven to be useful for the teaching and learning of a foreign language (Aston 2001, Granger 2003, Sinclair 2004, Conrad 2005, Granger & Meunier 2008, Aijmer 2009, Bennet 2010) as it allows both the linguist and the learner not only to become aware of the complexity of language but consider utterances in a real context as well. Likewise corpus linguistics has stressed the systematic interconnections between lexical items and their linguistic environment. It has empirically shown that native speakers tend to make use of recurrent strings of words, and has greatly contributed to the identification of units of meaning, which would have been hardly detected without the assistance of corpus‐based methods. Most corpus‐based studies conducted up to now deal with an empirical description of language, yet there are very few studies exploring the benefits of following this approach for language teaching (Conrad 2005, Laso & Giménez 2007 & 2008, Aijmer 2009, Bennet 2010). Although these benefits would seem consistent with language learning theory, little research on the effectiveness of using corpus‐based materials in the EFL classroom has been carried out so far. As part of a teaching innovation project devoted to the creation of teaching materials, the GReLiC group at the University of Barcelona has recently developed the Clause Pattern Database (CPDB), which gives account of the valency patterns performed by a selection of 45 prototypical verbs. This corpus‐based tool is also supplemented with tree diagrams, created with the assistance of the Charniak parser (Charniak and Johnson 2005) and the PhpSyntax Tree, illustrative of each example in the database. This paper aims at

illustrating the various applications of the CPDB for the teaching and learning of verb subcategorisation requirements. To this end, a continuous assessment task, especially designed for the undergraduate course Descriptive Grammar of English, will be presented. The task was conducted in the 3 groups of third‐year students, of approximately 50 students each. In the task students were asked to: a) complete the CPDB with real examples of language (excerpted from texts of their choice) by providing their valency and clause pattern; b) provide a tree analysis of each sentence. Once the task was completed, they were also asked to answer an online questionnaire so as to assess their satisfaction towards the newly designed database and corpus‐based activity and explore how corpus linguistics can contribute to language acquisition in formal tuition contexts.

Juncal, Lourdes Panel: 5. Corpus, estudios contrastivos y traducción A CONTRASTIVE STUDY OF ADVERBS OF CERTAINTY AS DISCOURSE MARKERS IN SPOKEN ENGLISH AND SPANISH The present paper will focus on the adverbs certainly, definitely, obviously, and absolutely in British English, and on their equivalents in Castilian Spanish, which I will divide into two groups: 1) their literal equivalents (ciertamente, definitivamente, obviamente and absolutamente) and 2) their equivalents in use (por supuesto, naturalmente, sin duda, claro, desde luego, cierto, etc.). All these adverbs of certainty (Martín Zorraquino & Portolés, 1999; Vandenbergen & Aijmer, 2007) will be analyzed in this presentation as discourse markers which are indexically linked to epistemic modality. The function of these adverbs as discourse markers, working as a whole sentence in conversation, has not been extensively analyzed. The aim of this study is to analyze the speaker´s reactive intervention (Martín Zorraquino & Portolés, 1999) when these markers occur as a whole sentence in turns of talk in order to determine conversational strategies (agreement, indirectness, fluency, interruption, empathetic use, power, solidarity, etc). In addition, I will show their differences and similarities in use and frequency in English and Spanish. By means of the Wordsmith Tools programme I will be able to compile wordlists, frequencies, and concordances in order to analyze grammatical features such as the position of the marker with respect to the discursive member where it occurs. Furthermore, I will examine contextual features to show which markers are used in formal and non‐formal registers, as well as gender and age differences in usage. This study will utilize samples taken from two corpora: the Integrated Reference Corpora for Spoken Romance Languages (C‐ORAL‐ROM) for Spanish, and the London Lund Corpus of Spoken English (LLC) for English. Bearing in mind that these two corpora vary in their quantity of words, I will apply Bibber´s procedure (1988) calculating the frequency of occurrences per million words in order to guarantee a comparable analysis.

Karakoc,Taner Panel: 5. Corpus, estudios contrastivos y traducción CORPUS OF TURKISH CULTURE‐SPECIFIC ITEMS AS REPRESENTATIVES THROUGH TRANSLATION IN ISTANBUL 2010 EUROPEAN CAPITAL OF CULTURE ACTIVITIES The paper aims to investigate the function of the corpus of Turkish Cultural Items as representatives of Turkish culture through translations produced during the activities organized within the scope of Istanbul 2010 European Capital of Culture Project. The monthly bilingual (Turkish – English) events bulletins as published online or in a booklet format serve as a means of resource of information for the corpus on the cultural activities held in the project highlighting conferences, concerts, documentary screenings, exhibitions, workshops, drama, nobel ceremonies of Sema, drama, performances etc. Such cultural items, or “culturemes” that make up the corpus convey invaluable information through translation about Turkish culture for foreign viewers. Among such culture‐specific terms are cultural items related to music, food, local arts& crafts, traditions, dance, drama, religion, religious ceremonies etc. The study describes the methods of translation (modulation, adaptation, transposition,

explicitation, omission, amplification, compensation, etc) implemented based on the texts appeared in such bulletins, which make up the corpus of the analysis. The study also provides a multifaceted analysis with references to paradigms in Translation Studies such as equivalence, descriptions, purposes, uncertainty and above all, cultural translation (Anthony Pym, Exploring Translation Theories, 2010).

Keshabyan, Irina Panel: 5. Corpus, estudios contrastivos y traducción A CONTRASTIVE STRUCTURAL ANALYSIS OF SHAKESPEARE’S HAMLET VERSUS SUMAROKOV’S GAMLET: A CORPUS‐ BASED APPROACH The main aim of this paper is to look at the structural (dis)similarities of two specific texts in the genre of drama ‐The Fourth Folio Edition of The Tragedy of Hamlet Prince of Denmark (1685) by Shakespeare and the English translation of Gamlet (1787) [1748] by the Russian playwright Sumarokov, translated from Russian by Richard Fortune in 1970. The main area of research of this investigation is the study of text by means of corpus‐based techniques ‐in other words, by means of a computational and quantitative analysis. For ease of reference, The Fourth Folio Edition of Shakespeare’s Hamlet (1685) will be referred to as Hamlet or SH. The Russian text will be referred to as SG‐R, whilst the English translation will be referred to as Gamlet or SG. The investigation is based on the electronic collection of these texts, that is, on the computerised texts. The method I use to analyse Hamlet and Gamlet does not dwell on the standpoints of various forms of historical, philosophical, language‐based, etc. approaches which are available at present. So, what I do is focus on the formal aspects of the plays that could be easily located, extracted, computerized, quantified and, at the same time, could contribute towards identifying Shakespeare and Sumarokov’s intentions, particularly with regard to the structural organisation of both plays. To investigate the patterns of structural variation, I shall select and quantify the total frequency of interaction variables for the analysis. Such an analysis is extremely useful as it can provide the basis for a reliable structural comparison of these texts. The quantification of interaction variables will be carried out by examining the two text files directly. After, the extracted data will be computerised, tabulated (intra‐play), cross‐tabulated (inter‐plays) and presented in tables, graphs and schemes. The readings of Hamlet and Gamlet suggest that the distribution patterns of the interactions of each main character with all characters, both main and secondary, and vice versa, as well as the relationships that are established among them are not necessarily parallel per act: intra‐play and inter‐ plays. Moreover, it seems that the interactions are not only distributed differently but their impact is also completely dissimilar per act and per full text: intra‐play and inter‐plays. My hypothesis is that Shakespeare and Sumarokov probably had dissimilar views about the complexity of the relationships ‐ revealed through the interaction patterns‐ among all characters, both main and secondary, and that these perspectives have led Sumarokov to somehow alter the structure of Shakespeare’s original play Hamlet. In general, the key findings will show considerable distinctions between the structures of the plays per acts associated with their organisation of the social network of the characters that have connections with each other.

Khudyakova, Mariya Panel: 3. Estudios gramaticales basados en córpora POSSESSOR NPS AND REFERENTIAL CHOICE IN ENGLISH BUSINESS PROSE (A CORPUS RESEARCH) The choice of an appropriate referential expression depends on multiple factors. This paper is focuses on the influence of the possessor position of a referential expression and its antecedent on referential choice. The study is based on a subcorpus of the specially designed RefRhet corpus.

Kieran O'Halloran Panel: 2. Discurso, análisis literario y corpus ELECTRONIC DECONSTRUCTION OF AN ARGUMENT THROUGH ITS ‘SUPPLEMENT’: DERRIDA AND CORPUS LINGUISTIC METHOD A by‐product of new social media is an abundant textual record of engagements ‐ billions of words across the world‐wide‐web in, for example, discussion forums, blogs and wiki discussion tabs. Many such engagements consist of commentary on a particular text and can thus be regarded as electronic supplements to these texts. The purpose of this presentation is to flag the utility value of this electronic supplementarity for corpus‐based, critical reading by highlighting the following: how an electronic supplement can reveal particular meanings that the text being responded to can reasonably be said to marginalize and / or repress. In turn, this can show where the text’s rhetorical structure can be said to be unstable, in a state of deconstruction. Given the often large size of these supplements, knowing how to mine them with corpus linguistic software is essential. I refer to this new type of corpus‐based analysis as Electronic Deconstruction. Electronic Deconstruction takes its theoretical orientations from the philosopher, Jacques Derrida, and, in particular, his idea of the supplement. We normally understand a supplement as something which is an add‐on and thus outside that which is being supplemented. In contrast, for Derrida (1976), any supplement has an undecideable ‘inside‐outside’ relation, e.g., vitamin supplements are both outside the diet in providing additional vitamins and inside the diet in replacing a lack of vitamins. I report on recent, Derrida‐inspired research (O’Halloran, 2010) where I examine how a discussion forum appended to an argument in an on‐line newspaper is simultaneously outside and ‘inside’ the argument; that is, it is a Derridean supplement. By employing statistical keyword analysis of this discussion forum supplement via WMatrix software (Rayson, 2008), using the BNC Sampler written corpus as a reference corpus, I reveal that the discussion forum carries meanings which occur as traces inside the argument, permitting a judgement that the argument seeks to marginalize / repress these meanings. Once these traces are revealed, the argument’s rhetorical structure is shown to deconstruct itself. Electronic Deconstruction can be seen, on the one hand, as an intervention into the text, that is, on the basis of the discussion forum supplement as outside the argument. On the other hand, it is an ‘intravention’, a bringing out of meanings that already exist as traces within the argument, that is, on the basis of the discussion forum supplement as ‘inside’ the argument. In being simultaneously intervention and ‘intravention’, the analytical procedure mirrors the undecideability of Derrida’s notion of the supplement. Lastly, because the procedure for locating salient concepts in the forum is statistically informed, it reduces arbitrariness in making judgements of repressions and marginalisations as well as in selecting points into the argument before going on to reveal its deconstruction.

Knörr, Garikoitz and Keith Stuart Panel: 4. Lexicología y lexicografía basadas en córpora THE SENSE AND SYNTAX OF ‘SPEAK’ AND ‘TALK’ This paper presents a corpus analysis of ‘speak’ and ‘talk’. Based on data provided by two large corpora (BNC and COCA), the aim is to point out some relevant differences in the use of these two often seemingly overlapping lemmas: the way and frequency with which they combine with adverbs (eg. ‘speak quietly’ vs. ‘talk quietly’), the use of prepositions (‘speak with/to’ vs. ‘talk with/to’), and their degree of productivity both in the formation of compounds and collocations and as stems (eg. ‘speakable’, ‘talkative’). The kind of information that can be gleaned from a large corpus or several large corpora is not always to be found in dictionaries or grammar books. In particular, when using a corpus, you can see how a word behaves in its immediate context and in the larger context of the text. Therefore, the paper also includes a brief overview of the definitions and usage notes offered in the most well‐known reference works and how they differ from the data provided by the corpora. Finally, we will attempt to show that the choice of a particular verb tense seems to motivate the choice of the

verb. In other words, we will try to demonstrate that there is a correlation between sense and syntax (Sinclair, 1991).

Kompara, Mojca Panel: 4. Lexicología y lexicografía basadas en córpora IS AUTOMATIC PRODUCTION OF DICTIONARY ENTRIES IN THE FIRST SLOVENE ONLINE DICTIONARY OF ABBREVIATIONS SLOVARČEK KRAJŠAV POSSIBLE? The possibility of automatic production of dictionary entries in the first Slovene online dictionary of abbreviations Slovarček krajšav in Termania software is discussed in this paper. The paper presents the newly build Slovene software for dictionary production (Termania) and the possibility of automatic production of abbreviations’ dictionary entries. As a first step, a demonstration algorithm has been used which focuses on the automatic recognition of abbreviations and abbreviation's expansions (Taghva 1999) in Slovene and with a restricted number of characters for each abbreviation (Kompara 2010). Further development expands the number of characters for each abbreviation to ten and takes into consideration all four types of abbreviation‐expansion patterns. In the next stage, the algorithm is provided online in a demonstration version. At this stage, a random selection of Slovene text is used to verify the performance of the algorithm and to improve recognition. The upgraded algorithm is then fully capable to handle large text databases and is used on a Slovene corpus of over 60 million words. In 30 minutes, the software filters the whole corpus and provides 5,000 abbreviation‐expansion pairs. The acquired data is then manually cleaned; good pairs are verified and used for production of the first Slovene abbreviations’ dictionary Slovarček krajšav. For entry production the Termania software is used. Dictionary entries are divided into simple and complex. Simple entries are produced entirely automatically, complex, due to complex structures, encyclopaedic data and translations, “semi”automatically. Simple entries are mainly Slovene, covering just abbreviation, language qualifier and expansion. The abbreviation and expansion are recognised automatically by the algorithm for recognition, language qualifiers are added automatically. In simple entries we are focusing on the automatic production of nominative Slovene structures of abbreviation’s expansions out of non nominative structures, as seen in example (1) (1) AB Alzheimerjevo boleznijo (non nominative structure) → Alzheimerjeva bolezen (nominative structure) Such approach is used also in complex entries. The main problem in complex entries are encyclopeadic data and transaltions for now included manually, but in the future automatically. The algorithm for automatic recognition of abbreviations and abbreviation’s expansions is the link between the electronic text and the “semi”automatically produced dictionary of abbreviations. Such dictionary represents the future of electronic lexicography (Kompara 2009).

Krasnikova, Anna Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje CORPORA AND TEACHING OF EDITING Discussing the use of corpora for teaching we rarely mention editing. Meanwhile corpora can serve as one of the major tools for editing courses. It is possible to distinguish two main goals that are set by a teacher: 1) to teach students "to work mechanically", that is to impart them some skills and let them develop these skills to automatism;

2) to teach students to work creatively with a text, to practice critical approach and to read thoughtfully. These two goals are achieved through different types of exercises and, accordingly, different types of corpora usage. It seems to us that the following distribution is effective: a teacher creates exercises for practicing the of editing skills, and students check their estimations and assumptions, learn to formulate and prove them. 1) Work of a teacher: creation of exercises. Editing skills depend on practice. And if you want to teach students to edit, it is necessary to have them do hundreds of exercises on different types of errors. Textbooks do not help much: while their content is enough to get acquainted with different kinds of errors, it is not enough to get hold of practical application. By means of search in language corpora it is possible to collect a material for exercises on analysis and estimation of different text aspects: language and style, logical connections, and facts. 2) Independent student work: raising of language awareness. Students often feel that there is “something wrong” with a phrase, but cannot tell what exactly is wrong and cannot explain why. They have to raise their language awareness, to prove their text estimations, and that is also where use of corpora proves to be effective.

Ktari, Imen Panel: 7. Lingüística computacional basada en corpus POSTMODIFIERS ACTING AS COMPLEMENTS AND ADJUNCTS IN POPULAR AND ACADEMIC MEDICAL ARTICLES: A GENERATIVE CORPUS‐BASED APPROACH Carnie (2001), following Chomsky’s theory, studies postmodification, a linguistic structure that comes after the head noun to modify it, following the three levels of projection of the X Bar Theory : a minimal projection (X), an intermediate projection (X’ or X bar) and a maximal projection (X’’, X double bar or XP). In this paper, the focus will be laid on one of the major contributions of this theory which consists in the distinction between complements and adjuncts within the noun phrase as far as postmodifiers are concerned. Sister to the head and daughter of the single bar level, the complement is “adjacent to the head” i.e. “closer to the head than an adjunct” (Carnie, 2001: 117). Hence the complement rule X’ X (WP) The adjunct, on the other hand, is a sister to and a daughter of a single bar level. (Carnie, 2001, p 117) and “may be freely added to any number of NPs” (Kroeger, 2005: 87). The adjunct should follow this rule: X’ X’ (ZP). Following a qualitative and a quantitative analysis (UAM Corpus Tool), this paper seeks to investigate the relationship between the syntactic and the semantic aspects along with the frequency distribution of postmodifiers acting as complements and adjuncts in both academic and popular medical articles, adhering to a comparative corpus‐based approach. . The aim of this paper is to show that postmodifiers acting as complements and are thus more “lexically specified” (Kroerger, 2005: 88) are found mainly in academic medical articles since the latter display a high level of scholarliness whereas those acting as adjuncts are more recurrent in popular articles which are considered as more narrative and closer to the casual register.

Lacalle, Miguel Panel: 9. Usos específicos de la Lingüística de Corpus THE LIMITS BETWEEN AFFIXATION AND COMPOUNDING IN OLD ENGLISH: THE SUFFIX ‐BORA This paper raises the question of the limits between compounding and affixation in Old English by focusing on the suffix ‐bora. This form is analyzed against the wider setting of the nominal derivatives to which the suffixes ‐a, ‐e, ‐en, ‐end, ‐ere/‐re, ‐icge, ‐estre/‐istre/‐ystre, ‐o and ‐u have been attached. These suffixes form deverbal derivatives, as in (ge)spreca ‘spokesman’ ~ (ge)sprecan ‘to speak, say, utter’, but the case with ‐bora is different, thus wi:gbora ‘fighter’ ~ wi:g ‘strife, contest, war, battle’. The suffix ‐bora is a verbal element, morphologically related to the verb beran ‘bear’. In this sense, Quirk and Wrenn (1994) consider ‐bora a suffix, whereas Kastovsky (1992) does not. The conclusion is reached that ‐bora represents a bound form and, as such, a suffix for two reasons. Firstly, although ‐bora derivatives are considerably transparent, we also come across some instances of lexicalization such as candelbora ‘acolyte’ and wro:htbora ‘the devil’. And, secondly, ‐bora as a free form is extremely infrequentent. According to The Dictionary of Old English, there is a single occurrence of bora ‘bearer’ in the corpus.

López Arroyo, Belén Panel: 5. Corpus, estudios contrastivos y traducción WRITING COMPUTERIZED ABSTRACTS: APPLICATIONS FROM A CORPUS‐BASED STUDY. Abstracts, which constitute a secondary genre based on the Research Paper (RP), have often been the object of interlingual contrastive analysis for translation and teaching language purposes among others. However, these empirically‐based, cross linguistic studies should have a central role to play in offering solutions to applied problems (Rabadán, 2008: 309). This is one of the aims of the ACTRES research group. In the present paper we intend to describe the methodology and the tools devised by the ACTRES group to bridge the transition between linguistic description and procedural information. The first step of this process was to design a small special corpus of scientific abstracts, the BioAbstracts_C‐ACTRES. The macro and microlinguistic characteristics of this corpus were analyzed in order to find the most prototypical rhetorical, grammatical and lexical features of this genre. Then, we identified the “anchors” (Rabadán: in press) relevant for the native speakers of Spanish. Finally, a prototype of a writing application, the Scientific_Abstract_Generator, has been designed, aiming at helping native Spanish users who are non‐linguist field experts, to write scientific abstracts in English.

López Arroyo, Belén and Martín Fernández Antolín Panel: 4. Lexicología y lexicografía basadas en córpora CORPUS BASED APPLICATIONS: DEFINING A BILINGUAL LEXICOGRAPHICAL AND PHRASEOLOGICAL WORK ON WINE TASTING NOTES The present paper aims at describing a bilingual (Spanish/English) terminological and phraseological dictionary on wine tasting notes. The dictionary was thought as a lexicographical corpus‐based work and designed as a communicative task according to Yong and Peng (2007); hence, the main criteria when designing and making the dictionary was the final user or the group of potential users it was addressed to. In this sense, considering the great variety of users, the dictionary has several distinctive features and further applications in different fields such as ESP teaching, Translation and Interpreting, Contrastive Analysis, Marketing, International commerce, etc. Among the distinctive features, we could point out it is a bilingual dictionary that includes definitions and examples of use; however, the most distinctive feature is that the dictionary is writing oriented (Hannay 2003), in other words it aims at helping potential users write wine tasting notes in the L2. We considered that for some users understanding how a term is used in context is as important or more as understanding its meaning. In this sense, we collected and describe the phraseological information of some of the main nouns in wine tasting notes; the user will find the linguistic structure of the main nouns used in wine tasting notes in order to be used a tool for writing them. This information is given in a separate glossary as it was not possible to include it in the dictionary entries

Lozano, Cristóba and Amaya Mendikoetxea Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje CEDEL2 (CORPUS ESCRITO DEL ESPAÑOL COMO L2): A LARGE‐SCALE CORPUS FOR L2 SPANISH ACQUISITION RESEARCH While second language acquisition (SLA) research has traditionally relied on experimental data, a new area of inquiry known as ‘learner corpus research’ has recently come into being resulting from the confluence of two fields: corpus linguistics and Second Language Acquisition (Granger 2002, 2004). But the contribution of learner corpus research so far has been much more substantial in description than interpretation (Granger 2004), with very little reference to current SLA debates and hypotheses (Myles 2005, 2007). We analyse the reasons why many SLA researchers are still reticent about using corpora and how good corpus design and adequate tools to annotate and search corpora could help overcome some of the problems observed. We do so by describing the design principles of a learner corpus of L2 Spanish we are compiling (CEDEL2) (Lozano 2009a) and its contribution to SLA research. CEDEL2 is a written learner corpus (L1 English – L2 Spanish) containing around 750,000 words (expected target: 1 million words) of all proficiency levels, plus a comparable native Spanish subcorpus. Data are being collected online mainly from universities and schools in USA, UK and Spain. It has been designed according to 10 corpus design principles proposed by Sinclair (2005), which distinguish it from other large learner corpora. Some advantages are: (i) CEDEL2 is a deductive learner corpus designed to potentially answer any L2 research question concerning any linguistic structure. (ii) CEDEL2 allows for a wide range of contrasts: it can be compared against a similarly designed native Spanish subcorpus acting as a ‘control group’ and against three interlanguage developmental stages (beginner, intermediate and advanced). It also allows for Contrastive Interlanguage Analysis (CIA) (Granger 1996) since CEDEL2 (L1 English – L2 Spanish) is similarly designed to WriCLE (L1 Spanish – L2

English) (Rollinson & Mendikoetxea 2010), so we can address key questions in SLA research, e.g., the source of L2 knowledge: L1 transfer, language‐specific vs universal influence. (iii) CEDEL2 includes a reliable and standardised measure of learner’s proficiency, as recommended by Tono (2003) ‐ essential to study L2 development. (iv) For each learner, CEDEL2 contains precise and detailed background information in order to conduct research into critical period effects, language use patterns, likely cross‐linguistic effects, etc. A preliminary version of CEDEL2 has already been used in published studies of L2 Spanish (Alonso et al. 2010a, 2010b, Lozano 2009b, Prieto et al. 2009). The next research steps for CEDEL2 are (i) to approach the intended target of 1 million words; (ii) to launch an online taster version; (iii) to continue the tagging of the corpus with particular reference to interlanguage phenomena (though future researchers will be able to tag any linguistic phenomena they wish); (iv) to make freely available the final version of the corpus via a dedicated webpage.

Luzón, María José Panel: 6. Corpus y variación lingüística DISCIPLINARY DIFFERENCES IN THE USE OF SUB‐TECHNICAL NOUNS: A CORPUS‐BASED STUDY Recent research on academic vocabulary has suggested that these words have specific behaviours related not only to the genre but also to the discipline (e.g. Hyland and Tse, 2007; Martínez et al., 2009). In this research I use a corpus‐based methodology to analyse how a type of sub‐technical vocabulary highly frequent in academic texts (which I will refer to as “research nouns” and “discourse nouns”) is used in two different disciplines (Applied Linguistics and Environmental Engineering). The purpose is to determine whether there are differences in the use of these nouns in both disciplines in terms of frequency, the lexico‐grammatical patterns in which they occur, and the discourse functions associated with these patterns. The results provide corpus evidence for disciplinary variation in the frequency and collocational behaviour of sub‐technical nouns. They also reveal that some of these nouns contribute to multi‐word units that are part of the specific phraseology of the research paper in these disciplines. These findings suggest the need to develop discipline specific academic wordlists, which should include not only the lexical items that are relevant in a discipline, but also information on their collocational behaviour and on the rhetorical functions with which they are associated.

Macdonald, Penny, Susana Murcia, Maria Boquera, Ana Botella, Laura Cardona, Rebeca García, Esther Mediero, Michael O'Donnell, Ainhoa Robles and Keith Stuart Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje ERROR CODING IN THE TREACLE PROJECT This paper presents the approach to error analysis within the TREACLE project, the aim of which is to profile learner proficiency to help inform teaching curriculum design. We will introduce the error annotation methodology used on a corpus of written texts by Spanish learners of English at University level. First, we will discuss the underlying principles of the error coding scheme and then provide more details about the coding scheme itself. To ensure coders are annotating the texts in the same way, two steps were followed. Firstly, we developed a comprehensive coding criteria description giving full details as to how to code particular instances. Secondly, we performed two intercoder reliability studies to help us identify areas where coders were differing so that we could address these areas. We will present the preliminary results of the error analysis and discuss their repercussions for grammar teaching at university level.

Maiz, Gema Panel: 4. Lexicología y lexicografía basadas en córpora THE OLD ENGLISH VERBAL SUFFIX ‐LÆCAN: DICTIONARY FREQUENCY VS. CORPUS PRODUCTIVITY The aim of this paper is to compare the corpus and the dictionary productivity of the Old English weak verbs suffixed in ‐læcan. The main sources for this research are the lexical database of Old English Nerthus and the online Dictionary of Old English Corpus. The assesment of productivity is based on the distinction between type‐frequency (dictionary‐based) and token‐frequency (corpus‐based). The conclusion is reached that the type‐frequency and token‐frequency of ‐læcan are very low, whereas its productivity is relatively high (except in poetry) taking into account the number of hapax legomena. Additionally, ‐læcan verbs are much more frequent in prose and glosses than in poetry.

Marcelino, Isabel, Gaël Dias, João Casteleiro and José Martinez‐De‐Oliveira Panel: 9. Usos específicos de la Lingüística de Corpus SEMI‐AUTOMATIC CONSTRUCTION OF THE UNIFIED MEDICAL LEXICON FOR PORTUGUESE The integration of standard terminology systems into a unified knowledge representation system for biomedicine has formed a key area of research in recent years. The Unified Medical Language System (UMLS) (Humphreys et al., 1998) is the most well‐known medical knowledge database, which combines the Metathesaurus, the SPECIALIST lexicon (Browne, McCray e Srinivasan, 2000) and the Semantic Network. However, the UMLS is mostly dedicated to the English language. Indeed, only a few languages are included in its core, which coverage is very limited. For instance, (Zweigenbaum et al., 2003) show that only 2% of the medical French terminology is included in the UMLS. As a consequence, many different projects have been appearing such as the UMLF (Zweigenbaum et al., 2003) for French and the efforts of the German Institute of Medical Documentation and Information to produce data for the German language for the original UMLS. But, most of the methodologies used so far to build a UMLS are based on using the original or the translated version of the MeSH (Medical Subject Headings) thesaurus , which is the most important resource of the Metathesaurus. To our point of view, in order to build a dynamic medical knowledge database, the medical language needs to be sampled by analyzing large and diversified corpora, representing diverse medical areas and genres, and by compiling existing controlled

medical vocabularies in the form of terminologies, meta‐thesauri or glossaries. Indeed, although the MeSH is a valuable resource, it needs constant manual updating to follow the dynamicity of the language. As a consequence, maintaining the MeSH and the UMLS is costly, time consuming and may not reflect the reality of the medical language in due time. Moreover, it is defined based on manual indexing, which may not reflect the reality of relations between concepts as evidenced (Fellbaum, 1998) for WordNet with the famous Tennis Problem. To avoid such limitations, we propose to semi‐ automatically build a unified medical Metathesaurus for the Portuguese language called the UMLP (Unified Medical Lexicon for Portuguese). Our idea is first to build a unified lexicon based on electronic dictionaries, online glossaries and taxonomies (Tardelli, 2007), Wikipedia and Wiktionary. Then, based on the automatically created thesauri from online resources, we aim at constructing the Portuguese Metathesaurus. In this paper, we will specifically focus on the construction of the unified lexicon and the automatic construction of thesauri, and show how corpus evidence can improve the unification process. Our work resulted in the construction of the biggest medical unified lexicon for the Portuguese language with approximately 85,000 entries together with their respective taxonomic paths from different resources.

Marin Perez, María José and Camino Rea Rizzo Panel: 1. Diseño, compilación y tipos de córpora DESIGN AND COMPILATION OF A LEGAL ENGLISH CORPUS BASED ON UK LAW REPORTS: THE PROCESS OF MAKING DECISIONS The implementation of the Bologna Reform has brought about a substantial change in the status of English as a subject in Higher Education programmes barring degrees in English studies and Translation. The new European Higher Education system aims to qualify graduates for professional competences among which the mastering of a second language, particularly English, is a must. The presence of English in current universities programmes has resulted from the choice between two possible ways of integration: the adoption of English as the language of instruction in a considerable part of some compulsory subjects, or the offer of English for specific purposes courses, as a separate subject independently of content courses. The latter is the case of Legal English incorporated into the degree in Law at the Law Faculty of the University of Murcia which the authors have been and will be in charge of teaching. It was a hard task to decide on teaching materials when first facing the subject. Legal English is a particularly obscure variety of ESP, Jonathan Swift would state in Gulliver’s Travels as early as 1726 that it is (…) a peculiar Cant and Jargon of their own, that no other Mortal can understand (in Mellinkoff, 1963: 5). In addition to this, the amount of available materials, especially text books, was considerably scarce, as usually happens in other branches of ESP (Rea, 2010a). Resorting to specific corpora could have been an option, as McEnery and Wilson affirm (1996: 121): such corpora can be used to provide many kinds of domain‐specific material for language learning, including quantitative accounts of vocabulary and usage which address the specific needs of students in a particular domain more directly than those taken from more general language corpora. Nevertheless, to our knowledge, the amount of written legal corpora is also reduced, and access to them, except for a few cases, is not complete. As a consequence of the scarce amount of such corpora and the methodological void derived from it, we engaged into ESP corpus design and decided to create the British Law Report Corpus (BLRC): a legal English corpus that could act as a reliable source for the development of new teaching material and further language analysis. The aim of this paper is to present the process of design and compilation of the BLRC, according to Corpus Linguistics standards as stated in Wynne (2005) for general corpora and its adaptation to specific corpora (Rea, 2010b). First, the legal corpora found are introduced; next, we give a detailed account of the design process and justify the reasons that lead to the selection of this legal genre, the mode of the texts, the organization of the corpus into different categories, the distribution of texts per category, etc.; to finish with some final remarks on further corpus applications and future research.

Marqués Aguado, Teresa and Laura Esteban Segura Panel: 1. Diseño, compilación y tipos de córpora (Póster) TEXSEN APPLIED TO A CORPUS OF MEDICAL TEXTS IN MIDDLE ENGLISH Historical corpora may be used as powerful tools to investigate the development of any language, whether synchronically or diachronically, and much more so if they are annotated. On many occasions and due to phenomena such as spelling variations or the existence of declensions, for instance, annotation may be indeed an asset. In spite of the existence of computer programmes that allow the user to extract various types of information from a corpus (such as Wordsmith or Wordcrunch), the peculiarities of a Middle English annotated corpus such as The Corpus of Late Middle English Scientific Prose (currently being compiled at the University of Málaga, in collaboration with the Universities of Glasgow, Oviedo, Murcia and Jaén) are far better catered for by software tools such as Texts Search Engine (TexSEn). In our poster, we will show the process followed for the compilation of our corpus, which involves two stages: first, transcription; and second, lemmatization and tagging. Once the texts are tagged, the resulting files (in Excel spreadsheets) can be used as suitable input for TexSEn. We will also present a sample of all the potential utilities that this tool offers, such as the retrieval of word‐ and lemma‐lists, as well as of concordances, together with the possibility of making complex searches and of building glossaries according to any user’s requirements (hence showing different formats).

Marszałek‐Kowalewska, Katarzyna Panel: 9. Usos específicos de la Lingüística de Corpus CORPUS AND LANGUAGE POLICY: IRANIAN LANGUAGE POLICY TOWARDS ENGLISH LOANWORDS This paper will exploit the potential of corpus linguistics in investigating language policy. It focuses on assessing Iranian language policy (which is characterized by heavy linguistic purism) towards English lexical borrowings in Farsi. Two years ago the author of the paper studied English loanwords in Farsi and carried out a comparative research of technical English loanwords and their Farsi counterparts coined and approved by the Academy of Persian Language and Literature. The tool used in that study was Persian Linguistic Database – corpus of the Persian language. The results showed that in majority cases loanwords held an advantage over their Farsi counterparts. However, the majority corpus evidence was from 2002 – 2005 whereas the first Collection of Terms Approved prepared by the Academy was published in 2003. Thus, it was decided to compare the results from PLDB with the results from compiled corpus of Farsi. This paper presents a comparative corpus‐driven study of certain English borrowings and their Farsi counterparts proposed by Iranian linguistic purists. These lexical borrowings belong to one semantic group – technology. The study attempts to verify the differences in usage between certain English loanwords and their Farsi counterparts. This usage relates to collocations, register and frequency. By means of compiled corpus the question about the successfulness of the Iranian language policy towards this particular semantic group will be addressed. To this end, the information about the corpus data will be presented. The aim of the study is to compare the results from the Persian Linguistic Database and corpus compiled by the author of the paper. In order to assess Iranian language policy by the means of corpus‐driven study the following questions are going to be answered: 1. What are the English borrowings in Farsi? How can they be classified? 2. What is the Iranian language policy towards English borrowings? 3. What kind of data does the corpus contain? 4. What are the problems that can make the results vague?

5. Is the Iranian language policy towards English borrowings successful?

Mat Awal, Norsimah, Imran Ho‐Abdullah and Intan Zainudin Panel: 5. Corpus, estudios contrastivos y traducción A CORPUS‐BASED STUDY ON THE LEXICO‐GRAMMARTICAL DIVERGENCE IN MALAY TRANSLATED TEXT: AN ANALYSIS OF THE RELATIVE CLAUSE MARKER YANG Laviosa (1998) suggests that corpus‐based approach is the ‘new paradigm in translation studies’. Since then, various translation studies utilizing corpus‐based approach have been conducted. This study uses a comparable corpus to investigate the lexico‐grammatical differences of the Malay relative clause marker yang as it is one of the salient lexical items found in the corpus. The comparable corpus is made up of texts translated into Malay and texts originally written in Malay. Comparable corpus presents an opportunity to discover features that occur more frequently in translated texts or ‘translation universals’. Findings on these translation universals would be a valuable tool in the teaching and training of translators.

Mateo Mendaza, Raquel Panel: 4. Lexicología y lexicografía basadas en córpora THE OLD ENGLISH ADJECTIVAL AFFIXES FUL‐ AND –FUL: A TEXT‐BASED ACCOUNT ON PRODUCTIVITY The aim of this paper is to measure the indexes of productivity of the Old English affix ful both as a prefix and a suffix. This analysis is based on Baayen’s (1992, 1993) framework, which comprises different measures on productivity. The major source consulted for this analysis is The Dictionary of Old English

Corpus, compiled at the University of Toronto, although some lexicographical sources are also checked in order to obtain more accurate results. This study of productivity allows for a diachronic perspective on the evolution of these affixes from the Old English period to the present. The main conclusion drawn from this analysis is that the suffix –ful is more productive than its prefixal counterpart, which implies that more productive patterns are still maintained in Present‐day English in contradistinction to the disappearance of less productive ones. These conclusions are compatible with Kastovsky’s (1992) statement regarding the tendency of the Old English lexicon towards lexicalization when a given morphological pattern loses its productivity.

This paper presents corpus and experimental evidence regarding the acquisition of subjects by L1 Spanish‐L2 English learners. As is well known, Spanish and English differ in their setting for the Null Subject Parameter, which has been widely studied in SLA research (e.g. White (1985), Liceras (1989), Ruiz de Zarobe (1998), Phinney (1987), Al‐Kasey & Pérez‐Leroux (1998), Liceras & Díaz (1999), Lozano (2002), and Montrul & Rodríguez‐Louro, (2006), among many others). It has been recently observed that learners do not treat all subjects alike. In particular, while L1 Spanish‐L2 English learners have no difficulties in acquiring referential subjects, non‐referential subjects (expletives it and there) remain problematic even at advanced stages. According to Ruiz de Zarobe (1998), once a Spanish learner of English has acquired the use of expletives, s/he is able to reset the initial parameter and adopt the target language parameter setting. Most L2 studies on the acquisition of the Null Subject Parameter are experimental. It is only very recently that researchers have started using corpora to test SLA hypothesis.

The findings reported in Oshita (2004) and Lozano & Mendikoetxea (2010) regarding issues related to the acquisition of different aspects of the Null Subject Parameter show that large and well constructed corpora and databases are powerful tools that are crucial for understanding the processes that constrain L2 production. In this study we used two L1 Spanish‐L2 English learner corpora (WriCLE and WriCLEInf corpora), compiled at the University Autónoma de Madrid (see Rollinson & Mendikoetxea 2010). A random selection of texts (different proficiency levels) were annotated according to the properties of referential and non‐referential subjects. A preliminary analysis of the facts confirms the hypothesis that learners show difficulties in acquiring non‐referential subjects even at advanced stages. In particular, even advanced learners omit subjects in certain contexts (they use 0‐subjects) and overuse it as the generic expletive, while the use of there with verbs other than be is highly limited (see also Lozano & Mendikoetxea 2010). These results are then compared with those obtained through an acceptability judgement task, in which subjects were asked to rate the acceptability of clauses containing the following subjects: it, there, 0, and a Prepositional Phrase. The results of the experimental tasks mostly match those obtained in the corpus study, so that we can talk about converging evidence, but they also show some interesting deviations, probably due to task differences.

Moerth,Karlheinz, Niku Dorostkar and Alexander Preisinger Panel: 1. Diseño, compilación y tipos de córpora GLEANING MICRO‐CORPORA FROM THE INTERNET: INTEGRATING HETEROGENEOUS DATA INTO EXISTING CORPUS INFRASTRUCTURES Over the past decade, the issue of Web as corpus has been discussed and studied extensively. Meanwhile, the existence of a number of very successful projects and the ever growing number of new corpora created from sources on the internet bears advocates of this new brand of NLP resources out. The number of tools that serve the purpose has steadily grown, some of these also provide web‐based interfaces. The meanwhile well‐established methodology of creating corpora from the Web has produced tools that allow the wholesale creation of large corpora. The software usually proceeds from so‐called seeds, then crawls the Web collecting links and downloading relevant data for future reference. The most obvious area of application that comes to mind is lexicography, most software developments that have been presented are geared towards the needs of researchers looking for words, less to the reading and interpreting kind of scholars. While creating ever larger corpora has become a comparatively easy task for computational linguists, other groups of researchers who might also be interested in archiving and exploiting such data still come up against a number of difficulties that often impede smooth access to data. Our paper describes a newly developed piece of software and touches on use cases from projects where researchers need more than mere KWIC lines. It will focus on issues of interface design and key functionalities implemented in the new tool which comprise among others the selective incorporation of particular documents from the internet into a corpus and their preservation (including styles and images) allowing subsequent reading and interpretation of the text. Among the design objectives of the development project was to also enable non‐technical users to archive data from the internet, to organise this data into reusable micro‐corpora, to enhance data with more fine‐grained metadata and to integrate them into an existing corpus infrastructure. The usability of the new tool has been put to trial in several small projects, the most important of which is a project bringing together scholars and high school students working collaboratively on racist language in online discussion forums applying methods of critical discourse analysis. The software discussed in the paper has been developed as part of a more general corpus toolbox comprising editing (corpusEditor) and access (corpusBrowser) tools. Development activities have been carried out with a strong emphasis on standards (XML, Unicode, LAF, ISOCat) and de facto standards (TEI, XCES). All the components being discussed in the paper will be freely available and published as open‐source.

COMBINED APPROACH TO MODERN LEXICOGRAPHIC TOOLS: THE CASE OF THE FIRST SLOVENE DICTIONARY OF TOURISM TERMINOLOGY This paper presents the first Slovene Dictionary of Tourism Terminology. In Slovene there is still no contemporary explanatory dictionary of tourism available. The only reliable explanatory sources remain foreign dictionaries of tourism. However, these dictionaries do not cover specific Slovene tourism‐ related terminology. That is why the production of a contemporary dictionary of tourism is essential. The paper presents the newly built Slovene Dictionary of Tourism Terminology, compiled on the basis of the Multilingual Corpus of Tourist Texts (Mikolič et al. 2008). The Corpus was compiled with the aim to draw up a Slovene‐Italian‐English corpus of tourist texts; to conduct analysis of these texts based on theoretical starting points of intercultural pragmatics, translation theory, critical discourse analysis and terminology, and thus to set up a platform for the compilation of a terminological dictionary of tourism. The Corpus includes 27 million words, mostly in Slovene, but also in English and Italian, thus representing a bigger multilingual LSP corpus for Slovene language (Mikolič et al. 2008). As research shows (Gorjanc 2002: 75), terminological electronic corpora represents an indispensable basis for compiling LSP dictionaries. The Dictionary of Tourism Terminology is being compiled using a newly designed software Termania (Amebis, 2010), which provides a flexible and user‐friendly interface for editing dictionary entries. The dictionary currently consists of approximately 2,000 terms. In the compilation of the dictionary, the automatic and the manual approach were combined. The automatic approach was used to process corpus data and enter the processed data into Termania editing software. The most frequent tourist terms (monograms, bigrams and trigrams) were automatically extracted from the Multilingual Corpus of Tourist texts and placed in Termania software as dictionary entries. Also inserted automatically for each entry were language qualifier, grammatical and field qualifiers, examples of use and translation into English. Manual approach was then used in consecutive editing phases for correcting, complementing or adding new data for individual entries. As an example ‐ for field qualifiers, automatic approach was combined with manual, since new fields could be added manually to the existing ones. In a similar manner, good examples and translations were checked for suitability and edited if necessary. Entirely manual approach was used for writing definitions, where editors drew upon different sources, both printed and electronic, in order to compile the definition, stating all the sources at the end of the entry. The results show that automatic approach in compiling LSP dictionaries is useful and helpful for the lexicographer but cannot replace him. A combined approach, building on the advantages of automatic and the manual approach, therefore seems the most appropriate. As shown in Humar (2004: 20‐21), a good terminological dictionary is usually the result of group work which draws together the knowledge and experience of specialists from different fields. Nevertheless, the Dictionary of Tourism Terminology represents a good example of a corpus‐based LSP dictionary in the electronic format, which represents an important trend of future development in the field of electronic lexicography.

Monaco, Leida Maria Panel: 2. Discurso, análisis literario y corpus MODALIZING MODERN ENGLISH SCIENTIFIC DISCOURSE: A CORPUS‐BASED APPROACH TO MODAL AUXILIARIES IN 18TH ‐CENTURY LIFE SCIENCES TEXTS (CORUÑA CORPUS) Scientific discourse, though often considered strictly objective and hence impersonal (Hyland 1995: 33), has nevertheless demonstrated to present a significant number of epistemic modality markers, through which the authors presumably convey their (un)willingness to commit themselves to the truth of their propositions (Hyland 1998: 3). Semantic‐pragmatic studies of diverse types dealing with scientific literature, both contemporary (Salager‐Meyer 1994; Vihla 1999) and historical (Banks 1991, 2008; Salager‐Meyer 2001; Taavitsainen 2001; Taavitsainen & Pähta 2004), appear to show that scientists normally tend to modalize their discourse when presenting their research achievements before the epistemic community, in a way that their statements might not be perceived as categorical assertions. One of such modalizing strategies is the use of modal auxiliaries conveying epistemic meanings, such as doubt, possibility, necessity, or inference (Gotti et al. 2002), all of which appear to be a recurrent case in scientific writing (Hyland 1998; Vihla 1999). The present study focuses on modal auxiliaries presenting more or less evident epistemic meanings in a corpus of twenty scientific texts belonging to the subfield of the Life Sciences (which in turn contains diverse disciplines, such as Biology, Zoology, Botany, etc.), written in English throughout the 18th century and distributed all along the said period at a rate of two samples per decade. The given texts belong to the Corpus of English Life Sciences Texts (CELiST), a part of the Coruña Corpus of Scientific Writing, the latter being an electronic collection of late Modern English scientific literature of diverse genres and disciplines, written between 1700 and 1900. The samples analyzed in the selected sub‐corpus might be regarded relevant for spotting the semantic and/or pragmatic scope of the given modal auxiliaries during a period in which English was already evolving as a language of science, but, apparently, there was not yet a standard pattern for a ‘scientific English’.

Nešpore, Gunta, Lauma Pretkalniņa, Baiba Saulīte and Kristīne Levāne‐Petrova Panel: 1. Diseño, compilación y tipos de córpora TOWARDS A LATVIAN TREEBANK Treebanks are among the crucial resources for the development of NLP tools. For Latvian no such a resource currently exists. To address this deficiency the development of Latvian Treebank is ongoing. As a grammatical framework for the Latvian Treebank, the SemTi‐Kamols model [Nešpore et al., 2010,

Bārzdiņš et al., 2007] is used. It is a hybrid grammar in relation to dependency and phrase structure grammars that covers both synthetic and analytical forms of Latvian — a highly synthetic language with relatively free word order. In essence, the SemTi‐Kamols grammar is close to the Tesnière’s dependency grammar [Tesnière, 1959]. The model is based on dependency links and the notion of x‐words that roughly correspond to Tesnière’s nuclei. X‐words were introduced as inseparable syntactic units describing analytical forms and relations other than subordination. From the phrase structure perspective, x‐words can be viewed as non‐terminal symbols, and as such substitute all entities forming respective constituents. From the dependency perspective, x‐words are treated as regular words — they can act as head or dependent nodes in dependency relations. Manual annotation of Treebank is very laborious; therefore the tool support is crucial. As the SemTi‐Kamols model is based on the dependency grammar, we have chosen to adapt the annotation tool TrEd [Hajič et al., 2001] that is used developing the Prague Dependency Treebank (PDT) [Hajič et al., 2000]. We have developed Prague Markup Language (PML) profile for the SemTi‐Kamols model. PML is XML based language for linguistic annotations developed together with TrEd and acts as default input/output format for TrEd. Developing the SemTi‐Kamols PML profile, the initial SemTi‐Kamols grammar model has been modified, dividing the types of syntactic relations further. The scope of x‐word was narrowed down to pure analytical forms (e. g., perfect tenses, complex predicates) and multi‐word units (e. g., multi‐word numerals). The coordination was distinguished as a separate relation: it represents both coordinated parts of sentence and coordinated clauses. This brings the SemTi‐Kamols model even closer to the Tesnière’s approach, where coordination (jonction) is formed by two or more homogenous nodes that have the same function in relation to the sentence. In Latvian the punctuation represents the grammatical structure of the sentence, therefore we distinguished one more type of relations — punctuation mark constructs — the relation between the punctuation mark and the unit that evokes the use of the punctuation mark. Thus we arrive at four relation types: dependency, x‐word, coordination, punctuation mark construct. As a result, we have obtained a working environment for creating the Latvian Treebank manually using the extended SemTi‐Kamols model and exploiting TrEd. As a proof of concept, we have annotated first 100 sentences of J. Gaarder’s “Sophie’s World”, in lines with the project of Parallel treebank of North European languages [Sophie]. Our future plans involve integrating TrEd with the SemTi‐Kamols syntax analyzer [Bārzdiņš et al., 2007] to obtain environment for semi‐automated annotation process.

Nijsen, Kasper Panel: 5. Corpus, estudios contrastivos y traducción “THIS PAPER ARGUES = DIT ARTIKEL BEWEERT?”: IS‐AV‐CONSTRUCTIONS IN ACADEMIC PROSE TRANSLATION Reporting on two corpus studies involving English and Dutch academic prose, this paper examines several issues from contrastive linguistics and translation studies. It focuses on constructions that previous studies have identified as ‘IS‐AV constructions’: Inanimate Subject – Active Verb (Master, 1991; 2006; Šeškauskienė, 2008; 2009). Typical examples are 'this paper argues' and 'this theory claims'. Such constructions appear to play a crucial role in English academic writing, but little is known about their use across languages (Low 1999). It is therefore worth investigating to what extent they are a distinguishing feature of English scientific language only or may also be spilling over into the academic prose of other languages. Such contrastive knowledge is a prerequisite for an examination of the choices made by academic translators dealing with IS‐AV constructions, which may reflect (or affect) cross‐linguistic or cross‐generic differences, also bringing into question broader theoretical questions with respect to translation universals. To investigate these issues, and taking the English‐Dutch situation as a case in point, this paper address two main questions: (1) how does the use of IS‐AV constructions in English academic prose compare with their use in the same genre in Dutch; and (2) what translation strategies are commonly used by English‐Dutch translators dealing with IS‐AV constructions in this genre? In order to frame the corpus studies, relevant literature from the field of contrastive linguistics is discussed, as well as previous studies focusing on the use, rhetorical function and conceptualization of IS‐AV constructions in an academic context. Additionally, I briefly sketch the cultural position of English in the Dutch academic world, drawing on recent reports as well as Even‐Zohar’s (1990) polysystem theory. Finally, Toury’s (1995) theory of two major translation universals or laws, interference and

normalization/standardization, is adopted to analyse the translation strategies found, including their relation to the cultural position of English in Dutch academia. To address the first question, a comparative corpus of English and Dutch academic prose was compiled; in the second part a parallel corpus of English source texts and Dutch translations was used. Corpus analyses reveal that IS‐AV constructions are used in both languages, but their frequency in English is considerably higher. Their use in Dutch, it is argued, may be due to the influence of English as the lingua franca of the academic world, and similar developments may apply to academic writing in other non‐English languages. With respect to the translation question, the findings show that a number of strategies are possible. Despite the possibilities, however, most translators chose to retain the IS‐AV constructions in their Dutch target texts. This suggests that in this case the process of interference takes precedence over normalization, a finding that may be related to the cultural prestige of the English source language in this domain. To conclude, I discuss the broader implications of the results and suggest several promising avenues for future research.

Novo Urraca, Carmen Panel: 4. Lexicología y lexicografía basadas en córpora A TYPOLOGY OF MORPHOLOGICALLY UNRELATED ADJECTIVES IN OLD ENGLISH The aim of this presentation is to identify the basic and derived‐basic adjectives in Old English*. The former represent morphologically unrelated adjectives which do not constitute bases of derivation for other words. The latter, derived‐basic adjectives, are those derived adjectives that do not have derivatives of their own. Since the formation of the adjective in Old English has drawn little attention in previous research, this study reports the results an analysis of all the adjectives contained in the lexical database of Old English Nerthus (www.nerthusproject.com), which comprises around 30,000 lexical entries along with semantic and morphological information. This analysis requires a previous study in the derivational paradigms through which all words which hold morphological relationships of a derivational nature have been isolated. Out of the 5,790 adjectives included in Nerthus, 62 basic adjectives have been identified, as well as 43 derived‐basic adjectives. The conclusions of this study are twofold. On the quantitative dimension, basis and basic‐derived adjectives represent a negligible part of the Old English lexicon, around 1.8% of adjectives and 0.35 of all the lexicon. On the qualititive dimension, these adjectives often reflect a lack of linguistic evidence, given that nearly one half of them are morphologically complex. The situation, therefore, is one in which reconstruction is needed in order to account for the bases of derivation of these adjectives. Therefore, this analysis contributes to an overall the explanation for the Old English lexicon in two directions. Firstly, by offering a picture of an area of the derivation of the adjective to which no previous studies have been devoted. And, secondly, by reinforcing the derivational and paradigmatic nature of the Old English lexicon.

Oncins‐Martínez, José Luis Panel: 2. Discurso, análisis literario y corpus A CORPUS‐BASED VIEW OF REPORTING FORMULAE IN DICKENS’ NOVELS As has often been pointed out, one of the distinguishing features of Dickens’ style is his mastery use of the techniques of characterization (see, e.g., Page 1973, Quirk 1959; 1961; 1979; Golding 1985). Much of this success –of paramount importance in character ‘individualisation’ (Quirk 1961: 20)– stems from his skilful use and exploitation of the wide variety of strategies for presenting the speech of the hundreds of characters that populate his fiction. Indeed, Dickens’ novels show not only one of the richest catalogues of reporting verbs in English fiction but also what is perhaps the most varied grammatical realization of the main reporting verb in fiction, said. Drawing on the classification of reporting verbs proposed by Caldas‐Coulthard (1994), and with the help of ConcGram 1.0 and Wordsmith Tools 4 software, this paper presents the preliminary results of a survey of the structures that characterize Dickens’ use of reporting verbs. The data come from the corpus of Dickens’ novels

(circa 4.5 mil. words). The survey is at this initial stage limited to verbs reporting direct speech and, for this presentation –and for time reasons–, it concentrates on said, discussing the most typical grammatical realizations of this reporting verb, namely, said + a manner adverb (‐ly), said + prepositional phrase and said + an ing‐ participle clause. In order to assess the idiosyncrasies of Dickens’ style, the results are finally compared with those found in a reference corpus of nineteenth‐century fiction (7 authors; c. 12.5 mill. words).

Palmerini, Monica and Serenella Zanotti Panel: 5. Corpus, estudios contrastivos y traducción A CORPUS‐BASED STUDY ON THE USE OF NARRATIVE IN ENGLISH AND SPANISH YOUTH CONVERSATIONS Recent studies have pointed at the crucial importance of narrative in the evolution of human language (Simone 2009; Lazard 2006; Victorri 2002). Narrating, i.e. telling past stories or imagining still to come or

never existed ones, is a primordial and irrepressible need in human experience, which has presumably shaped grammar at a very deep level and which appears to be an exclusive and ubiquitous property of verbal languages. As a consequence of this primeval relation, languages display a wide array of tools aimed at implementing the narrative function. The study of narrative applies to many social science fields, ranging from literary theory, history, linguistics, anthropology, psychology, sociology, art, drama, film, theology, philosophy, education and even evolutionary biological science. Linguists’ attention on narrative has often focused mainly on the analysis of the complex products of long‐standing literary or oral tradition. In particular, research on oral narrative has been carried out mainly on bodies of elicited personal/autobiographical narratives (cf. Labov & Waletzky 1967; Labov 1982, 1997; Gee 1991). In this study we argue, instead, for the interest of the simplest and most fundamental context where narration surfaces, namely spontaneous informal conversation. We further characterize our object of analysis by combining two different perspectives: a sociolinguistic one, which concentrates on youth language; and a contrastive one, which compares the use of narrative in English and Spanish youthtalk. The overall approach envisaged is ultimately corpus‐based, in that the analysis is carried out on and through two comparable corpora of youth language, that have been both constructed at the University of Bergen: the Corpus of London Teenage Language (COLT), and the Madrid subcorpus of the Corpus Oral de Lenguaje Adolescente (COLAm). Studies carried out over the last decade (cf. Bucholtz 2011, Stenström & Jørgensen 2009, Androutsopoulos & Georgakopoulou 2003) have demonstrated the interest of youth language as a site of innovation and paved the way for further research from a wide range of perspectives. Contrastive corpus‐based studies have been carried out on the Bergen corpora, which have investigated different aspects of youth language, with special reference to discourse markers (Stenström and Jørgensen 2009). And yet a model for the investigation of the forms and functions of narrative in youthspeak is still to be developed. In this contribution we intend to make a first step in this direction, presenting a corpus‐based investigation on how speakers from the same age‐group in two of the most spoken and influential languages in the world use and construct narrative in conversation. After outlining the basic functional and structural properties of narrative in this language modality, we will move to illustrate the contrastive analysis conducted on specific aspects of the body of data considered. We will examine, for instance, the dynamics between narration and non‐narration, “narrated world” and “commented world” (Weinrich 1964), from both a pragmatic and a grammatical point of view; the quotation strategies and the other devices used by young speakers to mark the frontier between their and the others’ voices; aspects of modalization, etc.

Papp, Kornélia Panel: 4. Lexicología y lexicografía basadas en córpora A CORPUS‐BASED STUDY OF THE PROPERTY CONCEPTS KIS/KICSI ‘SMALL’ IN HUNGARIAN The near synonymy of the two Hungarian adjectives kis and kicsi is examined using corpus techniques. Cognitive linguistics has witnessed a large growth in corpus‐driven approaches to language structure along with a long overdue interest in lexical semantics. Two trends have emerged in the cognitive literature on the subject. Firstly, the collostructural approach (Gries & Stefanowitsch 2003, Stefanowitsch & Gries 2005, Hilpert 2006) looks at lexical constructional associations in order to identify patterns of usage and thus the meaning of the construction and second, a multivariate technique (Gries 1999, Heylen 2005). This study considers both approaches and seeks to explain the difference in the usage of the adjectival alternation in Hungarian. The two property words in question, kis (e.g. kis ház ‘small house’) and kicsi (e.g. kicsi ház ‘small house’) are analysed within the noun phrase. The adjective kis is typically associated with attributive use, while kicsi has traditionally been identified as its predicative counterpart. There has been no corpus‐based investigation into the alternation of the above mentioned adjectives in attributive position. The presentation deals with the different adjectival senses of these primarily size‐related adjectives in combination with the corresponding noun senses. The study is based on the Hungarian National Corpus, where some 500 examples of each forms are annotated for semantic usage features. The semantic features consist of lexical semantic features of both the modifier and the noun. Collocational and correspondence analyses are then used to look for multivariate patterns in the usage, relative to semantic features. The results clarify the lexical constructional

interaction as well as outline a multidimensional map of the usage. This allows us to understand the lexico‐grammatical meaning that produces the apparent variation.

Pennock‐Speck, Barry Panel: 6. Corpus y variación lingüística VOICE‐OVERS IN BRITISH TELEVISION ADS: A CORPUS ANALYSIS OF A WRITTEN‐TO‐BE‐SPOKEN GENRE The analysis of a corpus of voice‐overs I will be presenting today is an integral part of a larger corpus of television ads compiled by the MATVA (Multimodal Analysis of TV Ads) group, which is made up of 636 day‐time television ads aired on ITV1 on the 24th and 35th of June 2009 from 8.00 a.m. to 6 p.m. The corpus as a whole consists of a detailed description of the ads, a transcription of all the voice‐overs, on‐ screen text, testimonials and dialogues, as well as an in‐depth description of the para‐ (Poyatos 1993) and extra‐linguistic elements of each commercial. I chose ITV1 as it is the most popular of British commercial TV channels but the two days chosen were done so randomly. Any day‐long corpus of TV ads contains many repeats of the same ad –up to 30 in the case of Sky TV– and taking them all into account is important in some types of analysis. However, for corpus analysis one of each ad was deemed appropriate and so repeats were eliminated leaving 277 ads. 200 of these featured voice‐overs. Although the layperson’s term voice‐over is well known, my definition is more restrictive as it only includes totally disembodied voices (Pennock‐Speck & del Saz‐Rubio 2009), thus excluding voices belonging to actors who appear at some time in the ad–these are included as testimonials and dialogues to be analyzed elsewhere. Unlike other qualitative analyses of voice‐overs I have carried out in the past, here I will eschew the para‐ and extralinguistic characteristics of the ads and concentrate on the actual verbal messages the voice‐overs are a vehicle for. One of the reasons for this is to discover, employing quantitative methods, the common lexical elements of British TV AD voice‐overs. Using Wordsmith I have discovered that there are grammatical and lexical elements that predominate in my corpus. With regard to the grammatical elements, once items such as articles and prepositions have been excluded, ‘you’, ‘your’, and ‘we’ and ‘our’ and ‘can’ and ‘just’ are the commonest. Subsequent qualitative analysis has shown that the frequency of the pronouns point to the presence of positive politeness strategies of inclusiveness. The discourse analysis of the word “just” shows that its most frequent use is as a hedger, that is, a negative politeness strategy. The most frequent lexical items are ‘now’, ‘free’ and ‘new’. The import of this research, apart from the findings we have made, is made more significant due to the dearth of corpora featuring TV ad voice‐overs (Leech, 1996; Costa et al. 2005). The corpus analysis I will describe in this paper is only the first part of an analysis which aims to compare our TV ad voice‐over corpus with both spoken and written discourse as my written‐to‐spoken genre partakes of both.

Gutiérrez, Camino, and Julia Alonso Panel: 7. Lingüística computacional basada en corpus THE TRACE CORPUS ALIGNER: DEVELOPING A NEW ELECTRONIC TOOL FOR LANGUAGE RESEARCHERS This presentation aims to introduce a tool that builds a bridge between new technologies and the study of source texts and their translations. Nowadays, many aligner applications can be found in the market, but they can barely fulfill researchers’ expectations, rarely satisfying all their needs. With this scenario, our goal is to develop an application that is useful and usable for researchers. By creating this software, functions such as tagger, aligner, and results screen are intended to become approachable from a single interface. The application offers several options, which are based on the needs of the TRACE project (University of León). This project is devoted to the study of the translation and censorship of different text types (narrative, theatre, audiovisual, poetry) during Franco’s regime. The software already available offers alignment by paragraphs or sentences, which is not useful in the study of, for instance, theatre or audiovisual works since these texts are structured into speeches and annotations. Our goal is to develop standardized software that can be used to solve these problems, therefore making possible this type of research. Another inconvenience found in the linguistic field is the uncommon use of computer standards. This problem is quite relevant, so part of our presentation is devoted to explaining concepts such as XML, TEI or TMX, which are important standards used in our application. Thanks to these standards, intermediate and final files generated by the application can be exported, being portable and accessible for other tools we may need.

Piotr Pakuła, Łukasz Panel: 2. Discurso, análisis literario y corpus ‘CIVIL PARTNERSHIP’ AND ‘GAY MARRIAGE’ IN CONTEXT The question of identity has enjoyed wide interest in various fields of contemporary social sciences (e.g. du Gay et al 2000). Recently, a global shift from scrutinising linguistic differences between members of diverse social groups (e.g. Labov 1966, Trudgil 1974, Lakoff 1975, Spender 1980) to examining more abstract socio‐linguistic means of expressing and describing any of the identities an individual assumes – i.e. discourses ‐ can be noticed (e.g. Baker 2005, van Dijk 2005, Litosseliti 2006). A more recent strand of research in this field takes advantage of the blend of CDA (Critical Discourse Analysis) and corpus linguistics, as the latter “[…] can help reduce researcher bias” (Mautner 2009: 123; see also Baker 2006). However, little attention has been devoted to the discursive representation of a relationship that a member of a socially stigmatised group enters. One in‐depth study done in this area is Bachmann (2011), who examined discourses surrounding the concept of ‘civil partnership’ as represented in the

British parliamentary debates at the time when it was undergoing legislation, i.e. 2004. Yet, because public opinion is informed mainly by the media, it was thought that investigating newspapers as one of the most profoundly opinion‐shaping means might be of particular relevance. This study aims to partially fill this gap by examining different ways of talking about: • the process of legislation of civil partnerships, • how civil partnerships work in practice in the UK, • and the struggle for the legal recognition of the institution of gay marriage as represented in the most popular British newspapers published between 2000 and 2010. To this end, a corpus of c. 6 million words has been compiled; the British National Corpus served as the reference corpus for deriving keywords in the newspaper corpus. In contrast to the methodology employed in Baker (2010), no classificatory attempt has been made with respect to the traditional broadsheet/tabloid division; the categories of newspapers employing similar discourses pertaining to the subject matter emerged as the result of the analysis. The quantitative analysis was performed using WordSmith 5, then a qualitative analysis followed in order to strive for a better understanding of the keywords and their collocations. Phenomena, including nominalisation, metaphor and metonymy, were taken into account as well. Moreover, a contrastive analysis of contextualised key phrases – civil partnership and gay marriage ‐ is presented.

Potemkin, Serge Panel: 4. Lexicología y lexicografía basadas en córpora SENTIMENT EXTRACTION FROM THE BILINGUAL CORPUS In recent years, sentiment analysis has attracted considerable attention. It is the task of mining positive and negative opinions from natural language, which can be applied to many natural language processing tasks, such as document summarization and question answering. Sentiment analysis both at document and sentence level rely heavily on word level. The hypothesis is that, given the semantic orientation SO of relevant words in a text, we can determine the SO for the entire text. This paper explores methods for generating subjectivity analysis resources in a new language by leveraging on the tools and resources available in English. We focus our experiments on Russian, selected as a representative of the large number of languages that have only limited text processing resources developed to date. Note that, although we work with Russian, the methods described are applicable to any other language, as in these experiments we (purposely) do not use any language‐specific knowledge of the target language. Certain semantic orientation lexicons have been manually compiled for English—the most notable being the General Inquirer (GI) [Stone et al., (1966)]. However, the GI lexicon has orientation labels for only about 3,600 entries. The Pittsburgh subjectivity lexicon (PSL) [Wilson et al., (2005)], which draws from the General Inquirer and other sources, also has semantic orientation labels, but only for about 8,000 words. The latter lexicon was used as the seed sentiment lexicon for further processing. The translation of sentiment information has been the topic of multiple publications. Some methods simply use bilingual dictionaries to translate an English sentiment lexicon. The other methods are based on parallel corpora. The source language in the corpus is annotated with sentiment information, and the information is then projected to the target language or vice versa. Problems arise due to mistranslations. Machine translation also was used for multilingual sentiment analysis. Given a corpus annotated with sentiment information in one language, machine translation is used to produce an annotated corpus in the target language, by preserving the annotations. The original annotations can be produced either manually or automatically. We use a collection of Internet blogs about new books in Russian. Each opinion in the blog is manually annotated [Zagibalov, (2010)]. This collection was translated into English using Google MT engine. Then the bilingual space techniques was applied to derive a mapping of the Russian source sentence (SS) to the English target sentence (TS). The most probable mapping defines the true matching of word pairs and multi‐word fragments [Potemkin, (2010)]. The Russian words that correspond to the seed semantically oriented English words are

included in the Russian seed sentiment lexicon. Afterwards this lexicon was compared to the hand‐ crafted list of Russian semantically – oriented words. The advantage of this approach in comparison to the direct translation of English seed lexicon into Russian using dictionary consists in disambiguation of multiple translation equivalents.

enumeration and ordering, exemplification and restatement, concession and contrast, cause and result, summation, stance, and topic shift. Findings revealed that both groups of students shared similar characteristics with regard to the types of DCs employed in their essays, but with different degree of occurrence. Despite a wide range of DCs, the Thai learners, similar to the native speakers, employed a rather small cluster of DCs in their argumentative writing. And, but, because, for example and also were mostly found in the compositions of the two groups. In terms of syntactic distribution, the Thai learners had a tendency to employ the top five DCs inter‐clausally as coordinators followed respectively by conjunctive adverbials and subordinators while the native speakers used them mostly as conjunctive adverbials in sentence‐initial, medial and final positions, followed by coordinators and subordinators. Although both groups used these DCs in similar functions, preliminary findings suggest that the learners are more familiar with the inter‐clausal rather than the intra‐clausal use of DCs, associating them with clause‐linking rather than intra‐clausal devices, and the learners apparently had difficulties with such DCs as but, part of which can be attributed to the influence of the native language.

Quintana Toledo, Elena and Margarita Esther Sánchez Cuervo Panel: 4. Lexicología y lexicografía basadas en córpora AN APPROACH TO TYPES OF MODALITY IN THE INTRODUCTION AND THE CONCLUSION SECTIONS OF COMPUTING RESEARCH ARTICLES The scientific research article comprises several parts with a different purpose. As an independent genre, it is currently assessed from several perspectives that take in lexical, grammatical and rhetorical features, among others. In this study, we seek to identify the most frequent modal auxiliary verbs encountered in the introduction and the conclusion of the scientific research article. In the introduction, the context of research and subject matter are described; it can present a summary or overview of the author’s position. In the conclusion, the research contribution to the field of study is usually revealed. This shows the logical outcome as devised in the introduction. Modality can be defined as the expression of the interpersonal function of language. It concerns the way in which the author is going to project his/her attitude into his/her texts (Hyland, 2000). It also refers to how we orientate, shape and measure utterances in discourse. Furthermore, modality is related to that part of language that allows us to connect our expressions of belief, attitude and obligation with what we say and write. It includes markers of the varying degrees of certainty that we have about the propositions we transmit, and of the types of commitment or obligation that can be attached to our utterances (Simpson 2004: 123). For our study of the prevalence of several types of modality, we will regard those related to the speaker/writer’s expression of volition with “will”, and his/her ability to carry out the event designated with “can”. We will also consider the speaker/writer’s assessment of the communicated proposition with instances of epistemic modality. This encodes diverse degrees of certainty as regards its validity. For example, we encounter high certainty or necessity (“must”, “cannot”), medium certainty or probability (“will”, “would”, “should”), and low certainty or possibility (“may”, “could”) (Arrese, forthcoming). Some preliminary conclusions indicate a dissimilar use of modal auxiliary verbs in the initial and final segments of the scientific research article. In the introductory section, authors manifest their impending decisions by utilising expressions of volition and intention with the modal auxiliary “will”. They are also concerned with their ability to perform the intended investigation with instances of “can”. In the concluding section, however, the epistemic predominance of modal verbs suggests medium and low certainty. For example, in the utterance “(…) Slices above will be unaffected, and slices below in objective will be unaffected if dominating in the other objectives”, the writers predict a positive result based on their preceding research about algorithms. This paper is part of the research project “Evidentiality in a multidisciplinary corpus of research papers in English” at the University of Las Palmas de Gran Canaria. The corpus for this study includes up to twenty computing articles covering a time span that goes from 2004 to 2008. The criteria for the selection include the impact index, year of publication, and sociological aspects. The methodology is both quantitative and qualitative.

Ramírez Polo, Laura Panel: 1. Diseño, compilación y tipos de córpora MATVA: A DATABASE OF ENGLISH TELEVISION COMMERCIALS FOR THE STUDY OF PRAGMATIC‐COGNITIVE EFFECTS OF PARALINGUISTIC AND EXTRALINGUISTIC ELEMENTS ON THE AUDIENCE OF ENGLISH TV ADS. The structure of television commercials, the soundtrack, voice‐overs, actors' accents etc. are not the result of random decisions. Rather they are chosen with a purpose in mind, that of maintaining the product in the public eye or persuading the audience to buy the product. Attesting the complexity of this type of texts, the MATVA research group (Multimedia Analysis of TV Ads) devised the creation of a database with commercials from the UK, aiming at constructing a valuable resource for the study and analysis of paralingistic and extralinguistic elements of TV ads and well as their pragmatic‐cognitive effects on the audience. The following paper addresses the difficulties and decisions made in the design and construction of the database. In the first place, we address some of the theoretical questions we have faced in the conceptual design. Our objective was to create a database as a special speech corpus with an analysable textual component made up of the transcriptions of the ads. To begin with, we tackle the criteria established by the Eagles Spoking Language Working Group for the acquisition of data. Further, we discuss the criteria defined by Sinclair (1996) within the EAGLES initiative to create corpora: quantity, that is, the size of the corpus; quality or the authenticity of the corpus; simplicity or the format in which text is stored; and documentation or the metadata that must accompany the text corpus. Besides, we consider the main aspects for designing a corpus dealt with by Tourruela & LListerri (1999): its goal(s), limits and the type of corpus. Finally, we address the notion of representativity with respect to our collection of ads. We then introduce some practical issues regarding the metadata that accompanies each advertisement: the classification schema used to organize the commercials as well as the different variables that constitute the database: product types, ad duration, song lyrics etc. We also explain the markup system developed in order to annotate the transcriptions of the commercials, which was conceived with the goal of subsequent para‐ and extralinguistic analysis. Finally, we mention some technical factors such as the platform used to store the data as well as the structure of the data. We end with some conclusions about the extendibility of the corpus and its practical applications.

Ramon, Noelia Panel: 5. Corpus, estudios contrastivos y traducción ‘WELL’ IN SPANISH TRANSLATIONS: EVIDENCE FROM THE P‐ACTRES PARALLEL CORPUS A particle such as the English form well is multifunctional. This English adverb can carry meanings related to manner, but also to degree or intensification. In addition, well is often grammaticalized into a discourse particle, especially in dialogue, and this requires a particularly careful treatment in the case of translations, as discourse particles do not carry easily definable meanings. Previous studies on the English particle well (Aijmer & Simone‐Vandenbergen 2003, Johansson 2006) have shown that the translation of this item into other languages is far from straightforward, as there are many different correspondences and a high degree of omissions. The translations of the English form well have been studied in the cases of Norwegian, Swedish, Dutch, German and Italian, and this paper aims at expanding the analysis considering translations into Spanish. The study will focus on the translations of well as it appears in an English‐Spanish parallel corpus, which will provide the empirical material for the analysis. The ACTRES project (Análisis Contrastivo y Traducción English‐Spanish) is a long‐running research endeavour currently in progress at the University of León, Spain, studying English and Spanish from a contrastive perspective and with translation‐oriented applications in mind. Within this larger framework was built the P‐ACTRES corpus (Parallel‐ACTRES). This corpus contains about 2.5 million words of contemporary English texts and their corresponding translations into European Spanish. Various registers are represented (fiction, non‐fiction, press, miscellanea) and all English texts have been published in the year 2000 or later, thus representing the current state of the language. The translations have all been published in Spain by a wide variety of different translators, thus also representing current trends in translational norms in this particular target language. The corpus‐based methodology

employed will consist of a careful analysis of the cases of well in the English section of the corpus, followed by a detailed study of the various translational options identified for each function or meaning. The aim of the study is to provide an inventory of translation solutions available in Spanish for the various functions of well in English original texts, in particular with regard to the use of well as a discourse marker. The trends observed in the options taken most frequently will provide useful information in the field of translator training as well as in translation practice.

Ricart‐Vayá , Alicia and María Alcantud‐Diaz Panel: 9. Usos específicos de la Lingüística de Corpus USING COMPUTER‐ BASED CORPORA TO CREATE LEARNING MATERIALS FOR TOURISM (ESP) The present article adopts a computerized frequency‐driven approach in the analysis of frequently‐used prepositions. Our purpose is to identify the errors made by first‐year students of Tourism when writing essays. Thus, students were required to watch the film “The terminal” by Spielberg (2004) as a compulsory task which supplemented two of the units in their student’s book of English language (Walker and Harding 2006): “Airport departures” and “The airline industry”. The students were asked to write a film review of about 200 words. We decided to investigate the errors made when using the prepositions at, in and on. Our corpus was composed of 50 student’s essays, which we analyzed using WordSmith Tools 5 (Scott, 2010) both quantitatively and qualitatively. That is, we retrieved the prepositions analyzing their frequencies and concordances in order to look for non‐native combinations. Our final aim was to create a series of exercises by using the occurrences of the prepositions as the main basis of our filling the gaps exercises. In this way, our students were provided with exercises based on their own errors when using prepositions in writing. These exercises created with the Exelearning programme for the design of learning objects, will be uploaded in the form of online activities for future students. As a general conclusion, we believe that corpus analysis could be an effective tool in order to

create tailor‐made activities and teaching materials This research has been carried out within the frame of the Tur‐i‐Tic research team (Anglotic) from the University of Valencia.

Richa and Shahid Mushtaq Bhat Panel: 7. Lingüística computacional basada en corpus CASE SYNCRETISM IN URDU‐HINDI: A CHALLENGE FOR NLP This paper is an effort to bring into focus the key issue of Case Syncretism which is one of the challenges to the annotation of corpora in Indian languages both manually and automatically in terms of cognitive load to the annotator and computational complexity, respectively. The paper, based on the annotation of Hindi‐Urdu corpora of 20K+ words, brings forth the issues of case syncretism. In this paper, Case Syncretism in Urdu‐Hindi is explored from the perspective of corpus annotation, illustrating bottlenecks in the annotation process. This paper provides an optimal solution for manual tagging by offering linguistic rules specific to Urdu‐ Hindi. It also presents various disambiguating mini‐algorithms for the automatic tagging. Finally, the analysis also shows that the residual issues can be handled at the level of argument structure as well as semantics. Thus, this paper supports a view that it is essential to annotate argument structure and semantic information for effective encoding of linguistic information and efficient POS‐Tagging.

Roca‐Varela, Mª Luisa Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje CORPORA AS TOOLS AND RESOURCES FOR THE TEACHING OF ENGLISH VOCABULARY Corpora are language databases which contain samples of real language use. These computerized databases are being increasingly used in both theoretical and applied linguistics with satisfactory results. This is true for both native and learner corpora (Granger, 1994; Palacios, 2005). In spite of this, the explicit application of corpora is relatively new within the field of applied linguistics, and the exploitation of corpora in the area of language teaching is not widespread. However, it has been proved that corpus‐ based language learning has positive effects and promotes learners’ autonomy through data‐driven learning (Johns, 1991; Leech, 1997). In this paper, I will first analyse how teachers can take advantage of these large language databases in EFL settings (Oghigian & Chujo, 2010) and how corpora can be useful resources for three basic areas of foreign language teaching: syllabus design, classroom materials and activities (Krieger, 2003). The second part of this paper will focus on the use of corpora for vocabulary teaching and learning. I will work on the information provided by native corpora (such as, the BNC or COCA) regarding the meaning and use of a particular lemma (collocations, colligations, semantic prosody). I will next show how useful it may be to compare and contrast native and learner corpus data to draw conclusions on what learners “know” about a L2 item and what they really “need to know”. The ultimate goal of this study is to demonstrate the pedagogical usefulness of corpora for the teaching of vocabulary.

Rodríguez Arrizabalaga, Beatriz Panel: 3. Estudios gramaticales basados en córpora ON THE PRODUCTIVITY OF ENGLISH COGNATE OBJECTS. A CORPUS‐BASED ANALYSIS English cognate objects of the type a gruesome death in He died a gruesome death and an enigmatic smile in She smiled an enigmatic smile, for instance, have always being a matter of debate in linguistics

due to their controversial syntactico‐semantic status (cf. Sweet 1891; Quirk et al. 1985; Rice 1987; Jones 1988; Massam 1990; Mittwoch 1997; Macfarland 1999; Pereltsvaig 1999; Felser and Wanner 2001; Kuno and Takami 2004 and Höche 2005, among others). As a consequence, the research carried out around this particular clause constituent has mainly focused on its syntactico‐semantic behaviour, trying to look for an answer to the following problematic issues: (a) the own definition of the term ‘cognate object’; (b) the syntactic function of cognate objects either as verbal arguments or adjuncts; (c) the syntactic verbal classes that are compatible with them; (d) the obligatory/optional patterns of modification they take; (e) the restrictions, if any, on the determiners that introduce them into discourse; (f) and the comparison, due to their semantic closeness, between cognate object structures and intransitive patterns with adverbial modification like He smiled in an enigmatic way, on the one hand, and light verb constructions of the type He had a gruesome death, on the other. In this debate, however, the pragmatic dimension underlying English cognate objects has almost gone unnoticed and, as a consequence, there are questions concerning their frequency, productivity, textual distribution and use that still remain unanswered. For this reason, and with the intention to shed some light on the real productivity of English cognates objects, I have carried out a thorough and exhaustive analysis in the British National Corpus of the four verbal classes that, according to Levin (1993), seem to be potentially compatible with cognate objects: namely, verbs of nonverbal expression, verbs of manner of speaking, waltz verbs and a fourth heterogeneous class that includes the verbs dream, fight, live, sing, sleep and think. The main objective of the present talk is to present the results of the aforementioned corpus‐ based study in order to prove, in agreement with Mittwoch (1997), that English cognate objects are “heavily restricted”, as well as to account for the main reasons underlying their scarce productivity in the English language.

Rodríguez Arrizabalaga, Beatriz Panel: 5. Corpus, estudios contrastivos y traducción THE BIRTH OF A NEW RESULTATIVE CONSTRUCTION IN SPANISH Whereas the English resultative construction of the type Peter hammered the metal flat has always been a common subject matter in English linguisitics due to its high level of occurrence in the English language (cf. Simpson 1983; Yamada 1987; Carrier and Randall 1992; Levin and Rappaport 1995; Goldberg 1995; Wechsler 1997 and Boas 2003, among other scholars), its Spanish counterpart, illustrated, for instance, in examples such as Coció un huevo duro and Cernió la arena fina, has not received enough linguistic attention because, contrary to what happens in English, its productivity has proved to be very restricted, appearing only in very specific contexts (cf. Bosque 1990; Demonte 1991; Mallén 1991; Demonte and Masullo 1999 and Rodríguez Arrizabalaga 2002). For some scholars, such a constructions is even said to be completely non‐existent in Spanish (cf. McNulty 1988; Aske 1989 and Sanz 2000). Despite such a productivity imbalance being true, I will present in the present talk the results of an exhaustive analysis of the prepositional phrase hasta la muerte in the Corpus de Referencia del Español Actual (CREA), which reveal that, besides its intensifying and emphatic function illustrated, for instance, in Hay que animar al equipo hasta la muerte (CREA: 28), in the last few years such a phrase has developed a resultative attributive function, as can be seen, for example, in Las muejeres fueron torturadas hasta la muerte (CREA: 1) and Lo apedrearon hasta la muerte (CREA: 97), which has to be considered completely equivalent to that of the English resultative attributes dead and to death in sentences of the type of He shot the president dead or He shot the president to death. The extremely frequent and common appearance of the resultative attribute hasta la muerte in the media nowadays, as a natural consequence of the enormous impact on the mass‐media of the negative social circumstances and conditions surrounding human beings (i.e., terrorism, wars, gender violence, etc.), on the one hand, and its proven presence in the CREA (Corpus de Referencia del Español Actual), on the other, are two clear reasons to state, thus, that this specific resultative construction, considered ungrammatical in Spanish some time ago for being a literal calque of the English resultative construction with dead or to death as attribute (i.e., He shot the president dead or He shot the president to death), if not having entered the language yet, is making its way directly into Spanish.

Rodriguez‐Puente, Paula Panel: 1. Diseño, compilación y tipos de córpora INTRODUCING THE CORPUS OF HISTORICAL ENGLISH LAW REPORTS: STRUCTURE AND COMPILATION TECHNIQUES Since May 2009 the research group Variation Linguistic Change and Grammaticalization from the University of Santiago de Compostela has been working on the compilation of British English legal texts as a contribution to the version 3.2 of the larger multi‐genre corpus ARCHER (A Representative Corpus of Historical English Registers). Taking as a point of departure the techniques employed for the selection and edition of texts for ARCHER, we have started the compilation of our own corpus of legal texts: The Corpus of Historical English Law Reports (CHELAR). This paper is intended to present the main structure and characteristics of the corpus, as well as the methodology used for its compilation. The new corpus will contain approximately half a million words and cover the years from about 1500 to 2000. The texts included are British English law reports: records of judicial decisions that are “cited by lawyers and judges for their use as precedent in subsequent cases” (EBO s.v. law report). The currently available corpora of legal English are mostly concerned with contemporary legal language (cf., e.g., the Cambridge Corpus of Legal English). Corpora of historical legal English include texts from Parliamentary acts, Royal web page at orders and Privy Council’s orders (cf. Anu Lehto’s http://www.helsinki.fi/varieng/people/varieng_lehto.html) and trial proceedings (cf. the Proceedings of the Old Bailey). Alternatively, the linguist interested in legal English from a diachronic perspective can resort to the legal texts included as part of larger diachronic corpora, such as the Helsinki Corpus (850‐ 1710), The Lampeter Corpus (1640‐1740) or the ARCHER Corpus (1650‐1999). However, to the best of our knowledge, a computerized corpus of historical law reports has not yet been compiled. The Corpus of Historical English Law Reports will, therefore, constitute a new, useful resource for linguists with an interest in legal language, from both a synchronic and a diachronic perspective.

The CLAN‐Project intends to describe the cognitive representation of landscapes in speakers who live in different intercultural contexts, and their subsequent emotional responses via spoken language. The hypothesis of the study is that students of English as a second or foreign language or speakers of English as a Lingua Franca may not react in the same way to the perception of natural landscapes, as their responses might depend upon the emotional implications that a particular natural scenario may trigger based on their cultural backgrounds, as well as upon their command of the second or foreign language. For this purpose, the team has selected a series of descriptive variables to study the language produced by learners of English as a second or foreign language, and speakers of English as a Lingua Franca. The objective of the project is to sketch out an atlas of linguistic features that represents the different emotions manifested by landscape preferences on the basis of cultural self‐identifications. As in previous studies carried out by our team, the corpus data will be used for linguistic analysis at different levels (phonological, lexical, syntactic, pragmatic, etc…) (cf. Romero‐Trillo, 2008). Following an evidence‐ based and pragmatic approach, our project aims at the description of the cultural norms, values and social practices affecting the linguistic representation of landscape perception in a metalanguage that has equivalents in different languages (The Natural Semantic Metalanguage, Goddard and Wierzbicka, 2002) and can also be inscribed in the tradition of intercultural pragmatics and ethnopragmatics (Goddard, 2006). These approaches to the description of intercultural linguistic phenomena try to understand speech practices in their context with special attention to the culturally loaded words. It is important to mention that our approach is evidence‐based and it relies on a pragmatic approach to (learner) corpora.

Rossini, Rema, Fabio Tamburini and Andrea Zaninello Panel: 9. Usos específicos de la Lingüística de Corpus EXPLOITING CORPUS EVIDENCE FOR AUTOMATIC SENSE INDUCTION In this paper we intend to explore how statistical analysis and corpus evidence can contribute to sense disambiguation in non‐annotated text. We focus on collocations as a source of surface evidence automatically extracted from corpora through positional and association‐based procedures following probabilistic criteria. Our basic assumption is that most characteristic collocates of a (potentially polysemic) word are a good indicator of its meanings and that co‐occurrence frequencies can be used to discriminate between different senses, in line with the Firthian tradition and the classical Harrissian distributional hypothesis. Our paper is organized as follows: firstly, we present a brief description of CORIS, the 120‐million‐word reference corpus of written Italian used in our study, composed of common, authentic texts chosen by virtue of their representativeness of modern Italian (cf. Rossini Favretti, Tamburini & De Santis 2002). Secondly, we describe the analysis tools exploited in our research. Thirdly, we present some case studies focusing on highly polysemic words in Italian. Collocation sets for the node are created through an automatic, iterated process of collocation analysis based on association measures and recursively applied to the collocates. The results are represented as co‐occurrence graphs (cf. Heyer et al. 2001). This representation, formerly exploited to modulate register variation (cf. Rossini & Tamburini 2009), allows one to single out clusters of collocates connected at different strengths, and thus define different meaning areas providing a visualisation of polysemy through a representation of the collocates’ distribution in a vectorial semantic space. As a matter of example, we analyse the collocates of the node “calcio” (meanings football, calcium, kick…) which are organised around two main axes, corresponding to the two main senses of the word: 1) Chemistry (meaning: ‘calcium’) Pattern 1.a: NOUN+PRE+NOUN* (e.g. carbonato di calcio) ‐ Asymmetric relation: node modifies collocate Pattern 1.b: NOUN+COORD+NOUN* (e.g. calcio, potassio e magnesio…) – Symmetric relation: node and collocate are co‐hyponyms 2) Sport (meaning: ‘football’) Pattern 2.a: Cranberry (e.g. ‘Quelli che il calcio’) – Arbitrary node‐collocate relation Pattern 2.b: NOUN+PRE+NOUN* (e.g. squadra di calcio, campo da calcio); Symmetric relation: node modifies collocate Pattern 2.c NOUN+COORD+NOUN* (e.g. calcio e basket); Symmetric relation: node and collocate are co‐ hyponyms Pattern 2.d NOUN+ADJ (e.g. calcio italiano) Asymmetric relation: collocate modifies node We conclude that the main senses of the node can be identified fairly accurately by the clustering procedure. However, the kind of relationship between the collocate and the node (co‐hyponymy, kind‐ of relation etc.) are consistent with and can only be identified by an analysis of the linguistic structures they feature in, making an integration of the two procedures desirable. As a suggestion for future work, we believe this procedure may be applied to multiword units to measure their level of opaqueness comparing the collocates of the head with the collocates of the MWU taken as a whole. Moreover, in order to study the evolution of a word’s senses across time, this procedure may be expanded in a historical dimension by applying it to diachronic corpora such as DiaCORIS, a representative and

balanced collection of Italian written language ranging from the National Unification (1861) to the end of the Second World War (1945) (cf. Onelli et al. 2006).

Ruano‐Garcia, Javier Panel: 6. Corpus y variación lingüística THE WORLD HAS GOT SOME HINT OF HER COUNTRY SPEECH: ON THE ENREGISTERMENT OF THE ‘NORTHERN DIALECT’ Recent research in sociolinguistics and dialectology has introduced the concept enregisterment to refer to the process whereby certain linguistic features become associated with a particular place and specific sociocultural values (see Agha 2005, 2007; Beal 2009a; Johnstone, Andrus and Danielson 2006; Remlinger 2009; among others). Agha (2003: 231) defines it as “the processes through which a linguistic repertoire becomes differentiable within a language as a socially recognized register of forms”. Some studies exemplifying it have shown that enregisterment occurs through a series of discursive practices. For example, Beal (2010: 94‐95) asserts that “speakers/writers may take part in the process of enregisterment via such practices as dialect writing, the compilation of dialect dictionaries and, more recently, websites dealing with issues of dialect and local identity” (see further Beal 2009b). This paper places literary renditions of northern English into the context of enregisterment. It investigates the repertoire of forms which have commonly been identified as northern and have, thus, contributed to the enregisterment of the ‘northern dialect’. For this purpose, I shall undertake a corpus‐based analysis of literary texts included in the Salamanca Corpus, laying emphasis on early modern material. My aim is threefold. Firstly, to identify the most recurrent traits of these representations, and the sociocultural values they index. Secondly, to ascertain if the set of forms depicted in the early modern literary discourse was maintained across time by surveying selected corpus material from the late modern period. Thirdly, to show that dialect writing, though much neglected for linguistic research, gives insights into language variation and attitudes. In fact, these texts are inextricable from the historico‐linguistic context in which they were produced, and from the attitude(s) towards the ‘other’ English which they reproduce (see Dawson and Larrivée 2010, for example).

Sánchez Aquilino, Cantos Pascual and Criado‐Sánchez Raquel Panel: 8. Los córpora y la adquisición y enseñanza del lenguaje CORPORA‐BASED FREQUENCY LISTS, READABILITY INDEX AND ELT TEXTBOOKS Vocabulary frequency lists for the elaboration of FLT/L (Foreign Language Teaching/Learning) materials were already used in the first half of the 20th century (Thorndike, 1924, 1944; García Hoz, 1953; West, 1953). No doubt, vocabulary lists based on modern corpora are more reliable, valid and, above all, ‘real’ (Kucera & Brown, 1967; Sinclair, 1987; Leech & Al., 2001). Corpus‐based FLT/L materials have become a must (Johns, 1994; Sinclair, 1996). The underlying rationale is that the most frequently used words in a language are also likely to be the most useful for communicative purposes. Hence, the learning of those very common words turns into one of the most important priorities in language teaching and learning, since effectiveness in communication, vocabulary frequency and communicative potential are intrinsically interwoven. The emphasis on vocabulary control and grammar has decayed significantly in the Communicative Approach, and the focus is placed instead in the communicative functions of language. In spite of this bias towards content and meaning, the popularity of corpora in linguistic research has maintained the interest for frequency lists and their importance in language teaching. In addition to that, many studies refer explicitly to the first 1,000, 2,000, 3,000, etc., words and the role they play in establishing effective communication (Nation, 1990; Diller, 1978; Gildea, 1987; Laufer, and Nation, 1995; Waring, 1997; Zechmeister et al., 1995). The popular appeal of ambiguous and biased slogans such as ‘Learn the first 1,000 words of English’ also contributes to increasing the importance of learning ‘the most frequent words of the language’. This paper addresses the issue of whether the

teaching materials have adapted or not, and to what extent, to the underlying beliefs and convictions regarding vocabulary teaching in connection to frequency lists. We shall investigate here whether claims and expectations on frequency lists and their role in teaching are really reflected in textbooks. Our research is based on the vocabulary analysis of a widely used textbook in the context of Spanish Secondary Education: English in Mind – Student’s Book 2 (Puchta & Stranks 2005). For this investigation, we compiled an ad hoc corpus with the whole textbook content. Our aim is (i) to quantify and typify new vocabulary and new vocabulary rate per unit in the textbook, (ii) to correlate the textbook vocabulary and frequency list against the BNC‐based frequency list and ranges (Nation, 2001), and finally (iii) to determine the readability index (text reading difficulty) of the texts as found in this course book. To achieve these goals, we first extracted the tokens and types present in the textbook, then we systematized the data obtained and identified the new words per unit; we later contrasted all the vocabulary items in the textbook against the BNC frequency list, in order to discover whether both lists matched or not and to which extent. Finally, we calculated the readability index –applying the ARI ‐ Automated Readability Index, based on word and sentence length– and related it to the size of the new vocabulary in the textbook. The results of our analysis provide (i) a reliable picture on how a contemporary and widely used textbook adapts to the claims regarding the relevance and function of frequency lists in FLT/L materials, and ii) a reliable readability index (ARI) to determine the difficulty of the materials built with the vocabulary analyzed.

Sánchez‐García, Pilar Panel: 6. Corpus y variación lingüística THE WESTMORELAND DIALECT IN THREE DIALOGUES (1790): THE CONTRIBUTION OF ANN WHEELER’S DIALOGUES TO JOSEPH WRIGHT’S THE ENGLISH DIALECT DICTIONARY. Joseph Wright’s monumental work English Dialect Dictionary (1898‐1905), the most comprehensive dialect piece hitherto compiled, is much indebted to thousand of works, both literary and non‐literary pieces, as he himself acknowledges in the preface to his work ‘upwards of three thousand dialect glossaries and works containing dialect words have been read and excerpted for the purposes of the Dictionary’ (vi). As it is well known, the volume of works corresponding to the counties of Lancashire and Yorkshire exceeds the number of works of the other four northern dialects for evident reasons. There are emblematic pieces corresponding to Lancashire and Yorkshire thoroughly analysed and studied while other important pieces of the many other dialects remain almost unnoticed. If there is an emblematic work representing the dialect of Westmoreland that is Ann Wheeler’s The Westmoreland Dialect in Three Familiar Dialogues (1790). Commentaries to this work such as that appeared in Russell Smith’s list of “interesting books” included in his Bibliographical Lists (1839): “The philologist will find numerous examples of words and phrases which are obsolete in the general language of England, or which are peculiar to Westmoreland and Cumberland from time immemorial” (7) have made us consider the importance of undertaking and in‐depth analysis of this dialogues. This paper tries to evaluate the contribution made to Wright’s English Dialect Dictionary by Wheeler’s dialogues, considering not only the first edition (1790) but also later editions of this work (1802) and (1840) to which there are important additions. This undertaking has been much more feasible thanks to the digitised version of Wright’s English Dialect Dictionary being prepared by the research team at the University of Innsbruck (Markus 2007, 2009, Markus & Heuberger 2007). Our aim is twofold. Firstly, to ascertain the entries from Wheelers’s dialogues included in Wright’s masterpiece and to analyse the treatment given to this information. Secondly, to contribute to a better knowledge of one of the northern dialects which traditionally has received poor attention, the Westmoreland dialect.

Santaemilia Ruiz, José and Sergio Maruenda‐Bataller Panel: 2. Discurso, análisis literario y corpus BUILDING A COMPARABLE CORPUS (ENGLISH‐SPANISH) OF NEWSPAPER ARTICLES ON GENDER AND SEXUAL (IN)EQUALITY (GENTEXT): PRESENT AND FUTURE APPLICATIONS IN THE ANALYSIS OF SOCIO‐IDEOLOGICAL DISCOURSES Over the last few years a number of legal measures have been adopted recently both in Spain and in the UK –e.g. the Civil Partnership Act 2004 or the Domestic Violence, Crime and Victims Act 2004 in UK, or the new Spanish legislation on abortion, gender‐based violence or homosexual marriages. These measures, along with the growing recognition of social and sexual rights in Western Europe, have sparked a heated debate within both Spanish and British societies. These debates are reproduced, generated, amplified, diminished, perverted or exploited by mass media, political parties or religious institutions, with a view to demanding or imposing either respect or neglect for the very minorities to which these legal measure are addressed. As part of the work of the research group GENTEXT , we have built a 4.5 million‐word, comparable (Spanish‐English) , highly‐specialised corpus (GENTEXT‐N) which serves to analyse, document and offer insights into the complex socio‐ideological debates behind people’s attitudes and values, into the discursive attempts to exercise power and to impose political and religious positions, and so on. It is, in short, an invaluable source of material to document the steps our societies are making towards sexual equality. We believe that a combination of qualitative and quantitative analyses (as advocated, among others, by Baker & McEnery 2005, Baker et al 2008, Caldas‐ Coulthard 2010) is essential if we wish to grasp both the linguistic and the ideological underpinnings of the heterogeneous texts we are investigating. Thus, our analyses integrate, on the one hand, critical discourse analysis and lexical semantics/pragmatics and, on the other, Corpus Linguistics techniques to fully exploit the potentialities of both approaches, thus trying to avoid the oversimplification of ideological bias. We will start with a statistical keyword analysis, using WordSmith Tools, in order to have a reliable list of recurrent words in the field. As argued by Baker (2006), keywords are not only neutral or statistical lists of words, but rather privileged rhetorical devices used to implant common sense in our ways of thinking. This analysis will provide information on the ideological implications of keywords, as well as on the ideas these keywords cluster around (naming strategies, ‘sensitive’ relationships between minority group members, social and sexual implications, identities and self‐ presentation …). Attention to context will be paramount. Apart from the initial focus on keyword in these gender‐sensitive texts, we also examine the potentialities of collocations and semantic/discourse prosodies for our research (Louw 1993, Stubbs 2001). As for the former, it will be revealing to document the collocations certain keywords (e.g. homosexual, abortion or violence) give rise to. As for the latter, semantic/discourse prosodies help transcend the collocational or even sentential scope to reveal discursive patterns and, consequently, to trace evaluative relations (with participants in discourse) in terms of ideological standpoint (see Martin & White 2005). These constitute a network or constellation of semantic concepts which contribute to shaping and (de)legitimising citizens’ discourses and rhetorical frameworks within communities of practice.

Santos Moreira, Adonay Custódia

Panel: 1. Diseño, compilación y tipos de córpora TURIGAL: COMPILATION OF A PARALLEL CORPUS FOR BILINGUAL TERMINOLOGY EXTRACTION These last few years have witnessed an increase in research involving the compilation of large quantities of texts and their respective translations, as well as the development of techniques for processing those bilingual term banks (Bowker & Pearson, 2002; Biber et al., 2004; McEnery & Wilson, 2004). The present study is an example of such research as it uses a Portuguese‐English unidirectional parallel corpus as a starting point for the retrieval of terminology. The main goal of this research is to exploit one of the possibilities offered by parallel corpora: the compilation of bilingual term banks. Turigal, a parallel corpus of tourist advertising material, has been devised to support the creation of a bilingual term bank on tourism. The corpus consists of texts – printed brochures and websites – in Portuguese and their translations into English, all of which were sourced from Portuguese Tourism Regions, Regional Tourism Boards and Regional Tourism Promotion Agencies, and stored as plain text. For the moment, it contains 1,285,764 words (632,193 words in Portuguese and 653,571 in English) and it is included in the Linguistic Corpus of the University of Vigo (Gómez Guinonart, 2003) and available for free consultation at http://sli.uvigo.es/CLUVI. Turigal is considered to be sufficiently representative of all bilingual (Portuguese‐English) promotional materials published and distributed by the official entities responsible for the internal and external tourism promotion of Portugal in 2007, the year the texts were collected. First, we describe the methodology used in the compilation of Turigal. Then, we discuss Pearson’s (1998) set of criteria for corpus design and text selection – namely size, text origin, author, factuality, technicality, audience, intended outcome, setting and topic – which has been considered when compiling our corpus. Finally, we present the alignment and tagging of Turigal. The programme TRANS Suite 2000 Align (Cypresoft, 2000) has been used to align the texts. All the aligned parallel texts are stored in TMX format and three translation strategies – omission, addition and reordering – have been encoded.

Del Saz Rubio, M. Milagros Panel: 2. Discurso, análisis literario y corpus AN APPROACH TO NATIVE AND NON‐NATIVE WRITERS’ USE OF INTERACTIONAL METADISCOURSAL FEATURES IN SCIENTIFIC ABSTRACTS IN ENGLISH WITHIN THE FIELD OF AGRICULTURAL SCIENCES The relevance of academic writing is nowadays more than justified as demonstrated by the large body of research in this area. Authors such as Berkenkotter, Huckin & Ackerman (1991) have brought to attention the importance of mastering a specialized literacy, especially for students or researchers entering the academic disciplines. This literacy can be defined as the ability to make use of the discipline‐specific rhetorical and linguistic conventions in order to fulfill the purpose as writers. Mastering academic writing thus involves an awareness of the existence and structure of specific genres, as a key element for acculturation and success. Therefore, to engage in the writing of a genre such as the research article (Ra), inevitably calls for awareness of its specific conventions, as well as of the role of the writer and the purpose of the writing task. This situation can be certainly more complex for researchers who need to write and publish their research in an L2, since mastering the grammar, lexicon or syntax is not enough to guarantee them communicative competence. Taking all this into consideration, the main aim of this paper is to assess whether there is intercultural variation in the rhetorical preferences of native English and Spanish‐speaking researchers when writing research articles abstracts in English within the field of Agricultural sciences. To do so, a total of 30 articles, 15 written by native English speakers (NES) and 15 by non‐native English speakers (NNES) are analysed and a quantitative and qualitative analysis of the interactional metadiscoursal features they employ, as developed by Hyland (2005), is carried out using WordSmith Tools 4.0. By focusing on the use that these writers make of interactional devices in the different sections of the abstract, the use of hedges and boosters, engagement and attitude markers and self‐mentions will be looked into as they are devices traditionally employed by writers to involve the reader in the text and thus explicitly build a relationship with the scientific audience. Results will also aid to determine if it is possible to talk of the existence of a conventional international culture in the genre of research articles within the field of Agricultural

sciences, or if, on the contrary, the two groups of writers tend to impose the writing conventions of their L1(s) in their writing of abstracts for scientific articles in English and thus, in their use of metadiscoursal features. Finally, the results obtained here can have implications for the teaching of academic writing to non‐native speakers of English. As such, they will be taken as a starting point for the design and elaboration of meaningful writing activities aimed at raising awareness of the conventions and expectations which operate in the genre of the research article in English in the field of Agricultural Sciences.

Schneider, Gerold and Fabio Rinaldi Panel: 6. Corpus y variación lingüística A DATA‐DRIVEN APPROACH TO ALTERNATIONS BASED ON PROTEIN‐PROTEIN INTERACTIONS Syntactic alternations, for example the dative shift, are well researched. There are investigations using large amounts of quantitative data and statistical techniques (e.g. Bresnan and Nikitina 2009). Recently, it has been suggested that traditional concepts of alternations are a heritage from generative syntax, that most decisions which speakers take are more complex than binary choices (e.g. Arppe et al. 2011) and there are complex interdependencies and combination options (e.g. Fillmore 2003). Multifactorial approaches and, as speakers choose among a wide range of grammatical forms, a large inventory of syntactic patterns need to be considered to supplement current approaches. We use the term semantic alternation, broadly referring to the many different ways in which a relation between entities, conveying broadly the same truth‐functional value can be expressed. We use a clearly defined and well‐resourced domain, biomedical research texts, for a corpus‐‐driven approach. As entities we use proteins, and as relations we use interactions between them, using data from large applied text mining challenges (e.g. Leitner 2010). The following sentences all convey the same core relation: We confirm binding of MEA to FIE In our experiments MEA amino acids were able to bind to FIE FIE‐binding by MEA amino acids has been observed The amino acids of MEA are sufficient to bind with FIE We discuss first an approach using a finite inventory of manually designed syntactic patterns, second a corpus‐based semi‐automatic approach and third a machine‐learning language model. The machine‐ learning approach learns the probability that a certain syntactic configuration expresses a relevant interaction of given event types from an annotated corpus. For each event, the inventory and probabilities of configurations define the envelope of application and its multitude of forms. A configuration consists of dependency relations and lexical chains, which use semantic information to overcome sparse data problems. As it has been pointed out that predictive models are particularly accurate for expressing complex, multifactorial phenomena (e.g. Tse 2003), we thus present and evaluate a predictive probability model for semantic alternations in the domain and also discuss its relevance for other domains.

Skorczynska, Hanna Panel: 2. Discurso, análisis literario y corpus METAPHOR IDENTIFICATION IN CORPORA: THE CASE OF ‘AS’ IN A BUSINESS PERIODICAL CORPUS Metaphor signals, also called metaphorical markers (Goatly, 1997), tuning devices (Cameron & Deignan, 2003) and flagging expressions (Steen, 2007), are words and phrases that anticipate metaphors in discourse and are meant to cue the reader/listener into the metaphorical rather than the literal interpretation of an expression. Metaphor signals also provide a direct access to metaphorical material in large corpora if concordancing techniques are used. Goatly (1997), as well as Wallington et al. (2003) have proposed the listings of possible metaphor signals. Their use in electronic queries of corpora may be an alternative to the troublesome manual searches for metaphors, and other corpus techniques, such as Charteris‐Black’s (2004) use of key metaphors. The studies of metaphor signals in different types of discourse and corpora (Skorczynska & Piqué, 2005; Wallington et al., 2003) have shown that only a small percentage of all metaphors used are signaled. In spite of that, metaphor signals can still be used as a complementary metaphor identification procedure, given that no reliable metaphor identification computer tool has been designed to date. The use of metaphor signals in metaphor identification methods still needs to be refined through the analysis of language data extracted from corpora. The potential metaphor signals should be evaluated with regard to the probability with which they fulfill this function. They also need to be examined in their co‐text to identify larger phraseological chunks that might anticipate metaphorical expressions in discourse. Both the probability and the phraseology might vary in different discourse types and corpora. In response to these needs, this study looked into the use of ‘as’ as a metaphor signal in a corpus of business periodicals. A previous study (Skorczynska & Piqué, 2005) had revealed that ‘as’ was one of the most frequent words signaling a metaphor in this corpus. In the present study a corpus of around 600,000 words was electronically queried for ‘as’. Of the 4,772 occurrences, which were manually analyzed, only 260 (5.5%) were used to signal a metaphor. The co‐ text of these occurrences was further analyzed in order to determine possible metaphor signaling phraseological patterns. One of the patterns identified was the combination of a verb with ‘as’. The following combinations were, therefore, further examined: ‘view as’, ‘refer to as’, ‘describe as’, ‘look as’, ‘act as’, ‘perceive as’, ‘think of as’, ‘see as’, ‘know as’, ‘use as’, ‘call as’. The comparison of metaphor signaling uses of these word combinations with their non‐signaling uses showed that some of them are more probable metaphor signals than others. For instance, ‘view as’ was found to be the most reliable metaphor signal that registered 75% of metaphor signaling uses. The least probable metaphor signal was ‘call as’ with only 18% metaphor signaling occurrences. The results obtained suggest that the use of ‘as’ combined with other lexical items as the search words in corpus electronic queries might be a more efficient metaphor identification technique than ‘as’ on its own.

LITERATURE REVIEWS IN ENGLISH AND SPANISH PHD THESES: A CROSS‐LANGUAGE STUDY Research on the Literature Review (LR) chapter of doctoral theses has been carried out on theses produced by native English speaking students (Ridley, 2000; Kwan, 2006; Thompson, 2005, 2009). However, to our knowledge there have been no contrastive studies based on LR chapters in theses written in English and in Spanish. Reviews in general entail critical evaluations which may involve face threatening acts (Brown & Levinson, 1987). The LR of a doctoral thesis both evaluates others’ research and is evaluated by the examiners, a distinguishing feature of the thesis social context which differentiates it from other ‘more public’ review genres, such as book reviews or back‐cover blurbs (Hyland & Diani, 2009). This makes it necessary to maintain appropriate relations between the writer and the academic community through politeness strategies that aim at saving three faces: the writer‘s, the examiners’ and the reviewed authors’. Doctoral candidates must submit their research for assessment and need to present their claims and show their knowledge in conformity to the norms of the academic environment. Citation practice provides justification for arguments and allows a writer to indicate a rhetorical gap for her/his research and adopt a tone of authority. Claims must be supported with evidence, and writers must demonstrate an understanding of approaches and knowledge in their fields of specialisation, in order to persuade the examiners that the thesis is worthy of the award of a doctorate (Thompson, 2005). Candidates also need to keep the adequate interpersonal relationship with the immediate audience (the examiners). They also need to evaluate the previous research in an area of study and to be respectful with previous claims from authorities in the disciplines. In this context of social interaction, politeness strategies should be taken into consideration so as to mitigate the strength of their arguments. This paper investigates contrastively how interactional resources and, in particular reporting verbs, are deployed in the LR chapters of PhD theses. It analyses a comparable corpus of 20 LRs ‐10 in English and 10 in Spanish‐ written by native speakers, within a single applied discipline: computing. It focuses on uses of reporting structures realised through integral and non‐integral citations of other texts (Hyland, 1999). The research design is based on previous taxonomies of reporting verbs proposed by Thompson & Ye (1991) and Hyland (1999), and classified according to the type of activity referred to, under two categories: denotative, e.g. ‘find’, ‘state’ (in English LRs) and ‘demostrar’, ‘analizar’ (in Spanish LRs), and evaluative, e.g. ‘suggest’, ‘recommend’ (in English LRs) and ‘proponer’ ‘asumir’ (in Spanish LRs). Using a combination of both quantitative and qualitative data we will determine if there is some variation in the way English and Spanish doctoral candidates adopt a stance to their reviewed authors. The pedagogical implications of this study will contribute to an understanding of interpersonal relations in two different cultural and linguistic backgrounds, and will help novice academic writers interact with their intended readers successfully.

Keith Stuart Panel: 2. Discurso, análisis literario y corpus A CORPUS ANALYSIS OF RHETORICAL STRATEGIES IN THE DISCOURSE OF CHOMSKY This paper explores the rhetorical strategies used by Chomsky in two of his most important books (Syntactic Structures, 1957 & Aspects of a Theory of Syntax, 1965). It continues and widens the research carried out by Hoey (2001) who has analysed Chomsky’s rhetorical strategies but limited his study to just two passages of Chomsky’s writings. One of the claims that we shall be making is that Chomsky is an expert in wrapping propositions in the form of interpersonal metaphors so as to appear objective and factual. In interpersonal metaphors of modality, the grammatical variation which occurs is based on the logico‐semantic relationship of projection (Halliday, 1994: 354). In other words, Chomsky construes propositions as projections and encodes the “objectivity” in a projecting clause. I have found 264 clause complexes of this type in Aspects of a Theory of Syntax and 248 in Syntactic Structures. Some examples are given of the way Chomsky dissimulates that he is expressing an opinion through the use of the logico‐semantic relationship of projection. it seems quite clear that no theory of linguistic structure… (Who is it clear to?) it is unquestionable that opposition to mixing levels, … (Who thinks it is unquestionable?) It is quite true that the higher levels of linguistic description... (Who is it true to?)

The paper will not limit itself to these structures but analyzes a range of interpersonal meanings and their lexico‐grammatical realizations. It will also make reference to a recent paper by Pullum (forthcoming, 2011) on the mathematical foundations of Syntactic Structures and suggest some reasons why Chomsky dresses up his texts in a very persuasive form of language. These reasons seem to be principally issues to do with the academic and historical context in which these two important texts were produced.

Toledo Báez, María Cristina Panel: 5. Corpus, estudios contrastivos y traducción TRANSLATING RESEARCH ARTICLES FROM SPANISH INTO ENGLISH: A CORPUS‐BASED COMPARATIVE ANALYSIS OF THE GENRE Translating research articles from any language into English is of paramount importance in the scientific community. However, before translating, it is necessary to ascertain the macrostructure of both source text and target text according to the genre conventions in each language. This article aims to prove whether research articles on the domain of Information and Technology Law published in Spanish share the Introduction‐Method‐Results‐Discussion (IMRD) structure used in most articles written in English. More specifically, we focus on the section ‘introduction’ in order to study whether most articles have either the Create a Research Space (CARS) model (Swales, 1990) or the Open a Research Option (OARO) model (Swales, 2004). In previous studies with small corpora (Toledo Báez, 2009 and 2010), the results showed that the introductions CARS are much more frequent in English than in Spanish and the OARO structure is the most common in the Romance language. However, we need to prove this hypothesis with the intratextual comparative analysis of our bilingual, specialized, virtual, and representative (Corpas Pastor and Seghiri Domínguez, 2010/in press) comparable corpus consisting of a collection of 280 research articles on electronic commerce, 140 in Spanish and 140 in English. This article also pays attention to the possible macrostructural consequences when translating research articles from Spanish into English. This difference may have an impact on the translation of these texts because the translator may have to decide whether to keep the original features of the research article in Spanish or, on the contrary, to adapt the Romance text to the Anglo‐Saxon conventions of research articles.

Torre Alonso, Roberto

Panel: 4. Lexicología y lexicografía basadas en córpora THE PREFIX UN‐ IN THE FORMATION OF OLD ENGLISH NOUNS: COMBINATORIAL PROPERTIES AND CONSTRAINTS. This paper aims at sheding light upon the morphological properties of the prefix un‐ in the formation of complex nouns in Old English. More concretely, this paper provides an exhaustive description of the combinatorial properties of the prefix as regards both the bases to which it may be attached and the affixes with which it can interact in recursive derivative processes. Thus this research is twofold On the one hand it explores the nature of the bases that admit derivation with un‐ as regards the lexical class to which they are adscribed. On the other hand, the morphological character of the bases is also discussed, whether simple or complex, thus allowing for the establishment of a set of affixes which can act in interaction with the prefix un‐. In this respect, this research is supported by the works by Siegel (1979), Fabb (1988), Aronoff and Furthop (2002), Hay and Plagg (2004) or Lieber (2004) or Martin Arista (2010), which focus on the subject of affix combinations. Regarding the target language, this stage of the language is characterised by a rich inflectional system (Kastovsky 1992), and complex words clearly outnumber simple ones. Moreover, complex words are also used as bases for further derivational steps to operate, thus allowing for the existence of recursively derived words, which are a suitable field of study for the analysis of affix combinations. The analysed data have been retrieved from the lexical database Nerthus (www.nerthusproject.com), which contains over 30,000 predicates, and consists of a total of 162 predicates. Of these 153 present a nominal base, whereas 9 present a non‐nominal base, which include 3 verbs and 6 adjectives. The reason for the existence of non‐nominal bases is to be found in the analysis methodology and in the fragmentation of the surviving lexical stock of the period. Regarding the morphological complexity of the bases, only 35 are underived. Thus, nouns prefixed with un‐ present some degree of recursivity in 78.395% of the cases. Besides 71 of the predicates (43.827% of the total and 55.905% of the recursively derived nouns) present two affixes in the final word creation steps. According to the data, the prefix un‐ can combine with four different prefixes, namely ā‐, for‐, ful‐ and ge‐, giving way to a total of 26 predicates, and with seven suffixes, those being ‐dōm, ‐en, ‐ere, ‐ ing/‐ung, ‐nes, ‐scipe, and –t, combining for a total of 45 nouns. The final part of the analysis tries to set the data against the Monosuffix Constraint, proposed by Aronoff and Fuhrhop (2002). These authors identify closing suffixes that do not allow for further derivation once they have been attached to a base. The data show the existence several that can occur once un‐ has occupied its place in the derivation The morphemes allowing for this combinatorial order are –dōm, ‐end, hād,‐ing/‐ung, ‐nes, and –t in some 93 predicates. These data don’t allow claiming the status for un‐, but show that the prefix is process final, as it admits no further prefixation.

Varela Pérez, José Ramón Panel: 6. Corpus y variación lingüística NOT‐NEGATION AND NO‐NEGATION IN CONTEMPORARY SPOKEN BRITISH ENGLISH: A CORPUS‐BASED STUDY This contribution explores the interface between corpus linguistics, diachronic typology and usage‐ based approaches to the study of grammatical variation. I will address the alternation between two types of negation in contemporary spoken English involving non‐specific indefinites under the scope of negation: NOT‐negation (He did not see anything) and NO‐negation (He saw nothing) (Tottie 1991a, 1991b, 1994). Historically, the possibility of variation between the older construction with NO‐negation and the newer one with NOT‐negation was only effective after the disappearance of ne and the rise of not as a marker of verbal negation at the end of the ME period (Jespersen 1917; Mazzon 2004). The demise of multiple negation of the type ne + verb + no/nothing/none, etc., brought about NO‐negation: I ne saw nothing > I saw nothing. In addition, the new marker of verbal negation (not) could increasingly combine with negative polarity items such as any and the indefinite article (not…a/any/anything, etc.) (Shanklin 1988). There have not been many corpus‐based studies of this topic. Most of them focus on contexts where variation is not possible and/or offer quantitative findings without further qualitative analysis of the data (e.g. Biber et al. 1999; Westin 2002; Peters 2008). Only Tottie (1991b) has offered a comprehensive study of this area although she relies on corpora dating back to the 1960s and the early 1970s. In this paper, I will use a sample of contemporary spoken British English taken from the British component of the International Corpus of English (ICE‐GB), including conversations recorded in the early 1990s. In this regard, a comparison of my data with Tottie’s (1991b) findings might reveal some evidence of on‐going change in this area given the way changes from below in English grammar seem to have spread historically. I will also address the impact of several internal factors on the choice between the two constructions, including some that have not yet been considered in the literature. Ultimately, the variation between NOT‐negation and NO‐negation must be placed against the backdrop of diachronic typology: the history of sentence negation in English (the so‐called Jespersen’s Cycle) and two competing typological tendencies that bear opposite results in the expression of negation: (a) the ‘Neg First’ principle, i.e. the universal psycholinguistic tendency for negative markers to be placed before the verb (Jespersen 1917; Horn 1989); and (b) the End‐weight principle, i.e. the tendency to concentrate communicatively significant elements towards the second part of the sentence (Mazzon 2004).

Vea, Raquel

Panel: 4. Lexicología y lexicografía basadas en córpora THE CORPUS PRODUCTIVITY OF OLD ENGLISH ADJECTIVAL COMPOUNDS WITH VERBAL BASE This presentation aims at analyzing the productivity of Old English deverbal adjectival compounds. In this research, the productivity of a word‐formation process in a historical language is based on an assessment of formal transparency and textual frequency, as put foward by Kastovsky (1992) and Lass (1994). The corpus of analysis has been retrieved from the lexical database of Old English Nerthus (www.nerthusproject.com), which turns out a total of 241 compounds, if spelling variants are disregarded. Focusing on the adjunct of the compound, three categories are involved: nominal adjunct (bordhæbbende 'shield‐bearing': bord 'board'), adjectival adjunct (micelsprecende 'boasting': micel 1 'great, intense'), and adverbial adjunt (eftboren 'born again': eft 'again'). By type, the most frequent compounds are the following: æ∂elboren 'of noble birth' (29), hefigty:me 'heavy, grievous' (29), u:tancumen 1 'foreign, strange' (31), a:ncenned 'only‐begotten' (74), frumcenned 'first‐begotten' (77). Once the type anaysis has been carried out by means of the lexical database, the token analysis resorts to the Dictionary of Old English Web Corpus. The conclusions of the analysis go along the following lines. Regarding the sources, some difficulties arise in establishing the correspondences between lemmatized and unlemmatized forms and, as far as the question of token frequency is concerned, this type of compound is far more frequent in prose than in poetry.

Viberg, Åke Panel: 5. Corpus, estudios contrastivos y traducción IMPERSONAL CONSTRUCTIONS IN SWEDISH. A CORPUS‐BASED CONTRASTIVE STUDY Impersonal constructions have attracted a lot of attention recently from typologically oriented researchers (Siewierska 2008, Malchukov & Siewierska forthc.). This paper, which represents an extension of Viberg (2010), presents a corpus‐based contrastive study of impersonals in Swedish based on the multilingual parallel corpus (MPC). At present, MPC consists of extracts from 22 novels in Swedish with translations of all texts into English, German, French, and Finnish. For some texts, translations also are included into Spanish, Italian, Dutch, Icelandic, Danish, and Norwegian. There is a total of around 600,000 words in the Swedish originals. In addition to this material, there are also some original texts in French and Finnish with translations into Swedish. Only part of the corpus has been analysed so far. (Author 2010 is based on five of the original texts in Swedish and their translations.) As a first step in the analysis, impersonals were identified with simple formal criteria. All occurrences of the Swedish generalized pronoun man and of non‐referential det ‘it’ were extracted for further analysis. Swedish impersonals include a number of constructions with impersonal (dummy) det ‘it’ as subject: clefting, presentation, extraposition of finite and non‐finite clauses and the impersonal passive. An analysis was also made of the distribution of impersonal verbs (and other predicates) across semantic fields. As a second step, this material was analysed from a functional point of view. It turned out that det appears as a formal subject (or placeholder) in agentless sentences or sentences with low agentivity, whereas man appears as an impersonal subject with general (‘all of mankind’) or vague reference. The individual constructions were also studied. From a contrastive perspective, it turns out that Finnish in many respects represents a different type than the other languages included in the study, but even if German, English, and French in many cases have rather direct structural equivalents to the Swedish impersonal constructions, the usage patterns differ in a striking way even between these languages. For example, in the material analyzed so far, there were 181 it‐clefts in Swedish of the type It was Peter who came. It turned out that it‐clefts and other clefts (pseudoclefts) were equally frequent as translations in English, but together these structures accounted for no more than 30% of the translations. For German, the proportion was even lower (20%). The highest correspondence was found in French with 43%, which is still rather low. Finnish does not have any direct structural corresponden to it‐clefts and used other translations (including neutral sentences lacking any functional equivalent).

Voutilainen, Atro, Krister Linden and Tanja Purtonen Panel: 1. Diseño, compilación y tipos de córpora DESIGNING A DEPENDENCY REPRESENTATION AND GRAMMAR DEFINITION CORPUS FOR FINNISH We outline the design and creation of a syntactically and morphologically annotated corpora of Finnish for use by the research community. We motivate a definitional, systematic “grammar definition corpus” as a first step in an three‐year annotation effort to help create higher‐quality, better‐documented extensive parsebanks at a later stage. The syntactic representation, consisting of a dependency structure and a basic set of dependency functions, is outlined with examples. Reference is made to double‐blind annotation experiments to measure the applicability of the newgrammar definition corpus methodology.

Given the current global obesity epidemic and the media’s coverage of this phenomenon over the past decade, researchers have begun to examine news reports on obesity through qualitative, quantitative, thematic, content and discourse analyses. Following the work of Lawrence (2004), Kim & Willis (2007) and Boero (2007) in the US, obesity news studies have been conducted in Australia (Udell & Mehta, 2008), Canada (Roy et al., 2007), Germany (Hilbert & Ried, 2009), Norway (Malterud & Ulriksen, 2009), Sweden (Sandberg, 2007) and the UK (Gough, 2007). Yet to my knowledge, there is no comparable study available for Spain. Therefore, the purpose of my ongoing research is to examine Spanish written press coverage of obesity, especially in regard to children (Author, in press; Author, 2010). This research is based on a specific corpus of 231 news items published between 01/01/2008 and 31/12/2008. This year was selected for study because in April of 2008 the national press ran headlines which blamed obesity for a child’s death in Murcia, Spain. Using various combinations and synonyms of the key search term, obesidad infantil, all pertinent news items were extracted from the online archives of ABC, El Mundo and El País, the three leading national newspapers in Spain. After analyzing each item manually, only those results containing at least one direct reference to childhood/adolescent overweight/obesity were included in the final 231‐item corpus: ABC (n=88; 38.1%), El Mundo (n=78; 33.8%), and El País (n=65; 28.1%) (total word count, approx. 135,932; 588 words/text). The present study will focus on the 65 items published in El País, the top‐circulation daily in Spain. The El País articles tended to be longer (673 words/item) than those published in ABC and El Mundo (475 and 645 words/item, respectively). With an average of 5.4 items/month, the El País sample contains 9 (13.8%) opinion articles, 23 (35.4%) interpretative pieces, and 33 (50.8%) informative texts; the names of staff reporters or journalists appear on 51 (78.5%) of the items. The content analysis confirmed two types of thematic coverage: social and scientific. The social perspective frames news on public and private schemes to control or prevent obesity (16; 24.6%) as well as the implications of obesity in the lives of celebrities (8; 12.3%). The scientific frame is clear in news about obesity prevalence (6; 9.2%) or the causes (12; 18.5%) and health risks associated with childhood obesity (13; 20%). In brief, some 42 million children under five are overweight today and, according to the World Health Organization (2010), the majority will be overweight as adults; many will be diagnosed with diabetes or cardiovascular disease and, some, like the Spanish child in 2008, will die prematurely. The results of this research highlight the newsworthiness of the current childhood obesity epidemic in the leading Spanish daily and will provide relevant data for future studies of news framing and contemporary obesogenic discourse.

Wissik, Tanja Panel: 1. Diseño, compilación y tipos de córpora COMPILING SPECIALIZED CORPORA ACROSS LANGUAGE VARIETIES AND WORKING WITH THEM The building and the analysis of specialized multilingual corpora on one hand and the building and the analysis of corpora of national varieties on the other hand are well established methods in translation studies and linguistics. But the analysis of specialized comparable corpora for national varieties is still in his infancy since most studies analyzing national varieties, especially for German, focus on general language and not on language for special purpose. In this paper the design and development of the so called UNI‐Corpus will be described. The relevant corpora are compiled with a special regard to the institutional language used in the university systems in Austria, Germany and Switzerland. This paper will present the experience of developing these three comparable corpora and will discuss issues which arose when setting up the corpus, like the selection of texts, the size of the sub‐corpora, regional distribution etc. Furthermore, the paper will discuss a case study to illustrate the application and the use of the UNI‐Corpus, which can be used for comparative and contrastive studies, but the results of the analysis can also be used in translation studies and in the actual translation process.

You, Zixi Panel: 3. Estudios gramaticales basados en córpora A CORPUS‐BASED EXAMINATION OF PERFECTIVE AUXILIARY SELECTION IN OLD JAPANESE Auxiliary selection has received a great deal of attention among linguists who work on European languages, for example, Italian, French, Old Spanish, and so on, whereas very few studies have taken auxiliary selection in Asian languages into consideration. Washio (2002; 2004) argued that the perfect auxiliaries in Old Japanese (OJ) displayed a close distributional correspondence to the European auxiliaries H (HAVE) and B (BE); however, the full picture of the distribution and the underlying criteria for the auxiliary selection in OJ itself remained unclear and debatable. This paper, illustrating how a large corpus with a big amount of textural and grammatical data merits descriptive and analytical linguistic research, examines auxiliary selection in OJ by means of a newly completed OJ Corpus, a part of the VSARPJ Corpus that features a large amount of grammatical information encoded in pre‐modern Japanese texts. The perfective auxiliary in OJ has two variants, ‘‐(i)n‐’ and ‘‐(i)te‐’. As has been pointed out by Frellesvig (2010: 67), they belong closely together for the reasons that they are mutually exclusive, occupy the same position in a verb system, do not co‐occur with the stative or the negative, and exhibit mostly the same inflected forms. The OJ Corpus consists of nearly all attested OJ texts, romanized and xml tagged with a wide range of linguistic information, e.g. orthography, part‐of‐speech, morphology, syntactic constituency, etc., following TEI conventions. (More recently, information about semantic role is also being added.) As part of the construction of the Corpus, I marked up, both automatically and manually, all the occurrences of perfective auxiliaries in OJ, and assigned ID numbers to all verbs preceding the perfectives (in the same word) according to the Lexicon of the Corpus. Xaira was used to extract the data from the Corpus. A comprehensive and exhaustive investigation was carried out on all verbs that precede perfective auxiliaries in both single and compound forms. In total, I found 199 verbs that only co‐occurred with ‘‐(i)n‐’, 112 verbs with ‘‐(i)te‐’, and more interestingly, 18 verbs that could co‐occur with both. Based on these lists, I looked at each verb in other contexts in the Corpus to investigate their syntactic behaviors, and also analyzed the interaction between semantic factors, namely, agentivity, volitionality, affectedness, and telicity. Results showed that agentivity and telicty played the most important roles in the auxiliary selection in OJ; furthermore, a strong tendency that transitive verbs and unergative verbs pattern with ‘‐(i)te‐’ and unaccusative verbs pattern with ‘‐ (i)n‐’ was also observed. After a more closed examination of the verbs that selected both ‘‐(i)n‐’ and ‘‐ (i)te‐’, compositional factors turned out to be the key that resulted in the syntactic variations in auxiliary selection, or, from another perspective, the extension of the domain of the selection. Based on the largest and newest corpus for OJ language, this research contributes to a detailed description of verbs and auxiliary selection in OJ, benefiting future comparative studies of Eastern and Western languages, while also having implications for linguistic theory in general.