Constructing resources for the identification of native language features online: the process, possibilities and challenges

This presentation outlines the construction of five parallel corpora written in English online by native speakers of five target languages. The corpora have been constructed as part of the Native Language Influence Detection (NLID) project. Their purpose is to aid the automatic identification of native influences on individuals' online communications in English. The presentation will outline the collection strategies developed to obtain this data, as well as the results of a quality review into its content. Particular sources and queries are found to be more productive than others in gathering data of this nature. Based on initial content investigations, the collection appears to be largely authentic and free from interference. The difficulties of matching such data to a sampling frame due to missing meta-data are also discussed. This presentation comes at a point at which the project team are investigating potential solutions to the problems of missing meta-data and further improving data accuracy, some of which will be outlined as part of this presentation. With this in mind, additional suggestions from audience members are particularly welcome.