The VALID Data Archive is an open multimedia data archive in which data from children and adults with language and/or communication problems are brought together. A pilot project, funded by CLARIN-NL, was carried out in which five existing data sets were curated. This pilot enabled us to build up experience in conserving different kinds of pathological language data in a searchable and persistent manner. These data sets reflect current research in language pathology rather well, both in the range of designs and the variety in pathological problems, such as Specific Language Impairment, deafness, dyslexia, and ADHD. In this paper, we present the VALID initiative, explain the curation process and discuss the materials of the data sets.

Word prediction, or predictive editing, has a long history as a tool for augmentative and assistive communication. Improvements in the state-of-the-art can still be achieved, for instance by training personalized statistical language models. We developed the word prediction system Soothsayer. The main innovation of Soothsayer is that it not only uses idiolects, the language of one individual person, as training data, but also sociolects, the language of the social circle around that person. We use Twitter for data collection and experimentation. The idiolect models are based on individual Twitter feeds, the sociolect models are based on the tweets of a particular person and the tweets of the people he often communicates with. The sociolect approach achieved the best results. For a number of users, more than 50% of the keystrokes could have been saved if they had used Soothsayer.

Studies in bilingualism have shown that words activate form-similar neighbors in both first (L1) and second (L2) languages. Accordingly, we hypothesized that the degree of form similarity between L1–L2 word pairs causes a proportional amount of prosodic transfer in L2 speech production. Thus, cognate pairs L1–L2 which bear lexical stress in the same syllable position should be facilitated in L2 production, while cognates with stress on mismatching positions L1–L2 should be inhibited. The results of a speeded word naming task with English L2 speakers showed facilitation in production of cognate words overall. Concerning word stress in L1–L2, an opposite effect was found between 2- and 3-syllable cognate words, while no effect was found for non-cognates. The effects found for cognate words correlate with form similarity and L2 lexical frequency values, corroborating the hypotheses that lexical activation in L2 is non-selective and that the bilingual lexicon is built in association between L1 and L2 at multiple levels of linguistic representation.

The debate whether natural fast speech is more intelligible than artificially time-compressed speech has not clearly been answered yet. For Dutch, for instance, it has been shown in a phoneme detection task that time-compressed speech is more intelligible than natural fast speech, while for Danish listeners, no difference between the intelligibility of natural fast speech and time-compressed speech was reported from a dictation task. This article further investigates these conflicting results by reporting on a dictation task with Dutch listeners. The results suggest that the reported differences are more likely to be language-related than task-related.