CEPLEXicon is a lexicon based on two different corpora of child speech – Santos corpus (Santos, 2006, Santos et al., 2014, see http://www.clul.ul.pt/resources/546?lang=en) and Freitas corpus (Freitas, 1997, Freitas et al. 2012). This lexicon results from the automatic tagging of the two corpora, using a tagger and the POS tag set produced in the research unit ANAGRAMA (Centro de Linguística da Universidade de Lisboa - CLUL) (Généreux, Hendrickx & Mendes, 2012). The automatic tagging was followed by a partial manual revision (as described in the manual).

This lexicon covers all the speech produced by seven monolingual Portuguese children aged 1;02.00 to 3;11.12, in a total of 114 files, each corresponding to 40-50 minutes of child-adult interaction in a naturalistic setting. The lexicon is presented in .xls format and includes 2201 lemmas, the number of occurrences of each lemma in three different age periods (<2 years of age; ≥ 2 and < 3 years of age; ≥ 3 years of age), frequency of the lemma in each period and age of first occurrence for each child.

CEPLEXicon was developed at ANAGRAMA (CLUL, Faculdade de Letras da Universidade de Lisboa), under the project Complement Clauses in the Acquisition of Portuguese (PTDC/CLE-LIN/120897/2010), funded by Fundação para a Ciência e Tecnologia.