Jan201926

In Vander Beken, Woumans, & Brysbaert (2018) we published a new Dutch vocabulary test with 75 multiple choice items for advanced users (typically university students). The test has a reliability of .84 (Cronbach’s alpha) and correlates .6 with English L2 proficiency (indicating that bilinguals who are good in one language are typically good in the other as well). The test can be used for free for research purposes.

Jan201910

As the number of megastudies is growing, it becomes difficult to keep track of everything that is out there. For a paper I decided to make a table and then realized that it would be great to have the information on a website with links to the articles and the datasets.

You find the outcome here. The list contains all the megastudies and eye movement corpora that I am aware of. Originally I wanted to work with a cut-off criterion of minimally 1,000 words (as the lower limit of the definition of mega), but it rapidly became clear that this excluded several interesting datasets. So, for the sake of completeness I dropped the criterion, although it still feels odd to me that you can have a megastudy with less than 1,000 stimuli.

Oct201817

Age of acquisition (AoA) is an important variable in word recognition research. Up to now, nearly all psychology researchers examining the AoA effect have used ratings obtained from adult participants. An alternative basis for determining AoA is directly testing children’s knowledge of word meanings at various ages. In educational research, scholars and teachers have tried to establish the grade at which particular words should be taught by examining the ages at which children know various word meanings. Such a list is available from Dale and O’Rourke’s (1981) Living Word Vocabulary for nearly 44 thousand meanings coming from over 31 thousand unique word forms and multiword expressions. In Brysbaert & Biemiller (2018) we relate these test-based AoA estimates to lexical decision times as well as to AoA adult ratings, and report strong correlations between all of the measures. Therefore, test-based estimates of AoA can be used as an alternative measure.

Mar201819

At long last we found time to make the English word prevalence measures available.

Word prevalence indicates how many people know a word. Because percentage known has an uninteresting distribution, word prevalence is calculated on the basis of a probit transformation. The following are interesting landmarks:

negative prevalence values: words known by less than 50% of the people; only of interest for word learning studies

You can do online searches for semantic similarities, semantic neighbors and analogies of Italian words here (for a large co-occurrence window) or here (for a small co-occurrence window; see the article to know which one to use for which question).

Jan201816

We’ve published the outcome of 4 years of study and computer simulations on the power of designs that include more than one observation per condition per participant. Indeed, a problem about the current studies on the replication crisis is that power is always calculated on the assumption that each participant only provides one observation per condition. This is not what happens in experimental psychology, where participants respond to multiple stimuli per condition and where the data are averaged per condition or (preferentially) are analyzed with mixed effects models.

Main findings

In a nutshell, these are our findings:

In experimental psychology we can do replicable research with 20 participants or less if we have multiple observations per participant per condition, because we can turn rather small differences between conditions into effect sizes of d > .8 by averaging across observations (as indeed known to psychophysicists for almost a century). This is the positive outcome of the analyses.

The more sobering finding is that the required number of observations is higher than the numbers currently used (which is why we run underpowered studies). The ballpark figure we propose for RT experiments with repeated measures is 1600 observations per condition (e.g., 40 participants and 40 stimuli per condition).

The 1600 observations we propose is when you start a new line of research and don’t know what to expect. The article gives you the tools to optimize your design once you’ve run the first study.

Standardized effect sizes in analyses over participants (e.g., Cohen’s d) depend on the number of stimuli that were presented. Hence, you must include the same number of observations per condition if you want to replicate the results. The fact that the effect size depends on the number of stimuli also has implications for meta-analyses.

Because we got many questions on power after writing the ms (and people rarely appreciated the answers we gave), we decided to write a prequel dealing with power requirements for simple designs. You find the text here (Brysbaert, 2019).

Missed studies in the article

After the publication of the article, it has become clear that other researchers already noticed the relationship between number of stimuli and standardized effect size. Usually this was framed in a negative way (i.e., the effect sizes are overestimated when based on the average of multiple observations), without paying attention to the more positive side for power. Here are some pointers:

Brand et al. (2010) already noticed the relationship between number of stimuli per condition and standardized effect sizes. They additionally point to the importance of the correlation between the observations: The higher the correlation, the less multiple observations will increase the standardized effect size (and arguably the less they will help to make the study more powerful).

Richard Morey (2016) also noticed that the standardized effect sizes in F1 analyses depend on the number of observations per condition. Maybe the effect size proposed by Westfall et al. is the preferred measure for future use? Alternatively, in reaction time experiments nothing may be more informative than the raw effect in milliseconds.

There was an interesting observation by Jeff Rouder pointing to the increased power of experiments with multiple observations. His rule of thumb (if you run within-subject designs in cognition and perception, you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition) agrees quite well with the norm we put forward (a properly powered reaction time experiment with repeated measures has at least 1,600 word observations per condition). With 2000-3000 observations per condition you have high powered experiment, with 1600 you have a properly powered experiment. Within limits (say a lower limit of 20), in most experiments the numbers of trials and participants can be exchanged, depending on how difficult it is to create items or to find participants.

More recent publications of interest

Kolossa & Kopp (2018) report that for model testing in cognitive neuroscience it is more important to obtain extra data per participant than testing more participants.

Rouder & Haaf (2018) published an article that nicely complements ours. They make a theoretical analysis of when extra trials improve power. The basic message is that extra participants are always better than extra trials. However, the degree to which this is the case depends on the phenomenon you are investigating. If there is great interindividual variation in the effect and if the variation is theoretically expected, you need many participants rather than many trials (of course). This is true for many experiments in social psychology. In contrast, when the effect is expected to be present in each participant and when trial variability is larger than the variability across participants, you can trade people for trials. These conditions were met for the priming studies we discussed. No participant was expected to show a negative orthographic priming effect (faster lexical decision times after unrelated primes than after related primes), and the variability in the priming effect across participants (and stimuli) was much smaller than the residual error. These conditions are true for many robust effects investigated in cognitive psychology, in particular for those investigated with reaction times. Indeed, many studies in cognitive psychology address the borderline conditions of well-established effects (to make a distinction between alternative explanations).

Another article warning against being too cheap on the number of trials per condition was published by Boudewyn et al. (2018). If you look at their small effect sizes (remember these are the ones we are after most of the time!), the recommendation of 40 participants 40 trials seems to hold for EEG research as well.

Nee (2019) nicely describes how extra runs improve the replicability of fMRI data, even with rather small sample sizes (n = 16). This is the good old psychophysics approach.

Inconsistencies in underpowered fMRI studies are nicely described by Munzon & Hernandez (2019), who started from a large sample (like we did) and looked at what would have been found in smaller samples. Well worth a read! Another article worth reading is Ramus et al. (2018), who document the many inconsistencies in fMRI research on dyslexia and convincingly relate this to the problem of underpowered studies.

Our article does not deal with interactions. A nice blog by Roger Giner-Sorolla (based on work by Uri Simonsohn) indicates that for an extra variable with 2 levels, it is advised to multiply the number of observations by at least 4 if you want to draw meaningful conclusions about the interaction (see also Brysbaert, 2019). So, beware of including multiple variables in your study. Is the interaction really needed to test your hypothesis?

Goulet & Cousineau (2019) discuss how you can use the reliability of your dependent variable to determine the best ratio of number of trials vs. number of participants (a message also in Brysbaert, 2019).

Feb201705

We have collaborated to validate a new set of 750 colored pictures for picture naming research, compiled by Jon Andoni Dunabeitia at the Basque Center on Cognition, Brain and Language. In particular, we have collected name agreement data for Belgian Dutch. Other languages that have been added are Spanish, British English, French, German, Italian, and Netherlands’ Dutch.

You find all information (including files about name agreement and raw data files) at the BCBL website (see the link above).

Jul201629

How large is the size of our vocabulary? Based on an analysis of the literature and a large scale crowdsourcing experiment, we estimate that an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families. The numbers range from 27,000 lemmas for the lowest 5% to 52,000 for the highest 5%. Between the ages of 20 and 60, the average person learns 6,000 extra lemmas or about one new lemma every 2 days. The knowledge of the words can be as shallow as knowing that the word exists. In addition, people learn tens of thousands of inflected forms and proper nouns (names), which account for the substantially high numbers of ‘words known’ mentioned in other publications.

You find the full details of our calculation of the vocabulary size here.

Here you find the file with all the lemmas and word families (as it turned out, for some reason a few words were lost in the file I uploaded to frontiers, among which again, against and ahead).

Apr201624

Algorithms become increasingly powerful to derive word meanings from word co-occurrences in texts. Paweł Mandera has compared the various algorithms to select the best one so far for use in psycholinguistic research. This turns out to be the Continuous Bag of Words (CBOW) model (Mikolov, Chen, Corrado, & Dean, 2013) based on a combined corpus of texts and subtitles. The findings have now been accepted for publication in the Journal of Memory and Language. This is the pdf. Please refer to it as:

More interestingly, Paweł also makes the semantic vectors available online and created an easy to use shell program and a web interface for those who feel not confident enough to program. So, now everyone can calculate the semantic distance (or semantic similarity) based on CBOW between any two words online in English and Dutch. More information can be found here.

Sep201523

In the Dutch Lexicon Project, we collected lexical decision times for 14K monosyllabic and disyllabic Dutch words. The Dutch Lexicon Project 2 (DLP2) contains lexical decision times for 30K Dutch lemmas. These include almost all words regularly used in Dutch, independent of length.

The word prevalence values for 54,319 Dutch words in Belgium and the Netherlands used in this paper can be found on this page.

In this paper, we have analyzed part of the data from our online vocabulary test (http://woordentest.ugent.be) in which hundreds of thousands of people from Belgium and the Netherlands participated.

Important results from this paper:

Word prevalence, the proportion of people who know a word, appears to be the most important variable in predicting visual word recognition times in the lexical decision task. We conjecture that this is because word prevalence estimates the true occurrence of words better than word frequency in the low range.

A person’s vocabulary accumulates throughout life in a predictable way: the number of words known increases logarithmically with age.

This result mirrors the growth of the number of unique words encountered with the length of a text (known as Herdan’s law in quantitative linguistics). It is first demonstrated here for human language acquisition.

Knowing more foreign languages increases rather than decreases vocabulary in your first language. This is probably a result of the shared vocabulary between languages and the faster growth in new types when acquiring a new language.

Jun201425

Our vocabulary test keeps on doing well (over 600K tests completed now). Below is a list of 20 words known in the UK but not in the US, and a list of 20 words known in the US but not in the UK. By known we mean selected by more than 85% of the participants from that country with English as their native language. As you can see, for each word there is a difference of more than 50% between both countries.

Better known in the UK (between brackets, percent known in the US and percent known in the UK)

tippex (7, 91)

biro (17, 99)

tombola (17, 97)

chipolata (16, 93)

dodgem (17, 94)

korma (20, 97)

yob (22, 97)

judder (19, 94)

naff (19, 94)

kerbside (23, 98)

plaice (16, 91)

escalope (17, 91)

chiropody (20, 93)

perspex (22, 94)

brolly (24, 96)

abseil (15, 87)

bodge (18, 89)

invigilator (22, 92)

gunge (19, 89)

gormless (26, 96)

Better known in the US (between brackets, percent known in the US and percent known in the UK)

garbanzo (91, 16)

manicotti (90, 15)

kabob (98, 29)

kwanza (91, 24)

crawdad (86, 20)

sandlot (97, 32)

hibachi (89, 27)

provolone (97, 36)

staph (86, 25)

boondocks (96, 37)

goober (96, 37)

cilantro (99, 40)

arugula (88, 29)

charbroil (97, 39)

tamale (92, 35)

coonskin (88, 31)

flub (89, 31)

sassafras (92, 35)

acetaminophen (92, 36)

rutabaga (85, 30)

You can still help us to get more refined data by taking part in our vocabulary test. For instance, we have not enough data yet to say anything about differences with Canada, Australia, or any other country with English as an official language.

Jun201413

Some words are better known to men than to women and the other way around. But which are they? On the basis of our vocabulary test, we can start to answer this question (on the basis of the first 500K tests completed). These are the 12 words with the largest difference in favor of men (between brackets: %men who know the word, %women who know the word):

codec (88, 48)

solenoid (87, 54)

golem (89, 56)

mach (93, 63)

humvee (88, 58)

claymore (87, 58)

scimitar (86, 58)

kevlar (93, 65)

paladin (93, 66)

bolshevism (85, 60)

biped (86, 61)

dreadnought (90, 66)

These are the 12 words with the largest difference in favor of women:

taffeta (48, 87)

tresses (61, 93)

bottlebrush (58, 89)

flouncy (55, 86)

mascarpone (60, 90)

decoupage (56, 86)

progesterone (63, 92)

wisteria (61, 89)

taupe (66, 93)

flouncing (67, 94)

peony (70, 96)

bodice (71, 96)

These 24 words should suffice to find out whether a person you are interacting with in digital space is male or female.

Take part in our vocabulary test to make the results even more fine grained!

May201419

Now that over 480,000 vocabulary tests have been completed, we can have a look at some of the findings. For instance, which words are not known at all in English? The following are the words of which less than 3% of the participants in our test indicated they were English words. For comparison, the fake words were endorsed by 8.3% of the participants on average. So, these are words not only unknown to everyone but also unlikely to be ‘mistaken’ for a true English word. The funny thing is that they often have interesting meanings, including a weapon, a precious stone, animals, several descriptions of people, and so on.

Here they are, the 20 least known words of English, also the least liked words, cast aside by everyone!

The AoA norms have been aggregated over the various studies that collected them (Ghyselinck et al., 2000, 2003; Moors et al., 2013; Brysbaert et al., 2014). If you cannot download the Excel files, most probably you are working with Internet Explorer. Ironically, this browser cannot read Microsoft Excel files.

Jan201429

After the success of our Dutch vocabulary test, we’ve developed an English version (wordORnot). The task is the same: you get 100 letter sequences and you have to indicate which are existing English words and which not. Guessing is discouraged, because you are penalized if you say “yes” to a nonword.

Our experiences with the Dutch vocabulary test show us that in the beginning there are some questionable (not to say bad) nonwords and words (for which we apologize). These are next to unavoidable given that we are using so many stimuli. However, on the basis of the responses and the feedback we get (an example of crowdsourcing), the lists are regularly updated, so that after a few days / weeks (depending on the popularity of the test) these problematic cases should be gone. In general, problematic words or nonwords should not change the score by more than 5%.

Read the first forum discussions after the launch of the test here (UK), here and here (USA).

Updates

Jan 31, 2014: After two days the test has been done 50K times already with lots of feedback

Feb 1, 2014: 100K tests completed

Feb 16, 2014 : 200K

May 20, 2014 : 480K. First cleaning of the lists. Words out: 300 problematic words (the letters, abbreviations, and some long compound words that are usually written in two words) plus 2,300 very low frequency derived words ending on -ness or -ly (we had too many of them). Words in: 1,300 words from a new frequency list (many science related words). Nonwords out: 8,000 with false acceptance rates of more than 33% (such as ammicably, peachness, ….). Nonwords in: 22,000 nonwords that look like science words or monosyllabic nonwords from the ARC nonword database (because many of the nonwords that had to be dropped were monosyllabic).

Oct201307

Attentive readers may have noticed that we have underused the data from the British Lexicon Project in our publications thus far, focusing more on the (American) English Lexicon Project. This was because we felt uneasy about using word frequencies from American English to predict word processing times in British English.

At long last, together with Walter van Heuven from Nottingham University, we now have analysed word frequency norms for British English based on subtitles: SUBTLEX-UK.

As expected, these norms explain 3% more variance in the lexical decision times of the British Lexicon Project than the SUBTLEX-US word frequencies. They also explain 4% more variance than the word frequencies based on the British National Corpus, further confirming the superiority of subtitle-based word frequencies over written-text-based word frequencies for psycholinguistic research. In contrast, the word frequency norms explain 2% variance less in the English Lexicon Project than the SUBTLEX-US norms.

The SUBTLEX-UK word frequencies are based on a corpus of 201.3 million words from 45,099 BBC broadcasts. There are separate measures for pre-school children (the Cbeebies channel) and primary school children (the CBBC channel). For the first time we also present the word frequencies as Zipf-values, which are very easy to understand (values 1-3 = low frequency words; 4-7 = high frequency words) and which we hope will become the new standard.