Updated: January 25, 2015.
If you have questions or experience issues with the calculator or search pages, please contact:
Kenny Vaden.

Summary

The Irvine Phonotactic Online Dictionary (IPhOD) is a large collection of English words and
pseudowords that was developed at the University of California, Irvine for research on speech perception and production.
The collection allows researchers to select words and/or pseudowords for experiments, based on measures related
to phonemes (i.e. speech sounds). This database can be used to perform searches, such as whether a word contains
relatively frequent or infrequent sound-sequences or which other English words sound similar to it.
All of the IPhOD tools on this website are freely available for academic and personal use.
There is also an blog to provide a forum for feedback,
questions, and suggestions - or use email to contact: Kenny Vaden.

A growing body of evidence demonstrates that we segment (Saffran et al., 1996), respond (Vitevitch et al., 1999),
produce (Vitevitch et al., 2004), and remember (Majerus et al., 2004) speech in ways that are affected by different kinds of
phonological frequency information. However, this speech research is restricted by the limited number of pronunciation collections
or utilities, which often derive their estimates narrowly (for instance, using small word collections) or present limited measurement
choices. While some collections, notably CELEX, address some of these concerns, they use British stress and pronunciation, which is
suboptimal for speech research using American English trained subjects. The Phonotactic
Probability Calculator (Vitevitch & Luce, 2004) uses American English pronunciations to compute position-specific phonotactic
probabilities, but provides no density estimates or frequency weighting options. Despite growing interest in phonotactic information,
it remains unavailable or difficult to derive for novel hypotheses with contemporary tools.

The current version (2.0) of the Irvine Phonotactic Online Dictionary (IPhOD) is a collection of phonotactic estimates
calculated across a broad sample to enable precise verbal stimuli selection for speech research and application in cognitive science,
computational linguistics, and natural language processing. IPhOD contains phonotactic and density estimates, American English
transcriptions of 1-28 phonemes, and word frequencies for 54030 word and 814840 pseudoword entries. Pseudowords are defined here as
word-like transcriptions consisting entirely of phoneme-pairs from real English words. Pseudowords like these are used in computational
psycholinguistics to study non-semantic language processes, since they have little meaning or association but are consistent enough with
a language to sound like typical words. The collection is freely available to download or search online.

Each IPhOD entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994),
and written word frequencies from the SUBTLEXus database (Brysbaert & New, 2009). Neighborhood density and word averaged phoneme-sequence
probabilities were extrapolated from those data using the same formulas for words and pseudowords, so that entries of either type could
be chosen using identical criteria. IPhOD is calculated broadly, over the entire word set in calculations for phonotactic probability and
neighborhood density, after the approach of Vitevitch and Luce (1999).

Phonotactic probabilities refer to the relative frequency for the sound sequences that are present in a given word. The
phonotactic measures in IPhOD extend upon definitions from Vitevitch and Luce (1999), elaborated upon below. The database was calculated
with two versions of each measure - one that treats vowel sounds identically regardless of syllable stress and a second version that
differentiates among vowels with different syllable stress. The syllable stressed counts were produced to investigate the extent to which
sensitivity to phoneme-order includes syllable stress. Distinguishing vowels by syllable stress transforms the small number of English vowel
sounds (compared to consonants) into a larger number of compartments, which changes the probability space of positions and sequences,
and also changes density counts.

Accessing IPhOD: downloads, searches, calculator

The IPhOD can be used in one of several ways. First, it may be useful to find words or pseudowords with values
in a specific range. For example, what pseudowords have between 20 and 25 phonological neighbors? What are those neighbor words?
A second approach is to determine these values for specified words or pseudowords, for example: what are the word frequencies
for cat, dog, tree, car? There are several ways to access this information in IPhOD, depending on your goal.

The IPhOD database can be downloaded in its entirety (text files) from the
download page. These files can be
opened using most available spreadsheet programs, or custom PERL scripts. A second option is to
search the database online, by
entering value ranges or word lists to obtain results. Finally, there is an
online calculator that produces phonotactic and
density values for lists of phonemic transcriptions that are entered by the user. An advantage of the latter two approaches is
that you can specify which output fields to include in results, and leave out columns that are not of interest. The online
calculator is helpful for generating values for words or pseudowords that are not included in the IPhOD database, and also
can list the phonological neighbors of each input transcription.

Acknowledgements

The IPhOD was developed by Kenny Vaden advised by Greg Hickok in the Department of Cognitive Sciences, UC Irvine.
We gratefully acknowledge the contributions of Harry Halpin (Informatics, University of Edinburgh) for the original XML markup and
online search functions, as well as the various contributions of Jean-Claude Falmagne, Kai Okada, Yasmine Omidvar and Corica Rodgers
(UC Irvine).

Brysbaert, M. and New, B. 2009. Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 997-990.