Most if not all natural languages, and many popular constructed
languages, have a large number of minimal pairs, or
near-homophones; pairs of words which differ by a single phoneme, or
even by a single distinctive feature in one phoneme. Examples in
English include "fight" /fAjt/ and "bite" /bAjt/,
differing by one phoneme but at least two distinctive
features; or "seat" /si:t/ and "seed" /si:d/,
differing by a single distinctive feature. Probably such near-homophones
are more likely to persist in a natural langauge (or a constructed
language that comes into use as a spoken language) if the words of the
pair are unlikely to occur in the same context.

I've been experimenting with some methods to generate the initial root
vocabulary for an engineered
language in such a way that there will be no minimal
pairs. I've tried several different approaches:

no two words differ by less than two phonemes

no two words differ by less than two distinctive features

no two words differ by less than three distinctive features

For instance, if I take the first criteria, and have some roots of the
form CVC (consonant-vowel-consonant), then /kan/ could coexist with
/kim/, /kel/, /koŋ/ and /kur/, but would block /kam/, /tan/,
/ken/, and so forth from being used. I've managed to quantify exactly
how this criterion reduces the number of possible words available for
a given phonology and phonotactic. It's necessary first to consider a
syllable or word in terms of a series of phoneme slots, each of which
can have any of several phonemes (possibly including null). Then
represent a word of a given length by an N-dimensional matrix where
each coordinate's value represents a particular phoneme in a
particular position. For instance, a syllabary table may represents
the set of all possible CV syllables in a given phonology: a
2-dimensional matrix. If the language allows an optional final /n/, we
can add a second layer, giving a 3-dimensional matrix with slight
thickness. We can extend the matrix into any number of dimensions, to
allow for words of more than one syllable (e.g. CVCVCV...) or words
with more complex structure (CCVCC, etc.).

For instance, a simple CV phonology with three consonants and three vowels
might be represented by:

ka ta pa
ki ti pi
ku tu pu

It's clear that there are several ways we could pick three different
monosyllables that each differ from the others by two phonemes (e.g.,
ka, ti, and pu; or ta, pi, and ku; or pa, ki, and tu); but there's no
way to get more than three out of it. We can represent this by
blocking off all the spaces in the same row and column as a word we've
picked.

Step 1:

ka ## ##
## __ __
## __ __

Step 2:

ka ## ##
## ti ##
## ## __

Step 3:

ka ## ##
## ti ##
## ## pu

If we extend this into three dimensions, by allowing a final
consonant, the second layer we add allows us to get three more
redundant words out of the system (e.g., /kin/, /tun/, /pan/); as does
the third (/kum/, /tam/, /pim/) — but a fourth layer (four final consonants, or three
consonants plus null) yields no additional benefit in terms of the
maximum number of words available. It's easy to see why; for each
cell of the matrix representing a word we pick, we must block off not
only the cells on the same row or column of the same layer, but the
cells on the same row and column of every other layer. After we've
picked nine words from the first three layers of our 3x3x4 matrix, all
the cells on the fourth layer are blocked off.

If the phonotactics of the language don't allow for certain phonemes
to occur next to each other, we can pre-block certain cells of the
matrix (representing words in which those forbidden combinations
occur) before we start searching. This may reduce the total number of
words available. For instance, if we have initial consonants /k/,
/j/, /m/; medial vowels /i/, /u/, /a/; and the same consonants in
final position, it's obvious we can get up to nine redundant CVC
words. But if we forbid the sequence /ji/, we must pre-block three
cells (one in each layer), and can get no more than eight redundant
words (two of the cells blocked because they have /ji/ would have been
blocked by our choice of other words anyway). If we forbid /ij/ as
well, the by the simpler algorithm described so far we must pre-block
five cells (one cell represents /jij/). It's still possible to get
eight redundant words under these constraints, as Alex Fink pointed
out in private email; he suggests starting with a set of nine
redundant words that contains /jij/ and then drop it, getting e.g.
/jam juk mik maj mum kim kak kuj/. I haven't yet gotten around to
figuring out how to generalize this insight for any arbitrary set of
constraints in any number of dimensions, or implement it in my
scripts, but I plan to work on the next time I am getting ready to
generate vocabulary for a particular conlang (e.g., when I eventually
build a large enough corpus for säb
zjed'a to make corpus frequency analysis meaningful, and am ready
to relex it with a new (hopefully more euphonious!) set of redundant
morphemes). I suppose one way to do it might be to first figure out
how many constraints each cell violates, and then find one of
those that violate the most constraints, and let that be the starting
point for the script's geometrical iteration over the set of all
cells; then remove the offending word we started with when done...?
Or iterate over the cells in order of most to least constraints they
violate, instead of in the straightforward geometrical ordere
described below, and then drop all words that violate any constraints
at the end? This needs a lot more work. [updated 2008/8/7]

The maximum number of redundant words to be extracted from a given
phonology, assuming there are no forbidden sequences, is equal to the
product of the width of the matrix in all dimensions except the
widest. (If the width is equal in all dimensions — that is, the same
number of phonemes can occur in every slot — we can consider any
of them the widest, since they're equal, and discard one of them to
multiply the others — e.g., for a 3x3x3 matrix (representing CVC
syllables with three consonants and three vowels), 3 * 3 = 9.) So for
instance, with ten initial consonants, an optional semivowel (one of
two; so three possibilities for the second slot), one of five vowels,
and an optional final nasal (one of three; so four possibilities for
the fourth slot), the dimensions are 10x3x5x4, and the maximum number
of redundant words is 3 * 5 * 4 = 60. It wouldn't matter if there
were only 5 initial consonants allowed, or as many as 100; the maximum
number of words would still be 60. The redundant words available in
an N-slot phonology are bottlenecked by the (N-1) smallest dimensions
in the matrix representing it.

We can extend this method to finding words with at least two
distinctive features. Each dimension of the matrix will represent a
distinctive feature of a phoneme in a given slot. For instance, CV
syllables with voiced and unvoiced fricatives and plosives in three
points of articulation, and front or back vowels with three heights,
could be represented by a six-dimensional matrix; one dimension (width
3) represents the point of articulation of the consonant, another
dimension (width 2) represents its manner of articulation (plosive or
fricative), a third dimension (thickness two) represents its voicing,
and the other two dimensions represent the vowel's height and
front/backness. If we follow the same method as before to pick out
cells representing words and block the cells in the same row, column,
stack... as each word-cell picked so far, we will come up
with a set of CV words where every word differs from the other by two
distinctive features — perhaps both in the same phoneme (/pa/
vs. /va/ vs. /ga/), perhaps in different phonemes (/pa/ vs. /bu/ vs. /fo/).

It's interesting to note that the order in which we pick words (and
thus block off other potential words) may determine how many words we
can get. For instance, with a 3x3x3 matrix we can get as few as seven
words if we pick cells in an unwise sequence. Generally the best
method I've found is to start in a corner, proceed diagonally whenever
we've just filled a cell successfully, stay in one plane until it is
completely filled in, and use the same row/column as our starting
point when proceeding to the next plane. This always works when the
matrix is equally thick in all dimensions, and usually when there are
different thicknesses in different dimensions. There are some cases
where this doesn't fill in the matrix as efficiently as possible; with
some configurations of particular thickness in four or more
dimensions, the direction you proceed in matters a lot. I haven't
figured out all the details yet.

I have figured out, empirically, that filling in the cells in a
random order instead of in the above systematic way has more adverse
consequences the higher the number of phoneme slots. With three
dimensions each of thickness 3, you can do no worse than find seven
redundant words. In general, as long as you have only three dimensions
random order is not much worse than systematic order. But in four
or more dimensions, the average performance of a random fill-in
gets worse and worse.

Any of the above criteria for a minimum degree of redundancy is in
tension with other desirable goals for a usable constructed langauge:
conciseness and euphony. With any given phonology, the first
redundancy criterion drastically reduces the number of words available
at a given length; so given the same phonology, one might produce
hundreds of monosyllables and thousands of disyllables, never needing
any trisyllables, without this criteria — but if this degree of
redundancy is required, only a few tens of monosyllables and a few
hundred disyllables, requiring trisyllables and perhaps even
tetrasyllables for thousands of less common words. This applies a
fortiori if the phonology is designed for a high degree of
euphony (which typically means a limited phoneme inventory and tight
restrictions on permitted consonant clusters and diphthongs; therefore
narrowing the matrix's thickness and pre-blocking many word-cells).
So probably a strict use of this criteria would produce a fairly
verbose language, -- either a large core vocabulary with many common
words being polysyllabic roots, or a small core vocabulary with many
common words being long compounds of short roots.

An alternate, more conservative way of using these methods would be to
apply them not to the vocabulary of a language as a whole, but to sets
of words within a given semantic domain, or a given distributional
category. Thus one would ensure that no two words likely to occur in
the same context would be minimal pairs. For instance, with a CVC
phonology with 10 consonants and 5 vowels, one might generate several
different internally redundant sets of 50 words, and use one set for
physical verbs, one for mental verbs, one for concrete adjectives, one
for abstract adjectives, one for animate nouns, one for inanimates,
etc. Words in a given category might have near-homophones in another
category, but not in the same category. I would be interested in
hearing from anyone who decides to use this approach in one of their
conlangs, especially if you find my scripts helpful.

In April 2006 I started working on säb zjed'a, an engineered language
that used this methodology to generate the set of wordforms from which
its initial vocabulary were taken. The self-segregating morphology
scheme I've considered has fricatives always and only at the beginning
of a morpheme, optionally followed by any of several stops, nasals,
liquids and semivowels before the first vowel, and the same set of
non-fricatives consonants allowed in final position; this results in a
few hundred redundant monosyllables, most of them easily pronouncable
though not always euphonious. So far säb zjed'a is a lexically
minimalist language, with more morphemes than Toki Pona or Ygyde but
fewer than Lojban or
gjâ-zym-byn. I expect I will
use this same methodology (with the improvements suggested by Alex Fink)
when I eventually relex the language based on a corpus frequency analysis,
after its corpus grows large enough for such a frequency analysis
to be meaningful — though with a different set of phonology
input files; I'm finding it not euphonious enough to be fun to
work with, which is why building the corpus to analyzable size
has been so slow.

I wrote several Perl scripts to generate redundant vocabulary for
a given phonology. Here they are:

gen-redundant-morphemes.pl - Reads a format file specifying a
phonology, and generates a redundant set of morphemes. You'll need
one format file for each possible number of syllables.

gen-redundant-morphemes-3dim.pl - the prototype, which works only
for exactly three dimensions. There's no point in using this
extensively, but you might want to study the code here to get a better
idea for how the newer and more powerful version works, in case you
want to modify it. The newer version has some tricky bits where it
builds up a block of Perl code in a string variable and then
evals it; this prototype has the simpler fixed code that the
runtime-generated code of the newer version is based on.

gen-all-possible-morphemes.pl - Reads a format file (same
format as with gen-redundant-morphemes.pl) and
generates all possible morphemes for this phonology (with no attempt at
redundancy).

filter_too_similar_strings.pl - Reads strings, one per
line, from standard input, and writes a redundant subset to standard
output. I usually used it in a pipeline after
gen-all-possible-morphemes.pl before I figured out how to
write the newer version of gen-redundant-morphemes.pl that
can handle an arbitrary number of dimensions. I still use it to comb
over the combined output of several runs of
gen-redundant-morphemes.pl that produced words of different
numbers of syllables for the same phonology; it has an option to
filter out strings of which one is a substring of another. Also, it
has an option -m that sets the minimum number of characters
distinctness (default = 2); if you set this -m 3 you
get behavior that gen-redundant-morphemes.pl isn't yet
capable of, finding words that differ by at least three
phonemes or distinctive features. I am far from satisfied that
this script does that in the most efficient possible way, however.

unsort.sh - Reads an entire file from standard intput, de-sorts
(randomizes) the order of the lines in the file, and writes to standard
output. Requires gawk. (Would be easy to rewrite in Perl,
but if it ain't broke, don't fix it...) Used in a pipeline between
gen-all-possible-morphemes.pl and
filter_too_similar_strings.pl to vary the output; I wrote
some shell scripts to repeat this process with the same phonology
format files and keep the largest sets of generated words.

replace_by_map.pl - Reads a replacement map file, with search and
replace text separated by tabs on each line, and applies those
replacements to every line of standard input, and writes to standard
output. One use of this is to fix strings that represent a phoneme by
its distinctive features — e.g., "kK0" = /k/ (velar stop,
unvoiced), "kK1" = /g/ (velar stop, voiced), "tK0"
= /t/ (alveolar stop, unvoiced), etc. — and turn them into more
standard orthography. I use this as the last stage in a pipeline when
generating words with at least two or three distinctive features.

gen-fmt.pl - I used this to generate some of the phonology format
files used in the regression test. It produces alternate consonant
and vowel slots with a specified number of phonemes in each (using a series
of numbers on the command line).

Note 1 - "fight"
/fAjt/ and "bite" /bAjt/ differ as fricative vs. stop, and
unvoiced vs. voiced; arguably the points of articulation (labiodental
and bilabial) are fairly distinct too, though English has no
labiodental stops or bilabial fricatives, so this isn't contrastive
by itself without the fricative/stop distinction.