Whilst it cannot be doubted that this is an interesting and laudable
idea, there are problems inherent in a corpus approach to
phonetics/phonology (the distinction is unclear in the original
post). Something like this is needed - books like Maddieson's Patterns
of Sounds (CUP 1984) form the basis of much interesting work in
phonological universals, and many interesting phonetic sketches of
languages have been produced which occasionally make it into journals
such as JIPA.
However, ideally phonetic analysis takes repetitions of a single
variable from several speakers under strictly controlled conditions,
and reading a long connected text may not produce enough data for
analysis or controlled enough conditions. For example, the vowel /a/
may occur twenty times, but in different segmental, prosodic,
intonational and positional contexts, all of which can affect factors
such as duration and formant frequencies. And if two speakers of the
same language read the same long text, there may be variations in
rhythm and pausing which are not apparent in shorter sentences, such
as the ones normally used in phonetic analysis. So the phonetic
utility of such texts is doubtful.
In phonological terms, not only does a text not provide the necessary
data for deciding which oppositions are contrastive, it may not give
examples of all phonemes for a language. So for Amharic, the ejective
/p'/ may not occur, and in English /T/ [theta] may not crop up. The
'marginal' nature of such phonemes is not uninteresting, but larger
patterns may reflect historical accidents. In English, for example,
very few instances of words with a long vowel + /b, d, g/ occur,
e.g. league, barb (for non-rhotic speakers like me). Final /d/ is
fairly common. Recent coinings like Beeb for BBC show that there is no
phonological constraint on such words occurring, but they don't crop
up as regularly as their counterparts with voiceless codas (e.g. beat,
meat, seep, sheep, soup, seek, park) for whatever historical
reasons. A random text may not show any such words with a voiced
plosive, and lead one to conclude that English, like German, does not
allow phonologically voiced plosives in coda position.
So I think a corpus approach should not be based on connected texts,
but on more traditional phonetic and phonological approaches. A purely
text based approach also has the drawback that unwritten languages
cannot be represented.
I look forward to reading what other List users think.
Mark Jones