This may be an odd question for this site, but tonight I've been enjoying myself by creating a small script that produces (is supposed to produce) sample sentences that resemble English, while being total gibberish.

The idea came from reading a question on StackOverflow.com which involved word wrapping of a text. Some people would use the Lorem Ipsum quote to generate a sample text for demonstration purposes. I thought, why this would be a nice use of a random text generator.

The very intriguing Wug test was also at the back of my mind, and the fact that it is relatively easy to read a sentence with scrambled words, as long as beginning and end letters remain the same. For example:

Obviously, words should contain at least one vowel. It might in fact be idea to make vowel insertion a distinct part of the process. Some consonants should not follow each other (e.g. tvnvcf), and should not be too many in a row.

I was looking for a distribution of the last letters in English words, but that may not be applicable, since word endings can be fairly similar (ing, ane, tion, able, etc), and that might add some familiarity to the sentences.

I'm looking for ideas. Links to resources. Rules of thumb. What can I do to make my script spout more legible gibberish?

In short, what are the general rules for building an English-looking word?

For one, there is not much rationale in 'inserting vowels'. You need to be dealing with 'syllables'. Pick all possible consonant permutations and combinations along with collocating vowels -- that is some kind of an algorithmic way to generate artificial syllables. Join the syllables to form words.
–
KrisJan 20 '12 at 9:52

4

Is this really on-topic? It sounds more like a linguistics question than an English one.
–
Karl KnechtelJan 20 '12 at 11:08

3 Answers
3

If you pick up an intro to linguistics text, it'll have something for you (like "Relevant Linguistics" by Paul Justice). I know that RL deals with this problem specifically. The key addition from linguistics would be that the way we produce sounds physically affects what kinds of words can be "believable" or even "conceivable."

For example, in your random text, there's a "word" called "Ynssdto." Let's make the Y sound like a short I (like "in") and call the double S's a single S sound (like "guess"). That brings us to an odd combination of what we call "alveolar-dental plosives" (if my terminology isn't too rusty). ADP's are "explosive"-type sounds (they make a puff of air) produced by placing our tongue where our teeth meet the roof of our mouths. This combination of sounds is not possible in English, and I would wager in any language. You'd need a vowel BETWEEN those two sounds. Like in "tada!"

I know nothing of programming, but here's what I think could solve the problem. First, classify letters by their manner of production, then assign rules governing their distribution in the words. One rule might be that "dental plosives cannot follow one another in the same word."

Or "no interdental fricatives can follow one another in the same word" (IDF's are the "th" sounds in "this" and "thin". Fricatives produce sound by buzzing...think "friction"..."sh" "z" and "s" are all fricatives). I bet that no two fricatives of any sort can follow each other (like "th" + "sh" + "z").

Or "two stops cannot follow each other" like in "gckort." [g] and [k] are both "glottal stops" made by stopping the flow of air in the back of the throat for a moment. Similarly, a glottal stop could not combine with a "alveolar stop" like [t] without a vowel in between. Gt? No. Git? OK.

Some good news: linguists have already classified manner of production for all phonemes (sounds) in all languages. RL actually gives a short set of rules for combining phonemes, and some nonsense words to demonstrate how these rules work! This will be a BIG step in the right direction.

BTW RL is a user-friendly text that should be very accessible, but for that exhaustive list of phoneme production location, you might need to grab a more detailed text.

This is a good idea. I did some quick research and wrote up a script, unfortunately, it looks more like some African language. For example: olovow iheigh akieyures ivoula ocheege adie ohor tafe wamun hailure sour. Needs more statistics, or some such. I'll continue tweaking it.
–
TLPJan 20 '12 at 11:09

That's interesting, and it brings up a good question: how should English "sound"? "Ivoula" does sound African (reminds me of the soccer player Douala). There's a lot of Scandinavian and German influence on English, so maybe you could see if you find some patterns there. IDK if this is true in actuality, but I reckon English might have more consonant sounds in beginning and end positions of words. (<-that sentence follows this "rule" for both beginning and end sounds in 17 of 30 words, if you count "w" as a consonant sound). Maybe you could program that sort of thing with a probability rule
–
thadJan 20 '12 at 21:29

Also, I see you mentioned the wug test. You might classify words as parts of speech and then have "verbs" take on an S from time to time, for example. You are creating random words, so that might make it difficult. The wug test had the luxury of deciding that a word was a verb or noun, and adding familiar morphemes like -s, -ed, or -ing to them. I suppose you'd need your program to remember the words it's created and filter them into groups (noun, verb, and "other"). Nouns can take on plural -s, verbs -s, -ed, -ing. You might also try to distribute verbs with some regularity.
–
thadJan 20 '12 at 21:36

I once wrote a program (long lost now, sorry) which processed a corpus of text, recording the frequency of each character given the previous sequence of n characters. So for example for the word "hello" and n=3 , it would do:

frequency[null,null,null,h] ++

frequency[null,null,h,e] ++

frequency[null,h,e,l] ++

frequency[h,e,l,l] ++

frequency[e,l,l,o] ++

frequency[l,l,o,end] ++

Then it would generate words by starting with a random letter, then picking the next letter using the weightings obtained from the corpus, until it picks an 'end'.

For too low values of n, you get unpronounceable words.

For too high values of n, you tend to get mostly real words.

If you tune n just right, you get a good selection of novel, pronounceable, English-looking words (or whatever language corpus you fed in). It's quite fun seeing what difference the corpus makes. The works of Shakespeare generates qualitatively different words to the Bible or just a Scrabble word list, for example.

I think this could be extended to sentences. One simple thing to try would be to treat the space character as just another letter. You could go one better and adjust the window size n depending on whether you're on a word boundary etc.

You could also try to classify your nonsense words into parts of speech based on some statistical heuristics (Not trivial I guess. Train a neural net to do it!)

I understand your basic idea, however I'm not quite sure how to implement it. With n = 3, how would you construct words longer than 4 letters? I think it is a promising idea, as it may do everything in one fell swoop, removing the need for externally applied logic.
–
TLPJan 20 '12 at 11:13

Okay, I do not have access to weights in my language, but it is basically just a more economic way of randomizing an array. So, with a few texts of Shakespear and George Bernard Shaw, I think I am getting somewhere. There is a risk of going into dead ends, causing wildly varying length of strings and endless loops, but it certainly looks like English. Too much so, in some cases. =) purges whoughts whated s illay paing an of i s ge blififin you hemba t honea man ds thenes alls he to he nate wayeshavion the i guit youtled a dred thing a mor he nesn her eand of woe in for intand the
–
TLPJan 20 '12 at 11:50

You may have to write your own weighted selection algorithm. Something's wrong if you get the word 's' as output. There are no "blank,s,end" in your source text, so the probability of randomly picking 'end' after "blank,s" should be zero.
–
slimJan 20 '12 at 11:58

I actually optimized a little, reading through the entire text, without the end character. Instead, space gets inserted where words end. It automatically forms varying length words, while preserving some realistic word endings. But as you say, there are drawbacks with this solution.
–
TLPJan 20 '12 at 12:07

Not quite what you're asking for, but one of interesting ways of creating fun gibberish is dissociated press - find repetitions of any random short string in source text - take X letters, 3-4 maybe, split the text and continue from random occurrence of the same string, keep on for a while then repeat at another random place.

Basing on your question, 2-letter strings.

This may be be a night I've been enjoyivolved words, as lolow.com why this wought, why the Loremain the sambled wor.