Search

Spinning and Text Processing

I have a dirty secret to share, and I hope you won't think less of me
once you learn it. I used to be in the internet marketing world and pitched
my coaching programs and DVD sets from stages around the United States.
Yes, for $999, I'd teach you how to make money online, and if
you were one of the first three to sign up, I'd even throw in my
friend's dynamite ebook absolutely free!

Truth is, I didn't last long in that space because I'm much more of
a do-er than a salesperson, and it would bug me to no end when people
would buy my coaching package—at 20% off, but only if you
sign up
right now!—and then never actually open it and use it to at least try
their hand at creating an online business.

That's all in the past, fortunately, but I've retained an interest
in those business opportunity pitches and what they're actually
selling. Just like the cliché envelope-stuffing job (you know:
"Send me
$200 in an envelope, and I'll show you how to ask people to send you
money!"), it turns out that a lot of online businesses still are predicated
on gaming search engines to gain traffic to pages selling daft and usually
worthless things.

And, one way that these entrepreneurs game Google and other search engines
is by "spinning" to produce lots and lots of content from a single
article that they've paid someone a few bucks to write in the first
place.

It's all rather uninspiring, except the spinning idea itself is rather
interesting, and I've been toying with writing a shell script to allow
easy article spinning for quite a long time. There are more prosaic,
less questionable uses for this technology too, like in programs or even
games that have text messages useful to vary.

The {idea|concept|inspiration} is that each time you'd use a
{word|phrase} you instead list a set of {similar words|synonyms|alternative
words} and the software automatically picks one {randomly|at random}.

So the previous sentence would come out of the spinner as "The idea is
that each time you'd use a phrase you instead list a set of alternative
words and the software automatically picks one at random." Got it? Easy
enough.

A more advanced spinner might actually tap a thesaurus, and each time it
sees a word, push out a set of synonyms automatically, which the other
script then randomly simplifies each time it's invoked.

In fact, go read spam blog comments or spam email, and you'll see the
output of these sort of contextless sentence manipulations. They can
be...weird, like this:

she's got arriving in can easily dresses, still Beth may be 36 yr
old men's city servant, outdoors of waking time 'en femme'.
she's single, symmetrical in addition thinks to achieve marital,
"Eventually..."

But hey, just because there are bad uses, doesn't mean it's not an
interesting project to try to code, right? I trust you to exercise good
judgment of your own when you explore this script, okay?

Spinning Out the Spinner

The basic tasks of the script are straightforward: parse the input, isolate
each word-choice block, pick one at random, then reassemble everything and
display it.

To make things a bit easier, I'm going to start by using
fmt to make
each paragraph one really long line. That way, I then can break the input
into lines that don't have a word-choice block and those that do:

fmt -w$bigwidth "$1" | tr '{' '\n' | tr '}' '\n'

An input line like {this|demo} would then transform.

An input line like
this|demo
would then transform.

See how that works? I'm going to use fmt again at the end of the
process to clean up the output.

One facet of shell script programming that most people don't realize is
that every loop structure acts as its own subshell, so rather than waste
space and time with a temporary file, I'll pipe the output of
the fmt|tr sequence directly into a while loop:

See how the fmt line ends with |
\, and that feeds directly into the while
loop? Very handy structure!

Now I'm going to run this code snippet with the sample input file to see what
happens:

$ sh spinner.sh spinme.txt
The
SPIN THIS: idea|concept|inspiration
is that each time you'd use a
SPIN THIS: word|phrase
you instead list a set of
SPIN THIS: similar words|synonyms|alternative words
and the software automatically picks one
SPIN THIS: randomly|at random
.

That pesky period on its own line is a glitch that'll need to be fixed
later, but the basic structure of the script is sound: you can parse and
break down the input file data and identify which new lines are selector
lines.

The Spinning Function

Instead of just prepending SPIN THIS: before a line that has
choices, that's a perfect place to put in a function call to a separate
block of code that does the actual work.

One of the most interesting parts of the function is how it figures out how
many options there are in the given string. It's a specific instance of
the general question "how many occurrences of X are in string
Y?", and it
exploits the little known -o flag to
grep:

grep -o '|' <<< "$*" | wc -l

Take a deep breath; I can talk you through this one! The
<<<
notation is a variation on the here document
(<<) you've
hopefully already seen in scripts. The difference is that the result is fed
as a single string on stdin.

The "$*" produces the entire argument as given to the function in
the main block of the script; the | is the character being
counted, and of course, wc -l produces the number of matching lines (in
this case, the number of delimiters in the line).

All that, and it's not quite what I want, because a line like
word|phrase has one delimiter, but two choices. Here's how I solve that
in this first, skeletal version of the function:

$ sh spinner.sh spinme.txt
The
3 options, spinning --- idea|concept|inspiration
is that each time you'd use a
2 options, spinning --- word|phrase
you instead list a set of
3 options, spinning --- similar words|synonyms|alternative words
and the software automatically picks one
2 options, spinning --- randomly|at random
.

That's it for this month. Next month, I'll finish up the function,
including implementing a way to pick one entry randomly from a set of
n
choices, then output the cleaned up copy, ready to use in whatever program
or utility you'd like.

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a
really long time. He's the author of Learning Unix for Mac OS
X and Wicked Cool Shell Scripts. You can find him on Twitter
as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.