Search

Analyze Song Lyrics with a Shell Script, Part II

In my last
article, I began exploring song lyrics. Not so you could have an epic
Karaoke night, but more in the sense of analyzing song lyrics and word usage
therein. The specific question that sparked my curiosity was an article
that claimed prolific song-writing duo Paul McCartney and John Lennon
mentioned the word "love" in Beatles songs 160 times.

How do you test that assertion? You do it by pulling the lyrics from a Web site that
specializes in song lyrics—in this case MLDb—and analyzing them
with a shell script.

I wrote the first part in my last article, which was a script that extracted links for every
published song lyric attributed to The Beatles, stepping through the
every-30 pagination structure of the site. In total, the site lists 240
songs by the band. Out of 240 songs, they mentioned "love"
only 160
times? I'm skeptical.

In this article, I expand on the idea by downloading the
lyrics to each and every one of those songs, then use some basic command-line tools to analyze word usage and frequency.

Tell Me What You See

The output of the script from my last article is
a set of files that have the following contents:

Instead of just writing it to the output file, however, what if I built a
proper URL and handed it to a subroutine that could use that to extract
lyrics? Sounds easy, but keep in mind that the above produces a list of 30
songs, not a single song match.

In fact, the easiest solution is to change the code to stick with the
output file, but make it a temp file, as it's just for internal use.
Then I can step through the file line by line as desired.

Why am I saving the song number separately? Because it makes for an easy
file output name, as I want to save the lyrics to each and every one of
the matching songs. Yes, I could put them in one massive file, but somehow
that doesn't seem right.

The work is all done by the savelyrics function, and
here's how
I've written it, having spent some time fine-tuning the filtering and
transformation:

The curl statement gets the web page with the full
song lyrics, which are
roughly delineated by a CSS class ID of songtext and are
contained in a crude HTML table, so the last line of the lyric appears
prior to the table closing: </table>.

As I've mentioned before, sed is your friend when you want to extract
well delineated passages of text. Use sed -n to stop its usual
behavior of echoing everything seen and
/start/,/end/p to print just the
lines between those two patterns.

The problem is that even when you convert every closing angle bracket into a
carriage return (to break the source file into a ton of separate lines for
further processing), it's still a bit messy. Most all lyric lines end
with the sequence <br />, but the very last line
of the lyrics has a </p>
instead.

To catch both those lines and screen out everything else,
grep has the
handy -E flag, which lets you specify a regular expression. Regular
expressions are a world unto themselves (which I've delved into in
prior columns), but suffice it to say a pattern of the form
(A|B) produces
lines that have either pattern A or pattern B, exactly as you'd hope.

That's really all the work. The third sed in the pipe simply removes
the fragmentary remnant HTML code:

sed 's/\<br \///g;s/\<\/p//g'

(Remember, the format is s/old/new/g for a global
substitution. This just
looks more complex because "/" is part of the source pattern. The
";" lets you put two sed command sequences on the same line for
convenience.)

Do a quick uniq to minimize blank lines, and you're done, ready to save. A
sample song lyric output:

$ head lyrics.32586.txt
Try to see it my way
Do I have to keep on talking till I can't go on
While you see it your way
Run the risk of knowing that our love may soon be gone
We can work it out, we can work it out
Think of what you're saying
You can get it wrong and still you think that it's alright
Think of what I'm saying

Know the song? Hear it in your head now? I can definitely keep going with
the rest of the lyrics if switching to Karaoke at this point.

Try to See It My Way

I made one more tweak to the script so that the status output as it runs
would be interesting. This now appears just before the call to
savelyrics:

echo "$lineofdata ($songnum)" | cut -d\> -f2

And so, when run, the script has this sort of output:

$ sh getsongs.sh
I Am The Walrus (32476)
Across The Universe (32554)
Come Together (32520)
Yellow Submarine (32461)
Day Tripper (32585)
. . .
Maggie Mae (61310)
Back In The USSR (61300)
When I'm Sixty-Four (61299)
Good Morning Good Morning (61286)
Got To Get You Into My Life (61285)

Looks good. Here's a quick double-check:

$ ls lyrics.* | wc -l
240

Got all 240 songs, so let's do some analysis. First off, how many songs
have the word "love" in their title? With the new improved script
output, that's easy:

$ sh getsongs.sh | grep -i love | wc -l
13

Looking across all the songs, how many lyric lines have the word
"love"?

$ cat lyrics.* | grep -i love | wc -l
445

That's a whole lot more than 160! But, what about lines that have the
word love more than once? They'd be counted only once. In fact, a more
traditional word analysis could be fun and interesting. Let's start
with just a single song, however, the cheerily titled "I'm A
Loser":

Notice that the first tr translates all spaces to carriage returns, the
second ensures everything's in lower case (using ANSI set notation for
portability), then I simply sort all the words, use
uniq -c to generate
counts, then reverse sort by numeric count and examine the top ten matches.
"I" is the most common word in this song lyric, followed by
"a". Not surprising. Notice that "loser" shows up
only seven times in the song (all in the reprise, actually).

And, what about if I examine every single song lyric en masse? Here's a
surprisingly similar command-line invocation:

These are all what are generally considered "noise words" in
semantic analysis, so let's expand the head to include more matches and
I'll hand-edit this final result for your reading pleasure:

1728 you
781 me
399 love
366 know
250 she
205 her

There are lots more, but now there's an answer, ladies and gentlemen! I now
can say definitively that the word love occurs exactly 399 times in The Beatles
songs and 13 times in the group's song titles too (as revealed
earlier).

Hello Goodbye

It took a while to get to the solution, but this analysis is a splendid
example of what in game theory they call divide and
conquer. Take a big
problem and keep breaking it down into smaller and smaller parts until you
can start to understand how to solve the little pieces. Then build it all
back up so you can solve the big challenge.

Now, what about The Monkees? How often did they actually reference monkeys
in their song lyrics? Hmm....

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a
really long time. He's the author of Learning Unix for Mac OS
X and Wicked Cool Shell Scripts. You can find him on Twitter
as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.