Search

Analyzing Song Lyrics

I was reading about the history of The Beatles a few days ago and bumped into an
interesting fact. According to the author, The Beatles used the word
"love" in their songs more than 160 times. At first I thought,
"cool", but the more I thought about it, the more I became skeptical about the figure. In
fact, I suspect that the word "love" shows up considerably more than 160
times.

And, this leads to the question: how do you actually figure out something like that?
The answer, of course, is with a shell script! So let's jump in, shall we?

Download Lyrics by Artist

The first challenge, and really most of the work, is figuring out where to
download the lyrics for every song by an artist, performer or band. There are lots
of online archives, but are they complete?

One source is MLDb, the Music Lyrics Database (modeled after the Internet Movie
Database, one presumes). An easy test is this: how many songs does the site list for The
Beatles?

Working backward from an interactive session in a web browser, an artist search
for "the beatles" produces eight pages of matches, 30 matches per page.
That's 240 songs. Wikipedia says that there are 237 original compositions for
the band, and BeatlesBible.com shows 302 original songs. Confusing!

Of course, some of the songs recorded by The Beatles didn't have lyrics. For
example, on the Magical Mystery Tour album, there's a track called
"Flying". Given that Paul McCartney and John Lennon were such brilliant
lyricists, however, the vast, vast majority of songs recorded have at least some
lyrics—even "The End".

So let's go with MLDb and trust that its 240 songs are close enough for this
task. Now the challenge is to get a list of all the songs, and then to download
the lyrics for each and every song that matches.

You can experimentally verify that this produces the second page of results, but
hey, let's just run with it! Since the second page has a
"from=30", you
can conclude that there are 30 entries per page (as mentioned earlier) and that
from=60 gets page three, from=90 page four, and so on.

Each page can be downloaded in HTML form using GET or
curl, with my preference being to
use the latter—it's more sophisticated and has oodles of
options. A quick
glance shows that "Yellow Submarine" shows up on the first page, so
here's a quick test, with url set to the value shown above:

Notice the sed pattern above. I'm replacing every < with a carriage
return followed by the < so that the net effect is that I unwrap the
HTML source neatly and then can use grep to isolate the matching lines and exclude
everything else.

Nice. Now how about turning each into a curl page
query? Well, hold on! Let's
first figure out how to get the full list of every song—that is, how to go from page to
page. To do that, the URL already shown has the clue: from=XX for each subsequent
page.

Another quick test shows what happens if you specify a URL that is beyond the last
song listed: no matches are returned. That's easy to deal with because
wc -l will return a zero in that instance.

Put the pieces together, and here's a loop that will get as many matches as
possible until there's a zero result:

Perfect. I expected eight pages of songs, and that's what the script
produced.
Each has the same format as the output listed earlier, so it's now a
matter of converting the href= format into an invocation to
curl to get that
particular page of lyrics. Since I'm already running out of space, however,
I'm going to defer that part of the script until my next article.

Meanwhile, notice how start is incremented by 30 with
the $(( )) notation for
calculations (you could use expr, but it's faster to
stay in the shell and not
spawn a subshell for the math). Also, the test to identify an empty output file
should be easy for you to understand:

if [ $(wc -l < $output$start) -eq 0 ]

There is a nuance to catch here, however: the $( )
notation gets you a sub-shell akin
to using backticks, while the $(( )) notation allows you to do rudimentary
calculations within the Bash shell itself.

I'll expand on all of this in my next article. See ya then!

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a
really long time. He's the author of Learning Unix for Mac OS
X and Wicked Cool Shell Scripts. You can find him on Twitter
as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.