This data would be really interesting to nail down, but (ever the data quibbler) i have some questions about how this works. While i don’t know the details of Matt’s methodology, i expect some typical keyword search problems take their toll here too:

There are multiple ways to specify a reference (John 3, Jn 3): this has the potential to reduce recall

Some matches for a given reference may not actually count: for example, the 12th match i found for “John 3″ is actually a roster for a Civil War infantry regiment containing the phrase “Adams, John: 3/9/1864″. It’s only one case, but i wonder how many more are lurking.

A similar problem occurs with some Scripture references themselves: “John 3″ also matches “1 John 3″! I wonder if that helps to account for the popularity of John 1, 2, and 3, all of which made the top 15? In these cases, you could subtract the count for “1 John 3″ from the count for “John 3″

I wondered why only John and Matthew’s Gospels made the top 15: but queries for Mark (or Mk) produce results that are full of non-Biblical acronyms and other misses, and i’ll bet Luke does too.

None of this is to put down Matt’s efforts: even noisy data is more instructive than silence. But this kind of counting can be very tricky business. The ESV blog discussed this topic based on Bible searches on their site a while back.
Given the full text of all these posts, i’ll bet the vocabulary is distinct enough that a statistical text classifier could be trained to determine with high reliability which ones actually referred to Biblical discussions, and which ones didn’t.

This entry is filed under SemanticBible. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Thank you for the link. A couple of things. I don’t believe there is any way to reduce the noise 100% within the data. My original intention was to search all combinations (eg. Gen, Genesis, etc) but it soon became apparant that it was not possible without writing a script because the biggest roadblock is that the blog search tools all block you after so many similar searches because you begin to resemble a worm/program mining data. Because of that I restricted the search to full book name and chapter number in quotes (eg. “Genesis 1″). Where it gets a little sticky is in Numbers and with personal names like Matthew, Mark, etc. A search for “Genesis 1″ does not include results for “Genesis 11″ or “Genesis 12.” While that does not capture everything, it at least looks evenly at what is there.

While not perfect, it has raised some awareness and does need refined. But like I said after about 15 searches you get blocked from more searches by many search engines and so that really limits how much you can do at a time because in doing this you do need to use the same search engine for all searches. Anyway, to improve the data, we could devise a way to get more people on board to get a better baseline for future reference. This could be updates every year or so to see how things have changed. Again, thank you.

Despite my questions, i still think this is interesting data, and as you say, a better baseline would be very valuable to a lot of people.

As it happens, i’m at a technical conference, and heard a presentation yesterday on microformats. Scripture references are actually a great use case, and a microformat for references would be pretty easy to define (agreeing how to name books is probably the only issue, and even in this case there are already some standards). But, as with all microformats, the real issue is getting people to adopt them.