Screen-Scraping Movie Showtimes Off Google.com With ColdFusion

Yesterday, I experimented with scraping movie showtimes off of the iPhone version of Fandango.com. Today, I wanted to try and do the same thing with the Google.com movie showtimes service. This actually provides an interesting context because it's two very different approaches to the same problem. With Fandango.com, we get XHTML that is so compliant that we can actually parse it into XML and use XPath to query the document. Google, on the other hand, is so conscious about bandwidth usage that they make their HTML as dirty and as incomplete as possible so long as it still renders properly. As such, when we deal with Google's markup, we have to fall back to string parsing and pattern matching rather than DOM querying.

Because I was solving the same problem, I actually wanted to build the same API. So, for this demo, you'll see that the ColdFusion code is almost exactly the same as the code used in the Fandango.com demo:

<!--- Create an instance of the Google movie component. --->

<cfset google = createObject( "component", "Google" ).init() />

<!---

Get the theater information for the showtimes at the

Regal Union Square Stadium 14 theater TODAY.

ID: 10dd19bd6f57c7c8 - Regal Union Square Stadium 14

ID: 14c321fe7754e274 - AMC Empire 25

NOTE: I had to get the theater ID off the website itself.

--->

<cfset theaterInfo = google.getTheaterInfo( "10dd19bd6f57c7c8" ) />

<!--- Output theater information. --->

<cfoutput>

<p>

<strong>#theaterInfo.title#</strong><br />

</p>

<!---

Loop over the movies to output the movie titles and

the times they are showing.

--->

<cfloop

index="movie"

array="#theaterInfo.movies#">

<p>

<strong>#movie.title#</strong><br />

#arrayToList( movie.showtimes, ", " )#

</p>

</cfloop>

</cfoutput>

Pretty much, the only difference here is that I am instantiating a ColdFusion component called "Google" rather than one called "Fandango." Both of these CFCs have the same public API, which is the method, getTheaterInfo(). This method returns the same structure in both cases. This is the nicest thing about creating an API - that you can change the underlying engine without changing the code that relies on it.

When we run the above code, we get the following page output:

NOTE: Movie data removed at the request of data owner.

The ColdFusion component that powers this is somewhat less complex than the Fandango one because all the movies are listed on one page. In the Fandango version, I had to make several CFHTTP page requests to gather all of the showtime information; but on Google, it's all right there. Of course, this time, I have to rely on Regular Expression pattern matching rather than XPath; but it's not too much more complex.

Google.cfc

<cfcomponent

output="false"

hint="I help screen scrape the Google Movie showtimes.">

<cffunction

name="init"

access="public"

returntype="any"

output="false"

hint="I return an initialized component.">

<!--- Define arguments. --->

<cfargument

name="baseURL"

type="string"

required="false"

default="http://google.com/movies"

hint="I am the base URL for the HTTP requests."

/>

<!--- Store properties. --->

<cfset this.baseURL = arguments.baseURL />

<!--- Return this object reference. --->

<cfreturn this />

</cffunction>

<cffunction

name="getTheaterInfo"

access="public"

returntype="struct"

output="false"

hint="I parse the showtimes for the given theater ID.">

<!--- Define arguments. --->

<cfargument

name="theaterID"

type="string"

required="true"

hint="I am the theater ID used by Fandango."

/>

<!--- Define the loacl scope. --->

<cfset var local = {} />

<!--- Define the theater structure. --->

<cfset local.theaterInfo = {

id = arguments.theaterID,

title = "",

movies = []

} />

<!---

Grab the HTML off of the Google web page. With Google,

you typically have to send some sort of User Agent

because it will block a lot of user agents that it

considers "bots."

--->

<cfhttp

result="local.googleGet"

method="get"

url="#this.baseURL#?tid=#arguments.theaterID#"

useragent="Mozilla/BenNadel.com"

/>

<!---

While the HTML of the Google page is horrendously

incomplete, it is thankfully well Classed enough to

make string parsing somewhat straightfoward.

--->

<!--- Grab the theater title div. --->

<cfset local.theaterDiv = reMatch(

"<div class=theater>[\w\W]+?</span>",

local.googleGet.fileContent

) />

<!---

Get the theater title by stripping out all tags from

the theater DIV. There is an H2 in there somewhere that

has our theater name.

--->

<cfset local.theaterInfo.title = trim(

reReplace(

local.theaterDiv[ 1 ],

"(&nbsp;|</?\w+[^>]*>)",

" ",

"all"

)

) />

<!--- Each movie is wrapped in a "movie" DIV that we can

extract with some regular expression matching.

--->

<cfset local.movieDivs = reMatch(

"<div class=movie>(?:\s|<(\w+)[^>]*>.+?</\1>)+",

local.googleGet.fileContent

) />

<!---

At this point, we have chunks of strings that contain

the movie data. Now, we have to loop over each one and

parse the details.

--->

<cfloop

index="local.movieDiv"

array="#local.movieDivs#">

<!--- Parse out the movie name DIV. --->

<cfset local.nameDiv = reMatch(

"<div class=name>.+?</div>",

local.movieDiv

) />

<!--- Parse out the showtimes DIV. --->

<cfset local.showtimesDiv = reMatch(

"<div class=times>.+?</div>",

local.movieDiv

) />

<!---

Create a movie struct from the parsed DIVs. For

this, we are basically going to take the pasred

DIVs and strip out all tags, leaving just the

textual data.

--->

<cfset local.movie = {

title = trim(

reReplace(

local.nameDiv[ 1 ],

"</?\w+[^>]*>",

" ",

"all"

)

),

showtimes = listToArray(

reReplace(

local.showtimesDiv[ 1 ],

"(&nbsp;|</?\w+[^>]*>)",

" ",

"all"

),

" "

)

} />

<!--- Append the movie to the ongoing collection. --->

<cfset arrayAppend(

local.theaterInfo.movies,

local.movie

) />

</cfloop>

<!--- Return the result. --->

<cfreturn local.theaterInfo />

</cffunction>

</cfcomponent>

As you can see, this version of the showtimes screen scraper relies entirely on reMatch() rather than xmlSearch(). But, just because this version approaches the problem in a different way, it doesn't mean that it is any less susceptible to problems. In either case, we are still depending on the predictable structure of a 3rd party page that we do not control. If that structure changes without notice, whether we use XML parsing or string pattern matching, our code might very well break.

In the long term, Google's markup, while significantly incomplete, seems to be easier to work with simply because it's all on one page and has better CSS class hooks (for pattern matching). If I am gonna play around more with screen scraping movie showtimes, I'll probably be using this service to do so.

Reader Comments

Ben, I have been trying to figure out an approach to screen-scraping for the website we build to aggregate event information around the state. Trying to get everyone to update their information is always an issue.

I'm not too swift when it comes to the whole issue of screen-scraping, then moving this data to a database. I've had limited success and it is different for each site.

What I would like to do is pull down event information form numerous sites, dump the information into a database, then upload it to our site. Is there an approach to this that is feasible? Am I just thinking too much and not working hard enough?

Screen-scraping is never the, "right," solution; however, sometimes it is the "only" solution available. When I was working on Skin-Spider waaaay back in the day (it screen-scraped adult content), what I did was create a uniform CFC interface for the concept of screen scraping. Then, I created a separate CFC for each target website that uphelp the "scraping interface", but internally was set up specially for that site (based on its HTML and what not).

It's not an easy approach, for sure; and, it is likely to break if / whenever they change the markup. But, if it's all you go, abstracting it out into individual CFCs is really beneficial.

Also, if you are really serious about this, it can be a godsend to run the HTML through an "XHTML cleaner" first such that you can actually use xmlSearch(). That's what I was using TagSoup for a while back:

This is a more complex solution since it used Groovy to load the JAR, which was then used to clean the HTML, which was then used by ColdFusion / xmlSearch. But, if you look, once you do that, you can treat the target HTML like it is XML, which makes scrapping MUCH easier.

I am the co-founder and lead engineer at InVision App, Inc — the world's leading prototyping,
collaboration & workflow platform. I also rock out in JavaScript and ColdFusion 24x7 and I dream about
promise resolving asynchronously.