Saturday, July 10, 2010

In data mining, much time is spent on collecting raw data and preprocessing data. Quite a few data mining research tasks require to download data from Internet, e.g. Wikipedia articles from Wikipedia, photos from Flickr, Google search results, reviews from Amazon, etc.

Usually general purpose crawlers, such as wget, are not sufficiently powerful and specialized in downloading data from Internet. Writing a crawler on our own is often required.

Recently I am doing an image related research project, in which I need to download a lot of tagged images from Flickr. I am aware that there is a Flickr downloadr, which uses Flickr API to download images. However 1) it only downloads licensed photos and 2) it cannot download the tags of a photo. Thus I decided to write one myself.

The input of the is a tag query, e.g. “dog”, # of photos to download and the disk folder to store the downloaded images.

Because the number of photos is quite big, so downloading them in parallel is critical. In .Net, there are several ways to do parallel computing. For IO intensive tasks, Async workflow is the best.

Tutorials for Async workflow

As Async workflow is one of the key features of F#, there are quite a few tutorials online for F# Async programming. Providing one more in this blog would be repetious.

The Flickr crawler

To write a crawler for a web site like Flickr, we need to 1) design the downloading strategy and 2) analyze the structures of Flickr.

My strategy is to use the search query to search images with some specific tags and from the result page(as shown below), the url of each image is extracted, from which the image and its tags are then crawled.

fetchUrl pretends to be a Mozilla browser and can do some redirections if the url is slightly invalid. The current exception handling is very easy – just return the empty string for the web page. Notice that the return type of the function is Async<string>, thus it cannot be used to download images as images are of binary format, not text.

Finally, write a function to work through every search result page, parse the result page and download the images in that result page:

let getImagesWithTag (tag:string) (pages:int) =

let rooturl = @"http://www.flickr.com/search/?q="+tag+"&m=tags&s=int"

seq {

for i=1 to pages do

let url = rooturl + "&page=" + i.ToString()

printfn "url = %s" url

let page = fetchUrl url |> Async.RunSynchronously

let imageUrls = getImageUrls page

let getName (iurl:string) =

let s = iurl.Split '/'

s.[s.Length-1]

(* images in every search page *)

let images =

imageUrls

|> Seq.map (fun url -> fetchUrl url)

|> Async.Parallel

|> Async.RunSynchronously

|> Seq.map (fun page ->

async {

let iurl, tags = getImageUrlAndTags page

let icontent = getImage iurl |> Async.RunSynchronously

let iname = getName iurl

return iname, icontent, tags

})

|> Async.Parallel

|> Async.RunSynchronously

yield! images

}

with a driver function to write all the images into hard disk:

let downloadImagesWithTag (tag:string) (pages:int) (folder:string) =

let images = getImagesWithTag tag pages

images

|> Seq.iter (fun (name, content, tags) ->

let fname = folder + name

File.WriteAllBytes(fname, content)

File.WriteAllLines(fname + ".tag", tags)

)

We’ve done! A Flickr image crawler in only about 120 lines of code. Let’s download some images!

downloadImagesWithTag "sheep" 5 @"D:\WORK\ImageData\flickr\sheep\"

It only costs less than 5 minutes to download 300 sheep pictures from Flickr.

Discussions

1. One of the strengths of F# Async programming is its ease for exception handling. In this example, the exception is handled immediately, we can also propagate the exception into an upper level function and handle it there. However, that would require more thinking….

2. I only parallelly download images in one search page. The program could be modified to parallelly process multiple search result pages, which is done sequentially now. If done in this way, we can see that, we can build a hierarchical parallel program: 1) at the first level, multiple search result pages are processed parallelly and 2) at the second level, images in a research result pages are downloaded parallelly.