The worker is what does all of the hard lifting with the internet, and the dispatcher keep everyone in line. You can have any many workers asyou're system will allow mongdb connections. MongoDB is used as thecentral cache to limit the amount of bandwidth needed to scrape targetURLs.

##How to use it

###Requirements

`iddt` uses MongoDB as a central cache while it is working. You'll need toinstall MongoDB to use `iddt`.

- Ubuntu

$ sudo apt-get install mongodb

###Worker

You will probably want to run the worker ( or many workers ) as daemons.This functionality is built into `iddt`. use the following code as a starting point:

This will allow you to start, stop, and restart a worker daemon at thecommand prompt. If you are interested in using the worker NOT as a daemon, you can execute the same functionality ( note this functionis fully blocking ) by using the .run() function.

from iddt import Worker

def new_doc(document): # do something with the document pass

worker = MyWorker() worker.register_callback(new_doc) worker.run()

You're on your own to gracefully exit the `run()` function. If you set`worker._running` to `False` it *should* gracefully exit after a short while.

##Dispatcher

The dispatcher tells the workers what to work on. You use it something likethis:

# this is how you query the results based on mime type some_docs = dispatcher.get_documents(['application/pdf'])

# this is how you get ALL of the documents all_docs = dispatcher.get_documents()

Note that the `dispatcher.dispatch()` function requires a dict with the following fields:

- `target_url` - This is the URL that the Workers (scrapers) should be working on- `link_level` - This is the number of links to follow. Be careful with numbers above 3- `allowed_domains` - The `iddt` Worker won't follow links away from the TLD of the `target_url`. If you would like it to, you can supply the list of allowed domains here.