ScraperWiki is useful both for programmers who want to write screen scrapers with less fuss, and for journalists, activists and the general public who want to discover and re-use interesting, useful data.

By default scrapers only run when you do so from the editor.
You can set your scraper to run automatically (e.g. once a
day) in the "schedule" section of its main page.

Currently you can't go more frequently than once a day.
We can do so on a case by case basis,
please get in touch if you have an
application that needs it.

Who can edit a scraper?

By default anyone can edit anyone else's scraper; this
means other people can help extend or fix your code. You
will be emailed so you'll know when your scraper is edited
and by whom.

You can also protect a scraper, so only people you choose
can edit it, or alter the data associated with it. Go to the
Contributors section on the scraper overview page. Where it says
"This scraper is public" choose "edit" and change it to
"Protected".

Eventually you will have the option to keep scrapers
completely private. If you are interested in testing an earlier
version of this, please get in touch.

How can I get data out of ScraperWiki?

The simplest way is to download a CSV file from the link on the scraper page, or you can use the API.

Autocommits to built in source control, based on Mercurial. See the history tab.

I made a column or a table I don't need, how do I remove it?

The datastore save function automatically makes a schema for you. This means
that while you're developing a scraper you sometimes end up with columns or tables that you
don't need later.

The easiest fix during development is to clear the datastore,
and let your script make it again with exactly the right
columns/tables. There is a button on the scraper overview page
called "Clear datastore" that does that.

Alternatively call the "excecute" function and use
"alter table" SQL commands to modify the schema how you like.
See the Datastore copy & paste guide.

How do I log progress of my scraper?

Just by printing! Use print or puts according to the language you're using.
You can write to stdout or stderr.

The output is displayed in the console as the scraper runs in the editor, and
a selectively cropped version is stored in the history for scheduled runs.

ScraperWiki scrapers are not guaranteed to run at a specific time. We use a queuing system to smooth demand on our servers. You can read more on our Google Group.

How do I get query parameters in my view?

Getting the query string is done the
same way in each language, via the
environment variable following the CGI standard; this
means that the typical ways of accessing the CGI
paramaeters will work (this will vary accordind to
your chosen language).

Each script run has a limit of roughly 160 seconds of processing time.
After that, in Python and Ruby you will get an exception
(scraperwiki.CPUTimeExceededError in Python, and
ScraperWiki::CPUTimeExceededError in Ruby).
In PHP the script is terminated with the fatal error
"Maximum execution time […] exceeded". We would love to convert this to an
exception, but sadly we can’t due to a limitation in PHP.

This exception can be caught, and it can be useful to
do that if you have some state to save or other cleaning
up to do before exiting. You will have a small amount
(2 seconds) of additional CPU time before the process
is killed without warning.

In many cases this happens when you are first scraping a site, catching
up with the backlog of existing data. The best way to handle it is to make
your script do a chunk at a time using the save_var and get_var functions
in the ScraperWiki library to remember your
place.
This technique also lets you recover more easily from other parsing errors.

Note that CPU time is not the same as
wall-clock time. Most scripts use only a little
CPU time when they are scraping, they are mostly
waiting for web pages to be downloaded, which
takes essentially no CPU time. A typical script
can scrape thousands of pages before hitting
this CPU limit.

What is the limit on memory use?

The sandbox in which the scrapers run gives them at most
1Gb of memory. When this is exceeded, you will likely get a SIGKILL.

At the moment, no. The editor and pair programming functions have been integrated into the internal
workings of this library.

You can, however, disable the editor and use a plain old HTML textarea editor
by adding either "?textarea=plain" or "#plain" to the end of the URL while editing the
scraper.

This should be enough for you to use it with browser plugins that
spawn vim or emacs for a textarea. It will also work in browsers
for which CodeMirror doesn't work.

Please give us feedback about this feature so we know if we should make it more available
(for example by adding a setting to your user account so that the editor is always a textarea for you).

How do I revert to an earlier version of my code?

On the history page for the scraper, view the commit that you
want to go back to. There's then a link called "rollback".

When do views show a "powered by ScraperWiki" banner?

On views
such as this,
you will see a banner at the top right saying "powered by ScraperWiki".
This appears on all HTML views, and is a link back to the view. It gives us and you credit,
and ensures sources are cited.

If your view is generating something other than HTML, such as json, csv or an image file you
should set the httpresponseheader to the appropriate MIME type (see ScraperWiki library),
and the banner will no longer appear. (eg in Python: scraperwiki.utils.httpresponseheader("Content-Type", "text/json"))

If you are finding the banners annoying in a particular set of cases, please get in
touch. At some point we'll add ways to vary the banner for different circumstances.

Can I save files, and if so where?

Yes, but all files are temporary. You can save them either in
/tmp, or in the user's home directory (/home/scriptrunner). The
current directory starts out as the home directory.

This only works for temporary downloads, as the scripts run
in a clean environment each time so data can't leak between
scripts. You must save any permanent data in the datastore, or
elsewhere on the web.

Wow, I can run arbitary commands!

Feel free to spawn external commands and download arbitary extra
binaries or code applications. It'll be slow, so if there is
something you use a lot ask us to install it permanently.

It's possible to attach to lots of datastores, and use SQL to select
from all of them as if one.

The datastore is slow and/or timing out, what should I do?

Queries to the datastore can take at most 30 seconds. Here are some things you
can do if this is a problem:

If this is happening with a large database from the web site,
try reloading the page again. It should then be cached in
the server's memory and respond the second time.

From code, it is faster to save lots of rows in groups. Instead
of passing a dictionary/hash to the save command, pass a list of dictionaries/hashes.
See "for greater speed" in the Datastore copy & paste guide.

When saving, specify "verbose = 0" as a parameter to the
save_sqlite function. This turns off logging to the Data tab
in the editor, and in some cases can make it ten times faster
while developing. See ScraperWiki library.

Make appropriate indices to speed up your queries.
For example, in Python, this created the index for a
road accidents scraper.
scraperwiki.sqlite.execute('''
CREATE INDEX IF NOT EXISTS casualty_type_manual_index
ON casualties (Casualty_Type)''')
The datastore normally times out after 30 seconds, but
CREATE INDEX commands have up to 3 minutes.

Can I import code from another scraper?

In Ruby you can run:

require 'scrapers/some-other-scraper'

In PHP you can pass a URL to
require:

require("http://some-url-to/some.php");

In Python we have a feature than can,
at best, be described as experimental.
Instead of import amodule you can use:
amodule = scraperwiki.utils.swimport("some-other-scraper")

If you would like this to be more convenient (or even
better, if you have a patch to make import foo
work), then please get in
touch.

It depends where the data originally came from and how it was derived.

What's your policy on what's legal to scrape?

In short, play nice, and don't do to someone else's website
what you wouldn't like done to your own.

General

It is our view that, where a web server responds to an
unauthenticated HTTP request, there is an implied licence to use the
HTML that is returned for reading and automatically extracting that
information. This, in our view, is how the web is designed to operate.
If the proprietor of a web host wishes (for example) to charge for use
of their site, HTTP provides mechanisms to require payment or
authentication for use. They may also make use of the robots exclusion
protocol to prevent scraping and spidering of any kind.

Of course we may be wrong about this. The question has not been tested
in any UK court and, we understand, there is not much more clarity
world-wide. If you are in doubt about whether what you are doing is
lawful, you should seek your own legal advice, rather than relying on
our best guess.

Platform

Users doing scraping themselves are using us as a hosting service
(whether public or private, it makes no difference).

We will obey the law, for example legal takedown notices. Other than
that, it is none of our business what they do.

We won't add features to the platform whose only purpose is to avoid
technical measures that prevent scraping. It is, however, not our
business whether users do so themselves using standard tools.

Data services

Where we are ourselves handling data requests or doing other
consultancy, our policy is two fold.

1) If it is a Government site (any Government), and we aren't just
performing technical measures to get data which the Government
otherwise sells, then we consider it in the public interest for it to
be scraped. We will do so.

2) For non-Government sites, we check the robots.txt file. If the site
permits robots in general to scrape their site (NOT just GoogleBot!),
then we will do so. We will make no effort to look for other terms and
conditions as well.

Developer introductions

Where we introduce someone to a developer who does scraping for them,
the same situation as described in "platform" above applies.

It is up to you to ensure that your scraping activity does not break
the law. While some standard libraries may check robots.txt others may
not. Even if you are permitted to use a site, you should ensure that
what you do is not disruptive or breaks the law in some other way.