Is there any reasonable guidelines for how much/often I can screen scrape from SO? I've been looking around heavily in related topics and have seen Jeff's enthusiastic response to the user information grabber made in python, a post of someone implimenting this program to scrape the user profile of all SO users, but also seem to remember reading a blog(?) post somewhere by Jeff who seemed quite upset by people who contributed heavy traffic to the site (someone who used a .NET application that made several requests per second, and Yahoo's spider).

I'm interested in creating an application that allows offline/online browsing, with most of the functionality of SO, and some of my own ideas. Of course with no api available, I'm forced to screen-scrape for the content, and at more regular intervals then the several types of rep-trackers out there since I require up to date post/comment information as well.

I've considered scraping all the content with my server and serving it out from there, but that would create the illusion that all the requests are coming from one source and returning to one, which isn't the case - and I wouldn't want to be blacklisted for such a reason. On the other hand I can make the requests come from the users themselves, but this would actually cause more traffic, although it would not be able to be traced back to me.

Are there any guidelines to follow, or posts by the SO team for me to read as how to do this in an acceptable manner, or should this idea be abandoned? I'm unaware of how much traffic I'd actually generate, but would hate to have my work reduced to nothing by getting blocked if people actually my app.

Sounds like you're just trying to steal SO users/traffic rather than produce anything useful and screen-scraping is a bad idea for an offline app.
–
Jonathan ParkerJun 29 '09 at 4:59

22

@Jonathan, that's a totally ignorant and unnecessary comment. In this case I was considering make an AIR app, of which the sole purpose is to extend the functionality of SO, and for this same reason there's some 20+ twitter apps. Are they stealing the twitter community? Simply, there's things I'd like to do that I assume the dev's don't have time for. 'Stealing the community' is definately the opposite of what I'm trying to achieve. Secondly, screen-scraping is the only option in this case, since there's no API. Luckily AIR apps can be easily updated to accommodate changes.
–
Ian ElliottJun 29 '09 at 5:16

3 Answers
3

We have scripts that check for unusual / abusive access patterns, and a daily "top n" traffic summary report of inordinate and anomalous usage.

We regularly block (IP range ban) unknown scrapers that do not identify themselves and/or have poor behavior patterns. These bans are permanent until someone emails us to make a case that they should be removed.

If you don't want to get blocked, here's how:

Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.

Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."

Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??

Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.

Yes, you want an API. Now there is one! see http://stackapps.com for all the info you could possibly want, and more.

The issue with this is that the scrapes are not interval specific. Say for instance a user clicks on a question from a list of the current questions. I would now need to scrape for the current question/responses. The requests would be driven by what a user wants and when they want it. Based on this information, it seems to allude that this is not exactly something you would prefer? Even though if the user weren't using the app they would be making the requests directly from the site.
–
Ian ElliottJun 29 '09 at 3:53

With the speed at which SO is growing you have highlighted the potential for creating an API for this purpose. We played around with the idea of building a WPF application to do the same thing recently and the screen scrape was just a bit to much dependancy for us. An API solves the problem
–
DiagoJun 29 '09 at 6:54

For the last point, what kind of things can we request, exactly?
–
Ian ElliottAug 20 '09 at 4:08

#4 - You have to pull the front page every 5 minutes or more during busy times to avoid missing changes. The other parts of the trilogy are fine for now, but stackoverflow at least needs an API just to keep up with it.
–
Adam DavisNov 11 '09 at 20:24

The dump is older information, which would make the usefulness of the app almost negligible. More and more this idea sounds like something the current state of SO can't accommodate.
–
Ian ElliottJun 29 '09 at 6:19