What are the most effective ways to scrape content from websites?

If you want to aggregate content from websites that do not offer API integration, what are the most effective methods? Let's say you want to pull consumer product review comments from Best Buy or Amazon. Can this be done? I recall from years past just how messy screen scraping was. Is this still the only option? Has the technology to do this improved in recent years? What if the content you want to aggregate requires a search in order to be found? For instance, let's say you want reviews pertaining to a specific product. If you were on Amazon, you'd need to search for the product first. How does screen scraping get around this? Can you run into legal issues and pushback if you attempt to collect content without consent?

In situations where a web site harbors product review comments and does not offer an API, what's the best approach to get them to share their data? I'm assuming that if they could monetize this content somehow they'd be receptive to the idea.

Anton Yakovlev
Founder of four successful businesses on two continents who can help you do the same

October 31st, 2015

Richard, from my knowledge, the NLP (natural language processing) tools have improved greatly. We were creating a tool that was scraping the app description from an Android app store, and understanded which permissions were needed for the app to run on an Android smartphone. And it worked with 80-90% efficacy.

Following the link Andrew provided I quickly ran import.io over the e-commerse site my company runs in Russia. It correctly parsed 4 of 20 items on a page, which is 25% efficacy. Therefore, in case you need a good scraper, and there is no API on the web site, I believe, you should create a good contemporary parser based on NLP, and get your data.

As for the legal issues. For sure it's legal to download the content from any public site on the web, and analyse it. At least Google does this all the time. Which is not legal in many cases is to publish the downloaded content without owners' consent. Therefore, the purpose of creating the scraper should be well determined. And, yes, amazon has an API that can give you all the information you need from them.

The technology has greatly improved from the time it was commonly referred to as "screen scraping". Currently more commonly referred to as web scraping, or data scraping, is mostly done using automated headless browsers, which are able to return cookies, execute JavaScript, etc., making it much easier from a technical point of view. If the site you are scraping doesn't defend itself against scraping (but simply hasn't bothered to expose the data in API form), you should be able use a cloud scraping service like those offered by AWS and Google, for example.

The problem usually lies in the fact that you are probably doing something very much against the website owner's interests, and probably violating their terms of use, exposing yourself to lawsuits, etc.

In the examples you provide (Best Buy, Amazon), the content is very much central to their value proposition to customers, and poses a significant competitive advantage (people prefer to shop on Amazon because they trust the reviews there, for some reason). They would have to be convinced that they will gain directly from your use of the content, and that you will safe guard this content against their competitors as diligently as they do themselves (of which it would be tough for a startup to convince them).

If you do not have their permission, you will probably find that they spend vast resources to foil scraping attempts, e.g. by blocking or serving fake responses to requests they are able to identify as scraping attempts.

Anonymous

January 6th, 2017

Check out the website at http://www.3idatascraping.com – the company provides best quality scraping website content services at affordable prices. Get data and images from any website within your budget.

Richard, I've worked a bit with some of what it sounds like you're trying to do and I can tell you from an engineering perspective, it is not trivial. Unless a company's website intends you to scrape or otherwise get their content, they will do many things to protect it including hiding it behind mostly dynamic HTML.

What @NearPrivman talks about is quite true.

If I were you, I would readjust the value proposition of what you're trying to do. Possible reach out and establish business/partner relations to make what you want happen a reality.

Anonymous

November 1st, 2015

Technically, scraping data is not at all difficult and you can do it in any standard programming language. Sites are typically very interested in getting Google to index the data, so it must be machine readable after all.

The problem is a legal one. Scraping a site against the wishes of the site owner (and it's typically excluded in the terms of service) is illegal in many jurisdictions. The rights to the product reviews is probably transferred to the site owner even when submitted by third parties. Alternatively, (like on Facebook) the writer retains the rights to the texts and confers usage rights to the site. In neither case are you allowed to republish them.

This a good text covering the legal ramifications of scraping: http://www.bna.com/legal-issues-raised-by-the-use-of-web-crawling-and-scraping-tools-for-analytics-purposes

Your objection is likely to be "but I'm doing the same thing Google is doing" but the problem is that most sites have specifically authorized Google to scrape their site in their robots.txt but excluded others from doing the same.

There are also many, many companies working in a gray zone, scraping data and hoping to get away with it, but that is unlikely to work at scale or if you are competing with the companies you are scraping the data from.

Anton, that's precisely the idea: to use an NLP engine to analyze, categorize and measure sentiment of product review data. I'm very familiar with NLP and the vendor landscape but getting the unstructured comment data is the big problem. There's tons of it out there but it's scattered and it may not be accessible that easily. No one has figured out this out yet for the obvious reasons. Once you have the data, then it's pretty straightforward analyzing it through NLP.

I can't agree with 5 Star Film Co. Have you read Amazon's and Best Buy's Terms of Service? Has you lawyer? Even if scraping and using someone else's content for your profit might in some cases technically be free, there are plenty of cases (gone to court and settled) around this, with no clear answer; you may find yourself in a long, uh, discussion with Amazon's lawyers. Can you afford that? Alternately, you could license their data.

What you want to sell to others might be copyrighted material. First, check that.

So web scraping has improved but it's still messy and challenging from an engineering point of view. The idea I have revolves around product review aggregation and analysis from various sites that contain such info, social media, blogs, etc... Some may have APIs, others won't. The intended audience for the collected and analyzed data would be product manufacturers.