Here's the situation, a person hired me to build a scraper related to economics/investment. Using sites for their data (problem) and it's easier to just write this thing, setup a server for the client and then boom, the legal problem is theirs. Granted they're not selling the data, they're just automating the access of the data and packaging it for themselves for insight as they see fit.

The data is "public facing" ie. no logins but it's their data too. The other site is labor/nasdaq sites that publish reports about how the economy is doing.

It's just somewhat annoying because I have to tell this person "I need you to rent a server, then buy a domain, then give me access then I can set everything up and get the mailing client setup as well so you can get the emails reliably." But it's the easiest thing to do for me regarding not dealing with a potential legal problem, I was paid to write code not to sell data. I don't understand this argument either, I mean I am willfully writing code that is bad. You could argue that cars/guns could be used to harm people, that is a byproduct of a car/bullet ramming into a person, they die. But the car/gun did it's purpose of moving something.

But if I was an enterprising sapling, I'd turn this into a service that people paid for, but the data isn't mine. So I could be like "I went out there and asked these companies for their data and worked something out..." Like aggregators and sites that use contents from other publishers/content creators.

I don't know if it's bad that I worry before something even happens and thus limit myself. Or maybe this is also why I'm not in jail. I've pretty much written all of the code already it's just a matter of having to go through the verbose crap of "buy a server, domain, let me handle everything, then I'll turn it over to you and you can pay for it/keep it going versus pay me and I'll keep it running."

Just because a piece of data is public does not mean that it is not copyrighted.

Meltwater got sued by the AP for aggregating their news stories on their clients.

Power got sued by Facebook for their service that allowed social media aggregation.

You are allowed to scrape for public, factual data; any work that could be considered copyrighted should be avoided. You can scrape a stock ticker, you can't scrape written commentary on what the stock is doing or why it's moving. You can't scrape for personal or sensitive data. You can't scrape where forbidden by the ToS.

As a developer you're held to the standard of "reasonable care", which would almost certainly be violated by writing this software if you're scraping any of the sorts of data discussed above; you're being asked to write software that performs actions that constitute illegally accessing or copying data.

To continue the car/gun analogy, as a gun store owner I'm not going to be prosecuted when some schmuck shoots his wife with a gun he bought from my store. However, if that schmuck came into my store and said "Hey, I'm looking for a gun, I need to shoot my wife", and I sold him a gun purpose-built for wife-shooting, then I'd be criminally liable.

To be clear, if you're scraping copyrighted data, having them run the server isn't going to absolve you of responsibility for this. You need to ensure that you take "reasonable care" to notify the client as to what they are or are not allowed to scrape. If this thing has a UI you need to alert the user that they are prohibited from using the tool to scrape copyright-protected data that they do not have permission to copy. If this is headless you need to get it in writing, in your contract, and in the requirement docs that this tool will not be used to scrape copyrighted data that they do not have rights to.

Assume everyone's an asshole. Assume they'll take this tool, use it to scrape copyrighted data, then when they're sued by the rights-owner, they'll sue you. Cover your ass to protect yourself from the worst-case scenario, then be stoked when it goes better... or at least you get to be right when the shit hits the fan.

EDIT: Thanks to /u/wengemurphy for their great response, and pointing out an error in mine. The claims I'm making are applicable to the US only, other countries have wildly different laws regarding unauthorized computer access, EULAs, copyright law etc. that would make this advice inapplicable.

This is probably what I'm looking for, where would I look for "true source" of that? The actual companies?

The other places like Nasdaq/Labor deparment I'm just looking if a part of their site changes (documents released) but the person still has to go there and look/open the link.

Damn... hmm, that sucks, already sunk 6 hours into this... kind of don't want to walk away from it. It's just personal use right? Not being sold?

Yeah this doesn't look good:

Thomson Reuters content is the intellectual property of Thomson Reuters. Any copying, republication or redistribution of Thomson Reuters content, including by caching, framing or similar means, is expressly prohibited without the prior written consent of Thomson Reuters.

I mean they have a download button and you can download all the data for year(s) so...

Thanks for your time/input

Edit: well... guess the simplest solution is to just not charge for it, ahh well, learning experience I guess.

Edit: I think I'm actually safe but because I'm scared I'll just not charge for it, it's actually for a person that I know so, and the data is free apparently (regarding yahoo finance) the labor thing and nasdaq, my argument for that is there are services out there that check against a site for a keyword and somehow "this is okay" so I don't know. It's not much money anyway so whatever, it was a good exercise anyway. I hunted down this token and it still wouldn't work hilarious, I parsed their site to find the generated crumb per visit and it wasn't valid still somehow.

I'm not a lawyer, but I can't see why you'd be held liable for this product, you're just presenting public data in a different way. If I were you I'd run the server myself and bill monthly for this service.

Mere data - collections of facts - isn't copyrightable in the United States. Other places like the UK do have "database rights" protections.

The classic case involves a lawsuit over a phone book. One company copied the entire contents of a phone book and published it as their own. First company sued, and lost. The data itself isn't protected.

We're talking about factual information here like numbers and addresses. You couldn't copy creative elements from the Yellow Pages like graphical ads and such, because those are copyrighted by various entities who submitted them.

However, this doesn't mean you can scrape anything, for any reason, at any time. Creatively written prose is protected by copyright, so if the "data" is a bunch of creative content, you can't copy it. The AP vs Meltwater case /u/jlobes mentions is different than copying "data", because news stories are creative. For another example, search engines like Google get to copy billions of webpages because they have a strong fair use argument, not because they have firm legal right to do so.

As far as Facebook v Power, Power was found liable for CFAA violations, but the CAN-SPAM violations were reversed on appeal. SCOTUS has declined to hear a further appeal.

Decision of the 9th circuit on CFAA.....

In sum, as it admitted, Power deliberately disregarded the cease and desist letter and accessed Facebook's computers without authorization to do so. It circumvented IP barriers that further demonstrated that Facebook had rescinded permission for Power to access Facebook's computers.4 We therefore hold that, after receiving written notification from Facebook on December 1, 2008, Power accessed Facebook's computers “without authorization” within the meaning of the CFAA and is liable under that statute.

and CAN-SPAM

Because neither e-mails nor internal messages sent through Power's promotional campaign were materially misleading, Power did not violate the CAN-SPAM Act. We reverse the district court on this claim and remand for entry of judgment in favor of Defendants.

They were not found liable for copyright infringement

Personally, I think using the CFAA in this manner is problematic, and so does the Electronic Frontier Foundation. The CFAA is supposed to be about hacking into computers - that's why CFAA violation is a crime - but it's worded broadly enough to cover web scraping.

The lesson here is that a company could sue you and win through other avenues besides copyright infringement. They might also try "tortious interference" as Blizzard did against WoW bots.

However, the raw data itself cannot be "owned" (unless, again, it's creative). If you're sued it will be over the service you're providing and the manner in which it's interacting with their servers.

Personally I wouldn't mind writing a scraper for a client, but I would license/sell it to them and have them run it themselves.

Personally I wouldn't mind writing a scraper for a client, but I would license/sell it to them and have them run it themselves.

Yeah amazing how your drive/interest drops when you're no longer getting paid like "ugh... look at all this work"

That's the thing these are just "stock ticker" facts right? Or I mean just data points.

The other thing is grabbing changes to a site but it's not like "published work" I guess that's arguable. Like one is upcoming IPO releases the person would receive a notification of the changes to that particular table on the page but it's just date/worth/name sort of thing.

Thanks, yeah the one-off is the approach I was looking at (they pay to host the server that runs it, I would configure it). Have to buy a domain too and I'm not really trying to buy a throw-away domain for the sake of emailing (reliably).

I did see that about the "polling their servers" argument, ToS said something about Robots for one particular page and that they could terminate your access though this isn't multiple times in a day it's twice a week.