Webmaster General Forum

I hope you can help me. I was thinking that it'd be a really good idea to store a copy of my competitors website (or a partial copy) so that if they have any big SEO increases I can look back at the changes they have made.

So I'm looking to spider their site to get this info. Also...

They also have members that I don't have on my site, and I'd like to scrape their members list, and compare it to my own so I can see what my market penetration is (for example if I have 70 members, he has 70, but 30 of them are different - then I know I have 70 out of 100 members and my market penetration is 70%).

Is spidering for both/either of these reasons legal?

I know they can ban the ip address, however I can perform the spidering from a non-static ip address if need be.

thanks, I thought that would be the case - I'm definately not going to be publishing the data, just using it internally.

Sure seems like a lot of work/time you could be spending on other things though. Just my opinion.

I can see why you think this, however I think benchmarking competitors is an essential business activity, and should be standard for any analytics/optimisation team. If my penetration levels are 80% then that looks really good, but I'll never know how good or bad this actually is until I benchmark my competition.

not a problem. By penetration I mean your share of a customer base. for example, you sell forum software...

100 websites use forum software, and 55 of them use your software, and 30 from your main competitor, and the rest use other forum software.

This would mean you have 55% of the market, so your market penetration is 55%.

Coming back to my original post - if you don't know how many websites use forum software in the first place, then you can have no way of knowing your market penetration. I know how many members my site has, but I dont know how many my competitor has - or how many are members of both our sites. Spidering would allow me to find out this information.

You may think that your 55 websites using your software is really good, but what would you think if you found out that 10,000 websites use forum software? and 5,000 are using your competitors software? all of a sudden you haven't "penetrated" the market at all.

Ok, I see where you're going with this. But I suppose this only really helps if you have a very specialized product? In the case of forum software, it's almost impossible to see who has forum software installed. Take phpBB / SMF / other OSS versions for example, how will the developers ever know how many instances are installed, exactly? With vBulletin / Invision board / other commercial scripts, this is obviously easy, since they probably track sales. But why would they publish this on their website?

In our case, of web hosting, this is pretty much an impossible "battle". Non of my competitors will have an exact amount of clients advertised on their website :)

Webmasters put up with spiders eating bandwidth because of the benefits of being listed by search engines. Legal or not I think that finding a competitor consuming their bandwidth with a bot would get a lot of people phoning their lawyers for advice.

Which is why I am asking on here if it is legal? I've had companies contact us asking if we wanted to spend thousands for a list of their members that they'll get by spidering - I am wondering if we can do it ourselves...

Most of us are not lawyers and are just sharing our opinions. I would consult an attorney for real legal advice.

My opinion, you are researching information that is public. Whether you read it manually or read it via a bot that obeys their site rules, that to me is not illegal (again, I'm not a lawyer).

Then there is the point of saving the information locally. That in itself is not illegal. How many web browsers cache pages locally on your hard drive? If it was illegal, browsers would not be able to do that. Nor would they have a "Save as" option for any webpage you visit.

Some folks like to save pages offline to view later while they are on trips, etc. All of this seem to fall under fair use of a site's public information.

When you spider your competitor's website with software like wget, make sure you wait a second or two between two fetches. Otherwise you could overload your competitor's server (DOS attempt) and that could get you in trouble.

i wouldn't worry so much about the terms of service for the site you are crawling. i WOULD respect the robots.txt directives and if your user agent or the url you intend to request is excluded then don't spider that url.