How to Create a Web Spy with a PHP Crawler

Facebook

Twitter

Google+

Pinterest

+

Crawler, spider, bot, or whatever you want to call it, is a program that automatically gets and processes data from sites, for many uses.

Google, for example, indexes and ranks pages automatically via powerful spiders, crawlers and bots. We have also link checkers, HTML validators, automated optimizations, and web spies. Yeah, web spies. This is what we will be doing now.

Actually I don’t know if this is a common term, or if its ever been used before, but I think it perfectly describes this kind of application. The main goal here is to create a software that monitors the prices of your competitors so you can always be up to date with market changes.

You might think “Well, it is useless to me. You know, I’m a freelancer, I don’t have to deal with this ‘price comparison’ thing.” Don’t worry, you are right. But you may have customers that have a lot of competitors they want to watch closely. So you can always offer this as a “plus” service (feel free to charge for it, I’ll be glad to know that), and learn a little about this process.

So, let’s rock!

1 – Requirements

PHP Server with linux – We need to use crontab here, so it is better to get a good online server

MYSQL – We will store data with it, so you will need a database

2 – Basic crawling

We will start by trying a basic crawling function: get some data. Let’s say that I sell shoes, and Zappos is my competitor (just dreaming about it). The first product I want to monitor is a beautiful pair of Nike Free Run+. We will use now fopen to open the page, fgets to read each line of the page and feof to check when we need to finish the reading. At this time, you need to have fopen enabled in your server (you can check it via phpinfo ). Our first piece of code will be:

At this point, if you echo the $content you will notice that it has all page contents without any CSS or JS, because on zappos site they are all with relative paths.

Now we have the content, we need to process the product price.

How do you know the difference between price and other ordinary data in our page? Well, it is easy to notice that all prices must have a “$” before them, so what we will do is get all data and run a Regular Expression to see which prices where we have a dollar sign, we have on that page.

But our regular expression will match every price on the page. Since Zappos is a good friend of spies, it has made the “official” price as the first, always. The others are just used in JavaScript, so we can ignore them.

Once you’ve created your table, we will start adding some data. So we will need to do a mysql connect in our PHP and prepare our prices to be saved.

Since all our data is not perfect floats, we need to prepare it so we will have just numbers and a dot.
To connect in our db we will use mysql_connect, and after we will use mysql_select_db to select “spy” and then we can do our mysql_query to save or get our data.

4 – Smarter spy with Crontab

Well, with crontab we can schedule some tasks in our (linux) system so it runs automatically. It is useful for backup routines, site optimizing routines and many more things that you just don’t want to do manually.

Since our crawler needs some fresh data, we will create a cron job that runs every day at 1am. On net.tuts+ we have a really good tutorial on how to schedule tasks with cron, so if you aren’t too familiar with it, feel free to check it out.

In short, we have command lines that we could use for it, (second is my favorite):

#here we load php and get the physical address of the file
#0 2 * * * says that it should run in minute zero, hour two, any day of month, any month and any day of week
0 2 * * * /usr/bin/php /www/virtual/username/cron.php > /dev/null 2>&1
#my favorite, with wget the page is processed as it were loaded in a common browser
0 2 * * * wget http://whereismycronjob/cron.php

5 – Let’s do some pretty charts

If you are planning to use this data, just a db record won’t be too useful. So after all this work we need to present it in a sexier way.

Almost all our jobs here will be done by the gvChart jQuery plugin. It gets all our data from tables and make some cool charts out of it. What we have to do actually is print our results as a table, so it can be used by gvChart. Our code this time will be (download our demo for more info!):

Are you hungry yet?

I think there’s a lot to improve on yet. You could, for example, do a “waiting list” of urls so you could crawl a lot of URL’s with a single call (of course each URL could have his own REGEX and “official price”, if they are from different sites).

And what do you think we could improve?

Facebook

Twitter

Google+

Pinterest

+

The Community of Web Design Professionals

"It's our personal mission to make sure you become the best web design professional." James, CEO / Co-Founder

Valuable ideas once a week

AwesomeWeb is Awesome!

Andreas W.

Submitted 9:02 AM .Oct 16, 2014

Hi Nick. If you have work that needs to be done, send me an email. I am glad to help out. Awesomeweb already helped me to earn an awesome 3,700 dollars since the start.;) ... yeah farewell Elance!

Comments

Thank you for a great article.
I have created the db as ‘spy’, imported the sql string and it has created the table ‘zappos’ in the db.
Any idea why i get this error : Notice: Undefined variable: date in C:\wamp\www\WebSpy\index.php on line 154?

This website showing forex rates of different countries, and i want to crwal all of the stored data which can be shown by selecting different dates, Please help me how can i write curl or fpot crawler.

So, actually this is not with PHP, this is the crontab job. (part 4)
——You’ll change this
#0 2 * * * says that it should run in minute zero, hour two, any day of month, any month and any day of week
0 2 * * * wget http://whereismycronjob/cron.php
——to
#minute zero, any hour, any day of month, any month and any day of week
0 * * * * wget http://whereismycronjob/cron.php
——-

Hi there!
In which point do you get this error? I don’t remember to have seen anything similar to it off the top of my head, but it could be:
– Server config
– Any variable reference that is wrong (maybe with all this copy & paste stuff I’ve done something wrong)
[]’s

Well, google is much smarter then us, if you try to do a search engine monitor, for example, he could block you access after X downloads of the page.. I could try it and if it works I could make another article about search engine position monitor :D

Now I’m learning php and for me this information is very important. It’s amazing but just now I am involved in the collection of information from other sites.
“… easy to notice that all prices must have a” $ “before them …” – it is not always the case. I would have replaced it with a piece of code with something else.
In any case, this is a very useful article. Thanks for the link for setting up cron

some time ago i was with a similar thing, my problem was that i couldnt do this automatically becouse I pay a limited webserver and the administrator told me that i dont have access to the physical server, i only have access to phpmyadmin and ferozo panel.

the question is: is there any solution to do something like crontab without entering the physical server?

Hey Ádan, how are you?
What I can think now is what wordpress uses, called wp_cron. It has a kind of listed tasks, and every time a user loads the aplication it compares the current time with the time that the tasks should run.
It is not so accurate like crontab, but can be a good alternative for you.. You could user in a wordpress site, or do something similar to it in your own system..

Actually there’s also another pretty easy way to do this without the cron: create a scheduled task on your own computer to access the php script automatically once in a while (maybe in the background) and get the necessary information. It won’t work as good unless your computer is on all the time when this should be executed, but you can go around any limitations you might have on the server.
Also, if you have access to another server where you CAN edit the crontab (so not necessarily the same server where the web page is stored), you could add an entry there to do the exact same thing – access the web page automatically. :)
In both cases, since the PHP script would be publicly accessible, you could think about protecting it with a hashkey sent via GET, to prevent unauthorized access even if someone else knows the URL.

This is quite interesting. So basically I can keep ahead of the market by monitoring my competitors prices and adjust mine accordingly, similar to what Tesco does with their price but this is online. This could be a very useful tool in the right company, thanks.

I use something really similar to this to update all the prices in my store daily, based on USD to BRL currrency, so I can get prices much more accurately than my competitors (sometimes they lose money, sometimes they lose sales because of higher prices..)