Website data scripting (cURL)

Not all websites have API’s that we can use (or have limits that we want to ignore), and so we have to go around the nicely created API, and grab the data from the website and go through it ourselves.

cURL the site. Gives us the html of the website in a variable. We can search through the html for anything specific, such as images, or the headlines, or whatever else they have that we want. This is basically the fastest way to download the internet. I’m pretty sure googles search spiders use this method to track a website and all the links on it, and then search all of those links – creating a web of data that connects to each other.

Basically what it gives us is the website, but in a variable we can manipulate. Think of the source code to a web page. That has to be transmitted to us so our web browser can take it apart and then build it in a graphical way that we understand. We request a website, and the website sends us the source code, and now we use that data to do what we want. Want the images out of it? Well all images have the same start tag and end in the same manner. Search the string for the position of ‘<img’ and if it finds it, search for the first closing tag after the start position. This will give us the entire image code that we would put into our website. Almost what we want. Create a new string from the start position to the end position of the found image. Search that new string for the src value of the image, (string position, ‘src=”‘ ending at the next ” after that position in the string). That will give us the url of the image (of the first image, we have to do this in a loop that searches through the rest of the source code string, each time starting at the end position of the last image (or whatever we are looking for) that we found. We do this because we don’t want duplicate images, and we don’t want to run this code for every single character in the source code.

I made a simple program that allows me to cURL reddit instead of using their API, that downloads all the images of a given subreddit. It searches each post on the first 5 pages for images, and if the link sends me to imgur, I scrape that site too, and take all the images from there. I then upload the images to my database with the reddit ID they are given, with a counter if there are additional images, to a folder that matches the subreddit name.

I haven’t done this yet, but I am fairly sure that I can also send data using cURL. This would be helpful if you wanted to create a program that could go to a specific web page, and create as many accounts as you are wanting. Basically you would have to figure out how the sign up process works, what the field names are and the restrictions. You would then cURL the site, retrieve the source code, re-cURL it with the data you want to send (filling in the form and submitting it), and then cURL-ing the return information for any further steps you have to take. If for example you need to log into your email, and verify the account, you’ll probably have to make a program to cURL an email provider (to create multiple accounts for each of the other accounts you want to create), log itself in (and out), and then cURL your emails, looking for specific validation email. Find the link to it, curl into the email, and curl the link, verifying the account. That sounds like a fun experiment for a future bored night.