Techniques for Mastering cURL

cURL is a tool for transferring files and data with URL syntax, supporting many protocols including HTTP, FTP, TELNET and more. Initially, cURL was designed to be a command line tool. Lucky for us, the cURL library is also supported by PHP. In this article, we will look at some of the advanced features of cURL, and how we can use them in our PHP scripts.

Why cURL?

It's true that there are other ways of fetching the contents of a web page. Many times, mostly due to laziness, I have just used simple PHP functions instead of cURL:

However they have virtually no flexibility and lack sufficient error handling. Also, there are certain tasks that you simply can not do, like dealing with cookies, authentication, form posts, file uploads etc.

cURL is a powerful library that supports many different protocols, options, and provides detailed information about the URL requests.

Basic Structure

Before we move on to more complicated examples, let's review the basic structure of a cURL request in PHP. There are four main steps:

Step #2 (i.e. curl_setopt() calls) is going to be a big part of this article, because that is where all the magic happens. There is a long list of cURL options that can be set, which can configure the URL request in detail. It might be difficult to go through the whole list and digest it all at once. So today, we are just going to use some of the more common and useful options in various code examples.

Detect Redirection Based on Browser

In this first example, we will write a script that can detect URL redirections based on different browser settings. For example, some websites redirect cellphone browsers, or even surfers from different countries.

We are going to be using the CURLOPT_HTTPHEADER option to set our outgoing HTTP Headers including the user agent string and the accepted languages. Finally we will check to see if these websites are trying to redirect us to different URLs.

First we have a set of URLs to test, followed by a set of browser settings to test each of these URLs against. Then we loop through these test cases and make a cURL request for each.

Because of the way setup the cURL options, the returned output will only contain the HTTP headers (saved in $output). With a simple regex, we can see if there was a "Location:" header included.

When you run this script, you should get an output like this:

POSTing to a URL

On a GET request, data can be sent to a URL via the "query string". For example, when you do a search on Google, the search term is located in the query string part of the URL:

http://www.google.com/search?q=nettuts

You may not need cURL to simulate this in a web script. You can just be lazy and hit that url with "file_get_contents()" to receive the results.

But some HTML forms are set to the POST method. When these forms are submitted through the browser, the data is sent via the HTTP Request body, rather than the query string. For example, if you do a search on the CodeIgniter forums, you will be POSTing your search query to:

http://codeigniter.com/forums/do_search/

We can write a PHP script to simulate this kind of URL request. First let's create a simple file for accepting and displaying the POST data. Let's call it post_output.php:

When you want to upload a file, all you have to do is pass its file path just like a post variable, and put the @ symbol in front of it. Now when you run this script you should get an output like this:

Multi cURL

One of the more advanced features of cURL is the ability to create a "multi" cURL handle. This allows you to open connections to multiple URLs simultaneously and asynchronously.

On a regular cURL request, the script execution stops and waits for the URL request to finish before it can continue. If you intend to hit multiple URLs, this can take a long time, as you can only request one URL at a time. We can overcome this limitation by using the multi handle.

The idea is that you can open multiple cURL handles and assign them to a single multi handle. Then you can wait for them to finish executing while in a loop.

There are two main loops in this example. The first do-while loop repeatedly calls curl_multi_exec(). This function is non-blocking. It executes as little as possible and returns a status value. As long as the returned value is the constant 'CURLM_CALL_MULTI_PERFORM', it means that there is still more immediate work to do (for example, sending http headers to the URLs.) That's why we keep calling it until the return value is something else.

In the following while loop, we continue as long as the $active variable is 'true'. This was passed as the second argument to the curl_multi_exec() call. It is set to 'true' as long as there are active connections withing the multi handle. Next thing we do is to call curl_multi_select(). This function is 'blocking' until there is any connection activity, such as receiving a response. When that happens, we go into yet another do-while loop to continue executing.

Let's see if we can create a working example ourselves, that has a practical purpose.

Wordpress Link Checker

Imagine a blog with many posts containing links to external websites. Some of these links might end up dead after a while for various reasons. Maybe the page is longer there, or the entire website is gone.

We are going to be building a script that analyzes all the links and finds non-loading websites and 404 pages and returns a report to us.

Note that this is not going to be an actual Wordpress plug-in. It is only a standalone utility script, and it is just for demonstration purposes.

So let's get started. First we need to fetch the links from the database:

First we have some database configuration, followed by an array of domain names we will ignore ($excluded_domains). Also we set a number for maximum simultaneous connections we will be using later ($max_connections). Then we connect to the database, fetch posts that contain links, and collect them into an array ($url_list).

Following code might be a little complex, so I will try to explain it in small steps.

And here is the explanation for the code above. Numbers in the list correspond to the numbers in the code comments.

Created a multi handle.

We will be creating the add_url_to_multi_handle() function later on. Every time it is called, it will add a url to the multi handle. Initially, we add 10 (based on $max_connections) URLs to the multi handle.

We must run curl_multi_exec() for the initial work. As long as it returns CURLM_CALL_MULTI_PERFORM, there is work to do. This is mainly for creating the connections. It does not wait for the full URL response.

This main loop runs as long as there is some activity in the multi handle.

curl_multi_select() waits the script until an activity to happens with any of the URL quests.

Again we must let cURL do some work, mainly for fetching response data.

We check for info. There is an array returned if a URL request was finished.

There is a cURL handle in the returned array. We use that to fetch info on the individual cURL request.

If the link was dead or timed out, there will be no http code.

If the link was a 404 page, the http code will be set to 404.

Otherwise we assume it was a working link. (You may add additional checks for 500 error codes etc...)

We remove the cURL handle from the multi handle since it is no longer needed, and close it.

We can now add another url to the multi handle, and again do the initial work before moving on.

Everything is finished. We can close the multi handle and print a report.

This is the function that adds a new url to the multi handle. The static variable $index is incremented every time this function is called, so we can keep track of where we left off.

I ran the script on my blog (with some broken links added on purpose, for testing), and here is what it looked like:

It took only less than 2 seconds to go through about 40 URLs. The performance gains are significant when dealing with even larger sets of URLs. If you open ten connections at the same time, it can run up to ten times faster. Also you can just utilize the non-blocking nature of the multi curl handle to do URL requests without stalling your web script.

Callback Functions

It is possible to have cURL call given callback functions during the URL request, before it is finished. For example, as the contents of the response is being downloaded, you can start using the data, without waiting for the whole download to complete.

The callback function MUST return the length of the string, which is a requirement for this to work properly.

As the URL response is being fetched, every time a data packet is received, the callback function is called.

Conclusion

We have explored the power and the flexibility of the cURL library today. I hope you enjoyed and learned from the this article. Next time you need to make a URL request in your web application, consider using cURL.

Thank you and have a great day!

Write a Plus Tutorial

Did you know that you can earn up to $600 for writing a PLUS tutorial and/or screencast for us? We're looking for in depth and well-written tutorials on HTML, CSS, PHP, and JavaScript. If you're of the ability, please contact Jeffrey at nettuts@tutsplus.com.

Please note that actual compensation will be dependent upon the quality of the final tutorial and screencast.

Burak Guzel is a full time PHP Web Developer living in Arizona, originally from Istanbul, Turkey. He has a bachelors degree in Computer Science and Engineering from The Ohio State University. He has over 8 years of experience with PHP and MySQL. You can read more of his articles on his website at PHPandStuff.com and follow him on Twitter here.