How to use curl_multi() without blocking

A more efficient implementation of curl_multi()curl_multi is a great way to process multiple HTTP requests in parallel in PHP. curl_multi is particularly handy when working with large data sets (like fetching thousands of RSS feeds at one time). Unfortunately there is very little documentation on the best way to implement curl_multi. As a result, most of the examples around the web are either inefficient or fail entirely when asked to handle more than a few hundred requests.

The problem is that most implementations of curl_multi wait for each set of requests to complete before processing them. If there are too many requests to process at once, they usually get broken into groups that are then processed one at a time. The problem with this is that each group has to wait for the slowest request to download. In a group of 100 requests, all it takes is one slow one to delay the processing of 99 others. The larger the number of requests you are dealing with, the more noticeable this latency becomes.

The solution is to process each request as soon as it completes. This eliminates the wasted CPU cycles from busy waiting. I also created a queue of cURL requests to allow for maximum throughput. Each time a request is completed, I add a new one from the queue. By dynamically adding and removing links, we keep a constant number of links downloading at all times. This gives us a way to throttle the amount of simultaneous requests we are sending. The result is a faster and more efficient way of processing large quantities of cURL requests in parallel.

function rolling_curl($urls,$callback,$custom_options=null){

// make sure the rolling window isn't greater than the # of urls$rolling_window=5;$rolling_window=(sizeof($urls)&lt;$rolling_window) ? sizeof($urls):$rolling_window;

// start the first batch of requestsfor($i=0;$i&lt;$rolling_window;$i++){$ch=curl_init();$options[CURLOPT_URL]=$urls[$i];curl_setopt_array($ch,$options);curl_multi_add_handle($master,$ch);}

do{while(($execrun=curl_multi_exec($master,$running))== CURLM_CALL_MULTI_PERFORM);if($execrun!= CURLM_OK)break;// a request was just completed -- find out which onewhile($done=curl_multi_info_read($master)){$info=curl_getinfo($done['handle']);if($info['http_code']== 200){$output=curl_multi_getcontent($done['handle']);

// start a new request (it's important to do this before removing the old one)$ch=curl_init();$options[CURLOPT_URL]=$urls[$i++];// increment icurl_setopt_array($ch,$options);curl_multi_add_handle($master,$ch);

Note: I set my max number of parallel requests ($rolling_window) to 100 5. Be sure to update this value according to the bandwidth available on your server / servers you are curling. Be nice and read this first.

Updated 4/2/09: Made some changes to increase reusability. rolling_curl now expects a $callback parameter for a function that will process each response. It also accepts an array called $options that let’s you add custom curl options such as authentication, custom headers, etc

Updated 4/8/09: Fixed a new bug that was introduced with the last update. Thanks to Damian Clement for alerting me to the problem.

I need to check external links, get all status codes, error messages, request time and the location header if 301. I have currently no idea how to get the location header… Also with your example I have no idea what original link was that have been checked if a 301 was given on the first request. With CURLOPT_MAXREDIRS => 5 you follow 5 redirects, but loose all information about a reference to the original link requested link. By this way you can use your example to download 1000 files to disk, but if you need to handle the status codes specific to the results it is very difficult.

Will it change anything about blocking if the variables are named $ch1, $ch2? Sorry, but I don't understand how it currently works and I tried to debug it some good time… I'm only like to be save that nothing goes wrong in my code.

Always keep in mind that some firewalls are blocking you if you open more than 6 requests to one hostname. This is not allowed per RFC definition and you can – no you WILL overload the remote server. You will bring the server down if you open 100 simultaneous requests per second! Do not overload other severs… this is like a DDOS attack and IDS (intrusion detection systems) will block you completely.

I've tried asking the people I'm scraping but haven't had any replies from any enquiries I've made to the webmaster email addresses so I guess I'll try upping my limit by 1 at a time and see how I go on??

please excuse the beginners question, but what variable type is $output returned in? I can use "strlen" to determine the length of $output but "substr" doesn't seem to work.

In my original linear code I used the expression, $output = file_get_contents($urls), and was able to parse $output for various HTML fragments, however it doesn't work on $output returned using your Rolling_curl function. Does it need modifying to suit my needs?

a nit that embarassed me a touch (though I blame it on being up all night and then some). At the point where you: "// start a new request (it's important to do this before removing the old one)" … it's best to check to make sure $i < count($urls);

Kaolin, I'm almost done with a new version that will bring a lot of improvements to this code. Look for it some time later this week. I've fixed a few small problems like the one you mentioned and made it object oriented for increased reusability.

What if I need an additional variable (say URL id from the database) going through the callback function? I tried passing the $info as well but it loses the original URL when redirected, so I cannot use the callback function to update the URL status on my database.

I too was looking out information on the same and came across this simple solution which may help all of us. After one gets the following code executed,
$info = curl_getinfo($done['handle']);
we can just simply retrieve the URL by adding this code.
$url = $info['url'];
Once you have the URL, you surely can come up with the ID used. And with the code of the above function, you can send across the response back to the callback by adding this as additional parameter.

i'm sorry to inform you, I have tried your code with 900 urls on a dedicated server with 1000Mbit connection, and window size of 10 only and it does not crawl all the 900 urls… between 30 to 140 urls randomly ??
any ideas why?

Are you hitting 1 site or 900 different sites? If you're hitting 1 site you might want to make sure you're not getting blocked.

Otherwise, my guess is there is some setting or limitation on your server that is limiting the number of connections. How much memory do you have? I've run into the problem of dropped urls before, but only with a window size of several hundred. One of the problems with multi_curl is that it tends to fail silently. Please let me know if you figure out what is going on. I'd love to find a solution besides \”use a smaller window size\”.

none of the 900 urls is on the same server, and i have 8gb of ram with 16 cores.. so i doubt that i have any server limitation..
if i use a different function that gathers all the data by "blocks" of 500 for example, and then parses everything it seems to work.. the part that parses the data as soon as it comes in your function might be making things a bit shaky,
try getting 900 urls, and callback to a function that counts the amount of urls that were called back (declare the count var global in the function to enable shared access from all the callbacks) and you'll see very few get called back.. even when you make a like a window size of 50… whereas if i use another function that doesn't process results as soon as they arrive, all pages get fetched and i process later but its slower… it would be awesome if you found out what's the reason in your code.. as i'm sure it would be faster!

why don't you echo something simpler on the callback function (say "donen") and count the number of times it displays? do the same with errors (i.e. replacing "// request failed. add error handling." with "errorn"). I had similar issues and it was because of a faulty success and error callback functions.

[...] How to use curl_multi() without blocking curl_multi is a great way to process multiple HTTP requests in parallel in PHP. curl_multi is particularly handy when working with large data sets (like fetching thousands of RSS feeds at one time). Unfortunately there is very little documentation on the best way to implement curl_multi. As a result, most of the examples around the web are either inefficient or fail entirely when asked to handle more than a few hundred requests. (tags: PHP, curl_multi) [...]

i also thought that "if (count($urls) < $i+1)" should be used but I tried it and it sent me in an infinite loop. I cant understand the reason why it is doing so, but apparently it doesnt work when if (count($urls) < $i+1) is added.

You don't want to do that because that would start every requestrunning at the same time — which would create issues if you had a lotof URLs to fetch. I intentionally used a rolling window to limit thenumber of simultaneous requests. It sounds like you might just wantto increase the size of the rolling window. The default is 5, but youcan safely bump that up to 100 or so as long as you are hittingdistributed resources.

Thanks a lot…this works great. But I am surprised….I see that no matter what, $urls[$i++] is added to the queue. How come there is no error when $urls[$i] is the last one in queue…ie…there is no $urls[$i++]???

$i++ is a post-increment. This means that $urls[$i] is added and THEN $i is incremented. We would probably have problems if we used ++$i. Also note that $i is just a counter, it doesn't control when the while loop stops, the variable $done handles that. Make sense?

@Prashanth
This question is not naive at all … I had the same concerns, and they were confirmed by the following notice produced by PHP*:
[php]Notice: Undefined offset: 1692 in /home/me/bin/foobar.php on line 80[/php]

Such notices are thrown out about a bunch of times, however, you can fix this problem easily by just adding the following if statement:

I too have had problems with having too big a window/too many urls. If I have a window of 100+ with 2000 urls, it'll only call back a random number of successfully fetched urls.. like 100-300. It's very irritating and I can't find any reason why. And it's not to do with memory, the box has 8 cores and 32gb of ram.. and the script process takes very little resources really.

Would love to find out the cause.. since I have to check roughly 2 million urls every day and it gets slow with a window of only 50.. in fact, regular curl_multi with 500 threads is faster right now. Let me know if you have any thoughts. I could even pay you if you find out the cause.

I think this code not good way, see below..
// start a new request (it's important to do this before removing the old one)
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i++]; // increment i
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);

felix, i too crawl more than a million pages a day and have the same issues on a huge box…
have you found the solution?
best regards,
Ronny

fyi my post above:
"i'm sorry to inform you, I have tried your code with 900 urls on a dedicated server with 1000Mbit connection, and window size of 10 only and it does not crawl all the 900 urls… between 30 to 140 urls randomly ??"

if $rolling_window 's getting bigger then available $urls
it open ne request with an empty url this will slow down
the script at the end

I recognized that if I use different open $rolling_window s
for different ping times the download time getting smaller
(usefull if you download much stuff from the same server)
it would be helpful to write a script which get the best combination
of open $rolling_window s for different ping times and for
the own pc power.

Is there a reason that you don't start a new request if the previous request failed? It seems to me that you should start a new request every time a previous request is done…

I also wonder if it is possible to send the original url to the callback function. That way it would be easier to identify which content origins to which url. Since requests can be redirected, the url in $info can be different than the original url.

I hope you find a solution. If I don't know which content that belongs to a certain URL it is hard to use the code. But maby I'm missing something? I haven't found any example code of how to use your library.

[...] years the code I’ve used to fetch a feed has drastically improved. I made optimizations to fetch feeds in parallel. I started keeping track of feeds that failed regularly so I could fix or eliminate them. I [...]

To get the URL, set a second parameter in the callback function. The second parameter contains the information passed to curl. So, for example, if your callback function was:
<pre>
function request_callback($result,$info) {
echo md5($result)."";
echo $info["url''];
}
</pre>

$info["url"] will return the URL of the request.

Thanks a lot for taking the time to put this together, Josh. It has really helped me on a few projects.

Thanks for the code Josh, I've only just started with it but the php doc pages were basically useless, so I hope I can accomplish what I want with your class..
What is the correct way to add curl options? I just want the header returned so I'm trying this

There is an off by one bug in the rolling window logic when the window_size is less than the total number of urls (e.g. 6 urls and window size is 3).
In this example, the 4th url gets skipped. If I change the window size to 4, the 5th url gets skipped. I've tracked it down to the logic in this if statement:

How to access Rolling Curl's Callback Function within a Parent Class?
I am calling RollingCurl from within another class.
How can I get RollingCurl to target the callback function within the class that called it? I am a little unfamilar with callback functions and not quite sure how to implement it within a class. Many thanks in advance.

Interesting question. The first thing that comes to mind is to use create_function (http://ca.php.net/create_function) to create an anonymous function for the callback. It's got a bit of an ugly syntax, but it works great. Using the example I have on Google Code, it would look something like this:

As an alternative, I tried the below but could not get the $vArgs array to show up in the attributeHTMLScraper function.
$vArgs = array($response,$info,'/html/body//a');
call_user_func_array($this->parentClass->attributeHTMLScraper,$vArgs );

If anyone knows a more elegant way to do this, I would be much appreciative. Still deep in that learning phase.

I get the same problem with the disappearing URL's as previously mentioned. I have kept the rolling window at 5 and experimented with retrieving XML feeds from Amazon. Once I get up to 50 URLs I am not getting the expected number of results. I have tried adding error handling but there are none, the URL's just disappear.

About the only thing I can think of at the moment is flagging each URL in the array and recursively processing until they are either flagged as completed or error. I will keep you posted.

Rolling Curl simply just rocks! Thanks for all your time & effort on this. I'm amazed at what this can accomplish on so few CPU cycles.

Who cares whether it's technically forking, threading, or otherwise….it works as advertised.

One small issue on the blog presentation. While I know it should be obvious that the current code lives on google, I think it would be advantages (from a visual quickscan standpoint) to replace the old code in the black area (on top of the blog post) with the current meat and potatoes, end result, functionality from the example on Google:

I need to access Rolling Curl results based on specific URL sequencing.

For smaller requests, I'm using the code below and it's working fine.

For larger requests, I could write the results to disk with sequential file naming for post sequential compilation, rather than storing all the results in memory.

I'm thinking the best way to handle it dynamically, would be to create a function to write sequentially named files to disk, based on a defined size/memory threshold, otherwise, handle via an in-memory associative array as below.

I would like to add two attributes (public $ response; public $ output;), when no callback function to facilitate the use of external processing, thank you.
Sorry, my English Henlan, and use Google Translation

There is just one minor error I noticed. Before creating a new handle you should check if there are any url's in the array left to add. If you don't check that you will create up to 5 empty requests (that will return error code 0) after you have processed all the URL's in the $urls array. So it should look like this:

// start a new request (it's important to do this before removing the old one)
if (sizeof($urls) > $i + 1) { …. start new request …. }

If you process the errors in else {} part, you will see that up to 5 requests will return errors with code value 0 ($info['http_code'] == 0). You much check if you have already sent all the urls (in $urls) as requests before making another one (if (sizeof($urls) > $i +1){create new request})

This is the callback that is called. Is the callback synchronous i.e, because when one of the handle is done it calls this function. The output is zip so I write it to a zip file locally. If successful I unzip it.
This writing to file and unziping takes time and after it gets completed only then the download of next items resume

In my case
I am downloading 3 files in batch
if 1st file is downloaded successfully it will call the callback
after the writing to zip and unzipping is finished
I have got this error

Firstly, apologies if this is a bit of a noob question but can anyone give any tips on how to get proper responses from the following site using RollingCurl? Is it to do with cookies? I am not a programmer by any stretch of the imagination so could do with some pointers.

The URL is "http://logis.korail.go.kr/getcarinfo.do?car_no=&quot; with a number appended to it, ranging from 8201 through 8286. The first time you enter the URL into your browser you get a login screen back. If you refresh, or enter a URL with a different number appended, you get one of two different responses. The first response has two input boxes, one of which is populated with the number you appended to the URL, the second being empty. The second response is the same as the first with the addition of two tables, with various data fields, underneath the two input boxes.

When you use the code below the html received back is for the login page in every instance. How do I implement RollingCurl so I get on of the other two repsonses back?

I'm not php expert and looking for solution to download multi images in one time..I'm not php expert and looking for solution to download multi images in one time..
this class look like work for me but i dont know how to work it out for saving files. at defined directory:

// start a new request (it's important to do this before removing the old one)
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i++]; // increment i
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);

I'm using your nice piece of code for a lot of data (3.5 million requests).
My rolling window is 10.
But after a while my working machine is out of memory.

I'm trying to find the memory leak and I noticed you don't close the single curl handles.
I think, after
curl_multi_remove_handle($master, $done['handle']);
you have to call
curl_close($done['handle']);
to totally close the handle, because
curl_multi_close($master);
closes the master but not the single handles.

And can you explain me, why you have to "// start a new request" if you just startet all of the requests in the "for" loop?

I haven't tried reusing the connections, or at least I don't remember experimenting with that. Would be interesting to see if you can make that work. If you do, please post your results here so others can gain from them.

Ok. I got it implemented using a hash tag to pass the monitor id and then getting this value from the $request ($info['url'] doesn't retain the hash tag on the URL for whatever reason). This way, by using a hash tag, I figure there is no possibility that it'll ever change the URL that is checked. Still it'd be cool, if RollingCurl had a way to pass a value without affecting the URL. But this is working for now. Thanks for sharing RC!

Hi, nice job.
Just some changes for me :
// for your pb of number of $urls must be more than number of rolling_window
$rolling_window = min(array(5,count($urls)));

// this ligne, just after the "for" (start the first batch of requests)
// because if you have 5 windows, you get out from this for with $i == 5 (last $i++)
// then, when you get the next url in the do while, you make another $i++ witch do not take the 5th url !!!
$i–;

thats all for me ! thank you again, this makes me save some hours !!!!

Not sure if I did it correctly, but my problem with the code is with the callback function:

for example:
call_user_func($callback,$urls[$z],$output);

When I called the callback function, the $output does not match the url, since I want to display the link with the output to match each other. What I am getting is the $output will either come before or after the next url…

tried to fix with sleep and curl_multi_select(which suppose to wait for activity on the connect), but can’t fix the problem…

Hey, i am using curl_mult_exec for processing thousands of URLs. Currently it is breaking down at around 15 to 20k.. plz help me on that… plzzzzzz

Leave a Reply

Name

Mail (will not be published)

Website

Hi. I'm Josh.

Entrepreneur, world traveler and rock climber. Co-founder and CTO of Forage.
Previous co-founder of Torbit and EventVue. This is my blog where I share what I'm learning about life, entrepreneurship and software development. read more »