High Performance Libcurl Tips

At SEOmoz, we’ve been able to make use of many great open source packages. One that’s particularly important to our crawling infrastructure is libcurl, which abstracts the logic behind HTTP (and many other protocols).

Using libcurl, our crawling infrastructure is currently peaking at about 45MB/second (per machine). Libcurl saved us countless hours on our way toward reaching this goal, but it was also a challenging library to understand and use in a high-performance way. To that end, here’s some tips from our recent development efforts:

Timeouts are broken on the multi interface. This is supposedly fixed on more recent versions of libcurl, but we still saw some handles getting stuck. We recommend keeping track of all active easy handles and timing them out manually if they don’t succeed.

You must call ‘curl_multi_socket_action()’ with CURL_SOCKET_TIMEOUT once to ‘kickstart’ the event loop. We do this after adding the very first set of easy handles to the multi handle, but before checking for activity on any of libcurl’s sockets. This was a little counter-intuitive for us, since if you don’t do it things seem to work most of the time — but occasionally the CURLMOPT_SOCKETFUNCTION seems to not get called the first time without doing this.

Use CURLMOPT_TIMERFUNCTION, not curl_multi_timeout (the docs recommend this as well).

Use epoll to detect activity on the sockets managed by libcurl, but aware that the kernel will automatically remove closed file descriptors from your epoll sets. Moreover, it’s possible that libcurl will have closed a file descriptor and subsequently opened a new one with the same number since the last time it called your CURLMOPT_SOCKETFUNCTION. This means that your epoll_ctl calls in your socket callback might fail — we work around this by retrying ‘EPOLL_CTL_ADD’ actions once as ‘EPOLL_CTL_MOD’, and vise versa (as well as ignoring EPOLL_CTL_DEL failures).

Use a separate thread for all your calls to curl_multi_socket_action() (assuming the processing you do on results is CPU bound, otherwise this might be needless complexity — however, even occasional spikes of high CPU usage in your result processing could needlessly timeout a request by delaying your calls to curl_multi_socket_action()).

Compile libcurl with c-ares, an asychronous DNS resolution library (the built-in resolver for libcurl is blocking). We get mysterious SIGPIPE signals from c-ares (even with CURLOPT_NOSIGNAL turned on), and ended up ignoring these with a call to ‘signal‘.

Unless you have access to a high performance DNS server, consider running your own recursive DNS resolver and cache (PowerDNS, for example).

If you do run your own DNS cache locally, consider disabling the cache built into curl’s easy handle abstraction (see CURLOPT_DNS_CACHE_TIMEOUT). We saw a large performance boost when doing so (this was with parallelism levels of ~1500, perhaps the cache has a global lock somewhere?).

If you’re not carefully lining up your easy handles to be reused for connections to the same server, consider disabling the connection cache built into the easy handle abstraction (see CURLOPT_FRESH_CONNECT and CURLOPT_FORBID_REUSE).

Re-use easy handles instead of destroying and re-creating them — we found curl_easy_reset() handy for this.

If you allow automatic following of redirects, be aware that libcurl supports many interesting protocols (and that most of them are enabled by default). We explicitly restrict our requests to HTTP using CURLOPT_PROTOCOLS and CURLOPT_REDIR_PROTOCOLS.

Keep in mind that this advice comes only from our particular environment (Ubuntu 9.10, libcurl 7.21.3, c-ares 1.7.4), and that it may become less relevant over time. As always, you should experiment and stress test to find what works best for your particular use case.

great tips chas
they made a substantial improvements in our crawling and its exciting that it is so much faster.

spm

chas, Thanks for sharing these ideas. Has this optimization been upstreamed to libcurl opensource ? I can see some of them are basically how to use different configurations better, so they are outside libcurl code. But first few of them are inside libcurl source. would it be possible for you to share the patches ?

Leonidas Tsampros

Very nice overview! Thanks!

http://static.megafrock.com Max DeLiso

This is gold, really helped me to get a grasp on how to use the curl multi interface in a performant way, though it is beginning to show its age. Thank you.

Welcome to our dev blog!

This is the blog that is written by members of the Moz engineering team covering topics and things that interest us.