For one, curl doesn’t parse and render JavaScript, and that’s what the internet is made out of. But perhaps even worse, many companies are employing technologies to outright detect and block curl because it’s often used for scraping.

Either way, if you use curl to pull a lot of sites en masse, you’re likely to have a massive failure rate in getting the HTML you’re looking for.

What we’ve needed for quite some time is something like curl, i.e., command-line and relatively simple, but that renders sites fully.

I’ve been using chromium (part of the Chrome project) to solve this problem for years, and I wanted to pass along the syntax for others.

I am usually doing things from Ubuntu, but you can get this to work on most UNIXy systems.

a timeout of 25 seconds keeps things from timing out while you’re waiting for xargs to do its thing, and/or for the site to respond.

headless means don’t display a GUI, and no-sandbox is a security issue if you’re running as root, so be careful with that.

dump-dom means pull everything that comes back from the render.

the {} bits are placeholders for the content of the current cycle of xargs

the 2> /dev/null is because Chromium can be noisy

the {}.html writes the file based on the name of the domain coming from domains.txt.

What you basically end up with—assuming you have a decent machine to run this on—is hundreds of nicely rendered HTML files being created very quickly. Chromium is Chrome, so you’re getting the full rendering of the JavaScript and all the goodness that comes with that.

Anyway, I hope this helps someone who’s smashing their face on the desk because of curl.