HTTP 504 error means a request to Splash took more than
timeout seconds to complete (30s by default) - Splash
aborts script execution after the timeout. To override the timeout value
pass ‘timeout’ argument to the Splash endpoint
you’re using.

Note that the maximum allowed timeout value is limited by the maximum
timeout setting, which is by default 60 seconds. In other words,
by default you can’t pass ?timeout=300 to run a long script - an
error will be returned.

Maximum allowed timeout can be increased by passing --max-timeout
option to Splash server on startup:

A website can be really slow, or it can try to get some remote
resources which are really slow.

There is no way around increasing timeouts and reducing request rate
if the website itself is slow. However, often the problem lays in unreliable
remote resources like third-party trackers or advertisments. By default
Splash waits for all remote resources to load, but in most cases it is
better not to wait for them forever.

When a script fetches many pages or uses large delays then timeouts
are inevitable. Sometimes you have to run such scripts; in this case increase
--max-timeout Splash option and use larger timeout
values.

But before increasing the timeouts consider splitting your script
into smaller steps and sending them to Splash individually.
For example, if you need to fetch 100 websites, don’t write a Splash Lua
script which takes a list of 100 URLs and fetches them - write a Splash Lua
script that takes 1 URL and fetches it, and send 100 requests to Splash.
This approach has a number of benefits: it makes scripts more simple and
robust and enables parallel processing.

Splash renders requests in parallel, but it doesn’t render them all
at the same time - concurrency is limited to a value set at startup
using --slots option. When all slots are used a request is put into
a queue. The thing is that a timeout starts to tick once Splash receives
a request, not when Splash starts to render it. If a request stays in an
internal queue for a long time it can timeout even if a website is fast
and splash is capable of rendering the website.

To increase rendering speed and fix an issue with a queue it is recommended
to start several Splash instances and use a load balancer capable of
maintaining its own request queue. HAProxy has all necessary features;
check an example config
here.
A shared request queue in a load balancer also helps with reliability:
you won’t be loosing requests if a Splash instance needs to be restarted.

Note

Nginx (which is another popular load balancer) provides an
internal queue only in its commercial version, Nginx Plus.

Of course, it is also good to setup monitoring, configuration management,
etc. - all the usual stuff.

To daemonize Splash, start it on boot and restart on failures
one can use Docker: since Docker 1.2 there are --restart
and -d options which can be used together. Another way to do that is
to use standard tools like upstart, systemd
or supervisor.

Note

Docker --restart option won’t work without -d.

Splash uses an unbound in-memory cache and so it will eventually consume
all RAM. A workaround is to restart the process when it uses too much memory;
there is Splash --maxrss option for that. You can also add Docker
--memory option to the mix.

In production it is a good idea to pin Splash version - instead of
scrapinghub/splash it is usually better to use something like
scrapinghub/splash:2.0.

A command for starting a long-running Splash server which uses
up to 4GB RAM and daemonizes & restarts itself could look like this:

Note that if you disable private mode then browsing data such as cookies or
items kept in localStorage may persist between requests. If you’re using
Splash in a shared environment it could mean your cookies or local storage
items can be accessed by other clients, or that you can occasionally access
other client’s cookies.

You may still want to turn Private mode off because in WebKit localStorage
doesn’t work when Private mode is enabled, and it is not possible
to provide a JavaScript shim for localStorage. So for some websites you may
have to turn Private model off.

When you check http://<splash-server>:8050/render.html?url=<url>
in a browser it is likely stylesheets & other resources won’t
load properly. It happens when resource URLs are relative - the browser
will resolve them as relative to
http://<splash-server>:8050/render.html?url=<url>, not to url.
This is not a Splash bug, it is a standard browser behaviour.

If you just want to check how the page looks like after rendering
use render.png or render.jpeg endpoints.
If screenshot is not an option and you want to display html with images,
etc. using a browser then you may post-process the HTML and add
an appropriate <base> HTML tag to the page.

baseurl Splash argument can’t help here. It allows
to render a page located at one URL as if it is located at another
URL. For example, you can host a copy of page HTML on your server,
but use baseurl of the original page. This way Splash will resolve
relative URLs as relative to original page URL, so that you can get
e.g. a proper screenshot or execute proper JavaScript code.

But by passing baseurl you’re instructing Splash to use it,
not your browser. It doesn’t change relative links to absolute in DOM,
it makes Splash to treat them as relative to baseurl when rendering.

Changing links to absolute in DOM tree is not what browsers do when
base url is applied - e.g. if you check href attribute using JS code
it will still contain relative value even if <base> tag is used.
render.html returns DOM snapshot, so the links are not changed.

When you load render.html result in a browser it is your browser
who resolves relative links, not Splash, so they are resolved incorrectly.