Rationale

In 2016 we gave our main server some more RAM, as a temporary solution
to cope with our workload, and as a way to learn about how to scale
it. See hardware for automated tests take2 for our
reasoning, lots of benchmark results, and conclusions.

It's working relatively well so far, but we may need to upgrade again
soonish, and improvements are always welcome in the contributor UX
area:

When Jenkins has built an ISO from one of our main branches or from
a branch that is Ready for QA, since October 2017 we rebuild in
in a slightly different build environment to ensure it can be
rebuilt reproducibly.
This increased substantially the number of ISO image we build
which sometimes creates congestion in our CI pipeline (see below
for details).

On our current setup, large numbers of automated test cases are
brittle and had to be disabled on Jenkins (@fragile), which quite
a few problematic consequences such as: it decreases the value our
CI system brings to our development process, it is demotivating for
test suite developers, it decreases the confidence developers have
in our test suite, and it forces developers to run the full test
suite elsewhere when they really want to validate a branch.
Interestingly, we don't see this much brittleness anywhere else,
even on a replica of our Jenkins setup that uses nested
virtualization too.

As we add more automated tests, and re-enable tests previously
flagged as fragile, a full test run takes longer and longer.
We're now up to 160 minutes / run. We can't make it faster by
adding RAM anymore nor by adding CPUs to ISO testers. But faster
CPU cores would fix that: the same test suite only takes 105
minutes on a replica of our Jenkins setup, also using nested
virtualization, with a poor Internet connection but a faster CPU.

Building our website takes a long while (8 minutes on our ISO
builders i.e. 17% of the entire ISO build time), which makes ISO
builds take longer than they could. This will get worse as new
languages are added to our website. This is a single-threaded task,
so adding more CPU cores or RAM would not help: only faster CPU
cores would fix that. For example, the ISO build only takes 30
minutes (including 4.5 minutes for building the website) on
a replica of our Jenkins setup, also using nested virtualization,
with a poor Internet connection but faster CPU cores.

Waiting time in queue for ISO build and test jobs is acceptable
most of the time, but too high during peak load periods: between
2017-06-17 and 2017-12-17, 4% of the test jobs have to wait for
more than 1 hour, and 1% more than 2 hours; similarly, 2% of the
ISO build jobs had to wait more than 1 hour. That's not many jobs
of course, but this congestion happens precisely when we need
results from our CI infra ASAP, be it because there's intense
ongoing development or because we're reviewing and merging lots of
branches close to a code freeze, so these delays hurt our
development and release process.

Our current server was purchased at the end of 2014. The hardware
can last quite a few more years, but we should plan (at least
budget-wise) for replacing it when it is 5 years old, at the end of
2019, to the latest.

New candidate services are being considered, such as Piwik
(#14601) and self-hosting our website
(#14588). At least Piwik will require lots of
resources and put quite some load on the system where we run it.
Both need to be hosted on hardware we control, where we can do our
best to guarantee our users' privacy. Once the Weblate service
reaches production status in 2018 and is advertised, it'll eat much
more resources; in passing, part of its job will be to often
rebuild the website, which is affected by the single-threaded task
performance limitations described above. OTOH there's no
security/privacy reason why our CI system has to be hosted on
hardware we control.

Options

Bare metal server dedicated to CI

Hard to tell whether this would fix our test suite fragility problems,
hard to specify what hardware we need. If we get it wrong, likely we
have to wait another 5 years before we try again ⇒ we need to rent
essentially the exact hardware we're looking at so we can benchmark it
before buying.

Pros:

No initial development nor skills to learn: we can run our test
suite in exactly the same way as we currently do.

Can provide hardware redundancy in case lizard suddenly dies.

We control the hardware and have a good relationship with
a friendly collocation.

For example, we could stuff 4-6 × Intel NUC NUC6i7KYK or similar
together in a custom case, with whatever cooling, PoE and network boot
system this high-density cluster would need. Each of these nodes
should be able to run 2 Jenkins workers.

Pros:

Potentially scalable: if there's room left we can add more nodes in
the future.

Probably as fast as server-grade hardware.

Cons:

Lots of initial research and development: casing, cooling, hosting,
power over Ethernet, network boot, remote administration

High initial money investment (given the research and development
costs we can't really try this option, either we go for it or we
don't).

Hosting this is a hard sell for collocations.

We need to buy a node in order to measure how it would perform
(as opposed to server-grade hardware that can be rented).
OTOH we already have data about the Intel NUC NUC6i7KYK so if we
pick a similar enough CPU we can reuse that.

On-going cost for hosting this cluster.

Run builds and/or tests in the cloud

EC2 does not support running KVM inside their VMs yet. Both Azure and
Google Cloud support it. OpenStack supports it too as long as the
cloud is run on KVM (e.g.
OVH Public Cloud,
and maybe later
OSU Open Source Lab).
ProfitBricks would likely work too as their cloud is based on KVM.

There's a Jenkins plugin for every major cloud provider that allows
starting instances on demand when needed, up to pre-defined limits,
and shuts them down after a configurable idle time.

After building an ISO, we copy artifacts from the ISO builder to the
Jenkins master (and thus to nightly.t.b.o), and then from the Jenkins
master to another ISO builder (if the branch is Ready for QA or one
of our main branches) and one ISO tester (in an case) that run
downstream jobs. These copies are blocking operations in our feedback
loop. So:

If the network connection between pieces of our CI system was
too slow, the performance benefits of building and testing
faster may vanish.

Assuming a 1.2 GB ISO, 3.5 minutes should be enough for a copy
(based on benchmarking a download of a Debian ISO image from lizard)
⇒ 2 or 3 × 3.5 = 8 or 10.5 minutes for an ISO build; compared to 1.5
minutes to the Jenkins master + 20 seconds to the 2nd ISO builder +
15 seconds to the ISO tester =~ 2 minutes on lizard currently.
In the worst case (Jenkins master on our infrastructure, Jenkins
workers in the cloud) it adds 5 or 8.5 minutes to the feedback loop,
which is certainly not negligible but is not a deal breaker either.

If data transfers between pieces of our CI system costed money we
would need estimate how much these copies would cost. On OVH Public
Cloud, data transfers to/from the Internet are included in the price
of the instance so let's ignore this.

One way to avoid this problem entirely is to run our Jenkins master
and nightly.t.b.o in the cloud as well.

Pros:

Scalable as much as we can (afford), both to react to varying
workloads on the short term (some days we build and test tons of
ISO images, some days a lot fewer), and to adjust to changing needs
on the long term.

No initial money investment.

No hardware failures we have to deal with.

We can try various instances types until we find the right one, as
opposed to bare metal that requires careful planning and
somewhat-informed guesses (mistakes in this area can only be fixed
years later: for example, choosing low-voltage CPUs, that are
suboptimal for our workload).

Frees lots of resources on our current virtualization host, that
can be reused for other purposes. And if we don't need these
resources, then our next bare metal server can be much cheaper,
both in terms of initial investment and on-going costs (it will
suck less power).

Cons:

We need to learn how to manage systems in the cloud, how to deal
with billing, and how to control these systems from Jenkins.

Likely these instances will be faster than my local Jenkins,
thanks to higher CPU clock rate, which should lower the actual
costs; but only actual testing will give us more
precise numbers.

Running a well chosen number of static instances would probably
lower these costs thanks to the discount when paying per month.
Also, booting a dynamic instance and configuring it takes some
additional time, which costs money and decreases performance.

We need to evaluate how many static instances (kept running at
all times and paid per-month) we run and how many dynamic
instances (spawned on demand and pay per-hour) we allow. E.g.
on OVH public cloud, a dynamic C2-15 instance costs more than
a static one once it runs more than 50% of the time. Thanks to
the Cluster Statistics Jenkins plugin, once we run this in the
cloud we'll have the data we need to optimize this; it should be
easy to script it so we can update these settings from time
to time.

We need to add the cost of hosting our Jenkins master and
nightly.t.b.o in the same cloud, or the cost of transferring
build artifacts between that cloud and lizard.

Our Jenkins master is currently allocated 2.6 GB of RAM and
2 vcpus. An OVH S1-8 (13€/month) or B2-7 (22€/month) static
instance should be enough. We estimated that 300 GB of storage
would be enough at least until the end of 2018. Our metrics say
that the storage volume that hosts Jenkins artifacts often makes
good use of 1000-2000 IOPS, so an OVH "High Speed Volume"
(0.08€/month/GB) would be better suited even though in practice
only a small part of these 300 GB needs to be that fast, and
possibly a slower "Classic Volume" might perform well enough.
So:

worst case: 22 + 300×0.08 = 46€/month

best case: 13 + 300×0.04 = 25€/month

In theory we could keep running some of our builds on our own
infra instead of in the cloud: one option is that the cloud
would only be used during peak load times for builds (but always
used for tests in order to fix our test suite brittleness
problems, hopefully). But if we do that, we don't improve ISO
build performance in most cases, and the build artifacts copy
problems surface, which costs performance and some development
time to optimize things a bit:

If we run the Jenkins master on our own infra: only artifacts
of ISO builds run in the cloud during peak load times need to
be downloaded to lizard; we could force the 2nd ISO build to
run locally so we avoid having to upload these artifacts to
the cloud, or we could optimize the 2nd ISO build job to
retrieve the 1st ISO only when the 2 ISOs differ and we need
to run diffoscope on them (in which case we also need to
download the 2nd ISO to archive it on the Jenkins master).
But all ISO build artifacts must be uploaded to the nodes
that run the test suite in the cloud.

If we run the Jenkins master in the cloud: most of the time
we need to upload there the ISOs built on lizard; then we
need to download them again for the 2nd build on lizard as
well (unless we do something clever to keep them around for
the 2nd build, or force the 2nd build to run in the cloud
too); but at this point they're already available out there
in the cloud for the test suite downstream job. During peak
times, the difference is that ISOs built in the cloud are
already there for everything else that follows (assuming we
force the 2nd build to run in the cloud too).

We need to trust a third-party somewhat.

To make the whole thing more flexible and easier to manage, it
would be good to have the same nodes able to run both builds and
tests. Not sure what it would take and what the consequences
would be.

We can sell the current hardware while it's still current, and get
some of our bucks back.

Cons:

High initial money investment.

Hard to tell whether this would fix our test suite fragility
problems, and we'll only know after we've spent lots of money.

Hard to specify what hardware we need. If we get it wrong, likely
we have to wait another 5 years before we try again.

The plan

Check how the sysadmins team feels about the cloud option: are
there any blockers, for example wrt. ethics, security, privacy,
anything else? → we're no big fans of using other people's
computers but if that's the best option we can do it

Keep gathering data about our needs while going through the next steps:

upcoming services [intrigeri]

what if we allowed more people to use our CI infrastructure?

Describe our needs for each option:

2nd bare metal server-grade machine:

hardware specs [intrigeri]

hosting needs (high power consumption) [intrigeri]

cloud (nested KVM, management tools in Debian) [intrigeri]

the hacker option: hosting

Ask potential hardware/VMs donors (e.g. HPE, ProfitBricks) if they
would happily satisfy our needs for free or with a big discount.
If they can do that, then let's do it. Otherwise, keep reading.
[intrigeri]

Benchmark how our workload would be handled by the options we have
no data about yet: rent a bare metal server and cloud VMs for
a short time, run CI jobs in a realistic way, measure.
[intrigeri]