DevOps/SRE

RemoteSingaporeSan Francisco

Ahrefs is looking for a SRE to help take care of its distributed backend systems
powered by 3000 servers and ensure all systems are up and running 24/7. We require deep understanding of operating systems and networks fundamentals, practical knowledge of Linux and a healthy desire to automate everything while being able to quickly resolve urgent issues manually. We strive to keep humans away from doing repetitive job that can be done by computers and focus instead on foreseeing problems and defining programmatic structures to handle them.

Who we are

Ahrefs runs an internet-scale bot that crawls the whole Web, storing huge volumes of information to be indexed and structured in a timely fashion. Backend system is powered by a custom petabyte-scale distributed key-value storage to accommodate all that data coming in at high speed. With that data Ahrefs is building analytics services for end-users and web-scale search platform.

We are a small team and strongly believe in better technology leading to better solutions for real-world problems. We worship functional languages and static typing, extensively employ code generation and meta-programming, value code clarity and predictability, and are constantly seeking to automate repetitive tasks and eliminate boilerplate, guided by DRY and following KISS. If there is any new technology that will make our life easier - no doubt, we'll give it a try. We rely heavily on opensource code (as the only viable way to build maintainable system) and contribute back, see e.g. https://github.com/ahrefs. Occasionally we track down CPU bugs.

Our motto is "first do it, then do it right, then do it better".

Responsibilities:

develop internal automation - monitoring, setup, statistics

setup automatic systems to control infrastructure

monitor live production systems health

first-aid reaction to infrastructure failures

deal with hardware problems and interact with datacenter

help developers with deployment and integration

participate in on-call rotation

You will be dealing on a daily basis with:

20PB storage cluster

3000 linux servers

experimental large-scale deployments

all kinds of software bugs and hardware deviations

Our system is big part custom OCaml code and also employs the following third-party technologies:

LAMP

ELK

Puppet

The ideal candidate is expected to:

Independently deal with and investigate infrastructure issues on live production systems

Foresee problems and prevent them from happening

Make well-reasoned technical choice and take responsibility for it

Understand the whole technology stack at all levels : from network and userspace code to OS internals and hardware

Approach problems with practical mindset and suppress perfectionism when time is a priority