5 steps to self-managing server infrastructure

Managing servers is a tedious work we all have to do to some extent. But it doesn’t have to fill our whole day. What if I tell you that you can build a self-managing system with some discipline and effort? I went through implementing a self-managing database infrastructure of thousands of MySQL servers and I’ll let you know what are the milestones so you can build yours too.

I’m going to use MySQL upgrade as an example

1. Script it

You’re probably tired of doing to same repetitive commands over and over again. Why not script it? What you can copy paste to a console you can copy paste to bash file or perl/python script and run that instead of series of other commands. You probably still need to make sure your server is out of production traffic and won’t impact your customers when do that but still running a command which probably start something like the one below saves you a lot of time that you can spend on taking this to the next level.

Automation 101 bash example

Shell

1

2

3

4

5

6

7

#!/bin/bash

/etc/init.d/mysql stop

[your-choice-of-package-manager]update mysql

mysql_upgrade

/etc/init.d/mysql start

some warmup script

Pro tips

Keep it simple

You’re four steps away from the total automation so do not overengineer. Make it work and progress in small iteration make it better every time you use it, you touch it.

Choice of language

I prefer python but it’s a very personal decision. I recommend though to choose a language which you can build upon later. It will only become more and more complex so choose a language which can support you on this journey.

2. Run it parallel and remotely

SSHing to every single box works for a while but certainly it can be improved. Being able to remotely execute the script you built and even better run it on multiple hosts will take you to the next milestone.

Ansible example

Shell

1

ansible-playbook-ihosts dbservers-l...update_mysql.yml

Where hosts is your repository file (you can limit the execution to a certain hosts with -l option) and update_mysql.yml is the playbook you’re going to run. The playbook can only contain one simple task:

Ansible task example

YAML

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

---

- name: mysql service state

service: name=mysql state=stopped enabled=no

- name: Install latest packages

yum: name={{item}}state=latest

with_items:

-mysql

-mysql-server

register: mysql

- name: Run mysql_upgrade

command: mysql_upgrade chdir=/var/lib/mysql

when: mysql.changed

- name: mysql service state

service: name=mysql state=started enabled=yes

Pro tips

Don’t reinvent the wheel

There are many existing solutions you can build your automation on. Ansible, SaltStack, Chef just to mention a few but there are gazillion of options. Find what works for you the best and preferably matches your scripting language (recognized I’ve put Ansible and Salt first as they are both Python based? 🙂 ).

3. Unattended execution

After you have gained some trust in your automation you’re ready to run it unattended. Whether you do it by scheduled tasks, cron jobs, SaltStack, Chef, Puppet you’ve just done a major step to free up yourself from daily operation and focus on innovation instead of operations.

This is the time when you can start spending time thinking about how you manage your connections to your servers and if you can start disabling them programatically. Different loadbalancer options like HAProxy for example or an inventory coordinator like ZooKeeper can come very handy in the next phase of automation.

Pro tips

Murphy’s law

What can fail it will fail. Don’t worry about failure but make sure you minimize the impact and can learn from it. A detailed (central) logging comes very useful in those situations. My personal favourite is fluentd to push logs to a central repository but you can also use the Elastic stack (Logstash, ElasticSearch, Kibana).

4. Automatic batch job execution

Once you have your tasks stable you can group them into batches and execute them when it’s appropriate. I call those batches jobs. For example you can have a job to upgrade every slave in a certain replication chain.

The executor picks up the next task in the queue

Run the task:

Disable your server (see previous point)

Stop mysql

Do upgrade

Start mysql

Warmup if necessary or do any other post work

Enable server if everything looks good

If

successful go to the next task

failed report the failure and stop execution (since you disabled your server nothing should be impacted)

Pro tips

Plan for maintenance

Once you reached that point you probably have a large enough infrastructure to afford running under capacity for having room for maintenance aka. bringing down servers or having unused server for being clone sources for MySQL for example.

Trashable servers

In general you shouldn’t have too many single point of failures in your system but it becomes more and more important to not care about individual servers and consider your machines as a pool of workers providing certain service. As long as your pool have enough members your service should be intact.

5. Self-managing servers

Now you have your tasks doing certain operations it’s time to move away from the job queue which I know you just built and take it a step further by plugging it into your monitoring/trending/event system. Noticing a certain condition will result in a certain action. It doesn’t matter if you have a cronjob running which is scanning your graphite or Nagios servers for datapoints or you implement a check-action system by using something like Monit.

Word of advice

You already have almost everything setup so you don’t need my tips but a word of advise if you let me: Operations like this can really take down your entire infrastructure so make sure you did everything to minimize the impact if it happens, recognize it as fast as possible and be able to react (rollback or terminate). This last one might sound obvious but many times this is the trickiest bit. Our solution to this is to have a single central mutex which can prevent every automated task to run. The tasks only proceed if they made sure the mutex is not in place.

Don’t forget the lazy engineer is the best engineer. Happy automation!

About charlesnagy

I'm out of many things mostly automation expert, database specialist, system engineer and software architect with passion towards data, searching it, analyze it, learn from it. I learn by experimenting and this blog is a result of these experiments and some other random thought I have time to time.

Categories

About Charles Nagy

Database specialist

Automation expert

System engineer

Software architect

“An expert is a man who has made all the mistakes which can be made, in a narrow field.” - Niels Bohr

Don't be afraid to fail, to test, to experiment. This is what teaches you things you cannot learn from books. Everything written, told, heard are things somebody already know. If you want to be better in something you have to do things that nobody did before and push those boundaries as much as you can.