Thursday, March 25, 2010

If sending email is a critical part of your online presence, then it pays to look at ways to enhance the probability that messages you send will find their way into your recipients' inboxes, as opposed to their spam folders. This is fairly hard to achieve and there are no silver bullet guarantees, but there are some things you can do to try to enhance the reputation of your mail servers.

One thing you can do is use DKIM, which stands for DomainKeys Identified Mail. DKIM is a result of a merging between Yahoo's DomainKeys and Cisco's Identified Internet Mail. There's a great Wikipedia article on DKIM which I strongly recommend you read.

DKIM is a method for email authentication -- in a nutshell, mail senders use DKIM to digitally sign messages they send with a private key. They also publish the corresponding public key as a DNS record. On the receiving side, mail servers use the public key to verify the digital signature. So by using DKIM, you as a mail sender prove to your mail recipients that you are who you say you are.

Note that DKIM doesn't prevent spam. Spammers can also use DKIM, and sign their messages. However, DKIM achieves an important goal, which is to prevent spammers from spoofing the source of their emails and impersonate users in other mail domains by forging the 'From:' header in their spam emails. If spammers want to use DKIM, they are thus forced to use their real domain name in the 'From' header, and this makes it easier for the receiving mail servers to reject that email. See also 'Three myths about DKIM' by John R. Levine.

As an email sender, if you use DKIM, then your chances of your mail servers being whitelisted and of your mail domain being considered 'reputable' are increased.

Enough theory, let's see how you can set up DKIM with postfix. I'm going to use OpenDKIM and postfix on an Ubuntu 9.04 server.

You will also need to add /usr/local/lib to the paths inspected by ldconfig, otherwise opendkim will not find its required shared libraries when starting up. I created a file called /etc/ld.so.conf.d/opendkim.conf containing just one line:

/usr/local/lib

Then I ran:

# ldconfig

3) Add a 'dkim' user and group

# useradd dkim

4) Use the opendkim-genkey.sh script to generate a private key (in PEM format) and a corresponding public key inside a TXT record that you will publish in your DNS zone. The opendkim-genkey.sh script takes as arguments your DNS domain name (-d) and a so-called selector, which identifies this particular private key/DNS record combination. You can choose any selector name you want. In my example I chose MAILOUT.

The opendkim-genkey.sh script generates a private key called MAILOUT.private, which I'm copying to /var/db/dkim/MAILOUT.key.pem. It also generates a file called MAILOUT.txt which contains the TXT record that you need to add to your DNS zone:

There is a sample file called opendkim.conf.sample located in the root directory of the opendkim source distribution. I copied it as /etc/opendkim.conf, then I set the following variables (this article by Eland Systems was very helpful):

Note the InternalHosts setting. It points to a file listing IP addresses that belong to servers which use your mail server as a mail relay. They could be for example your application servers that send email via your mail server. You need to list their IP addresses in that file, otherwise mail sent by them will NOT be signed by your mail server running opendkim.

7) Start up opendkim

# /usr/local/sbin/opendkim -x /etc/opendkim.conf

At this point, opendkim is running and listening on the port you specified in the config file -- in my case 9999.

8) Configure postfix

You need to configure postfix to use opendkim as an SMTP milter.

Note that the postfix syntax in the opendkim documentation is wrong. I googled around and found this blog post which solved the issue.

Edit /etc/postfix/main.cf and add the following line:

smtpd_milters = inet:localhost:9999

(the port needs to be the same as the one opendkim is listening on)

Reload the postfix configuration:

# service postfix reload

9) Troubleshooting

At first, I didn't know about the InternalHosts trick, so I was baffled when email sent from my application servers wasn't being signed by opendkim. To troubleshoot, make sure you set

LogWhy yes

in opendkim.conf and then inspect /var/log/mail.log and google for any warnings you find (remember to always RTFL!!!).

When you're done troubleshooting, set LogWhy back to 'no' so that you don't log excessively.

10) Verifying your DKIM setup

At this point, try to send email to some of your email accounts such as gmail, yahoo, etc. When you get the email there, show all headers and make sure you see the DKIM-Signature and X-DKIM headers. Here's an example from an email I received in my GMail account (you need to click the 'Reply' drop-down and choose 'Show original' to see all the headers):

This is a rule I'm trying to stick to. All deployments need to be automated. No ad-hoc, one-off deployments allowed. If you do allow them, they can quickly snowball into an unmaintainable, unreproducible mess.

Update (via njl): Only deploy files that are checked out of source control

So true. I should have included this when I wrote the post. Make sure your fabfiles, as well as all files you deploy remotely are indeed what you expect them to be. It's best to keep them in a source control repository, and to be disciplined about checking them in every time you update them.

Along the same lines: when deploying our Tornado-based Web services here at Evite, we make sure our continuous integration system (we use Hudson) builds the eggs and saves them in a central location, and we install those eggs with Fabric from the central location.

Common Fabric operations

I find myself using a small number of the Fabric functions. Basically I mainly use these three operations:

1) copy files to a remote server (with the 'put' function)2) run a command as a regular user on a remote server (with the 'run' function)3) run a command as root on a remote server (with the 'sudo' function)

I also make use sometimes of the 'sed' function, which allows me for example to comment or uncomment lines in an Nginx configuration file, so I can take servers in and out of the load balancer (for commenting lines out, you can also use the 'comment' function).

The main function in this fabfile is 'install'. Inside it, I call 3 other functions which can be also called on their own: 'install_prereqs' installs the munin-node package via apt-get, 'copy_munin_files' copies various configuration files to the remote server under ~/munin, then sudo copies them to /etc/munin, and finally 'start' starts up the munin-node service. I find that this is a fairly common pattern for my deployments: install base packages, copy over configuration files, start or restart service.

Note that I'm importing a module called 'environments'. That's where I define and name lists of hosts that I want to target during the deployment. For example, your environments.py file could contain the following:

If you wanted to install munin-node on a server which is not part of any env.hosts lists defined in environments.py, you would just call:

fab -f fab_munin install

...and the fab utility will ask you for a host name or IP on which to run the 'install' function. Simple and powerful.

Dealing with configurations for different environment types

A common issue I've seen is that different environment types (test, staging, production) need different configurations, both for services such as nginx or apache, and for my own applications. One solution I found is to prefix my environment names with their types (tst for testing, stg for staging and prd for production), and to add a sufix to the configuration files, for example nginx.conf.tst for the testing environment, nginx.conf.stg for staging and nginx.conf.prd for production.

Then, in my fabfiles, I automatically send the correct configuration file over, based on the environment I specify on the command line. For example, I have this function in my fab_nginx.py fabfile:

...this will deploy the configuration on the 2 hosts defined inside the tstapp list -- tstapp01 and tstapp02. The deploy_config function will capture 'tst' as the first 3 characters of the current host name, and it will then operate on the nginx.conf.tst file, sending it to the remote server as /usr/local/nginx/conf/nginx.conf.

I find that it is a good practice to not have local files called nginx.conf, because the risk of overwriting the wrong file in the wrong environment is increased. Instead, I keep 3 files around -- nginx.conf.tst, nginx.conf.stg and nginx.conf.prd -- and I copy them to remote servers accordingly.

Defining short functions inside a fabfile

My fabfiles are composed of short functions that can be called inside a longer function, or on their own. This gives me the flexibility to only deploy configuration files for example, or only install base packages, or only restart a service.

Automated deployments ensure repeatability

I may repeat myself here, but it is worth rehashing this point: in my view, the best feature of an automated deployment system is that it transforms your deployments from an ad-hoc, off-the-cuff procedure into a repeatable and maintainable process that both developer and operation teams can use (#devops anybody?). An added benefit is that you get for free a good documentation for installing your application and its requirements. Just copy and paste your fabfiles (or Puppet manifests) into a wiki page and there you have it.

Wednesday, March 03, 2010

I've been immersed in the world of automated deployment systems for quite a while. Because I like Python, I've been using Fabric, but I also dabbled in Puppet. When people are asked about alternatives to Puppet in the Python world, many mention Fabric, but in fact these two systems are very different. Their main difference is the topic of this blog post.

Fabric is what I consider a 'push' automated deployment system: you install Fabric on a server, and from there you push deployments by running remote commands via ssh on a set of servers. In the Ruby world, an example of a push system is Capistrano.

The main advantages of a 'push' system are:

control: everything is synchronous, and under your control. You can see right away is something went wrong, and you can correct it immediately.

simplicity: in the case of Fabric, a 'fabfile' is just a collection of Python functions that copy files over to a remote server and execute commands over ssh on that server; it's all very easy to set up and run

The main disadvantages of a 'push' system are:

lack of full automation: it's not usually possible to boot a server and have it configure itself without some sort of client/server protocol which push systems don't generally support (see 'pull' systems below for that)

lack of scalability: when you're dealing with hundreds of servers, a push system starts showing its limits, unless it makes heavy use of threading or multi-processing

Puppet is what I consider a 'pull' automated deployment system (actually to be more precise, it is a configuration management system). In such a system, you have a server which acts as a master, and clients which contact the master to find out what they need to do, thus pulling their configuration information from the master. In Puppet, configuration files are called manifests. They are written in a specific language and they are declarative, i.e. they tell each client what to do, not how to do it. The Puppet client software running on each server knows how to interpret the manifest files and how to translate them into actions specific to the operating system of that server. For example, you specify in your manifest file that you want a user created and you don't need to say 'run the adduser command on server X'. Other examples of 'pull' deployment/configuration management systems are bcfg2 (Python),Chef (Ruby) and slack (Perl). A newcomer in the Python world is a port of Chef called kokki (it looks like it's very much in its infancy still, but I hope the author will continue to actively develop it).

The main advantages of a 'pull' system are:

full automation capabilities: it is possible, and indeed advisable, to fully automate the configuration of a newly booted server using a 'pull' deployment system (for details on how I've done it with Puppet, see this post)

increased scalability: in a 'pull' system, clients contact the server independently of each other, so the system as a whole is more scalable than a 'push' system

The main disadvantages of a 'pull' system are:

proprietary configuration management language: with the notable exception of Chef, which uses pure Ruby for its configuration 'recipes', most other pull system use their own proprietary way of specifying the configuration to be deployed (Puppet's language looks like a cross between Perl and Ruby, while bcfg2 uses...gasp...XML); this turns out to be a pretty big drawback, because if you're not using the system on a daily basis, you're guaranteed to forget it (as happened to me with Puppet)

scalability is still an issue: unless you deploy several master servers and keep them in sync, that one master will start getting swamped as you add more and more clients and thus will become your bottleneck

My particular preference is to use a 'pull' system for the initial configuration of a server, including all the packages necessary to deploy my application (for example tornado). For the actual application code deployment, I prefer to use a 'push' system, because it gives me more control over how exactly I do the deployment. I can take a server out of the load balancer, deploy, test, then put it back, rinse and repeat.

In discussions with Holger Krekel at PyCon, I realized that execnet might be a good replacement for Fabric for my needs. It already provides remote command execution via ssh, and an rsync-like file transfer protocol. All it needs is a small library of functions on top to do common system administration tasks such as running commands as sudo, etc. I also want to look into kokki as a replacement for Puppet in my deployment architecture.

A parting thought: my colleague Dan Mesh suggested using a queuing mechanism for the client-server protocol in a 'pull' system. In fact, I am becoming more and more convinced that as far as scalability is concerned, when in doubt, use a queuing mechanism. In this deployment architecture, the master would post tasks to be done by a specific client to a central queue. The client would check the queue periodically for a task assigned to it, would execute it then would send a report back to the server when done. Of course, you need to worry about authentication in this scenario, but it seems that it would solve a lot of the scalability issues that both push and pull systems exhibit. Who knows, we may build it at Evite and open source it...so stay tuned ;-)

Tuesday, March 02, 2010

This is a note for myself, but maybe it will be useful to other people too.

I've been using Fabric version 1.0a lately, and it's been working very well, with an important exception: when launching remote processes that get daemonized, the 'run' Fabric command which launches those processes hangs, and needs to be forcefully killed on the server where I run the 'fab' commands.

I remembered vaguely reading on the Fabric mailing list something about the ssh channel not being closed properly, so I hacked the source code (operations.py) to close the channel before waiting for the stdout/stderr capture threads to exit.