Tuesday, December 21, 2010

More and more organizations are moving to ‘the cloud’ these days. In most cases, using ‘the cloud’ means buying compute and storage capacity from a public cloud vendor such as Amazon, Rackspace, GoGrid, Linode, etc. I believe that the next step in cloud usage will be deploying instances across multiple cloud providers, mainly for high availability, but also for performance reasons (for example if a specific provider has a presence in a geographical region closer to your user base).

All cloud vendors offer APIs for accessing their services -- if they don’t, they’re not a genuine cloud vendor in my book at least. The onus is on you as a system administrator to learn how to use these APIs, which can vary wildly from one provider to another. Enter libcloud, a Python-based package that offers a unified interface to various cloud provider APIs. The list of supported vendors is impressive, and more are added all the time. Libcloud was started by Cloudkick but has since migrated to the Apache Foundation as an Incubator Project.

One thing to note is that libcloud goes for breadth at the expense of depth, in that it only supports a subset of the available provider APIs -- things such as creating, rebooting, destroying an instance, and listing all instances. If you need to go in-depth with a given provider’s API, you need to use other libraries that cover all or at least a large portion of the functionality exposed by the API. Examples of such libraries are boto for Amazon EC2 and python-cloudservers for Rackspace.

Introducing libcloud

The current stable version on libcloud is 0.4.0. You can install it from PyPI via

# easy_install apache-libcloud

The main concepts of libcloud are providers, drivers, images, sizes and locations.

A provider is a cloud vendor such as Amazon EC2 and Rackspace. Note that currently each EC2 region (US East, US West, EU West, Asia-Pacific Southeast) is exposed as a different provider, although they may be unified in the future.

The common operations supported by libcloud are exposed for each provider through a driver. If you want to add another provider, you need to create a new driver and implement the interface common to all providers (in the Python code, this is done by subclassing a base NodeDriver class and overriding/adding methods appropriately, according to the specific needs of the provider).

Images are provider-dependent, and generally represent the OS flavors available for deployment for a given provider. In EC2-speak, they are equivalent to an AMI.

Sizes are provider-dependent, and represent the amount of compute, storage and network capacity that a given instance will use when deployed. The more capacity, the more you pay and the happier the provider.

Locations correspond to geographical data center locations available for a given provider; however, they are not very well represented in libcloud. For example, in the case of Amazon EC2, they currently map to EC2 regions rather than EC2 availability zones. However, this will change in the near future (as I will describe below, proper EC2 availability zone management is being implemented). As another example, Rackspace is represented in libcloud as a single location, listed currently as DFW1; however, your instances will get deployed at a data center determined at your Rackspace account creation time (thanks to Paul Querna for clarifying this aspect).

Managing instances with libcloud

Getting a connection to a provider via a driver

All the interactions with a given cloud provider happen in libcloud across a connection obtained via the driver for that provider. Here is the canonical code snippet for that, taking EC2 as an example:

Once you get a connection, you can call a variety of informational methods on that connection, for example list_images, which returns a list of NodeImage objects. Be prepared for this call to take a while, especially in Amazon EC2, which in the US East region returns no less than 6,932 images currently. Here is a code snippet that prints the number of available images, and the first 5 images returned in the list:

Note that a NodeImage object for a given provider may have provider-specific information stored in most cases in a variable called ‘extra’. It pays to inspect the NodeImage objects by printing their __dict__ member variable. Here is an example for EC2:

As I mentioned before, locations are somewhat ambiguous currently in libcloud.

For example, when you call list_locations on a connection to the EC2 provider (which represents the EC2 US East region), you get information about the region and not about the availability zones (AZs) included in that region:

However, there is a patch sent by Tomaž Muraus to the libcloud mailing list which adds support for EC2 availability zones. For example, the US East region has 4 AZs: us-east-1a, us-east-1b, us-east-1c, us-east-1d. These AZs should be represented by libcloud locations, and indeed the code with the patch applied shows just that:

(Update 02/24/11The patch did make it in the latest libcloud release which is 0.4.2 at this time)

If you run list_locations on a Rackspace connection, you get back DFW1, even though your instances may actually get deployed at a different data center. Hopefully this too will be fixed soon in libcloud:

The API call for launching an instance with libcloud is create_node. It has 3 required parameters: a name for your new instance, a NodeImage and a NodeSize. You can also specify a NodeLocation (if you don’t, the default location for that provider will be used).

EC2 node creation example

A given provider driver may accept other parameters to the create_node call. For example, EC2 accepts an ex_keyname argument for specifying the EC2 key you want to use when creating the instance.

Note that to create a node, you have to know what image and what size you want to use for that node. Here can come in handy the code snippets I showed above for retrieving images and sizes available for a given provider. You can either retrieve the full list and iterate through the list until you find your desired image and size (either by name or by id), or you can construct NodeImage and NodeSize objects from scratch, based on the desired id.

Example of a NodeImage object for EC2 corresponding to a specific AMI:

image = NodeImage(id="ami-014da868", name="", driver="")

Example of a NodeSize object for EC2 corresponding to an m1.small instance size:

Note that in both examples the only parameter that need to be set is the id, but all the other parameters need to be present in the call, even if they are set to None or the empty string.

In the case of EC2, for the instance to be actually usable via ssh, you also need to pass the ex_keyname parameter and set it to a keypair name that exists in your EC2 account for that region. Libcloud provides a way to create or import a keypair programmatically. Here is a code snippet that creates a keypair via the ex_create_keypair call (specific to the libcloud EC2 driver), then saves the private key in a file in /root/.ssh on the machine running the code:

You can also pass the name of an EC2 security group to create_node via the ex_securitygroup parameter. Libcloud also allows you to create security groups programmatically by means of the ex_create_security_group method specific to the libcloud EC2 driver.

Now, armed with the NodeImage and NodeSize objects constructed above, as well as the keypair name, we can launch an instance in EC2:

Note that we didn’t specify any location, so we have no control over the availability zone where the instance will be created. With Tomaž’s patch we can actually get a location corresponding to our desired availability zone, then launch the instance in that zone. Here is an example for us-east-1b:

Once the node is created, you can call the list_nodes method on the connection object and inspect the current status of the node, along with other information about that node. In EC2, a new instance is initially shown with a status of ‘pending’. Once the status changes to ‘running’, you can ssh into that instance using the private key created above.

Printing node.__dict__ for a newly created instance shows it with ‘pending’ status:

Note also that the ‘extra’ member variable of the node object shows a wealth of information specific to EC2 -- things such as security group, AMI id, kernel id, availability zone, private and public DNS names, etc. Another interesting thing to note is that the name member variable of the node object is now set to the EC2 instance id, thus guaranteeing uniqueness of names across EC2 node objects.

At this point (assuming the machine where you run the libcloud code is allowed ssh access into the default EC2 security group) you should be able to ssh into the newly created instance using the private key corresponding to the keypair you used to create the instance. In my case, I used the k1.pem private file created via ex_create_keypair and I ssh-ed into the private IP address of the new instance, because I was already on an EC2 instance in the same availability zone:

# ssh -i ~/.ssh/k1.pem domU-12-31-39-04-65-11.compute-1.internal

Rackspace node creation example

Here is another example of calling node_create, this time using Rackspace as the provider. Before I ran this code, I already called list_images and list_sizes on the Rackspace connection object, so I know that I want the NodeImage with id 71 (which happens to be Fedora 14) and the NodeSize with id 1 (the smallest one). The code snippet below will create the node using the image and the size I specify, with a name that I also specify (this name needs to be different for each call of create_node):

Note that the name variable of the node object was set to the name we specified in the create_node call. You don’t log in with a key (at least initially) to a Rackspace node, but instead you’re given a password you can use to log in as root to the public IP that is also returned in the node information:

Once you have a list of nodes in a given provider, it’s easy to iterate through the list and choose a given node based on its unique name -- which as we’ve seen is the instance id for EC2 and the hostname for Rackspace. Once you identify a node, you can call destroy_node or reboot_node on the connection object to terminate or reboot that node.

Here is a code snippet that performs a destroy_node operation for an EC2 instance with a specific instance id:

I would be remiss if I didn’t mention a new but very promising project started by Miquel Torres: Overmind. The goal of Overmind is to be a complete server provisioning and configuration management system. For the server provisioning portion, Overmind uses libcloud, while also offering a Django-based Web interface for managing providers and nodes. EC2 and Rackspace are supported currently, but it should be easy to add new providers. If you are interested in trying out Overmind and contributing code or tests, please send a message to the overmind-dev mailing list. Next versions of Overmind aim to add configuration management capabilities using Opscode Chef.

Friday, December 10, 2010

Here's a short Fabric script which might be useful to people who need to stripe EBS volumes in Amazon EC2. Striping is recommended if you want to improve the I/O of your EBS-based volumes. However, striping won't help if one of the member EBS volumes goes AWOL or suffers performance issues. In any case, here's the Fabric script:

I assume that you already created and attached 4 EBS volumes to your instance with device names /dev/sdd through /dev/sdg; if your device names or volume count are different, modify the DEVICES list appropriately

The size of your target RAID0 volume is set in the VOL_SIZE variable

the helper functions are pretty self-explanatory:

we use mdadm to create a RAID0 device called /dev/md0; we also set the block size to 64 KB via the blockdev call

we create a physical LVM volume on /dev/md0

we create a volume group called vgm0 on /dev/md0

we create a logical LVM volume called lvm0 of size VOL_SIZE, inside the vgm0 group

we format the logical volume as XFS, then we mount it and also modify /etc/fstab

Tuesday, November 30, 2010

It took me a while to really get how to use Chef attributes. It's fairly easy to understand what they are and where they are referenced in recipes, but it's not so clear from the documentation how and where to override them. Here's a quick note to clarify this.

A Chef attribute can be seen as a variable that:

1) gets initialized to a default value in cookbooks/mycookbook/attributes/default.rb

2) gets used in cookbook recipes such as cookbooks.mycookbook/recipes/default.rb or any other myrecipefile.rb in the recipes directory; the syntax for using the attribute's value is of the form #{node[:mycookbook][:attribute_name]}

Example of using the haproxy_version attribute in a recipe called haproxy.rb:

Here I override the default swapfilesize value (an attribute from the mycookbook cookbook) of 10 GB and set it to 4 GB. I also override the default memcached memory value (an attribute from the Opscode memcached cookbook) of 128 MB and set it

to 4 GB.

If however you want to override some attributes at the node level, you could do this in the chef.json file on the node if you wanted for example to set the swap size to 2 GB:

"run_list": [ "role[base]", "role[appserver]" ]

"mycookbook": { "swapfilesize": "2097152"}

Anothe question is when should you use Chef attributes? My own observation is that wherever I use hardcoded values in my recipes, it's almost always better to use an attribute instead, and set the default value of the attribute to that hardcoded value, which can be then overridden as needed.

Tuesday, November 16, 2010

Step 0. If you're fortunate enough to participate in the design of your infrastructure (as opposed to being thrown at the deep end and having to maintain some 'legacy' one), then try to aim for horizontal scalability. It's easier to scale out than to scale up, and failures in this mode will hopefully impact a smaller percentage of your users.

Step 1. Configure a good monitoring and alerting system

This is the single most important thing you need to do for your infrastructure. It's also a great way to learn a new infrastructure that you need to maintain.

I talked about different types of monitoring in another blog post. My preferred approach is to have 2 monitoring systems in place:

an internal monitoring system which I use to check the health of individual servers/devices

an external monitoring system used to check the behavior of the application/web site as a regular user would.

My preferred internal monitoring/alerting system is Nagios, but tools like Zabbix, OpenNMS, Zenoss, Monit etc. would definitely also do the job. I like Nagios because there is a wide variety of plugins already available (such as the extremely useful check_mysql_health plugin) and also because it's very easy to write custom plugin for your specific application needs. It's also relatively easy to generate Nagios configuration files automatically.

For external monitoring I use a combination of Pingdom and Akamai alerts. Pingdom runs checks against certain URLs within our application, whereas Akamai alerts us whenever the percentage of HTTP error codes returned by our application is greater than a certain threshold.

I'll talk more about correlating internal and external alerts below.

Step 2. Configure a good resource graphing system

This is the second most important thing you need to do for your infrastructure. It gives you visibility into how your system resources are used. If you don't have this visibility, it's very hard to do proper capacity planning. It's also hard to correlate monitoring alerts with resource limits you might have reached.

I use both Munin and Ganglia for resource graphing. I like the graphs that Munin produces and also some of the plugins that are available (such as the munin-mysql plugin), and I also like Ganglia's nice graph aggregation feature, which allows me to watch the same system resource across a cluster of nodes. Munin has this feature too, but Ganglia was designed from the get-go to work on clusters of machines.

Step 3. Dashboards, dashboards, dashboards

Everybody knows that whoever has the most dashboards wins. I am talking here about application-specific metrics that you want to track over time. I gave an example before of a dashboard I built for visualizing the outgoing email count through our system.

It's very easy to build such a dashboard with the Google Visualization API, so there's really no excuse for not having charts for critical metrics of your infrastructure. We use queuing a lot internally at Evite, so we have dashboards for tracking various queue sizes. We also track application errors from nginx logs and chart them in various ways: by server, by error code, by URL, aggregated, etc.

Dashboards offered by external monitoring tools such as Pingdom/Keynote/Gomez/Akamai are also very useful. They typically chart uptime and response time for various pages, and edge/origin HTTP traffic in the case of Akamai.

Step 4. Correlate errors with resource state and capacity

The combination of internal and external monitoring, resource charting and application dashboards is very powerful. As a rule, whenever you have an external alert firing off, you should have one or more internal ones firing off too. If you don't, then you don't have sufficient internal alerts, so you need to work on that aspect of your monitoring.

Once you do have external and internal alerts firing off in unison, you will be able to correlate external issues (such as increased percentages of HTTP error codes, or timeouts in certain application URLs) with server capacity issues/bottlenecks within your infrastructure. Of course, the fact that you are charting resources over time, and that you have a baseline to go from, will help you quickly identify outliers such as spikes in CPU usage or drops in Akamai traffic.

A typical work day for me starts with me opening a few tabs in my browser: the Nagios overview page, the Munin overview, the Ganglia overview, the Akamai HTTP content delivery dashboard, and various application-specific dashboards.

Let's say I get an alert from Akamai that the percentage of HTTP 500 error codes is over 1%. I start by checking the resource graphs for our database servers. I look at in-depth MySQL metrics in Munin, and at CPU metrics (especially CPU I/O wait time) in Ganglia. If nothing is out of the ordinary, I look at our various application services (our application consists of a multitude of RESTful Web services). The nginx log dashboard may show increased HTTP 500 errors from a particular server, or it may show an increase in such errors across the board. This may point to insufficient capacity at our services layer. Time to deploy more services on servers with enough CPU/RAM capacity. I know which servers those are, because I keep tabs on them with Munin and Ganglia.

As another example, I know that if the CPU I/O wait on my database servers approaches 30%, the servers will start huffing and puffing, and I'll see an increased number of slow queries. In this case, it's time to either identify queries to be optimized, or reduce the number of queries to the database -- or if everything else fails, time to add more database servers. (BTW, if you haven't yet read @allspaw's book "The Art of Capacity Planning", add reading it as a task for Step 0)

My point is that all these alerts and metric graphs are interconnected, and without looking at all of them, you're flying blind.

Step 5. Expect failures and recover quickly and gracefully

It's not a question whether failures will happen, it's WHEN they will happen. When they do happen, you need to be prepared. Hopefully you designed your infrastructure in a way that allows you to bounce back quickly from failures. If a database server goes down, hopefully you have a slave that you can quickly promote to a master, or even better you have another passive master ready to become the active server. Even better, maybe you have a fancy self-healing distributed database -- kudos to you then ;-)

One thing that you can do here is to have various knobs that turn on and off certain features or pieces of functionality within your application (again, John Allspaw has some blog posts and presentations on that from his days at Flickr and his current role at Etsy). These knobs allow you to survive an application server outage, or even (God forbid) a database outage, while still being able to present *something* to your end-users.

To quickly bounce back from a server failure, I recommend you use automated deployment and configuration management tools such as Chef, Puppet, Fabric, etc. (see someposts of mine on this topic). I personally use a combination of Chef (to bootstrap a new machine and do things as file system layout, pre-requisite installation etc) and Fabric (to actually deploy the application code).

Update #1:

Comment on Twitter:

@ericholscher:

"@griggheo Good stuff. Any thoughts on figuring out symptom's from causes? eg. load balancer is having issues which causes db load to drop?"

Good question, and something similar actually has happened to us. To me, it's a matter of knowing your baseline graphs. In our case, whenever I see an Akamai traffic drop, it's usually correlated to an increase in the percentage of HTTP 500 errors returned by our Web services. If I also see DB traffic dropping, then I know the bottleneck is at the application services layer. If the DB traffic is increasing, then the bottleneck is most likely the DB. Depending on the bottleneck, we need to add capacity at that layer, or to optimize code or DB queries at the respective layer.

Main accomplishment of this post

I'm most proud of the fact that I haven't used the following words in the post above: 'devops', 'cloud', 'noSQL' and 'agile'.

Thursday, October 28, 2010

In an earlier blog post I was advising people to use HAProxy 1.4 and above if they need MySQL load balancing with health checks. It turns out that I didn't have much luck with that solution either. HAProxy shines when it load balances HTTP traffic, and its health checks are really meant to be run over HTTP and not plain TCP. So the solution I found was to have a small HTTP Web service (which I wrote using tornado) listening on a configurable port on each MySQL node.

For the health check, the Web service connects via MySQLdb to the MySQL instance running on a given port and issues a 'show databases' command. For more in-depth checking you can obviously run fancier SQL statements.

The code for my small tornado server is here. The default port it listens on is 31337.

Now on the HAProxy side I have a "listen" section for each collection of MySQL nodes that I want to load balance. Example:

In this case, HAProxy listens on port 33306 and load balances MySQL traffic between db101 and db201, with db101 being the primary node and db201 being the backup node (which means that traffic only goes to db101 unless it's considered down by the health check, in which case traffic is directed to db201). This scenario is especially useful when db101 and db201 are in a master-master replication setup, and you want traffic to hit only 1 of them at any given time. Note also that I could have had HAProxy listen on port 3306, but I preferred to have it listen and be contacted by the application on port 33306, in case I also wanted to run a MySQL server in port 3306 on the same server as HAProxy.

I specify how to call the HTTP check handler via "option httpchk GET /mysqlchk/?port=3306". I specify the port the handler listens on via the "port" option in the "server" line. In my case the port is 31337. So HAProxy will do a GET against http://10.10.10.1:31337/mysqlchk/?port=3306. If the result is an HTTP error code, the health check will be considered failed.

The other options "inter 5000 rise 3 fall 3" mean that the health check is issued by HAProxy every 5,000 ms, and that the health check needs to succeed 3 times ("rise 3") in order for the node to be considered up, and it needs to fail 3 times ("fall 3") in order for the node to be considered down.

I hasten to add that the master-master load balancing has its disadvantages. It did save my butt one Sunday morning when db101 went down hard (after all, it was an EC2 instance), and traffic was directed by HAProxy to db201 in a totally transparent fashion to the application.

But....I have also seen the situation where db201, as a slave to db101, lagged in its replication, and so when db101 was considered down and traffic was sent to db201, the state of the data was stale from an application point of view. I consider this disadvantage to weigh more than the automatic failover advantage, so I actually ended up taking db201 out of HAProxy. If db101 ever goes down hard again, I'll just manually point HAProxy to db201, after making sure the state of the data on db201 is what I expect.

So all this being said, I recommend the automated failover scenario only when load balance against a read-only farm of MySQL servers, which are all probably slaves of some master. In this case, although reads can also get out of sync, at least you won't attempt to do creates/updates/deletes against stale data.

The sad truth is that there is no good way of doing automated load balancing AND failover with MySQL without resorting to things such as DRBD which are not cloud-friendly. I am aware of Yves Trudeau's blog posts on "High availability for MySQL on Amazon EC2" but the setup he describes strikes me as experimental and I wouldn't trust it in a large-scale production setup.

In any case, I hope somebody will find the tornado handler I wrote useful for their own MySQL health checks, or actually any TCP-based health check they need to do within HAProxy.

Thursday, October 14, 2010

Overmind is the brainchild of Miquel Torres. In its current version, released today, Overmind is what is sometimes called a 'controller fabric' for managing cloud instances, based on libcloud. However, Miquel's Roadmap for the project is very ambitious, and includes things like automated configuration management and monitoring for the instances launched and managed via Overmind.

A little bit of history: Miquel contacted me via email in late July because he read my blog post on "Automated deployment systems: push vs. pull" and he was interested in collaborating on a queue-based deployment/config management system. The first step in such a system is to actually deploy the instances you need configured. Hence the need for something like Overmind.

I'm sure you're asking yourself -- why do these guys wanted to roll their own system? Why not use something like OpenStack? Note in late July OpenStack had only just been announced, and to this day (mid-October 2010) they have yet to release their controller fabric code. In the mean time, we have a pretty functional version of a deployment tool in Overmind, supporting Amazon EC2 and Rackspace, with a Django Web interface, and also with a REST API interface.

I am aware there are many other choices out there in terms of managing and deploying cloud instances -- Cloudkick, RightScale, Scalarium ...and the list goes on. The problem is that none of these is Open Source. They do have great ideas though that we can steal ;-)

I am also aware of Ruby-based tools such as Marionette Collective and its close integration with Puppet (which is now even closer since it has been acquired by Puppet Labs). The problem is that it's Ruby and not Python ;-)

In short, what Overmind brings to the table today is a Python-based, Django-based, libcloud-based tool for deploying (and destroying, but be careful out there) cloud instances. For the next release, Miquel and I are planning to add some configuration management capabilities. We're looking at kokki as a very interesting Python-based alternative to chef, although we're planning on supporting chef-solo too.

If you're interested in contributing to the project, please do! Miquel is an amazingly talented, focused and relentless developer, but he can definitely use more help (my contributions have been minimal in terms of actual code; I mostly tested Miquel's code and did some design and documentation work, especially in the REST API area).

Monday, September 27, 2010

Ever since Vladimir Vuksan pointed me to his Ganglia script for getting detailed disk stats, I've been looking for something similar for Munin. The iostat and iostat_ios Munin plugins, which are enabled by default when you install Munin, do show disk stats across all devices detected on the system. I wanted more in-depth stats per device though. In my case, the devices I'm interested in are actually Amazon EBS volumes mounted on my database servers.

I finally figured out how to achieve this, using the diskstat_ Munin plugin which gets installed by default when you install munin-node.

If you run

/usr/share/munin/plugins/diskstat_ suggest

you will see the various symlinks you can create for the devices available on your server.

In my case, I have 2 EBS volumes on each of my database servers, mounted as /dev/sdm and /dev/sdn. I created the following symlinks for /dev/sdm (and similar for /dev/sdn):

My next step is to follow the advice of Mark Seger (the author of collectl) and graph the output of collectl in real time, so that the stats are displayed in fine-grained intervals of 5-10 seconds instead of the 5-minute averages that RRD-based tools offer.

Tuesday, September 21, 2010

I decided to give Ganglia a try to see if I like its metric visualizations and its plugins better than Munin's. I am still in the very early stages of evaluating it. However, I already banged my head against the wall trying to understand how to configure it properly. Here are some quick notes:

1) You can split your servers into clusters for ease of metric aggregation.

2) Each node in a cluster needs to run gmond. In Ubuntu, you can do 'apt-get install ganglia-monitoring' to install it. The config file is in /etc/ganglia/gmond.conf. More on the config file in a minute.

3) Each node in a cluster can send its metrics to a designated node via UDP.

4) One server in your infrastructure can be configured as both the overall metric collection server, and as the web front-end. This server needs to run gmetad, which in Ubuntu can be installed via 'apt-get install gmetad'. Its config file is /etc/gmetad.conf.

Note that you can have a tree of gmetad nodes, with the root of the tree configured to actually display the metric graphs. I wanted to keep it simple, so I am running both gmetad and the Web interface on the same node.

5) The gmetad server periodically polls one or more nodes in each cluster and retrieves the metrics for that cluster. It displays them via a PHP web interface which can be found in the source distribution.

That's about it in a nutshell in terms of the architecture of Ganglia. The nice thing is that it's scalable. You split nodes in clusters, you designate one or more nodes in a cluster to gather metrics from all the other nodes, and you have one ore more gmetad node(s) collecting the metrics from the designated nodes.

Now for the actual configuration. I have a cluster of DB servers, each running gmond. I also have another server called bak01 that I keep around for backup purposes. I configured each DB server to be part of a cluster called 'db'. I also configured each DB server to send the metrics collected by gmond to bak01 (via UDP on the non-default port of 8650). To do this, I have these entries in /etc/ganglia/gmond.conf on each DB server:

On host bak01, I also defined a udp_recv_channel and a tcp_accept_channel:

udp_recv_channel {

port = 8650

}

/* You can specify as many tcp_accept_channels as you like to share

an xml description of the state of the cluster */

tcp_accept_channel {

port = 8649

}

The upd_recv_channel is necessary so bak01 can receive the metrics from the gmond nodes. The tcp_accept_channel is necessary so that bak01 can be contacted by the gmetad node.

That's it in terms of configuring gmond.

On the gmetad node, I made one modification to the default /etc/gmetad.conf file by specifying the cluster I want to collect metrics for, and the node where I want to collect the metrics from:

data_source "eosdb" 60 bak01

I then restarted gmetad via '/etc/init.d/gmetad restart'.

Ideally, these instructions would get you to a state where you would be able to see the graphs for all the nodes in the cluster.

I automated the process of installing and configuring gmond on all the nodes via fabric. Maybe it all happened too fast for the collecting node (bak01), because it wasn't collecting metrics correctly for some of the nodes. I noticed that if I did 'telnet localhost 8649' on bak01, some of the nodes had no metrics associated with them. My solution was to stop and start gmond on those nodes, and that kicked things off. Strange though...

In any case, my next step is to install all kinds of Ganglia plugins, especially related to MySQL, but also for more in-depth disk I/O metrics.

Wednesday, September 15, 2010

I've started to use Rackspace CloudFiles as an alternate storage for database backups. I have the backups now on various EBS volumes in Amazon EC2, AND in CloudFiles, so that should be good enough for Disaster Recovery purposes, one would hope ;-)

I found the documentation for the python-cloudfiles package a bit lacking, so here's a quick post that walks through the common scenarios you encounter when managing CloudFiles containers and objects. I am not interested in the CDN aspect of CloudFiles for my purposes, so for that you'll need to dig on your own.

A CloudFiles container is similar to an Amazon S3 bucket, with one important difference: a container name cannot contain slashes, so you won't be able to mimic a file system hierarchy in CloudFiles the way you can do it in S3. A CloudFiles container, similar to an S3 bucket, contains objects -- which for CloudFiles have a max. size of 5 GB. So the CloudFiles storage landscape consists of 2 levels: a first level of containers (you can have an unlimited number of them), and a second level of objects embedded in containers. More details in the CloudFiles API Developer Guide (PDF).

Here's how you can use the python-cloudfiles package to perform CRUD operations on containers and objects.

Getting a connection to CloudFiles

First you need to obtain a connection to your CloudFiles account. You need a user name and an API key (the key can be generated via the Web interface at https://manage.rackspacecloud.com).

I tried using the latest stable XtraBackup .deb package from the Percona downloads site but it didn't work for me. I started a hot backup with /usr/bin/innobackupex-1.5.1 and it ran for a while before dying with "InnoDB: Operating system error number 9 in a file operation." See this bug report for more details.

After unsuccessfully trying to compile XtraBackup from source, I tried XtraBackup-1.3-beta for Lucid from the Percona downloads. This worked fine.

Here's the scenario I tested against a MySQL Percona XtraDB instance running with DATADIR=/var/lib/mysql/m10 and a customized configuration file /etc/mysql10/my.cnf. I created and attached an EBS volume which I mounted as /xtrabackup on the instance running MySQL.

This will take a while and will create a timestamped directory under /xtrabackup, where it will store the database files from DATADIR. Note that the InnoDB log files are not created unless you apply step 2 below.

As the documentation says, make sure the output of innobackupex-1.5.1 ends with:

100901 05:33:12 innobackupex-1.5.1: completed OK!

2) Apply the transaction logs to the datafiles just created, so that the InnoDB logfiles are recreated in the target directory:

At this point, I tested a disaster recovery scenario by stopping MySQL and moving all files in DATADIR to a different location.

To bring the databases back to normal from the XtraBackup hot backup, I did the following:

1) Brought back up a functioning MySQL instance to be used by the XtraBackup restore operation:

i) Copied the contents of the default /var/lib/mysql/mysql database under /var/lib/mysql/m10/ (or you can recreate the mysql DB from scratch)

ii) Started mysqld_safe manually:

mysqld_safe --defaults-file=/etc/mysql10/my.cnf

This will create the data files and logs under DATADIR (/var/lib/mysql/m10) with the sizes specified in the configuration file. I had to wait until the messages in /var/log/syslog told me that the MySQL instance is ready and listening for connections.

2) Copied back the files from the hot backup directory into DATADIR

Note that the copy-back operation below initially errored out because it tried to copy the mysql directory too, and it found the directory already there under DATADIR. So the 2nd time I ran it, I moved /var/lib/mysql/m10/mysql to mysql.bak. The copy-back command is:

You can also copy the files from /xtrabackup/2010-09-01_05-21-36/ into DATADIR using vanilla cp.

NOTE: verify the permissions on the restored files. In my case, some files in DATADIR were owned by root, so MySQL didn't start up properly because of that. Do a 'chown -R mysql:mysql DATADIR' to be sure.

3) If everything went well in step 2, restart the MySQL instance to make sure everything is OK.

At this point, your MySQL instance should have its databases restored to the point where you took the hot backup.

IMPORTANT: if the newly restored instance needs to be set up as a slave to an existing master server, you need to set the correct master_log_file and master_log_pos parameters via a 'CHANGE MASTER TO' command. These parameters are saved by innobackupex-1.5.1 in a file called xtrabackup_binlog_info in the target backup directory.

instance m1 on db101 and instance m1 on db201 are set up in master-master replication (and similar for instance m2)

the DATADIR for m1 is /var/lib/mysql/m1 on each server; that file system is mounted from an EBS volume (and similar for m2)

the configuration files for m1 are in /etc/mysql1 on each server -- that directory was initially a copy of the Ubuntu /etc/mysql configuration directory, which I then customized (and similar for m2)

the init.d script for m1 is in /etc/init.d/mysql1 (similar for m2)

What I tested:

I took a snapshot of each of the 2 EBS volumes associated with each of the DB servers (4 snapshots in all)

I terminated the 2 m1.large instances

I launched 2 m1.xlarge instances and installed the same Percona distribution (this was done via a Chef recipe at instance launch time); I'll call the 2 new instances xdb101 and xdb102

I pushed the configuration files for m1 and m2, as well as the init.d scripts (this was done via fabric)

I created new volumes from the EBS snapshots (note that these volumes can be created in any EC2 availability zone)

On xdb101, I attached the 2 volumes created from the EBS snapshots on db101; I specified /dev/sdm and /dev/sdn as the device names (similar on xdb201)

On xdb101, I created /var/lib/mysql/m1 and mounted /dev/sdm there; I also created /var/lib/mysql/m2 and mounted /dev/sdn there (similar on xdb201)

At this point, the DATADIR directories for both m1 and m2 are populated with 'live files' from the moment when I took the EBS snapshot

I made sure syslog-ng accepts UDP traffic from localhost (by default it doesn't); this is because by default in Ubuntu mysql log messages are sent to syslog --> to do this, I ensured that "udp(ip(127.0.0.1) port(514));" appears in the "source s_all" entry in /etc/syslog-ng/syslog-ng.conf

At this point, I started up the first MySQL instance on xdb101 via "/etc/init.d/mysql1 start". This script most likely will show [fail] on the console, because MySQL will not start up normally. If you look in /var/log/syslog, you'll see entries similar to:

At this point, you can do "/etc/init.d/mysql1 restart" just to make sure that both stopping and starting that instance work as expected. Repeat for instance m2, and also repeat on server xdb201.

So....IF you are lucky and the InnoDB crash recovery process did its job, you should have 2 functional MySQL instances one each of xdb101 and xdb201. I tested this with several pairs of servers and it worked for me every time, but I hasten to say that YMMV, so DO NOT bet on this as your disaster recovery strategy!

At this point I still had to re-establish the master-master replication between m1 on xdb101 and m1 on xdb201 (and similar for m2).

When I initially set up this replication between the original m1.large servers, I used something like this on both db101 and db201:

The trick for me is that master1 points to db201 in db101's /etc/hosts, and vice-versa.

On the newly created xdb101 and xdb201, there are no entries for master1 in /etc/hosts, so replication is broken. Which is a good thing initially, because you want to have the MySQL instances on each server be brought back up without throwing replication into the mix.

Once I added an entry for master1 in xdb101's /etc/hosts pointing to xdb201, and did the same on xdb201, I did a 'stop slave; start slave; show slave status\G' on the m1 instance on each server. In all cases I tested, one of the slaves was showing everything OK, while the other one was complaining about not being able to read from the master's log file. This was fairly simply to fix. Let's assume xdb101 is the one complaining. I did the following:

on xdb201, I ran 'show master status\G' and noted the file name (for example "mysql-bin.000017") and the file position (for example 106)

on xdb101, I ran the following command: "stop slave; change master to master_log_file='mysql-bin.000017', master_log_pos=106; start slave;"

not a 'show slave status\G' on xdb101 should show everything back to normal

Some lessons:

take periodic snapshots of your EBS volumes (at least 1/day)

for a true disaster recovery strategy, use at least mysqldump to dump your DB to disk periodically, or something more advanced such as Percona XtraBackup; I recommend dumping the DB to an EBS volume and taking periodic snapshots of that volume

the procedure I detailed above is handy when you want to grow your instance 'vertically' -- for example I went from m1.large to m1.xlarge

Friday, August 20, 2010

Munin is a great tool for resource visualization. Sometimes though installing a 3rd party Munin plugin is not as straightforward as you would like. I have been struggling a bit with one such plugin, munin-mysql, so I thought I'd spell it out for my future reference. My particular scenario is running multiple MySQL instances on various port numbers (3306 and up) on the same machine. I wanted to graph in particular the various InnoDB metrics that munin-mysql supports. I installed the plugin on various Ubuntu flavors such as Jaunty and Lucid.

6) Run "/usr/share/munin/plugins/mysql_ suggest" to see what metrics are supported by the plugin. Then proceed to create symlinks in /etc/munin/plugins, adding the port number and the metric name as the suffix.

For example, to track InnoDB I/O metrics for the MySQL instance running on port 3306, you would create this symlink:

7) Restart munin-node and wait 10-15 minutes for the munin master to receive the information about the new metrics.

Important! If you need to troubleshoot this plugin (and any Munin plugin), do not make the mistake of simply running the plugin script directly in the shell. If you do this, it will not read the configuration file(s) correctly, and it will most probably fail. Instead, what you need to do is to follow the "Debugging Munin plugins" documentation, and run the plugin through the munin-run utility. For example:

One more thing: you should probably automate all these above steps. I have most of it automated via a fabric script. The only thing I do by hand is to create the appropriate symlinks for the specific port numbers I have on each server.

Monday, August 16, 2010

This is just a quick post that I hope will save some people some headache when they try to customize their MySQL setup on Ubuntu. I've spent some quality time with this problem over the weekend. I tried in vain for hours to have MySQL read its configuration files from a non-default location on an Ubuntu 9.04 server, only to figure out that it was all AppArmor's fault.

My ultimate goal was to run multiple instances of MySQL on the same host. In the past I achieved this with MySQL Sandbox, but this time I wanted to use MySQL installed from Debian packages and not from a tarball of the binary distribution, and MySQL Sandbox has some issues with that.

Here's what I did: I copied /etc/mysql to /etc/mysql0, then I edited /etc/mysql0/my.cnf and modified the location of the socket file, the pid file and the datadir to non-default locations. Then I tried to run:

/usr/bin/mysqld_safe --defaults-file=/etc/mysql0/my.cnf

At this point, /var/log/daemon.log showed this error:

mysqld[25133]: Could not open required defaults file: /etc/mysql0/my.cnf
mysqld[25133]: Fatal error in defaults handling. Program aborted

It took me as I said a few hours trying all kinds of crazy things until I noticed lines like these in /var/log/syslog:

This made me realize it's AppArmor preventing mysqld from opening non-default files. I don't need AppArmor on my servers, so I just stopped it with 'service apparmor stop' and chkconfig-ed it off....at which point every customization I had started to work perfectly.

Tuesday, August 03, 2010

I posted this question yesterday as a quick tweet. I got a bunch of answers already that I'll include here, but feel free to add your answers as comments to this post too. Or reply to @griggheo on Twitter.

I started by saying I have 2 favorite tools: Fabric for pushing app state (pure Python) and Chef for pulling/bootstraping OS/package state (pure Ruby). For more discussions on push vs. pull deployment tools, see this post of mine.

Here are the replies I got on Twitter so far:

@keyist : Fabric and Chef for me as well. use Fabric to automate uploading cookbooks+json and run chef-solo on server

Wednesday, July 21, 2010

This is the third installment of my Chef post series (read the first and the second). This time I'll show how to use the Ubuntu EC2 instance bootstrap mechanism in conjunction with Chef and have the instance configure itself at launch time. I had a similar post last year, in which I was accomplishing a similar thing with puppet.

Why Chef this time, you ask? Although I am a Python guy, I prefer learning a smattering of Ruby rather than a proprietary DSL for configuration management. Also, when I upgraded my EC2 instances to the latest Ubuntu Lucid AMIs, puppet stopped working, so I was almost forced to look into Chef -- and I've liked what I've seen so far. I don't want to bad-mouth puppet though, I recommend you look into both if you need a good configuration management/deployment tool.

Here is a high-level view of the bootstrapping procedure I'm using:

1) You create Chef roles and tie them to cookbooks and recipes that you want executed on machines which will be associated with these roles.2) You launch an EC2 Ubuntu AMI using any method you want (the EC2 Java-based command-line API, or scripts based on boto, etc.). The main thing here is that you pass a custom shell script to the instance via a user-data file.3) When the EC2 instance boots up, it runs your custom user-data shell script. The script installs chef-client and its prerequisites, downloads the files necessary for running chef-client, runs chef-client once to register with the chef master and to run the recipes associated with its role, and finally runs chef-client in the background so that it wakes up and executed every N minutes.

The role 'base' looks something like this, in a file called roles/base.rb:

name "base"description "Base role (installs common packages)"run_list("recipe[base]")

The role 'myapp' looks something like this, in a file called roles/myapp.rb:

name "myapp"description "Installs required packages and applications for an app server"run_list "recipe[memcached]", "recipe[myapp::tornado]"

Note that the role myapp specifies 2 recipes to be run: one is the default recipe of the 'memcached' cookbook (which is part of the Opscode cookbooks), and one is a reciped called tornado which is part of the myapp cookbook (the file for that recipe is cookbooks/myapp/recipes/tornado.rb). Basically, to denote a recipe, you either specify its cookbook (if the recipe is the default recipe of that cookbook), or you specify cookbook::recipe_name (if the recipe is non-default).

So far, we haven't associated any clients with these roles. We're going to do that on the client EC2 instance. This way the Chef server doesn't have to do any configuration operations during the bootstrap of the EC2 instance.

2) Launching an Ubuntu EC2 AMI with custom user-data

I wrote a Python wrapper around the EC2 command-line API tools. To launch an EC2 instance, I use the ec2-run-instances command-line tool. My Python script also takes a command line option called chef_role, which specifies the Chef role I want to associate with the instance I am launching. The main ingredient in the launching of the instance is the user-data file (passed to ec2-run-instances via the -f flag).

I use this template for the user-data file. My Python wrapper replaces HOSTNAME with an actual host name that I pass via a cmdline option. The Python wrapper also replaces CHEF_ROLE with the value of the chef_role cmdline option (which defaults to 'base').

The shell script which makes up the user-data file does the following:

a) Overwrites /etc/hosts with a version that has hardcoded values for chef.mycloud and mysite.com. The chef.mycloud.com box is where I run Chef server, and mysite.com is a machine serving as a download repository for utility scripts.

b) Downloads Eric Hammond's runurl script, which it uses to run other utility scripts.

c) Executes via runurl the script mysite.com/customize/hostname and passes it the real hostname of the machine being launched. The hostname script simply sets the hostname on the machine:

What this does is it adds the internal IP of the machine being launched to /etc/hosts and associates it with both the FQDN and the short hostname. The FQDN bit is important for chef configuration purposes. It needs to come before the short form in /etc/hosts. I could have obviously also used DNS, but at bootstrap time I prefer to deal with hardcoded host names for now.

Update 07/22/10

Patrick Lightbody sent me a note saying that it's easier to get the local IP address of the machine by using one of the handy EC2 internal HTTP queries.

If you run "curl -s http://169.254.169.254/latest/meta-data" on any EC2 instance, you'll see a list of variables that you can inspect that way. For the local IP, I modified my script above to use:

IPADDR=`curl -s http://169.254.169.254/latest/meta-data/local-ipv4`

e) Finally, and most importantly for this discussion, executes via runurl the script mysite.com/install/chef-client and passes it the actual value of the cmdline argument chef_role. The chef-client script does the heavy lifting in terms of installing and configuring chef-client on the instance being launched. As such, I will describe it in the next step.

3) Installing and configuring chef-client on the newly launched instance

Here is the chef-client script I'm using. The comments are fairly self-explanatory. Because I am passing CHEF_ROLE as its first argument, the script knows which role to associate with the client. It does it by downloading the appropriate chef.${CHEF_ROLE}.json. To follow the example, I have 2 files corresponding to the 2 roles I created on the Chef server.

Note that the client knows the IP address of chef.mycloud.com because we hardcoded it in /etc/hosts.

The chef-client script also downloads validation.pem, which is an RSA key file used by the Chef server to validate the client upon the initial connection from the client.

The last file downloaded is the init script for launching chef-client automatically upon reboots. I took the liberty of butchering this sample init script and I made it much simpler (see the gist here but beware that it contains paths specific to my environment).

At this point, the client is ready to run this chef-client command which will contact the Chef server (via client.rb), validate itself (via validation.pem), download the recipes associated with the roles specified in chef.json, and run these recipes:

chef-client -j /etc/chef/chef.json -L /var/log/chef.log -l debug

I run the command in debug mode and I specify a log file location (the default output is stdout) so I can tell what's going on if something goes wrong.

That's about it. At this point, the newly launched instance is busy configuring itself via the Chef recipes. Time to sit back and enjoy your automated bootstrap process!

The last lines in chef-client remove the validation.pem file, which is only needed during the client registration, and run chef-client again, this time in the background, via the init script. The process running in the background looks something like this in my case:

The -i 600 option means chef-client will contact the Chef server every 600 seconds (plus a random interval given by -s 30) and it will inquire about additions or modifications to the roles it belongs to. If there are new recipes associated with any of the roles, the client will download and run them.

If you want to associate the client to new roles, you can just edit the local file /etc/chef/chef.json and add the new roles to the run_list.

Thursday, July 15, 2010

To me, nothing beats a nice dashboard for keeping track of how your infrastructure and your application are doing. At Evite, sending mail is a core part of our business. One thing we need to ensure is that our mail servers are busily humming away, sending mail out to our users. To this end, I built a quick outgoing email tracking tool using MongoDB and pymongo, and I also put together a dashboard visualization of that data using the Google Visualization API via the gviz_api Python module.

Tracking outgoing email from the mail logs with pymongo

Mail logs are sent to a centralized syslog. I have a simple Python script that tails the common mail log file every 5 minutes, counts the lines that conform to a specific regular expression (looking for a specific msgid pattern), then inserts that count into a MongoDB database. Here's the snippet of code that does that:

I use the pymongo module to open a connection to the host running the mongod daemon, then I declare a database called logs and a collection called maillogs within that database. Note that both the database and the collection are created on the fly in case they don't exist.

I then instantiate a Python dictionary with two keys, insert_time and msg_count. Finally, I use the save method on the maillogs collection to insert the dictionary into the MongoDB logs database. Can't get any easier than this.

Visualizing the outgoing email count with graph_viz

I have another simple Python script which queries the MongoDB logs database for all documents that have been inserted in the last hour. Here's how I do it:

As an aside, when querying MongoDB databases that contain documents with timestamp fields, the datetime module will become your intimate friend.

Just remember that you need to pass datetime objects when you put together a pymongo query. In the case above, I use the now() method to get the current timestamp, then I use timedelta with minutes=-60 to get the datetime object corresponding to 'now minus 1 hour'.

The gviz_api module has decent documentation, but it still took me a while to figure out how to use it properly (thanks to my colleague Dan Mesh for being the trailblazer and providing me with some good examples).

I want to graph the timestamps and message counts from the last hour. Using the pymongo query above, I get the documents inserted in MongoDB during the last hour. From that set, I need to generate the data that I am going to pass to gviz_api:

The important parts in this function are the description and the data variables. According to the docs, they both need to be of the same type, either dictionary or list. In my case, they're both lists. The description denotes the schema for the data I want to chart. I declare two variables I want to chart, insert_time of type string, and msg_count of type number. For msg_count, I also specify a user-friendly label called 'Message count', which will be displayed in the chart legend.

After constructing the data list based on chart_data, I declare a gviz_api DataTable, I load the data into it, I call the ToJSon method on it to get a JSON string, and finally I fill in a template string, passing it a title for the chart and the JSON data.

The template string is an HTML + Javascript snippet that actually talks to the Google Visualization backend and tells it to create an Area Chart. Click on this gist to view it.

That's it. I run the gviz_api script every 5 minutes via crontab and I generate an HTML file that serves as my dashboard.

I can easily also write a Nagios plugin based on the pymongo query, which would alert me for example if the number of outgoing email messages is too low or too high. It's very easy to write a Nagios plugin by just having a script that exits with 0 for success, 1 for warnings and 2 for critical errors. Here's a quick example, where wlimit is the warning threshold and climit is the critical threshold:

Update #1See Mike Dirolf's comment on how to properly insert and query timestamp-related fields. Basically, use datetime.datetime.utcnow() instead of now() everywhere, and convert to local time zone when displaying.

Update #2Due to popular demand, here's a screenshot of the chart I generate. Note that the small number of messages is a very, very small percentage of our outgoing mail traffic. I chose to chart it because it's related to some new functionality, and I want to see if we're getting too few or too many messages in that area of the application.