Browse Author: kale

We use the ELK stack extensively at my job, thanks to my evangelizing and endless hard work. With all our servers logging to logstash and being pushed to Elasticsearch, logging into servers via ssh just to check logs is a thing of the past, and to help push that ideology along, I’ve hacked up a simple bash script to query Elasticsearch and return results in a manner that mimics running `tail` on a server’s logs. It quite literally just runs a query against Elasticsearch’s HTTP API, but I added some niceties so I can allow folks to make queries to ES without having to read a novel on how to do so.

I stumbled across this fantastic blog post that offers a clever bash script to notify you of the completion of long-running commands in your bash shell. I made a couple tweaks to make it work for OSX, and gave it a little blacklist (I usually run `less’ or `vim’ for >10 seconds, for example).

I have a database table that was created about 2 years ago and has been filling up quite quickly over the years. These days, it’s massive. Our database dumps are 68gb uncompressed, and 60gb of that is this table. It’s used quite regularly, as it contains all of the error reports we receive, but to call it “unwieldy” is an understatement.

I was content to just let sleeping dogs lie, but alas — one of my devs needs a couple extra fields added to the table for more data and sorting and whatnot. If this wasn’t a 60gb table in our production database, I’d happily run an ALTER TABLE and call it a day. (In fact, I attempted to do this — and then the site went down because the whole db was locked. oops)

Instead, I discovered a better way to add fields while retaining both uptime and data (!). MySQL’s CREATE TABLE command actually has a lot of interesting functionality that allows me to do this:

What this CREATE TABLE statement does is create a new table with 5 explicitly-specified fields (keywords, errorid, stacktrace, is_silent, and id). Four of these are what I wanted to add; ‘id’ exists in the original table, but I specify it here because I need to make it AUTO_INCREMENT (as this is a table setting, not a bit of data or schema that can be copied). Additional keys are specified verbatim from a SHOW CREATE TABLE errors (the original table), as is the AUTO_INCREMENT value.

After specifying my table creation variables, I perform a SELECT on the original table. MySQL is smart enough to know that if I’m SELECTing during a CREATE TABLE, I probably want any applicable table schema copied as well, so it does exactly that — copies over any columns missing from the schema I specified in my CREATE statement. Even better, because the various keys were specified, the indexes get copied over as well.

The result? An exact copy of the original table — with four additional fields added. All that’s left is to clean up:

[UPDATED 2017-03-09]
I still get comments/questions regarding this process I hacked together many moons ago. I must request that anybody who’s looking for a way to backup Elasticsearch indices STOP and do not follow the process described — it was for ES 0.00000000001, written back in 2011. You should not do what I suggest here! I’m saving this purely for historical purposes.

What you should do instead is save your events in flat text — in Logstash, output to both your ES index for searching via Kibana or whatnot, and also output your event to a flat file, likely periodic (per-day or month or whatever). Backup and archive these text files, since they compress quite well. When you want to restore data from a period, just re-process it through Logstash — CPU is cheap nowadays with cloud instances! The data is the important part — processed or not, if you have the data in an easily stored format, you can re-process it.

[Original post as follows]

I’ve been spending a lot of time with Elasticsearch recently, as I’ve been implementing logstash for our environment. Logstash, by the way, is a billion times awesome and I can’t recommend it enough for large-scale log management/search. Elasticsearch is pretty awesome too, but considering the sheer amount of data I was putting into it, I don’t feel satisfied with its replication-based redundancy — I need backups that I can save and restore at will. Since logstash creates a new Elasticsearch index for each day worth of logs, I want the ability to backup and restore arbitrary indices.

Elasticsearch has a concept of a gateway, wherein you can configure a gateway that maintains metadata and snapshots are regularly taken. “Regularly” as in every 10 seconds by default. The docs recommend using S3 as a gateway, meaning every 10s it’ll ship data up to S3 for backup purposes, and if a node ever needs to recover data, it can just look to S3 and get the metadata and fill in data from that source. However, this model does not support the “rotation”-style backup and restore I’m looking for, and it can’t keep up with the rate of data I’m sending it (my daily indices are about 15gb apiece, making for about 400k log entries an hour).

So I’ve come up with a pair of scripts that allow me to manage logstash/Elasticsearch index data, allowing for arbitrary restore of an index, as well as rotation so as to keep the amount of data that Elasticsearch keeps track of manageable. As always, I wrote my scripts for my environment, so I take no responsibility if they do not work in yours and instead destroy all your data (a distinct possibility). I include these scripts here because I spent a while trying to figure this out and couldn’t find any information elsewhere on the net.

The following script backs up today’s logstash index. I’m retarded at timezones, so I managed to somehow ship my logs to logstash in GMT, so my “day” ends at 5pm, when logstash closes its index and opens a new one for the new day. Shortly after logstash closes an index (stops writing to it, not “close” in the Elasticsearch sense), I run the following script in cron, which backs up the index, backs up the metadata, creates a restore script, and sticks it all in S3:

Restoring from this data is just as you would expect — download the backed up index.tar.gz and the associated restore.sh to the same directory, chmod +x the restore.sh, then run it. It will automagically create the index and put the data in place. This has the benefit of making backed up indices portable — you can “export” them from one ES cluster and import them to another.

As mentioned, because of logstash, I have daily indices that I back up; I also rotate them to prevent ES from having to search through billions of gigs of data over time. I keep 8 days worth of logs in ES (due to timezone issues) by doing the following:

Sometimes, due to the way my log entries get to logstash, the timestamp is mangled, and logstash, bless its heart, tries so hard to index it. Since logstash is keyed on timestamps, though, this means every once in a while I get an index dated 1970 with one or two entries. There’s no harm save for any overhead of having an extra index, but it also makes it impossible to back those up or to be able to make any assumptions about the index names. I nuke the 1970s indices from orbit, and then, if there are more than 8 indices in logstash, drop the oldest. I run this script at midnight daily, after index backup. Hugest caveat in the world about the rotation: running `curl -s -XDELETE http://localhost:9200/logstash-10.14.2011/’ will delete index logstash-10.14.2011, as you’d expect. However, if that variable $OLDESTLOG is mangled somehow and this command is run: `curl -s -XDELETE http://localhost:9200//’, you will delete all of your indices. Just a friendly warning!

Ok, theoretically my last post about mod_rpaf was supposed to lead to mod_qos working. It did, in the most technical way… it just made it instantly obvious that mod_qos was not the solution I was looking for! mod_qos performs qos on a URI but applies it to all connecting clients, not just offenders. It’s best used for resource limiting… not in API throttling to put a stop to abuse, which is my intent.

I grudgingly turned to mod_security. I’ve known all along that mod_security would be the best tool to help me reach my goal; however, mod_security is the least user-friendly piece of software that I’ve ever used, with a highly esoteric language and odd processing rules. Forced to sit down and make it work, however, I’ve come up with a few rules that may help others who wish to perform request-based throttling.

First, I initialize a collection called “IP”, based on the X-Forwarded-For header. Because I’m using mod_rpaf, I could technically use the remote address, but “just in case” I opted for the X-Forwarded-For, since that’s much more important to me. It also prevents the load balancer from getting blocked… ever.

Second line is where I do the IP increment — and decrement. As you can see, for every hit from that IP I increment the IP.hitcount variable by 1; the ‘deprecatevar:IP.hitcount=1/1’ tells the variable to decrement the count by one per second. If the user makes one hit per second, they will never hit the limit. If they make 2 hits per second, the net gain will be 1 one first second, 2 the next, 3 the next, etc.

The last line, of course, is where we do our test. If the hitcount is greater than 3, I’m allowing the request to go through, but adding a 3000ms pause — 3 seconds.

I configured these rules within my VirtualHost definition, and used Location tags to specify the URIs that require throttling. It works like a champ. In each of the rules, I’ve specified ‘nolog’, as it’s pretty spammy, though you’ll want to change that to ‘log’ for testing. Because I’m disabling mod_security’s spammy logging, I’m timing requests with a custom log format:

The %D at the end of the LogFormat spits out the total time taken by Apache to fulfill the request in microseconds, which will include the artificial delay. With this CustomLog definition, you can now easily visualize throttled requests:

Amazon’s ELB service is nice — magical load balancers that just work, sitting in front of your servers, that you can update and modify on a whim. Of course, because it’s a load balancer (a distributed load balancer infrastructure, to be more precise), Apache and other applications sitting behind it see all the incoming traffic as coming from the load balancer — ie, $REMOTE_ADDR is 10.251.74.17 instead of the end client’s public IP.

This is normal behavior when sitting behind a load balancer, and it’s also normal behavior for the load balancer to encapsulate the original client IP in an X-Forwarded-For header. Using Apache, we can, for example, modify LogFormat definitions to account for this, logging %{X-Forwarded-For}i to log the end user’s IP.

Where this falls short, however, is when you want to *do* things with the originating IP beyond logging. The real-world scenario I ran into was using mod_qos to do rate-limiting based on URIs within Apache — mod_qos tests against the remote IP, not the X-Forwarded-For, so using the module as is, I’m unable to apply any QoS rules against anything beyond the load balancer… which of course defeats the purpose.

Luckily, I’m not the only person to have ever run into this issue. The Apache module mod_rpaf is explicitly designed to address this type of situation by translating the X-Forwarded-For header into the remote address as Apache expects, so that other modules can properly run against the originating IP — not the load balancer.

ELB makes implementation of mod_rpaf much more difficult that it should be, however. ELB is architected as a large network of load balancers, such that incoming outside requests bounce around a bit within the ELB infrastructure before being passed to your instance. Each “bounce” adds an additional IP to X-Forwarded-For, essentially chaining proxies. Additionally, there are hundreds of internal IPs within ELB that would need to be accounted for to use mod_rpaf as is, as you must specify the proxy IPs to strip.

So I patched up mod_rpaf to work with ELB. I’ve been running it for a day or so in dev and it appears to be working as expected, passing the original client value to mod_qos (and mod_qos testing and working against that), but of course if you run into issues, please let me know (because your issues will probably show up in my environment as well).

I’m using Amazon’s SES service for my servers’ emails. To implement, instead of re-writing all of our code to hook into the SES API, I simply configured Postfix to use the example script ses-send-mail.pl provided by Amazon. It works fine and dandy, with mails happily going out to their intended recipients via SES.

However, that’s not good enough for me. You see, if you send a mail through SES and it bounces, you’ll receive the bounce message at the original From: address, as expected, but because a lot of ISPs/ESPs strip the original To: header in their bounce templates to prevent backscatter, and SES mangles the message ID set on the email by Postfix (replacing it with their own), it’s very possible to get bounce messages that have no information on the intended recipient. How do you do bounce management when you have no information that links the bounce to the original email that you sent?

While Amazon strips the message ID assigned by Postfix, it adds its own message ID — AWSMessageID. This value is returned by the SES API when you submit an email to the service; the provided example scripts, however, don’t do anything with this ID.

To address this issue in my environment, I wrote the following script, which I set as my Postfix transport (rather than ses-send-email.pl).