Flood Detection, Rails and Memcached.

Note: The article is about the code behind Forumwarz for those who are interested in such things. If people like it I might write more like it!

Flooding is what happens when a user submits data repeatedly to your server. A good example would be a user who repeatedly posts comments on your message board.

Sometimes it’s done innocently: a user hits the submit button 5 times impatiently during a bit of server lag, only to find their post went through 5 times.

Other times it’s nefarious and deliberate: a user creates a wget script to just post garbage over and over.

Not only can flooding be taxing on your server, but it can fill up your site with so much garbage that it will put other users off. Obviously it’s something you want to get rid of!

A simple and effective strategy: the cool down period

You can assign a “cool down” period, where a user is barred from submitting data until the period ends. The length of the cool down is largely up to you, based on what you consider normal submission processes. I’ve found 30 seconds seems to work in most cases.

An obvious way to do this would be to add a timestamp column to the row you insert (which is always a good idea anyway!). Then, when posting, query the table you’re about to insert into for anything from that user within that cooldown period. If any rows are found, don’t allow the insert.

For most sites, implementing something like this will work great. However, and I cannot stress this enough, make sure you have a good database index on your timestamp column. If you do not, every insert will result in a table scan and your site performance will be terrible.

Memcached allows you to set an expiry for any key you set. So, instead of using the timestamp column in the database, you can simply set a key in memcached with an expiry of your cool down period. Then, when you are about to insert, check to see if the key exists in memcached. If it does, don’t insert. If it doesn’t, insert your row and then add the key there.

It is incredibly fast, and it doesn’t matter if your tables are indexed on the date column or not. In fact, since it’s not tied to your database at all, you can do flood prevention on anything you want (sending emails, real time chat, etc)!

Adding it to ActiveRecord

I have created a custom validation method in ActiveRecord for flood protection. It can be attached to any model using the following simple syntax:

prevent_flood 30.seconds, :user_id

The first parameter is the length of the cooldown period. The second parameter is the column in the model that uniquely identifies the user. In this case, it’s a user_id column.

CACHE_ME is an abstraction I wrote to use Memcached from ruby. It it initialized to connect to memcached when rails starts up, and can be replaced with however you personally connect to Memcached fairly easily. The get method returns nil if the key isn’t there, and the put method sets a key to be the value “F” with an expiry of cooldown seconds. It doesn’t really matter what value you put in the cache, I just chose F for flood, and because it’s one character long.

I know personally flood protection is something that I never really implement until it becomes a problem, and one of the reasons was that it was a pain in the butt to code for every model. However, with this interface, I am now using it on all new code I’m writing from the beginning. The overhead is minimal, and it can really save your butt down the line!

8 Responses

I have been looking for simple-stupid ways of controlling floods of unwanted trafficfor ages; without memcached a cluster of servers gets worse at detecting the problem the bigger the cluster. indexing the created_at column is a big waste. Your solution is brilliant.

Howevah…

What about the bots you want? Sure, they’re not POSTing comment spam with links to V1@gr@ sites, or anything, but not every rogue bot is a POSTer. Whitelists/blacklists all, well they all suck, and heuristics User-agent bot detection suck only a bot, er, a bit less.

Trying to separate the sheep from the wolves is a challenge indeed.

Wouldn’t it be cool to make a little non-human detector based on patterns of behavior? Sure, the wget script would be simple: same IP, same User-agent, same rate. But detecting the smarter ones (like ones I have written in past, dark and evil days) that snarf up content, but irregularly (using rand()) and with innocent user-agents, but which happen with uncanny regularity.

Some patterns are intentional (Googlebot has a well-known list of IPs it comes from) and therefore helpful. Anything violating robots.txt rules is dead meat. Any IP that doesn’t also get the images on your page is either a) the last remaining Lynx user, or b) a bot. But it’s the tying this all together, especially in a large, clustered environment that makes the problem hard.

I think memcached can be used to aggregate this information in the same way you wrote about, and that should make the problem a) much simpler and immediate, and b) much lighter-weight.

And when it’s all done, all we need to do is figure out how to end spam for once and for all by turning the bots upon each other in some n-squared kind of way that makes the rest of us blokes just trying to focus on doing good things laugh with glee as the spambots self-destruct. Moo ha ha!

And if there’s not a good algorithm here, there’s gotta be a good B-movie plot.