I have 4 nginx-powered image servers on their own subdomains which users would access at random. I decided to put them all behind a HAProxy load balancer to improve the reliability and to see the traffic statistics from a single location. It seemed like a no-brainer.

Unfortunately, the move was a complete failure as the load balancer's 100mbit port was completely saturated with all requests now going through it.

I was wondering what to do about this - I could get a port upgrade ($$) or return to 4 separate image servers that are randomly accessed. I thought about putting HAProxy on each image server which would in turn route to another image server if that server's nginx service was having trouble.

What would you do? I would like to not have to spend too much additional money.

Not at all cost feasible at high volumes. We save a ton of money using our own dedicated servers.
–
TomJun 7 '12 at 23:59

The purpose of a CDN IS dealing with high volume traffic, and it wont hurt if he talks to a few CDN providers and asks them what rates they can offer for his use case.
–
rackandbonemanJun 8 '12 at 1:08

Amazon S3, based off our traffic stats = $8000/m. Our LAMP cluster = $1000/month. CDN is not for high traffic. Or at least, not the "high" you're referring to, or for a business model that can afford to throw money at IT.
–
TomJun 8 '12 at 1:32

Plenty of massive sites use CDN. Amazon is far from the cheapest option (their CDN is CloudFront, not S3), especially when you negotiate. Don't forget to count the time you spend running your own cobbled together CDN in the cost comparison.
–
ceejayozJun 8 '12 at 4:11

2 Answers
2

Whatever breaks your nginx (overload, hardware defect) would probably also break your haproxy. It would probably work best to get an additional IP (use as an alias on the interface) for each server, and use that as the IP (directly or via a dns name) that you publicize via your image urls. Build a script that will relocate the secondary ip to another server in case of serious problems. The devil in the details will be in making sure that the IP is safely taken away from the other server. In case the script can no longer log into the failed server and deallocate the IP alias, the best thing to do is to shut it down hard via IPMI if it is available.

As an alternative, you could install something CGIish on the fourth server that just redirects to a random choice of available servers; control the list of servers it can redirect to with a periodic monitoring script (you could misuse nagios check_http for that for example). As an extension, that script could also accept an exclusion list from another file - really handy if you need to quiesce one of the servers for maintenance.

Also, the suggestion about using a CDN is not that misguided.... if you have static file traffic that saturates a 100MBit line, you are talking traffic in the terabytes to tens of terabytes per month depending on usage patterns...

Solution 1. Active DNS monitoring/advertising

Your name servers then need to actively poll the http status of each IP (as you would monitor with the load balancer) and stop advertising an IP when it's state is 'down' Make it a thorough test but avoid any services common to all the boxes (e.g a single db backend). When a node fails the monitor, it stops being advertised in DNS.

The catches,
More DNS requests due to low TTL.
Failover takes "DNS TTL" seconds (and some people like to disobey TTL's as well)
Your name servers need to be relatively close to the services, or have sensible defaults configured say if there was a network outage between the NS and your image servers.

You could also do 4 separate domains names that fail back to another ip via the same method.

Solution 2. IP failover

rackandboneman's IP takeover is implemented without much hassle in linux with keepalived/lvs using the the VRRP protocol. (Assuming your boxes are close to each other on the network and linux, os's like bsd and solaris have vrrp/carp implementations)

With 4 boxes you can create a circle topology for the Virutal IP failover which means you can lose 2 boxes next to each other but only lose 1 VIP, the box on the left of the []'s has the highest priority for the VIP.

or 3 nodes per VIP, in priority order, more complex setup but better availability.

nodes [1 - 2 - 3] [2 - 3 - 4] [3 - 4 - 1 ] [ 4 - 1 - 2]

With keepalived, I would setup the monitor script to hit the local http service your load balancer would hit to judge the health of the server. Also make sure the VRRP traffic is using the same interface as the real traffic if you have multiple nics.