SwiftServe is a Singapore based, privately held technology company focusing on developing innovative products for telecommunication service providers. Service providers worldwide are facing increasing business pressures as the telco network evolves from a voice & data centric network to a network driven by video content and applications. This transition to a video centric “content and app” network requires heavy infrastructure investment by service providers, while presenting little opportunities for service providers for traffic monetization.

The Swift range of content caching, content distribution and subscriber management products allow our service provider customers to efficiently manage the business costs of their “content and app” network. At the same time, our service provider customers now have the ability to monetize their network assets by creating differentiated service offerings through content and subscriber awareness.

a moment in which one realizes that one has made an error, typically by pressing the wrong button.

It’s more commonly referred to in Operations Management parlance as:

OHGODFUARGHFLKJAFWHATIDIDIDOARGHNO

It is unfortunately something that will happen to everyone during their systems administration career, and the variations are almost endless, some notable occurrences include:-

Copying SSL libraries over from a Debian host to a RHEL host

Setting a new root password and immediately losing it

Copying over an out of date backup CMS to a production system

Running one of the many variations of ‘rm’ at the wrong level

Unfortunately in a recent scenario, a poor hypothetical sysadmin managed to issue:

rm /*

instead of:

rm ./*

This removed every non directory at the / level. The impact of this varies between operating systems, and even between Linux distributions. We’re lucky that in this scenario, there was no ‘-rf’ specified – or it’d be ‘recover from backup’ time, however this situation did (hypothetically) pose an interesting conundrum.

The Problem

In Debian x86_64 systems, /lib64 is a symlink to /lib, and you’ll find most applications (for instance, ‘ldd’) are linked to libraries in /lib64:

In the event of /lib64 not being accessible, most applications will fail to run because they’ll be missing a myriad of dependencies and won’t be able to find them. After a bit of brief investigation and some furious attempts to revive it with frankly disappointing results, including:

Using a statically compiled symlinker such as sln (Available by default on CentOS and RHEL, not on the affected Debian box)

Copying over sln via netcat and writing it out (proved surprisingly difficult)

Trying to copy over a symlink via rsync (couldn’t rsync/scp/sftp as they need to exec another process – which they can’t because of missing libraries)

Using BusyBox (Needs dynamic linking)

Writing a linker in C, compiling it, getting it over there via a mixture of cat, echo \x{..}\x{..}, and other incantations (I lost the will to live around this point)

The Epiphany

I eventually remembered a slideshow – chmod -x chmod – which was surprisingly relevant. You see, those more eagle eyed may have noticed we would end up missing one important dependency: ld-linux-x86-64.so.2.

ld-linux and ld-linux-x86-64 find and load shared libraries used by other applications – preparing programs to run, and then actually executing them too. Most Linux binaries require dynamic linking, meaning at runtime the libraries that the application depends on are loaded in from a shared source rather than compiled into the executable, unless the -static option was used during compilation. As this is quite unlikely (with most modern distributions), this means if you cannot access ld-linux.so, you’re in trouble. Luckily, you can still use ld-linux.so to execute arbitrary commands, and it’ll resolve the dependencies relative to your LD_LIBRARY_PATH at that point. A simple:

/lib/ld-2.11.1.so /bin/ln -s /lib/ /lib64/

Restored the symlink and allowed normal execution of binaries again, leaving our hypothetical sysadmin off the hook, except for having to write a mildly humiliating email to the rest of the operations team who, understandably, responded a bit like this.

I recently migrated away from Tumblr, as I found that Tumblr was heading more towards micro-blogging – reducing the size of the posting editor (seriously? LOOK AT THE PROPORTIONS) which made embedding code snippets or writing more lengthy posts pretty arduous. As an unapologetic geek – WordPress seemed like the natural choice.

The Tumblr to WordPress Import Process did a reasonably good job of importing everything – but I wanted to make sure to not lose the (already indexed/linked) URLs. Unfortunately this wasn’t quite as easy – as defining custom permalinks on a per-post basis in WordPress appears to still be manual (via an .htaccess). To resolve this generically (without having to make a new alias for the mammoth amount of posts I had (ahem), I simply set the permalink format to the name of the post (which follows the same format as tumblr) and defined the following RedirectMatch regular expression in my htaccess:

Update: Before going much further, there now is a much more comprehensive CORS walkthrough for nginx at enable-cors.org – so check that out before following the below.

If you’ve deployed even a mildly complex web application in the last few years, you’ve probably had to care about CORS headers. They allow webpages to make requests to another domain, or the same domain on another scheme. Without them, you’ll find that trying to request other assets will be forbidden by your browser, and things won’t load.

They’re relatively simple to implement. You just add a header:

Access-Control-Allow-Origin: https://www.alexjs.im

to the HTTP responses of assets you’d like to call in your webapp. Thanks to Michiel Kalkman’s gist you can easily achieve this in Nginx – with something relatively standards compliant, too.

The problem, it seems, is that despite the W3C spec and RFC 6454 prescribing the use of a list of origins, not all browsers (e.g. Firefox) support multiple domains in an Access-Control-Allow-Origin header:

However that can cause some security implications. The best compromise I’ve found to get around this was to implement a simple whitelist in the Nginx config and match against that. I’ve put this in a public gist – and I’m testing it for deployment now.

I’ve not yet done any performance testing, so I’m not sure how efficient the Nginx regex engine is and what the overall effect on throughput/capacity is. I’ll probably forget to update this post with a bit of information once that’s complete.

Update:

This has been in production for a couple of months now, and we haven’t had any performance issues. It seems that for the throughput we require (<10 req/s) we’re able to yield the load on a single m1.small comfortably, so I think the nginx regex engine’s pretty efficient.

Celery is a distributed task queue for Python. It’s pretty useful, and a lot of apps I’m involved in deploying seem to be using it lately.

Something it seems to struggle with is stability; in the event of a database disappearing, being unable to resolve a database’s hostname, or a single connection to a database failing, it just shuts down.

I needed this to not happen, when running things in “the cloud” (sorry) you’re very much at the mercy of other people controlling your networking/tin/everything – so you need to write applications that are capable of a little bit of failure (even if the application was originally written in this way to avoid split brain or similar). To get around this, we implemented monit. I am definitely not a fan of apps automatically restarting, but it was the only trivial resolution in this situation. Just append this to your monit config and you should be sorted. My understanding is that there isn’t a better solution yet, but would be interested to know if anyone has seen one.

When using nginx as a caching proxy, I found myself needing to ignore particular parameters for both the cache key and the values being passed to the backend. In this particular situation the value I wanted to ignore was ‘uid’. An example URI being:

It took a remarkably large amount of searching to find out what ‘[7.2]’ meant in this error message, and why we kept getting a mailserver’s IP blacklisted, but if this happens to you, hopefully this will help resolve it.

When MessageLabs returns a [7.2], this seems to mean that they’ve checked the IP address of the host which is connecting to their MX against the CBL. Connections will be dropped immediately, rather than mail being rejected, as such:

The easiest way to get around this is to fix your mail server, then request delisting from the CBL.

In a completely unrelated note (ahem), it seems that you may be added to the CBL if you send an email from a domain where the sending mail server is explicitly disallowed by SPF records (such as -all with no matching include), to a gmail address; Google will automatically (?) submit the IP address to the CBL and your problems will begin (again).

I highly recommend robtex as a lazy way to check your hosts against blacklists.

Hypothetical situation. You installed VMWare ESX, possibly upgraded from 3.5 to 4, went with the embedded SQL Server, and Many Years Later the VirtualCenter server no longer starts. You look through the event logs and the best you can find is:

So you decide to look at general application eventlog events rather than just for VMware:

Could not allocate space for object ‘dbo.VPX_EVENT’.’PK_VPX_EVENT’ in database ‘VIM_VCDB’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

“Great”, you think. I can just pass this over to a DBA to get them to increase the filegroup size. Then you dig a bit deeper and look at the event log for SQLServer:

Only one directory (was, a while ago) exported by NFS, which isn’t one that is affected, and the box has never mounted anything by NFS. It seems like the error can be caused when a file is open and the disk falls out from underneath it, and an ambiguous error code is sent back which is interpreted as a stale filehandle. Either way, the superblock on this particular FS is corrupted, so the next step would be to attempt to recover using one of the backup superblocks. I’ll attempt this later and let you know how it goes. I’m sure you’ll be on the edge of your seats.