The hidden performance bottleneck: Network

As mentioned several times here, hardware can not be treated as a black box. Every mysql professional who is charged with performance tuning has to understand where often overlooked bottlenecks can occur. This can occur anywhere in the system : disk, cpu, memory, networking. Everyone who reads my blog knows that I have beaten the disk horse until its bloody corpse, although I still believe too many people ignore disk performance… Everyone looks at CPU, in fact every monitoring tool known to man seems to include cpu stats. But what about network performance? The performance of the network is even more taken for granted then disk is. I mean to most people they don’t give a second thought to what’s happening between servers, after all isn’t that the “network teams” job. Unfortunately I run into network problems more often then I would like. What could these take the form of? A couple examples:

Simple network configuration problems, I have run into a handful of folks who for whatever reason had their 1Gbe nics set to 100Mb or even 10Mb… others set to half duplex… it pays to check.

trying to use nfs mounts/and or some ip based disk access over the same nic as your application traffic ( sharing the network between app and database server with something else ).

returning tons of data accross the network, only to throw away 90% of the data you retrieved (application issue, that causes network issues)

Oddball switches

backing up data across the same nic/network as your application

Consider this I was recently working with a client who performed select * from all their tables, even if they only needed an ID. If this was not bad enough, lots of the columns held 6-12K text fields… this means even if they really only needed < 10bytes of data they were transferring 50K + for each record. This adds up.

Recently I was testing out a new memcached server for waffle grid. As you know waffle grid is very keyed into network latency. I was testing with a 1Gbe switch and ended up with an average memcached get time of 1.9ms, this was much higher then earlier tests with a 1Gbe Crossover, so I replaced the switch… the result was a drop to around 1.2ms per get. While 0.7 ms doesn’t seem like a lot, it is when its over 1.75 million gets( 30 minute test run ). In fact that means gets took 1.25 million milliseconds or about 20 minutes less network time with a better switch. Maybe this is effecting your application as well, do you really know?

Networking performance is not going to solve all your issues, in fact you generally get more bang for your buck tuning elsewhere… but don’t forget about it otherwise you maybe sorry.