I'm stress testing a site that we're making, and we're finding a very surprising result compared to my expectations:
Our site starts to load very slow with a few hundred simultaneous people, even though CPU and memory are fine. Looking at Task Manager, Networking tab, I see that my 100 Mbps network card is maxed out at 98%.

For some reason, this sounds extremely ridiculous to me...
Every time I read something on scalability it's CPU, Memory, Caching, etc, etc, and here I'm getting the bottleneck on the network card itself.

We serve all our content gzipped, and our home page is kind of heavy, but not THAT much. I would've never expected the network card to be the bottleneck.

Is this normal?
Is everyone having public facing websites using a 1Gbps network card?
I thought 100 Mbps would be the standard.

Am I looking at something wrong? Am I interpreting the graph in the Networking tab incorrectly?

NOTE: I can think of a number of ways to fix this, starting with getting a 1 Gbps card, and moving static files to their own server(s). My question is mostly around whether everyone is simply using 1 Gbps connections, which would surprise me enormously.

This question came from our site for professional and enthusiast programmers.

Is the site making any outbound connections, for example, to a database on a separate server?
–
Rowland ShawSep 16 '09 at 17:53

Not at all, it's all still in one server for now.
–
Daniel MagliolaSep 16 '09 at 18:05

Is is the load test that is making the network card go to 100% usage, or is it something else? Can you run the load-test from the machine itself?
–
user20202Sep 16 '09 at 18:07

I don't really know, but I can't think of anything else. This is a staging server we have that is an exact clone of our live server, so no traffic is going to it. The test is being ran by someone else that's helping us outside our company, so we can't really run it ourselves in the server. Thanks!
–
Daniel MagliolaSep 16 '09 at 18:10

Are you sure that 100Mbps card is set to 100Mbps ? It's not uncommon for a crappy card talking to a crappy switch to end up autonegotiating 10Mbps.
–
nosSep 16 '09 at 18:51

4 Answers
4

It sounds exactly like you are saturating your available bandwidth. You either need to cut down on your bandwidth or switch to a 1Gbps card, which is what I'd normally expect to find in a publicly facing web server (certainly has been the case with every server class machine I've touched in the last 10 years - where did you find a server with a cheapo 100mbps card anyway? Is it really a repurposed desktop?).

Some things to check or consider:

You don't mention caching. If your site is setup in someway that does not return good caching headers for static files (such as images) you will take a large hit. Use FireFox and the YSlow addon (from Yahoo) to show you the pie chart comparing cached and uncached page size.

What is your testing methodology? Are your 100 "users" just hitting the site as fast possible? What about caching - if your users are just bots that keep grabbing some page, they may be ignoring your caching hints (see last point).

You are using gzip compression, but how much of your content is text (gzip helps) and how much is images and other binary files (gzip usually does nothing)?

Are you using your network bandwidth for any other functions - such as a seperate database server?

You are not really specific about how big your pages are (use YSlow to find out). Are you maybe using large images in place of thumbnails (I've seen my fair share of sites that have several megabytes worth of images on one page because the designer/designer's tools just used the width and height html attribute to downsize the image files into thumbnails).

The server came from SoftLayer, and 100Mbps is kind of the default. I'm guessing the only reason for this is that most people don't need more, and they can charge you a premium price to upgrade to 1 Gbps (which we will now, for sure).
–
Daniel MagliolaSep 16 '09 at 19:30

As for your other points: I have an Expires tag set somewhere in 2020. Testing methodology: I don't know, an external team is helping us with the stress testing. They are probably ignoring caching hints which would be right for this case, since they're only loading ONE page (the home) and it'd mean simulating separate users. They are probably hitting the site as fast as they can. I'm not really worried about "the exact number of users our site supports" right now. For that I would ask for more data about the test. I'm just worried now about network being the bottleneck since they told me this.
–
Daniel MagliolaSep 16 '09 at 19:33

gzip: A LOT of our content is JS, and there's not much more I can do about this than turning it on to a decent compression level, I guess. Other functions: Nope, everything is in one box.
–
Daniel MagliolaSep 16 '09 at 19:35

As for how big the page is... At this point I'm just completely taken by surprise that a 100Mbps network card can be the bottleneck AT ALL. Evidently, I thought 100 Mbps were much more than they really are. In the future, I will of course look into making the page lighter, given this. For now, I just wanted some reassurance that what I'm seeing is possible, and that I'm not looking in the wrong place. You have succesfully convinced me of that with your first paragraph. Thank for your comprehensive answer!
–
Daniel MagliolaSep 16 '09 at 19:36

Bandwidth as the first bottleneck is not something I'm too surprised about. CPU, RAM, HD and all the other components have all come on in leaps and bounds over the years, whereas 100 Mbps has been there for over a decade now. So you're in a situation where you've got a good box that's more than capable of handling a typical load, but it's connected using decade+ old technology.

Even so, are you quite certain that your 100 simultaneous user simulation is an accurate reflection of what real world traffic would be like? With 100 absolutely simultaneous hits, it only requires to serve 1 megabit each, or 128K each, for you to hit peak traffic. That's a very low ceiling, and my feeling is that - unless you're certain you're going to be getting that kind of usage - you might need to revise your load testing.

Mh, thank you for your answer, you're replied to exactly what I was looking for. Honestly, I don't know much about how the test is done, and I'm going to find out as soon as they are done and report me some actual figures. For now, I just wanted to know whether the network COULD effectively be the bottleneck. I'll have to upgrade the card so that something else starts to struggle, and then we'll see some real data. Thanks!
–
Daniel MagliolaSep 16 '09 at 19:38

For years the simpler, less capable web servers have been touting their speed, and for years Apache aficionados have been noting that Apache is fast enough to easily saturate the network interface. Sounds like you have an efficient site. Are you really pumping 100 megabits, or is the network stack just taking up a lot of CPU?

Honestly, I don't really know. CPU is around 10%. The "Networking" tab of Task Manager shows my "PublicNetwork" card at 98%. I'm know sure what that graph is measuring exactly, but the only thing I can think of is bytes transferred / max card capacity. I may be wrong, though. As for an efficient site... Maybe. Maybe I just have too many, too heavy images :-)
–
Daniel MagliolaSep 16 '09 at 18:12

PEra, any suggestions as how I can do all these? :-D (consider me a n00b sysadmin). If it helps, however, this is a dedicated server at SoftLayer.com, so I don't really have access to things like switching the cables and NICs unless I have a good diagnostic that'd force THEM to switch them. I'm going to be upgrading to gigabit, for sure. Thanks!
–
Daniel MagliolaSep 16 '09 at 19:40