It's not cpu governor I'm talking about but DVFS in particular.There's bound to be some small amount of latency involved with the process.It's point isn't for best performance but energy efficiency thus why I made the comment in the first place.Reply

There's the potential for DVFS to optimize for better performance on a few cores while putting some of the other cores into a lower P-state, but I think that would be more for stuff like Turbo Boost/Turbo Core. It's also possible Johan is referring to the potential for the optimizations to simply improve performance in general.Reply

Love my job, since I've been bringing in $5600… I sit at home, music playing while I work in front of my new iMac that I got now that I'm making it online.(Click Home information)http://goo.gl/9u8usReply

Can you tell me where I got you confused? Because I write "This allowed us to make use of Dynamic Voltage and Frequency Scaling (DVFS, P-states) using the CPUfreq tool. First let's see if all these power saving tweaks have reduced the total throughput."

So it should been clear that we are looking for a better performance/watt ratio. The interesting thing to note is that ARM benefits from p-states, and that Intel's excellent implementation of C-states makes p-states almost useless. Reply

For information about a year ago the following post on the Linkedin ARM Based Group gave a link to a M.Sc. thesis publishing figures on the performance/watt ratio for Cortex-A8 and Cortex-A9 based boards:www.linkedin.com/groups/Single-CortexA8-CortexA9-in-comparison-85447.S.84348310Reply

Damn, Johan. As always- an incredible writeup. Interesting thought experiment to figure that an upper bound on damage to INTC server share might be found by simply looking at how much of the market is running applications like your web server here (where single-threaded performance isn't as important).

Intel powering phones and ARM chips in servers...the end is nigh.Reply

I wouldn't call that a spectacular performance per watt ratio. It's a bit faster than the Xeon under a cherry picked benchmark (much slower under others), and is only marginally lower power. Best case it's an 80% improvement over Sandy Bridge with regards to performance per watt, and Atom wasn't represented. Considering all the hype, I was expecting something a little more... exciting. Ignoring Ivy Bridge improvements, Haswell isn't far off.Reply

Yeah... I agree. It also only seems to really come into its own in high concurrency. The Xeons idle quite similarly in terms of power - what happens if you compare it to more Xeon cores? It seems like on a per core basis, Intel still has the advantage on both fronts?Reply

I would also point out that the A15 has already been compared against Sandy and Ivy cores and come up short in performance per watt; so I'm very interested to see what the next step for these ARM node servers is.Reply

I warned against the hype in the first sentences. :-) ARM CPUs are still rather weak and not a good match for most applications. However, the fact that we could actually find a case where they do a lot better than the current Xeon systems was surprising to me. Reply

24 servers ran inside 24 VM's on Xeon server, while for ARM server you used the 24 physical server nodes... Hmm... Does not seems to me like apple to apple comparison. Why not to compare, for example, 16 physical nodes on both, xeon and arm servers?Reply

And how do you slice the Xeon server into 16 physical nodes ? It does not support any kind of HW partitioning that I am aware of. On the other hand the Calxeda machine is a cluster by design. If you try 16 Xeon nodes you'll go through the roof with power.Reply

We tested with 16 as I briefly mentioned in the conclusion. The 2650L did 170 responses/s per VM, or about 40% better. Total Throughput = 2.7k/s, while with 24, 2.9 K/s. THe flexibility that the Xeon has to reduce the number of VMs if higher throughput is necessary is definitely an advantage, but the performance numbers are not that different with different VM configs. Reply

Yeah, should have had two teams - each with goal to optimize on each platform. The Xeon team would not (lol) load up 24 VM's to serve the same web app. It's silly. Go bare metal in that use case.

There will be different needs for different cases. The "lets load up a bunch of VMs" is useful to cloud providers and in other cases, but not for "I want to feed this app to as many users as possible".Reply

Outdated in what sense? No one else has really made a serious attempt to review thee Calxedas stuff, and while there are better Atom option out there, as Johan notes we were unable to get any in-house in time for testing. Or do you mean Calxedas' use of Cortex-A9 is outdated? If so, that's more of a case of laying the groundwork I think. Assuming they have their A15 option be backwards compatible with the current system (e.g. just get a new set of cards with the updated SoCs), that would be very cool.Reply

This was a fabulous and most informative write up. You answered so many of my questions with this article. Excellent job covering an area that no one else is, and also kudos for running such great benchmarks.

This really is tech journalism at its best. Thank you Johan, and thank you Anand for employing such high-quality writers.

We all know how memory constrained the ARM A9 is. Even something like Krait would solve a lot of A9's traditional weak areas. And yet, it looks like the Calxeda makes sense in enough niches to be sustain their R&D and development efforts. Low-to-medium traffic web hosting, media streaming and storage. Each one of those areas is a sizeable market and the Calxeda solution offers enough to be seriously considered in these makets.

And when one thinks about how many years of x86 optimisation has gone into the toolchain in things like the gcc, one realises the potential that lies ahead for ARM in this market. ARM's future roadmap is well known, next is Cortex A15 and then Cortex A57. Meanwhile there will be more software optimisation, and the management/deployment side will also improve. With all these in mind, I think it's more than conceivable that ARM will grab up to 20% marketshare in the server market by 2015. Reply

Thanks! Good summary... and indeed 20% marketshare is not impossible. The real questions is whether Intel give the Atom it is long overdue architecture update, or will Haswell put some pressure from above? Exciting times. Reply

Isn't it much easier to administer 24 virtual servers than 24 physical ones (cost of personnel)? When all servers have the same workload it look sgood for ARM but the virtualized intel environment easily wins if some servers get a lot more requests than others, meaning too much for one ARM SOC to handle. The tested scenario is basically the best one could ever hope for the ARM server and pretty unrealistic (same load for all servers). That's fine but then also post worst-case scenarios...Intel server is a lot more flexible.Reply

I completely agree with the other readers that this writing is just absolutely superb. Fantastic novel job Johan.However, I also agree with the above commenter: a big part coup on virtualizing a "fat" core system is to be able to properly utilize the resources of the machine across VMs. By equally loading "tiny tiles", the obvious advantage of the inherent load balancing of a virtualized infrastructure completely disappears.Under current the current "fat" VM infrastructure you can accomodate individual VMs with heterogeneous loading levels, with extra provisioning in the resource pool.That is just not simply the case for these tests based on an army of individual machines against a many VMs virtualized under a few "fat" cpus.I don't mean to be overcritical, but this is a proper apples vs oranges comparison.Reply

A lot of shared hosting ISP's use lightweight virtualization with Linux or BSD "Containers". I would like to see you re-benchmark with those on both servers instead of using VMs.You should see higher performance vs full virtualization. I'm not sure how it would affect the ARM performance, but it shouldn't hurt much, and there is more potential for better load sharing if some sites are busier than others.Reply

Hmm if these didn't cost $20,000 they would make a nice front end for larger websites and forums using less rack space and power. What setup using these would you use for anandtech? Would you guys keep the intel DB server? Reply

I'm not sure I agree with the absolutism that seems imlicit in your comment that Xeons are better for relational databases...I think there are cases where that won't be true.

Database scale-out doesn't always require sharding...using any of a number of different off-the-shelf capabilities built right into most SQL engines, you can create multiple active replicas of your database. This is generally better-suited to workloads that aren't write-intensive, but both clustering and replication allow for writes. While this may seem like a quick-and-dirty solution that is architecturally "less good" than sharding, hardware is a lot cheaper than paying people to design a sharding solution and the dollars very often drive the conversation. As long as the database size isn't terribly large this can be a very cost-effective way to scale out a database.

I would wager that the Anandtech website database (not the forum database) would probably be well-suited to this type of scale-out. You do waste some money on redundant storage but you more than make up for that cost by not having to pay a development team to implement sharding. If the comments section of the Anandtech website gets stored in the same underlying database, the size constraints and the write activity may appear to be incompatible with this approach, but I would in fact argue that comments don't require relational capabilities of SQL and would be more rightly stored as blobs in Hadoop or Azure Storage Tables. Then the Anandtech database is strictly articles and is both much more compact and almost entirely read-only (except for a few new articles per day).Reply

To the best of my understanding, replication does well for scaling reads but doesn't do much for writes. I'd still imagine that this would work decently well with AnandTech, where I can't see the volume of writes being that large relative to the volume of reads.Reply

They would make a horrible front end for such websites. Just buy a single Xeon server and don't artificially limit it by using 24 VMs. Just run the app straight on the metal and it will perform massively better.Reply

Very interesting Johan as your tests often are!Interesting that the memory bw is so much lower than anything from Intel. In fact Iphone 5 looks much better...why? Only Intel has about the same rsults in compress and decompress. Reply

Do you know what would be an interesting concept for a future version of these cluster-in-a-box systems? A solution like ScaleMP. ScaleMP is basically a reverse VM. A hypervisor on each server clusters together to run a single OS with an aggregation of all resources (cores, RAM, network, and disk). ScaleMP running on 4x Dual-socket 8-core Xeon systems w/ 32GB RAM results in a usable system with 64-cores and 128GB RAM as if it was running natively on the hardware. This would be an interesting concept to transfer to the ARM space (if a form of hardware virtualization ever is designed). In a box like this, there would be 192 cores and 192GB of RAM available to a single Fedora instance. Cluster 2 of these together and suddenly there's a system with 384 cores and 384GB of RAM in 4U. Just some food for thought.Reply

Reading through this article about Calxedas, great job BTW, I couldn't help but think about the old SGI hardware that seemed pretty similar with MIPs (and later Itanium) processors connected through a switch with NUMALink. I haven't played with NUMALink directly in almost a decade, but back then cheaper Altix slabs were ring topology while higher end hardware was switched. In the end though, you could put together a bunch of 1U racks together and have a single system image. Like you mentioned though, cache coherency was exceptionally important. Since we have a uv here, I can point you to the documentation for that box.

Isn't remarkable how PR people manage to fill so many pages with "extreme" and "the future" without telling anything. Frustation became even higher when I clicked "get the facts" page. That is more like "You are not getting any facts at all". Reply

I'd be interested in seeing where, and what happens when you start pushing single chips to and slightly beyond their limits. Calxeda's hardware's proved competitive on a very friendly workload (which I didn't really expect would happen until their A15 product); but in the real world a set of small websites are unlikely to all have equal load levels. Virtual servers on larger CPUs should give more headroom for load spikes; so knowing what the limits on Calxeda's hardware are strikes me as fairly important.Reply

I make performance oriented web apps for a living and I was looking forward to this performance test very much. However, I was quite disappointed at how you have done the "real world" test.

If you're serving a single site you would never put a Xeon through the performance penalties of virtualisation, so I deem your real world results flawed/unusable.

Basically, if I was to consider buying a Calxeda server tomorrow, I want to know if I can serve a site faster/better by using the "cluster in a box" solution which ARM's partners are going for or if a single Xeon server with standardised dedicated hardware will serve me and my businesses better.

The other thing that I would have also tested is SSL request performance because Intel has AES-NI built in and I believe ARM has something similar? I would say the majority of request today for a serious web app/site will be traffic using the SSL protocol, so that would also be one of those deciding factors I would look at.

If I was a cloud host provider your comparison may contain some truth as their business model would be to presumably let each ARM node out as a VPS alternative, but that isn't what you were testing were you?Reply

1. The single site: it is not meant to be an environment of one single site. The reason why we use the same site over and over again, is that it makes it easier to interpret the results and more repeatable. Consider a hosting provider who host many similar - but not the same - LAMP sites.The repeatable part is the part that most people don't understand very well: we don't just hit the same URL over and over again. We perform real user interactions and randomize them in realworld patterns (like logging in first and then several real actions) and then getting a repeatable benchmark gets very complex.2. The SSL comment is definitely good feedback. We are currently writing the connection code for such SSL websites but also need to find one or more good examples. If your site is a good example, maybe we can use yours (even under NDA if necessary) ?3. Lastly, the virtualization overhead of ESXi 5 is very small. Reply

It won't be LAMP sites any more though - take a trawl through something like the Linode forums to get an idea of what people are building. You are talking higher concurrency and more likely nginx.

Someone made a valid comment about database sharding - for web apps this is much more likely as people try to make sure they have failover.

Whilst initially very disappointed, if you imaging the refresh on the ARM cores over the next 2 years (and considering the rate of change due to the phone market) you might actualy be looking at a beast of a machine in two or three iterations. Imagine if you could buy these off the shelf for under $10k: That feels to me like mission critical failover systems in a box. I can see this taking off in a couple of years. Reply

It seems this has a very narrow application in VM hosting, but I am not sure it's applicable when you have the choice of just scaling up memory or process usage of the single instance Xeon server. For example, I could load 24 instances of my production middle tier on the ARM server - or I could run one instance on a Xeon server and give it all the memory and make sure it spawns enough threads to keep all the internal cores busy. Perhaps my middle tier software has issues with handling all that RAM, so maybe I run 4 instances of it as a process, not a biggy.

I am going to bet that the Xeon server will win as it won't have the VM overhead.Reply

I'd prefer a fat machine with virtualized servers to get automatic load balancing, but it's not like one couldn't shuffle tasks around in the ARM farm. And there's room for improvement: be it the next Atom or the memory controller in the current ECX-1000 CPUs. And take a look at how badly they scale from 2 to 4 threads - surely, there's lot's of rooms left!Reply

When you said " The next generation ARM servers are already on the way and will probably hit the market in the third quarter of this year. The "Midway" SoC is based on a 28nm (TSMC) Cortex-A15 chip. A 28nm A15 offers 50% higher single-threaded integer performance at slightly higher power levels and can address up to 16GB of RAM." As far as I know the A15 cores have 50% more performance but consume 3X more power, that's not "slightly".........Reply

Where on earth you do get that 3x from? So far no 28nm Cortex-A15 chips have been released. The A15 in the Exynos Octo uses about 1.25W per core at 1.8GHz according to Samsung. That's slightly more power than a Calxeda A9 uses per core, but the A15 gives twice the performance per core.Reply

Assuming the 800mW figure is accurate and the uncore power stays the same, then a node would go from 6W to 7.8W - ie. 30% more power for 100% more performance. Or they could voltage scale down to 1.5GHz and get 65% more performance for 5% more power. While a 28nm A15 uses more power in both scenarios, it is also much faster, so perf/Watt is significantly better.Reply

1. I guess we have to wait to see if it's really 2X perf from a9 to a15 in real tests. I personally wouldn't bet on that just yet.2. mostly likely the uncore power will increase too. i don't think the larger memory bandwidth will come free.Reply

1. We already know A15 is 50-60% faster than A9 per clock (and often more, particularly floating point), so that gives ~2x gain from 1.4GHz to 1.8GHz.2. The uncore power will be scaling down with process while the higher bandwidth demand from A15 will increase DRAM power. Without detailed figures it's reasonable to assume these balance each other out.Reply

It really doesn't sound like the price\performance is there. Also, lack of Windows support makes it useless for those of us who run ASP.NET websites (like the company I work for).

It's still nice to see companies trying something different from the standard strategy. Maybe this is be better in a few generations and take the web server market by storm. If we see a Windows Server arm I could see considering it as an option.Reply

I agree your testing suite's method is good and ok, so you were testing in consideration with hosting providers, fair enough.

However on the topic of if you were serving a single site would a standard Xeon be better or ARM based ones? Which - is the case of consideration to FB/Twitter/Google/Baidu etc..., whom are as I have been led to believe by the media this past year, companies that ARM partners are trying to sell this piece of kit to. This test unfortunately cannot tell us.

A quick search on Google on performance impact of VMs yielded a thread in the VMware community forum by a vExpert/Moderator that mentioned expectation of 90% performance, and frankly, no matter how small you think the performance impact of a VM maybe, it is still using up CPU cycles to emulate hardware, that point will remain true no matter how efficient the hypervisor gets.

Secondly, coupled with the overhead of running 24 physical copies of the OS + Apache + DB on a box that would otherwise be running a single copy of the OS + Apache + DB is total overkill (on that topic)

It would be great if you can also test Xeon's req/sec if it ran a single instance so we can see it from a different perspective, as of now as I said, your test is skewered towards hosting providers whom might invest in Calxeda to provide VPS alternatives. But to them (and their client base), the benefit of a VPS is it's portability, which, 24 physical ARM nodes isn't going to provide, so I don't see them considering it as an alternative solution anyway.Reply

I would like to see the results with the website running on bare metal. I would like to, but I don't believe you when you say the virtualization overhead is minimal.Also, did you include the power used by the switch? as we scale the xeon cluster we will add a lot of cost and power in the network, however Calxeda fabric should scale for free.Reply

I think a lot of you are missing the main point or future potential of this server technology. And that is that intel like to make an absolute minimum of $50 per CPU they make, in server CPUs it's more like $300.

These Arm CPUs are being sold at around $10 a CPU.Sure Caldexa have gone the hard yards making such a server and want a lot of money for it. BUT once these ARM servers are priced in relative context of their actual CPu costs its going to be the biggest bomb drop on Intels sever profits in history.Reply

Assuming you are right and ARM is becoming so important that it can't be ignored, what's to prevent Intel to produce and sell ARM itself? In fact what's to prevent Intel to produce the best ARM socs as it has arguably the best fabs?There are rumors that Apple is asking Intel to produce procs for them, this would certainly be very interesting if it proves to be true.Reply

The problem is that ARM cores are pretty much a commodity, so ARM SoC pricing is inevitably going to end up as a race to the bottom. This could make it difficult for Intel to sustain the kind of margins it needs to keep it's superior process R&D efforts going. Or at least, it would need to use its high-margin parts to subsidize R&D for the commodity stuff which could get tricky given the overall slowing of the market for the higher end processors. I think this is what's happening with the supposed Apple deal. There have been reports that they have excess capacity at 22nm right now so it makes sense to use it. And, since Apple only sells its processors as part of its phones and tablets, it doesn't directly compete with x86 on the open market.

Of course, all the other fabs are operating under the same cost constraints, so there would be an overall slower pace of process improvements (which is happening anyway as we get closer to the absolute limits at <10nm).Reply

Yup. This is actually Intel's biggest threat by far. It's not the technical competition (even though Intel's Atom servers don't seem nearly as competitive as these upcoming ARM servers), but the biggest problem by far for them will be that they will have to compete with the dozen or so ARM server companies on price, while having more or less the same performance.

THAT is what will kill Intel in the long term. Intel is not a company built to last on Atom-like profits (which will get even lower once the ARM servers flood the market). And they can forget about their juicy Core profits in a couple of years.Reply

So your argument is because the ARM solution is more expensive than Intel solution now, therefore it must be cheaper than Intel solution in the feature? The mobile ARM is cheap, so does the Intel mobile chips.Reply

The savings are more than just electricity cost, you also save on cooling costs and can pack your server room more densely. If you do a TCO calculation over several years it might well turn out to be cheaper overall.

This is the first ARM server solution, so it's partly to get the software working and test the market. However I was surprised how competitive it is already, especially when you realize they use a relatively slow 40nm Cortex-A9. The 2nd generation using 28nm A15 will be out in about 6 months, if they manage to double performance per core at similar cost and power then it will look even better.Reply

Ja, IF they have high volume. But even if there is high volume, it's shared between different ARM suppliers and needless to say, the ATOM. How much can it be for one company?

But the question is where the ARM get the volume? less performance, comparable power consumption, less performance/watt rational (not this kind extreme bias case ), less flexibility, less software support (stability), vendor specific (you can build a normal server, but can you build up a massive parallel cluster?), oh, don't forgot, more (much more) expensive. Which company will sacrifice themselves to beef up the market volume of the ARM server? Reply

Hi Johan,Nice job benchmarking and analyzing the results. Our group at EPFL has recently done some work aimed at understanding the demands that scale-out workloads, such as web serving, place on processor architectures. Our findings very much agree with your benchmark conclusions for the Xeon/Calxeda pair. However, a key result of our work was that many-core processors (with dozens of simple cores per chip) are the sweet spot with regard to performance per TCO dollar. I encourage you to take a look at our work -- http://parsa.epfl.ch/~grot/pubs/SOP-TCO_IEEEMicro....Please consider benchmarking a Tilera system to round-out your evaluation.Best regards!Reply

LWN.net has a very interesting write-up on a talk given by Facebook's Director of Capacity Engineering & Analysis on the future of ARM servers and how they see ARM servers fit in with their operation. I think it gives valuable insight on this topic.

It is a test using wrong software stack. Yes, I am not afraid to say that! Apache will never be used on such ARM servers. They are exact match for Memcached or Nginx or another set-get type services, like static data serving. Using Apache or LAMP stack is too much favorable for Xeon.What I would like to see is: Xeon server with max RAM non-virtualized running 4-8 (similar to core count) instances of Memcached/Nginx/lighttpd vs cluster of ARM cores doing the same light task. Measure performance and power usage.Reply