Posted
by
samzenpus
on Wednesday September 23, 2015 @06:33PM
from the behind-the-curtain dept.

1sockchuck writes: As Sunday's outage demonstrates, the Amazon Web Services cloud is critical to many of its more than 1 million customers. Data Center Frontier looks at Amazon's cloud infrastructure, and how it builds its data centers. The company's global network includes at least 30 data centers, each typically housing 50,000 to 80,000 servers. "We really like to keep the size to less than 100,000 servers per data center," said Amazon CTO Werner Vogels. Like Google and Facebook, Amazon also builds its own custom server, storage and networking hardware, working with Intel to produce processors that can run at higher clockrates than off-the-shelf gear.

working with Intel to produce processors that can run at higher clockrates than off-the-shelf gear.

What does this mean? They have custom chips? Custom mods at the chip fab level? Or are they taking advantage of designed-in features that are locked out for normal chip users? Are they simply over-clocking? Or are there features that can be unlocked with money?

Probably means they buy in bulk, so they get to pick the more overclock-able chips.

Say, Core i7 xxxx runs at 3.0ghz and i7 yyyy chip runs at 3.4ghz. They make a batch of i7s and test them at 3.4ghz. Some barely pass QC and are sold as retail i7 yyyy. Some fail at 3.4ghz so they're marked as i7 xxxx 3.0ghz. Some pass at 3.4ghz with flying colors, these are the ones overclockers want the most. Retail buyers like us don't get to pick which ones we get when we buy the i7 yyyy, but Amazon might.

i remember the last time i over clocked a computer. had a 300A running at 924 and stable. used two peltier pads and encased the heat sinks to pipe in water. Was a really fun build. before that i had a 233 mmx up to 405 stable,

But of course these are all Xeon processors. Those normally have a lower clock rate the more cores the chip has, to limit heat density. The 10-core processors run a bit more than half the speed of the 2-core (IIC, but I could be way off). You don't need to overclock these in the way you do enthusiast parts, when they're underclocked to begin with. You do need prodigious cooling.

“Every day, Amazon enough new server capacity to support all of Amazon’s global infrastructure when it was a $7 billion annual revenue enterprise,” said James Hamilton, Distinguished Engineer at Amazon, who described the AWS infrastructure at the Re:Invent conference last fall. “There’s a lot of scale. That volume allows us to reinvest deeply into the platform and keep innovating.”

Did they use AWS for translation on this paragraph? How do you have "a lot of scale"? One can scale up or down, but is this like a computer hokey pokey? Scale is a verb!

Really, I skimmed this one pretty lightly. It looks like a marketing article, not a technical article. Buzz words a plenty, so I'm guessing your question is answered by "marketing"..

As I weigh this fish scale on my scale, before cleaning the scale off my kettle, while listening to my neighbor play scales, I wonder about the scale of your intoxication: on a scale of one to potato, how high are you right now? Oh well, I'm off to work: I was hoping for better, but it pays scale.

I agree with the change based on context, I should have been more specific. In the paragraph I quoted, "scale" is being used as a nouerb, or perhaps a veroun? I also realize I was being quite pedantic. Read that article and try to guess what langue it was translated from, because it was not originally English. I quoted the worst, but not the only translation error.

They are expensive and you have to buy a lot, but they'll do custom. Oracle also buys custom Intel chips. There are limits to what they'll customize, obviously writing a whole new ISA wouldn't be possible (at least not without a shit ton of resources) but they can customize things like cache sizes and configurations.

In terms of clock rate I image what Amazon is doing is more or less having Intel raise the TDP for the chips and run them harder. All the Xeons cap out at about the same TDP for the high end, re

Or they want a slightly special version. Say the CPU supports 30 different features across the entire line. For cloud services maybe amazon only really cares about 15 of them. So they could ask Intel to disable those 15 features permanently which saves power and use that extra power saved to run them a bit faster without burning up the chip. I am sure that if you buy enough at the right price they would do it. Its just a question of price and volume.

They are building custom hardware and a lot of it so they get a bit of special treatment from Intel.

You engineer the thermal paths and better control how you get rid of heat. You tweak the board layout for the best performance of the chipset and CPU and run closer tolerances on voltages and clock frequencies while keeping it small. Buying in bulk also lets you customize the chipset and CPU packaging to get you better performance/watt and higher density by eliminating all the "fluff" stuff you really don't want on the cloud machine. Who needs all those USB controllers, PCI-e busses, and sound cards you find in your average server chassis in a high density server farm that just take up space and suck power? Just give me a couple of NIC's, a SATA connection and a serial console and a way to reset an individual system and I have what I need to stand up an OS and grant somebody external access to it.

What does this mean? They have custom chips? Custom mods at the chip fab level? Or are they taking advantage of designed-in features that are locked out for normal chip users? Are they simply over-clocking? Or are there features that can be unlocked with money?

Basically, if you commit to buying a lot of chips, Intel will fab you modified versions of their existing product lines.

Remember back in the early days of Intel Macs, and Apple managed to get Intel chips that supported hardware virtualization, even th

The packets are larger (more bits) so take longer to transmit, and more memory to store. Also, ASICs are built for IPv4, they don't work for IPv6, so much of IPv6 traffic is done in CPU rather than ASICs which is less efficient in power usage.

I doubt the power difference is terribly high, but at an Amazon level, it would likely be noticeable.

Amazon doesn't exactly "not turn a profit", they dump all their profit they earn into growth and research, so that they have no taxable profit. It is an optimization technique, not really a OMG we aren't making profit type issue.

As someone that participated in their beta test, while Amazon might be ready for IPv6, the apps most of their customers run are not. For example, we couldn't get Tomcat to accept IPv4 connections on Linux when IPv6 is enabled. It binds to the IPv6 port by default, but not to the IPv4 port. I don't think there's a way to get it to bind to both. We have a support contract with Kippdata, and they said they didn't think it was possible.

It's the fact that they only focus on infrastructure. IaaS is their bread and butter and it's what keeps them running and going with companies that don't know anything better than servers and storage, to migrate their workloads (the peaks and valleys kind) into the cloud to save money and be agile.

The next generation is a step beyond that, and it's what Microsoft, SalesForce and Google are building for -- PaaS. The idea that you manage fleets of servers is an archaic one, and the next generation will be wri

You know what happens with the ones too far, far ahead of others? In the future, people rise statues honoring them, but they usually die poor and/or too young.

It's quite funny you talk about Microsoft since, back in the day, it was Novell the one far, far ahead of Microsoft on PC-based client/server deployments. And know what? Microsoft not only didn't give a damn but they mocked Novell as too complex. And they were right: most people wasn't ready for Novell forests and inherited/nested permissions and Windows for Workgroups was everything they could cope with. Then they grew up to "classic" domains, still tad simpler than Novell while still being "good enough" for their customer base (in fact, being not only "good enough" but "top notch" since for most of them it was all they knew as in practical terms it was Microsoft itself the one "educating" them).

Eventually, Novell died and, who could think about it!? the very next day Microsoft came up with their new and shinny Active Domains that were basically what Novell had been doing since ten years before: now, somehow, that wasn't "too complex" anymore but the only true way.

I'd say Amazon is exactly on the same track today: on one hand, most people, as you say, is not ready yet for higher abstraction levels like PaaS, IaaS is good enough and strongly growing. On the other hand, PaaS market is far from mature enough: writing code against any public API today is guaranteed to have it rewritten even before the provider gets to declare it non-beta.

And there's even more: it's said that in the gold rush, the only ones consistently making money where the shovel shops, not the miners: nowadays, the "hardware store" is Amazon and it is the people building on top of AWS the ones taking the real risks of doing business. And Amazon is not just seeing the time going by: few years back they offered pretty simple virtual machines; now they offer quite a complex landscape with databases, routing, DNS, load balancing, tiered persistent storage... They are the Microsoft of today mocking on the ones too far, far ahead while, at the same time, cultivating their own customer base to make them ready for their future products and services.

He is completely wrong. AWS is far far ahead of Microsoft with PaaS, its not even comparable because AWS is so far ahead. Anyone who has used both knows this. He read a page or two on EC2 on AWS and thinks that is the only thing they have, its about 5% of what they offer.

He is obviously an MS paid shill hoping no one calls him out. Azure works fine, at least from what I hear it does now, but back when I was using it they were having serious problems. Went to Rackspace, similar to Azure (his comments wo

Is that a joke? Are you an Amazon shill? or do you just not understand the difference between IaaS and PaaS? Amazon dominate in IaaS, but Amazon are non existent in the PaaS space and falling further and further behind every day. Most analysts don't even mention them when talking about PaaS

I think that AWS' IaaS picture is more complete than Microsoft's, no doubt... as for deprecating APIs well, I'll have to put my tin foil hat on that because since.NET 1.0 they have managed somehow, to maintain most of their APIs with little deprecation. I don't imagine it would bode well for their business if they deprecated it for Azure, but you're free to believe that. On the point of PaaS being far from mature enough, I'd likely agree; except if you look at modern startups, most of those are written in

"However for new development efforts where we look to write in a microservice architecture, then AWS is simply not an option and I'm looking at Apache Mesos, Heroku, Service Fabric and AppEngine. Now you may disagree with"

Not at all. Either I didn't explain myself good enough or you misunderstood. All these are well and good, but I bet you'll either fail in your next application (and therefore it doesn't matter) or you'll have to rewrite it in the not so distant future because one of your Mesos, Heroku, F

It's a matter of risk vs reward. Yes, I might be locked into a platform but at the level I develop, MS and other enterprise cloud vendors can't just arbitrarily raise the price. There are enterprise agreements that have liabilities, timelines, penalties and a lot more in order to ensure that there aren't runaway costs. I know, because I've negotiated them with both AWS and Microsoft. Funny thing is, AWS does not agree to terms for large organizations that are any different for a startup, and that's great fo

The thing I see with the API sales landscape today is that it is being sold as a 'business solution' that only a developer can understand.

The nightmare comes when you try to figure out which of your api vendors have brought your application down and you are left carrying the can because a problem with the billing system left you with only 100,000 API calls instead of the 1000,000 you expected. Still, it could be fun having the accounts dept on call to respond to outages.

Not to be too pedantic, but Crikey! That was very close to the worst edited article I've ever read - even on the web, which is saying a hell of a lot! C'mon, guys, you're supposed to be some kind of publication, for Christ's sake!

From that I get that you didn't saw that many in-house IT departments. On my previous job I've ran for 7 years (including one server room relocation to the other side of the city) with only one 30-minutes downtime period. Yes, I intend to keep my bragging right for this streak.

I tend to think that it's not a question of their unreliability but the inherent complexity of providing high availability and scale that works 100% of the time.

As a consultant, I love AWS/Azure/O365 outages. They bring most customers back to reality with regard to the infalliability of "the cloud" and to the exponential increase in complexity required when chasing the "never goes down" dream.

If those guys, with unlimited money and unlimited talent, can't make their systems not have outages, then some ran

You didn't build out a Multi AZ solution for your critical app? You relied on AWS services for critical load balancing and fail-over? You shoved everything into US-EAST-1 where it can sometimes take 5 minutes for a reboot? You're doing it wrong.