On the other hand, I've worked at places where the worst thing you could do is leave things that the company can't live without *in* the control of the company. Sometimes certain areas of expertise require specializations that the company just doesn't have and isn't interested in acquiring. Of course handing the responsibility of those things off to *Microsoft* is not necessarily any better.

Yeah, but who is? AWS has more outages than I care to remember, Rackspace has had it's share of outages, Google goes down like once a month, even Apple can't keep a service up - and that's pretty much all the big players counted out.

Declining population of ducks in the local pond?Chips no-longer served in old newspaper?Lack of respect for elders?Banning of blackboards in schools?Rampant rape and violence all foreigners bring to your little Daily Mail reading village?

If you want to defend H1B1 workers and dirt-cheap Indian code monkeys, perhaps you should make a logical argument.

I don't think the guy you're responding to had the most well thought out argument but your response did nothing to refute it. You accuse him of xenophobia when it's obvious that he wasn't talking about foreigners in general, he was talking about specific foreigner workers that are hired by American firms that are looking to cut costs. That doesn't mean that all foreigners are incompetent -- the assumption is that the most competent foreigners don't have to accept lower than deserved wages to undercut American workers. There's a reason the foreigners who undercut American jobs are willing to accept less money -- they're not worth as much.

Having worked in hiring of H1B workers, I can assure you it's illegal to offer them significantly less than an American. Maybe they get a couple thousand less because they don't have the experience, but it's the same as if you hired a US worker with the same experience. They don't work for half and they are not indentured servants (they can find another H1B sponsor and stay in the country).

I won't try to argue against your anecdote, though I'd like to point out that your experience is a single instance and I've heard other anecdotes that are contradictory. Different situations are different.

My major qualm was the AC's accusation of xenophobia where xenophobia wasn't present. It sickens me when people try to use political correctness as a trump card in a debate when it's really just a red herring.

There is simply no substitute for having all your I's dotted and T's cross with large integrated systems like this. This is a culture problem not a individual screwed up problem. If you just fire the guy, there will be lots of awareness but the take away most of your remaining people will get is "don't forget to check the certificate expiry dates, that'll get you canned" many of them traumatized by the experience will dutifully check certificate dates for the rest of their careers but this will do nothing to prevent your next major outage; because that will almost certainly be the result of something else.

Everyone is pushing this vitalization + "dev ops" + management/monitoring is going to let us have one admin do what was once the work of ten. The fact is it just does not work like that. Management/monitoring like Microsoft Mom for example requires you to have all the failure modes identified and the scripts written to check conditions like expiry dates and trigger the alerts. Unless everyone is really good about all the routine maintenance tasks in there is won't help with something like this. That takes time you ONE admin has not got and discipline that breaks down when someone is overworked.

The "dev ops" and vitalization stuff is all great in terms of how much can be automated. Someone has to develop that automation though. Your ONE guy does not have time to build and test his generic deployment scrip when you promised your customers you'd have their infrastructure stood up last week.

It comes down to the business recognizing its important to have good people, enough people, and willingness to invest in making sure the job is done correctly and completely every time, and that documentation is maintained and in a way everyone knows how to use it. Check lists need to be kept and followed etc. IT got away from plant engineering style discipline when hardware got cheap. You know longer had to worry about that one computer you had failing. As we move back to more consolidated and integrated solutions; management is going to have to get used to the idea again that there is some people time investment that must be made. Its great you can save on power, cooling equipment, and headcount but you can't cut headcount to far because the more consolidated you get the less you can afford for anything to go wrong so it all must be check, doubled checked, and checked again just to be sure. This is if you do it yourself or if you pay your cloud provider to do it. Either way cloud services so far have been mostly a race to the bottom and that is going to cause some to have to learn some very painful lessons if the industry remains on its current trajectory.

There is still a person who wrote the manual for that procedure. And the HR or manager who approved such by-the-rote practice in the first place. Ultimately, there is a responsible party - a person, or several people.

Also, "it was not in my work manual" is not a valid excuse for the failure to apply common sense. Are all the people who deal with those certs failed to notice the expiry date? Did it never cross their mind to consider what would happen when the cert expires? I'm not talking about showing "fire

You'd think that, but there's contract stuff. The thing is, you basically need a department in charge of renewing shit like this when you have enterprise level services. We've got a site with millions of hits daily and still manage to let it expire every couple of years. You try the credit card thing, but credit cards expire. You try recurring billing and then you get into a contractual nightmare with the registrar. The registrar isn't going to do you any favors, you might get millions of hits daily, but they still only get $5/year even from google.com so fuck you, figure out the billing yourself.

The only real way to do it effectively is build yourself a database of all the crap you need to renew regularly, then hire someone to renew that stuff. But who are you going to hire? It usually ends up being some assistant that doesn't know a damned thing about tech... and it's still going to cost you $60k a year in pay and bennifits to retain them. That's an expensive way of keeping track of such things... ah, the website admins can remember right?

Except that companies like Microsoft and Google register domains through "Enterprise" registrars like MarkMonitor, who charge upwards of a few hundred (possibly even thousand) dollars per year for their service - which supposedly includes "not letting the fucking things expire" and "making sure other people don't register our damn marks".

Microsoft actually has even less excuse in this instance, believe it or not - Microsoft's certificate vendor is itself. All MS certificates are chained up to a Microsoft s

It is almost a year ago to the day Azure was down for a day because no one accounted for leap year for validating certificates, lol. AWS seems to have issues too, but they don't seem to revolve around blatant stupidity and result in an entire day of downtime.

M$ has a history of lack of customer focus hence it will fail ay any industry that demand the highest levels of customer focus. For cloud services to be down for a down is inexcusable and seriously any IT management staff that fails to acknowledge these failures and uses or recommends Azure should be fired. Any down time should be measured in minutes not days, this should be considered catastrophic failure. M$ is far to used to it's EULA's a warranty without a warranty and has become woefully complacent about actually guaranteeing a supply of service, meh, it mostly works it their motto and we'll fix it net time round, for sure this time.

For me the confusing thing is that there was a single point of failure. I thought that much of what the cloud was about was resilience; I would expect that someone designing cloud infrastructure would have done an analysis of failure points, and implemented failover mechanisms (or at least monitoring and recovery procedures). Ok, maybe not a cloud-startup-du-jour, but certainly a big enterprise-style entity like Microsoft.

The reality is, if you outsource your hosting to a single company, there will always be single points of failure.

There will be architectural ones, like root of trust expiring resulting in security framework taking everything down.

There will be bugs that can bite all of their instances in the same way at the same time.

There will be business realities like failing to pay electric bills, or collapsing, or simply closing down their hosting business for the sake of other business interests.

Ideally:-You must keep ownership of all data required to set up anywhere at all time. Even if you host nothing publicly yourself, you must assure all your data exists on storage that you own.-You either do not outsource your hosting (in which case your single point of failure business wise would take you out anyway) or else you outsource to financially independent companies. "Everything to EC2" is a huge mistake, just as much as "everything to azure" is a huge mistake.-Never trust a providers security promises beyond what they explicitly accept liability for. If you consider the potential risk to be "priceless", then you cannot host it. If you do know what your exposure is (e.g. you could be sued for 20 million, then only host it if the provider will assume liability to the tune of 20 million)

They spent several hours doing "test deployments"... while it's great to make sure you aren't going to make something worse, updating an SSL cert isn't exactly rocket science. I'd had to see how long it took to recover from a more serious service issue triggered by a software bug.

Pretty sure they tried rebooting first to solve the solution, which cause windows system repair to start on boot up. System repair ran for the whole time (since theres a grayed out cancel button you cant click) after which it reported system repair was unable to repair the system

It's not that amazing when you consider the service level of their hosted email. A week to correct an internal DNS entry, and meanwhile a customer with sixteen thousand email users just had to wait in queue to get it fixed. The large print pretends to give, but the fine print says you just have to wait for as long as it takes and SLA's be damned.

I wonder how long it will be before there's a major failure loop in the cloud, something like the certificate for cloud X is stored in service Y, which actually uses cloud X as its backend. So when certificate for X stops, the whole thing grinds to a halt with no way to restart it (unless backdoors)...

... this is what you get. Sure, it's possible the same thing can happen for any company. But at least then you can fire your incompetent staff.

Once you deploy to a vendor, you are stuck. From what I've seen, you can't easily move data and code from one vendor to another. One of our clients is in the UK Azure cloud and we have to BCP about 6M rows from their server to our system every week. Takes over 90 minutes, and constantly fails because of losing the connection. We've looked at deploying systems to various clouds, and the costs were not worth it.

I will NEVER put any critical business system in someone else's cloud. At worst, I might put it in someone's data center on *MY* servers. The cloud seems to be fine for small business startups and non-important data for personal use. Businesses who no one would even notice if their site was down for a day.

BTW.. 'Cloud' computing is just remote virtual servers over the Internet. It's really not something new and original. People act like it's some amazing new 'thing'. Well.. it's not. It's just another way of letting companies with limited or no tech skills put up a web site or store data. It's expensive, proprietary, and I doubt very cost effective in the long run.

Actually, there's a bit more to being "cloudy" than just virtual servers over the internet (indeed, they not even need be over the internet - you can have your own local cloud and many companies have internal clouds). Virtual servers over the internet is merely client/server. For a service to be "cloudy", generally it'll have attributes like HTTP (in other words, RESTful interfaces and each request being treated no different to the first request, in other words, the service doesn't hold state from request to request, just like with HTTP) and distributable. The main benefit of "cloudiness" is because of this you can easy scale up services when demand is high, and scale them back when demand is low. It makes it easier to make a resilient service than the traditional client/server type service where the server side has to keep state. Infrastructures like Amazon's EC2 allow you to scale things up and down easily and economically because you can turn on the "virtual server over the internet" part of it on and off very rapidly, and you only pay for the instances you've instatiated. But just using Amazon's EC2 doesn't automatically make your service "cloudy" if it does not have all the other necessary attributes.

Yes and it was done by buying a shit ton of hardware and all the complexities and expenses that come with it. The problem is that 90% of the time that hardware was sitting around idle. Or that you would have to purchase a bunch of hardware for a one time project and then hope and pray that someone would buy that hardware from you when you were done. It doesn't take a tech website genius to realize how incredibly inefficient that is.

And you think the cloud works differently? It's just that someone else is buying all that hardware to have sitting around idle until you need it. You hope. But, being a business, I'll bet one of their policies is to not buy more hardware than their projected needs, to avoid having any more sitting around idle than they absolutely have to to cover their own short-term needs. Anything else increases their costs without providing any revenue, so as a business they're going to avoid it just like you are.

It's just that someone else is buying all that hardware to have sitting around idle until you need it.

That's no longer my problem. It's now an operating expense for me instead of a massive up front capital expense.

What makes it work is that they have so many customers that when one needs more capacity they can take a bit away from everybody else and each customer's share will be so small they won't notice.

Nooo... when you reserve a VM that VM is yours whether you use it or not. You are paying for it after all. I have a very tough time buying that any of the major cloud platforms are oversubscribed. You will have to back up that claim.

It doesn't matter anyways. If you have grown to such a monstrous scale that you start to outgrow the capabilities of these cloud platforms, the capital cost of rol

That's no longer my problem. It's now an operating expense for me instead of a massive up front capital expense.

Exactly. Now, answer me this: you've decided that you can't afford that large up-front capital expense and having that capacity sitting around unused to deal with the occasional large spike in demand. So why is your cloud provider not following exactly the same business logic that you find sound? Why are they not trying to avoid exactly the same large capital expenditure that you're trying to avoi

The only people who bought a bunch of hardware and had it sitting around idle were people that didn't know how to manage data centers. You still have to project loads for the cloud, and you still have to pay for the ability to scale up. In fact, in our cost estimating, the cost of moving data into and out of someone else's cloud, and the cost of having those large data sets on their servers, was the reason it was more pricey than having our own servers locally even if we had to buy extra servers.

Once you deploy to a vendor, you are stuck. From what I've seen, you can't easily move data and code from one vendor to another.

RHEL is CentOS is RHEL is Amazon Linux wherever you are. A basic of the cloud is that, as you migrate to it you migrate almost everything to Linux.

One of our clients is in the UK Azure cloud and we have to BCP about 6M rows from their server to our system every week. Takes over 90 minutes, and constantly fails because of losing the connection. We've looked at deploying systems to various clouds, and the costs were not worth it.

There have been outages in Amazon; almost nothing has ever crossed from one Availability zone to another. Multiple countries have never happened. At the same time there have been many total outages in Azure. Whilst Microsoft regularly loses data; every time a Google system fails totally, it turns out they have a tape backup. These are not "minor issues betw

Back in the bad old days, IBM had a solution for down time in mission critical systems - such as for United Airlines. It was called redundancy - a complete dual system. Or as we described it: when one of the two parallel systems detected an error, it automatically sent a signal to the second system so that it could go down too.

I think this design was also used in the first Ariane 5 flight! You know the one where 800 Million Euros in solar-research satellites went up in smoke, because some manager was too stupid to understand that you cannot just plug-in an Ariane 4 guidance module and expect it to work.

There are always single points of failure. Always. In this case, it was that x509 is poorly designed, but there are others.

The point of "the cloud" was never to have no single points of failure. It is to avoid any single points of failure it can, and hire smart people to avoid and fix the SPoFs it cannot, all at a far lower price than you could afford. And it works well (unless you choose to use an incompetent cloud provider). Most companies screw up certificate expirations at some point, then spend da

Well the cloud works on open web standards and while certificate servers can have redundancy built in, the underlying certificate would still essentially be a single point of failure in the design. Any TLS that relies on certs will have to take this into account. The good news is that while somebody goofed at MSFT, the underlying principles of Certs prevailed and people were denied access to resources because their clients wouldn't trust the MSFT resources protected by those certs. Now, I would be more c

I find it hard to believe anyone who maintains such a large fleet of services wouldn't have setup some sort of trivial monitoring (I know they own a product or two) that would include SSL Certificate expiration warning. 30+ days out, a ticket (or some sort of actionable tracking mechanism) should have been generated, alerting those responsible to start taking action. Said ticket should have become progressively higher severity as the expiration date loomed (meaning nothing had been updated), which in any sane company, would have implied higher and higher visibility.

That way, if an extensive test plan for such a simple operation was required, they had plenty of time to execute upon it and still not miss the boat.

Working with MS in other ways, and combined with both the lack of foresight and inability to act quickly, just shows that this sort of customer-forward thinking just doesn't exist inside the MS mind.

Believe it. When I worked at IBM, there was a certain automation team who let the critical SSL certificate for an ID provisioning tool expire not just once, but two years in a row causing a major outage to a large client.

Even if a warning wasn't trickled down a month ago, and we've no reason to assume it wasn't, the person whose job it is to act on it, provided they weren't on vacation, won't have simply thrown five dollars at a registrar. They'll have had to put in a request to the finance department, probably via a cost-management chain of command, with a full description of what needed to be paid to whom and why, with payee reference, cost-center code, expense code and departmental authorization, and hoped it would arrive in time to be allocated to the next monthly rubber-stamp meeting. Assuming the application contained no errors, was suitably endorsed and was made against an allocated budget that hadn't been over-spent and wasn't under review, then, perhaps, in the fullness of time, it might have received approval and have been sent back down the chain for subsequent escalation to the bought-ledger department, who'd have looked at the due date, added ninety days and put it on the bottom of the pile. After those ninety days, when the finance folk began to take a view to assessing its urgency, unless they found a proper purchase order from the supplier, and a full set of signed terms and conditions of purchase, non-disclosure agreements, sustainability declarations and ethical supply-chain statements, as now required by any self-respecting outfit, it'll have been put aside and, eventually, sent back round to be done properly. Or, if it all checked out first time, it'll have been put on the system for calendering into the next round of payment processing.

I'm sure it might be possible to streamline aspects of such mechanisms, but to suggest there's anything trivial about them is a touch hasty. But you never know. Perhaps they're already thinking of planning a meeting to discuss it, and are working on a framework for identifying the stakeholders as I write.

After the infamous Feb 29th incident MS should have set up an Azure cluster identical to production stuff but with all the clocks set to 1 week or more ahead. Have it continuously running regression tests. Certs even getting close to 3 days before expiring is stupid.

Microsoft has billions of dollars, so if this 12 hour downtime is the best MS can do when they're "All In" (Ballmer's words not mine), it's not a good sign.

Actually, Microsoft has a wide variety of SLAs with financial penalties covering the Azure cloud. I expect customers will be able to claim at least a 10% service credit on this, as it's definitely an issue within Microsoft's control and definitely would cause a miss of the monthly availability number.

99.9% is between 7 and 8 hours down-time a month (which is the unit they measure in). If it took them 12 hours to get new certificates up, then they are not keeping their promise, they are failing.Of course, if that downtime coincides with your working hours, that's an entire working day down. It's a shitty level of service. Nobody hosting their own services, and having skilled staff managing their systems, would find that acceptable. I will admit that 99.999% uptime/connectivity is hard (we've had it one y

The Azure SQL Database reporting facility just completed a 5-day outage this month [theregister.co.uk] so they may be a couple years over their downtime quota this month. Or as somebody else put it recently: "Five nines: 9.9999".

My problem with those SLAs is that they're for a credit for a fraction of the cost of the service for that month. Which is fine if your business doesn't depend on the service and you suffer no disruption when the service is down. But if you're hosting a Web site on the service, or using it for anything business-critical? The cost of the service is going to be the smallest part of the cost to you of the disruption (that's why you went with the service after all, because it was so much cheaper than doing it i

Because they are forbidden to issue a certificate with a greater validity than 39 months in accordance with the CA/Browser Forum Baseline Requirements for Publicly Trusted Certificates [cabforum.org] (warning: PDF). If they were to violate this, they'd have GTE's root certificate de-listed by Apple, Google, Mozilla, KDE, Opera, Blackberry and... um... Microsoft. Which would invalidate their subordinate certificate.

From a business perspective, it makes perfect sense: If Azure were reliable, secure and fast, customers could start to wonder why the other products by MS are not. This could heighten customer expectations, and that would be bad as MS really does not have the engineering capabilities to build, say, a good OS or a good office productivity suite and then customers may leave for the alternatives. So I applaud them for their foresight in making Azure just as bad as their other things are. This may actually be quite beneficial for their bottom-line.

Imagine if someone's signature on your PGP identity expired. It might be a bit of a blow, but people would still have other trust pathways toward you. Then you get a new signature from 'em, or someone else.

Certs can fail in so many ways, both false positives (compromised CAs) or false negatives (such as this expiration), and a myriad of subjective failures since different people have different reasons to trust (or not trust) different CAs. The risks aren't even theoretical. Failure really happens, to the extent that it's almost routine and we see a story about it here on Slashdot every month.

And Phil Zimmerman totally solved the problem(!) in, what, 1988? Why are we still using obsolete-the-day-it-came-out single signer systems? So brittle. So unrealistic.

The only reason I can think of, is that it would work too well. MitM attacks would become nearly impossible for even the most powerful governments. Certs would become so competitive and cheap that the CA business would collapse.

You're acting like its an SSL issue that MS decided to consider expired certs invalid in their systems rather than accepting them.

No. I'm saying it's an SSL issue that when The One and Only cert that can possibly exist, expires, there is no backup trust path. When the expiration happens, the number of valid certifications falls from 1 to 0. With a real world trust model, when an expiration happens, the number of valid certifications could fall from, say, 4 to 3.

My perception of Ballmer and Dell is that they virtually started with their companies and neither person has a wide ranging training in business management & psychology of managing. Ballmer is famous for his chair throwing and viscous firing with a loud voice, sometimes for trivial reasons & banning Apple products in most places inside the company. Dell has been reported to become physically withdrawn when competitor Apple is mentioned.