Lessons on Cloud Management at Scale: An Interview With Randy Skopecek

Cloud computing is more than just fast self-service of virtual infrastructure. Developers and admins are looking for ways to provision and manage at scale. This InfoQ article is part of a series focused on automation tools and ideas for maintaining dynamic pools of compute resources. You can subscribe to notifications about new articles in the series here.

How do real organizations handle the management of a cloud portfolio? InfoQ reached out to Randy Skopecek, the lead applications architect for a midsize insurance firm, to find out the considerations and challenges when operating at cloud scale.

Randy: At some point, all processes need to adapt or be replaced to meet corporate, cultural, and service-level (incl. scalability) needs. The opportunity for adaptation or replacement should be happening frequently, say monthly and ad-hoc. At smaller scale of VM/service to change-agent ratio, hands on direct manipulation is reasonable and sometimes the best choice vs a large set of systematic overhead. However, it certainly comes at a trade-off price. Mistakes will be made, and often they are untraceable.

When scaling the number of VMs or instances of an app, the class of challenges alters. The development of any process tends to shift from mostly manual to mostly automated. You don’t build a custom application to take the place of a few spreadsheets one person has. Similarly, when a core function of a business is dependent on tons of people having that same spreadsheet…things change. Should it be a template? Could the ecosystem around the outcomes of the spreadsheet benefit from being more tightly integrated? The same happens with managing VMs/instances. At a certain level you template your VMs, run something like chef/puppet, manage automatic patching, re-focus and use PaaS/SaaS, and on and on.

InfoQ: What should cloud providers offer to let you manage resources at scale?

Randy:

Scriptable and user interface access

Grouping and tagging of resource items (VMs, websites, DBs, etc.)

Identification of low use, high use, erratic use, other less desirables, and areas of opportunity

Set-and-forget automations and change orchestration

Automated live/hot resource update control

Actions can be executed against an instance, group(s), or tag(s)

Audit trails, Logging, Analytics, and Reports

Cost center & expense code tagging and tracking

PaaS for fundamental resources (ex. file storage, caching, DB)

DR/BC options including recovery of resources to point-in-time

Support for a VM templating technology

Most importantly, tight feedback loop with customers

InfoQ: What advice would you offer to organizations that anticipate growing from a handful of cloud servers to hundreds or thousands?

Randy: Like anything you do frequently, automate and standardize. Setup change pipelines including all of the facets to track the change, implement the change, verify the change, monitor the change, handle issues/deviations, and report on the changes over time. Think 1) open item 2) change item 3) save item 4) promote item for others….all with minimal manual steps. Select vendors and solutions that are both open to your input to change and are fully capable of innovating by themselves. Each dollar spent should provide the service, cover some facet of support, AND cover the innovative progress of the service.

When you talk about growth 100x – 1,000x, it is almost always about a particular application service. Depending on your risk preference, you may prefer to select technology stacks that you can heavily invest in making those stacks foundational. The more massive something is though, the harder to change. Every technology, stack, service, etc. should be considered transitional. Your journey with one of those might be longer lived, but technology evolves and matures. A person can build a solution from their kitchen table that might be more desirable for your situation than best solutions today.

In the scope of 1k-x though, you are frequently better off selecting a service (PaaS) that is designed to scale. Scaling is a tough subject. By using PaaS though, you are investing in so much more. You can consider those people an unofficial extension of your team. Those people building the PaaS are burning dedicated time focusing on the core issues and complexities of scaling. It doesn’t however take the burden off of your application/solution. It merely means instead of increasing your risk and expertise needed exponentially, you are outsourcing a big part of it to people who (hopefully) already know. Just because you need a faster means of transportation than walking, doesn’t mean you go build a car…you buy/lease it. If at some point you get into the 1% that need more than that…you might get into custom builds.

One item I would recommend keeping in mind, especially as a litmus test, from the point in time when you look at a service/system and say “that is just crap”…how long does it take to do something about it. Whether encapsulating the issue, requesting improvement from the creator(s)…and maybe helping yourself, or even replacing the whole service/stack/vendor…it should all be a short timeline. Considering you can literally pack up your entire DC and move it to another vendor in less than 1 month is both amazing and empowering.

InfoQ: When do you start thinking about how to run (cloud) infrastructure at scale? Right away? After initial projects? Not until you feel pain?

Randy: I’ve definitely heard both schools of thought and people struggle with this issue constantly. The “fear of rapid success” is something that causes many people to contemplate solutions or custom design structures that cause heavy complexity and anxiety. The real answer can be hashed out with your team in 15 minutes. The team being people making it and someone from the top. What does the team want to happen, what do they know will happen, and what are they comfortable with. The team has to make the choice whether adding complexity initially is better or worse than later when they might feel the pain. It’s all up to the solution strategy and team comfort.

A good design to strive for is something that buys you time and, once again, looking to see if some of the pain points can be offloaded with minimal hassle. All designs have limits. If you can select a strategy that can buy your team enough time to comfortably respond, that would be preferred. You wouldn’t have invested too much in upfront infrastructure and yet still have time to do something if that event happens. It can feel depressing to the builders to over architect/provision if in reality it isn’t actually used.

One thing that shouldn’t be overlooked is the application itself. It is more common for bad code/design to be the cause of scalability issues. Sometimes more virtual hardware or more instances are thrown at it. While it may improve the scalability, it also inflates the issues. In the healthcare.gov service, it had a very public scalability fallout. Page 5 shows the outcomes from their scalability improvements strictly technological. Without talking money, look at the database (is RDBMS). Twelve…”12” large dedicated database servers with new storage were added which only resulted in 3x. That is a heck of a lot of infrastructure and complexity for only 3x. What you won’t see is caching or high IOPS storage. One heavy DB server with high storage IOPS (like Hyperscale) very well could have taken the place of all 12 servers. Altering the application to benefit from caching could have added plenty of performance. Also, the solution shows that cloud infrastructure wasn’t used…at all. So they took on the whole scalability burden from bits to nuts.

The healthcare site is also an example of the team (devs and execs) not fully agreeing and understanding the scalability needs. Many systems that people build don’t actually know how many people will use it. There is a hope for big success and lots of happy customers but it is still a guessing game. Internal only apps get to benefit from more precise load awareness. The healthcare site is a bit in the middle with a government mandate but still a somewhat unknown number of users. Sure after it’s been live and people use it the traffic would be a little more even. However, there was a go live expectation of high load. Only “after” the scalability improvements the site was able to support 50k concurrent people. That was a guaranteed scenario that would crash just from known load.

I would recommend placing your solutions where you can make rapid change, usually at a cloud vendor. Also, if you have a solution that can scale out consider starting with 2. The jump from 1 server to 2 is larger than from 2 up to 100+. At a larger scale if you have ~200 and still have to consider scale out, think about placing half in different subnets. You should also consider the performance fundamental opportunities of the vendor. All of those will force you to consider design strategies that will help to not box you in to scale up only abilities. If you have access to IaaS/PaaS solutions that can take on some of the potential scaling burden…seriously consider it. Consider it especially if you feel it is easily interchangeable with the other choices on the table.

InfoQ: What's an underrated "hard thing" to do with a large number of running resources?

Randy: Structural changes while keeping everything live and providing service. With smaller services or time-locked audiences you can easily get away with some downtime. In enterprises, you would even tell your customers ahead of time that the service(s) would be unavailable over the weekend if it were a decent change.

It is more of a complementary issue. With change pipelines in place you can make the changes to the systems in an automated and traceable way. However, what do you do if you roll out an update that needs to replace one column for two? If you use an RDBMS it will lock the table. If you have millions or even billions of rows, it will lock it for a long time. That will result in some form of downtime, or functional outage. If you use automated scripts and try to run DB scripts out of sync with the service codebase you remove a level integrity from the app…and might cause downtime also if it is tightly integrated (like with CodeFirst). Maybe a NoSQL backend should be used. That doesn’t fix the issue, it just changes it. Now your data store could be completely different from the codebase data models. You now have to either live upgrade/patch the data or keep extra code for longer periods of time. You might even have to change your codebase to be completely introspective of all data it works with.

If it isn’t a custom built software maintained internally, now you have to consider live template full infrastructure switches where you spin up and prep a whole service to switch live at a push of a button.

InfoQ: How should organizations do application monitoring when functioning at scale?

Randy: Depend on the services and analytics provided and consider adding in something like NewRelic, AppDynamics, or Keen.io as a few examples. Don’t build something yourself as these issues have been heavily solved with solutions. Make any logging you build in very specific. It does no one good to see “The following error occurred: one or more errors occurred.” If a vendor solution doesn’t output to something you can easily get to the bottom of, push them to improve it or replace. You don’t need a failing 3rd party system where everyone shrugs if something goes wrong.

InfoQ: Should cloud consumers stick with a single provider to minimize the complexity of running workloads across clouds, or can customers successfully run a big solution that spans clouds?

Randy: Yes, both. Unless there are other service or specific strategic plans there is no reason to make your life, and the application(s) life, harder than it has to be. That includes a single DC vs multiple DCs on the same vendor. Network latencies are obviously better within a single DC than geo-distributed. However, for DR…you may want 2 live DCs. Or 1 live, with failover capability. Several of the big web properties run out of 1 region because they have found if something were to go down, which isn’t often, it is significantly less costly than keeping geo-distributed systems running live and automatically handling a failover and then the final recovery.

You should strive for system isolation boundaries. Each system can then be free to have the choice of where it will prosper for its customers. Maybe some pieces make complete sense to keep at an enterprise IaaS like CenturyLink, some web properties at Azure, and pipeline automation at AWS. I wouldn’t think of cloud vendors as an all or nothing venture. They each have their strengths, will change over time, and it’s more of a best of breed and comfort scenario instead.

A solution that runs dual cloud loads is Auth0. With their solution, they strategically didn’t want to have an outage just because a cloud vendor had an issue. So they run both on Azure and AWS and deploy 3-4 times a day. Their solution is a great case where it both can be done and probably should. However, for most…that adds more complexity than needed.

InfoQ: Is "scale" in the cloud a real problem, or do you think users treat cloud as something that each decentralized group works with, thus no one group works with a massive amount of resources?

Randy: Scaling is still a real problem and probably always will be. Every app design has its limits. The amount of resources isn’t specific to centrality of the group, but more of the application use and quality of design/code. If an application is highly complex, has a lot of people working on it, and has a massive amount of resources…the app design needs improved. Pieces of that application need to be extracted and isolated elsewhere. That way each component/service can have a set of people and resources. That helps with load targeting. A large complex app doesn’t have heavy load over every single facet of itself. You still may have a single team with a single service that manages thousands of resources. At that point though it is so highly contextual it can be optimized to a higher quality standard and spec resources close to actual needs.

The balance of scalability is still shared between customer and vendor. Services like file stores or Key/Value NoSQL stores from some of the biggest cloud vendors still have their limits. Even when solutions are built around the benefits of those services at some point you still hit complications. Services that compartmentalize and communicate those limitations have the best chance. For example, with AWS they have a newer service called Kinesis. In the past AWS has frequently boasted a mentality of unlimited scalability regarding fundamental resources like S3 and DynamoDB. Then you hear an outcry from customers in the field who hit limitations not posted. With Kinesis, the resources and scalability are broken out as unlimited scale one shard at a time. It pushes part of the scalability back on the customer as they have to individually access each shard. So sure you can scale 1,000 fold but your app now has to access it at least 1k times. The balance of scalability challenges is still being defined between provider and customer.

About the Interviewee

Randy Skopecek is the lead applications architect at a niche midsize insurance company. He is in charge of all technology development and operations covering vision, implementation, management, and replacement. You can find him on Twitter at @rskopecek

Cloud computing is more than just fast self-service of virtual infrastructure. Developers and admins are looking for ways to provision and manage at scale. This InfoQ article is part of a series focused on automation tools and ideas for maintaining dynamic pools of compute resources. You can subscribe to notifications about new articles in the series here.

Get the most out of the InfoQ experience.

Tell us what you think

How do real organizations handle the management of a cloud portfolio? Managing resources at large scale while providing performance isolation and efficient use of underlying hardware is a key challenge for any Cloudwedge management. thanks