Tag: aws

If you want to have your finger on the pulse of startup land, there aren’t many better places to start than Hunter Walk’s 99% humble writings. Google finds his top posts on topics like AngelList, Advisors, and reinventing the movie theatre. Good writing, insiders view.

Bruce Schneier is one of the original bad boys of computer security. He writes about broad topics, that affect us all everyday from common sense about airport security, to the impacts of cryptography for you and me. Very worth looking at regularly, just to see what he’s paying attention to.

Eric Hammond has been writing about Amazon Web Services, EC2 & Ubuntu for years now. He maintains and releases some excellent AMIs, those are the machine images for spinning up new servers in Amazon’s cloud.

Even if you’re not big on the command line, you can get a lot of critical insight about the Amazon cloud by keeping up with his blog. Jeff Barr’s AWS blog is also good, but not nearly as critical and boots on the ground as Eric’s.

The internet stack is a complex infrastructure of interlocking components. An scalability engineer must be adept at Linux, plus webservers, caching servers, search servers, automation services, and relational databases on the backend. We think a generalist with a broad base of experience is most suited to the job of scalability engineer.

The Foundation of Speed

All servers use disk to store files. Operating system libraries, webserver & application code, and most importantly databases all use disk constantly.

So disk speed is crucial to server speed.

[mytweetlinks]

[quote]
Disk speed is crucial for MySQL databases. It has been a real challenge in multi-tenant environments like Amazon’s EBS. The provisioned IOPS feature addresses this head on, allowing customers to lock in great MySQL database performance!
[/quote]

Since Amazon is a multi-tenant environment, other customers are using that same network, and hitting those same disks. So if your neighbors are seeing a lot of traffic to disk, your web application can slow down. Not good!

What is Provisioned IOPS

We’ll agree that it’s one of the worst branded features ever, but you should know about it and use it, especially for your MySQL databases.

Provisioned means that you’ll lock in performance in advance, and IOPS stands for I/O operations. Think of it as google juice for your cloud database servers!

Although I couldn’t be in Vegas to attend re:Invent, there is so much online it’s almost better than being at the conference. From an ongoing live stream of keynotes and sessions, to an archived collection on Youtube.

The big wins

You may have heard of all the great things that Amazon or cloud computing can do, but I thought Andy Jassy summarized these nicely in these six points.

Redshift

By far the biggest announcement at the show is Amazon’s new Redshift product. It is a fully managed datawarehouse solution that scales to petabytes in it’s cloud. Currently there are two business intelligence tools that are supported namely Jaspersoft and Microstrategy.

Reduced prices by 25% for S3

As a lot of folks know, Amazon has always been about cheaper prices. That model has been disruptive in the book selling industry, and in a huge way in the infrastructure and datacenter industry. As more customers signup, economies of scale mean they can offer the same hardware & services for lower prices.

With that they’re announcing lower prices for S3 by a whopping 25%. To me this speaks to their continuing push to dominate the market by driving prices downward.

Amazon’s Channel on Youtube

If you weren’t able to attend the conference, or want to recap some highlights you might have missed, they have put up a great AWS Channel on Youtube.

If you’re a systems administrator or ops guy, you may want to prepare for an interview for such a position. Meanwhile, if you’re a director of it or operations, a recruiter or manager in HR, you’ll want to have some idea how to find the right candidate.

1. Solid unix systems administrator

At the top of the list, a cloud operations expert needs to understand Unix and more importantly Linux. Here are some sample questions to get the conversation moving:

o What is web operations and what have you done day-to-day?

Prepare some stories.

o What’s your favorite feature of the linux kernel?

This is an open ended question, but a systems administrator should have some knowledge here. The kernel is the most basic piece of software that runs when a computer boots up, whether it is a desktop or a server. This piece of software coordinates everything, manages resources, and directs traffic.

o Name some distributions of linux. What is a distro?

Linux is built by a collaborative team of thousands on the internet. That’s what makes it open source. The distributions, include the operating system, along with a collection of software to go along with it. All the supporting utilities, libraries and servers must be compiled and held in a repository. That’s what makes up a distribution. Debian, Redhat and Ubuntu are a few popular ones.

[quote]
A cloud operations expert needs to have a wide ranging skillset, from unix administration, architecture, scalability, database & webserver administration, troubleshooting & performance, load & stress testing. You’ll also want someone who has learned hard lessons from some failures, has some war stories to tell and has a hard nose for stability.
[/quote]

o What’s the difference between apache and nginx?

These two pieces of software are both webservers, that is they respond to the HTTP protocol, and can serve HTML pages. They also have a myriad of plugins to support different languages and features. The difference? Nginx (pronounced engine-X) is a newer incarnation. It’s been rearchitected from the ground up, building on all the things learned from Apache over the years. Its tighter, more efficient code, and easier to configure.

There are lots of examples of these types of databases. They are a very simple memory cache that can interface with most applications. Memcache is a popular example of a key value store. Redis, CouchDB and Voldemort can also do this.

o What is a page cache? Reverse proxy cache? examples?

These are all the same thing. They are basically a very minimal webserver without all the plugins or bells and whistles. You put one of these in front of your webserver to handle all the easy stuff, and speed up overall throughput. Varnish is a popular example.

o What filesystem do you prefer?

This is a bit arcane, but one should have some opinions here. xfs is a popular filesystem, though ext3 and ext4 are also common. Emphasize the journaling aspect here. Journaling means that if you pull the cord or your server crashes, the filesystem can recover upon reboot. It does this by journaling changes, much how a database keeps a redolog cache of recent changes to database tables.

o Command line tools

There are lots of commands in the day-to-day toolbox of a web ops expert. Here are some examples:
rsync (pronounced our-sync) – sync files between servers & do checksums to allow easy restarts
scp (pronounced s-c-p) – secure copy, similar to rsync but no checksums, so less reliable
curl (pronounced kurl) – diagnose & test urls and HTTP from the command line
cron (pronounced cron) – run commands at scheduled times
ssh (pronounced s-s-h) – secure shell, the most basic tool to reach a cloud server
ifconfig (pronounced if-config) – check the network interfaces on the server
vi/emacs (pronounced v-i and e-macks) – terminal editors, to modify config files
uptime (pronounced up-time) – display the current load average of the server
top (pronounced top) – interactive display of system metrics like memory, load, swap & processes
ps (pronounced p-s) – shows running processes on the server
/var/log/messages – essential system logfile

o What are application servers? How are they different from webservers?

Tomcat & Glassfish are two examples of application servers. These handle heavier weight languages & applications like Java. Application server on some level is just a more heavyduty webserver and these days Apache can be thought of as an application server also.

2. Cloud concepts

o What is virtualization? What is a hypervisor?

Virtualization allows you to run one or more computers within a computer. You can do virtualization on a desktop, sharing network, memory, cpu and disk resources among a number of virtual servers. But more importantly in cloud computing or IaaS offerings you can do virtualization at the datacenter level. The hypervisor layer is a datacenter virtualization technology that provisions server resources, and balances shared network and disk resources.

o What is an image?

In Amazon the world, the AMI or amazon machine image is a snapshot of a server state at one moment in time. This image is take at the block level, and includes the master block record, the first block on disk that a server boots from. All that is the state of a server, when it is shutdown, is what is stored on disk or in this image. All config files, logfiles, and anything else writing to disk.

o What is multi-tenant?

This means that there are multiple servers sharing resources. The tenants are the customers who each want to get the server, cpu, memory, network and disk that they paid for.

o What is the downside to shared resources?

Contention for resources is always the challenge. If your fellow tenants are not very thirsty, this can work to your advantage. But if they’re also heavy users, the hypervisor layer has manage the balancing act. You may get a spike of disk I/O at one point, but later get a dearth. This can cause a relational database like MySQL or Oracle to suddenly look stalled.

o What is instance-store? What is ebs?

Instance store servers were Amazon’s original offering, where servers had their own local (and slow) storage. This storage was ephemeral, so all machine state was lost on reboot. These servers also boot slowly. EBS also known as elastic block storage is a virtualized storage option, similar to NAS or NFS. You can create arbitrary chunks of storage, and attach them to servers, all from command line APIs. Cool!

o What is virtual private cloud?

With the VPC offering, Amazon drops a router into your existing datacenter. You can then provision virtual servers to your hearts content, and they all appear to be servers in your existing datacenter. Elastically scale, within the network and security model you’re already using.

o What is a hybrid approach to cloud adoption?

Keeping your investments in hardware and datacenter is obviously an appealing option for firms that have large existing environment. A hybrid approach with a VPC allows you to get your feet wet, but still keep essential applications on physical servers.

Amazon’s infrastructure offering isn’t just a single datacenter with servers. The beauty of what they’ve built is that they offer a number of datacenters (called availability zones) in each of many regions such as Northern Virginia, Oregon and Singapore.

A CDN is a content delivery network. Remember all those files that make up a webpage? Images, video, css files? Turns out serving these components from servers *closer* to your customer, make their webpages load much faster. CDNs are networks of servers that hold the content of your pages, and serve them faster.

It works by replacing content paths with a special one from your provider. A simple change in your code will allow content to dynamically load from across the web. Cool!

CloudFront is Amazon’s offering coupled with S3 for file storage. Akamai is another big provider.

o Why does the API battle between Amazon & Eucalyptus (FOSS) matter?
o Do you use command line tools? why?
o What can go wrong with backups? how do we test them?
o Should we encrypt filesystems in the cloud? what are the risks?
o Should we use offsite backups?
o What is DRBD?
o Why is auditing important? access control?
o What is load balancing? why is it difficult with databases? o How do you perform a benchmark? perform load testing? o Why use a package manager? can we install from source?

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is a zone, and there are many in each of their service regions including US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), South America (Sao Paulo), and AWS GovCloud.

Today one of those datacenters in the Northern Virginia region had a failure. What does this mean? Essentially firms like AirBNB that hosted their applications ONLY in Northern Virginia experienced outages.

2. Use Redundancy

Although there are lots of pieces and components to a web infrastructure, two big ones are webservers and database servers. Turns out AirBNB could make both of these tiers redundant. How do we do it?

On the database side, you can use Amazon’s multi-az or alternately read-replicas. Each have different service characteristics so you’ll have to evaluate your application to figure out what will work for you.

Then there is the option to host mysql or Percona directly on Amazon servers yourself and use replication.

Now that you’re using multiple zones and regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

3. Have a browsing only mode

Another step AirBNB can take to be resilient is to build a browsing only mode into their application. Often we hear about this option for performing maintenance without downtime. But it’s even more valuable during a situation like this. In a real outage you don’t have control over how long it lasts or WHEN it happens. So a browsing only mode can provide real insurance.

For a site like AirBNB this would mean the entire website was up and operating. Customers could browse and view listings, only when they went to book a room would the encounter an error. This would be a very small segment of their customers, and a much less painful PR problem.

Facebook has experience intermittent outages of it’s service. People hardly notice because they’ll often only see a message when they are trying to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

[quote]A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.
[/quote]

Drupal, an open source CMS system that powers sites like Adweek.com, TheHollywoodReporter.com, and Economist.com uses this technology. It supports a browsing only mode out of the box. An amazon outage like this one would only stop editors from publishing new stories temporarily. A huge win to sites that get 50 to 100 million with-an-m pageviews per month.

4. Web Applications need Feature Flags

Feature flags give you an on/off switch. Build them into heavy duty parts of your site, and you can disable those in an emergency. Host components multiple availability zones for extra peace of mind.

5. Consider Netflix’s Simian Army

Netflix takes a very progressive approach to availability. They bake redundancy and automation right into all of their infrastructure. Then they run an app called the Chaos Monkey which essentially causes outages, randomly. If resilience from constantly falling and getting back up can’t make you stronger, I don’t know what can!

Take a look at the Netflix blog for details on intentional load & stress testing.

6. Use multiple cloud providers

If all of the above isn’t enough for you, taking it further you’d do as George Reese of enstratus recommends and use multiple cloud providers. Not being beholden to one company could help in more situations than just these type of service disruptions too.

Basic EC2 Best Practices mean building redundancy into your infrastructure. Multiple cloud providers simply take that one step further.

A couple years back I was contacted to look at a very strange problem.

The firm ran flash sales. An email goes out at noon, the website traffic explodes for a couple of hours, then settles back down to a trickle.

Of course you might imagine where this is going. During that peak, the MySQL database was brought to its knees. I was asked to do analysis during this peak load, and identify and fix problems. Make it go faster, please!

First day on the job I’m working with a team of outsourced DBAs. I was also working with a sort of swat team chatting on SKYPE, while monitoring the systems closely.

Then up popped one comment from a gentlemen I hadn’t worked with. He insisted there was contention for a little known MySQL resource called the AUTO_INC lock. Since I wanted to know more, I asked who the guy was and to my surprise he turned out to be the CTO.

[quote]The CTO was tuning and troubleshooting the database![/quote]

Wow, that’s a first. I thought I’d seen it all. A CTO is normally overseeing technology & the team rather than crawling around in the trenches on the front line.

This all raised some important points

1. The app was having major growing pains
2. Current architecture was not scaling
3. Amazon elasticity was not helping at the database layer
4. People & process were also failing, hence the CTOs hands on approach

It was shocking to see a problem deteriorate to this point, but when you consider the context its understandable. A company like this is struggling with hypergrowth to such a degree, that each day seems like a hurricane storm. With emergency meetings, followed by hardware & application emergencies, trouble seems constant. It can be very difficult to step back and see the larger picture.

The takeaway from this experience…

o Amazon EC2 can’t do it all – consider physical servers for disk intensive apps
o MySQL still has some real scalability limitations
o use technology for its intended purpose – MySQL isn’t great for queueing
o A CTO tuning the database means problems have deteriorated too far

You’ve heard all the buzz and spiel about the cloud, and there’re good reasons to want to get there. On-demand compute power makes new levels of scalability possible. Low up front costs means moving capital expenditure to operating expenditure and saving a bundle in the process. We won’t give you anymore of the rah rah marketing hoopla. You’ve heard enough of that. We’ll gently play devil’s advocate for a moment, and give you a few things to think about when deploying applications with a cloud provider. Our focus is mainly on Amazon EC2.

One of the biggest hurdles we see clients struggle with on Amazon EC2 is performance. This is rooted in the nature of shared resources. Computer servers, just like desktops rely on CPUs, Memory, Network and Disk. In the virtual datacenter, you can be given more than your fair share without you even knowing it. More bandwidth, more CPU, more disk? Who would complain? Well if your application behaves erratically, while you suddenly compete for disk resources you’ll quickly feel the flip side of that coin. Stocks go up, and they can just as easily come right back down.

Variability around disk I/O seems to be the one that hits applications the hardest, especially the database tier of many web applications. If your application requires extremely high database transaction throughput, you would do well to consider physical servers and a real RAID array to host your database server. Read more about IOPs

Uncertain Reliability – A Loaded Gun

Everybody has heard the saying, don’t hand someone a loaded gun. In the case of Amazon servers, you really do load your applications onto fickle and neurotic servers.

Imagine you open a car rental business. You could have two brand new fully reliable cars to rent out to customers. Your customers would be very happy, but you’d have a very small business. Alternatively you could have twenty used Pintos. You’d have some breaking down a lot, but as long as you keep ten of them rented at a time, your business is booming.

Managed hosting providers vary drastically in terms of the support you can expect. Companies like Rackspace, Servint or Datapipe have Support built into their DNA. They’ve grown up around having a support tech that your team can reach when they’re having trouble.

Amazon takes the opposite approach. They give you all the tools to do everything yourself. But in a crunch it can be great to have that service available to help troubleshoot and diagnose a problem. Although they’re now offering support contracts, it’s not how they started out.

If you have a crack operations team at your disposal, or you hire a third party provider like Heavyweight Internet Group Amazon Web Services gives you the flexibility and power to build phenomenal and scalable architectures. But if you’re a very small team without tons of technical know-how, you may well do better with a service-oriented provider like Rackspace et al.

A few more considerations…

Will your cloud provider go out of business?

Could a Subpoena against your provider draw you into the net?

Since you don’t know where your sensitive data is, should you consider encryption?

In search of a good book on Chef itself, I picked up this new title on O’Reilly. It’s one of their new format books, small in size, only 75 pages.

There was some very good material in this book. Mr. Nelson-Smith’s writing style is good, readable, and informative. The discussion of risks of infrastructure as code was instructive. With the advent of APIs to build out virtual data centers, the idea of automating every aspect of systems administration, and building infrastructure itself as code is a new one. So an honest discussion of the risks of such an approach is bold and much needed. I also liked the introduction to Chef itself, and the discussion of installation.