What Does Amazon Outage Mean for Cloud Adoption?

On Friday, April 22, Pillsbury hosted a meeting of the Washington, DC, chapter of the Cloud Security Alliance (CSA-DC). Dr. Ramaswamy Chandramouli, Group Chair of the NIST Cloud Computing Security Working Group addressed members of CSA-DC representing local businesses, government agencies and various consulting and law firms regarding the work NIST is doing to develop a security architecture for cloud services.

Dr. Chandramouli’s presentation focused, among other things, on the various ways the software development life cycle (SDLC) needs to be adapted to address the move to cloud based services, including ways to maximize the ability to move applications from one cloud provider to another. According to Dr. Chandramouli, when moving to the cloud, a number of aspects of the SDLC need to be re-evaluated, from access controls and use of things like OpenID to the use of third party-provided digital libraries and APIs. As Dr. Chandramouli and a number of other participants at the meeting noted, the move to the cloud also requires an examination of your disaster recovery/business continuity planning.

Naturally, the discussion turned to last week’s Amazon EC2 outage, opinions about its cause and a discussion of its effects.

This incident comes as the Federal government is pursuing a “cloud first” technology strategy where, for example, the Department of Agriculture and the General Services Administration plan to migrate a combined 137,000 email users to the cloud in the next five years. According to Vivek Kundra, CIO of the Office of Management and Budget, moving this email to the cloud would reduce $42 million in IT costs over that same time period. As part of that movement to the cloud, Kundra said in testimony on Capitol Hill on April 19 that Federal agencies expect to close 100 data centers this year, with about 700 more to be shuttered over the next five years.

So what impact should the Amazon outage have on the decision to move to the cloud? Among those at the CSA-DC meeting, the consensus was that the people who signed up for the services made the choice to assume that risk. The thought was that when they signed up for the EC2 service either they chose the less expensive option of being hosted in single Amazon availability zone, which means they accepted the potential for a single data center outage without failover capability, or they chose to pay for the ability to implement failover services but perhaps didn’t engineer their applications to properly take advantage of the failover capacity they were paying for.

In fact, Amazon’s “AWS Web Hosting Best Practices (PDF),” states:

Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. As can be seen in the AWS web hosting architecture, it is recommended to spread EC2 hosts across multiple Availability Zones since this provides for an easy solution to making your web application fault tolerant. Care should be taken to make sure that there are provisions for migrating single points of access across Availability Zones in the case of failure.

The reason certain Amazon customers were unable to recover from the outage is that there is no automatic mechanism to re-route capacity between Amazon’s availability zones unless the customer’s infrastructure has been specifically designed and implemented to switch resources between availability zones. Customers with systems in one of Amazon’s availability zones in Amazon’s “US East” region could not reliably access the data in their Elastic Block Store (EBS) volumes. If those customers had performed regular backups of those volumes, the outage would have been confined to a few hours, not more than forty hours. In last week’s outage, all but one of Amazon’s US East region availability zones were functioning normally within about four hours. Many participants at the meeting made the same observation that was reported in the New York Times – the companies apparently hit hardest by the Amazon interruption were start-ups who are less likely to pay for extensive backup and recovery services.

The CSA-DC members felt that Amazon was able to handle a critical situation, discuss it publicly in real-time, and advise customers what could be done to minimize recovery time. While the meeting attendees praised Amazon’s transparency regarding the outage, they recognized that the challenge for the cloud industry is clarifying what customers need to do in the event of incidents. Providers and customers need to do a better job of documenting the “what” and the “how” related to incident management so that DR/BC procedures work properly.

The lesson customers should take from the EC2 outage is not that the cloud is unreliable, but, rather, that just as deploying a mission-critical application in your own data center isn’t as simple as setting up a few servers, deploying a cloud-based application properly is not as easy as buying a few server instances. While the cloud offers the potential for significant cost savings and the access to skills and services beyond the reach of many organizations (particularly small- to medium-size businesses), like any service, customers need to work with suppliers to understand the limitations of the service and the implications of those limitations for their operations, incident management, business continuity and disaster recovery.

Share this: