Amazon Outage: Is it a Story of a Conspiracy? – Chapter 2

by Ofir Nachmani
June 20, 2012

In April 2011, when Amazon’s cloud s east region failed. I posted the first chapter of theAmazon Cloud Outage Conspiracy – it was already very clear that the cloud will fail again and here it is… Chapter 2

Let’s first try to understand Amazon’s explanation for this outage.

“At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power.”

Ok. So the AZ power failed over to generator power.

At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity).

Ok. So the generator failed over to a separate power circuit.

Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power.

Ok. So the power circuit was not configured right and the computing resources didn’t get enough power (or something like that).
> > > Did you get that?
Sounds like it might be something as simple as someone stumbling on a wire that led to all that. Anyway Quora, Heroku, Dropbox and other sites failed again due to the cloud outage and were down for hours. The power outage resulted in down time and inconsistent behavior of EC2 services including instances, EBS volumes, RDS and unresponsive API.
After about 5 hours, Amazon announced that they had managed to recover most of EBS (Elastic Block Store) volumes:

“Almost all affected EBS volumes have been brought back online. Customers should check the status of their volumes in the console. We are still seeing increased latencies and errors in registering instances with ELBs.”

Once Quora was back online, I opened the thread – What are the lessons learned from Amazon’s June 2012 us-east-1 outage? Among the great answers submitted, I want to point to a specific interesting feedback returned with regard to the fragility of the EBS volume, suggesting working with an instance store instead of EBS-backed instances. The differences between these two include costs, availability and performance considerations. It is important to learn the differences between these two options and make a smart decision on which to base your cloud environment.
> > > Education
Anyway, back to our conspiracy. In comparison to the last outage, right after this outage new Amazon AWS experts were born who spouted the cloud giant mantra with regards to its building blocks: Amazon provides the tools and resources to create a robust environment, proudly tweeting that their based AWS service didn’t fail. This proves that the April outage served Amazon well with regards to customers’ education. Though there were still some mega websites that failed again.
So, does Amazon examine if its customers improved their deployments following last year outage? Does the cloud giant continue to teach its customers using outage drills? Is that a conspiracy?
> > > Additional Revenues
The outage raised again the discussion with regards to the distinct availability (AZ) zone. Again it seems that the impacted resources on a specific AZ affected the whole AWS east region while generating API latency and inconsistencies (API errors varied from 500s to 503s to RequestLimitExceeded). High availability best practice includes backup, mirroring and distributing traffic between at least two availability zones. The impact on the region apparent hence the dependency between AZs strengthens the need to maintain cross regions or even cross clouds disaster recovery (DR) practice.
These DR practices include more computing resources and data transfer (between AZs and regions), meaning significant additional costs which apparently support the cloud giant’s revenue growth. Is that a conspiracy?
> > > Final words
The cloud giant is a leader and a guide to other IaaS as well as new PaaS players. Without a doubt – Amazon is the Cloud (for now anyway).
To clarify – I don’t think that there is any conspiracy. This is part of the learning curve of the market, including the customers and the vendors, specifically Amazon. Lots of online discussions and articles were published in the last few days explaining what happened and what the AWS cloud’s customers should learn.
No doubt that the cloud will fail again. I believe that although the customers are ultimately responsible for the high availability of their services, the AWS cloud guys should also take a step back to learn and improve – every additional outage diminishes from the cloud’s reliability as a place for all.
(Cross-posted on CloudAve)