The Inevitable Outage of the Cloud

by Ofir Nachmani
August 26, 2011

Traditionally delivering high availability often meant replicating everything. However, today with the option of going to the cloud we can say that providing two of everything is costly. High availability should be planned and achieved at several different levels: including the software, the data center and the geographic redundancy. According to a recent study the cost of a data center outage ranges from a minimum cost of $38,969 to a maximum of $1,017,746 per organization, with an overall average cost of $505,502 per incident.

1 – Total cost of partial and complete outages can be a significant expense for organizations.

2 – Total cost of outages is systematically related to the duration of the outage.

3 – Total cost of outages is systematically related to the size of the data center

4 – Certain causes of the outage are more expensive than others. Specifically, IT equipment failure is the most expensive root cause. Accidental/human error is least expensive.

With regards to environment availability one possible conclusion based on the these findings is that moving the DC infrastructure responsibility to the IaaS vendor provides several significant benefits to the IT organization. The cloud basically removes the hassle and the huge hidden costs associated with down time. The major cost results from the business disruption and lost revenues, these costs continue to be relevant for the cloud consumers when there is a cloud crash.

So, why do outages happen? Three main issues are associated with the cloud infrastructure availability –
1 – Physical Hardware – The IT problems typical in traditional hosting are particularly caused by the networks components. With the cloud, the complexity and size of new cloud platforms, we need to expect more down time caused by servers and storage failures.
2 – The virtualization layer – The virtualization layer is actually software. The use of virtualization decrease the stability and availability of platform. The virtualization layer enables several servers on the same physical resource. When this consolidation suffers from an outage, multiple applications may be at risk for discontinuity. Because the most common causes of an application outage are software related, the more complex the virtualization stack so the higher risk of failure.
3 – Power failures – Outages may be caused by lack of a robust infrastructure and appropriate back up means. Electric power is one of the major costs of the IaaS vendor (up to 30% of the cloud overall operating expenses). The cloud giants have dedicated professionals that plan the establishment of the infrastructure, in praticular its power needs. Moving to the cloud eliminates the need to take care of this complicated and expensive operation as well as the associated capital expense.
Deciding to move to the cloud raises a lot of questions regarding high availability. The traditional IT manager will need to change his mindset and transfer some of the his team’s responsibilities to the cloud service providers including the IaaS vendor. At the same time the cloud (IT) operations team needs to understand that the overall liability remains at its end. The team need to carefully plan the roles in each deployment layer when designing the service HA (high availability).
Toady the cloud providers announce on changes and improvements in their infrastructures on a monthly basis. The IaaS market is strongly moving towards providing the IT environment’s basic building blocks (basically virtual servers and storage) with a robust API and are less focsued on delivering a UI front end. The cloud operation team (the new IT team) should have the skills that enable them to leverage the IaaS API virtual features and create the system for the defined SLA. Some of the application developers (SaaS vendors) may find that it better to outsource the cloud management including its HA deployment and maintenenace.
What to ask about your IaaS provider’s availability –

1 – Architecture – What’s the system architecture, does it have the robustness that you are looking for?

2 – Guaranty – guaranty – What’s level of guaranty do they provide ?

3 – Continuity – How do they insure your service continuity ? What procedures they perform to maintain backup and recovery ? What about fail over capabilities and procedures?

4 – Archiving – Do they have the archiving and procedures to maintain your secured(!) data over a long period of time?

This year several on-line giants such as Twitter, Intuit with its online QuickBooks, Cloud Foundry and more, suffered from severe outages. On April this year, the cloud market leader Amazon AWS suffered a major outage in its US East facility. This was the worst crash in cloud computing and Amazon’s history. This failure affected major sites such as Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. News magazines and cloud blogers reported on the extraordinary news: “The cloud computing crashed”.
Amazon AWS describes their environment in their security white paper:

“The data center electrical power systems are designed to be fully redundant and maintainable without impact to operations, 24 hours a day, and seven days a week. Uninterruptible Power Supply (UPS) units provide back-up power in the event of an electrical failure for critical and essential loads in the facility. Data centers use generators to provide back-up power for the entire facility.”

Alot of post trauma articles were posted on-line and we suggest you read Amazon response for an example of an IaaS vendor’s perception of HA.
The second aspect of availability resides within inside the application itself. The application’s robustness has a significant role in the traditional world; There is not much difference in the application design, including the database system, the business logic and the user Interface. These layers should be build wisely to keep the application running smoothly without any bottlenecks due to the script and DB logic and distribution. In the cloud there is no change in these “traditional” matters but there is an added layer of complexity. The software (web) developer must know the cloud delivery features, starting with understanding that customer’s IT team is not responsible for the availability ,all the way to being able to leverage the scalability and the elasticity of the cloud to support the application’s specific logic and needs.
James Urquhart, the famous cloud blogger writes:

The adaptive part comes about when humans attempt to correct the negative behaviors of the system (like the cascading EBS “remirroring” in the AWS outage) and encourage positive behaviors (by reusing and scaling “best practices”). Now, expand that to an ecosystem of cloud providers with customers that add automation across the ecosystem (at the application level), and you have an increasingly complex environment with adaptive systems behavior.

We will conclude saying that the cloud is here to stay. Availability is a subject that must be planned, tested and improved over time. Urquhart notes that the ongoing immunity of the cloud will be achieved “by changing its behavior in the face of negative events”. Furthermore, you must learn and imporve while building your system or serving your customers whether for your service availablity, security or your overall service offering.