Two recent events reminded us this spring that cloud computing infrastructures are vulnerable to the same genetic IT flaw that plagues traditional data center operations: everything fails, sooner or later.
The failure modes of cloud vs. traditional data center architectures may differ in nature and frequency, but the threat is the same – outages, downtime, lost revenues and damaged customer trust. Ironically, these same recent events also highlight how cloud infrastructures, when managed correctly, actually provide unprecedented capabilities to deliver high availability, resiliency, and business continuity in IT operations.
In March, a 9.0 earthquake and subsequent tsunami caused widespread disruptions to power supplies and network connectivity to data centers across Japan, causing Japanese companies to rethink their traditional disaster recovery strategies. Several weeks later, the Elastic Block Storage system in one of Amazon’s data centers in the Eastern U.S. failed due to a faulty router upgrade and a cascade of resulting events, sending hundreds of customers – including many Web 2.0 companies such as FourSquare and Reddit – scrambling to resume services.
These events prove that, no matter how you slice it, neither public clouds nor private data centers constitute a magic bullet for all the needs of today’s dynamic businesses. In the case of the Amazon cloud, it may in fact have been its remarkable record of operational excellence that led some customers – despite Amazon’s constant reminders to “design for failure” – to assume that the inherent scalable, redundant and global nature of the cloud would protect them from having their systems go down. Well, lesson learned: it’s not just the presence of alternative cloud resource pools that matters, but the ability to fail over to them quickly and seamlessly that is critical to maintaining continuous operations.
Rethinking Disaster Recovery
Many companies in Japan have long emphasized and invested in maintaining business continuity as a key IT principle. They have also believed that the safest place for data center operations is inside the four walls of the corporation. However, the recent tragedy exposed the weakness of this strategy. A disaster that affects an entire region can take out a corporate data center just as easily as a nuclear power plant.
And the impact of the disaster has not ended. Power-plant disruptions continue to cause rolling blackouts that cut electricity to data centers throughout the country in three-hour increments. Backup generators fill the power gap, but the Japan Data Center Council recently warned that reliance on generators is causing a diesel fuel shortage. With more than 50 data centers in the Tokyo area alone, JDCC members are burning through 5,000 to 6,250 gallons of fuel every hour. Japanese corporate leaders have begun to realize they don’t want their businesses to be dependent on diesel fuel supplies for backup generators.
As a result, Japanese companies are rethinking business continuity, eschewing traditional disaster recovery architectures and looking to the cloud to provide a new level of redundancy, failure isolation and geographical diversity for their IT resources in a more cost-effective manner. Already, ZDNet Japan has reported on several new cloud deployments by private companies and government agencies as a direct result of the earthquake.
Just when cloud computing might have seemed like a welcome solution to disaster recovery planning in Japan, Amazon suffered a major outage that affected hundreds of customers. Like their counterparts in Japan, many of these companies had not prepared a ‘Plan B’ – instead relying on a single availability zone in a single region in the Amazon Web Services cloud. It appears that they either didn’t anticipate any outages at all, or that they expected that if one of Amazon’s zones went down, they’d somehow be able to easily move their systems to one of AWS’s other regions. For most, both strategies failed. Why? Because it’s not enough to have options during failure scenarios – you also need to have reliable failover procedures that have been tested and are quick to implement.
We all know the old business continuity clichés: be prepared, never put all your eggs in one basket. But we’re also human, which is why it’s understandable that so many companies were critically affected by the AWS outage. Many organizations weren’t preparing for failure as we all know we should. And it reminds us that organizations themselves must take ownership of their own business continuity strategies and cannot rely on any single infrastructure – whether public cloud or internal data center – to always be available.
Designing for Failure in the Cloud
The best way to protect your organization from unplanned downtime due to a natural disaster or human error has always been to implement redundancy and diversity in your disaster recovery and business continuity systems. This involves enabling your team to run business services on a number of different infrastructures – whether they be public clouds such as Amazon or Rackspace, or private clouds using traditional on-premise hardware – and fail over between them quickly and efficiently as necessary.
Despite the Amazon outage, the fact is that public clouds now provide organizations with an impressively wide array of options to implement business continuity at a level of affordability that simply did not exist a few years ago. Consider this: right now from my laptop I can launch servers in a dozen disparate locations worldwide – including the U.S., Europe, and Asia – for pennies per hour. As a result, I can design a system for my business that can quite reasonably withstand localized outages from just about any human error or natural disaster, and at a lower cost than previously possible.
The key is to design for failure. Amazon’s CTO Werner Vogels has been preaching this religion for many years now, suggesting that the only way to test the true robustness of a system is to ‘pull the plug’. Netflix – itself a major cloud infrastructure user – has created a process they call the Chaos Monkey that randomly kills running server instances and services just to make sure that the overall system continues to operate well without them. And, not surprisingly, Netflix’s overall operation saw little impact from the AWS U.S. East outage when it occurred.
Implementing failure resilient systems is not easy. How can you quickly move your operations from one infrastructure to the next when the pressure is on and the alarm bells are ringing? How do you design a system that not only allows new compute resources to begin to operate as part of your service, but also folds in an up-to-date copy of the data your users and customers depend on?
There is, of course, no one-size-fits-all solution. But there is a general approach that does work – combining redundancy in design with automation in the cloud management layer. The first step requires architecting a solution that uses components that can withstand failures of individual nodes – whether those are servers, storage volumes or entire data centers. Each component (e.g. at the Web layer, application layer, data layer) needs to be considered independently, and designed with the realities of data center infrastructure and Internet bandwith, cost and performance in mind. Solutions for resilient design are almost as many and varied as are the software components they utilize.
But the secret sauce really comes in how your architecture is operated. What parts of the system can respond automatically to failure, what parts can respond nearly automatically, and which not at all? To be more specific, if a given cloud resource goes down – be it a disk drive, a server, a network switch, a SAN, or an entire geographical region – how seamlessly can you launch or fail over to another and keep operations running? Ideally, of course, the more that is automated (or nearly so), the better your operational excellence.
Achieving that level of automation requires that your system design and configuration be easily replicable. Servers, for example, need to be quickly re-deployable in a predictable fashion across different cloud infrastructures. It’s this automation that gives organizations the life-saving flexibility they need when crisis strikes.
The right cloud management solution should simplify the process of launching entire deployments through customizable best practices. It should also provide complete visibility into all infrastructures through a central management dashboard – a ‘single pane of glass’ – through which administrators can monitor performance and make capacity changes based on real-time needs. The same automation and control that gives organizations the ability to scale up or down using multiple servers when demand increases also allows them to migrate entire server deployments to a new infrastructure when disaster strikes.
The fallout from the Japanese earthquake and Amazon outage is being felt throughout the business community and is causing organizations to rethink how they ensure business continuity. Cloud architecture provides the distributed structures necessary to counteract regional disasters, but companies also need the cloud management capabilities necessary to fail over their operations to multiple infrastructures in a way that keeps things up and running.
Some may have thought that the cloud was a magic bullet. It’s not, and that’s actually good news. By recognizing one of the original founding principles of cloud architectures – that everything fails at some point – businesses are now in a position to design and build services that are more resilient than in the past, at a fraction of the cost. With the right architecture and management layer, cloud-based services can provide unparalleled disaster protection and business continuity.