Home » Inside SafeNet » The Cloud Advocate: Cloud Automation and Isaac Asimov
The Cloud Advocate: Cloud Automation and Isaac Asimov
May 9, 2011, 11:57 am EDT
Isaac Asimov taught us that robots don’t kill people, incorrect logic kills people. Such is my take away from the latest spat of coverage on the implications of cloud outages.
I have to admit to being a little nonplussed by recent coverage because I don’t find these outages particularly surprising. Not because I have a low opinion of cloud providers (in fact I’m quite impressed by Amazon and how they’ve learned from their own incident), but because what we’re doing in the cloud is highly complicated and highly automated. These outages are part of the learning curve. When you build up huge systems, with multiple planes of controls, each with automated logic, we’re going to have to learn how to get it right and when it goes wrong it’s going to go wrong a huge way. We’re going to make a few "Terminators" before we make "R2D2s". So I actually feel our cloud is stronger today than it was a month ago. So let me cover two things here- a brief overview of what happened in the Amazon event in particular and what that means for those of us who have to deploy to the cloud.
Amazon has a great 5,700 word essay on their event…a brilliant read if you have the time. While getting a black eye in the press, they are the leaders in this space and will inherently get coverage (BTW, just how many data centers had outages everyday during this same period and received no coverage), and I think they’re teaching us all how to do the cloud right. Let me try and summarize for my colleagues who’ve been asking for the short version (I’m using analogies for the non-techies, so if you are a techie purest just read the original Amazon write-up):
- Network Storm: Amazon EBS nodes plug into 2 networks, a high speed data transfer network for regular business and a lower speed network for internal maintenance. A network upgrade went wrong, killed the high speed network, and caused all EBS traffic to dump and swamp the low speed network. The result; each EBS volume found itself alone and began to invoke “I need to make new friends and backup my data” logic.
- Mirror Storm: When the network was restored suddenly large numbers of EBS volumes simultaneously started making new “friends” and as a result swamped all their new “friends” and hence now more volumes ground to a stop (thus invoking their own logic to make new “friends”). At this point you have a full load of EBS nodes all making friends and no-one actually doing work (a little like Facebook at work on a Friday…).
- Management Storm: So a control plane sits on top of the EBS system, brokering requests from customers to the EBS nodes- a little like the phone company. As customers started having problems with the original EBS volumes the “phone” lines started filling up. At some point customers started getting “busy signals” and could no longer interact with EBS volumes. This is where the problem started to spill over into other availability zones. Once the “phone company” got busy customers started having problems calling other zones. Which you would probably expect to pick up as they respond to the single zone outage by activating more nodes in other zones.
What we learn from it
First, don’t panic. OK, Amazon had an outage. They learned from it, let’s move on. So some thoughts:
- Rethink Redundancy: Customers are going to re-think they’re redundancy strategy. Those who were in a single zone will think about using the multiple zones Amazon uses. Those with multiple zones will think about redundancy across regional centers. And those across multiple centers may think about multiple providers- so one foot in Amazon, one foot in a different cloud (OK, this is a bit of a dream at this point given the flux of the technology and interoperability issues, but I think we may get there sooner than later).
- Examine Control: Customers need to examine how their own cloud management control plane reacted. A good control plane would have seamlessly migrated resources away from the bad zone and kept up with the problem. Netflix did this nicely. If your system didn’t- why? Each customer is going to have to sit down and look at their implementation, either through home-grown systems using cloud APIs or vendors like RightScale or ServiceMesh and figure out what changes they need to make for it to respond better.
- Support Automation: Your systems have to support automation to allow the system to self-heal to an outage like what we saw at Amazon. For example, this is part of why we at SafeNet are designing interaction with the Cloud Management systems into our solutions. So when your automation tools react to issues like the Amazon outage, our solutions stick with you as you re-provision. EBS wouldn’t have affected our solutions, but we’re making sure that you can fail-over elegantly when it did. Demand the same of all your vendors, security or not.
So a couple years down the road, we as an industry, will figure out the three laws of cloud, but until then the industry is going to have a few missteps. And that’s not a bad thing. -DeanThis entry was posted in Uncategorized by Dean Ocampo. Bookmark the permalink.
SafeNet October 6, 2011, 11:35 am UTC
SafeNet October 4, 2011, 03:03 pm UTC
SafeNet September 13, 2011, 04:40 pm UTC
SafeNet October 6, 2011, 11:35 am UTC