Sunday, July 1, 2012

Storm Cloud

This weekend a major storm hit the Washington, DC area and knocked out power for over a million people.  It also knocked out power to Amazon's Ashburn, Virginia data center, or the us-east-1d availability zone.  Since Flex runs on the Amazon cloud, I thought it might be a good idea to explain how the Amazon outage impacted Flex.

Luckily, Flex fared much better in the outage than better known companies like Netflix and Instagram - both of which sustained major service interruptions.  Only one of our customers experienced any downtime during the outage, and this was for less than ten minutes.  We have no support tickets on file for the outage, which is likely due to the fact that the outage occurred on Friday night, after normal working hours.  This customer was on a dedicated server and is one of only two customers whose systems are located on the east coast.

All other customers are hosted in Amazon's Dublin, Ireland and Oregon data centers: Dublin for all European and African Customers and Oregon for everyone else.  As a result, none of those customers were impacted by the the weather event in Virginia.



So it looks like we got off easy this time.  When you see future reports about cloud outages, there's unlikely to be any impact on Flex or your systems unless the outage is in Oregon or Dublin.  We'll also be moving Pacific Rim customers to Amazon's new data center in Sydney, Australia once construction of that facility is complete.

Common Sense

A lot of media reports about the outage are speculating about whether or not this outage and other recent outages like it might spell the end of the cloud, as if exposing some fatal flaw in the technology.  As with anything in technology, there is no such thing as a magic bullet.  Without fault tolerant software architecture that properly leverages the cloud technology, there's no practical difference between cloud hosting and traditional server co-location.  The cloud is just a room full of servers, just like any other data center, and subject to the same risks.

Having a truly fault-tolerant architecture means building in some measure of geographic redundancy, such that a total failure in one physical location will not bring the whole system down.  At Flex, we're working on a new high-availability architecture intended to make Flex more fault-tolerant, even with respect to catastrophic events like the one that happened this weekend.  For the time being, we are still subject to single-point-of-failure issues, especially if a whole Amazon availability zone goes down. We could use load balancers to split load between multiple availability zones now and make our system resistant to single availability zone outages, but not without adding significant cost.

We do have tools and procedures that can facilitate rapid recovery, but there would still be downtime for some customers if we lost a whole availability zone in Oregon, for example.  Over the coming months, you may hear some talk about Flex Version 5.0, or Flex Alto as we've started calling it.  This project is aimed at changing the way Flex uses computing resources like CPU and memory in order to make Flex faster, more reliable, and fault tolerance without increasing our prices to offset extra hardware costs.

So, in short, this particular outage didn't really impact us, but the next one could and the challenge for us is to make our technology resilient enough to withstand a major outage without raising prices.  In the meantime, if anyone wants to move to a fault tolerant architecture before Flex Alto is released in 2013, contact Chris for pricing on a dedicated load balanced cluster.


No comments:

Post a Comment