Reality Check: About Amazon's Outage & The Current State of Bime

Starting yesterday, Amazon EC2 web services encoutered serious issues resulting in what I believe is the most impressive outage in the short cloud computing history. Amazon being our hosting provider, we had our share of troubles but we managed to keep our heads above water and minimize the downtime. I don't want to appear to be proud here. We had downtime and it is unacceptable. I just want to share a bit of the story and present an explanation (and our sincere apologies) to some of our customers that were impacted.

The Story

Yesterday, Amazon encountered a massive issue with the infrastructure of its data center in Virginia. Bime's infrastructure held strong because of some technical choices we made in the past. It's still early, but all the information we have so far tells us that the faulty piece of the infrastructure in Amazon is EBS (that is more or less the storage attached to the servers). As we chose not to use it for the main pieces of Bime's infrastructure, we were safe. The only service that broke yesterday was the website (this one) and seeing the time flying by (more on this later) we decided to take action and to move it to another region within amazon's infrastructure.

This morning it was the proxy that handles requests to services like Google Analytics, spreadsheets and databases that went down. Here again we moved the instance to another region and that solved the issue.

The lessons

The worst can happen. The proxy for example was built for failure with a load balancer in front of it, auto scale rules etc...but when your instances are stuck and you can't launch new one, it's not really useful.

Move fast. We shoudn't have waited that long to move the web site. It just damages your image for a SAAS provider to have a site down for several hours in a row but we were honestly thinking that the Amazon issue wouldn't last very long. So don't be afraid to move your instances to a new region as fast as you can. It's kind of cool to have the ability to do it (for the cloud skeptics: you can't do that with your own infrastructure). So do it! Quickly!

Be ready. It's not fault tolerant unless you have something ready to be launched in other geographic regions, or even to another provider. I guess we are not the only ones that will have a look at Rackspace in the coming days.

One thing is sure, the cloud computing world is still shaking after yesterday. But I must say that even in the worst case scenario (this one is pretty close in my book), being on AWS still gives you some ways to recover pretty quickly from a disaster. So, kudos to you Amazon, I think you've had some pretty tough days lately and despite the fact we also share your pain, I just want to say: "keep up the good work".