April 23rd, 2011
Understanding the recent Amazon Cloud outage.
The PawPrint.net web site is finally back online after a 3 day outage due to the recent, massive failure of the Amazon Cloud data center in Virginia. Many of our customers and colleagues were also affected by what some are calling the "Great Internet Fail of 2011" This article is geared for those who are still trying to understand what happened and how to prevent it.
The first and most important thing to remember is that any computer system can fail - there is simply no way to prevent that. Some failures are small, where only one computer or several suffer downtime. Others, like this event are much larger. The failure at Amazon´s data center on April 21st 2011 took down many high profile services such as Reddit, Quora, FourSquare, Hootsuite, parts of the New York Times, and ProPublica; corporations with experienced IT staff, multiple redundant servers, and massive infrastructure and technology budgets. It also took down many thousands of smaller web sites and services. The specific details of what caused the outage has not yet been conclusively outlined by Amazon.
To give a rough magnitude analogy if we compare a single computer failing to be like blowing a circuit breaker in your home, then this was like an entire hydro sub-station going down. Similarly when a single breaker blows you can do something about it yourself, but when the whole sub-station fails you simply have to wait for hydro to solve the problem, even bringing in an electrician isn´t going to help.
During this ´waiting game´ PawPrint took it upon ourselves to try and ensure our customers were kept well informed about what was going on. We did this via some of the few remaining means available - our phone system, Facebook, and Twitter. PawPrint is not a web hosting company, so there never was anything we could do during this outage - but we wanted to do our best to try and help. Dark French Host, our web hosting partner was similarity hampered by having to wait until Amazon fixed the data-center issues before they could even gain access to their servers.
Some articles online suggest that using a non-cloud-based server is better and this outage is an example of why. That is, to be blunt, poppycock. When an entire data center goes down (something that is equally likely in cloud computing or traditional data centers) the majority of systems in that location become unavailable and restoration times are long. In our 17 years in this industry we have seen plenty of major failures such as this - some that lasted much longer (14 days in one instance where a power distribution system blew and several servers were completely destroyed with all data lost). Overall the cloud has been very reliable and PawPrint intends to continue using it. The recovery in this case took a long time, but no data was lost and that alone is something to be very happy about, especially since many of our smaller customers do not spend the extra money for backup services.
The only way to safeguard yourself from such "virtual disasters" is through redundancy and backup. This is how large sites cope with the expected and inevitable failures that will happen. They spread their data across multiple servers in different locations and if one goes down they have teams of IT personnel ready to deal with the failure. This works well - but it is very expensive and not something most small businesses or even medium size organizations can afford. Many web sites pay for only a ´shared hosting´ service (typically around $10 /month) whereas for something providing true fail-over and redundancy the costs would be over 10 times this amount at a minimum.
That said, PawPrint has already begun to work closely with Dark French Host to look into a way to offer a fail-over option for XDe customers. Something that would allow web sites and emails to remain in limited operation during an extended outage such as this and built on our existing offside backup technology for a fraction of the cost of a true dedicated fail-over system.
The past 3 days have been frustrating for everyone and exhausting for those attempting to ensure that the server was back up and running the instant it was possible for it to come back. We will provide more information regarding what happened as it become available. We also will try and assist with integrated fail-over solutions. Of course, we will be happy to answer any questions our customers might have about what they can do to better prepare for such events.
For the moment we are just glad that everyone is finally back online and look forward to rolling out a new update to the content manager we had just finished on April 20th and has been sitting waiting all this time in our release queue.