The Docs team pushed a change that was "designed to improve real time collaboration within the document list." That sounds like a good idea. Unfortunately, it revealed a big old memory management bug that they couldn't detect until it was exposed to the full force of Google Docs users. Basically, the machines that check for updates didn't clear their memory properly, so they filled up and crashed, shifting the load onto other machines, causing them to crash, and away we go. The team caught the problem within half an hour. It's worth reading the blog post to see exactly how.
Isn't that refreshing? Remember Amazon's explanation for their Web Services outage in April? Me either. It was the epitome of tl;dr, which was terribly disappointing, seeing as I was managing editor of an AWS-hosted site at the time. I can only imagine how the average mourning Reddit reader must have felt.
Downtime is the bugaboo, the monster under our bed at the dawn of the cloud era. No service is 100% reliable, but cloud services are becoming more and more vital to keep our businesses running and our sites up. A cloud service provider's handling of an outage is absolutely crucial to keeping its customers happy and earning their forgiveness. But since outages usually require detailed technical explanations, they are often left to engineers whose tone might not be as gentle or apologetic as can be. When Amazon's EBS hosting services went down in April, bringing some of the Web's most important sites with them, the explanation was long-winded and dense, and the fallout was not handled well. Warren's post today couldn't be more different.
How have cloud outages affected you? Tell us your stories in the comments.