Instead, their registrar’s domain name servers (a.k.a., DNS) were down. Despite the fact the registrar had four domain name servers, spread across diverse geography, all were down. Without at least one of those servers functioning, the Internet couldn’t find the web sites using that registrar’s name servers, giving the appearance the web sites were down.
How do four different domain name servers, in four different data centers, in different geographic regions, all go down at the same time?
The answer is that oft-used buzz-term: “The Cloud.”
In everyone’s mad rush to The Cloud in recent years, there’s been some sloppy thinking and sloppy work. Everyone likes to imagine this wonderful, ever-present, never failing cloud. By sheer size and complexity, it is miraculously always there. At least that’s what we’re asked to think by the folks who market those solutions.
Just as in the physical world, for complex reasons, sometimes clouds are nowhere to be found.
You’ve probably been impacted by these types of failures many times already: A major outage at AWS, Godaddy, Akami, or other large service provider takes out a large swath of layered services. Even if it was a human who entered the wrong command at a console, it’s still the complex layering of services that allows the failure.
In such an event, the response from cloud infrastructure providers is often: “Your implementation isn’t complex enough.” That’s pretty much what Amazon said this past February when they had a major outage due to a failure in their S3 (storage) layer in the Northern Virginia data center (US-EAST-1). Amazon passed the buck right back to its customers, stating: “We already told you to use our [even more expensive and complex] geographic diversity features.” In fairness, Amazon, you did say that. But you also said that S3 was extremely robust and fault tolerant. You also said that about many other AWS services that toppled like dominoes from the S3 outage.
Own it, please.
Meanwhile, had the registrar’s DNS been on independent, stand-alone or virtual servers, and not part of a complex, layered, cloud solution, the outage wouldn’t have taken place. One server might have been affected, but someone would have noticed and fixed it, without the rest of the world being any wiser.
The Cloud, in whatever form and purpose, is not inherently bad. It’s just vulnerable in different ways than more traditional web infrastructure.
Keep that in mind when putting your eggs in that basket.