
Summary Bullets:
- Increasing automation in the data center can be one of the best ways to reduce errors in a dynamic production environment.
- Automation can also be a source for problems of a much greater scale because of the number of processes that can affected by errors within a large and complex environment.
It’s highly unlikely that American sociologist Robert Merton was thinking about cloud computing when he proposed his “Law of Unintended Consequences” in 1936, but it seems particularly apt in light of Microsoft’s revelations regarding the major Azure cloud storage outage of November 2014. Just this week, Microsoft released its root cause analysis that pointed to simple human error as the cause of the 11-hour storage outage that also took down any associated VMs, some of which took more than a day to get back online. Now I’m not here to pile on Microsoft; its response in fixing such a massive system crash can’t really be faulted. What does interest me is how vulnerable our complex and automated systems can still be after years of automation designed to remove human error from the equation.
I’ve actually been debating this very issue over the last few months with my fellow nerd and evil twin, Steve, and we’ve been positing that – even though automation and orchestration processes can remove many of the opportunities for human error – it can also increase the potential for large-scale errors to occur across our increasingly integrated systems. It’s a matter of scale in two different ways: automation substantially reduces the number of opportunities for us to make an error, but it also means that the effect of a single error can be magnified a thousand-fold, as evidenced in the Azure situation. I would hazard a guess that every single one of us in the tech industry has had that “Oh, no!” moment in our past – the one where we answered “Yes” to that warning prompt and hit enter, that single keystroke that cost you a day, week, or month of work. You know the one…
So, in spite of our best-laid plans, there’s still room for human error in the system. In their post-mortem, Microsoft showed due diligence in the procedures it has in place to prevent this type of a problem from occurring, but it also proved that operational protocols are only as strong as the team executing them. Evil Steve and I concur that, regardless of the degree of automation, humans will remain an extremely necessary part of the equation for the foreseeable future. We provide the inspiration, and they provide the perspiration. But, I also believe that, as the ramifications of our decisions reach further and further beyond our range of vision, our systems will need to evolve to better protect us from our own myopia. Truly self-healing systems are still a ways off, so perhaps for now the best we can do is to continue to stay vigilant and prepare for the worst, and maybe add ANOTHER warning prompt that asks, “Are you SURE you want to do this?” And to that engineer from Microsoft: stay strong, my brother.