In the last year or so, we’ve custom-built a new approach to redundancy for our entire Web hosting infrastructure. The idea is that we can hit the system with any type of failure or disaster and every one of our sites will keep humming along like nothing happened.
During that time, we have completely overhauled our entire hosting infrastructure to provide greater performance, security, and uptime to our clients. First, our design and configuration was carefully planned and implemented to include automatic redundancy from the get-go. We spent months selecting components and working with hardware and software vendors to find the right combination of parts. Then we tested in our pre-production lab. Once everything was finally in place this past spring, we repeated the tests monthly by gracefully failing the systems.
Everything worked like a charm.
But now we wanted to push it even further to simulate possible failures and disasters to make sure the entire system worked as planned even in worst case scenarios. Our team created a brutal series of tests on the new configuration.
And by brutal, I mean pulling plugs out of critical machines. We literally pulled the plug on our core switches. The secondary switches took over, as planned. We unplugged our primary Internet connections. The secondary connections took over the traffic within a few seconds. We pulled the Ethernet cables from our active database server and within 30 seconds our secondary SQL server had taken control and our sites were still online.
For the next hour we moved down our list of eleven tests – unplugging various cables and flicking power switches off. When we were done the sites were still up, as designed. But it wasn’t perfect. We did have a few seconds to a few minutes of transition time as secondary systems took over.
After the high-fives and chest bumps were exchanged, we identified a few areas that, while they worked as designed and failed over as planned, could be improved. Those few seconds to a few minutes are one area of focus. We’d like the failovers, even in a disaster scenario, to have no detectable transition time needed at all.
So next quarter, when we run our tests again, we have an even more brutal set of tests planned.