From The Department of Redundancy Department

At Geonetric, we build some amazing stuff. But sometimes, we put our engineering talents to use to break things.

In the last year or so, we’ve custom-built a new approach to redundancy for our entire Web hosting infrastructure. The idea is that we can hit the system with any type of failure or disaster and every one of our sites will keep humming along like nothing happened.

During that time, we have completely overhauled our entire hosting infrastructure to provide greater performance, security, and uptime to our clients. First, our design and configuration was carefully planned and implemented to include automatic redundancy from the get-go. We spent months selecting components and working with hardware and software vendors to find the right combination of parts. Then we tested in our pre-production lab. Once everything was finally in place this past spring, we repeated the tests monthly by gracefully failing the systems.

Everything worked like a charm.

But now we wanted to push it even further to simulate possible failures and disasters to make sure the entire system worked as planned even in worst case scenarios. Our team created a brutal series of tests on the new configuration.

And by brutal, I mean pulling plugs out of critical machines. We literally pulled the plug on our core switches. The secondary switches took over, as planned. We unplugged our primary Internet connections. The secondary connections took over the traffic within a few seconds. We pulled the Ethernet cables from our active database server and within 30 seconds our secondary SQL server had taken control and our sites were still online.

For the next hour we moved down our list of eleven tests – unplugging various cables and flicking power switches off. When we were done the sites were still up, as designed. But it wasn’t perfect. We did have a few seconds to a few minutes of transition time as secondary systems took over.

After the high-fives and chest bumps were exchanged, we identified a few areas that, while they worked as designed and failed over as planned, could be improved. Those few seconds to a few minutes are one area of focus. We’d like the failovers, even in a disaster scenario, to have no detectable transition time needed at all.

So next quarter, when we run our tests again, we have an even more brutal set of tests planned.

Plusone Twitter Facebook Email Stumbleupon Pinterest Linkedin Digg Delicious Reddit
This entry was posted in Geonetric Culture, Hosting, Transparency, VitalSite by Joe Olerich. Bookmark the permalink.
Joe Olerich

About Joe Olerich

Joe speaks in languages only fellow techies can understand. Virtualization. LAN. Firewalls. We might not understand it all, but we know it means our network is in capable hands. As a network engineer, he is responsible for ensuring our network is optimized while keeping a watchful eye on our hosting facilities’ production systems. Before joining Geonetric, Joe spent three years with Network Integration Services where he did everything from implementing multi-site EIGRP to troubleshooting layer 2 and 3 networks. Sounds impressive! Joe holds a bachelors of science in management information systems from Kansas State University. When he’s not providing technical support for his friends and coworkers (“Joe, my Netflix won’t work!”) he enjoys watching K-State football, Kansas City Royals baseball and reminiscing about the best six months of his life – working at a ski resort just after college graduation.

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload CAPTCHA.