About a month ago I was asked to participate in a discussion with my good friend and co-worker, Josh Neland (@joshneland), about different methods for building redundant or high up-time computing applications. My understanding was that Josh would discuss redundancies in general and I would cover enterprise class hardware redundancies, which makes sense given our fields of expertise. What I didn’t expect was the great differences in what I said versus what Josh said.
To Josh, who has a storied background is software development, the way to make redundant applications was to write better code which anticipates and recovers from failures of any kind by being connectionless or as stateless as possible. If multiple instances of a host service run together, a client could simply plan on moving from host to host if the original host failed to respond. He went into a very complex and deep conversation of the mechanics of such an application to a level I had never really considered. Ultimately, I was left with a feeling that hardware failures would never be an issue if the software were written properly to handle failures.
Then it was my turn to speak about hardware redundancies where anticipated hardware failures would be managed through component redundancies such as RAID arrays for hard drives and parallel redundant power supplies. As I covered all the approaches enterprise hardware engineers use to design the gear to handle expected and unexpected hardware failures, I was struck by how much effort and expense goes into designing hardware to run software which cannot handle failures gracefully. Basically, highly redundant hardware is designed to keep software processes running at all cost in order to maintain availability of that service or data.
By the time I had covered hardware redundancy to the point of running parallel processes on parallel hardware simultaneously for 100% fault tolerance and availability, I realized that if both approaches were worked on jointly between hardware and software engineers developing a solution, the result would be a staggeringly reliable system – reliable inherently in the entire design. I would imagine the effort made up front in development could reduce the amount of work spent forcing average coding to be redundant via costly hardware designs and vice-versa.
In my experience with hundreds of companies developing computing solutions the general trend is that the software team works almost entirely independently of the hardware team. In fact, I have been in meetings where the hardware team didn’t even know who to speak with in order to answer my simple software related questions. If the solution owners worked together to merge the efforts of the software and hardware engineers, the results could be extremely effective.
What do you think about creating more synergies between hardware and software teams? How do you ensure uptime in your solutions?