Saved by an Old System (Disaster Recovery Story and Case Study)

I had an email spam filter machine (which also did other things). When we got new computers, I figured on replacing the machine. What we did was, basically, copy most of the configuration from the old system to the new system. It wasn't identical, but very close.

Once the new system was up, I just left the old one sitting next to it. A few months into using the new system, it failed to reboot. The memory or memory controller was dead. Memtest86 revealed that. (It seemed to be an intermittent disk problem, at first.)

Since I had to run tests, I decided to fire up the old system. Fortunately, I hadn't recycled it. It was just left there, and, again, due to luck, it had the old configuration. It was an unintentional clone. The clone came up in a few minutes, and the email traffic was directed toward it. The mail flowed again!

A test with SpinRite (which took a few tries to boot) revealed after a couple hours that the disk was fine. So then I ran the Memtest86. Red error reports appeared immediately. When this happens, I immediately get the sharpie out and write the error on the case. To be sure, I also wrote the error on the RAM sticks. This saves time later, when you have to clean up.

To fix the system, I reached for another computer (another Dell, actually) and put the old hard drive into it. The system came up, but required some network tweaking to get the network going. No big deal.

It worked. It had also been showing a message that I should upgrade to the latest Ubuntu, so I did that instead of putting it back into production. I'm figuring that running through all the config scripts would be a "good thing", as most of the hardware had changed, and it's likely that some different device drivers were now being loaded.

The lesson learned: create a work-alike of any important systems. You can make the work-alike on something like an old Pentium III. It doesn't have to be a bit-for-bit clone. It's not that hard to do this with Linux, because it's all done via simple config files that can be copied.

It doesn't have to be a "live" failover to be extremely useful. In fact, keeping it turned off helps preserve the hardware. Not all systems need 1-second failover to be effective. In this case, a many-hour failure wasn't that bad because it happened in the evening... but even an hour of downtime during business hours is disaster.