Warning: Part of this document are obsolete for larger disks. A short report on a failure.

I recently had a RAID5 array fail, and learned something about backup: it's not just about the data, but also the recovery time. The recovery time was several hours to bring the system to a state of usability, and several days of work to a state of relative safety, and that required bringing someone in to help with migrating files. Subsequently, we decided to use a two-server setup based around making incremental disk images. Incremental disk images will help make recovery within an hour more feasible.

A RAID array with 700 gigabytes of data takes hours, even days to back up. it takes even longer to restore, because writes take longer.

Exchange recovery proceeds incredibly slowly. A seemingly small 30 gigabyte database took what seemed like half a day to recover.

These two facts can put you in a situation where you have all your data on backups, maybe even multiple backups, but recovering from a failure will take a very long time, forcing the entire business offline for a day, or longer, costing hundreds of dollars per hour (or more) until the system is fixed. That doesn't even include the real value of the work, which (as any leftist would tell you) is greater than the costs of doing the work.

This is an unacceptable situation. Ultimately, it's a good value to spend a few thousand dollars to have a redundant system on-site. Buy enough capacity for two servers, use them both all the time, and when one fails, move all the work onto one server for a few weeks (until the new system is sent and configured).

(If you need to convince your boss to get redundant servers, print this article out.)

RAID nightmares

Large disks are statistically more likey to suffer read errors. Today, all disks ship with errors, and simply map them out. So they need to be continually scanned so the disk can find and fix these errors.

A RAID5 array failure can be difficult to fix. When a disk fails, you can replace it, but if you haven't been running the background consistency check feature for months, it won't be able to rebuild the array successfully: during the recovery you are likely to suffer a read error and then the entire array will go offline.

It's better not to replace the failed disk. Instead, force the entire array back online, and then perform a file-level backup, and restore to a fresh disk. Don't run a consistency check, because that will cause the RAID controller to take the array offline when it hits the error. Doing a file-level backup seems to be more tolerant of errors, or maybe the sectors with errors are just less likely to be read.

Forcing the array online will allow the business to continue operating. Just be aware that the array is damaged and all the data needs to be migrated off of it. It's a zombie disk, undead, and no new data should go on it.

Install a fresh disk, and start migrating all the active data to it, and migrate users onto that disk. This won't take much time, because your active data set is typically small. It'll only take a few hours to do this for most scenarios. It won't be so easy for older server-dependent software, but for newer software with a cleaner separation between client and server, it'll be easy. Set up a frequent backup for this data.

If you haven't started a full backup, do so, and all the older files will be covered by this backup.

If the C: drive is on the array, you will need to image the partition and then move it to the new disk. This is tricky (and we called in our consultant to do it). I'm not sure how to do it, but it requires knowledge of the Windows boot sequence, and may require editing the boot.ini file and the registry so it'll try the new partition first, and totally ignore the old partition.

(This isn't any easier than on Linux. The lesson I learned is that being able to manipulate or even recover and create the boot sequence is a must-know skill for sysadmins. It's also hard to learn and practice, requiring spare hardware and whatnot.)

Once the system is on stable new disks, you have to re-unify the active and old data. I used WinMerge, a great file comparision tool, to do this.

For backups, I used NTBackup - it was an old system. NTBackup has flaws where it'll just fail to save some data. It's also very "quiet" about this - you have to read the final report. I used the error report to build a file list that NTBackup could use to perform an additional backup. Usually, this second try would result in all the files being saved.

Restoring data onto Server 2012 and Server 2008 R2 was weird, because the new OSes don't use NTBackup. You need to dig around to find tools to restore from NTBackup bkf files. The tools work fine.

The newer backup tools are all centered around disk images. The built-in tools don't do incremental backups, so you need to find a 3rd party solution for those. We're going to use ShadowProtect, which is sold by our consultant. I don't know the price, but the market rate for Windows backup with incremental backups is around $1000.

For an equivalent disk-image-based backup system on Linux, you use either software or hardware RAID (I prefer software) and use LVM volumes and virtual partitions. You use "snapshots" to freeze the disk state, and compare disk states. The differences are copied to another computer with a mirror of the partition (via rsync). The main problem is that system performance with snapshots is worse, so you have to work around that.