The perils of 'home office' RAID 5

I’ve had a little stand alone RAID device for several years now. It’s a Netgear ReadyNAS NV+ and it works quite nicely. I’m sure I could have something better but, I haven’t yet upgraded it apart from adding some memory and changing drives every now and then. I have a second ReadyNAS at my dad’s office and my office backs up critical data to that via rsync and a VPN; the one at his end does the same, though most data flows from me to him. I have a third ReadyNAS that I use as a Subversion server (only because I had a spare NAS due to the fact that I had a power supply failure on my main NAS and the quickest way to get up and running again was to buy another empty chassis and move the drives across whilst I waited for the power supply to be replaced under warranty). All of this has worked well for several years and the only issues are when bits fail. So far I’ve lost no data and that’s the main point of the whole set-up; it’s far better than occasionally remembering to back up to an external drive and then occasionally remembering to put one of the drives in my car as an off-site backup.

The RAID used on the ReadyNAS NV+ is the proprietary “X-RAID”. This allows you to expand the RAID volume by replacing the disks, one at a time, with larger disks. Once the fourth disk is replaced the volume is expanded to take advantage of the new capacity; and all of this without down time on the NAS. It’s nice and I’ve used the expansion once to good effect. The problem is that underneath it’s RAID-5 and RAID-5 is dangerous..

There are plenty of places on the internet that explain all about how RAID-5 works and why it’s flawed. In summary, as soon as you lose a disk you’re unprotected from future failures and since you have your data spread over at least 3 drives you now have a risk of failure that’s three times higher than if you had all your data on a single disk… What’s more, when you replace the dead disk with a new disk the volume rebuild is quite likely to find any bad sectors that are lurking but haven’t yet caused problems. Again see the internet for more detailed descriptions of what can fail and why. The bottom line is, if you lose a second disk during the rebuild then you’ve lost the volume. This may sound unlikely, and it is, but it’s possible and the way the rebuilds work make it more likely, and increasing disk capacities make it more likely. Again there are plenty of references available, here’s a good one.

You could go to RAID-6, well if your hardware supports it, and RAID-6 only really defers the problem if you’re using really large disks… Perhaps you could go to a RAID-0 array of RAID-5 (that is mirroring your RAID-5 array) or you could go for a RAID-5 array of RAID-0 (RAID-5’ing your mirror array) but to do either of those you probably need to build your own NAS, and I’ve no idea what price point we’re looking at for RAID controllers that would support that kind of thing… You could simply go back to doing regular backups, but how regular is regular enough?

Given that I have a “spare” NAS, I’m going for a manual mirroring (regular backups!), using rsync, of the RAID-5 NAS to the second RAID-5 NAS. As well as doing the remote backup to the off-site NAS we’ll do a local backup to what will become a “hot swap” NAS. The rsync jobs can run once per day and as soon as a drive fails you can run the rsync jobs, swap the ip addresses on the two NAS devices and then replace the broken disk and rebuild the array… I have a way to go until I get to this point and it does seem somewhat over the top but…

The reason that I’m looking at this is that I’ve just had a drive fail on my NAS, it had been slowly accumulating reallocated sectors on one drive for several weeks but I’d hadn’t got around to getting a replacement drive. Then it started to accumulate reallocated sectors on a second drive and then the first disk died. I’m currently backing up as much as I can to the (smaller) second local NAS and hoping that once I plug the new drive in we complete the rebuild without problems due to my paranoid approach to fixing the problem… I’d like to get to the end of this with only time lost… Of course the backup itself could cause a second disk to fail, especially as I’m touching stuff that doesn’t get touched that often and so the chance of a read failure due to a bad sector is likely higher than when my normal off-site backups run, though hopefully not as high as when the volume is rebuilt….

Fingers crossed.