Replacing a failing SnapRAID parity drive on my Ubuntu Home Server

About 3 weeks ago, I got an email notification from smartd that the regular SMART short check on one of the drives in my home server failed. The specific failure was a failure to read one of the sectors in the disk (current pending sector count went to 1 from 0). Now, it is hard to predict if this type of error is transient or a precursor to drive failure. I re-ran the check again and it passed a second time, so I decided to wait and watch. Two days later, the same error was reported but with a different sector failing to read. At this point, given that the tolerance in my snapraid setup is for a single disk failure, I decided to be prudent and order a new drive (this one if you are curious). While waiting for the drive to arrive, I re-ran a longer self-test on the failing drive (the failing drive was this one, it had slight over 3.5 years of use) which passed but as before, two days later the scheduled short test failed again with yet another read failure on a different sector. The drive in question was the parity drive for my SnapRAID setup.

Once my new drive arrived, the first thing I did was take a copy of my parity file from the failing drive to an external hard drive. Per the SnapRAID FAQ for replacing a parity drive, the process proceeds a lot faster if you can get whatever you can of your old parity data (note that this is not necessary, but simply an optimization).

After that, I shut down my machine, replaced the failing drive with the new drive and restarted my Ubuntu server (pro tip: If you’re running the OS off a USB stick like I am, remove the USB stick before moving the machine around/opening it up). At this point, the new drive will not be recognized or mounted as it doesn’t have a valid filesystem on it. Also, the boot process will seem a bit perilous as the system will recognize that one of the drives it is supposed to be mounting (via /etc/fstab) is missing. You can tell Linux that it is ok to skip mounting that drive and continue with the boot process.

After the boot completed, the first order of business was to partition the drive and create a filesystem. For drives > 2TB, you’re best off using parted. For my case, I simply created one primary GPT partition in the drive. Now that the partition was created, it was time to create a filesystem – I used mkfs.ext4 to create a file system on the new partition. Now that the drive was prepared and ready, I used blkid to determine the UUID for the partition and then updated /etc/fstab with the new UUID. You can simply run mount -a as root to get the new drive mounted and any other dependent mounts to also take place automatically. Assuming /dev/sde is the new drive that needs to be prepared, here are the steps:

sudo parted /dev/sde
mklabel gpt
mkpart primary
quit
sudo mkfs.ext4 /dev/sde1
blkid /dev/sde1

The next step in the process was to copy over the parity file from the external drive to the new replacement drive. After this, I simply ran snapraid fix and spot checked a few files to make sure the data seemed ok. Finally, I ran snapraid sync manually to make sure everything was in sync.

Overall, the process of replacing the drive was extremely simple and I was up and running within 2 hours including the time to physically replace the drive (and clean the year long gunk inside the machine). The only wrinkle I faced was that after snapraid fix, my Plex media server running on that machine seemed to have lost all its configured libraries. I suspect this was due to some database inconsistency introduced by the restoration (most likely the restore doesn’t play well with Plex for some reason or I screwed something up by maybe running Plex while restoring). In any case, this was a very easy fix too – simply copy over the last available backup of my library database (Plex does the backup per a schedule you can specify) and things started working again.

While I’m not thrilled that I had to replace a hard drive on the home server, I am very happy at how easy and seamless the process was and the fact that I didn’t lose any data in the process. Even more impressive was how quick the whole recovery process was. SnapRAID remains an absolutely fantastic piece of work for a home server NAS environment and based on this experience, I have no hestation in recommending it as the first choice for protection against drive failures for a home NAS environment.

Leave a Reply

Your email address will not be published. Required fields are marked *