On 2/25/2012 3:57 PM, Peter Grandi wrote:
>> There are always failures. But again, this is a backup system.
> Sure, but the last thing you want is for your backup system to
Putting an exclamation point on Peter's wisdom requires nothing more
than browsing the list archive:
Subject: xfs_repair of critical volume
Date: Sun, 31 Oct 2010 00:54:13 -0700
I have a large XFS filesystem (60 TB) that is composed of 5 hardware
RAID 6 volumes. One of those volumes had several drives fail in a very
short time and we lost that volume. However, four of the volumes seem
OK. We are in a worse state because our backup unit failed a week later
when four drives simultaneously went offline. So we are in a bad very state.
This saga is available in these two XFS list threads:
1. Don't use cheap hardware for a backup server
2. Make sure your backup system is reliable
Do test restores operations regularly
I suggest you get the dual active/active controller configuration and
use two PCIe SAS HBAs, one connected to each controller, and use SCSI
multipath. This prevents a dead HBA leaving you dead in the water until
replacement. How long does it take, and at what cost to operations, if
your single HBA fails during a critical restore?
Get the battery backed cache option. Verify the controllers disable the
drive write caches.
Others have recommended stitching 2 small arrays together with mdadm and
using a single XFS on the volume instead of one big array and one XFS.
I suggest using two XFS, one on each small array. This ensures you can
still access some of your backups in the event of a problem with one
array or one filesystem.
As others mentioned, an xfs_[check|repair] can take many hours or even
days on a multi-terabyte huge metadata filesystem. If you need to do a
restore during that period you're out of luck. With two filesystems,
and if duplicating critical images/files on each, you're still in business.