xfs
[Top] [All Lists]

Re: xfs data loss

To: Linux XFS <xfs@xxxxxxxxxxx>
Subject: Re: xfs data loss
From: pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi)
Date: Sun, 6 Sep 2009 21:00:07 +0000
In-reply-to: <B9A7B002C7FAFC469D4229539E909760308D8CAB6D@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
References: <B9A7B002C7FAFC469D4229539E909760308D8CAB6D@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
[ ... ]

>> The original 20 devices or did you put in 2 new blank hard
>> drives? I feel like that 2 blank drives went in, but then
>> later I read that all [original] 20 drives could be read for
>> a few MB at the beginning.

> No. No blank drives went in. And I always used the original 20
> devices.

That may be very good news (or not if some are partially
damaged).

[ ... ]

> I therefore suspect that the "broken devices" indication,
> since it is repeatedly found in the last weeks, and always for
> different devices/filesystems, has to do with the RAID
> controller, and not with a specific device failure-.

But a broken RAID host adapter can write random stuff to
some/most disks and can continue to do so. Unless the RAID host
adapter had a temporary failure. But who knows?

>> * Somehow 'xfs_repair' managed to rebuild the metadata of
>> '/dev/md5' despite a loss of 5-6% of it, so it looks
>> "consistent" as far as XFS is concerned, but up to 5-6% of
>> each file is essentially random, and it is very difficult to
>> know where the random part are.

> I don't see any element to support this - at present.

Well, the only thing is known for sure at this point is that an
event happened that physically damaged some parts of the system,
this damage includes some drives out of the 48 that died, and
there was huge data loss *apparently* without cause, as in the
arrays where data loss happened all drives are at least
partially working, but some have been failing afterwards,
and anyhow the arrays would not resync afterwards.

Given this background, I would not assume *anything* really
works unless it is proven to work with fairly challenging
testing.

Thus the repeated advice to do a thorough read check of all
drives. I would also check the error log of all drives with
'smartctl -l error' but if there was an electric shock the drive
might not have been able to log anything.

<Prev in Thread] Current Thread [Next in Thread>