On Wed, Sep 22, 2010 at 09:26:53AM +0200, Ralf Gross wrote:
> we've a fileserver withe the following setup:
> Debian Lenny AMD64, 2.6.32 bpo Kernel
> Infortrend RAID with BBU -> DRBD -> LVM -> XFS
> This system is running since beginning of August and replaced some
> older hardware.
> Last week xfs began to print some warnings to syslog. The day before a DRBD
> verify ended without showing differences between the 2 cluster nodes.
That doesn't mean there is no corruption - it means the corruption
got propagted to both nodes.
> This seems not to happen all the time, the server was running 5 weeks without
> these messages. And there were some full backups running during this
> time which read every file on the fs.
Which implies that it is recent. Knowing when the directory was last
modified and what was done to it would be useful, but I know you
won't have that information....
> Any hints what to look for or what to do to notice this corruption as soon as
You won't find an error on disk without scrubbing of some kind.
In the case of filesystem metadata, you need to read all the
metadata and validity check it to find random corruptions. The best
you can do is traverse and stat every file regularly...
> Sep 13 12:30:30 VU0EM003 kernel: [2834063.439771] block drbd0: conn(
> Connected -> VerifyS )
> Sep 13 12:30:30 VU0EM003 kernel: [2834063.439803] block drbd0: Starting
> Online Verify from sector 0
> Sep 15 03:06:59 VU0EM003 kernel: [2972785.494729] block drbd0: Online verify
> done (total 138989 sec; paused 0 sec; 33716 K/sec)
> Sep 15 03:06:59 VU0EM003 kernel: [2972785.494794] block drbd0: conn( VerifyS
> -> Connected )
> Sep 16 12:18:16 VU0EM003 kernel: [3092032.035881] ffff8803e65c8000: 49 4e 00
> 00 02 02 00 00 00 00 14 1b 00 00 04 26 IN.............&
> Sep 16 12:18:16 VU0EM003 kernel: [3092032.035936] Filesystem "dm-2": XFS
> internal error xfs_da_do_buf(2) at line 2112 of file
> Caller 0xffffffffa02b0a52
So it found an inode cluster rather than a directory block. Implies
a bad block pointer. Without the repair output, there's no way of
knowing what it might have been incorrect (either the directory
btree block pointers or the block contents), so there's not much
that can be guessed from this...