On Wed, Feb 18, 2009 at 10:36:59AM +0100, Carsten Aulbert wrote:
> Dave Chinner schrieb:
> > On Tue, Feb 17, 2009 at 03:49:16PM +0100, Carsten Aulbert wrote:
> >> Do you need more information or can I send these nodes into a re-install?
> > More information. Can you get a machine into a state where you can
> > trigger this condition reproducably by doing:
> > mount filesystem
> > touch /mnt/filesystem/some_new_file
> > If you can get it to that state, and you can provide an xfs_metadump
> > image of the filesystem when in that state, I can track down the
> > problem and fix it.
> I can try doing that on a few machines, would a metadump help on a
> machine where this corruption occurred some time ago and is still in
> this state?
If you unmount the filesystem, mount it again and then touch a new
file and it reports the error again, then yes, a metadump woul dbe
If the error doesn't show up after a unmount/mount, then I
can't use a metadump image to reproduce the problem.
> >> Feb 16 22:01:28 n0260 kernel: [1129250.851451] Filesystem "sda6":
> >> xfs_iflush: Bad inode 1176564060 magic number 0x36b5, ptr
> >> 0xffff8801a7c06c00
> > However, this implies some kind of memory corruption is occurring.
> > That is reading the inode out of the buffer before flushing the
> > in-memory state to disk. This implies someone has scribbled over
> > page cache pages.
> >> Feb 17 05:57:44 n0463 kernel: [1156816.912129] Filesystem "sda6": XFS
> >> internal error xfs_btree_check_sblock at line 307 of file
> >> fs/xfs/xfs_btree.c. Caller 0xffffffff802dd15b
> > And that is another buffer that has been scribbled over.
> > Something is corrupting the page cache, I think. Whether the
> > original shutdown is caused by the some corruption, i don't
> > know.
> At least on two nodes we ran memtest86+ overnight and so far no error.
I don't think it is hardware related.
> >> plus a few more nodes showing the same characteristics
> > Hmmmm. Did this show up in 188.8.131.52? Or did it start occurring only
> > after you upgraded from .10 to .14?
> As far as I can see this only happened after the upgrade about 14 days
> ago. What strikes me odd is that we only had this occurring massively on
> Monday and Tuesday this week.
> I don't know if a certain access pattern could trigger this somehow.
I suspect so. We've already had XFS trigger one bug in the new
lockless pagecache code, and the fix for that went in 184.108.40.206 -
between the good version and the version that you've been seeing
these memory corruptions on. I'm wondering if that fix exposed or
introduced another bug that you've hit....