On Mon, 2002-02-04 at 07:30, Ian D. Hardy wrote:
> Anyone any ideas on the following Oops (processed with ksymoops 2.4.3). It is
> from a NFS server (Dual 1Ghz Supermicro LE, 1Gbyte RAM, 40Gbyte Maxtor IDE
> system disk, Zero-D/GForce RI Fibrechannel to IDE hardware RAID-5 500Gbyte
> disk unit). It is running the Linux 2.4.17-xfs kernel taken as a CVS image
> on 27th January. The main area of disk it is serving is on the HW RAID unit,
> which is the only XFS filesystem on the system. The system had been up
> for just over 3 days when it crashed.
> I reported a very similar failure a few weeks ago, at that time running a
> 2.4.9 based kernel, Steve Lord suggested that we tried the latest CVS image
> as this had fixed some memory alloacation problems.
> The machine is essentially an NFS fileserver to a computational cluster.
> of possible interest is the 'save' process that was running on one of the
> processes, this is the Legato Networker backup client process (which was
> performing a full backup of the XFS filesystem at the time). I don't think
> this is significant as I was seeing these crashes (at ~4 to 12 day intervals)
> with the 2.4.9 kernel not dependant upon a 'save' session running.
You have not been forgotten, just trying to do too many things at once
around here right now. But both of you ended up with an oops in kfree,
would it be possible to turn on CONFIG_DEBUG_SLAB.
This will turn on a number of memory checking features and might make
things fall over at a different - and more inciteful point.
In Chip's case I suspect the config flag does not exist, so hand edit
mm/slab.c and turn on the DEBUG options in there.
On a side note, today I experienced an oops due to what appeared to be
a failure to allocate a buffer - we had been assuming these were caused
by being out of memory, but in my case I had plenty of available memory,
it turns out to be a bug in the pagebuf code when we reallocate metadata
space. I am thrashing the fix on some test boxes now, but it is possible
that those really were not out of memory cases people were seeing, but
due to this bug.
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: lord@xxxxxxx