Summary: XFS + NFS under heavy use cause CPU load to go to 300,
lock up filesystem
Component: XFS kernel code
Estimated Hours: 0.0
Hardware is 2xOpteron 270, 16GB RAM, Areca ARC-1160 16-Port PCI-X to SATA RAID
Controller (16x500GB disks in RAID6, 1 hot spare); this is our cluster head
node and primary storage for about 160 computer nodes of various sizes.
We're running 256 nfsd's. We are also running a gluster 3.3beta3 client to a
gluster volume over ethernet, but this does not seem to be involved. The
gluster fs remains up and active thru this.
The ethernet controllers are not reporting any errors with ifconfig, so the
hardware seems to be OK.
This has happened under CentOS 5.7 and more recently under 5.8
The RAID controller reports everything as OK;
xfs_check reports everything as OK.
We are running the following (self-reported) versions from CentOS 5.8
mkfs.xfs version 2.9.4
xfs_db version 2.9.4
xfsdump Version 2.2.46
xfsprogs Version 2.9.4
I do not know whether this is an explicit XFS problem but XFS calls are
referenced in the stack trace in the dmesg log (see below)
As the primary storage node, this node is almost always under heavy load
(/proc/loadavg varies from 1-16 due to NFS load). Up until about 5 months ago,
everything was running fairly smoothly, but at about that point things started
to go wrong sporadically.
The problem developed with the load going from a high but acceptable level
(~10) increasing over the course of about 10 min to 250-300. The XFS
filesystem became unusable from both the host and remotely, but the root
filesystem was still available (as was the remote gluster filesystem). local
utilities reported that the load was high but reported no visible processes
causing it. (top, atop, htop). Network traffic was very low (via iostat,
nfswatch, ifstat, iftop, etc). I did not do an extensive tcpdump, but I will
try to the next time it happens.
The dmesg log (see attachment) over this time included several stack traces
that referenced xfs functions, starting at about line 620.
Do these imply an XFS or an NFS error and/or can you describe either a fix or
other approach that would remedy this?
The sudden appearance of such a problem could imply a hardware fault, but the
machine reports normal on all variables that I can find. It could also be a
problem with a badly behaved application - since it is part of an academic
cluster, there are a number of badly written, badly behaved apps running, but
typically these reveal themselves pretty easily.
Configure bugmail: http://oss.sgi.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.