Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.3/8.12.3) with ESMTP id g5RELqnC016831 for ; Thu, 27 Jun 2002 07:21:52 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.3/8.12.3/Submit) id g5RELqiW016830 for linux-xfs-outgoing; Thu, 27 Jun 2002 07:21:52 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-linux-xfs@oss.sgi.com using -f Received: from zeus-e8.americas.sgi.com ([198.149.7.250]) by oss.sgi.com (8.12.3/8.12.3) with SMTP id g5RELhnC016792 for ; Thu, 27 Jun 2002 07:21:43 -0700 Received: from tulip-e185.americas.sgi.com (tulip-e185.americas.sgi.com [128.162.185.208]) by zeus-e8.americas.sgi.com (SGI-SGI-8.9.3/americas-smart-nospam1.1) with ESMTP id JAA57488; Thu, 27 Jun 2002 09:25:10 -0500 (CDT) Received: from n236.ols.wavesec.net (cf-vpn-sw-corp-64-72.corp.sgi.com [134.15.64.72]) by tulip-e185.americas.sgi.com (980427.SGI.8.8.8/SGI-server-1.7) with ESMTP id JAA51438; Thu, 27 Jun 2002 09:25:06 -0500 (CDT) Subject: Re: XFS corruption! From: Stephen Lord To: Libor Vanek Cc: linux-xfs@oss.sgi.com In-Reply-To: <3D1AAB70.4060400@conet.cz> References: <3D1AAB70.4060400@conet.cz> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.7 Date: 27 Jun 2002 09:18:59 -0500 Message-Id: <1025187574.1622.5.camel@n236> Mime-Version: 1.0 X-Spam-Status: No, hits=-4.3 required=5.0 tests=IN_REP_TO,PLING,SUPERLONG_LINE version=2.20 X-Spam-Level: Sender: owner-linux-xfs@oss.sgi.com Precedence: bulk On Thu, 2002-06-27 at 01:06, Libor Vanek wrote: > Hi, > we are selling Linux file servers and we wanted to use XFS. Our internal > tests passed OK but when we installed first server at customer and > migrated data an error occured (usually after copying 60-100 GB). In > /var/log/messages we saw this messages: > > Jun 27 03:09:56 localhost kernel: xfs_btree_check_sblock: Not OK: > Jun 27 03:09:56 localhost kernel: magic 0x41425443 level 0 numrecs 394 leftsib -1 rightsib -129 > Jun 27 03:09:56 localhost kernel: xfs_btree_check_sblock: Not OK: > Jun 27 03:09:56 localhost kernel: magic 0x41425443 level 0 numrecs 394 leftsib -1 rightsib -129 > ...MANY MANY SAME... > Jun 27 03:09:56 localhost kernel: xfs_btree_check_sblock: Not OK: > Jun 27 03:09:56 localhost kernel: magic 0x41425443 level 0 numrecs 394 leftsib -1 rightsib -129 > Jun 27 03:10:30 localhost kernel: xfs_force_shutdown(md(9,0),0x8) called from line 1039 of file xfs_trans.c. Return address > = 0xc01e816a > Jun 27 03:10:30 localhost kernel: Corruption of in-memory data detected. Shutting down filesystem: md(9,0) > Jun 27 03:10:30 localhost kernel: Please umount the filesystem, and rectify the problem(s) > > > We tried migrating 160 GB of data using "cp -a" (over NFS), scp and rsync from old server using RH7.0 (ext2) - all resulted in this. > The system is running software RAID5 (10x60GB), 1 GHz Celeron, 128 MB RAM, standard RH7.3 with SGI XFS modified installation CD. > When we rebooted system everything seems OK (nothing lost) but after copying few more MB the same error occurs. > We have built up 2 VERY same machines from same system image and both behave the very same so I think that it's some software failure. > > I have stress tested system with doing lot of "dd if=/dev/md0 of=/raid/tmp bs=10MB count=100" and recursive directories (about 50 levels deep) and nothing similar occured. Only when copying data over network from the old system. > > > Thanks, > Libor > > Can you please run xfs_check on the filesystem after this has happened. I suspect you may have found a hole in the endian conversion code in XFS. Doing the copy into the filesystem over NFS is probably generating more fragmentation and hence more complex free space structures than doing it locally. Steve