Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id f8D8kQM32548 for linux-xfs-outgoing; Thu, 13 Sep 2001 01:46:26 -0700 Received: from relay.xlink.net (relay.xlink.net [193.141.40.4]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id f8D8kCe32523 for ; Thu, 13 Sep 2001 01:46:13 -0700 Received: from lizard.webland.de (lizard.webland.de [194.122.76.201]) by relay.xlink.net (8.9.3/8.8.7) with ESMTP id KAA03690; Thu, 13 Sep 2001 10:45:35 +0200 (MET DST) Received: (from uucp@localhost) by lizard.webland.de (8.8.8/8.8.7) id KAA12219; Thu, 13 Sep 2001 10:45:06 +0200 (MET DST) >Received: from mobile.sauter-bc.com (unknown [10.1.6.21]) by basel1.sauter-bc.com (Postfix) with ESMTP id D14B257306; Thu, 13 Sep 2001 10:44:28 +0200 (CEST) Received: from ch.sauter-bc.com (support.cad.sba [10.1.200.117]) by mobile.sauter-bc.com (Postfix) with ESMTP id C88C525835; Thu, 13 Sep 2001 10:44:28 +0200 (CEST) Message-ID: <3BA071EC.D9BE8E22@ch.sauter-bc.com> Date: Thu, 13 Sep 2001 10:44:28 +0200 From: Simon Matter Organization: Sauter AG, Basel X-Mailer: Mozilla 4.77 [de] (X11; U; Linux 2.2.19-6.2.7 i686) X-Accept-Language: de-CH, en MIME-Version: 1.0 To: Adrian Head , linux-xfs@oss.sgi.com Subject: Re: Problems with many processes copying large directories acrossan XFS volume. References: <3B9CCE00.D704DC0B@ch.sauter-bc.com> <3B9DA493.11CD4C16@ch.sauter-bc.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Sender: owner-linux-xfs@oss.sgi.com Precedence: bulk Simon Matter schrieb: > > Simon Matter schrieb: > > > > Adrian Head schrieb: > > > > > > Thanks for your reply Simon > > > > > > Yes the softraid was fully synced before I started any test. > > > > > > The XFS patch I used to obtain these errors was > > > patch-2.4.9-xfs-2001-08-19 and the errors were: > > > Sep 9 05:13:46 ATLAS kernel: 02:86: rw=0, want=156092516, limit=360 > > > Sep 9 05:13:46 ATLAS kernel: attempt to access beyond end of device > > > > > > When I used a later version of the XFS patch I had more descriptive > > > errors written to /var/log/messages: > > > Sep 10 10:14:57 ATLAS kernel: I/O error in filesystem ("md(9,0)") > > > meta-data dev 0x900 block 0x9802bdc > > > Sep 10 10:14:57 ATLAS kernel: (xlog_iodone") error 5 buf count 32768 > > > Sep 10 10:14:57 ATLAS kernel: xfs_force_shutdown(md(9,0),0x2) called > > > from line 940 of file xfs_log.c. Return address - 0xd8cb66f8 > > > Sep 10 10:14:57 ATLAS kernel: Log I/O Error Detected. Shutting down > > > filesystem: md(9,0) > > > Sep 10 10:14:57 ATLAS kernel: Please umount the filesystem, and rectify > > > the problem(s) > > > Sep 10 10:14:57 ATLAS kernel: xfs_force_shutdown(md(9,0),0x2) called > > > from line 714 of file xfs_log.c. Return address = 0xd8cb65d3 > > > Sep 10 10:14:57 ATLAS kernel: attempt to access beyond end of device > > > Sep 10 10:14:57 ATLAS kernel: 02:82: rw=0, want=1602235696, limit=4 > > > > > > I did think at the time that it may have been issues with XFS stomping > > > all over raid code or raid code stomping all over XFS. Although I not > > > sure now as the 2.4.10-pre2-xfs-2001-09-02 patch never wrote any errors > > > out at all. (please see my 2nd post for more info) > > > > > > Thanks for taking the time to test this on your own machine. > > > > I tried 20, 40 and 80 simultanous cp with no crash. Then I changed the > > file tree and the new tree has ~280M small files with 100b-50kb size. > > When using 60 cp jobs the machine died. I could ping it but nothing > > more. No ssh, no console, no shutdown. I try some more tests tonight. I > > try the same with ext2 as well to make sure it's XFS and not Softraid. > > Update: I tried the 60 cp jobs on a ext2 filesystem and the system is > still alive but 58 of the 60 cp jobs are hanging. Well maybe the 2.4.3 > kernel is a bit old now... I'll try some more tests. I have just installed the 2.4.10-pre2 kernel from http://rpms.aicompro.net/ and tried my stresstest with 60 instances of cp and it has crashed as well. Not good. I'm not sure whether kdb is in those RPMS but when the machine hung I couldn't get into kdb (sysctl switch was on). Usually you can unblank the console by pressing shift or numlock but this time it was really dead. I'm now right now trying the same test on a non raid partition. Simon > > Simon > > > > > -Simon > > > > > > > > Adrian Head > > > Bytecomm P/L > > > > > > > -----Original Message----- > > > > From: Simon Matter [SMTP:simon.matter@ch.sauter-bc.com] > > > > Sent: Monday, 10 September 2001 17:45 > > > > To: adrian.head@bytecomm.com.au > > > > Cc: linux-xfs@oss.sgi.com > > > > Subject: Re: Problems with many processes copying large > > > > directories across an XFS volume. > > > > > > > > Hi Adrian > > > > > > > > I did similar tests two months ago. I was having problems as well but > > > > ufurtunately I don't remember what is was exactly. > > > > First question: You created Softraid5, was the raid synced when you > > > > started the tests? > > > > > > > > > In the /var/log/messages log around the same time as the copy test I > > > > get > > > > > entries like: > > > > > Sep 9 05:13:46 ATLAS kernel: 02:86: rw=0, want=156092516, limit=360 > > > > > Sep 9 05:13:46 ATLAS kernel: attempt to access beyond end of device > > > > > > > > This looks interesting. I don't know what this means exactly but it > > > > looks to me like you managed to create a filesystem bigger than the > > > > raid > > > > volume was? I got the very same error when I tried to restore data > > > > with > > > > xfsrestore from DAT (xfsrestore from DLT was fine). The issue is still > > > > open. > > > > > > > > I have a test system here with SoftRAID5 on 4 U160 SCSI disks. I'll > > > > try > > > > to kill it today with cp jobs. > > > > > > > > -Simon > > > > -- Simon Matter Tel: +41 61 695 57 35 Fr.Sauter AG / CIT Fax: +41 61 695 53 30 Im Surinam 55 CH-4016 Basel [mailto:simon.matter@ch.sauter-bc.com]