[Top] [All Lists]

Re: Problems with many processes copying large directories across an XFS

To: adrian.head@xxxxxxxxxxxxxxx
Subject: Re: Problems with many processes copying large directories across an XFS volume.
From: Simon Matter <simon.matter@xxxxxxxxxxxxxxxx>
Date: Mon, 10 Sep 2001 09:45:09 +0200
>received: from mobile.sauter-bc.com (unknown []) by basel1.sauter-bc.com (Postfix) with ESMTP id 8A06957306; Mon, 10 Sep 2001 09:45:10 +0200 (CEST)
Cc: linux-xfs@xxxxxxxxxxx
Organization: Sauter AG, Basel
References: <01090823430700.01184@HERCULES>
Sender: owner-linux-xfs@xxxxxxxxxxx
Hi Adrian

Adrian Head schrieb:
> I am in the process of building a couple of file servers for various purposes
> and over the last week have been running quite a few tests in an attempt to
> determine if I trust the hardware/software combination enough for it to be
> put into production.
> One of the tests I was doing was trying to simulate many users copying large
> directories across an XFS volume.  To do this I was generating many
> background jobs copying a 4G directory to another directory on the XFS volume.
> eg.
> #>cp -r 001 002&
> #>cp -r 001 003&
> #>cp -r 001 004&
> .....
> .....
> #>cp -r 001 019&
> #>cp -r 001 020&

I did similar tests two months ago. I was having problems as well but
ufurtunately I don't remember what is was exactly.
First question: You created Softraid5, was the raid synced when you
started the tests?

> Everything would start fine but less than a minute into the test various
> hundreds of errors are displayed like:
> cp: cannot stat file `/mnt/raid5/filename`: Input/output error
> Once this has happened the XFS volume disappears.  By this I mean that it is
> still mounted but all files and directories are no longer visible using ls.
> Any other file activity results in an Input/Output error.  Once I unmount &
> mount the volume again the data is again visible up to the point where the
> copy failed.
> In the /var/log/messages log around the same time as the copy test I get
> entries like:
> Sep 9 05:13:46 ATLAS kernel: 02:86: rw=0, want=156092516, limit=360
> Sep 9 05:13:46 ATLAS kernel: attempt to access beyond end of device

This looks interesting. I don't know what this means exactly but it
looks to me like you managed to create a filesystem bigger than the raid
volume was? I got the very same error when I tried to restore data with
xfsrestore from DAT (xfsrestore from DLT was fine). The issue is still

> The problem is reproduceable on XFS volumes on a 2 disk (IDE) raid0 (SW raid)
> partition and on a 5 disk (IDE) raid5 (SW raid) partition.  However, there is
> no problem with the copy test using ext2 volumes on the above partitions.
> The copy test also passes when run on a non-raid drive.
> I am using Kernel 2.4.9 and the relevant latest XFS patch from the patches
> directory on the XFS ftp site.
> patch-2.4.9-xfs-2001-08-19
> The thing that really puzzles me is that the above directory copy test runs
> fine when I only have 10 background copy jobs running at a time.  As soon as
> I have 20 background copy jobs running the problem occurs.  The system passes
> both bonnie++ and mongo.pl tests/benchmarks.
> So from the results I have at the moment it would seem that XFS is stomping
> over the raid code or the raid code is stomping over XFS.  Should I cross
> post this to the raid list as well?
> P.S. I have just noticed on the mailing list archive a note about fixing a
> problem that caused mongo.pl to hang.  Although my systems don't hang in
> mongo do people think I'm seeing the same problem just a different symptom?
> Another issue that I think is not related is that when using the 2.4.9-xfs
> kernel, when the kernel identifies the drives during bootup I get IRQ probe
> failed errors.
> hda: IC35L040AVER07-0, ATA DISK drive
> hda: IRQ probe failed (0xfffffff8)
> hdb: IC35L040AVER07-0, ATA DISK drive
> hdb: IRQ probe failed (0xfffffff8)
> ........the rest as normal
> The errors occur when the kernel is run on an ASUS A7V133 motherboard but not
> on a ASUS A7V133C.  The errors don't happen with a native 2.4.9 kernel
> either.  Since the errors occur for the 2 drives on the 1st channel of the
> 1st IDE controller (which is not related to the raid arrays mentioned above)
> and the system still boots - I have not been worried about it.  Should I be
> worried?
> At this stage I'm unsure what other info people would like.  If anyone wants
> logs, config files, more information or more testing please tell me.
> Thanks for your time.
> Adrian Head
> Bytecomm P/L

I have a test system here with SoftRAID5 on 4 U160 SCSI disks. I'll try
to kill it today with cp jobs.


<Prev in Thread] Current Thread [Next in Thread>