Received: with ECARTIS (v1.0.0; list linux-xfs); Sat, 12 Jun 2004 09:18:11 -0700 (PDT) Received: from omx1.americas.sgi.com (cfcafw.sgi.com [198.149.23.1]) by oss.sgi.com (8.12.10/8.12.9) with SMTP id i5CGI5gi031900 for ; Sat, 12 Jun 2004 09:18:05 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [192.48.203.135]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id i5B0ctiv007984 for ; Thu, 10 Jun 2004 19:38:55 -0500 Received: from daisy-e236.americas.sgi.com (daisy-e236.americas.sgi.com [128.162.236.214]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id i5B0csKe40251777; Thu, 10 Jun 2004 19:38:54 -0500 (CDT) Received: from [128.162.232.98] (rose.americas.sgi.com [128.162.232.98]) by daisy-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id i5B0cr5N1086698; Thu, 10 Jun 2004 19:38:53 -0500 (CDT) Subject: Re: data corruption on nfs+xfs From: Russell Cattelan To: Masanori TSUDA Cc: linux-xfs@oss.sgi.com, Kazuyuki Goto In-Reply-To: <200406100130.AA00198@tnesb9665.tnes.nec.co.jp> References: <200405271558.EJG73779.VJBLYZVL@sys1.cpg.sony.co.jp> <200406100130.AA00198@tnesb9665.tnes.nec.co.jp> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-lXcpJ2gVNoB+LbitJrwB" Message-Id: <1086914331.1160.63.camel@rose.americas.sgi.com> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.6 Date: Thu, 10 Jun 2004 19:38:53 -0500 X-archive-position: 3378 X-ecartis-version: Ecartis v1.0.0 Sender: linux-xfs-bounce@oss.sgi.com Errors-to: linux-xfs-bounce@oss.sgi.com X-original-sender: cattelan@xfs.org Precedence: bulk X-list: linux-xfs Content-Length: 6701 Lines: 190 --=-lXcpJ2gVNoB+LbitJrwB Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable This looks really promising. I'm currently reading through the code again to see what kind of implications this might have. I'm worried that you're patch might increase file fragmentation,=20 but that is just at first glance. I'll look some more and run some testing with and with out you're patch. I'm looking at xfs_inactive_free_eofblocks again, I think there may be an issue with the xfs_inode di_size and the linux inode i_size. BTW what tracing did you use to find this? On Wed, 2004-06-09 at 20:30, Masanori TSUDA wrote: > Hi, >=20 > I have reproduced similar problem on xfs1.3.1 (based on 2.4.21), > my environment is as follows. >=20 > nfs server : > OS : RedHat9 + xfs1.3.1 (based on 2.4.21)=E3=80=80 > CPU : Xeon(2.4GHz) x 2 > MEM : 1GB > NIC : Intel PRO/1000 > Local Filesystem : XFS, the refcache is disabled. >=20 > nfs client : > OS : RadHat9 (based on 2.4.20-8) > NIC : Intel PRO/1000 > NFS Ver. : 3 > NFS Mount Options : udp,hard,intr,wsize=3D8192 >=20 > Within 1 hour of running the test, the corruption was detected. > (to make it easy to detect the corruption, umount nfs, umount xfs,=20 > mount xfs and mount nfs before comparing data, i.e. purge memory cache.) > The corruption width was a multiple of 4KB, starting at 4KB boundary. > In many cases, it was caused in the start part of the physical extent.=20 >=20 > I have investigated the issue using the kernel embeded local trace. > I think that the issue was caused by the delayed allocation mechanism. > I explain the example of curruption scenario which I guess. > Each process of the scenario is in order of time. >=20 > 1. open and write in nfsd (for write1) > The nfs client write 8KB data to file (called write1). > The write request is processed in nfsd. The nfsd call open [linvfs_op= en], > and call write [linvfs_write]. After calling write, the file has seve= ral > delayed allocation blocks over end of the file, by allocation in chun= ks and > alignment of writeiosize. >=20 > file image > offset=3D0 eof > +----+----+----+----+----+- ... +----+ > | | | | | | | | > +----+----+----+----+----+- ... +----+ > 4KB 4KB > +---------+ > write data (write1) > +------------------------------------+ > delayed allocation blocks >=20 > 2. allocate disk space in kupdated (for write1) > The disk space is allocated for delayed allocotion blocks before data > flushed to disk [linvfs_writepage, page_state_convert]. >=20 > file image > offset=3D0 eof > +----+----+----+----+----+- ... +----+ > | | | | | | | | > +----+----+----+----+----+- ... +----+ > 4KB 4KB > +---------+ > write data (write1) > +------------------------------------+ > allocated disk space > +---------+ > called disk space1 > +--------------------------+ > called disk space2 >=20 > 3. close in nfsd (for write1) > The nfsd call close [linvfs_release]. At this time, allocated disk sp= ace > over end of the file (disk space2) is truncated, when the refcache is= disabled > [xfs_inactive_free_eofblocks]. >=20 > file image > offset=3D0 eof > +----+----+ > | | | > +----+----+ > 4KB 4KB > +---------+ > write data (write1) > +---------+ > disk space1 >=20 > 4. open and write in nfsd (for write2) > Furthermore the nfs client write 8KB data to file (called write2). > The nfsd call open [linvfs_open], and call write [linvfs_write]. >=20 > file image > offset=3D0 eof > +----+----+----+----+----+- ... +----+ > | | | | | | | | > +----+----+----+----+----+- ... +----+ > 4KB 4KB 4KB 4KB > +---------+ > write data (write1) > +---------+ > write data (write2) > +--------------------------+ > delayed allocation blocks > +---------+ > disk space1 >=20 > 5. flush data to disk in kupdated (for write1) > The write data (write1) is flushed to disk space1 [page_state_convert= ]. > And the write data (write2) is flushed to disk space2 [cluster_write]= !!!, > because the buffer status of write data (write2) is dirty and delay. > But, the disk space2 dose not exist at this time. > The disk space2 may be used by the other file or free space. >=20 > I think that one of solution for the issue is to flush only buffers in > end of the file before allocating disk space for delayed allocation block= s, > don't flush buffers over that. > I made patch for xfs1.3.1. I am running the test on the kernel added the > patch, it has been running for over 16 hours with no corruption. >=20 > Could you please comment the attached patch. >=20 > Regards, > Tsuda >=20 > In message "data corruption on nfs+xfs"=20 > (04/05/27 15:58:48), > kazuyuki@sys1.cpg.sony.co.jp wrote... > >We are experiencing the same problem as No.198. > > http://oss.sgi.com/bugzilla/show_bug.cgi?id=3D198 > > http://marc.theaimsgroup.com/?t=3D108343605300001&r=3D1&w=3D2 > > > >We have confirmed that even when the refcache is disabled, setting > >fs.xfs.refcache_size to zero through sysctl, the problem does not disapp= ear. > >To run linux as single CPU mode, it makes the problem slightly hard to o= ccur, > >but it still occurs. > > > >Two types of corruption we've seen: > > > > 1) Width is a multiple of 8kB, starting at 8kB boundary.=20 > > *Maybe the same trouble as No.198. > > > > 2) Width is a 964 bytes, ending up to 4kB boundary.=20 > > *I'm not sure the cause is same as 1) above. > > > >We have tested on 2.4.20-20.9.XFS1.3.1, 2.4.20-30.9.sgi1 XFS1.3.3 and ot= her kernels > >based on 2.4.20-20 on which we made some changes. > > > >Anyone who knows where is the cause. On page cache, disk block handling,= or other parts? > >Or who knows how to avoid this with some setting or another version? > > --=-lXcpJ2gVNoB+LbitJrwB Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (FreeBSD) iD8DBQBAyP8bNRmM+OaGhBgRApEZAJ47g6Nia6xgqhHaTCLadnw24rKARACdGg1P EsFDr/qzeVwpygp/VIA5Y9A= =nRKF -----END PGP SIGNATURE----- --=-lXcpJ2gVNoB+LbitJrwB--