Received: with ECARTIS (v1.0.0; list linux-xfs); Wed, 09 Jun 2004 18:30:45 -0700 (PDT) Received: from TYO201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.214]) by oss.sgi.com (8.12.10/8.12.9) with SMTP id i5A1Ufgi029925 for ; Wed, 9 Jun 2004 18:30:41 -0700 Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.162] (may be forged)) by TYO201.gate.nec.co.jp (8.11.7/3.7W01080315) with ESMTP id i5A1UZp16423 for ; Thu, 10 Jun 2004 10:30:35 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id i5A1UX425870 for linux-xfs@oss.sgi.com; Thu, 10 Jun 2004 10:30:34 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv5.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id i5A1UVa14701 for ; Thu, 10 Jun 2004 10:30:32 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20040610.103312.53102708 for ; Thu, 10 Jun 2004 10:33:12 +0900 Received: FROM tnesgate.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Thu Jun 10 10:33:12 2004 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnesgate.tnes.nec.co.jp (8.11.6/3.7W00091816) with ESMTP id i5A1UVX74661; Thu, 10 Jun 2004 10:30:31 +0900 (JST) Received: from tnesb9665.tnes.nec.co.jp (bsd240.bsd.tnes.nec.co.jp [10.1.104.104]) by rifu.bsd.tnes.nec.co.jp (8.11.6/3.7W/BSD-TNES-MX01) with SMTP id i5A1UVX28977; Thu, 10 Jun 2004 10:30:31 +0900 Message-Id: <200406100130.AA00198@tnesb9665.tnes.nec.co.jp> From: Masanori TSUDA Date: Thu, 10 Jun 2004 10:30:31 +0900 To: linux-xfs@oss.sgi.com, Kazuyuki Goto Subject: Re: data corruption on nfs+xfs In-Reply-To: <200405271558.EJG73779.VJBLYZVL@sys1.cpg.sony.co.jp> References: <200405271558.EJG73779.VJBLYZVL@sys1.cpg.sony.co.jp> MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: multipart/mixed; boundary="--------------------0793017953057401" X-archive-position: 3361 X-ecartis-version: Ecartis v1.0.0 Sender: linux-xfs-bounce@oss.sgi.com Errors-to: linux-xfs-bounce@oss.sgi.com X-original-sender: tsuda@tnes.nec.co.jp Precedence: bulk X-list: linux-xfs Content-Length: 8277 Lines: 204 This is multipart message. ----------------------0793017953057401 Content-Type: text/plain; charset=iso-2022-jp Hi, I have reproduced similar problem on xfs1.3.1 (based on 2.4.21), my environment is as follows. nfs server : OS : RedHat9 + xfs1.3.1 (based on 2.4.21)  CPU : Xeon(2.4GHz) x 2 MEM : 1GB NIC : Intel PRO/1000 Local Filesystem : XFS, the refcache is disabled. nfs client : OS : RadHat9 (based on 2.4.20-8) NIC : Intel PRO/1000 NFS Ver. : 3 NFS Mount Options : udp,hard,intr,wsize=8192 Within 1 hour of running the test, the corruption was detected. (to make it easy to detect the corruption, umount nfs, umount xfs, mount xfs and mount nfs before comparing data, i.e. purge memory cache.) The corruption width was a multiple of 4KB, starting at 4KB boundary. In many cases, it was caused in the start part of the physical extent. I have investigated the issue using the kernel embeded local trace. I think that the issue was caused by the delayed allocation mechanism. I explain the example of curruption scenario which I guess. Each process of the scenario is in order of time. 1. open and write in nfsd (for write1) The nfs client write 8KB data to file (called write1). The write request is processed in nfsd. The nfsd call open [linvfs_open], and call write [linvfs_write]. After calling write, the file has several delayed allocation blocks over end of the file, by allocation in chunks and alignment of writeiosize. file image offset=0 eof +----+----+----+----+----+- ... +----+ | | | | | | | | +----+----+----+----+----+- ... +----+ 4KB 4KB +---------+ write data (write1) +------------------------------------+ delayed allocation blocks 2. allocate disk space in kupdated (for write1) The disk space is allocated for delayed allocotion blocks before data flushed to disk [linvfs_writepage, page_state_convert]. file image offset=0 eof +----+----+----+----+----+- ... +----+ | | | | | | | | +----+----+----+----+----+- ... +----+ 4KB 4KB +---------+ write data (write1) +------------------------------------+ allocated disk space +---------+ called disk space1 +--------------------------+ called disk space2 3. close in nfsd (for write1) The nfsd call close [linvfs_release]. At this time, allocated disk space over end of the file (disk space2) is truncated, when the refcache is disabled [xfs_inactive_free_eofblocks]. file image offset=0 eof +----+----+ | | | +----+----+ 4KB 4KB +---------+ write data (write1) +---------+ disk space1 4. open and write in nfsd (for write2) Furthermore the nfs client write 8KB data to file (called write2). The nfsd call open [linvfs_open], and call write [linvfs_write]. file image offset=0 eof +----+----+----+----+----+- ... +----+ | | | | | | | | +----+----+----+----+----+- ... +----+ 4KB 4KB 4KB 4KB +---------+ write data (write1) +---------+ write data (write2) +--------------------------+ delayed allocation blocks +---------+ disk space1 5. flush data to disk in kupdated (for write1) The write data (write1) is flushed to disk space1 [page_state_convert]. And the write data (write2) is flushed to disk space2 [cluster_write] !!!, because the buffer status of write data (write2) is dirty and delay. But, the disk space2 dose not exist at this time. The disk space2 may be used by the other file or free space. I think that one of solution for the issue is to flush only buffers in end of the file before allocating disk space for delayed allocation blocks, don't flush buffers over that. I made patch for xfs1.3.1. I am running the test on the kernel added the patch, it has been running for over 16 hours with no corruption. Could you please comment the attached patch. Regards, Tsuda In message "data corruption on nfs+xfs" (04/05/27 15:58:48), kazuyuki@sys1.cpg.sony.co.jp wrote... >We are experiencing the same problem as No.198. > http://oss.sgi.com/bugzilla/show_bug.cgi?id=198 > http://marc.theaimsgroup.com/?t=108343605300001&r=1&w=2 > >We have confirmed that even when the refcache is disabled, setting >fs.xfs.refcache_size to zero through sysctl, the problem does not disappear. >To run linux as single CPU mode, it makes the problem slightly hard to occur, >but it still occurs. > >Two types of corruption we've seen: > > 1) Width is a multiple of 8kB, starting at 8kB boundary. > *Maybe the same trouble as No.198. > > 2) Width is a 964 bytes, ending up to 4kB boundary. > *I'm not sure the cause is same as 1) above. > >We have tested on 2.4.20-20.9.XFS1.3.1, 2.4.20-30.9.sgi1 XFS1.3.3 and other kernels >based on 2.4.20-20 on which we made some changes. > >Anyone who knows where is the cause. On page cache, disk block handling, or other parts? >Or who knows how to avoid this with some setting or another version? > ----------------------0793017953057401 Content-Type: application/octet-stream; name="xfs1.3.1-delalloc.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="xfs1.3.1-delalloc.patch" LS0tIGxpbnV4LTIuNC4yMS14ZnMxLjMuMS9mcy94ZnMvbGludXgveGZzX2Fv cHMuYy5vcmlnaW5hbAkyMDA0LTA2LTA3IDE5OjE3OjA2LjAwMDAwMDAwMCAr MDkwMAorKysgbGludXgtMi40LjIxLXhmczEuMy4xL2ZzL3hmcy9saW51eC94 ZnNfYW9wcy5jCTIwMDQtMDYtMDkgMTU6NTA6MjQuMDAwMDAwMDAwICswOTAw CkBAIC01NzYsMTMgKzU3NiwxMiBAQCBjbHVzdGVyX3dyaXRlKAogCXVuc2ln bmVkIGxvbmcJCXRpbmRleCwKIAlwYWdlX2J1Zl9ibWFwX3QJCSptcCwKIAlp bnQJCQlzdGFydGlvLAotCWludAkJCWFsbF9iaCkKKwlpbnQJCQlhbGxfYmgs CisJdW5zaWduZWQgbG9uZwkJdGxhc3QpCiB7Ci0JdW5zaWduZWQgbG9uZwkJ dGxhc3Q7CiAJc3RydWN0IHBhZ2UJCSpwYWdlOwogCi0JdGxhc3QgPSAobXAt PnBibV9vZmZzZXQgKyBtcC0+cGJtX2JzaXplKSA+PiBQQUdFX0NBQ0hFX1NI SUZUOwotCWZvciAoOyB0aW5kZXggPCB0bGFzdDsgdGluZGV4KyspIHsKKwlm b3IgKDsgdGluZGV4IDw9IHRsYXN0OyB0aW5kZXgrKykgewogCQlwYWdlID0g cHJvYmVfZGVsYWxsb2NfcGFnZShpbm9kZSwgdGluZGV4KTsKIAkJaWYgKCFw YWdlKQogCQkJYnJlYWs7CkBAIC02MTgsMTUgKzYxNywxNyBAQCBwYWdlX3N0 YXRlX2NvbnZlcnQoCiB7CiAJc3RydWN0IGJ1ZmZlcl9oZWFkCSpiaF9hcnJb TUFYX0JVRl9QRVJfUEFHRV0sICpiaCwgKmhlYWQ7CiAJcGFnZV9idWZfYm1h cF90CQkqbXAsIG1hcDsKLQl1bnNpZ25lZCBsb25nCQlwX29mZnNldCA9IDAs IGVuZF9pbmRleDsKKwl1bnNpZ25lZCBsb25nCQlwX29mZnNldCA9IDAsIGVu ZF9pbmRleCwgbGFzdF9pbmRleCwgdGxhc3Q7CiAJbG9mZl90CQkJb2Zmc2V0 LCBlbmRfb2Zmc2V0OwogCWludAkJCWxlbiwgZXJyLCBpLCBjbnQgPSAwOwog CWludAkJCWZsYWdzID0gc3RhcnRpbyA/IDAgOiBCTUFQX1RSWUxPQ0s7CiAJ aW50CQkJcGFnZV9kaXJ0eSA9IDE7CisJaW50CQkJZGVsYWxsb2MgPSAwOwog CiAKIAkvKiBBcmUgd2Ugb2ZmIHRoZSBlbmQgb2YgdGhlIGZpbGUgPyAqLwog CWVuZF9pbmRleCA9IGlub2RlLT5pX3NpemUgPj4gUEFHRV9DQUNIRV9TSElG VDsKKwlsYXN0X2luZGV4ID0gKGlub2RlLT5pX3NpemUgLSAxKSA+PiBQQUdF X0NBQ0hFX1NISUZUOwogCWlmIChwYWdlLT5pbmRleCA+PSBlbmRfaW5kZXgp IHsKIAkJdW5zaWduZWQgcmVtYWluaW5nID0gaW5vZGUtPmlfc2l6ZSAmIChQ QUdFX0NBQ0hFX1NJWkUtMSk7CiAJCWlmICgocGFnZS0+aW5kZXggPj0gZW5k X2luZGV4KzEpIHx8ICFyZW1haW5pbmcpIHsKQEAgLTY5MCw2ICs2OTEsNyBA QCBwYWdlX3N0YXRlX2NvbnZlcnQoCiAJCSAqLwogCQl9IGVsc2UgaWYgKGJ1 ZmZlcl9kZWxheShiaCkpIHsKIAkJCWlmICghbXApIHsKKwkJCQlkZWxhbGxv YyA9IDE7CiAJCQkJZXJyID0gbWFwX2Jsb2Nrcyhpbm9kZSwgb2Zmc2V0LCBs ZW4sICZtYXAsCiAJCQkJCUJNQVBfQUxMT0NBVEUgfCBmbGFncyk7CiAJCQkJ aWYgKGVycikgewpAQCAtNzYyLDggKzc2NCwxMyBAQCBuZXh0X2JoOgogCWlm IChzdGFydGlvKQogCQlzdWJtaXRfcGFnZShwYWdlLCBiaF9hcnIsIGNudCk7 CiAKLQlpZiAobXApCi0JCWNsdXN0ZXJfd3JpdGUoaW5vZGUsIHBhZ2UtPmlu ZGV4ICsgMSwgbXAsIHN0YXJ0aW8sIHVubWFwcGVkKTsKKwlpZiAobXApIHsK KwkJdGxhc3QgPSAobXAtPnBibV9vZmZzZXQgKyBtcC0+cGJtX2JzaXplIC0g MSkgPj4gUEFHRV9DQUNIRV9TSElGVDsKKwkJaWYgKGRlbGFsbG9jICYmICh0 bGFzdCA+IGxhc3RfaW5kZXgpKSB7CisJCQl0bGFzdCA9IGxhc3RfaW5kZXg7 CisJCX0KKwkJY2x1c3Rlcl93cml0ZShpbm9kZSwgcGFnZS0+aW5kZXggKyAx LCBtcCwgc3RhcnRpbywgdW5tYXBwZWQsIHRsYXN0KTsKKwl9CiAKIAlyZXR1 cm4gcGFnZV9kaXJ0eTsKIAo= ----------------------0793017953057401--