Received: with ECARTIS (v1.0.0; list linux-xfs); Fri, 11 Jun 2004 01:27:17 -0700 (PDT) Received: from TYO202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.202]) by oss.sgi.com (8.12.10/8.12.9) with SMTP id i5B8R9gi025508 for ; Fri, 11 Jun 2004 01:27:10 -0700 Received: from mailgate4.nec.co.jp (mailgate53.nec.co.jp [10.7.69.184]) by TYO202.gate.nec.co.jp (8.11.7/3.7W01080315) with ESMTP id i5B8QnY18331 for ; Fri, 11 Jun 2004 17:27:03 +0900 (JST) Received: (from root@localhost) by mailgate4.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id i5B8QmW06359 for linux-xfs@oss.sgi.com; Fri, 11 Jun 2004 17:26:48 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv4.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id i5B8Ql625109 for ; Fri, 11 Jun 2004 17:26:47 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20040611.172929.62502792 for ; Fri, 11 Jun 2004 17:29:29 +0900 Received: FROM tnesgate.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Fri Jun 11 17:29:29 2004 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnesgate.tnes.nec.co.jp (8.11.6/3.7W00091816) with ESMTP id i5B8QkX44435; Fri, 11 Jun 2004 17:26:46 +0900 (JST) Received: from tnesb9665.tnes.nec.co.jp (bsd240.bsd.tnes.nec.co.jp [10.1.104.104]) by rifu.bsd.tnes.nec.co.jp (8.11.6/3.7W/BSD-TNES-MX01) with SMTP id i5B8QkX09504; Fri, 11 Jun 2004 17:26:46 +0900 Message-Id: <200406110826.AA00205@tnesb9665.tnes.nec.co.jp> From: Masanori TSUDA Date: Fri, 11 Jun 2004 17:26:46 +0900 To: Russell Cattelan Cc: linux-xfs@oss.sgi.com Subject: Re: data corruption on nfs+xfs In-Reply-To: <1086914331.1160.63.camel@rose.americas.sgi.com> References: <1086914331.1160.63.camel@rose.americas.sgi.com> MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: multipart/mixed; boundary="--------------------1280059699777016" X-archive-position: 3375 X-ecartis-version: Ecartis v1.0.0 Sender: linux-xfs-bounce@oss.sgi.com Errors-to: linux-xfs-bounce@oss.sgi.com X-original-sender: tsuda@tnes.nec.co.jp Precedence: bulk X-list: linux-xfs Content-Length: 14561 Lines: 344 This is multipart message. ----------------------1280059699777016 Content-Type: text/plain; charset=iso-2022-jp In message "Re: data corruption on nfs+xfs" (04/06/11 09:38:53), cattelan@xfs.org wrote... >This looks really promising. >I'm currently reading through the code again to >see what kind of implications this might have. >I'm worried that you're patch might increase file fragmentation, >but that is just at first glance. I'll look some more and run >some testing with and with out you're patch. Thank you. >I'm looking at xfs_inactive_free_eofblocks again, I think >there may be an issue with the xfs_inode di_size and the linux >inode i_size. I also think so. I think that following issue may be caused (I don't reproduce following issue). This problem is another with the previous problem and is based on asynchronous updating inode i_size and xfs_inode di_size. Each process is in order of time. 1. write 8KB The TP write 8KB data to file. First, 1st 4KB data is processed [do_generic_file_write]. file image offset=0 +----+ | | +----+ inode i_size : -----> (4KB) xfs_inode di_size : (0) inode i_size is 4KB, but xfs_inode di_size is zero, because a_op->write_commit update only i_size. Next, 2nd 4KB data is processed [do_generic_file_write]. file image offset=0 +----+----+ | | | +----+----+ inode i_size : ----------> (8KB) xfs_inode di_size : (0) inode i_size is 8KB, but xfs_inode di_size remain zero. 2. revalidate The revalidate runs (ex. by ls) [linvfs_revalidate]. At this time, inode i_size is changed to same value as xfs_inode di_size [vn_revalidate]. As result i_size is zero! file image offset=0 +----+----+ | | | +----+----+ inode i_size : (0) xfs_inode di_size : (0) 3. flush data The flushing runs by memory overload [balance_dirty]. Although it is going to flush the buffer (1st page) of write 8KB, its result is EIO, because 1st page index is over inode i_size [xfs_page_state_convert]. And the buffer is discarded on 2.4.26. 4. write 8KB (continue) inode i_size and xfs_inode di_size is updated 8KB at the last of write processing [xfs_write]. But write data is lossed. file image offset=0 +----+----+ | | | +----+----+ inode i_size : ----------> (8KB) xfs_inode di_size : ----------> (8KB) +----+ no data I think that the issue rarely is caused, and one of solution for the issue is to simultaneously update inode i_size and xfs_inode di_size at a_op->write_commit. >BTW what tracing did you use to find this? Although there were trial and error, finally I investigated with the tracing (attached patch). Regards, Tsuda >On Wed, 2004-06-09 at 20:30, Masanori TSUDA wrote: >> Hi, >> >> I have reproduced similar problem on xfs1.3.1 (based on 2.4.21), >> my environment is as follows. >> >> nfs server : >> OS : RedHat9 + xfs1.3.1 (based on 2.4.21)  >> CPU : Xeon(2.4GHz) x 2 >> MEM : 1GB >> NIC : Intel PRO/1000 >> Local Filesystem : XFS, the refcache is disabled. >> >> nfs client : >> OS : RadHat9 (based on 2.4.20-8) >> NIC : Intel PRO/1000 >> NFS Ver. : 3 >> NFS Mount Options : udp,hard,intr,wsize=8192 >> >> Within 1 hour of running the test, the corruption was detected. >> (to make it easy to detect the corruption, umount nfs, umount xfs, >> mount xfs and mount nfs before comparing data, i.e. purge memory cache.) >> The corruption width was a multiple of 4KB, starting at 4KB boundary. >> In many cases, it was caused in the start part of the physical extent. >> >> I have investigated the issue using the kernel embeded local trace. >> I think that the issue was caused by the delayed allocation mechanism. >> I explain the example of curruption scenario which I guess. >> Each process of the scenario is in order of time. >> >> 1. open and write in nfsd (for write1) >> The nfs client write 8KB data to file (called write1). >> The write request is processed in nfsd. The nfsd call open [linvfs_open], >> and call write [linvfs_write]. After calling write, the file has several >> delayed allocation blocks over end of the file, by allocation in chunks and >> alignment of writeiosize. >> >> file image >> offset=0 eof >> +----+----+----+----+----+- ... +----+ >> | | | | | | | | >> +----+----+----+----+----+- ... +----+ >> 4KB 4KB >> +---------+ >> write data (write1) >> +------------------------------------+ >> delayed allocation blocks >> >> 2. allocate disk space in kupdated (for write1) >> The disk space is allocated for delayed allocotion blocks before data >> flushed to disk [linvfs_writepage, page_state_convert]. >> >> file image >> offset=0 eof >> +----+----+----+----+----+- ... +----+ >> | | | | | | | | >> +----+----+----+----+----+- ... +----+ >> 4KB 4KB >> +---------+ >> write data (write1) >> +------------------------------------+ >> allocated disk space >> +---------+ >> called disk space1 >> +--------------------------+ >> called disk space2 >> >> 3. close in nfsd (for write1) >> The nfsd call close [linvfs_release]. At this time, allocated disk space >> over end of the file (disk space2) is truncated, when the refcache is disabled >> [xfs_inactive_free_eofblocks]. >> >> file image >> offset=0 eof >> +----+----+ >> | | | >> +----+----+ >> 4KB 4KB >> +---------+ >> write data (write1) >> +---------+ >> disk space1 >> >> 4. open and write in nfsd (for write2) >> Furthermore the nfs client write 8KB data to file (called write2). >> The nfsd call open [linvfs_open], and call write [linvfs_write]. >> >> file image >> offset=0 eof >> +----+----+----+----+----+- ... +----+ >> | | | | | | | | >> +----+----+----+----+----+- ... +----+ >> 4KB 4KB 4KB 4KB >> +---------+ >> write data (write1) >> +---------+ >> write data (write2) >> +--------------------------+ >> delayed allocation blocks >> +---------+ >> disk space1 >> >> 5. flush data to disk in kupdated (for write1) >> The write data (write1) is flushed to disk space1 [page_state_convert]. >> And the write data (write2) is flushed to disk space2 [cluster_write] !!!, >> because the buffer status of write data (write2) is dirty and delay. >> But, the disk space2 dose not exist at this time. >> The disk space2 may be used by the other file or free space. >> >> I think that one of solution for the issue is to flush only buffers in >> end of the file before allocating disk space for delayed allocation blocks, >> don't flush buffers over that. >> I made patch for xfs1.3.1. I am running the test on the kernel added the >> patch, it has been running for over 16 hours with no corruption. >> >> Could you please comment the attached patch. >> >> Regards, >> Tsuda >> >> In message "data corruption on nfs+xfs" >> (04/05/27 15:58:48), >> kazuyuki@sys1.cpg.sony.co.jp wrote... >> >We are experiencing the same problem as No.198. >> > http://oss.sgi.com/bugzilla/show_bug.cgi?id=198 >> > http://marc.theaimsgroup.com/?t=108343605300001&r=1&w=2 >> > >> >We have confirmed that even when the refcache is disabled, setting >> >fs.xfs.refcache_size to zero through sysctl, the problem does not disappear. >> >To run linux as single CPU mode, it makes the problem slightly hard to occur, >> >but it still occurs. >> > >> >Two types of corruption we've seen: >> > >> > 1) Width is a multiple of 8kB, starting at 8kB boundary. >> > *Maybe the same trouble as No.198. >> > >> > 2) Width is a 964 bytes, ending up to 4kB boundary. >> > *I'm not sure the cause is same as 1) above. >> > >> >We have tested on 2.4.20-20.9.XFS1.3.1, 2.4.20-30.9.sgi1 XFS1.3.3 and other kernels >> >based on 2.4.20-20 on which we made some changes. >> > >> >Anyone who knows where is the cause. On page cache, disk block handling, or other parts? >> >Or who knows how to avoid this with some setting or another version? >> > ----------------------1280059699777016 Content-Type: application/octet-stream; name="trace.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="trace.patch" ZGlmZiAtcnVOIGxpbnV4LTIuNC4yMS14ZnMxLjMuMS9mcy94ZnMvbGludXgv eGZzX2FvcHMuYyBsaW51eC0yLjQuMjEteGZzMS4zLjEtbW9kL2ZzL3hmcy9s aW51eC94ZnNfYW9wcy5jCi0tLSBsaW51eC0yLjQuMjEteGZzMS4zLjEvZnMv eGZzL2xpbnV4L3hmc19hb3BzLmMJMjAwNC0wNi0wNyAxMDozMjozMS4wMDAw MDAwMDAgKzA5MDAKKysrIGxpbnV4LTIuNC4yMS14ZnMxLjMuMS1tb2QvZnMv eGZzL2xpbnV4L3hmc19hb3BzLmMJMjAwNC0wNi0xMSAxNjo1Mzo1My4wMDAw MDAwMDAgKzA5MDAKQEAgLTUzLDYgKzUzLDE1IEBACiAjaW5jbHVkZSAieGZz X3J3LmgiCiAjaW5jbHVkZSA8bGludXgvaW9idWYuaD4KIAorI2lmIDEKK3Nw aW5sb2NrX3QgdHN1ZGFfd3BfbG9jayA9IFNQSU5fTE9DS19VTkxPQ0tFRDsK K2ludCB0c3VkYV93cF9zdyA9IDA7Cit1bnNpZ25lZCBsb25nIHRzdWRhX3dw X2lubyA9IDA7Cit1bnNpZ25lZCBsb25nIHRzdWRhX3dwX3NfaW5kZXggPSAw OwordW5zaWduZWQgbG9uZyB0c3VkYV93cF9lX2luZGV4ID0gMDsKK3Vuc2ln bmVkIGxvbmcgdHN1ZGFfd3BfdF9pbmRleCA9IDA7CisjZW5kaWYKKwogU1RB VElDIHZvaWQgY29udmVydF9wYWdlKHN0cnVjdCBpbm9kZSAqLCBzdHJ1Y3Qg cGFnZSAqLAogCQkJcGFnZV9idWZfYm1hcF90ICosIHZvaWQgKiwgaW50LCBp bnQpOwogCkBAIC01ODAsMTIgKzU4OSwzMCBAQAogewogCXVuc2lnbmVkIGxv bmcJCXRsYXN0OwogCXN0cnVjdCBwYWdlCQkqcGFnZTsKKyNpZiAxCisgICAg ICAgIGludCB0c3VkYV93b3JrX3N3ID0gMDsKKyNlbmRpZgogCiAJdGxhc3Qg PSAobXAtPnBibV9vZmZzZXQgKyBtcC0+cGJtX2JzaXplKSA+PiBQQUdFX0NB Q0hFX1NISUZUOwogCWZvciAoOyB0aW5kZXggPCB0bGFzdDsgdGluZGV4Kysp IHsKIAkJcGFnZSA9IHByb2JlX2RlbGFsbG9jX3BhZ2UoaW5vZGUsIHRpbmRl eCk7CiAJCWlmICghcGFnZSkKIAkJCWJyZWFrOworI2lmIDEKKwkJc3Bpbl9s b2NrKCZ0c3VkYV93cF9sb2NrKTsKKwkJaWYgKHRzdWRhX3dwX3N3ID09IDIp IHsKKwkJCWlmICgoaW5vZGUtPmlfaW5vID09IHRzdWRhX3dwX2lubykgJiYg KHBhZ2UtPmluZGV4ID49IHRzdWRhX3dwX3RfaW5kZXgpKSB7CisJCQkJaWYg KHRzdWRhX3dvcmtfc3cgPT0gMCkgeworCQkJCQl0c3VkYV93b3JrX3N3Kys7 CisJCQkJCXByaW50aygiIyMjIGN3IDE6IGlubz0weCVseCAgYm49MHglbGx4 IGJzaXplPTB4JXhcbiIsCisJCQkJCQlpbm9kZS0+aV9pbm8sbXAtPnBibV9i bixtcC0+cGJtX2JzaXplPj45KTsKKwkJCQl9CisJCQkJcHJpbnRrKCIjIyMg Y3cgMjogaW5vPTB4JWx4IGlkeD0weCVseCBzLWlkeD0weCVseCBlLWlkeD0w eCVseCB0LWlkeD0weCVseFxuIiwKKwkJCQkJaW5vZGUtPmlfaW5vLHBhZ2Ut PmluZGV4LHRzdWRhX3dwX3NfaW5kZXgsdHN1ZGFfd3BfZV9pbmRleCx0c3Vk YV93cF90X2luZGV4KTsKKwkJCX0KKwkJfQorCQlzcGluX3VubG9jaygmdHN1 ZGFfd3BfbG9jayk7CisjZW5kaWYKIAkJY29udmVydF9wYWdlKGlub2RlLCBw YWdlLCBtcCwgTlVMTCwgc3RhcnRpbywgYWxsX2JoKTsKIAl9CiB9CkBAIC02 MjMsNiArNjUwLDkgQEAKIAlpbnQJCQlsZW4sIGVyciwgaSwgY250ID0gMDsK IAlpbnQJCQlmbGFncyA9IHN0YXJ0aW8gPyAwIDogQk1BUF9UUllMT0NLOwog CWludAkJCXBhZ2VfZGlydHkgPSAxOworI2lmIDEKKyAgICAgICAgaW50IHRz dWRhX3dvcmtfc3cgPSAwOworI2VuZGlmCiAKIAogCS8qIEFyZSB3ZSBvZmYg dGhlIGVuZCBvZiB0aGUgZmlsZSA/ICovCkBAIC03NTksMTIgKzc4OSwzNyBA QAogCQliaCA9IGJoLT5iX3RoaXNfcGFnZTsKIAl9IHdoaWxlIChvZmZzZXQg PCBlbmRfb2Zmc2V0KTsKIAorI2lmIDEKKwlpZiAobXApIHsKKwkJc3Bpbl9s b2NrKCZ0c3VkYV93cF9sb2NrKTsKKwkJaWYgKHRzdWRhX3dwX3N3ID09IDAp IHsKKwkJCXRzdWRhX3dvcmtfc3cgPSAxOworCQkJdHN1ZGFfd3Bfc3cgPSAx OworCQkJdHN1ZGFfd3BfaW5vID0gaW5vZGUtPmlfaW5vOworCQkJdHN1ZGFf d3Bfc19pbmRleCA9IHBhZ2UtPmluZGV4ICsgMTsKKwkJCXRzdWRhX3dwX2Vf aW5kZXggPSAoKG1wLT5wYm1fb2Zmc2V0ICsgbXAtPnBibV9ic2l6ZSkgPj4g UEFHRV9DQUNIRV9TSElGVCkgLSAxOworCQkJaWYgKHRzdWRhX3dwX3NfaW5k ZXggPiB0c3VkYV93cF9lX2luZGV4KSB7CisJCQkJdHN1ZGFfd3Bfc3cgPSAw OworCQkJfQorCQl9CisJCXNwaW5fdW5sb2NrKCZ0c3VkYV93cF9sb2NrKTsK Kwl9CisjZW5kaWYKIAlpZiAoc3RhcnRpbykKIAkJc3VibWl0X3BhZ2UocGFn ZSwgYmhfYXJyLCBjbnQpOwogCiAJaWYgKG1wKQogCQljbHVzdGVyX3dyaXRl KGlub2RlLCBwYWdlLT5pbmRleCArIDEsIG1wLCBzdGFydGlvLCB1bm1hcHBl ZCk7CiAKKyNpZiAxCisJaWYgKG1wKSB7CisJCXNwaW5fbG9jaygmdHN1ZGFf d3BfbG9jayk7CisJCWlmICh0c3VkYV93b3JrX3N3KSB7CisJCQl0c3VkYV93 cF9zdyA9IDA7CisJCX0KKwkJc3Bpbl91bmxvY2soJnRzdWRhX3dwX2xvY2sp OworCX0KKyNlbmRpZgogCXJldHVybiBwYWdlX2RpcnR5OwogCiBlcnJvcjoK ZGlmZiAtcnVOIGxpbnV4LTIuNC4yMS14ZnMxLjMuMS9mcy94ZnMveGZzX3Zu b2Rlb3BzLmMgbGludXgtMi40LjIxLXhmczEuMy4xLW1vZC9mcy94ZnMveGZz X3Zub2Rlb3BzLmMKLS0tIGxpbnV4LTIuNC4yMS14ZnMxLjMuMS9mcy94ZnMv eGZzX3Zub2Rlb3BzLmMJMjAwNC0wNi0wNyAxMDozMjozMi4wMDAwMDAwMDAg KzA5MDAKKysrIGxpbnV4LTIuNC4yMS14ZnMxLjMuMS1tb2QvZnMveGZzL3hm c192bm9kZW9wcy5jCTIwMDQtMDYtMDkgMTA6NTg6MjUuMDAwMDAwMDAwICsw OTAwCkBAIC03MCw2ICs3MCwxNCBAQAogI2luY2x1ZGUgInhmc19tYWMuaCIK ICNpbmNsdWRlICJ4ZnNfbG9nX3ByaXYuaCIKIAorI2lmIDEKK2V4dGVybiBz cGlubG9ja190IHRzdWRhX3dwX2xvY2s7CitleHRlcm4gaW50IHRzdWRhX3dw X3N3OworZXh0ZXJuIHVuc2lnbmVkIGxvbmcgdHN1ZGFfd3BfaW5vOworZXh0 ZXJuIHVuc2lnbmVkIGxvbmcgdHN1ZGFfd3Bfc19pbmRleDsKK2V4dGVybiB1 bnNpZ25lZCBsb25nIHRzdWRhX3dwX2VfaW5kZXg7CitleHRlcm4gdW5zaWdu ZWQgbG9uZyB0c3VkYV93cF90X2luZGV4OworI2VuZGlmCiAKIC8qCiAgKiBU aGUgbWF4aW11bSBwYXRobGVuIGlzIDEwMjQgYnl0ZXMuIFNpbmNlIHRoZSBt aW5pbXVtIGZpbGUgc3lzdGVtCkBAIC0xMzIyLDYgKzEzMzAsMjAgQEAKIAkJ ICogZG8gdGhhdCB3aXRoaW4gYSB0cmFuc2FjdGlvbi4KIAkJICovCiAJCXhm c19pbG9jayhpcCwgWEZTX0lPTE9DS19FWENMKTsKKyNpZiAxCisJCXNwaW5f bG9jaygmdHN1ZGFfd3BfbG9jayk7CisJCWlmICh0c3VkYV93cF9zdyA9PSAx KSB7CisJCQlpZiAoaXAtPmlfaW5vID09IHRzdWRhX3dwX2lubykgeworCQkJ CXVuc2lnbmVkIGxvbmcgd29ya19pbmRleCA9ICgoaXAtPmlfZC5kaV9zaXpl IC0gMUxMKSA+PiBQQUdFX0NBQ0hFX1NISUZUKSArIDE7CisJCQkJaWYgKHdv cmtfaW5kZXggPj0gdHN1ZGFfd3Bfc19pbmRleCAmJgorCQkJCSAgICB3b3Jr X2luZGV4IDw9IHRzdWRhX3dwX2VfaW5kZXgpIHsKKwkJCQkJdHN1ZGFfd3Bf c3cgPSAyOworCQkJCQl0c3VkYV93cF90X2luZGV4ID0gd29ya19pbmRleDsK KwkJCQl9CisJCQl9CisJCX0KKwkJc3Bpbl91bmxvY2soJnRzdWRhX3dwX2xv Y2spOworI2VuZGlmCiAJCXhmc19pdHJ1bmNhdGVfc3RhcnQoaXAsIFhGU19J VFJVTkNfREVGSU5JVEUsCiAJCQkJICAgIGlwLT5pX2QuZGlfc2l6ZSk7CiAK ----------------------1280059699777016--