From BATV+e431fca72d399363b476+2442+infradead.org+hch@bombadil.srs.infradead.org Sat May 1 07:56:36 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_64 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o41CuY2a252632 for ; Sat, 1 May 2010 07:56:36 -0500 X-ASG-Debug-ID: 1272718720-2e0900540000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2E41F304F29 for ; Sat, 1 May 2010 05:58:40 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id lNveYqkHpcHorCgE for ; Sat, 01 May 2010 05:58:40 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1O8CHH-0007Bn-Aq; Sat, 01 May 2010 12:58:39 +0000 Date: Sat, 1 May 2010 08:58:39 -0400 From: Christoph Hellwig To: Dave Chinner Cc: Christoph Hellwig , xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 4/5] [PATCH] xfs: simplify buffer to transaction matching Subject: Re: [PATCH 4/5] [PATCH] xfs: simplify buffer to transaction matching Message-ID: <20100501125839.GA26342@infradead.org> References: <20100418001041.865247520@bombadil.infradead.org> <20100418001058.677429475@bombadil.infradead.org> <20100420064155.GH15130@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100420064155.GH15130@dastard> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1272718721 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Tue, Apr 20, 2010 at 04:41:55PM +1000, Dave Chinner wrote: > Good start, but I think that it should use xfs_trans_first_item() > and xfs_trans_next_item() rather than walking the descriptor > table directly. I tried implementing it, but it doesn't work. We can call the buffer matching routines on transactions that don't have any item linked to it, which will cause xfs_trans_first_item to panic. Compare this code in xfs_trans_buf_item_match: for (licp = &tp->t_items; licp != NULL; licp = licp->lic_next) { if (xfs_lic_are_all_free(licp)) { ASSERT(licp == &tp->t_items); ASSERT(licp->lic_next == NULL); return NULL; } ... } to this in xfs_trans_first_item: licp = &tp->t_items; /* * If it's not in the first chunk, skip to the second. */ if (xfs_lic_are_all_free(licp)) { licp = licp->lic_next; } /* * Return the first non-free descriptor in the chunk. */ ASSERT(!xfs_lic_are_all_free(licp)); From xfs@tlinx.org Sat May 1 08:37:08 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o41Db8Zx255665 for ; Sat, 1 May 2010 08:37:08 -0500 X-ASG-Debug-ID: 1272721139-30d401ce0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from Ishtar.sc.tlinx.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 0D859304E20 for ; Sat, 1 May 2010 06:39:09 -0700 (PDT) Received: from Ishtar.sc.tlinx.org (173-164-175-65-SFBA.hfc.comcastbusiness.net [173.164.175.65]) by cuda.sgi.com with ESMTP id CWrAAYtC3999NOw6 for ; Sat, 01 May 2010 06:39:09 -0700 (PDT) Received: from [192.168.3.12] (Athenae [192.168.3.12]) by Ishtar.sc.tlinx.org (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o41Dcqi7014095; Sat, 1 May 2010 06:38:55 -0700 Message-ID: <4BDC2EEC.9080808@tlinx.org> Date: Sat, 01 May 2010 06:38:52 -0700 From: Linda Walsh User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.8.1.24) Gecko/20100228 Lightning/0.9 Thunderbird/2.0.0.24 Mnenhy/0.7.6.666 MIME-Version: 1.0 To: Peter Shuere CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: Building XFSDump but missing uuid development package Subject: Re: Building XFSDump but missing uuid development package References: <53964.41414.qm@web180401.mail.gq1.yahoo.com> In-Reply-To: <53964.41414.qm@web180401.mail.gq1.yahoo.com> X-Stationery: 0.5.1 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Barracuda-Connect: 173-164-175-65-SFBA.hfc.comcastbusiness.net[173.164.175.65] X-Barracuda-Start-Time: 1272721155 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.92 X-Barracuda-Spam-Status: No, SCORE=-1.92 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28792 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Peter Shuere wrote: > Hi, > > I am trying to build the XFS utility, but during the build-process, I ran into a problem saying that I need the UUID development package to make the build complete. I am using OpenSUSE 11.2, where should I get the copy of the uuid-devel source (or binary, whichever one that works). (On SuSE 11.2:) > zypper se uuid Loading repository data... Reading installed packages... S | Name | Summary | Type --+---------------------+-------------------------------------------+----------- i | libuuid-devel | Development files for libuuid1 | package | libuuid-devel-32bit | Development files for libuuid1 | package i | libuuid1 | Library to generate UUIDs | package i | libuuid1-32bit | Library to generate UUIDs | package | perl-Data-UUID | Perl extension for generating Globally/-> | package | perl-Data-UUID | Perl extension for generating Globally/-> | srcpackage | uuidd | Utilities for the Second Extended File -> | package --- Does libuuid-devel not work for you? From tytso@thunk.org Sat May 1 14:45:24 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_ATTENTION autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o41JjOC2019250 for ; Sat, 1 May 2010 14:45:24 -0500 X-ASG-Debug-ID: 1272743250-4d1703230000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from thunker.thunk.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id E07601DE2210 for ; Sat, 1 May 2010 12:47:30 -0700 (PDT) Received: from thunker.thunk.org (thunk.org [69.25.196.29]) by cuda.sgi.com with ESMTP id vdVaK8oPWRT2YIM7 for ; Sat, 01 May 2010 12:47:30 -0700 (PDT) Received: from root (helo=closure.thunk.org) by thunker.thunk.org with local-esmtp (Exim 4.50 #1 (Debian)) id 1O8Ied-0004PA-Ah; Sat, 01 May 2010 15:47:11 -0400 Received: from tytso by closure.thunk.org with local (Exim 4.69) (envelope-from ) id 1O8Iec-0001BE-JW; Sat, 01 May 2010 15:47:10 -0400 Date: Sat, 1 May 2010 15:47:10 -0400 From: tytso@mit.edu To: Andrew Morton Cc: "Aneesh Kumar K. V" , Dave Chinner , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages Message-ID: <20100501194710.GV14986@thunk.org> Mail-Followup-To: tytso@mit.edu, Andrew Morton , "Aneesh Kumar K. V" , Dave Chinner , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com References: <1271731314-5893-1-git-send-email-david@fromorbit.com> <1271731314-5893-4-git-send-email-david@fromorbit.com> <20100429143931.331c2bab.akpm@linux-foundation.org> <87sk6dwka6.fsf@linux.vnet.ibm.com> <20100430124329.10a4c02b.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100430124329.10a4c02b.akpm@linux-foundation.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-Barracuda-Connect: thunk.org[69.25.196.29] X-Barracuda-Start-Time: 1272743250 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=NO_REAL_NAME X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28814 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.00 NO_REAL_NAME From: does not include a real name X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Fri, Apr 30, 2010 at 12:43:29PM -0700, Andrew Morton wrote: > > Maybe that fs shouldn't be calling write_cache_pages() at all. After > all, write_cache_pages() is a wrapper which emits a sequence of calls > to ->writepage(), and ->writepage() writes a page. On my todo list is to fix ext4 to not call write_cache_pages() at all. We are seriously abusing that function ATM, since we're not actually writing the pages when we call write_cache_pages(). I won't go into what we're doing, because it's too embarassing, but suffice it to say that we end up calling pagevec_lookup() or pagevec_lookup_tag() *four*, count them *four* times while trying to do writeback. I have a simple patch that gives ext4 our own copy of write_cache_pages(), and then simplifies it a lot, and fixes a bunch of problems, but then I discarded it in favor of fundamentally redoing how we do writeback at all, but it's going to take a while to get things completely right. But I am working to try to fix this. If it would help, I can ressurect the "fork write_cache_pages() and simplify" patch, so ext4 isn't dependent on the mm/page-writeback.c's write_cache_pages(), if there is an immediate, short-term need to fix that function. - Ted From admin@sylaba.poznan.pl Sun May 2 21:16:30 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o432GTvp125859 for ; Sun, 2 May 2010 21:16:30 -0500 X-ASG-Debug-ID: 1272853110-185601e20002-w1Z2WR X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from brown.mccme.ru (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id C95AE306F18 for ; Sun, 2 May 2010 19:18:35 -0700 (PDT) Received: from brown.mccme.ru (brown.mccme.ru [213.171.48.226]) by cuda.sgi.com with ESMTP id TPyqqRSTBrBzJJpj for ; Sun, 02 May 2010 19:18:35 -0700 (PDT) Received: from [213.171.48.246] (helo=webmail.mccme.ru) by brown.mccme.ru with esmtp (Exim 4.69 (FreeBSD)) (envelope-from ) id 1O78YA-0001K7-5s; Wed, 28 Apr 2010 18:47:42 +0400 Received: from 41.138.190.253 (SquirrelMail authenticated user vagordin) by webmail.mccme.ru with HTTP; Wed, 28 Apr 2010 14:47:42 -0000 (UTC) Message-ID: <2974.41.138.190.253.1272466062.squirrel@webmail.mccme.ru> Date: Wed, 28 Apr 2010 14:47:42 -0000 (UTC) X-ASG-Orig-Subj: Drodzy Webmail klient Subject: Drodzy Webmail klient From: "WEBMAIL UPGRADE" Reply-To: webupgradeuser@admin.in.th User-Agent: SquirrelMail/1.4.4 MIME-Version: 1.0 Content-Type: text/plain;charset=koi8-r Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-SA-Exim-Connect-IP: 213.171.48.246 X-SA-Exim-Mail-From: admin@sylaba.poznan.pl X-SA-Exim-Scanned: No (on brown.mccme.ru); SAEximRunCond expanded to false X-Bounce-ID: brown.mccme.ru X-Barracuda-Connect: brown.mccme.ru[213.171.48.226] X-Barracuda-Start-Time: 1272853116 X-Barracuda-Bayes: INNOCENT GLOBAL 0.4311 1.0000 0.0000 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 1.58 X-Barracuda-Spam-Status: No, SCORE=1.58 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=MISSING_HEADERS, TO_CC_NONE X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28925 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 1.58 MISSING_HEADERS Missing To: header 0.00 TO_CC_NONE No To: or Cc: header To: undisclosed-recipients:; X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Drodzy Webmail klient, Poczty przekroczyła limit 20 GB pamięci, w którym jest określone przez administrator, obecnie używasz na 20.9GB, Może nie być w stanie wysyłać i odbierać pocztę, aż ponownie zweryfikować skrzynki pocztowej. Do ponownej weryfikacji poczty proszę dalej i całkowicie swoje dane poprawnie: Login ID użytkownika e-mail: Hasło: Dzięki Administratora systemu. From SRS0+ChgG+62+fromorbit.com=david@internode.on.net Sun May 2 23:21:40 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_64 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o434Le1Y130030 for ; Sun, 2 May 2010 23:21:40 -0500 X-ASG-Debug-ID: 1272860625-6537001c0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7A2A6306F7E for ; Sun, 2 May 2010 21:23:46 -0700 (PDT) Received: from mail.internode.on.net (bld-mail18.adl2.internode.on.net [150.101.137.103]) by cuda.sgi.com with ESMTP id qKAxlio6K0pGfWwy for ; Sun, 02 May 2010 21:23:46 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 22822888-1927428 for multiple; Mon, 03 May 2010 13:53:44 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O8nC3-0000qs-BI; Mon, 03 May 2010 14:23:43 +1000 Date: Mon, 3 May 2010 14:23:43 +1000 From: Dave Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 4/5] [PATCH] xfs: simplify buffer to transaction matching Subject: Re: [PATCH 4/5] [PATCH] xfs: simplify buffer to transaction matching Message-ID: <20100503042343.GD2591@dastard> References: <20100418001041.865247520@bombadil.infradead.org> <20100418001058.677429475@bombadil.infradead.org> <20100420064155.GH15130@dastard> <20100501125839.GA26342@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100501125839.GA26342@infradead.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail18.adl2.internode.on.net[150.101.137.103] X-Barracuda-Start-Time: 1272860627 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28934 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Sat, May 01, 2010 at 08:58:39AM -0400, Christoph Hellwig wrote: > On Tue, Apr 20, 2010 at 04:41:55PM +1000, Dave Chinner wrote: > > Good start, but I think that it should use xfs_trans_first_item() > > and xfs_trans_next_item() rather than walking the descriptor > > table directly. > > I tried implementing it, but it doesn't work. We can call the buffer > matching routines on transactions that don't have any item linked to > it, which will cause xfs_trans_first_item to panic. Compare this code > in xfs_trans_buf_item_match: > > for (licp = &tp->t_items; licp != NULL; licp = licp->lic_next) { > if (xfs_lic_are_all_free(licp)) { > ASSERT(licp == &tp->t_items); > ASSERT(licp->lic_next == NULL); > return NULL; > } > > ... > } > > to this in xfs_trans_first_item: > > licp = &tp->t_items; > /* > * If it's not in the first chunk, skip to the second. > */ > if (xfs_lic_are_all_free(licp)) { > licp = licp->lic_next; > } > > /* > * Return the first non-free descriptor in the chunk. > */ > ASSERT(!xfs_lic_are_all_free(licp)); Is there any reason why xfs_trans_first_item() should panic if it doesn't find a log item? I can't think of one that can't be handled by returning NULL and the caller doing ASSERT(lidp)? The callers: xfs_trans_count_vecs - has assert, triggers shutdown xfs_trans_fill_vecs - has assert, handles null return xfs_trans_committed - handles null return xfs_trans_uncommit - handles null return So it seems like we could make xfs_trans_first_item() not assert fail on debug kernels. The other thing that concerns me is that we have two different definitions of "empty transactions" in these two functions. It seems to me that xfs_trans_first_item() is the correct one - if we remove items from the transaction there is no guarantee that the embedded chunk in the transaction contains any active descriptors. Hence I think the code in xfs_trans_buf_item_match[_all]() is actually buggy... Cheers, Dave. -- Dave Chinner david@fromorbit.com From michael.monnerie@is.it-management.at Mon May 3 01:47:45 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,J_CHICKENPOX_62, J_CHICKENPOX_64 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o436li6G138376 for ; Mon, 3 May 2010 01:47:45 -0500 X-ASG-Debug-ID: 1272869389-464100e40000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mailsrv14.zmi.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2D8CA1B06074 for ; Sun, 2 May 2010 23:49:49 -0700 (PDT) Received: from mailsrv14.zmi.at (mailsrv1.zmi.at [212.69.164.54]) by cuda.sgi.com with ESMTP id M7WuN8CD1DHF54Qy for ; Sun, 02 May 2010 23:49:49 -0700 (PDT) Received: from mailsrv.i.zmi.at (h081217106033.dyn.cm.kabsi.at [81.217.106.33]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mailsrv2.i.zmi.at", Issuer "power4u.zmi.at" (not verified)) by mailsrv14.zmi.at (Postfix) with ESMTPSA id 57769800187 for ; Mon, 3 May 2010 08:49:48 +0200 (CEST) Received: from saturn.localnet (saturn.i.zmi.at [10.72.27.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mailsrv.i.zmi.at (Postfix) with ESMTPSA id 1A76D83C823 for ; Mon, 3 May 2010 08:49:48 +0200 (CEST) From: Michael Monnerie Organization: it-management http://it-management.at To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: xfs_fsr question for improvement Subject: Re: xfs_fsr question for improvement Date: Mon, 3 May 2010 08:49:43 +0200 User-Agent: KMail/1.12.4 (Linux/2.6.31.12-0.2-vanilla; KDE/4.3.5; x86_64; ; ) References: <201004161043.11243@zmi.at> <20100417012415.GE2493@dastard> In-Reply-To: <20100417012415.GE2493@dastard> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart15342380.VbRfVO0Erk"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <201005030849.47591@zmi.at> X-Barracuda-Connect: mailsrv1.zmi.at[212.69.164.54] X-Barracuda-Start-Time: 1272869391 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28942 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean --nextPart15342380.VbRfVO0Erk Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable On Samstag, 17. April 2010 Dave Chinner wrote: > They have thousands of extents in them and they are all between > 8-10GB in size, and IO from my VMs are stiall capable of saturating > the disks backing these files. While I'd normally consider these > files fragmented and candidates for running fsr on tme, the number > of extents is not actually a performance limiting factor and so > there's no point in defragmenting them. Especially as that requires > shutting down the VMs... I personally care less about file fragmentation than about=20 metadata/inode/directory fragmentation. This server gets accesses from=20 numerous people, # time find /mountpoint/ -inum 107901420 /mountpoint/some/dir/ectory/path/x.iso real 7m50.732s user 0m0.152s sys 0m2.376s It took nearly 8 minutes to search through that mount point, which is=20 6TB big on a RAID-5 striped over 7 2TB disks, so search speed should be=20 high. Especially as there are only 765.000 files on that disk: =46ilesystem Inodes IUsed IFree IUse% /mountpoint 1258291200 765659 1257525541 1% Wouldn't you say an 8 minutes search over just 765.000 files is slow,=20 even when only using 7x 2TB 7200rpm disks in RAID-5? > > Would it be possible xfs_fsr defrags the meta data in a way that > > they are all together so seeks are faster? >=20 > It's not related to fsr because fsr does not defragment metadata. > Some metadata cannot be defragmented (e.g. inodes cannot be moved), > some metadata cannot be manipulated directly (e.g. free space > btrees), and some is just difficult to do (e.g. directory > defragmentation) so hasn't ever been done. I see. On this particular server I know it would be good for performance=20 to have the metadata defrag'ed, but that's not the aim of xfs_fsr. But maybe some developer is bored once and finds a way to optimize the=20 search&find of files on an aged filesystem, i.e. metadata defrag :-) I tried this two times: # time find /mountpoint/ -inum 107901420 real 8m17.316s user 0m0.148s=20 sys 0m1.964s=20 # time find /mountpoint/ -inum 107901420 real 0m30.113s user 0m0.540s=20 sys 0m9.813s=20 Caching helps the 2nd time :-) =20 > > Currently, when I do "find /this_big_fs -inum 1234", it takes > > *ages* for a run, while there are not so many files on it: > > # iostat -kx 5 555 > > Device: r/s rkB/s avgrq-sz avgqu-sz await svctm=20 > > %util xvdb 23,20 92,80 8,00 0,42 15,28=20 > > 18,17 42,16 xvdc 20,20 84,00 8,32 0,57 =20 > > 28,40 28,36 57,28 >=20 > Well, it's not XFS's fault that each read IO is taking 20-30ms. You > can only do 30-50 IOs a second per drive at that rate, so: >=20 > [...] >=20 > > So I get 43 reads/second at 100% utilization. Well I can see up to >=20 > This is right on the money - it's going as fast a your (slow) RAID-5 > volume will allow it to.... >=20 > > 150r/s, but still that's no "wow". A single run to find an inode > > takes a very long time. >=20 > Raid 5/6 generally provides the same IOPS performance as a single > spindle, regardless of the width of the RAID stripe. A 2TB sata > drive might be able to do 150-200 IOPS, so a RAID5 array made up of > these drives will tend to max out at roughly the same.... Running xfs_fsr, I can see up to 1200r+1200w=3D2400I/Os per second: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util xvdc 0,00 0,00 0,00 1191,42 0,00 52320,16 =20 87,83 121,23 96,77 0,71 84,63 xvde 0,00 0,00 1226,35 0,00 52324,15 0,00 =20 85,33 0,77 0,62 0,13 15,33 But on average it's about 600-700 read plus writes per second, so=20 1200-1400 IOPS.=20 Both "disks" are 2TB LVM volumes on the same raidset, I just had to=20 split it as XEN doesn't allow to create >2TB volumes. So, the badly slow I/O I see during "find" are not happening during fsr.=20 How can that be? I'm just running another "find" on a fresh remounted xfs, and I can see=20 the reads are happening on 2 of the 3 2TB volumes parallel: Device: r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await =20 svctm %util xvdb 103,20 0,00 476,80 0,00 9,24 0,46 =20 4,52 4,50 46,40 xvdc 97,80 0,00 455,20 0,00 9,31 0,52 =20 5,29 5,30 51,84 When I created that XFS, I took two 2TB partitions, did pvcreate,=20 vgcreate and lvcreate. Could it be that lvcreate automatically thought=20 it should do a RAID-0? Because all reads are equally split between the=20 two volumes. After a while, I added the 3rd 2TB volume, and I can't see=20 that behaviour there. So maybe this is the source of all evil. BTW: I changed mount options "atime,diratime" to "relatime,reldiratime"=20 now and "find" runtime went from 8 minutes down to 7m14s. =2D-=20 mit freundlichen Gr=FCssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 // Wir haben im Moment zwei H=E4user zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/ --nextPart15342380.VbRfVO0Erk Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (GNU/Linux) iEYEABECAAYFAkvecgsACgkQzhSR9xwSCbSnEwCg36nlQTOc1VcH55khyOmjHQqx SLcAoNuGv6TZjgAx5rxovG5apfwZsdwq =EGIU -----END PGP SIGNATURE----- --nextPart15342380.VbRfVO0Erk-- From michael.monnerie@is.it-management.at Mon May 3 02:39:03 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o437d3RT140627 for ; Mon, 3 May 2010 02:39:03 -0500 X-ASG-Debug-ID: 1272872469-370501ce0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mailsrv14.zmi.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2767C3074EE for ; Mon, 3 May 2010 00:41:09 -0700 (PDT) Received: from mailsrv14.zmi.at (mailsrv1.zmi.at [212.69.164.54]) by cuda.sgi.com with ESMTP id 3PEIhcojGMXEqJsH for ; Mon, 03 May 2010 00:41:09 -0700 (PDT) Received: from mailsrv.i.zmi.at (h081217106033.dyn.cm.kabsi.at [81.217.106.33]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mailsrv2.i.zmi.at", Issuer "power4u.zmi.at" (not verified)) by mailsrv14.zmi.at (Postfix) with ESMTPSA id 05734800112 for ; Mon, 3 May 2010 09:41:08 +0200 (CEST) Received: from saturn.localnet (saturn.i.zmi.at [10.72.27.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mailsrv.i.zmi.at (Postfix) with ESMTPSA id A6F1883C823 for ; Mon, 3 May 2010 09:41:07 +0200 (CEST) From: Michael Monnerie Organization: it-management http://it-management.at To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: xfs_fsr question for improvement Subject: Re: xfs_fsr question for improvement Date: Mon, 3 May 2010 09:41:06 +0200 User-Agent: KMail/1.12.4 (Linux/2.6.31.12-0.2-vanilla; KDE/4.3.5; x86_64; ; ) References: <201004161043.11243@zmi.at> <20100417012415.GE2493@dastard> <201005030849.47591@zmi.at> In-Reply-To: <201005030849.47591@zmi.at> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2279591.YjFnkzqt7e"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <201005030941.07109@zmi.at> X-Barracuda-Connect: mailsrv1.zmi.at[212.69.164.54] X-Barracuda-Start-Time: 1272872470 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28944 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean --nextPart2279591.YjFnkzqt7e Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Montag, 3. Mai 2010 Michael Monnerie wrote: > When I created that XFS, I took two 2TB partitions, did pvcreate, > vgcreate and lvcreate. Could it be that lvcreate automatically > thought it should do a RAID-0? Because all reads are equally split > between the two volumes. After a while, I added the 3rd 2TB volume, > and I can't see that behaviour there. So maybe this is the source of > all evil. I found that lvcreate really is too smart: -i, --stripes Stripes Gives the number of stripes. This is equal to the number=20 of physical volumes to scatter the logical volume. So it seems lvcreate did know that the VG was split among 2 "disks", and=20 therefore used -i2 while I wanted -i1. > reldiratime Should be nodiratime, of course. =2D-=20 mit freundlichen Gr=C3=BCssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 // Wir haben im Moment zwei H=C3=A4user zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/ --nextPart2279591.YjFnkzqt7e Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (GNU/Linux) iEYEABECAAYFAkvefhMACgkQzhSR9xwSCbQKcQCfSDUqQ+GXWWECSu7Y9CYUOTTn c3cAnRl5Q9y0xbtZ9KaUKQIQc976+mFO =Wm7W -----END PGP SIGNATURE----- --nextPart2279591.YjFnkzqt7e-- From aarora@linux.vnet.ibm.com Mon May 3 03:29:39 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o438TdTk143035 for ; Mon, 3 May 2010 03:29:39 -0500 X-ASG-Debug-ID: 1272875506-532f00340000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from e5.ny.us.ibm.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 9FB29307474 for ; Mon, 3 May 2010 01:31:46 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by cuda.sgi.com with ESMTP id 5JY3DT6atIZPhN1l for ; Mon, 03 May 2010 01:31:46 -0700 (PDT) Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147]) by e5.ny.us.ibm.com (8.14.3/8.13.1) with ESMTP id o438G5hu008620 for ; Mon, 3 May 2010 04:16:05 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o438Vitj2056410 for ; Mon, 3 May 2010 04:31:44 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id o438ViA6023257 for ; Mon, 3 May 2010 05:31:44 -0300 Received: from amitarora.in.ibm.com ([9.77.205.175]) by d01av02.pok.ibm.com (8.14.3/8.13.1/NCO v10.0 AVin) with ESMTP id o438Vgqd023174; Mon, 3 May 2010 05:31:43 -0300 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 806B04854; Mon, 3 May 2010 14:01:41 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.14.2/8.14.2/Submit) id o438VZ5C019874; Mon, 3 May 2010 14:01:35 +0530 Date: Mon, 3 May 2010 14:01:35 +0530 From: "Amit K. Arora" To: Christoph Hellwig Cc: Andrew Morton , xfs@oss.sgi.com, Nikanth Karthikesan , coly.li@suse.de, Nick Piggin , Alexander Viro , linux-fsdevel@vger.kernel.org, "Theodore Ts'o" , Andreas Dilger , linux-ext4@vger.kernel.org, Eelis , Amit Arora X-ASG-Orig-Subj: [PATCH] New testcase to check if fallocate respects RLIMIT_FSIZE or not Subject: [PATCH] New testcase to check if fallocate respects RLIMIT_FSIZE or not Message-ID: <20100503083135.GC13756@amitarora.in.ibm.com> References: <201004281854.49730.knikanth@suse.de> <4BD85F1F.7030100@suse.de> <201004291014.07194.knikanth@suse.de> <20100430143319.d51d6d77.akpm@linux-foundation.org> <20100501070426.GA9562@amitarora.in.ibm.com> <20100501101846.GA3769@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100501101846.GA3769@infradead.org> User-Agent: Mutt/1.5.17 (2007-11-01) X-Barracuda-Connect: e5.ny.us.ibm.com[32.97.182.145] X-Barracuda-Start-Time: 1272875506 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28948 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Sat, May 01, 2010 at 06:18:46AM -0400, Christoph Hellwig wrote: > On Sat, May 01, 2010 at 12:34:26PM +0530, Amit K. Arora wrote: > > Agreed. How about doing this check in the filesystem specific fallocate > > inode routines instead ? For example, in ext4 we could do : > > That looks okay - in fact XFS should already have this check because > it re-uses the setattr implementation to set the size. > > Can you submit an xfstests testcase to verify this behaviour on all > filesystems? Here is the new testcase. I have run this test on a x86_64 box on XFS and ext4 on 2.6.34-rc6. It passes on XFS, but fails on ext4. Below is the snapshot of results followed by the testcase itself. -- Regards, Amit Arora Test results: ------------ # ./check 228 FSTYP -- xfs (non-debug) PLATFORM -- Linux/x86_64 elm9m93 2.6.34-rc6 228 0s ... Ran: 228 Passed all 1 tests # # umount /mnt # mkfs.ext4 /dev/sda4 >/dev/null mke2fs 1.41.10 (10-Feb-2009) # ./check 228 FSTYP -- ext4 PLATFORM -- Linux/x86_64 elm9m93 2.6.34-rc6 228 0s ... - output mismatch (see 228.out.bad) --- 228.out 2010-05-03 02:51:24.000000000 -0400 +++ 228.out.bad 2010-05-03 04:27:33.000000000 -0400 @@ -1,2 +1 @@ QA output created by 228 -File size limit exceeded (core dumped) Ran: 228 Failures: 228 Failed 1 of 1 tests # Here is the test: ---------------- Add a new testcase to the xfstests suite to check if fallocate respects the limit imposed by RLIMIT_FSIZE (can be set by "ulimit -f XXX") or not, on a particular filesystem. Signed-off-by: Amit Arora diff -Nuarp xfstests-dev.org/228 xfstests-dev/228 --- xfstests-dev.org/228 1969-12-31 19:00:00.000000000 -0500 +++ xfstests-dev/228 2010-05-03 02:45:10.000000000 -0400 @@ -0,0 +1,79 @@ +#! /bin/bash +# FS QA Test No. 228 +# +# Check if fallocate respects RLIMIT_FSIZE +# +#----------------------------------------------------------------------- +# Copyright (c) 2010 IBM Corporation. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#----------------------------------------------------------------------- +# +# creator +owner=aarora@in.ibm.com + +seq=`basename $0` +echo "QA output created by $seq" + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +here=`pwd` +tmp=$TEST_DIR/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 25 + +# get standard environment, filters and checks +. ./common.rc +. ./common.filter + +# real QA test starts here +# generic, but xfs_io's fallocate must work +_supported_fs generic +# only Linux supports fallocate +_supported_os Linux + +[ -n "$XFS_IO_PROG" ] || _notrun "xfs_io executable not found" + +rm -f $seq.full + +# Sanity check to see if fallocate works +_require_xfs_io_falloc + +# Check if we have good enough space available +avail=`df -P $TEST_DIR | awk 'END {print $4}'` +[ "$avail" -ge 104000 ] || _notrun "Test device is too small ($avail KiB)" + +# Set the FSIZE ulimit to 100MB and check +ulimit -f 102400 +flim=`ulimit -f` +[ "$flim" != "unlimited" ] || _notrun "Unable to set FSIZE ulimit" +[ "$flim" -eq 102400 ] || _notrun "FSIZE ulimit is not correct (100 MB)" + +# FSIZE limit is now set to 100 MB. +# Lets try to preallocate 101 MB. This should fail. +$XFS_IO_PROG -F -f -c 'falloc 0 101m' $TEST_DIR/ouch +rm -f $TEST_DIR/ouch + +# Lets now try to preallocate 50 MB. This should succeed. +$XFS_IO_PROG -F -f -c 'falloc 0 50m' $TEST_DIR/ouch +rm -f $TEST_DIR/ouch + +# success, all done +status=0 +exit diff -Nuarp xfstests-dev.org/group xfstests-dev/group --- xfstests-dev.org/group 2010-05-03 02:35:09.000000000 -0400 +++ xfstests-dev/group 2010-05-03 02:45:21.000000000 -0400 @@ -341,3 +341,4 @@ deprecated 225 auto quick 226 auto enospc 227 auto fsr +228 rw auto prealloc quick From michael.monnerie@is.it-management.at Mon May 3 03:52:34 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o438qY3l144246 for ; Mon, 3 May 2010 03:52:34 -0500 X-ASG-Debug-ID: 1272876880-544e01190000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mailsrv14.zmi.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 0BAE5307659 for ; Mon, 3 May 2010 01:54:40 -0700 (PDT) Received: from mailsrv14.zmi.at (mailsrv1.zmi.at [212.69.164.54]) by cuda.sgi.com with ESMTP id LHv1InRhONA9ZgBO for ; Mon, 03 May 2010 01:54:40 -0700 (PDT) Received: from mailsrv.i.zmi.at (h081217106033.dyn.cm.kabsi.at [81.217.106.33]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mailsrv2.i.zmi.at", Issuer "power4u.zmi.at" (not verified)) by mailsrv14.zmi.at (Postfix) with ESMTPSA id B6B0A800183 for ; Mon, 3 May 2010 10:54:39 +0200 (CEST) Received: from saturn.localnet (saturn.i.zmi.at [10.72.27.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mailsrv.i.zmi.at (Postfix) with ESMTPSA id 603CB83C823 for ; Mon, 3 May 2010 10:54:39 +0200 (CEST) From: Michael Monnerie Organization: it-management http://it-management.at To: xfs@oss.sgi.com X-ASG-Orig-Subj: read slower than write on "mv"? Subject: read slower than write on "mv"? Date: Mon, 3 May 2010 10:54:28 +0200 User-Agent: KMail/1.12.4 (Linux/2.6.31.12-0.2-vanilla; KDE/4.3.5; x86_64; ; ) MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart1632245.G5Q5NsuaWY"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <201005031054.38817@zmi.at> X-Barracuda-Connect: mailsrv1.zmi.at[212.69.164.54] X-Barracuda-Start-Time: 1272876881 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28950 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean --nextPart1632245.G5Q5NsuaWY Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable This is not XFS specific, I see it on every filesys. I do a=20 "mv . /newlocation" and see this with iostat: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq- sz avgqu-sz await svctm %util xvdb 608,60 2,60 25967,20 102,40 85,31 5,37 =20 8,77 1,58 96,72 xvdg 0,00 608,60 0,00 26532,10 87,19 7,55 =20 12,41 0,12 7,44 Reading takes 97% I/O time, writing 7%. This is on the same raidset, but=20 I see the same when copying between two different single disks also. Or=20 is it just an effect of write caching that writes look faster than=20 reads? =2D-=20 mit freundlichen Gr=C3=BCssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 // Wir haben im Moment zwei H=C3=A4user zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/ --nextPart1632245.G5Q5NsuaWY Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (GNU/Linux) iEYEABECAAYFAkvej04ACgkQzhSR9xwSCbQgCQCgoJg7TW1o4nZ30FAtCjjFyBht 064AoPOsBCaeZxeUjB2J+Mlsgi0J7nhp =YbG2 -----END PGP SIGNATURE----- --nextPart1632245.G5Q5NsuaWY-- From SRS0+ChgG+62+fromorbit.com=david@internode.on.net Mon May 3 06:50:05 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o43Bo5h4151871 for ; Mon, 3 May 2010 06:50:05 -0500 X-ASG-Debug-ID: 1272887530-541e01c20000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 69EC81DE3387 for ; Mon, 3 May 2010 04:52:10 -0700 (PDT) Received: from mail.internode.on.net (bld-mail15.adl6.internode.on.net [150.101.137.100]) by cuda.sgi.com with ESMTP id BDbJ2ZjC64XKKSHK for ; Mon, 03 May 2010 04:52:10 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 11280079-1927428 for multiple; Mon, 03 May 2010 21:22:09 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O8uBz-0001DU-FC; Mon, 03 May 2010 21:52:07 +1000 Date: Mon, 3 May 2010 21:52:07 +1000 From: Dave Chinner To: Michael Monnerie Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: read slower than write on "mv"? Subject: Re: read slower than write on "mv"? Message-ID: <20100503115207.GE2591@dastard> References: <201005031054.38817@zmi.at> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201005031054.38817@zmi.at> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail15.adl6.internode.on.net[150.101.137.100] X-Barracuda-Start-Time: 1272887532 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28960 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Mon, May 03, 2010 at 10:54:28AM +0200, Michael Monnerie wrote: > This is not XFS specific, I see it on every filesys. I do a > "mv . /newlocation" and see this with iostat: > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq- > sz avgqu-sz await svctm %util > xvdb 608,60 2,60 25967,20 102,40 85,31 5,37 > 8,77 1,58 96,72 > xvdg 0,00 608,60 0,00 26532,10 87,19 7,55 > 12,41 0,12 7,44 > > Reading takes 97% I/O time, writing 7%. This is on the same raidset, but > I see the same when copying between two different single disks also. Or > is it just an effect of write caching that writes look faster than > reads? Yes, just an effect of write caching hiding IO latency. Cheers, Dave. -- Dave Chinner david@fromorbit.com From peter@palfrader.org Mon May 3 06:52:35 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o43BqYBk151958 for ; Mon, 3 May 2010 06:52:35 -0500 X-ASG-Debug-ID: 1272887679-050c02a70000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from anguilla.debian.or.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 9F95D1551D16 for ; Mon, 3 May 2010 04:54:40 -0700 (PDT) Received: from anguilla.debian.or.at (anguilla.debian.or.at [86.59.21.37]) by cuda.sgi.com with ESMTP id K26OKn93kzQZlwfF for ; Mon, 03 May 2010 04:54:40 -0700 (PDT) Received: by anguilla.debian.or.at (Postfix, from userid 1002) id BF21F10EBB8; Mon, 3 May 2010 13:54:38 +0200 (CEST) Date: Mon, 3 May 2010 13:54:38 +0200 From: Peter Palfrader To: linux-kernel@vger.kernel.org Cc: xfs@oss.sgi.com, david@fromorbit.com, stable@kernel.org X-ASG-Orig-Subj: [regression,bisected] 2.6.32.12: find(1) on xfs causes OOM Subject: [regression,bisected] 2.6.32.12: find(1) on xfs causes OOM Message-ID: <20100503115438.GA16623@anguilla.noreply.org> Mail-Followup-To: Peter Palfrader , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, david@fromorbit.com, stable@kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline X-PGP: 1024D/94C09C7F 5B00 C96D 5D54 AEE1 206B AF84 DE7A AF6E 94C0 9C7F X-Request-PGP: http://www.palfrader.org/keys/94C09C7F.asc X-Accept-Language: de, en User-Agent: Mutt/1.5.18 (2008-05-17) X-Barracuda-Connect: anguilla.debian.or.at[86.59.21.37] X-Barracuda-Start-Time: 1272887681 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28959 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Hi, I have an xfs filesystem in a KVM domain with 512megs of memory and 2 gigs of swap. The filesystem is 750g in size, of which some 500g are in use in about 6 million files. (This XFS filesystem is exported via nfs4. I haven't tested if this makes any difference.) Starting in 2.6.32.12 running something like "find | wc -l" on this filesystem's mountpoint causes the OOM killer to kill off most of the system. (See kern.log[1]) With 2.6.32.11 the system does not behave like this. Bisecting turned up the following commit. Reverting it in 2.6.32.12 also results in a system that works. | 9e1e9675fb29c0e94a7c87146138aa2135feba2f is first bad commit | commit 9e1e9675fb29c0e94a7c87146138aa2135feba2f | Author: Dave Chinner | Date: Fri Mar 12 09:42:10 2010 +1100 | | xfs: reclaim all inodes by background tree walks | | commit 57817c68229984818fea9e614d6f95249c3fb098 upstream | | We cannot do direct inode reclaim without taking the flush lock to | ensure that we do not reclaim an inode under IO. We check the inode | is clean before doing direct reclaim, but this is not good enough | because the inode flush code marks the inode clean once it has | copied the in-core dirty state to the backing buffer. | | It is the flush lock that determines whether the inode is still | under IO, even though it is marked clean, and the inode is still | required at IO completion so we can't reclaim it even though it is | clean in core. Hence the requirement that we need to take the flush | lock even on clean inodes because this guarantees that the inode | writeback IO has completed and it is safe to reclaim the inode. | | With delayed write inode flushing, we could end up waiting a long | time on the flush lock even for a clean inode. The background | reclaim already handles this efficiently, so avoid all the problems | by killing the direct reclaim path altogether. | | Signed-off-by: Dave Chinner | Reviewed-by: Christoph Hellwig | Signed-off-by: Alex Elder | Signed-off-by: Greg Kroah-Hartman | | diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c | index a82a93d..ea7a59a 100644 | --- a/fs/xfs/linux-2.6/xfs_super.c | +++ b/fs/xfs/linux-2.6/xfs_super.c | @@ -953,16 +953,14 @@ xfs_fs_destroy_inode( | ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM)); | | /* | - * If we have nothing to flush with this inode then complete the | - * teardown now, otherwise delay the flush operation. | + * We always use background reclaim here because even if the | + * inode is clean, it still may be under IO and hence we have | + * to take the flush lock. The background reclaim path handles | + * this more efficiently than we can here, so simply let background | + * reclaim tear down all inodes. | */ | - if (!xfs_inode_clean(ip)) { | - xfs_inode_set_reclaim_tag(ip); | - return; | - } | - | out_reclaim: | - xfs_ireclaim(ip); | + xfs_inode_set_reclaim_tag(ip); | } | | /* Cheers, Peter 1. http://asteria.noreply.org/~weasel/volatile/2010-05-03-Aju29kSrm0A/kern.log 2. http://asteria.noreply.org/~weasel/volatile/2010-05-03-Aju29kSrm0A/config-2.6.32.12-dsa-amd64 -- | .''`. ** Debian GNU/Linux ** Peter Palfrader | : :' : The universal http://www.palfrader.org/ | `. `' Operating System | `- http://www.debian.org/ From SRS0+ChgG+62+fromorbit.com=david@internode.on.net Mon May 3 07:15:13 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,J_CHICKENPOX_62, J_CHICKENPOX_64 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o43CFDOE153173 for ; Mon, 3 May 2010 07:15:13 -0500 X-ASG-Debug-ID: 1272889038-192f034b0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 986F8307F7F for ; Mon, 3 May 2010 05:17:19 -0700 (PDT) Received: from mail.internode.on.net (bld-mail16.adl2.internode.on.net [150.101.137.101]) by cuda.sgi.com with ESMTP id KsokmC62TlIps8fY for ; Mon, 03 May 2010 05:17:19 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 22881792-1927428 for multiple; Mon, 03 May 2010 21:47:18 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O8uaK-0001F2-9U; Mon, 03 May 2010 22:17:16 +1000 Date: Mon, 3 May 2010 22:17:16 +1000 From: Dave Chinner To: Michael Monnerie Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: xfs_fsr question for improvement Subject: Re: xfs_fsr question for improvement Message-ID: <20100503121716.GF2591@dastard> References: <201004161043.11243@zmi.at> <20100417012415.GE2493@dastard> <201005030849.47591@zmi.at> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201005030849.47591@zmi.at> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail16.adl2.internode.on.net[150.101.137.101] X-Barracuda-Start-Time: 1272889040 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28960 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Mon, May 03, 2010 at 08:49:43AM +0200, Michael Monnerie wrote: > On Samstag, 17. April 2010 Dave Chinner wrote: > > They have thousands of extents in them and they are all between > > 8-10GB in size, and IO from my VMs are stiall capable of saturating > > the disks backing these files. While I'd normally consider these > > files fragmented and candidates for running fsr on tme, the number > > of extents is not actually a performance limiting factor and so > > there's no point in defragmenting them. Especially as that requires > > shutting down the VMs... > > I personally care less about file fragmentation than about > metadata/inode/directory fragmentation. This server gets accesses from > numerous people, > > # time find /mountpoint/ -inum 107901420 > /mountpoint/some/dir/ectory/path/x.iso > > real 7m50.732s > user 0m0.152s > sys 0m2.376s > > It took nearly 8 minutes to search through that mount point, which is > 6TB big on a RAID-5 striped over 7 2TB disks, so search speed should be > high. Not necessarily, as your raid array has shown. > > Especially as there are only 765.000 files on that disk: > Filesystem Inodes IUsed IFree IUse% > /mountpoint 1258291200 765659 1257525541 1% > > Wouldn't you say an 8 minutes search over just 765.000 files is slow, > even when only using 7x 2TB 7200rpm disks in RAID-5? Depends on the directory structure and the number of IOs needed to traverse it. If it's only a handful of files per directory, then you get no internal directory readahead to hide read latency. That results in a small random synchronous read workload that might require a couple of hundred thousand IOs to complete. >From your early stats showing a read rate of 50 IO/s from the raid array, then the directory read traverse has requires about 25kiops to complete. That takes about 10s on my laptop's cheap SSD, which does random reads about 50x faster than your RAID array.... > > > Would it be possible xfs_fsr defrags the meta data in a way that > > > they are all together so seeks are faster? > > > > It's not related to fsr because fsr does not defragment metadata. > > Some metadata cannot be defragmented (e.g. inodes cannot be moved), > > some metadata cannot be manipulated directly (e.g. free space > > btrees), and some is just difficult to do (e.g. directory > > defragmentation) so hasn't ever been done. > > I see. On this particular server I know it would be good for performance > to have the metadata defrag'ed, but that's not the aim of xfs_fsr. > But maybe some developer is bored once and finds a way to optimize the > search&find of files on an aged filesystem, i.e. metadata defrag :-) Many have. Find and tar have resisted attempts to optimise them over the years, so stuff like this: http://oss.oracle.com/~mason/acp/ grows on the interwebs all over the place... ;) > I tried this two times: > # time find /mountpoint/ -inum 107901420 > real 8m17.316s > user 0m0.148s > sys 0m1.964s > > # time find /mountpoint/ -inum 107901420 > real 0m30.113s > user 0m0.540s > sys 0m9.813s > > Caching helps the 2nd time :-) That still seems rather slow traversing 750,000 cached directory entries. My laptop (1.3GHz CULV core2 CPU) does 465,000 directory entries in: $ time sudo find / -mount -inum 123809285 real 0m2.196s user 0m0.384s sys 0m1.464s > > Raid 5/6 generally provides the same IOPS performance as a single > > spindle, regardless of the width of the RAID stripe. A 2TB sata > > drive might be able to do 150-200 IOPS, so a RAID5 array made up of > > these drives will tend to max out at roughly the same.... > > Running xfs_fsr, I can see up to 1200r+1200w=2400I/Os per second: > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq- > sz avgqu-sz await svctm %util > xvdc 0,00 0,00 0,00 1191,42 0,00 52320,16 > 87,83 121,23 96,77 0,71 84,63 > xvde 0,00 0,00 1226,35 0,00 52324,15 0,00 > 85,33 0,77 0,62 0,13 15,33 > > But on average it's about 600-700 read plus writes per second, so > 1200-1400 IOPS. > Both "disks" are 2TB LVM volumes on the same raidset, I just had to > split it as XEN doesn't allow to create >2TB volumes. > > So, the badly slow I/O I see during "find" are not happening during fsr. > How can that be? Because most of the IO xfs_fsr does is large sequential IO which the RAID caches are optimised for. Directory traversals, OTOH, are small, semi-random IO which are latency sensitive.... Cheers, Dave. -- Dave Chinner david@fromorbit.com From SRS0+lq9o+62+fromorbit.com=david@internode.on.net Mon May 3 07:34:08 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o43CY7o9153905 for ; Mon, 3 May 2010 07:34:08 -0500 X-ASG-Debug-ID: 1272890171-542902ae0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 90D0E1DE2960 for ; Mon, 3 May 2010 05:36:12 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id GDIflvHc59Gb0NdI for ; Mon, 03 May 2010 05:36:12 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 22893953-1927428 for multiple; Mon, 03 May 2010 22:06:08 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O8usY-0001Fe-5J; Mon, 03 May 2010 22:36:06 +1000 Date: Mon, 3 May 2010 22:36:06 +1000 From: Dave Chinner To: Peter Palfrader , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, stable@kernel.org X-ASG-Orig-Subj: Re: [regression,bisected] 2.6.32.12: find(1) on xfs causes OOM Subject: Re: [regression,bisected] 2.6.32.12: find(1) on xfs causes OOM Message-ID: <20100503123606.GG2591@dastard> References: <20100503115438.GA16623@anguilla.noreply.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100503115438.GA16623@anguilla.noreply.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1272890173 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28961 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Mon, May 03, 2010 at 01:54:38PM +0200, Peter Palfrader wrote: > Hi, > > I have an xfs filesystem in a KVM domain with 512megs of memory and 2 gigs of > swap. > > The filesystem is 750g in size, of which some 500g are in use in about 6 > million files. (This XFS filesystem is exported via nfs4. I haven't tested if > this makes any difference.) > > Starting in 2.6.32.12 running something like "find | wc -l" on this > filesystem's mountpoint causes the OOM killer to kill off most of the > system. (See kern.log[1]) Knwon problem. As a workaraound, you can increase the frequency at which the xfssyncd runs so that it is less than the default 30s between background reclaim runs. > With 2.6.32.11 the system does not behave like this. > > Bisecting turned up the following commit. Reverting it in 2.6.32.12 > also results in a system that works. > > | 9e1e9675fb29c0e94a7c87146138aa2135feba2f is first bad commit > | commit 9e1e9675fb29c0e94a7c87146138aa2135feba2f > | Author: Dave Chinner > | Date: Fri Mar 12 09:42:10 2010 +1100 > | > | xfs: reclaim all inodes by background tree walks Reverting this leaves you running with a subtly altered and completely untested reclaim path that I'm not sure does the right thing in all situations. I wouldn't run that revert on my machines, nor recommend it for anyone else. But it's up to you if you want to run it on your machines.... The fix for this problem only got to mainline a couple of days ago. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9bf729c0af67897ea8498ce17c29b0683f7f2028 I've got to backport it to the stable kernel tree so the next stable kernel should fix this. Cheers, Dave. -- Dave Chinner david@fromorbit.com From peter@palfrader.org Mon May 3 07:42:29 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o43CgTK3154186 for ; Mon, 3 May 2010 07:42:29 -0500 X-ASG-Debug-ID: 1272890676-4d12034e0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from anguilla.debian.or.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 6603E132E2FD for ; Mon, 3 May 2010 05:44:36 -0700 (PDT) Received: from anguilla.debian.or.at (anguilla.debian.or.at [86.59.21.37]) by cuda.sgi.com with ESMTP id tU7EVqUBBY99X7lI for ; Mon, 03 May 2010 05:44:36 -0700 (PDT) Received: by anguilla.debian.or.at (Postfix, from userid 1002) id 3711510EBB6; Mon, 3 May 2010 14:44:36 +0200 (CEST) Date: Mon, 3 May 2010 14:44:36 +0200 From: Peter Palfrader To: Dave Chinner Cc: linux-kernel@vger.kernel.org, xfs@oss.sgi.com, stable@kernel.org X-ASG-Orig-Subj: Re: [regression,bisected] 2.6.32.12: find(1) on xfs causes OOM Subject: Re: [regression,bisected] 2.6.32.12: find(1) on xfs causes OOM Message-ID: <20100503124436.GB16623@anguilla.noreply.org> Mail-Followup-To: Peter Palfrader , Dave Chinner , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, stable@kernel.org References: <20100503115438.GA16623@anguilla.noreply.org> <20100503123606.GG2591@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20100503123606.GG2591@dastard> X-PGP: 1024D/94C09C7F 5B00 C96D 5D54 AEE1 206B AF84 DE7A AF6E 94C0 9C7F X-Request-PGP: http://www.palfrader.org/keys/94C09C7F.asc X-Accept-Language: de, en User-Agent: Mutt/1.5.18 (2008-05-17) X-Barracuda-Connect: anguilla.debian.or.at[86.59.21.37] X-Barracuda-Start-Time: 1272890677 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.28961 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Mon, 03 May 2010, Dave Chinner wrote: > > Starting in 2.6.32.12 running something like "find | wc -l" on this > > filesystem's mountpoint causes the OOM killer to kill off most of the > > system. (See kern.log[1]) > > Knwon problem. > The fix for this problem only got to mainline a couple of days ago. > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9bf729c0af67897ea8498ce17c29b0683f7f2028 > > I've got to backport it to the stable kernel tree so the next stable > kernel should fix this. Thanks, I'll stay on .11 on that machine for now then. -- | .''`. ** Debian GNU/Linux ** Peter Palfrader | : :' : The universal http://www.palfrader.org/ | `. `' Operating System | `- http://www.debian.org/ From BATV+34c124583dd587bcabd0+2444+infradead.org+hch@bombadil.srs.infradead.org Mon May 3 08:32:41 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_64 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o43DWe3v156458 for ; Mon, 3 May 2010 08:32:40 -0500 X-ASG-Debug-ID: 1272893687-68cf03d30000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 572EB1554454 for ; Mon, 3 May 2010 06:34:48 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id PBFBbK0gy8iletOz for ; Mon, 03 May 2010 06:34:48 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1O8vnL-0005bJ-Gz for xfs@oss.sgi.com; Mon, 03 May 2010 13:34:47 +0000 Date: Mon, 3 May 2010 09:34:47 -0400 From: Christoph Hellwig To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH] xfs: kill xfs_trans_find_item Subject: [PATCH] xfs: kill xfs_trans_find_item Message-ID: <20100503133447.GA21325@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1272893688 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean xfs_trans_find_item is just and awkward way to derefence ->li_desc in the log_item structure. So replace the calls to it with the direct reference, and while we're at it also make these type-safe instead using casts. Signed-off-by: Christoph Hellwig Index: xfs/fs/xfs/quota/xfs_trans_dquot.c =================================================================== --- xfs.orig/fs/xfs/quota/xfs_trans_dquot.c 2010-04-30 18:33:49.928263075 +0200 +++ xfs/fs/xfs/quota/xfs_trans_dquot.c 2010-05-03 15:28:51.713004310 +0200 @@ -93,12 +93,10 @@ xfs_trans_log_dquot( xfs_trans_t *tp, xfs_dquot_t *dqp) { - xfs_log_item_desc_t *lidp; + xfs_log_item_desc_t *lidp = dqp->q_logitem.qli_item.li_desc; ASSERT(dqp->q_transp == tp); ASSERT(XFS_DQ_IS_LOCKED(dqp)); - - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)(&dqp->q_logitem)); ASSERT(lidp != NULL); tp->t_flags |= XFS_TRANS_DIRTY; @@ -890,9 +888,8 @@ xfs_trans_log_quotaoff_item( xfs_trans_t *tp, xfs_qoff_logitem_t *qlp) { - xfs_log_item_desc_t *lidp; + xfs_log_item_desc_t *lidp = qlp->qql_item.li_desc; - lidp = xfs_trans_find_item(tp, (xfs_log_item_t *)qlp); ASSERT(lidp != NULL); tp->t_flags |= XFS_TRANS_DIRTY; Index: xfs/fs/xfs/xfs_buf_item.c =================================================================== --- xfs.orig/fs/xfs/xfs_buf_item.c 2010-05-01 15:06:36.160254276 +0200 +++ xfs/fs/xfs/xfs_buf_item.c 2010-05-03 15:28:51.714004520 +0200 @@ -441,13 +441,10 @@ xfs_buf_item_unpin_remove( * occurs later in the xfs_trans_uncommit() will try to * reference the buffer which we no longer have a hold on. */ - struct xfs_log_item_desc *lidp; - ASSERT(XFS_BUF_VALUSEMA(bip->bli_buf) <= 0); trace_xfs_buf_item_unpin_stale(bip); - lidp = xfs_trans_find_item(tp, (xfs_log_item_t *)bip); - xfs_trans_free_item(tp, lidp); + xfs_trans_free_item(tp, bip->bli_item.li_desc); /* * Since the transaction no longer refers to the buffer, the Index: xfs/fs/xfs/xfs_extfree_item.c =================================================================== --- xfs.orig/fs/xfs/xfs_extfree_item.c 2010-04-30 18:33:49.911254275 +0200 +++ xfs/fs/xfs/xfs_extfree_item.c 2010-05-03 15:28:51.719004170 +0200 @@ -132,18 +132,18 @@ STATIC void xfs_efi_item_unpin_remove(xfs_efi_log_item_t *efip, xfs_trans_t *tp) { struct xfs_ail *ailp = efip->efi_item.li_ailp; - xfs_log_item_desc_t *lidp; spin_lock(&ailp->xa_lock); if (efip->efi_flags & XFS_EFI_CANCELED) { + xfs_log_item_t *lip = &efip->efi_item; + /* * free the xaction descriptor pointing to this item */ - lidp = xfs_trans_find_item(tp, (xfs_log_item_t *) efip); - xfs_trans_free_item(tp, lidp); + xfs_trans_free_item(tp, lip->li_desc); /* xfs_trans_ail_delete() drops the AIL lock. */ - xfs_trans_ail_delete(ailp, (xfs_log_item_t *)efip); + xfs_trans_ail_delete(ailp, lip); xfs_efi_item_free(efip); } else { efip->efi_flags |= XFS_EFI_COMMITTED; Index: xfs/fs/xfs/xfs_trans_buf.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_buf.c 2010-05-01 14:59:07.653003960 +0200 +++ xfs/fs/xfs/xfs_trans_buf.c 2010-05-03 15:28:51.723003751 +0200 @@ -518,7 +518,7 @@ xfs_trans_brelse(xfs_trans_t *tp, * Find the item descriptor pointing to this buffer's * log item. It must be there. */ - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); + lidp = bip->bli_item.li_desc; ASSERT(lidp != NULL); trace_xfs_trans_brelse(bip); @@ -707,7 +707,7 @@ xfs_trans_log_buf(xfs_trans_t *tp, bip->bli_format.blf_flags &= ~XFS_BLI_CANCEL; } - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); + lidp = bip->bli_item.li_desc; ASSERT(lidp != NULL); tp->t_flags |= XFS_TRANS_DIRTY; @@ -748,7 +748,7 @@ xfs_trans_binval( ASSERT(XFS_BUF_FSPRIVATE(bp, void *) != NULL); bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); + lidp = bip->bli_item.li_desc; ASSERT(lidp != NULL); ASSERT(atomic_read(&bip->bli_refcount) > 0); Index: xfs/fs/xfs/xfs_trans_extfree.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_extfree.c 2010-05-01 14:59:07.653003960 +0200 +++ xfs/fs/xfs/xfs_trans_extfree.c 2010-05-03 15:28:51.728005428 +0200 @@ -65,11 +65,10 @@ xfs_trans_log_efi_extent(xfs_trans_t *t xfs_fsblock_t start_block, xfs_extlen_t ext_len) { - xfs_log_item_desc_t *lidp; + xfs_log_item_desc_t *lidp = efip->efi_item.li_desc; uint next_extent; xfs_extent_t *extp; - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)efip); ASSERT(lidp != NULL); tp->t_flags |= XFS_TRANS_DIRTY; @@ -122,11 +121,10 @@ xfs_trans_log_efd_extent(xfs_trans_t *t xfs_fsblock_t start_block, xfs_extlen_t ext_len) { - xfs_log_item_desc_t *lidp; + xfs_log_item_desc_t *lidp = efdp->efd_item.li_desc; uint next_extent; xfs_extent_t *extp; - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)efdp); ASSERT(lidp != NULL); tp->t_flags |= XFS_TRANS_DIRTY; Index: xfs/fs/xfs/xfs_trans_inode.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_inode.c 2010-04-30 18:33:49.886254554 +0200 +++ xfs/fs/xfs/xfs_trans_inode.c 2010-05-03 15:28:51.732005497 +0200 @@ -149,13 +149,11 @@ xfs_trans_log_inode( xfs_inode_t *ip, uint flags) { - xfs_log_item_desc_t *lidp; + xfs_log_item_desc_t *lidp = ip->i_itemp->ili_item.li_desc; ASSERT(ip->i_transp == tp); ASSERT(ip->i_itemp != NULL); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); - - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)(ip->i_itemp)); ASSERT(lidp != NULL); tp->t_flags |= XFS_TRANS_DIRTY; Index: xfs/fs/xfs/xfs_trans_item.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_item.c 2010-05-01 14:59:07.653003960 +0200 +++ xfs/fs/xfs/xfs_trans_item.c 2010-05-03 15:28:51.736005218 +0200 @@ -177,25 +177,6 @@ xfs_trans_free_item(xfs_trans_t *tp, xfs } /* - * This is called to find the descriptor corresponding to the given - * log item. It returns a pointer to the descriptor. - * The log item MUST have a corresponding descriptor in the given - * transaction. This routine does not return NULL, it panics. - * - * The descriptor pointer is kept in the log item's li_desc field. - * Just return it. - */ -/*ARGSUSED*/ -xfs_log_item_desc_t * -xfs_trans_find_item(xfs_trans_t *tp, xfs_log_item_t *lip) -{ - ASSERT(lip->li_desc != NULL); - - return lip->li_desc; -} - - -/* * Return a pointer to the first descriptor in the chunk list. * This does not return NULL if there are none, it panics. * Index: xfs/fs/xfs/xfs_trans_priv.h =================================================================== --- xfs.orig/fs/xfs/xfs_trans_priv.h 2010-05-01 14:59:07.654003960 +0200 +++ xfs/fs/xfs/xfs_trans_priv.h 2010-05-03 15:28:51.740033714 +0200 @@ -30,8 +30,6 @@ struct xfs_log_item_desc *xfs_trans_add_ struct xfs_log_item *); void xfs_trans_free_item(struct xfs_trans *, struct xfs_log_item_desc *); -struct xfs_log_item_desc *xfs_trans_find_item(struct xfs_trans *, - struct xfs_log_item *); struct xfs_log_item_desc *xfs_trans_first_item(struct xfs_trans *); struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *, struct xfs_log_item_desc *); From BATV+34c124583dd587bcabd0+2444+infradead.org+hch@bombadil.srs.infradead.org Mon May 3 10:43:05 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o43Fh3hN161622 for ; Mon, 3 May 2010 10:43:05 -0500 X-ASG-Debug-ID: 1272901511-4bf800360000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7EA84308BC1 for ; Mon, 3 May 2010 08:45:11 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id BQmuZCfbq2rTVskX for ; Mon, 03 May 2010 08:45:11 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1O8xpW-0004db-SH for xfs@oss.sgi.com; Mon, 03 May 2010 15:45:10 +0000 Date: Mon, 3 May 2010 11:45:10 -0400 From: Christoph Hellwig To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH] xfs: kill xfs_trans_find_item Subject: Re: [PATCH] xfs: kill xfs_trans_find_item Message-ID: <20100503154510.GA15411@infradead.org> References: <20100503133447.GA21325@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100503133447.GA21325@infradead.org> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1272901511 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Mon, May 03, 2010 at 09:34:47AM -0400, Christoph Hellwig wrote: > xfs_trans_find_item is just and awkward way to derefence ->li_desc in > the log_item structure. So replace the calls to it with the direct > reference, and while we're at it also make these type-safe instead using > casts. Actually, ignore this patch for now. I have a larger one that will incorporate it. From SRS0+4mlS+63+fromorbit.com=david@internode.on.net Mon May 3 20:48:41 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o441meWC190348 for ; Mon, 3 May 2010 20:48:40 -0500 X-ASG-Debug-ID: 1272937845-3f3b03240000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id A2DE691E1D0 for ; Mon, 3 May 2010 18:50:46 -0700 (PDT) Received: from mail.internode.on.net (bld-mail17.adl2.internode.on.net [150.101.137.102]) by cuda.sgi.com with ESMTP id f41DvEVkAU0ZgwO9 for ; Mon, 03 May 2010 18:50:46 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 22782987-1927428 for ; Tue, 04 May 2010 11:20:44 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O97HW-000285-Rg for xfs@oss.sgi.com; Tue, 04 May 2010 11:50:42 +1000 Date: Tue, 4 May 2010 11:50:42 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [GIT] Delayed Logging V3 Subject: [GIT] Delayed Logging V3 Message-ID: <20100504015042.GA8120@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail17.adl2.internode.on.net[150.101.137.102] X-Barracuda-Start-Time: 1272937848 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-ASG-Whitelist: BODY (http://marc\.info/\?) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Hi Folks, This is version 3 of the delayed logging series. I won't repeat everything about what it is, just point you here: http://marc.info/?l=linux-xfs&m=126862777118946&w=2 for the description, and here: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging for the current code. Note that this is a rebased branch, so you'll need to pull it again into a new branch to review. All previously known, reproducable issues have been fixed in this release, and as such I consider it ready from a functional POV for inclusion into the -dev tree as an experimental feature. The new cleanups added in this version mean it toucheŃ• more files that the previous versions, but overall it should still be simpler to review because I've collapsed many of the intermediate patches into one "Introduce delayed logging core code" commit. Version 3: 28 files changed, 2366 insertions(+), 506 deletions(-) Version 2: 22 files changed, 2188 insertions(+), 377 deletions(-) Version 1: 19 files changed, 2594 insertions(+), 580 deletions(-) Changes for V3: o changed buffer log item reference counted model to be consistent for both logging modes o cleaned up XFS_BLI flags usage (new commit) o separated out log ticket overrun printing cleanup (new commit) o made sure "delaylog" option shows up in /proc/mounts o collapsed many of the intermediate commits together to make it easier to review o fixed inode buffer tagging issue that was causing shutdowns in log recovery in test 087 and 121 Changes for V2: o 22 files changed, 2188 insertions(+), 377 deletions(-) o fixed some memory leaks o fixed ticket allocation for checkpoints to use KM_NOFS o minor code cleanups o performed stress and scalability testing The following changes since commit 6ff75b78182c314112c1173edaab6c164c95d775: Christoph Hellwig (1): xfs: mark xfs_iomap_write_ helpers static are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging Dave Chinner (10): xfs: Improve scalability of busy extent tracking xfs: allow log ticket allocation to take allocation flags xfs: modify buffer item reference counting V2 xfs: allow detection of inode allocation buffers in recovery xfs: clean up log ticket overrun debug output xfs: Delayed logging design documentation xfs: Introduce delayed logging core code xfs: forced unmounts need to push the CIL xfs: enable background pushing of the CIL xfs: allow detection of inode allocation buffers in recovery .../filesystems/xfs-delayed-logging-design.txt | 819 ++++++++++++++++++++ fs/xfs/Makefile | 1 + fs/xfs/linux-2.6/xfs_buf.c | 11 +- fs/xfs/linux-2.6/xfs_quotaops.c | 1 + fs/xfs/linux-2.6/xfs_super.c | 12 +- fs/xfs/linux-2.6/xfs_trace.h | 80 ++- fs/xfs/quota/xfs_dquot.c | 6 +- fs/xfs/support/debug.c | 1 + fs/xfs/xfs_ag.h | 21 +- fs/xfs/xfs_alloc.c | 272 ++++--- fs/xfs/xfs_alloc.h | 5 +- fs/xfs/xfs_buf_item.c | 166 ++-- fs/xfs/xfs_buf_item.h | 18 +- fs/xfs/xfs_error.c | 2 +- fs/xfs/xfs_filestream.c | 1 + fs/xfs/xfs_log.c | 107 ++- fs/xfs/xfs_log.h | 12 +- fs/xfs/xfs_log_cil.c | 729 +++++++++++++++++ fs/xfs/xfs_log_priv.h | 118 +++- fs/xfs/xfs_log_recover.c | 46 +- fs/xfs/xfs_log_recover.h | 2 +- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_trans.c | 216 +++++- fs/xfs/xfs_trans.h | 45 +- fs/xfs/xfs_trans_buf.c | 46 +- fs/xfs/xfs_trans_extfree.c | 1 + fs/xfs/xfs_trans_item.c | 114 +--- fs/xfs/xfs_trans_priv.h | 19 +- 28 files changed, 2366 insertions(+), 506 deletions(-) create mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt create mode 100644 fs/xfs/xfs_log_cil.c Cheers, Dave. -- Dave Chinner david@fromorbit.com From SRS0+Madt+63+fromorbit.com=david@internode.on.net Mon May 3 21:56:28 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_72 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o442uRYN194030 for ; Mon, 3 May 2010 21:56:28 -0500 X-ASG-Debug-ID: 1272941913-46d801070000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2EB9A30ABCB for ; Mon, 3 May 2010 19:58:34 -0700 (PDT) Received: from mail.internode.on.net (bld-mail15.adl6.internode.on.net [150.101.137.100]) by cuda.sgi.com with ESMTP id gNDedutdGBn5xZkk for ; Mon, 03 May 2010 19:58:34 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 11360481-1927428 for ; Tue, 04 May 2010 12:28:32 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O98Ky-0002Dl-91; Tue, 04 May 2010 12:58:20 +1000 Date: Tue, 4 May 2010 12:58:20 +1000 From: Dave Chinner To: stable@kernel.org Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH] xfs: stable update for 2.6.32.x and 2.6.33.y Subject: [PATCH] xfs: stable update for 2.6.32.x and 2.6.33.y Message-ID: <20100504025820.GI2591@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail15.adl6.internode.on.net[150.101.137.100] X-Barracuda-Start-Time: 1272941915 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29010 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean G'day, The following patch that adds a inode reclaim shrinker needs to be added to both the 2.6.32.x stable series and the 2.6.33.y stable series. The regression this patch addresses was introduced in the previous round of XFS stable updates that have just been released (2.6.32.12 and 2.6.33.3). However, the fix wasn't upstream until after thoes stable kernels were released, so please consider this for the next stable releaseŃ•. The same patch applies to the XFS code in both 2.6.32.12 and 2.6.33.3 trees, and has passed a run of xfsqa on both kernels with limited memory to trigger the OOM conditions that lead to problems. The upstream commit is 9bf729c0af67897ea8498ce17c29b0683f7f2028. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: add a shrinker to background inode reclaim From: Dave Chinner >From 9bf729c0af67897ea8498ce17c29b0683f7f2028 Thu, 29 Apr 2010 21:22:13 +0000 On low memory boxes or those with highmem, kernel can OOM before the background reclaims inodes via xfssyncd. Add a shrinker to run inode reclaim so that it inode reclaim is expedited when memory is low. This is more complex than it needs to be because the VM folk don't want a context added to the shrinker infrastructure. Hence we need to add a global list of XFS mount structures so the shrinker can traverse them. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/linux-2.6/xfs_super.c | 5 ++ fs/xfs/linux-2.6/xfs_sync.c | 107 +++++++++++++++++++++++++++++++++++++--- fs/xfs/linux-2.6/xfs_sync.h | 7 ++- fs/xfs/quota/xfs_qm_syscalls.c | 3 +- fs/xfs/xfs_ag.h | 1 + fs/xfs/xfs_mount.h | 1 + 6 files changed, 115 insertions(+), 9 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index 77414db..146d491 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -1160,6 +1160,7 @@ xfs_fs_put_super( xfs_unmountfs(mp); xfs_freesb(mp); + xfs_inode_shrinker_unregister(mp); xfs_icsb_destroy_counters(mp); xfs_close_devices(mp); xfs_dmops_put(mp); @@ -1523,6 +1524,8 @@ xfs_fs_fill_super( if (error) goto fail_vnrele; + xfs_inode_shrinker_register(mp); + kfree(mtpt); return 0; @@ -1767,6 +1770,7 @@ init_xfs_fs(void) goto out_cleanup_procfs; vfs_initquota(); + xfs_inode_shrinker_init(); error = register_filesystem(&xfs_fs_type); if (error) @@ -1794,6 +1798,7 @@ exit_xfs_fs(void) { vfs_exitquota(); unregister_filesystem(&xfs_fs_type); + xfs_inode_shrinker_destroy(); xfs_sysctl_unregister(); xfs_cleanup_procfs(); xfs_buf_terminate(); diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index 6b6b394..57adf2d 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -95,7 +95,8 @@ xfs_inode_ag_walk( struct xfs_perag *pag, int flags), int flags, int tag, - int exclusive) + int exclusive, + int *nr_to_scan) { struct xfs_perag *pag = &mp->m_perag[ag]; uint32_t first_index; @@ -135,7 +136,7 @@ restart: if (error == EFSCORRUPTED) break; - } while (1); + } while ((*nr_to_scan)--); if (skipped) { delay(1); @@ -153,23 +154,30 @@ xfs_inode_ag_iterator( struct xfs_perag *pag, int flags), int flags, int tag, - int exclusive) + int exclusive, + int *nr_to_scan) { int error = 0; int last_error = 0; xfs_agnumber_t ag; + int nr; + nr = nr_to_scan ? *nr_to_scan : INT_MAX; for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) { if (!mp->m_perag[ag].pag_ici_init) continue; error = xfs_inode_ag_walk(mp, ag, execute, flags, tag, - exclusive); + exclusive, &nr); if (error) { last_error = error; if (error == EFSCORRUPTED) break; } + if (nr <= 0) + break; } + if (nr_to_scan) + *nr_to_scan = nr; return XFS_ERROR(last_error); } @@ -289,7 +297,7 @@ xfs_sync_data( ASSERT((flags & ~(SYNC_TRYLOCK|SYNC_WAIT)) == 0); error = xfs_inode_ag_iterator(mp, xfs_sync_inode_data, flags, - XFS_ICI_NO_TAG, 0); + XFS_ICI_NO_TAG, 0, NULL); if (error) return XFS_ERROR(error); @@ -311,7 +319,7 @@ xfs_sync_attr( ASSERT((flags & ~SYNC_WAIT) == 0); return xfs_inode_ag_iterator(mp, xfs_sync_inode_attr, flags, - XFS_ICI_NO_TAG, 0); + XFS_ICI_NO_TAG, 0, NULL); } STATIC int @@ -679,6 +687,7 @@ __xfs_inode_set_reclaim_tag( radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), XFS_ICI_RECLAIM_TAG); + pag->pag_ici_reclaimable++; } /* @@ -710,6 +719,7 @@ __xfs_inode_clear_reclaim_tag( { radix_tree_tag_clear(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG); + pag->pag_ici_reclaimable--; } STATIC int @@ -770,5 +780,88 @@ xfs_reclaim_inodes( int mode) { return xfs_inode_ag_iterator(mp, xfs_reclaim_inode, mode, - XFS_ICI_RECLAIM_TAG, 1); + XFS_ICI_RECLAIM_TAG, 1, NULL); +} + +/* + * Shrinker infrastructure. + * + * This is all far more complex than it needs to be. It adds a global list of + * mounts because the shrinkers can only call a global context. We need to make + * the shrinkers pass a context to avoid the need for global state. + */ +static LIST_HEAD(xfs_mount_list); +static struct rw_semaphore xfs_mount_list_lock; + +static int +xfs_reclaim_inode_shrink( + int nr_to_scan, + gfp_t gfp_mask) +{ + struct xfs_mount *mp; + xfs_agnumber_t ag; + int reclaimable = 0; + + if (nr_to_scan) { + if (!(gfp_mask & __GFP_FS)) + return -1; + + down_read(&xfs_mount_list_lock); + list_for_each_entry(mp, &xfs_mount_list, m_mplist) { + xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0, + XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan); + if (nr_to_scan <= 0) + break; + } + up_read(&xfs_mount_list_lock); + } + + down_read(&xfs_mount_list_lock); + list_for_each_entry(mp, &xfs_mount_list, m_mplist) { + for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) { + + if (!mp->m_perag[ag].pag_ici_init) + continue; + reclaimable += mp->m_perag[ag].pag_ici_reclaimable; + } + } + up_read(&xfs_mount_list_lock); + return reclaimable; +} + +static struct shrinker xfs_inode_shrinker = { + .shrink = xfs_reclaim_inode_shrink, + .seeks = DEFAULT_SEEKS, +}; + +void __init +xfs_inode_shrinker_init(void) +{ + init_rwsem(&xfs_mount_list_lock); + register_shrinker(&xfs_inode_shrinker); +} + +void +xfs_inode_shrinker_destroy(void) +{ + ASSERT(list_empty(&xfs_mount_list)); + unregister_shrinker(&xfs_inode_shrinker); +} + +void +xfs_inode_shrinker_register( + struct xfs_mount *mp) +{ + down_write(&xfs_mount_list_lock); + list_add_tail(&mp->m_mplist, &xfs_mount_list); + up_write(&xfs_mount_list_lock); +} + +void +xfs_inode_shrinker_unregister( + struct xfs_mount *mp) +{ + down_write(&xfs_mount_list_lock); + list_del(&mp->m_mplist); + up_write(&xfs_mount_list_lock); } diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h index ea932b4..0b28c13 100644 --- a/fs/xfs/linux-2.6/xfs_sync.h +++ b/fs/xfs/linux-2.6/xfs_sync.h @@ -54,6 +54,11 @@ void __xfs_inode_clear_reclaim_tag(struct xfs_mount *mp, struct xfs_perag *pag, int xfs_sync_inode_valid(struct xfs_inode *ip, struct xfs_perag *pag); int xfs_inode_ag_iterator(struct xfs_mount *mp, int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag, int flags), - int flags, int tag, int write_lock); + int flags, int tag, int write_lock, int *nr_to_scan); + +void xfs_inode_shrinker_init(void); +void xfs_inode_shrinker_destroy(void); +void xfs_inode_shrinker_register(struct xfs_mount *mp); +void xfs_inode_shrinker_unregister(struct xfs_mount *mp); #endif diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c index 873e07e..145f596 100644 --- a/fs/xfs/quota/xfs_qm_syscalls.c +++ b/fs/xfs/quota/xfs_qm_syscalls.c @@ -891,7 +891,8 @@ xfs_qm_dqrele_all_inodes( uint flags) { ASSERT(mp->m_quotainfo); - xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, XFS_ICI_NO_TAG, 0); + xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, + XFS_ICI_NO_TAG, 0, NULL); } /*------------------------------------------------------------------------*/ diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h index 6702bd8..1182604 100644 --- a/fs/xfs/xfs_ag.h +++ b/fs/xfs/xfs_ag.h @@ -229,6 +229,7 @@ typedef struct xfs_perag int pag_ici_init; /* incore inode cache initialised */ rwlock_t pag_ici_lock; /* incore inode lock */ struct radix_tree_root pag_ici_root; /* incore inode cache root */ + int pag_ici_reclaimable; /* reclaimable inodes */ #endif } xfs_perag_t; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 1df7e45..c95f81a 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -257,6 +257,7 @@ typedef struct xfs_mount { wait_queue_head_t m_wait_single_sync_task; __int64_t m_update_flags; /* sb flags we need to update on the next remount,rw */ + struct list_head m_mplist; /* inode shrinker mount list */ } xfs_mount_t; /* From BATV+f20da4fa05706f3678eb+2445+infradead.org+hch@bombadil.srs.infradead.org Tue May 4 05:15:28 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44AFRHQ219603 for ; Tue, 4 May 2010 05:15:28 -0500 X-ASG-Debug-ID: 1272968256-183b026c0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id A9C9930B8E7 for ; Tue, 4 May 2010 03:17:36 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id MG0keEm8Q6WlKNdF for ; Tue, 04 May 2010 03:17:36 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1O9FC1-0000bs-Um; Tue, 04 May 2010 10:17:33 +0000 Date: Tue, 4 May 2010 06:17:33 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [GIT] Delayed Logging V3 Subject: Re: [GIT] Delayed Logging V3 Message-ID: <20100504101733.GA18229@infradead.org> References: <20100504015042.GA8120@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100504015042.GA8120@dastard> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1272968256 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Tue, May 04, 2010 at 11:50:42AM +1000, Dave Chinner wrote: > The new cleanups added in this version mean it touche?? more files > that the previous versions, but overall it should still be simpler > to review because I've collapsed many of the intermediate patches > into one "Introduce delayed logging core code" commit. Thanks. Btw, a little procedural comment - you include the V2 markers and changelogs in the commit message. Normal procedure is to have them below the -- or what it is marker with the diffstat so that they only get picked up in the mail and not the final commits, and no V2 in the subject line at all. From SRS0+Bo/h+63+fromorbit.com=david@internode.on.net Tue May 4 06:42:21 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44BgKZl223520 for ; Tue, 4 May 2010 06:42:21 -0500 X-ASG-Debug-ID: 1272973466-7d9800440000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 5217013404E4 for ; Tue, 4 May 2010 04:44:27 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id QU9aL5mjKap7ZsE8 for ; Tue, 04 May 2010 04:44:27 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23031331-1927428 for multiple; Tue, 04 May 2010 21:14:25 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O9GY4-0002hO-2p; Tue, 04 May 2010 21:44:24 +1000 Date: Tue, 4 May 2010 21:44:24 +1000 From: Dave Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [GIT] Delayed Logging V3 Subject: Re: [GIT] Delayed Logging V3 Message-ID: <20100504114423.GC8120@dastard> References: <20100504015042.GA8120@dastard> <20100504101733.GA18229@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100504101733.GA18229@infradead.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1272973468 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29040 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Tue, May 04, 2010 at 06:17:33AM -0400, Christoph Hellwig wrote: > On Tue, May 04, 2010 at 11:50:42AM +1000, Dave Chinner wrote: > > The new cleanups added in this version mean it touche?? more files > > that the previous versions, but overall it should still be simpler > > to review because I've collapsed many of the intermediate patches > > into one "Introduce delayed logging core code" commit. > > Thanks. Btw, a little procedural comment - you include the V2 markers > and changelogs in the commit message. Normal procedure is to have > them below the -- or what it is marker with the diffstat so that they > only get picked up in the mail and not the final commits, and no V2 in > the subject line at all. Yeah, That's typical. The problem is that guilt seems to kill anything I add to the patch headers below a "---" separator. Hence if i don't put it the in commit message it doesn't stick as I rebase my working branches, and hence doesn't get included in the patchbombs I send direct from the repository. For stuff that is in a series that I'm tracking in a separate branch I can probably just keep versioning changes in the series header (the patch 0/N message), but for individual patches it's a bit harder. I'll see what I can do to track this more easily in my workflow. Cheers, Dave. -- Dave Chinner david@fromorbit.com From info@sarfmarketi.com Tue May 4 07:32:21 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_50,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44CWKOJ226156 for ; Tue, 4 May 2010 07:32:21 -0500 X-ASG-Debug-ID: 1272976465-4d4003980000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.saubulten.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 4E54C30BDFC for ; Tue, 4 May 2010 05:34:25 -0700 (PDT) Received: from mail.saubulten.com (mail.turuncusoft.com [85.153.27.210]) by cuda.sgi.com with ESMTP id bEmmdWkzFokH6wvd for ; Tue, 04 May 2010 05:34:25 -0700 (PDT) Received: from pc1 ([88.233.23.54]) by mail.saubulten.com (IceWarp 9.2.1) with ASMTP id MRM54613 for ; Tue, 04 May 2010 15:34:13 +0300 From: "Tonerci" To: "xfs@oss.sgi.com" Reply-To: info@sarfmarketi.com Date: Tue, 04 May 2010 16:11:14 +0430 X-ASG-Orig-Subj: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= Subject: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-9 Content-Transfer-Encoding: quoted-printable X-Mailer: aspNetEmail ver 3.6.0.77 Message-ID: X-Barracuda-Connect: mail.turuncusoft.com[85.153.27.210] X-Barracuda-Start-Time: 1272976467 X-Barracuda-Bayes: INNOCENT GLOBAL 0.4857 1.0000 0.0000 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29045 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean HP CP1215 RENKL=DD TONER DOLUMU 50 TL =0D=0AHP 1600/2600 RENKL=DD= TONER DOLUMU 60 TL=0D=0AHP Q2612 S=DDYAH TONER DOLUMU 14 T= L=0D=0AHP CB435A S=DDYAH TONER DOLUMU 14 TL=0D=0AHP CB46A TONER = DOLUMU 14 TL=0D=0A=0D=0AListemizde olmayan modeller= i=E7in bilgi sorunuz=2E=0D=0AFiyatlara =2518 KDV dahil de=F0ildir=2E=0D=0A= =0D=0A=0D=0ATuruncusoft Bili=FEim ve =DDnternet Hiz=2ELtd=2E=DEti=0D=0ATE= L: 0212-272 70 83=0D=0AMSN: kartus_dolum=40windowslive=2Ecom=0D=0A From info@sarfmarketi.com Tue May 4 07:33:38 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: *** X-Spam-Status: No, score=3.3 required=5.0 tests=BAYES_60,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44CXcWT226234 for ; Tue, 4 May 2010 07:33:38 -0500 X-ASG-Debug-ID: 1272976529-4d7d03bc0000-w1Z2WR X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.saubulten.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 67ADD30BE0C for ; Tue, 4 May 2010 05:35:44 -0700 (PDT) Received: from mail.saubulten.com (mail.turuncusoft.com [85.153.27.210]) by cuda.sgi.com with ESMTP id lAGC2sVHk1DqYQw9 for ; Tue, 04 May 2010 05:35:44 -0700 (PDT) Received: from pc1 ([88.233.23.54]) by mail.saubulten.com (IceWarp 9.2.1) with ASMTP id MRN74922 for ; Tue, 04 May 2010 15:35:22 +0300 From: "Tonerci" To: "linux-xfs@oss.sgi.com" Reply-To: info@sarfmarketi.com Date: Tue, 04 May 2010 16:11:14 +0430 X-ASG-Orig-Subj: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= Subject: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-9 Content-Transfer-Encoding: quoted-printable X-Mailer: aspNetEmail ver 3.6.0.77 Message-ID: X-Barracuda-Connect: mail.turuncusoft.com[85.153.27.210] X-Barracuda-Start-Time: 1272976547 X-Barracuda-Bayes: INNOCENT GLOBAL 0.5236 1.0000 0.7500 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 0.75 X-Barracuda-Spam-Status: No, SCORE=0.75 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29045 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean HP CP1215 RENKL=DD TONER DOLUMU 50 TL =0D=0AHP 1600/2600 RENKL=DD= TONER DOLUMU 60 TL=0D=0AHP Q2612 S=DDYAH TONER DOLUMU 14 T= L=0D=0AHP CB435A S=DDYAH TONER DOLUMU 14 TL=0D=0AHP CB46A TONER = DOLUMU 14 TL=0D=0A=0D=0AListemizde olmayan modeller= i=E7in bilgi sorunuz=2E=0D=0AFiyatlara =2518 KDV dahil de=F0ildir=2E=0D=0A= =0D=0A=0D=0ATuruncusoft Bili=FEim ve =DDnternet Hiz=2ELtd=2E=DEti=0D=0ATE= L: 0212-272 70 83=0D=0AMSN: kartus_dolum=40windowslive=2Ecom=0D=0A From info@sarfmarketi.com Tue May 4 07:59:58 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_50,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44CxwsC227646 for ; Tue, 4 May 2010 07:59:58 -0500 X-ASG-Debug-ID: 1272978125-7db4018c0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.saubulten.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 993151DE4A24 for ; Tue, 4 May 2010 06:02:05 -0700 (PDT) Received: from mail.saubulten.com (mail.turuncusoft.com [85.153.27.210]) by cuda.sgi.com with ESMTP id Ro2auRQFBaoHlwuU for ; Tue, 04 May 2010 06:02:05 -0700 (PDT) Received: from pc1 ([88.233.23.54]) by mail.saubulten.com (IceWarp 9.2.1) with ASMTP id MSF18759 for ; Tue, 04 May 2010 16:01:59 +0300 From: "Tonerci" To: "xfs@oss.sgi.com" Reply-To: info@sarfmarketi.com Date: Tue, 04 May 2010 16:11:14 +0430 X-ASG-Orig-Subj: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= Subject: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-9 Content-Transfer-Encoding: quoted-printable X-Mailer: aspNetEmail ver 3.6.0.77 Message-ID: X-Barracuda-Connect: mail.turuncusoft.com[85.153.27.210] X-Barracuda-Start-Time: 1272978126 X-Barracuda-Bayes: INNOCENT GLOBAL 0.4857 1.0000 0.0000 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29047 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean HP CP1215 RENKL=DD TONER DOLUMU 50 TL =0D=0AHP 1600/2600 RENKL=DD= TONER DOLUMU 60 TL=0D=0AHP Q2612 S=DDYAH TONER DOLUMU 14 T= L=0D=0AHP CB435A S=DDYAH TONER DOLUMU 14 TL=0D=0AHP CB46A TONER = DOLUMU 14 TL=0D=0A=0D=0AListemizde olmayan modeller= i=E7in bilgi sorunuz=2E=0D=0AFiyatlara =2518 KDV dahil de=F0ildir=2E=0D=0A= =0D=0A=0D=0ATuruncusoft Bili=FEim ve =DDnternet Hiz=2ELtd=2E=DEti=0D=0ATE= L: 0212-272 70 83=0D=0AMSN: kartus_dolum=40windowslive=2Ecom=0D=0A From info@sarfmarketi.com Tue May 4 08:01:30 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: *** X-Spam-Status: No, score=3.3 required=5.0 tests=BAYES_60,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44D1Tnp227746 for ; Tue, 4 May 2010 08:01:30 -0500 X-ASG-Debug-ID: 1272978209-718c021b0000-w1Z2WR X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.saubulten.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 813D71DE4A42 for ; Tue, 4 May 2010 06:03:29 -0700 (PDT) Received: from mail.saubulten.com (mail.turuncusoft.com [85.153.27.210]) by cuda.sgi.com with ESMTP id uKs3NVGjQaohYCAB for ; Tue, 04 May 2010 06:03:29 -0700 (PDT) Received: from pc1 ([88.233.23.54]) by mail.saubulten.com (IceWarp 9.2.1) with ASMTP id MSH95322 for ; Tue, 04 May 2010 16:03:22 +0300 From: "Tonerci" To: "linux-xfs@oss.sgi.com" Reply-To: info@sarfmarketi.com Date: Tue, 04 May 2010 16:11:14 +0430 X-ASG-Orig-Subj: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= Subject: =?iso-8859-9?Q?Renkli=20Toner=20Dolum=20=DCcretleri?= MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-9 Content-Transfer-Encoding: quoted-printable X-Mailer: aspNetEmail ver 3.6.0.77 Message-ID: X-Barracuda-Connect: mail.turuncusoft.com[85.153.27.210] X-Barracuda-Start-Time: 1272978218 X-Barracuda-Bayes: INNOCENT GLOBAL 0.4941 1.0000 0.0000 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29047 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean HP CP1215 RENKL=DD TONER DOLUMU 50 TL =0D=0AHP 1600/2600 RENKL=DD= TONER DOLUMU 60 TL=0D=0AHP Q2612 S=DDYAH TONER DOLUMU 14 T= L=0D=0AHP CB435A S=DDYAH TONER DOLUMU 14 TL=0D=0AHP CB46A TONER = DOLUMU 14 TL=0D=0A=0D=0AListemizde olmayan modeller= i=E7in bilgi sorunuz=2E=0D=0AFiyatlara =2518 KDV dahil de=F0ildir=2E=0D=0A= =0D=0A=0D=0ATuruncusoft Bili=FEim ve =DDnternet Hiz=2ELtd=2E=DEti=0D=0ATE= L: 0212-272 70 83=0D=0AMSN: kartus_dolum=40windowslive=2Ecom=0D=0A From BATV+f20da4fa05706f3678eb+2445+infradead.org+hch@bombadil.srs.infradead.org Tue May 4 08:51:42 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44DpetU230405 for ; Tue, 4 May 2010 08:51:42 -0500 X-ASG-Debug-ID: 1272981228-7fa303280000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id D412F30C23C for ; Tue, 4 May 2010 06:53:48 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id BdY5xV8IWNQYPstO for ; Tue, 04 May 2010 06:53:48 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1O9IZI-0005Hl-E9 for xfs@oss.sgi.com; Tue, 04 May 2010 13:53:48 +0000 Date: Tue, 4 May 2010 09:53:48 -0400 From: Christoph Hellwig To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH v2] xfs: cleanup log reservation calculactions Subject: [PATCH v2] xfs: cleanup log reservation calculactions Message-ID: <20100504135348.GA18991@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1272981228 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Instead of having small helper functions calling big macros do the calculations for the log reservations directly in the functions. These are mostly 1:1 from the macros execept that the macros kept the quota calculations in their callers. Signed-off-by: Christoph Hellwig Index: xfs/fs/xfs/xfs_trans.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans.c 2010-05-04 15:47:36.143004267 +0200 +++ xfs/fs/xfs/xfs_trans.c 2010-05-04 15:49:00.118253812 +0200 @@ -49,134 +49,489 @@ kmem_zone_t *xfs_trans_zone; + +/* + * Various log reservation values. + * + * These are based on the size of the file system block because that is what + * most transactions manipulate. Each adds in an additional 128 bytes per + * item logged to try to account for the overhead of the transaction mechanism. + * + * Note: Most of the reservations underestimate the number of allocation + * groups into which they could free extents in the xfs_bmap_finish() call. + * This is because the number in the worst case is quite high and quite + * unusual. In order to fix this we need to change xfs_bmap_finish() to free + * extents in only a single AG at a time. This will require changes to the + * EFI code as well, however, so that the EFI for the extents not freed is + * logged again in each transaction. See SGI PV #261917. + * + * Reservation functions here avoid a huge stack in xfs_trans_init due to + * register overflow from temporaries in the calculations. + */ + + /* - * Reservation functions here avoid a huge stack in xfs_trans_init - * due to register overflow from temporaries in the calculations. + * In a write transaction we can allocate a maximum of 2 + * extents. This gives: + * the inode getting the new extents: inode size + * the inode's bmap btree: max depth * block size + * the agfs of the ags from which the extents are allocated: 2 * sector + * the superblock free block counter: sector size + * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size + * And the bmap_finish transaction can free bmap blocks in a join: + * the agfs of the ags containing the blocks: 2 * sector size + * the agfls of the ags containing the blocks: 2 * sector size + * the super block free block counter: sector size + * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size */ STATIC uint -xfs_calc_write_reservation(xfs_mount_t *mp) +xfs_calc_write_reservation( + struct xfs_mount *mp) { - return XFS_CALC_WRITE_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((mp->m_sb.sb_inodesize + + XFS_FSB_TO_B(mp, XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK)) + + 2 * mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 2) + + 128 * (4 + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + + XFS_ALLOCFREE_LOG_COUNT(mp, 2))), + (2 * mp->m_sb.sb_sectsize + + 2 * mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 2) + + 128 * (5 + XFS_ALLOCFREE_LOG_COUNT(mp, 2)))); } +/* + * In truncating a file we free up to two extents at once. We can modify: + * the inode being truncated: inode size + * the inode's bmap btree: (max depth + 1) * block size + * And the bmap_finish transaction can free the blocks and bmap blocks: + * the agf for each of the ags: 4 * sector size + * the agfl for each of the ags: 4 * sector size + * the super block to reflect the freed blocks: sector size + * worst case split in allocation btrees per extent assuming 4 extents: + * 4 exts * 2 trees * (2 * max depth - 1) * block size + * the inode btree: max depth * blocksize + * the allocation btrees: 2 trees * (max depth - 1) * block size + */ STATIC uint -xfs_calc_itruncate_reservation(xfs_mount_t *mp) +xfs_calc_itruncate_reservation( + struct xfs_mount *mp) { - return XFS_CALC_ITRUNCATE_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((mp->m_sb.sb_inodesize + + XFS_FSB_TO_B(mp, XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1) + + 128 * (2 + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK))), + (4 * mp->m_sb.sb_sectsize + + 4 * mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 4) + + 128 * (9 + XFS_ALLOCFREE_LOG_COUNT(mp, 4)) + + 128 * 5 + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels + + XFS_ALLOCFREE_LOG_COUNT(mp, 1)))); } +/* + * In renaming a files we can modify: + * the four inodes involved: 4 * inode size + * the two directory btrees: 2 * (max depth + v2) * dir block size + * the two directory bmap btrees: 2 * max depth * block size + * And the bmap_finish transaction can free dir and bmap blocks (two sets + * of bmap blocks) giving: + * the agf for the ags in which the blocks live: 3 * sector size + * the agfl for the ags in which the blocks live: 3 * sector size + * the superblock for the free block count: sector size + * the allocation btrees: 3 exts * 2 trees * (2 * max depth - 1) * block size + */ STATIC uint -xfs_calc_rename_reservation(xfs_mount_t *mp) +xfs_calc_rename_reservation( + struct xfs_mount *mp) { - return XFS_CALC_RENAME_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((4 * mp->m_sb.sb_inodesize + + 2 * XFS_DIROP_LOG_RES(mp) + + 128 * (4 + 2 * XFS_DIROP_LOG_COUNT(mp))), + (3 * mp->m_sb.sb_sectsize + + 3 * mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 3) + + 128 * (7 + XFS_ALLOCFREE_LOG_COUNT(mp, 3)))); } +/* + * For creating a link to an inode: + * the parent directory inode: inode size + * the linked inode: inode size + * the directory btree could split: (max depth + v2) * dir block size + * the directory bmap btree could join or split: (max depth + v2) * blocksize + * And the bmap_finish transaction can free some bmap blocks giving: + * the agf for the ag in which the blocks live: sector size + * the agfl for the ag in which the blocks live: sector size + * the superblock for the free block count: sector size + * the allocation btrees: 2 trees * (2 * max depth - 1) * block size + */ STATIC uint -xfs_calc_link_reservation(xfs_mount_t *mp) +xfs_calc_link_reservation( + struct xfs_mount *mp) { - return XFS_CALC_LINK_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((mp->m_sb.sb_inodesize + + mp->m_sb.sb_inodesize + + XFS_DIROP_LOG_RES(mp) + + 128 * (2 + XFS_DIROP_LOG_COUNT(mp))), + (mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (3 + XFS_ALLOCFREE_LOG_COUNT(mp, 1)))); } +/* + * For removing a directory entry we can modify: + * the parent directory inode: inode size + * the removed inode: inode size + * the directory btree could join: (max depth + v2) * dir block size + * the directory bmap btree could join or split: (max depth + v2) * blocksize + * And the bmap_finish transaction can free the dir and bmap blocks giving: + * the agf for the ag in which the blocks live: 2 * sector size + * the agfl for the ag in which the blocks live: 2 * sector size + * the superblock for the free block count: sector size + * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size + */ STATIC uint -xfs_calc_remove_reservation(xfs_mount_t *mp) +xfs_calc_remove_reservation( + struct xfs_mount *mp) { - return XFS_CALC_REMOVE_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((mp->m_sb.sb_inodesize + + mp->m_sb.sb_inodesize + + XFS_DIROP_LOG_RES(mp) + + 128 * (2 + XFS_DIROP_LOG_COUNT(mp))), + (2 * mp->m_sb.sb_sectsize + + 2 * mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 2) + + 128 * (5 + XFS_ALLOCFREE_LOG_COUNT(mp, 2)))); } +/* + * For symlink we can modify: + * the parent directory inode: inode size + * the new inode: inode size + * the inode btree entry: 1 block + * the directory btree: (max depth + v2) * dir block size + * the directory inode's bmap btree: (max depth + v2) * block size + * the blocks for the symlink: 1 kB + * Or in the first xact we allocate some inodes giving: + * the agi and agf of the ag getting the new inodes: 2 * sectorsize + * the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize + * the inode btree: max depth * blocksize + * the allocation btrees: 2 trees * (2 * max depth - 1) * block size + */ STATIC uint -xfs_calc_symlink_reservation(xfs_mount_t *mp) +xfs_calc_symlink_reservation( + struct xfs_mount *mp) { - return XFS_CALC_SYMLINK_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((mp->m_sb.sb_inodesize + + mp->m_sb.sb_inodesize + + XFS_FSB_TO_B(mp, 1) + + XFS_DIROP_LOG_RES(mp) + + 1024 + + 128 * (4 + XFS_DIROP_LOG_COUNT(mp))), + (2 * mp->m_sb.sb_sectsize + + XFS_FSB_TO_B(mp, XFS_IALLOC_BLOCKS(mp)) + + XFS_FSB_TO_B(mp, mp->m_in_maxlevels) + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels + + XFS_ALLOCFREE_LOG_COUNT(mp, 1)))); } +/* + * For create we can modify: + * the parent directory inode: inode size + * the new inode: inode size + * the inode btree entry: block size + * the superblock for the nlink flag: sector size + * the directory btree: (max depth + v2) * dir block size + * the directory inode's bmap btree: (max depth + v2) * block size + * Or in the first xact we allocate some inodes giving: + * the agi and agf of the ag getting the new inodes: 2 * sectorsize + * the superblock for the nlink flag: sector size + * the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize + * the inode btree: max depth * blocksize + * the allocation btrees: 2 trees * (max depth - 1) * block size + */ STATIC uint -xfs_calc_create_reservation(xfs_mount_t *mp) +xfs_calc_create_reservation( + struct xfs_mount *mp) { - return XFS_CALC_CREATE_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((mp->m_sb.sb_inodesize + + mp->m_sb.sb_inodesize + + mp->m_sb.sb_sectsize + + XFS_FSB_TO_B(mp, 1) + + XFS_DIROP_LOG_RES(mp) + + 128 * (3 + XFS_DIROP_LOG_COUNT(mp))), + (3 * mp->m_sb.sb_sectsize + + XFS_FSB_TO_B(mp, XFS_IALLOC_BLOCKS(mp)) + + XFS_FSB_TO_B(mp, mp->m_in_maxlevels) + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels + + XFS_ALLOCFREE_LOG_COUNT(mp, 1)))); } +/* + * Making a new directory is the same as creating a new file. + */ STATIC uint -xfs_calc_mkdir_reservation(xfs_mount_t *mp) +xfs_calc_mkdir_reservation( + struct xfs_mount *mp) { - return XFS_CALC_MKDIR_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return xfs_calc_create_reservation(mp); } +/* + * In freeing an inode we can modify: + * the inode being freed: inode size + * the super block free inode counter: sector size + * the agi hash list and counters: sector size + * the inode btree entry: block size + * the on disk inode before ours in the agi hash list: inode cluster size + * the inode btree: max depth * blocksize + * the allocation btrees: 2 trees * (max depth - 1) * block size + */ STATIC uint -xfs_calc_ifree_reservation(xfs_mount_t *mp) +xfs_calc_ifree_reservation( + struct xfs_mount *mp) { - return XFS_CALC_IFREE_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + mp->m_sb.sb_inodesize + + mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_FSB_TO_B(mp, 1) + + MAX((__uint16_t)XFS_FSB_TO_B(mp, 1), + XFS_INODE_CLUSTER_SIZE(mp)) + + 128 * 5 + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels + + XFS_ALLOCFREE_LOG_COUNT(mp, 1)); } +/* + * When only changing the inode we log the inode and possibly the superblock + * We also add a bit of slop for the transaction stuff. + */ STATIC uint -xfs_calc_ichange_reservation(xfs_mount_t *mp) +xfs_calc_ichange_reservation( + struct xfs_mount *mp) { - return XFS_CALC_ICHANGE_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + mp->m_sb.sb_inodesize + + mp->m_sb.sb_sectsize + + 512; + } +/* + * Growing the data section of the filesystem. + * superblock + * agi and agf + * allocation btrees + */ STATIC uint -xfs_calc_growdata_reservation(xfs_mount_t *mp) +xfs_calc_growdata_reservation( + struct xfs_mount *mp) { - return XFS_CALC_GROWDATA_LOG_RES(mp); + return mp->m_sb.sb_sectsize * 3 + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (3 + XFS_ALLOCFREE_LOG_COUNT(mp, 1)); } +/* + * Growing the rt section of the filesystem. + * In the first set of transactions (ALLOC) we allocate space to the + * bitmap or summary files. + * superblock: sector size + * agf of the ag from which the extent is allocated: sector size + * bmap btree for bitmap/summary inode: max depth * blocksize + * bitmap/summary inode: inode size + * allocation btrees for 1 block alloc: 2 * (2 * maxdepth - 1) * blocksize + */ STATIC uint -xfs_calc_growrtalloc_reservation(xfs_mount_t *mp) +xfs_calc_growrtalloc_reservation( + struct xfs_mount *mp) { - return XFS_CALC_GROWRTALLOC_LOG_RES(mp); + return 2 * mp->m_sb.sb_sectsize + + XFS_FSB_TO_B(mp, XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK)) + + mp->m_sb.sb_inodesize + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (3 + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + + XFS_ALLOCFREE_LOG_COUNT(mp, 1)); } +/* + * Growing the rt section of the filesystem. + * In the second set of transactions (ZERO) we zero the new metadata blocks. + * one bitmap/summary block: blocksize + */ STATIC uint -xfs_calc_growrtzero_reservation(xfs_mount_t *mp) +xfs_calc_growrtzero_reservation( + struct xfs_mount *mp) { - return XFS_CALC_GROWRTZERO_LOG_RES(mp); + return mp->m_sb.sb_blocksize + 128; } +/* + * Growing the rt section of the filesystem. + * In the third set of transactions (FREE) we update metadata without + * allocating any new blocks. + * superblock: sector size + * bitmap inode: inode size + * summary inode: inode size + * one bitmap block: blocksize + * summary blocks: new summary size + */ STATIC uint -xfs_calc_growrtfree_reservation(xfs_mount_t *mp) +xfs_calc_growrtfree_reservation( + struct xfs_mount *mp) { - return XFS_CALC_GROWRTFREE_LOG_RES(mp); + return mp->m_sb.sb_sectsize + + 2 * mp->m_sb.sb_inodesize + + mp->m_sb.sb_blocksize + + mp->m_rsumsize + + 128 * 5; } +/* + * Logging the inode modification timestamp on a synchronous write. + * inode + */ STATIC uint -xfs_calc_swrite_reservation(xfs_mount_t *mp) +xfs_calc_swrite_reservation( + struct xfs_mount *mp) { - return XFS_CALC_SWRITE_LOG_RES(mp); + return mp->m_sb.sb_inodesize + 128; } +/* + * Logging the inode mode bits when writing a setuid/setgid file + * inode + */ STATIC uint xfs_calc_writeid_reservation(xfs_mount_t *mp) { - return XFS_CALC_WRITEID_LOG_RES(mp); + return mp->m_sb.sb_inodesize + 128; } +/* + * Converting the inode from non-attributed to attributed. + * the inode being converted: inode size + * agf block and superblock (for block allocation) + * the new block (directory sized) + * bmap blocks for the new directory block + * allocation btrees + */ STATIC uint -xfs_calc_addafork_reservation(xfs_mount_t *mp) +xfs_calc_addafork_reservation( + struct xfs_mount *mp) { - return XFS_CALC_ADDAFORK_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + mp->m_sb.sb_inodesize + + mp->m_sb.sb_sectsize * 2 + + mp->m_dirblksize + + XFS_FSB_TO_B(mp, XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1) + + XFS_ALLOCFREE_LOG_RES(mp, 1) + + 128 * (4 + XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1 + + XFS_ALLOCFREE_LOG_COUNT(mp, 1)); } +/* + * Removing the attribute fork of a file + * the inode being truncated: inode size + * the inode's bmap btree: max depth * block size + * And the bmap_finish transaction can free the blocks and bmap blocks: + * the agf for each of the ags: 4 * sector size + * the agfl for each of the ags: 4 * sector size + * the super block to reflect the freed blocks: sector size + * worst case split in allocation btrees per extent assuming 4 extents: + * 4 exts * 2 trees * (2 * max depth - 1) * block size + */ STATIC uint -xfs_calc_attrinval_reservation(xfs_mount_t *mp) +xfs_calc_attrinval_reservation( + struct xfs_mount *mp) { - return XFS_CALC_ATTRINVAL_LOG_RES(mp); + return MAX((mp->m_sb.sb_inodesize + + XFS_FSB_TO_B(mp, XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) + + 128 * (1 + XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))), + (4 * mp->m_sb.sb_sectsize + + 4 * mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 4) + + 128 * (9 + XFS_ALLOCFREE_LOG_COUNT(mp, 4)))); } +/* + * Setting an attribute. + * the inode getting the attribute + * the superblock for allocations + * the agfs extents are allocated from + * the attribute btree * max depth + * the inode allocation btree + * Since attribute transaction space is dependent on the size of the attribute, + * the calculation is done partially at mount time and partially at runtime. + */ STATIC uint -xfs_calc_attrset_reservation(xfs_mount_t *mp) +xfs_calc_attrset_reservation( + struct xfs_mount *mp) { - return XFS_CALC_ATTRSET_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + mp->m_sb.sb_inodesize + + mp->m_sb.sb_sectsize + + XFS_FSB_TO_B(mp, XFS_DA_NODE_MAXDEPTH) + + 128 * (2 + XFS_DA_NODE_MAXDEPTH); } +/* + * Removing an attribute. + * the inode: inode size + * the attribute btree could join: max depth * block size + * the inode bmap btree could join or split: max depth * block size + * And the bmap_finish transaction can free the attr blocks freed giving: + * the agf for the ag in which the blocks live: 2 * sector size + * the agfl for the ag in which the blocks live: 2 * sector size + * the superblock for the free block count: sector size + * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size + */ STATIC uint -xfs_calc_attrrm_reservation(xfs_mount_t *mp) +xfs_calc_attrrm_reservation( + struct xfs_mount *mp) { - return XFS_CALC_ATTRRM_LOG_RES(mp) + XFS_DQUOT_LOGRES(mp); + return XFS_DQUOT_LOGRES(mp) + + MAX((mp->m_sb.sb_inodesize + + XFS_FSB_TO_B(mp, XFS_DA_NODE_MAXDEPTH) + + XFS_FSB_TO_B(mp, XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) + + 128 * (1 + XFS_DA_NODE_MAXDEPTH + + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK))), + (2 * mp->m_sb.sb_sectsize + + 2 * mp->m_sb.sb_sectsize + + mp->m_sb.sb_sectsize + + XFS_ALLOCFREE_LOG_RES(mp, 2) + + 128 * (5 + XFS_ALLOCFREE_LOG_COUNT(mp, 2)))); } +/* + * Clearing a bad agino number in an agi hash bucket. + */ STATIC uint -xfs_calc_clear_agi_bucket_reservation(xfs_mount_t *mp) +xfs_calc_clear_agi_bucket_reservation( + struct xfs_mount *mp) { - return XFS_CALC_CLEAR_AGI_BUCKET_LOG_RES(mp); + return mp->m_sb.sb_sectsize + 128; } /* @@ -185,11 +540,10 @@ xfs_calc_clear_agi_bucket_reservation(xf */ void xfs_trans_init( - xfs_mount_t *mp) + struct xfs_mount *mp) { - xfs_trans_reservations_t *resp; + struct xfs_trans_reservations *resp = &mp->m_reservations; - resp = &(mp->m_reservations); resp->tr_write = xfs_calc_write_reservation(mp); resp->tr_itruncate = xfs_calc_itruncate_reservation(mp); resp->tr_rename = xfs_calc_rename_reservation(mp); Index: xfs/fs/xfs/xfs_trans.h =================================================================== --- xfs.orig/fs/xfs/xfs_trans.h 2010-05-04 15:47:36.150003848 +0200 +++ xfs/fs/xfs/xfs_trans.h 2010-05-04 15:47:55.115254931 +0200 @@ -300,24 +300,6 @@ xfs_lic_desc_to_chunk(xfs_log_item_desc_ /* - * Various log reservation values. - * These are based on the size of the file system block - * because that is what most transactions manipulate. - * Each adds in an additional 128 bytes per item logged to - * try to account for the overhead of the transaction mechanism. - * - * Note: - * Most of the reservations underestimate the number of allocation - * groups into which they could free extents in the xfs_bmap_finish() - * call. This is because the number in the worst case is quite high - * and quite unusual. In order to fix this we need to change - * xfs_bmap_finish() to free extents in only a single AG at a time. - * This will require changes to the EFI code as well, however, so that - * the EFI for the extents not freed is logged again in each transaction. - * See bug 261917. - */ - -/* * Per-extent log reservation for the allocation btree changes * involved in freeing or allocating an extent. * 2 trees * (2 blocks/level * max depth - 1) * block size @@ -341,429 +323,36 @@ xfs_lic_desc_to_chunk(xfs_log_item_desc_ (XFS_DAENTER_BLOCKS(mp, XFS_DATA_FORK) + \ XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK) + 1) -/* - * In a write transaction we can allocate a maximum of 2 - * extents. This gives: - * the inode getting the new extents: inode size - * the inode's bmap btree: max depth * block size - * the agfs of the ags from which the extents are allocated: 2 * sector - * the superblock free block counter: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size - * And the bmap_finish transaction can free bmap blocks in a join: - * the agfs of the ags containing the blocks: 2 * sector size - * the agfls of the ags containing the blocks: 2 * sector size - * the super block free block counter: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size - */ -#define XFS_CALC_WRITE_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK)) + \ - (2 * (mp)->m_sb.sb_sectsize) + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 2) + \ - (128 * (4 + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + XFS_ALLOCFREE_LOG_COUNT(mp, 2)))),\ - ((2 * (mp)->m_sb.sb_sectsize) + \ - (2 * (mp)->m_sb.sb_sectsize) + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 2) + \ - (128 * (5 + XFS_ALLOCFREE_LOG_COUNT(mp, 2)))))) #define XFS_WRITE_LOG_RES(mp) ((mp)->m_reservations.tr_write) - -/* - * In truncating a file we free up to two extents at once. We can modify: - * the inode being truncated: inode size - * the inode's bmap btree: (max depth + 1) * block size - * And the bmap_finish transaction can free the blocks and bmap blocks: - * the agf for each of the ags: 4 * sector size - * the agfl for each of the ags: 4 * sector size - * the super block to reflect the freed blocks: sector size - * worst case split in allocation btrees per extent assuming 4 extents: - * 4 exts * 2 trees * (2 * max depth - 1) * block size - * the inode btree: max depth * blocksize - * the allocation btrees: 2 trees * (max depth - 1) * block size - */ -#define XFS_CALC_ITRUNCATE_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1) + \ - (128 * (2 + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK)))), \ - ((4 * (mp)->m_sb.sb_sectsize) + \ - (4 * (mp)->m_sb.sb_sectsize) + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 4) + \ - (128 * (9 + XFS_ALLOCFREE_LOG_COUNT(mp, 4))) + \ - (128 * 5) + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * (2 + XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels + \ - XFS_ALLOCFREE_LOG_COUNT(mp, 1)))))) - #define XFS_ITRUNCATE_LOG_RES(mp) ((mp)->m_reservations.tr_itruncate) - -/* - * In renaming a files we can modify: - * the four inodes involved: 4 * inode size - * the two directory btrees: 2 * (max depth + v2) * dir block size - * the two directory bmap btrees: 2 * max depth * block size - * And the bmap_finish transaction can free dir and bmap blocks (two sets - * of bmap blocks) giving: - * the agf for the ags in which the blocks live: 3 * sector size - * the agfl for the ags in which the blocks live: 3 * sector size - * the superblock for the free block count: sector size - * the allocation btrees: 3 exts * 2 trees * (2 * max depth - 1) * block size - */ -#define XFS_CALC_RENAME_LOG_RES(mp) \ - (MAX( \ - ((4 * (mp)->m_sb.sb_inodesize) + \ - (2 * XFS_DIROP_LOG_RES(mp)) + \ - (128 * (4 + 2 * XFS_DIROP_LOG_COUNT(mp)))), \ - ((3 * (mp)->m_sb.sb_sectsize) + \ - (3 * (mp)->m_sb.sb_sectsize) + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 3) + \ - (128 * (7 + XFS_ALLOCFREE_LOG_COUNT(mp, 3)))))) - #define XFS_RENAME_LOG_RES(mp) ((mp)->m_reservations.tr_rename) - -/* - * For creating a link to an inode: - * the parent directory inode: inode size - * the linked inode: inode size - * the directory btree could split: (max depth + v2) * dir block size - * the directory bmap btree could join or split: (max depth + v2) * blocksize - * And the bmap_finish transaction can free some bmap blocks giving: - * the agf for the ag in which the blocks live: sector size - * the agfl for the ag in which the blocks live: sector size - * the superblock for the free block count: sector size - * the allocation btrees: 2 trees * (2 * max depth - 1) * block size - */ -#define XFS_CALC_LINK_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_inodesize + \ - XFS_DIROP_LOG_RES(mp) + \ - (128 * (2 + XFS_DIROP_LOG_COUNT(mp)))), \ - ((mp)->m_sb.sb_sectsize + \ - (mp)->m_sb.sb_sectsize + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * (3 + XFS_ALLOCFREE_LOG_COUNT(mp, 1)))))) - #define XFS_LINK_LOG_RES(mp) ((mp)->m_reservations.tr_link) - -/* - * For removing a directory entry we can modify: - * the parent directory inode: inode size - * the removed inode: inode size - * the directory btree could join: (max depth + v2) * dir block size - * the directory bmap btree could join or split: (max depth + v2) * blocksize - * And the bmap_finish transaction can free the dir and bmap blocks giving: - * the agf for the ag in which the blocks live: 2 * sector size - * the agfl for the ag in which the blocks live: 2 * sector size - * the superblock for the free block count: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size - */ -#define XFS_CALC_REMOVE_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_inodesize + \ - XFS_DIROP_LOG_RES(mp) + \ - (128 * (2 + XFS_DIROP_LOG_COUNT(mp)))), \ - ((2 * (mp)->m_sb.sb_sectsize) + \ - (2 * (mp)->m_sb.sb_sectsize) + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 2) + \ - (128 * (5 + XFS_ALLOCFREE_LOG_COUNT(mp, 2)))))) - #define XFS_REMOVE_LOG_RES(mp) ((mp)->m_reservations.tr_remove) - -/* - * For symlink we can modify: - * the parent directory inode: inode size - * the new inode: inode size - * the inode btree entry: 1 block - * the directory btree: (max depth + v2) * dir block size - * the directory inode's bmap btree: (max depth + v2) * block size - * the blocks for the symlink: 1 kB - * Or in the first xact we allocate some inodes giving: - * the agi and agf of the ag getting the new inodes: 2 * sectorsize - * the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize - * the inode btree: max depth * blocksize - * the allocation btrees: 2 trees * (2 * max depth - 1) * block size - */ -#define XFS_CALC_SYMLINK_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_inodesize + \ - XFS_FSB_TO_B(mp, 1) + \ - XFS_DIROP_LOG_RES(mp) + \ - 1024 + \ - (128 * (4 + XFS_DIROP_LOG_COUNT(mp)))), \ - (2 * (mp)->m_sb.sb_sectsize + \ - XFS_FSB_TO_B((mp), XFS_IALLOC_BLOCKS((mp))) + \ - XFS_FSB_TO_B((mp), (mp)->m_in_maxlevels) + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * (2 + XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels + \ - XFS_ALLOCFREE_LOG_COUNT(mp, 1)))))) - #define XFS_SYMLINK_LOG_RES(mp) ((mp)->m_reservations.tr_symlink) - -/* - * For create we can modify: - * the parent directory inode: inode size - * the new inode: inode size - * the inode btree entry: block size - * the superblock for the nlink flag: sector size - * the directory btree: (max depth + v2) * dir block size - * the directory inode's bmap btree: (max depth + v2) * block size - * Or in the first xact we allocate some inodes giving: - * the agi and agf of the ag getting the new inodes: 2 * sectorsize - * the superblock for the nlink flag: sector size - * the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize - * the inode btree: max depth * blocksize - * the allocation btrees: 2 trees * (max depth - 1) * block size - */ -#define XFS_CALC_CREATE_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_sectsize + \ - XFS_FSB_TO_B(mp, 1) + \ - XFS_DIROP_LOG_RES(mp) + \ - (128 * (3 + XFS_DIROP_LOG_COUNT(mp)))), \ - (3 * (mp)->m_sb.sb_sectsize + \ - XFS_FSB_TO_B((mp), XFS_IALLOC_BLOCKS((mp))) + \ - XFS_FSB_TO_B((mp), (mp)->m_in_maxlevels) + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * (2 + XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels + \ - XFS_ALLOCFREE_LOG_COUNT(mp, 1)))))) - #define XFS_CREATE_LOG_RES(mp) ((mp)->m_reservations.tr_create) - -/* - * Making a new directory is the same as creating a new file. - */ -#define XFS_CALC_MKDIR_LOG_RES(mp) XFS_CALC_CREATE_LOG_RES(mp) - #define XFS_MKDIR_LOG_RES(mp) ((mp)->m_reservations.tr_mkdir) - -/* - * In freeing an inode we can modify: - * the inode being freed: inode size - * the super block free inode counter: sector size - * the agi hash list and counters: sector size - * the inode btree entry: block size - * the on disk inode before ours in the agi hash list: inode cluster size - * the inode btree: max depth * blocksize - * the allocation btrees: 2 trees * (max depth - 1) * block size - */ -#define XFS_CALC_IFREE_LOG_RES(mp) \ - ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_sectsize + \ - (mp)->m_sb.sb_sectsize + \ - XFS_FSB_TO_B((mp), 1) + \ - MAX((__uint16_t)XFS_FSB_TO_B((mp), 1), XFS_INODE_CLUSTER_SIZE(mp)) + \ - (128 * 5) + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * (2 + XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels + \ - XFS_ALLOCFREE_LOG_COUNT(mp, 1)))) - - #define XFS_IFREE_LOG_RES(mp) ((mp)->m_reservations.tr_ifree) - -/* - * When only changing the inode we log the inode and possibly the superblock - * We also add a bit of slop for the transaction stuff. - */ -#define XFS_CALC_ICHANGE_LOG_RES(mp) ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_sectsize + 512) - #define XFS_ICHANGE_LOG_RES(mp) ((mp)->m_reservations.tr_ichange) - -/* - * Growing the data section of the filesystem. - * superblock - * agi and agf - * allocation btrees - */ -#define XFS_CALC_GROWDATA_LOG_RES(mp) \ - ((mp)->m_sb.sb_sectsize * 3 + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * (3 + XFS_ALLOCFREE_LOG_COUNT(mp, 1)))) - #define XFS_GROWDATA_LOG_RES(mp) ((mp)->m_reservations.tr_growdata) - -/* - * Growing the rt section of the filesystem. - * In the first set of transactions (ALLOC) we allocate space to the - * bitmap or summary files. - * superblock: sector size - * agf of the ag from which the extent is allocated: sector size - * bmap btree for bitmap/summary inode: max depth * blocksize - * bitmap/summary inode: inode size - * allocation btrees for 1 block alloc: 2 * (2 * maxdepth - 1) * blocksize - */ -#define XFS_CALC_GROWRTALLOC_LOG_RES(mp) \ - (2 * (mp)->m_sb.sb_sectsize + \ - XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK)) + \ - (mp)->m_sb.sb_inodesize + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * \ - (3 + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + \ - XFS_ALLOCFREE_LOG_COUNT(mp, 1)))) - #define XFS_GROWRTALLOC_LOG_RES(mp) ((mp)->m_reservations.tr_growrtalloc) - -/* - * Growing the rt section of the filesystem. - * In the second set of transactions (ZERO) we zero the new metadata blocks. - * one bitmap/summary block: blocksize - */ -#define XFS_CALC_GROWRTZERO_LOG_RES(mp) \ - ((mp)->m_sb.sb_blocksize + 128) - #define XFS_GROWRTZERO_LOG_RES(mp) ((mp)->m_reservations.tr_growrtzero) - -/* - * Growing the rt section of the filesystem. - * In the third set of transactions (FREE) we update metadata without - * allocating any new blocks. - * superblock: sector size - * bitmap inode: inode size - * summary inode: inode size - * one bitmap block: blocksize - * summary blocks: new summary size - */ -#define XFS_CALC_GROWRTFREE_LOG_RES(mp) \ - ((mp)->m_sb.sb_sectsize + \ - 2 * (mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_blocksize + \ - (mp)->m_rsumsize + \ - (128 * 5)) - #define XFS_GROWRTFREE_LOG_RES(mp) ((mp)->m_reservations.tr_growrtfree) - -/* - * Logging the inode modification timestamp on a synchronous write. - * inode - */ -#define XFS_CALC_SWRITE_LOG_RES(mp) \ - ((mp)->m_sb.sb_inodesize + 128) - #define XFS_SWRITE_LOG_RES(mp) ((mp)->m_reservations.tr_swrite) - /* * Logging the inode timestamps on an fsync -- same as SWRITE * as long as SWRITE logs the entire inode core */ #define XFS_FSYNC_TS_LOG_RES(mp) ((mp)->m_reservations.tr_swrite) - -/* - * Logging the inode mode bits when writing a setuid/setgid file - * inode - */ -#define XFS_CALC_WRITEID_LOG_RES(mp) \ - ((mp)->m_sb.sb_inodesize + 128) - #define XFS_WRITEID_LOG_RES(mp) ((mp)->m_reservations.tr_swrite) - -/* - * Converting the inode from non-attributed to attributed. - * the inode being converted: inode size - * agf block and superblock (for block allocation) - * the new block (directory sized) - * bmap blocks for the new directory block - * allocation btrees - */ -#define XFS_CALC_ADDAFORK_LOG_RES(mp) \ - ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_sectsize * 2 + \ - (mp)->m_dirblksize + \ - XFS_FSB_TO_B(mp, (XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1)) + \ - XFS_ALLOCFREE_LOG_RES(mp, 1) + \ - (128 * (4 + (XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1) + \ - XFS_ALLOCFREE_LOG_COUNT(mp, 1)))) - #define XFS_ADDAFORK_LOG_RES(mp) ((mp)->m_reservations.tr_addafork) - -/* - * Removing the attribute fork of a file - * the inode being truncated: inode size - * the inode's bmap btree: max depth * block size - * And the bmap_finish transaction can free the blocks and bmap blocks: - * the agf for each of the ags: 4 * sector size - * the agfl for each of the ags: 4 * sector size - * the super block to reflect the freed blocks: sector size - * worst case split in allocation btrees per extent assuming 4 extents: - * 4 exts * 2 trees * (2 * max depth - 1) * block size - */ -#define XFS_CALC_ATTRINVAL_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) + \ - (128 * (1 + XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)))), \ - ((4 * (mp)->m_sb.sb_sectsize) + \ - (4 * (mp)->m_sb.sb_sectsize) + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 4) + \ - (128 * (9 + XFS_ALLOCFREE_LOG_COUNT(mp, 4)))))) - #define XFS_ATTRINVAL_LOG_RES(mp) ((mp)->m_reservations.tr_attrinval) - -/* - * Setting an attribute. - * the inode getting the attribute - * the superblock for allocations - * the agfs extents are allocated from - * the attribute btree * max depth - * the inode allocation btree - * Since attribute transaction space is dependent on the size of the attribute, - * the calculation is done partially at mount time and partially at runtime. - */ -#define XFS_CALC_ATTRSET_LOG_RES(mp) \ - ((mp)->m_sb.sb_inodesize + \ - (mp)->m_sb.sb_sectsize + \ - XFS_FSB_TO_B((mp), XFS_DA_NODE_MAXDEPTH) + \ - (128 * (2 + XFS_DA_NODE_MAXDEPTH))) - #define XFS_ATTRSET_LOG_RES(mp, ext) \ ((mp)->m_reservations.tr_attrset + \ (ext * (mp)->m_sb.sb_sectsize) + \ (ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \ (128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))) - -/* - * Removing an attribute. - * the inode: inode size - * the attribute btree could join: max depth * block size - * the inode bmap btree could join or split: max depth * block size - * And the bmap_finish transaction can free the attr blocks freed giving: - * the agf for the ag in which the blocks live: 2 * sector size - * the agfl for the ag in which the blocks live: 2 * sector size - * the superblock for the free block count: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size - */ -#define XFS_CALC_ATTRRM_LOG_RES(mp) \ - (MAX( \ - ((mp)->m_sb.sb_inodesize + \ - XFS_FSB_TO_B((mp), XFS_DA_NODE_MAXDEPTH) + \ - XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) + \ - (128 * (1 + XFS_DA_NODE_MAXDEPTH + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK)))), \ - ((2 * (mp)->m_sb.sb_sectsize) + \ - (2 * (mp)->m_sb.sb_sectsize) + \ - (mp)->m_sb.sb_sectsize + \ - XFS_ALLOCFREE_LOG_RES(mp, 2) + \ - (128 * (5 + XFS_ALLOCFREE_LOG_COUNT(mp, 2)))))) - #define XFS_ATTRRM_LOG_RES(mp) ((mp)->m_reservations.tr_attrrm) - -/* - * Clearing a bad agino number in an agi hash bucket. - */ -#define XFS_CALC_CLEAR_AGI_BUCKET_LOG_RES(mp) \ - ((mp)->m_sb.sb_sectsize + 128) - #define XFS_CLEAR_AGI_BUCKET_LOG_RES(mp) ((mp)->m_reservations.tr_clearagi) From BATV+f20da4fa05706f3678eb+2445+infradead.org+hch@bombadil.srs.infradead.org Tue May 4 08:58:03 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00,J_CHICKENPOX_61, J_CHICKENPOX_64,J_CHICKENPOX_66,LOCAL_GNU_PATCH autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44Dw2Oa230669 for ; Tue, 4 May 2010 08:58:03 -0500 X-ASG-Debug-ID: 1272981611-1c1702b40000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 9858530ACED for ; Tue, 4 May 2010 07:00:11 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id oUAFpicT6sQ8TekY for ; Tue, 04 May 2010 07:00:11 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1O9IfT-00078L-5S for xfs@oss.sgi.com; Tue, 04 May 2010 14:00:11 +0000 Date: Tue, 4 May 2010 10:00:11 -0400 From: Christoph Hellwig To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH] xfs: simplify log item descriptor tracking Subject: [PATCH] xfs: simplify log item descriptor tracking Message-ID: <20100504140011.GA20656@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1272981611 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Currently we track log item descriptor belonging to a transaction using a complex opencoded chunk allocator. This code has been there since day one and seems to work around the lack of an efficient slab allocator. This patch replaces it with dynamically allocated log item descriptors from a dedicated slab pool, linked to the transaction by a linked list. This allows to greatly simplify the log item descriptor tracking to the point where it's just a couple hundred lines in xfs_trans.c instead of a separate file. The external API has also been simplified while we're at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/ delete items from a transaction have been simplified to the bare minium, and the xfs_trans_find_item function is replaced with a direct dereference of the li_desc field. All debug code walking the list of log items in a transaction is down to a simple list_for_each_entry. Note that we could easily use a singly linked list here instead of the double linked list from list.h as the fastpath only does deletion from sequential traversal. But given that we don't have one available as a library function yet I use the list.h functions for simplicity. [the patch applies ontop of Dave's delayed-logging branch] Signed-off-by: Christoph Hellwig --- fs/xfs/xfs_trans_item.c | 441 ------------------------------------- xfs/fs/xfs/Makefile | 1 xfs/fs/xfs/linux-2.6/xfs_super.c | 11 xfs/fs/xfs/quota/xfs_trans_dquot.c | 25 -- xfs/fs/xfs/xfs_bmap.c | 43 --- xfs/fs/xfs/xfs_buf_item.c | 5 xfs/fs/xfs/xfs_extfree_item.c | 8 xfs/fs/xfs/xfs_trans.c | 218 +++++++++++------- xfs/fs/xfs/xfs_trans.h | 105 -------- xfs/fs/xfs/xfs_trans_buf.c | 64 +---- xfs/fs/xfs/xfs_trans_extfree.c | 22 - xfs/fs/xfs/xfs_trans_inode.c | 9 xfs/fs/xfs/xfs_trans_priv.h | 15 - 13 files changed, 200 insertions(+), 767 deletions(-) Index: xfs/fs/xfs/quota/xfs_trans_dquot.c =================================================================== --- xfs.orig/fs/xfs/quota/xfs_trans_dquot.c 2010-05-04 15:47:36.063023404 +0200 +++ xfs/fs/xfs/quota/xfs_trans_dquot.c 2010-05-04 15:51:53.153255000 +0200 @@ -59,16 +59,14 @@ xfs_trans_dqjoin( xfs_trans_t *tp, xfs_dquot_t *dqp) { - xfs_dq_logitem_t *lp = &dqp->q_logitem; - ASSERT(dqp->q_transp != tp); ASSERT(XFS_DQ_IS_LOCKED(dqp)); - ASSERT(lp->qli_dquot == dqp); + ASSERT(dqp->q_logitem.qli_dquot == dqp); /* * Get a log_item_desc to point at the new item. */ - (void) xfs_trans_add_item(tp, (xfs_log_item_t*)(lp)); + xfs_trans_add_item(tp, &dqp->q_logitem.qli_item); /* * Initialize i_transp so we can later determine if this dquot is @@ -93,16 +91,11 @@ xfs_trans_log_dquot( xfs_trans_t *tp, xfs_dquot_t *dqp) { - xfs_log_item_desc_t *lidp; - ASSERT(dqp->q_transp == tp); ASSERT(XFS_DQ_IS_LOCKED(dqp)); - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)(&dqp->q_logitem)); - ASSERT(lidp != NULL); - tp->t_flags |= XFS_TRANS_DIRTY; - lidp->lid_flags |= XFS_LID_DIRTY; + dqp->q_logitem.qli_item.li_desc->lid_flags |= XFS_LID_DIRTY; } /* @@ -874,9 +867,8 @@ xfs_trans_get_qoff_item( /* * Get a log_item_desc to point at the new item. */ - (void) xfs_trans_add_item(tp, (xfs_log_item_t*)q); - - return (q); + xfs_trans_add_item(tp, &q->qql_item); + return q; } @@ -890,13 +882,8 @@ xfs_trans_log_quotaoff_item( xfs_trans_t *tp, xfs_qoff_logitem_t *qlp) { - xfs_log_item_desc_t *lidp; - - lidp = xfs_trans_find_item(tp, (xfs_log_item_t *)qlp); - ASSERT(lidp != NULL); - tp->t_flags |= XFS_TRANS_DIRTY; - lidp->lid_flags |= XFS_LID_DIRTY; + qlp->qql_item.li_desc->lid_flags |= XFS_LID_DIRTY; } STATIC void Index: xfs/fs/xfs/xfs_buf_item.c =================================================================== --- xfs.orig/fs/xfs/xfs_buf_item.c 2010-05-04 15:47:36.071004407 +0200 +++ xfs/fs/xfs/xfs_buf_item.c 2010-05-04 15:51:53.153255000 +0200 @@ -461,13 +461,10 @@ xfs_buf_item_unpin_remove( * occurs later in the xfs_trans_uncommit() will try to * reference the buffer which we no longer have a hold on. */ - struct xfs_log_item_desc *lidp; - ASSERT(XFS_BUF_VALUSEMA(bip->bli_buf) <= 0); trace_xfs_buf_item_unpin_stale(bip); - lidp = xfs_trans_find_item(tp, (xfs_log_item_t *)bip); - xfs_trans_free_item(tp, lidp); + xfs_trans_del_item(&bip->bli_item); /* * Since the transaction no longer refers to the buffer, the Index: xfs/fs/xfs/xfs_extfree_item.c =================================================================== --- xfs.orig/fs/xfs/xfs_extfree_item.c 2010-05-04 15:47:36.079004407 +0200 +++ xfs/fs/xfs/xfs_extfree_item.c 2010-05-04 15:51:53.160253464 +0200 @@ -132,18 +132,18 @@ STATIC void xfs_efi_item_unpin_remove(xfs_efi_log_item_t *efip, xfs_trans_t *tp) { struct xfs_ail *ailp = efip->efi_item.li_ailp; - xfs_log_item_desc_t *lidp; spin_lock(&ailp->xa_lock); if (efip->efi_flags & XFS_EFI_CANCELED) { + struct xfs_log_item *lip = &efip->efi_item; + /* * free the xaction descriptor pointing to this item */ - lidp = xfs_trans_find_item(tp, (xfs_log_item_t *) efip); - xfs_trans_free_item(tp, lidp); + xfs_trans_del_item(lip); /* xfs_trans_ail_delete() drops the AIL lock. */ - xfs_trans_ail_delete(ailp, (xfs_log_item_t *)efip); + xfs_trans_ail_delete(ailp, lip); xfs_efi_item_free(efip); } else { efip->efi_flags |= XFS_EFI_COMMITTED; Index: xfs/fs/xfs/xfs_trans_buf.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_buf.c 2010-05-04 15:47:36.088024871 +0200 +++ xfs/fs/xfs/xfs_trans_buf.c 2010-05-04 15:51:53.164254512 +0200 @@ -51,36 +51,17 @@ xfs_trans_buf_item_match( xfs_daddr_t blkno, int len) { - xfs_log_item_chunk_t *licp; - xfs_log_item_desc_t *lidp; - xfs_buf_log_item_t *blip; - int i; + struct xfs_log_item_desc *lidp; + struct xfs_buf_log_item *blip; len = BBTOB(len); - for (licp = &tp->t_items; licp != NULL; licp = licp->lic_next) { - if (xfs_lic_are_all_free(licp)) { - ASSERT(licp == &tp->t_items); - ASSERT(licp->lic_next == NULL); - return NULL; - } - - for (i = 0; i < licp->lic_unused; i++) { - /* - * Skip unoccupied slots. - */ - if (xfs_lic_isfree(licp, i)) - continue; - - lidp = xfs_lic_slot(licp, i); - blip = (xfs_buf_log_item_t *)lidp->lid_item; - if (blip->bli_item.li_type != XFS_LI_BUF) - continue; - - if (XFS_BUF_TARGET(blip->bli_buf) == target && - XFS_BUF_ADDR(blip->bli_buf) == blkno && - XFS_BUF_COUNT(blip->bli_buf) == len) - return blip->bli_buf; - } + list_for_each_entry(lidp, &tp->t_items, lid_trans) { + blip = (struct xfs_buf_log_item *)lidp->lid_item; + if (blip->bli_item.li_type == XFS_LI_BUF && + XFS_BUF_TARGET(blip->bli_buf) == target && + XFS_BUF_ADDR(blip->bli_buf) == blkno && + XFS_BUF_COUNT(blip->bli_buf) == len) + return blip->bli_buf; } return NULL; @@ -127,7 +108,7 @@ _xfs_trans_bjoin( /* * Get a log_item_desc to point at the new item. */ - (void) xfs_trans_add_item(tp, (xfs_log_item_t *)bip); + xfs_trans_add_item(tp, &bip->bli_item); /* * Initialize b_fsprivate2 so we can find it with incore_match() @@ -483,7 +464,6 @@ xfs_trans_brelse(xfs_trans_t *tp, { xfs_buf_log_item_t *bip; xfs_log_item_t *lip; - xfs_log_item_desc_t *lidp; /* * Default to a normal brelse() call if the tp is NULL. @@ -514,13 +494,6 @@ xfs_trans_brelse(xfs_trans_t *tp, ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(atomic_read(&bip->bli_refcount) > 0); - /* - * Find the item descriptor pointing to this buffer's - * log item. It must be there. - */ - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); - ASSERT(lidp != NULL); - trace_xfs_trans_brelse(bip); /* @@ -536,7 +509,7 @@ xfs_trans_brelse(xfs_trans_t *tp, * If the buffer is dirty within this transaction, we can't * release it until we commit. */ - if (lidp->lid_flags & XFS_LID_DIRTY) + if (bip->bli_item.li_desc->lid_flags & XFS_LID_DIRTY) return; /* @@ -553,7 +526,7 @@ xfs_trans_brelse(xfs_trans_t *tp, /* * Free up the log item descriptor tracking the released item. */ - xfs_trans_free_item(tp, lidp); + xfs_trans_del_item(&bip->bli_item); /* * Clear the hold flag in the buf log item if it is set. @@ -665,7 +638,6 @@ xfs_trans_log_buf(xfs_trans_t *tp, uint last) { xfs_buf_log_item_t *bip; - xfs_log_item_desc_t *lidp; ASSERT(XFS_BUF_ISBUSY(bp)); ASSERT(XFS_BUF_FSPRIVATE2(bp, xfs_trans_t *) == tp); @@ -707,11 +679,8 @@ xfs_trans_log_buf(xfs_trans_t *tp, bip->bli_format.blf_flags &= ~XFS_BLF_CANCEL; } - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); - ASSERT(lidp != NULL); - tp->t_flags |= XFS_TRANS_DIRTY; - lidp->lid_flags |= XFS_LID_DIRTY; + bip->bli_item.li_desc->lid_flags |= XFS_LID_DIRTY; bip->bli_flags |= XFS_BLI_LOGGED; xfs_buf_item_log(bip, first, last); } @@ -740,7 +709,6 @@ xfs_trans_binval( xfs_trans_t *tp, xfs_buf_t *bp) { - xfs_log_item_desc_t *lidp; xfs_buf_log_item_t *bip; ASSERT(XFS_BUF_ISBUSY(bp)); @@ -748,8 +716,6 @@ xfs_trans_binval( ASSERT(XFS_BUF_FSPRIVATE(bp, void *) != NULL); bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); - ASSERT(lidp != NULL); ASSERT(atomic_read(&bip->bli_refcount) > 0); trace_xfs_trans_binval(bip); @@ -764,7 +730,7 @@ xfs_trans_binval( ASSERT(!(bip->bli_flags & (XFS_BLI_LOGGED | XFS_BLI_DIRTY))); ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_INODE_BUF)); ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); - ASSERT(lidp->lid_flags & XFS_LID_DIRTY); + ASSERT(bip->bli_item.li_desc->lid_flags & XFS_LID_DIRTY); ASSERT(tp->t_flags & XFS_TRANS_DIRTY); return; } @@ -797,7 +763,7 @@ xfs_trans_binval( bip->bli_format.blf_flags |= XFS_BLF_CANCEL; memset((char *)(bip->bli_format.blf_data_map), 0, (bip->bli_format.blf_map_size * sizeof(uint))); - lidp->lid_flags |= XFS_LID_DIRTY; + bip->bli_item.li_desc->lid_flags |= XFS_LID_DIRTY; tp->t_flags |= XFS_TRANS_DIRTY; } Index: xfs/fs/xfs/xfs_trans_extfree.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_extfree.c 2010-05-04 15:47:36.096004058 +0200 +++ xfs/fs/xfs/xfs_trans_extfree.c 2010-05-04 15:51:53.170254721 +0200 @@ -50,9 +50,8 @@ xfs_trans_get_efi(xfs_trans_t *tp, /* * Get a log_item_desc to point at the new item. */ - (void) xfs_trans_add_item(tp, (xfs_log_item_t*)efip); - - return (efip); + xfs_trans_add_item(tp, &efip->efi_item); + return efip; } /* @@ -66,15 +65,11 @@ xfs_trans_log_efi_extent(xfs_trans_t *t xfs_fsblock_t start_block, xfs_extlen_t ext_len) { - xfs_log_item_desc_t *lidp; uint next_extent; xfs_extent_t *extp; - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)efip); - ASSERT(lidp != NULL); - tp->t_flags |= XFS_TRANS_DIRTY; - lidp->lid_flags |= XFS_LID_DIRTY; + efip->efi_item.li_desc->lid_flags |= XFS_LID_DIRTY; next_extent = efip->efi_next_extent; ASSERT(next_extent < efip->efi_format.efi_nextents); @@ -107,9 +102,8 @@ xfs_trans_get_efd(xfs_trans_t *tp, /* * Get a log_item_desc to point at the new item. */ - (void) xfs_trans_add_item(tp, (xfs_log_item_t*)efdp); - - return (efdp); + xfs_trans_add_item(tp, &efdp->efd_item); + return efdp; } /* @@ -123,15 +117,11 @@ xfs_trans_log_efd_extent(xfs_trans_t *t xfs_fsblock_t start_block, xfs_extlen_t ext_len) { - xfs_log_item_desc_t *lidp; uint next_extent; xfs_extent_t *extp; - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)efdp); - ASSERT(lidp != NULL); - tp->t_flags |= XFS_TRANS_DIRTY; - lidp->lid_flags |= XFS_LID_DIRTY; + efdp->efd_item.li_desc->lid_flags |= XFS_LID_DIRTY; next_extent = efdp->efd_next_extent; ASSERT(next_extent < efdp->efd_format.efd_nextents); Index: xfs/fs/xfs/xfs_trans_inode.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_inode.c 2010-05-04 15:47:36.105004547 +0200 +++ xfs/fs/xfs/xfs_trans_inode.c 2010-05-04 15:51:53.176254791 +0200 @@ -93,7 +93,7 @@ xfs_trans_ijoin( /* * Get a log_item_desc to point at the new item. */ - (void) xfs_trans_add_item(tp, (xfs_log_item_t*)(iip)); + xfs_trans_add_item(tp, &iip->ili_item); xfs_trans_inode_broot_debug(ip); @@ -149,17 +149,12 @@ xfs_trans_log_inode( xfs_inode_t *ip, uint flags) { - xfs_log_item_desc_t *lidp; - ASSERT(ip->i_transp == tp); ASSERT(ip->i_itemp != NULL); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); - lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)(ip->i_itemp)); - ASSERT(lidp != NULL); - tp->t_flags |= XFS_TRANS_DIRTY; - lidp->lid_flags |= XFS_LID_DIRTY; + ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY; /* * Always OR in the bits from the ili_last_fields field. Index: xfs/fs/xfs/xfs_trans_item.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans_item.c 2010-05-04 15:47:36.113004197 +0200 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,441 +0,0 @@ -/* - * Copyright (c) 2000-2002,2005 Silicon Graphics, Inc. - * All Rights Reserved. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it would be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write the Free Software Foundation, - * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA - */ -#include "xfs.h" -#include "xfs_fs.h" -#include "xfs_types.h" -#include "xfs_log.h" -#include "xfs_inum.h" -#include "xfs_trans.h" -#include "xfs_trans_priv.h" -/* XXX: from here down needed until struct xfs_trans has its own ailp */ -#include "xfs_bit.h" -#include "xfs_buf_item.h" -#include "xfs_sb.h" -#include "xfs_ag.h" -#include "xfs_dir2.h" -#include "xfs_dmapi.h" -#include "xfs_mount.h" - -STATIC int xfs_trans_unlock_chunk(xfs_log_item_chunk_t *, - int, int, xfs_lsn_t); - -/* - * This is called to add the given log item to the transaction's - * list of log items. It must find a free log item descriptor - * or allocate a new one and add the item to that descriptor. - * The function returns a pointer to item descriptor used to point - * to the new item. The log item will now point to its new descriptor - * with its li_desc field. - */ -xfs_log_item_desc_t * -xfs_trans_add_item(xfs_trans_t *tp, xfs_log_item_t *lip) -{ - xfs_log_item_desc_t *lidp; - xfs_log_item_chunk_t *licp; - int i=0; - - /* - * If there are no free descriptors, allocate a new chunk - * of them and put it at the front of the chunk list. - */ - if (tp->t_items_free == 0) { - licp = (xfs_log_item_chunk_t*) - kmem_alloc(sizeof(xfs_log_item_chunk_t), KM_SLEEP); - ASSERT(licp != NULL); - /* - * Initialize the chunk, and then - * claim the first slot in the newly allocated chunk. - */ - xfs_lic_init(licp); - xfs_lic_claim(licp, 0); - licp->lic_unused = 1; - xfs_lic_init_slot(licp, 0); - lidp = xfs_lic_slot(licp, 0); - - /* - * Link in the new chunk and update the free count. - */ - licp->lic_next = tp->t_items.lic_next; - tp->t_items.lic_next = licp; - tp->t_items_free = XFS_LIC_NUM_SLOTS - 1; - - /* - * Initialize the descriptor and the generic portion - * of the log item. - * - * Point the new slot at this item and return it. - * Also point the log item at its currently active - * descriptor and set the item's mount pointer. - */ - lidp->lid_item = lip; - lidp->lid_flags = 0; - lidp->lid_size = 0; - lip->li_desc = lidp; - lip->li_mountp = tp->t_mountp; - lip->li_ailp = tp->t_mountp->m_ail; - return lidp; - } - - /* - * Find the free descriptor. It is somewhere in the chunklist - * of descriptors. - */ - licp = &tp->t_items; - while (licp != NULL) { - if (xfs_lic_vacancy(licp)) { - if (licp->lic_unused <= XFS_LIC_MAX_SLOT) { - i = licp->lic_unused; - ASSERT(xfs_lic_isfree(licp, i)); - break; - } - for (i = 0; i <= XFS_LIC_MAX_SLOT; i++) { - if (xfs_lic_isfree(licp, i)) - break; - } - ASSERT(i <= XFS_LIC_MAX_SLOT); - break; - } - licp = licp->lic_next; - } - ASSERT(licp != NULL); - /* - * If we find a free descriptor, claim it, - * initialize it, and return it. - */ - xfs_lic_claim(licp, i); - if (licp->lic_unused <= i) { - licp->lic_unused = i + 1; - xfs_lic_init_slot(licp, i); - } - lidp = xfs_lic_slot(licp, i); - tp->t_items_free--; - lidp->lid_item = lip; - lidp->lid_flags = 0; - lidp->lid_size = 0; - lip->li_desc = lidp; - lip->li_mountp = tp->t_mountp; - lip->li_ailp = tp->t_mountp->m_ail; - return lidp; -} - -/* - * Free the given descriptor. - * - * This requires setting the bit in the chunk's free mask corresponding - * to the given slot. - */ -void -xfs_trans_free_item(xfs_trans_t *tp, xfs_log_item_desc_t *lidp) -{ - uint slot; - xfs_log_item_chunk_t *licp; - xfs_log_item_chunk_t **licpp; - - slot = xfs_lic_desc_to_slot(lidp); - licp = xfs_lic_desc_to_chunk(lidp); - xfs_lic_relse(licp, slot); - lidp->lid_item->li_desc = NULL; - tp->t_items_free++; - - /* - * If there are no more used items in the chunk and this is not - * the chunk embedded in the transaction structure, then free - * the chunk. First pull it from the chunk list and then - * free it back to the heap. We didn't bother with a doubly - * linked list here because the lists should be very short - * and this is not a performance path. It's better to save - * the memory of the extra pointer. - * - * Also decrement the transaction structure's count of free items - * by the number in a chunk since we are freeing an empty chunk. - */ - if (xfs_lic_are_all_free(licp) && (licp != &(tp->t_items))) { - licpp = &(tp->t_items.lic_next); - while (*licpp != licp) { - ASSERT(*licpp != NULL); - licpp = &((*licpp)->lic_next); - } - *licpp = licp->lic_next; - kmem_free(licp); - tp->t_items_free -= XFS_LIC_NUM_SLOTS; - } -} - -/* - * This is called to find the descriptor corresponding to the given - * log item. It returns a pointer to the descriptor. - * The log item MUST have a corresponding descriptor in the given - * transaction. This routine does not return NULL, it panics. - * - * The descriptor pointer is kept in the log item's li_desc field. - * Just return it. - */ -/*ARGSUSED*/ -xfs_log_item_desc_t * -xfs_trans_find_item(xfs_trans_t *tp, xfs_log_item_t *lip) -{ - ASSERT(lip->li_desc != NULL); - - return lip->li_desc; -} - - -/* - * Return a pointer to the first descriptor in the chunk list. - * This does not return NULL if there are none, it panics. - * - * The first descriptor must be in either the first or second chunk. - * This is because the only chunk allowed to be empty is the first. - * All others are freed when they become empty. - * - * At some point this and xfs_trans_next_item() should be optimized - * to quickly look at the mask to determine if there is anything to - * look at. - */ -xfs_log_item_desc_t * -xfs_trans_first_item(xfs_trans_t *tp) -{ - xfs_log_item_chunk_t *licp; - int i; - - licp = &tp->t_items; - /* - * If it's not in the first chunk, skip to the second. - */ - if (xfs_lic_are_all_free(licp)) { - licp = licp->lic_next; - } - - /* - * Return the first non-free descriptor in the chunk. - */ - ASSERT(!xfs_lic_are_all_free(licp)); - for (i = 0; i < licp->lic_unused; i++) { - if (xfs_lic_isfree(licp, i)) { - continue; - } - - return xfs_lic_slot(licp, i); - } - cmn_err(CE_WARN, "xfs_trans_first_item() -- no first item"); - return NULL; -} - - -/* - * Given a descriptor, return the next descriptor in the chunk list. - * This returns NULL if there are no more used descriptors in the list. - * - * We do this by first locating the chunk in which the descriptor resides, - * and then scanning forward in the chunk and the list for the next - * used descriptor. - */ -/*ARGSUSED*/ -xfs_log_item_desc_t * -xfs_trans_next_item(xfs_trans_t *tp, xfs_log_item_desc_t *lidp) -{ - xfs_log_item_chunk_t *licp; - int i; - - licp = xfs_lic_desc_to_chunk(lidp); - - /* - * First search the rest of the chunk. The for loop keeps us - * from referencing things beyond the end of the chunk. - */ - for (i = (int)xfs_lic_desc_to_slot(lidp) + 1; i < licp->lic_unused; i++) { - if (xfs_lic_isfree(licp, i)) { - continue; - } - - return xfs_lic_slot(licp, i); - } - - /* - * Now search the next chunk. It must be there, because the - * next chunk would have been freed if it were empty. - * If there is no next chunk, return NULL. - */ - if (licp->lic_next == NULL) { - return NULL; - } - - licp = licp->lic_next; - ASSERT(!xfs_lic_are_all_free(licp)); - for (i = 0; i < licp->lic_unused; i++) { - if (xfs_lic_isfree(licp, i)) { - continue; - } - - return xfs_lic_slot(licp, i); - } - ASSERT(0); - /* NOTREACHED */ - return NULL; /* keep gcc quite */ -} - -/* - * This is called to unlock all of the items of a transaction and to free - * all the descriptors of that transaction. - * - * It walks the list of descriptors and unlocks each item. It frees - * each chunk except that embedded in the transaction as it goes along. - */ -void -xfs_trans_free_items( - xfs_trans_t *tp, - xfs_lsn_t commit_lsn, - int flags) -{ - xfs_log_item_chunk_t *licp; - xfs_log_item_chunk_t *next_licp; - int abort; - - abort = flags & XFS_TRANS_ABORT; - licp = &tp->t_items; - /* - * Special case the embedded chunk so we don't free it below. - */ - if (!xfs_lic_are_all_free(licp)) { - (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); - xfs_lic_all_free(licp); - licp->lic_unused = 0; - } - licp = licp->lic_next; - - /* - * Unlock each item in each chunk and free the chunks. - */ - while (licp != NULL) { - ASSERT(!xfs_lic_are_all_free(licp)); - (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); - next_licp = licp->lic_next; - kmem_free(licp); - licp = next_licp; - } - - /* - * Reset the transaction structure's free item count. - */ - tp->t_items_free = XFS_LIC_NUM_SLOTS; - tp->t_items.lic_next = NULL; -} - - - -/* - * This is called to unlock the items associated with a transaction. - * Items which were not logged should be freed. - * Those which were logged must still be tracked so they can be unpinned - * when the transaction commits. - */ -void -xfs_trans_unlock_items(xfs_trans_t *tp, xfs_lsn_t commit_lsn) -{ - xfs_log_item_chunk_t *licp; - xfs_log_item_chunk_t *next_licp; - xfs_log_item_chunk_t **licpp; - int freed; - - freed = 0; - licp = &tp->t_items; - - /* - * Special case the embedded chunk so we don't free. - */ - if (!xfs_lic_are_all_free(licp)) { - freed = xfs_trans_unlock_chunk(licp, 0, 0, commit_lsn); - } - licpp = &(tp->t_items.lic_next); - licp = licp->lic_next; - - /* - * Unlock each item in each chunk, free non-dirty descriptors, - * and free empty chunks. - */ - while (licp != NULL) { - ASSERT(!xfs_lic_are_all_free(licp)); - freed += xfs_trans_unlock_chunk(licp, 0, 0, commit_lsn); - next_licp = licp->lic_next; - if (xfs_lic_are_all_free(licp)) { - *licpp = next_licp; - kmem_free(licp); - freed -= XFS_LIC_NUM_SLOTS; - } else { - licpp = &(licp->lic_next); - } - ASSERT(*licpp == next_licp); - licp = next_licp; - } - - /* - * Fix the free descriptor count in the transaction. - */ - tp->t_items_free += freed; -} - -/* - * Unlock each item pointed to by a descriptor in the given chunk. - * Stamp the commit lsn into each item if necessary. - * Free descriptors pointing to items which are not dirty if freeing_chunk - * is zero. If freeing_chunk is non-zero, then we need to unlock all - * items in the chunk. - * - * Return the number of descriptors freed. - */ -STATIC int -xfs_trans_unlock_chunk( - xfs_log_item_chunk_t *licp, - int freeing_chunk, - int abort, - xfs_lsn_t commit_lsn) -{ - xfs_log_item_desc_t *lidp; - xfs_log_item_t *lip; - int i; - int freed; - - freed = 0; - lidp = licp->lic_descs; - for (i = 0; i < licp->lic_unused; i++, lidp++) { - if (xfs_lic_isfree(licp, i)) { - continue; - } - lip = lidp->lid_item; - lip->li_desc = NULL; - - if (commit_lsn != NULLCOMMITLSN) - IOP_COMMITTING(lip, commit_lsn); - if (abort) - lip->li_flags |= XFS_LI_ABORTED; - IOP_UNLOCK(lip); - - /* - * Free the descriptor if the item is not dirty - * within this transaction and the caller is not - * going to just free the entire thing regardless. - */ - if (!(freeing_chunk) && - (!(lidp->lid_flags & XFS_LID_DIRTY) || abort)) { - xfs_lic_relse(licp, i); - freed++; - } - } - - return freed; -} Index: xfs/fs/xfs/xfs_trans_priv.h =================================================================== --- xfs.orig/fs/xfs/xfs_trans_priv.h 2010-05-04 15:47:36.121026337 +0200 +++ xfs/fs/xfs/xfs_trans_priv.h 2010-05-04 15:51:53.183254651 +0200 @@ -23,20 +23,9 @@ struct xfs_log_item_desc; struct xfs_mount; struct xfs_trans; -/* - * From xfs_trans_item.c - */ -struct xfs_log_item_desc *xfs_trans_add_item(struct xfs_trans *, - struct xfs_log_item *); -void xfs_trans_free_item(struct xfs_trans *, - struct xfs_log_item_desc *); -struct xfs_log_item_desc *xfs_trans_find_item(struct xfs_trans *, - struct xfs_log_item *); -struct xfs_log_item_desc *xfs_trans_first_item(struct xfs_trans *); -struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *, - struct xfs_log_item_desc *); +void xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *); +void xfs_trans_del_item(struct xfs_log_item *); -void xfs_trans_unlock_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn); void xfs_trans_free_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn, int flags); Index: xfs/fs/xfs/Makefile =================================================================== --- xfs.orig/fs/xfs/Makefile 2010-05-04 15:47:36.156004128 +0200 +++ xfs/fs/xfs/Makefile 2010-05-04 15:51:53.184254931 +0200 @@ -87,7 +87,6 @@ xfs-y += xfs_alloc.o \ xfs_trans_buf.o \ xfs_trans_extfree.o \ xfs_trans_inode.o \ - xfs_trans_item.o \ xfs_utils.o \ xfs_vnodeops.o \ xfs_rw.o \ Index: xfs/fs/xfs/linux-2.6/xfs_super.c =================================================================== --- xfs.orig/fs/xfs/linux-2.6/xfs_super.c 2010-05-04 15:47:36.168004337 +0200 +++ xfs/fs/xfs/linux-2.6/xfs_super.c 2010-05-04 15:51:53.186253883 +0200 @@ -1757,6 +1757,12 @@ xfs_init_zones(void) if (!xfs_trans_zone) goto out_destroy_ifork_zone; + xfs_log_item_desc_zone = + kmem_zone_init(sizeof(struct xfs_log_item_desc), + "xfs_log_item_desc"); + if (!xfs_log_item_desc_zone) + goto out_destroy_trans_zone; + /* * The size of the zone allocated buf log item is the maximum * size possible under XFS. This wastes a little bit of memory, @@ -1766,7 +1772,7 @@ xfs_init_zones(void) (((XFS_MAX_BLOCKSIZE / XFS_BLF_CHUNK) / NBWORD) * sizeof(int))), "xfs_buf_item"); if (!xfs_buf_item_zone) - goto out_destroy_trans_zone; + goto out_destroy_log_item_desc_zone; xfs_efd_zone = kmem_zone_init((sizeof(xfs_efd_log_item_t) + ((XFS_EFD_MAX_FAST_EXTENTS - 1) * @@ -1803,6 +1809,8 @@ xfs_init_zones(void) kmem_zone_destroy(xfs_efd_zone); out_destroy_buf_item_zone: kmem_zone_destroy(xfs_buf_item_zone); + out_destroy_log_item_desc_zone: + kmem_zone_destroy(xfs_log_item_desc_zone); out_destroy_trans_zone: kmem_zone_destroy(xfs_trans_zone); out_destroy_ifork_zone: @@ -1833,6 +1841,7 @@ xfs_destroy_zones(void) kmem_zone_destroy(xfs_efi_zone); kmem_zone_destroy(xfs_efd_zone); kmem_zone_destroy(xfs_buf_item_zone); + kmem_zone_destroy(xfs_log_item_desc_zone); kmem_zone_destroy(xfs_trans_zone); kmem_zone_destroy(xfs_ifork_zone); kmem_zone_destroy(xfs_dabuf_zone); Index: xfs/fs/xfs/xfs_bmap.c =================================================================== --- xfs.orig/fs/xfs/xfs_bmap.c 2010-05-04 15:47:36.134004197 +0200 +++ xfs/fs/xfs/xfs_bmap.c 2010-05-04 15:51:53.193254791 +0200 @@ -5882,43 +5882,18 @@ xfs_bmap_get_bp( bp = NULL; if (!bp) { /* Chase down all the log items to see if the bp is there */ - xfs_log_item_chunk_t *licp; - xfs_trans_t *tp; + struct xfs_log_item_desc *lidp; + struct xfs_buf_log_item *bip; - tp = cur->bc_tp; - licp = &tp->t_items; - while (!bp && licp != NULL) { - if (xfs_lic_are_all_free(licp)) { - licp = licp->lic_next; - continue; - } - for (i = 0; i < licp->lic_unused; i++) { - xfs_log_item_desc_t *lidp; - xfs_log_item_t *lip; - xfs_buf_log_item_t *bip; - xfs_buf_t *lbp; - - if (xfs_lic_isfree(licp, i)) { - continue; - } - - lidp = xfs_lic_slot(licp, i); - lip = lidp->lid_item; - if (lip->li_type != XFS_LI_BUF) - continue; - - bip = (xfs_buf_log_item_t *)lip; - lbp = bip->bli_buf; - - if (XFS_BUF_ADDR(lbp) == bno) { - bp = lbp; - break; /* Found it */ - } - } - licp = licp->lic_next; + list_for_each_entry(lidp, &cur->bc_tp->t_items, lid_trans) { + bip = (struct xfs_buf_log_item *)lidp->lid_item; + if (bip->bli_item.li_type == XFS_LI_BUF && + XFS_BUF_ADDR(bip->bli_buf) == bno) + return bip->bli_buf; } } - return(bp); + + return bp; } STATIC void Index: xfs/fs/xfs/xfs_trans.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans.c 2010-05-04 15:49:00.118253812 +0200 +++ xfs/fs/xfs/xfs_trans.c 2010-05-04 15:51:53.201254302 +0200 @@ -1,5 +1,6 @@ /* * Copyright (c) 2000-2003,2005 Silicon Graphics, Inc. + * Copyright (C) 2010 Red Hat, Inc. * All Rights Reserved. * * This program is free software; you can redistribute it and/or @@ -48,6 +49,7 @@ #include "xfs_log_priv.h" kmem_zone_t *xfs_trans_zone; +kmem_zone_t *xfs_log_item_desc_zone; /* @@ -598,8 +600,7 @@ _xfs_trans_alloc( tp->t_magic = XFS_TRANS_MAGIC; tp->t_type = type; tp->t_mountp = mp; - tp->t_items_free = XFS_LIC_NUM_SLOTS; - xfs_lic_init(&(tp->t_items)); + INIT_LIST_HEAD(&tp->t_items); INIT_LIST_HEAD(&tp->t_busy); return tp; } @@ -639,8 +640,7 @@ xfs_trans_dup( ntp->t_magic = XFS_TRANS_MAGIC; ntp->t_type = tp->t_type; ntp->t_mountp = tp->t_mountp; - ntp->t_items_free = XFS_LIC_NUM_SLOTS; - xfs_lic_init(&(ntp->t_items)); + INIT_LIST_HEAD(&ntp->t_items); INIT_LIST_HEAD(&ntp->t_busy); ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); @@ -1144,6 +1144,108 @@ xfs_trans_unreserve_and_mod_sb( } /* + * Add the given log item to the transaction's list of log items. + * + * The log item will now point to its new descriptor with its li_desc field. + */ +void +xfs_trans_add_item( + struct xfs_trans *tp, + struct xfs_log_item *lip) +{ + struct xfs_log_item_desc *lidp; + + ASSERT(lip->li_mountp = tp->t_mountp); + ASSERT(lip->li_ailp = tp->t_mountp->m_ail); + + lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP); + + lidp->lid_item = lip; + lidp->lid_flags = 0; + lidp->lid_size = 0; + list_add_tail(&lidp->lid_trans, &tp->t_items); + + lip->li_desc = lidp; +} + +STATIC void +xfs_trans_free_item_desc( + struct xfs_log_item_desc *lidp) +{ + list_del_init(&lidp->lid_trans); + kmem_zone_free(xfs_log_item_desc_zone, lidp); +} + +/* + * Unlink and free the given descriptor. + */ +void +xfs_trans_del_item( + struct xfs_log_item *lip) +{ + xfs_trans_free_item_desc(lip->li_desc); + lip->li_desc = NULL; +} + +/* + * Unlock all of the items of a transaction and free all the descriptors + * of that transaction. + */ +void +xfs_trans_free_items( + struct xfs_trans *tp, + xfs_lsn_t commit_lsn, + int flags) +{ + struct xfs_log_item_desc *lidp, *next; + + list_for_each_entry_safe(lidp, next, &tp->t_items, lid_trans) { + struct xfs_log_item *lip = lidp->lid_item; + + lip->li_desc = NULL; + + if (commit_lsn != NULLCOMMITLSN) + IOP_COMMITTING(lip, commit_lsn); + if (flags & XFS_TRANS_ABORT) + lip->li_flags |= XFS_LI_ABORTED; + IOP_UNLOCK(lip); + + xfs_trans_free_item_desc(lidp); + } +} + +/* + * Unlock the items associated with a transaction. + * + * Items which were not logged should be freed. Those which were logged must + * still be tracked so they can be unpinned when the transaction commits. + */ +STATIC void +xfs_trans_unlock_items( + struct xfs_trans *tp, + xfs_lsn_t commit_lsn) +{ + struct xfs_log_item_desc *lidp, *next; + + list_for_each_entry_safe(lidp, next, &tp->t_items, lid_trans) { + struct xfs_log_item *lip = lidp->lid_item; + + lip->li_desc = NULL; + + if (commit_lsn != NULLCOMMITLSN) + IOP_COMMITTING(lip, commit_lsn); + IOP_UNLOCK(lip); + + /* + * Free the descriptor if the item is not dirty + * within this transaction. + */ + if (!(lidp->lid_flags & XFS_LID_DIRTY)) + xfs_trans_free_item_desc(lidp); + } +} + +/* * Total up the number of log iovecs needed to commit this * transaction. The transaction itself needs one for the * transaction header. Ask each dirty item in turn how many @@ -1154,30 +1256,27 @@ xfs_trans_count_vecs( struct xfs_trans *tp) { int nvecs; - xfs_log_item_desc_t *lidp; + struct xfs_log_item_desc *lidp; nvecs = 1; - lidp = xfs_trans_first_item(tp); - ASSERT(lidp != NULL); /* In the non-debug case we need to start bailing out if we * didn't find a log_item here, return zero and let trans_commit * deal with it. */ - if (lidp == NULL) + if (list_empty(&tp->t_items)) { + ASSERT(0); return 0; + } - while (lidp != NULL) { + list_for_each_entry(lidp, &tp->t_items, lid_trans) { /* * Skip items which aren't dirty in this transaction. */ - if (!(lidp->lid_flags & XFS_LID_DIRTY)) { - lidp = xfs_trans_next_item(tp, lidp); + if (!(lidp->lid_flags & XFS_LID_DIRTY)) continue; - } lidp->lid_size = IOP_SIZE(lidp->lid_item); nvecs += lidp->lid_size; - lidp = xfs_trans_next_item(tp, lidp); } return nvecs; @@ -1197,7 +1296,7 @@ xfs_trans_fill_vecs( struct xfs_trans *tp, struct xfs_log_iovec *log_vector) { - xfs_log_item_desc_t *lidp; + struct xfs_log_item_desc *lidp; struct xfs_log_iovec *vecp; uint nitems; @@ -1208,14 +1307,11 @@ xfs_trans_fill_vecs( vecp = log_vector + 1; nitems = 0; - lidp = xfs_trans_first_item(tp); - ASSERT(lidp); - while (lidp) { + ASSERT(!list_empty(&tp->t_items)); + list_for_each_entry(lidp, &tp->t_items, lid_trans) { /* Skip items which aren't dirty in this transaction. */ - if (!(lidp->lid_flags & XFS_LID_DIRTY)) { - lidp = xfs_trans_next_item(tp, lidp); + if (!(lidp->lid_flags & XFS_LID_DIRTY)) continue; - } /* * The item may be marked dirty but not log anything. This can @@ -1226,7 +1322,6 @@ xfs_trans_fill_vecs( IOP_FORMAT(lidp->lid_item, vecp); vecp += lidp->lid_size; IOP_PIN(lidp->lid_item); - lidp = xfs_trans_next_item(tp, lidp); } /* @@ -1321,24 +1416,15 @@ xfs_trans_committed( struct xfs_trans *tp, int abortflag) { - xfs_log_item_desc_t *lidp; - xfs_log_item_chunk_t *licp; - xfs_log_item_chunk_t *next_licp; + struct xfs_log_item_desc *lidp, *next; /* Call the transaction's completion callback if there is one. */ if (tp->t_callback != NULL) tp->t_callback(tp, tp->t_callarg); - for (lidp = xfs_trans_first_item(tp); - lidp != NULL; - lidp = xfs_trans_next_item(tp, lidp)) { + list_for_each_entry_safe(lidp, next, &tp->t_items, lid_trans) { xfs_trans_item_committed(lidp->lid_item, tp->t_lsn, abortflag); - } - - /* free the item chunks, ignoring the embedded chunk */ - for (licp = tp->t_items.lic_next; licp != NULL; licp = next_licp) { - next_licp = licp->lic_next; - kmem_free(licp); + xfs_trans_free_item_desc(lidp); } xfs_trans_free_busy(tp->t_mountp, &tp->t_busy); @@ -1354,11 +1440,9 @@ xfs_trans_uncommit( struct xfs_trans *tp, uint flags) { - xfs_log_item_desc_t *lidp; + struct xfs_log_item_desc *lidp; - for (lidp = xfs_trans_first_item(tp); - lidp != NULL; - lidp = xfs_trans_next_item(tp, lidp)) { + list_for_each_entry(lidp, &tp->t_items, lid_trans) { /* * Unpin all but those that aren't dirty. */ @@ -1530,33 +1614,28 @@ STATIC struct xfs_log_vec * xfs_trans_alloc_log_vecs( xfs_trans_t *tp) { - xfs_log_item_desc_t *lidp; + struct xfs_log_item_desc *lidp; struct xfs_log_vec *lv = NULL; struct xfs_log_vec *ret_lv = NULL; - lidp = xfs_trans_first_item(tp); /* Bail out if we didn't find a log item. */ - if (!lidp) { + if (list_empty(&tp->t_items)) { ASSERT(0); return NULL; } - while (lidp != NULL) { + list_for_each_entry(lidp, &tp->t_items, lid_trans) { struct xfs_log_vec *new_lv; /* Skip items which aren't dirty in this transaction. */ - if (!(lidp->lid_flags & XFS_LID_DIRTY)) { - lidp = xfs_trans_next_item(tp, lidp); + if (!(lidp->lid_flags & XFS_LID_DIRTY)) continue; - } /* Skip items that do not have any vectors for writing */ lidp->lid_size = IOP_SIZE(lidp->lid_item); - if (!lidp->lid_size) { - lidp = xfs_trans_next_item(tp, lidp); + if (!lidp->lid_size) continue; - } new_lv = kmem_zalloc(sizeof(*new_lv) + lidp->lid_size * sizeof(struct xfs_log_iovec), @@ -1569,7 +1648,6 @@ xfs_trans_alloc_log_vecs( else lv->lv_next = new_lv; lv = new_lv; - lidp = xfs_trans_next_item(tp, lidp); } return ret_lv; @@ -1592,29 +1670,25 @@ xfs_trans_fill_log_vecs( struct xfs_trans *tp, struct xfs_log_vec *log_vector) { - xfs_log_item_desc_t *lidp; + struct xfs_log_item_desc *lidp; struct xfs_log_vec *lv = log_vector; - lidp = xfs_trans_first_item(tp); - ASSERT(lidp); - while (lidp) { + ASSERT(!list_empty(&tp->t_items)); + list_for_each_entry(lidp, &tp->t_items, lid_trans) { /* * Skip items which aren't dirty in this transaction. */ - if (!(lidp->lid_flags & XFS_LID_DIRTY)) { - lidp = xfs_trans_next_item(tp, lidp); + if (!(lidp->lid_flags & XFS_LID_DIRTY)) continue; - } + /* Skip items that do not have any vectors for writing */ - if (!lidp->lid_size) { - lidp = xfs_trans_next_item(tp, lidp); + if (!lidp->lid_size) continue; - } + IOP_FORMAT(lidp->lid_item, lv->lv_iovecp); lv->lv_niovecs = lidp->lid_size; lv->lv_item = lidp->lid_item; - lidp = xfs_trans_next_item(tp, lidp); lv = lv->lv_next; } } @@ -1779,12 +1853,6 @@ xfs_trans_cancel( int flags) { int log_flags; -#ifdef DEBUG - xfs_log_item_chunk_t *licp; - xfs_log_item_desc_t *lidp; - xfs_log_item_t *lip; - int i; -#endif xfs_mount_t *mp = tp->t_mountp; /* @@ -1803,21 +1871,11 @@ xfs_trans_cancel( xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); } #ifdef DEBUG - if (!(flags & XFS_TRANS_ABORT)) { - licp = &(tp->t_items); - while (licp != NULL) { - lidp = licp->lic_descs; - for (i = 0; i < licp->lic_unused; i++, lidp++) { - if (xfs_lic_isfree(licp, i)) { - continue; - } - - lip = lidp->lid_item; - if (!XFS_FORCED_SHUTDOWN(mp)) - ASSERT(!(lip->li_type == XFS_LI_EFD)); - } - licp = licp->lic_next; - } + if (!(flags & XFS_TRANS_ABORT) && !XFS_FORCED_SHUTDOWN(mp)) { + struct xfs_log_item_desc *lidp; + + list_for_each_entry(lidp, &tp->t_items, lid_trans) + ASSERT(!(lidp->lid_item->li_type == XFS_LI_EFD)); } #endif xfs_trans_unreserve_and_mod_sb(tp); Index: xfs/fs/xfs/xfs_trans.h =================================================================== --- xfs.orig/fs/xfs/xfs_trans.h 2010-05-04 15:47:55.115254931 +0200 +++ xfs/fs/xfs/xfs_trans.h 2010-05-04 15:51:53.209254302 +0200 @@ -161,105 +161,14 @@ typedef struct xfs_trans_header { * the amount of space needed to log the item it describes * once we get to commit processing (see xfs_trans_commit()). */ -typedef struct xfs_log_item_desc { +struct xfs_log_item_desc { struct xfs_log_item *lid_item; - ushort lid_size; - unsigned char lid_flags; - unsigned char lid_index; -} xfs_log_item_desc_t; + ushort lid_size; + unsigned char lid_flags; + struct list_head lid_trans; +}; #define XFS_LID_DIRTY 0x1 -#define XFS_LID_PINNED 0x2 - -/* - * This structure is used to maintain a chunk list of log_item_desc - * structures. The free field is a bitmask indicating which descriptors - * in this chunk's array are free. The unused field is the first value - * not used since this chunk was allocated. - */ -#define XFS_LIC_NUM_SLOTS 15 -typedef struct xfs_log_item_chunk { - struct xfs_log_item_chunk *lic_next; - ushort lic_free; - ushort lic_unused; - xfs_log_item_desc_t lic_descs[XFS_LIC_NUM_SLOTS]; -} xfs_log_item_chunk_t; - -#define XFS_LIC_MAX_SLOT (XFS_LIC_NUM_SLOTS - 1) -#define XFS_LIC_FREEMASK ((1 << XFS_LIC_NUM_SLOTS) - 1) - - -/* - * Initialize the given chunk. Set the chunk's free descriptor mask - * to indicate that all descriptors are free. The caller gets to set - * lic_unused to the right value (0 matches all free). The - * lic_descs.lid_index values are set up as each desc is allocated. - */ -static inline void xfs_lic_init(xfs_log_item_chunk_t *cp) -{ - cp->lic_free = XFS_LIC_FREEMASK; -} - -static inline void xfs_lic_init_slot(xfs_log_item_chunk_t *cp, int slot) -{ - cp->lic_descs[slot].lid_index = (unsigned char)(slot); -} - -static inline int xfs_lic_vacancy(xfs_log_item_chunk_t *cp) -{ - return cp->lic_free & XFS_LIC_FREEMASK; -} - -static inline void xfs_lic_all_free(xfs_log_item_chunk_t *cp) -{ - cp->lic_free = XFS_LIC_FREEMASK; -} - -static inline int xfs_lic_are_all_free(xfs_log_item_chunk_t *cp) -{ - return ((cp->lic_free & XFS_LIC_FREEMASK) == XFS_LIC_FREEMASK); -} - -static inline int xfs_lic_isfree(xfs_log_item_chunk_t *cp, int slot) -{ - return (cp->lic_free & (1 << slot)); -} - -static inline void xfs_lic_claim(xfs_log_item_chunk_t *cp, int slot) -{ - cp->lic_free &= ~(1 << slot); -} - -static inline void xfs_lic_relse(xfs_log_item_chunk_t *cp, int slot) -{ - cp->lic_free |= 1 << slot; -} - -static inline xfs_log_item_desc_t * -xfs_lic_slot(xfs_log_item_chunk_t *cp, int slot) -{ - return &(cp->lic_descs[slot]); -} - -static inline int xfs_lic_desc_to_slot(xfs_log_item_desc_t *dp) -{ - return (uint)dp->lid_index; -} - -/* - * Calculate the address of a chunk given a descriptor pointer: - * dp - dp->lid_index give the address of the start of the lic_descs array. - * From this we subtract the offset of the lic_descs field in a chunk. - * All of this yields the address of the chunk, which is - * cast to a chunk pointer. - */ -static inline xfs_log_item_chunk_t * -xfs_lic_desc_to_chunk(xfs_log_item_desc_t *dp) -{ - return (xfs_log_item_chunk_t*) \ - (((xfs_caddr_t)((dp) - (dp)->lid_index)) - \ - (xfs_caddr_t)(((xfs_log_item_chunk_t*)0)->lic_descs)); -} #define XFS_TRANS_MAGIC 0x5452414E /* 'TRAN' */ /* @@ -516,8 +425,7 @@ typedef struct xfs_trans { int64_t t_rblocks_delta;/* superblock rblocks change */ int64_t t_rextents_delta;/* superblocks rextents chg */ int64_t t_rextslog_delta;/* superblocks rextslog chg */ - unsigned int t_items_free; /* log item descs free */ - xfs_log_item_chunk_t t_items; /* first log item desc chunk */ + struct list_head t_items; /* log item descriptors */ xfs_trans_header_t t_header; /* header for in-log trans */ struct list_head t_busy; /* list of busy extents */ unsigned long t_pflags; /* saved process flags state */ @@ -596,6 +504,7 @@ int xfs_trans_ail_init(struct xfs_mount void xfs_trans_ail_destroy(struct xfs_mount *); extern kmem_zone_t *xfs_trans_zone; +extern kmem_zone_t *xfs_log_item_desc_zone; #endif /* __KERNEL__ */ From sandeen@redhat.com Tue May 4 15:42:33 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_66 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o44KgWpX253509 for ; Tue, 4 May 2010 15:42:33 -0500 X-ASG-Debug-ID: 1273005880-6e9a023f0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx1.redhat.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7EAB11B4FFB9 for ; Tue, 4 May 2010 13:44:40 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id eiwwuV6P12gHtAuj for ; Tue, 04 May 2010 13:44:40 -0700 (PDT) X-ASG-Whitelist: Barracuda Reputation Received: from int-mx08.intmail.prod.int.phx2.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o44KiEwY003285 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 4 May 2010 16:44:14 -0400 Received: from liberator.sandeen.net (ovpn01.gateway.prod.ext.phx2.redhat.com [10.5.9.1]) by int-mx08.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o44Ki8sq024509 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 4 May 2010 16:44:11 -0400 Message-ID: <4BE08718.5040608@redhat.com> Date: Tue, 04 May 2010 15:44:08 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: "Amit K. Arora" CC: Christoph Hellwig , Andrew Morton , xfs@oss.sgi.com, Nikanth Karthikesan , coly.li@suse.de, Nick Piggin , Alexander Viro , linux-fsdevel@vger.kernel.org, "Theodore Ts'o" , Andreas Dilger , linux-ext4@vger.kernel.org, Eelis , Amit Arora X-ASG-Orig-Subj: Re: [PATCH] New testcase to check if fallocate respects RLIMIT_FSIZE or not Subject: Re: [PATCH] New testcase to check if fallocate respects RLIMIT_FSIZE or not References: <201004281854.49730.knikanth@suse.de> <4BD85F1F.7030100@suse.de> <201004291014.07194.knikanth@suse.de> <20100430143319.d51d6d77.akpm@linux-foundation.org> <20100501070426.GA9562@amitarora.in.ibm.com> <20100501101846.GA3769@infradead.org> <20100503083135.GC13756@amitarora.in.ibm.com> In-Reply-To: <20100503083135.GC13756@amitarora.in.ibm.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 10.5.11.21 X-Barracuda-Connect: mx1.redhat.com[209.132.183.28] X-Barracuda-Start-Time: 1273005881 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Amit K. Arora wrote: > On Sat, May 01, 2010 at 06:18:46AM -0400, Christoph Hellwig wrote: >> On Sat, May 01, 2010 at 12:34:26PM +0530, Amit K. Arora wrote: >>> Agreed. How about doing this check in the filesystem specific fallocate >>> inode routines instead ? For example, in ext4 we could do : >> That looks okay - in fact XFS should already have this check because >> it re-uses the setattr implementation to set the size. >> >> Can you submit an xfstests testcase to verify this behaviour on all >> filesystems? > > Here is the new testcase. Thanks! A few comments... > I have run this test on a x86_64 box on XFS and ext4 on 2.6.34-rc6. It > passes on XFS, but fails on ext4. Below is the snapshot of results > followed by the testcase itself. > > -- > Regards, > Amit Arora > > Test results: > ------------ > # ./check 228 > FSTYP -- xfs (non-debug) > PLATFORM -- Linux/x86_64 elm9m93 2.6.34-rc6 > > 228 0s ... > Ran: 228 > Passed all 1 tests > # > # umount /mnt > # mkfs.ext4 /dev/sda4 >/dev/null > mke2fs 1.41.10 (10-Feb-2009) > # ./check 228 > FSTYP -- ext4 > PLATFORM -- Linux/x86_64 elm9m93 2.6.34-rc6 > > 228 0s ... - output mismatch (see 228.out.bad) > --- 228.out 2010-05-03 02:51:24.000000000 -0400 > +++ 228.out.bad 2010-05-03 04:27:33.000000000 -0400 > @@ -1,2 +1 @@ > QA output created by 228 > -File size limit exceeded (core dumped) > Ran: 228 > Failures: 228 > Failed 1 of 1 tests > # 228.out is missing from the patch Also on my fedora box I don't get a coredump by default; can you either make that explicit, or filter out the core message? > > Here is the test: > ---------------- > Add a new testcase to the xfstests suite to check if fallocate respects > the limit imposed by RLIMIT_FSIZE (can be set by "ulimit -f XXX") or > not, on a particular filesystem. ... > +# get standard environment, filters and checks > +. ./common.rc > +. ./common.filter Nitpick, I don't think you need common.filter, doesn't look like you are using it. > +# FSIZE limit is now set to 100 MB. > +# Lets try to preallocate 101 MB. This should fail. > +$XFS_IO_PROG -F -f -c 'falloc 0 101m' $TEST_DIR/ouch > +rm -f $TEST_DIR/ouch > + > +# Lets now try to preallocate 50 MB. This should succeed. > +$XFS_IO_PROG -F -f -c 'falloc 0 50m' $TEST_DIR/ouch > +rm -f $TEST_DIR/ouch Even more nitpicky, but sometimes I think it's nice to have the .out file be a bit more descriptive in and of itself so when you see a failing diff you have a better idea what's gone wrong. Changing the comments to echos, like: +# FSIZE limit is now set to 100 MB. +# echo "Lets try to preallocate 101 MB. This should fail." +$XFS_IO_PROG -F -f -c 'falloc 0 101m' $TEST_DIR/ouch +rm -f $TEST_DIR/ouch etc ... would make a failure look like: --- 228.out 2010-05-04 15:42:31.924278768 -0500 +++ 228.out.bad 2010-05-04 15:42:36.961278392 -0500 @@ -1,3 +1,2 @@ QA output created by 228 Lets try to preallocate 101 MB. This should fail. -File size limit exceeded Lets now try to preallocate 50 MB. This should succeed. ... just a thought. Thanks, -Eric From aarora@linux.vnet.ibm.com Wed May 5 02:54:03 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_66 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o457s3aJ022593 for ; Wed, 5 May 2010 02:54:03 -0500 X-ASG-Debug-ID: 1273046171-306000680000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from e32.co.us.ibm.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id F0B4F30FBAD for ; Wed, 5 May 2010 00:56:11 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by cuda.sgi.com with ESMTP id cdV4c9hhaTvR9QVZ for ; Wed, 05 May 2010 00:56:11 -0700 (PDT) Received: from d03relay01.boulder.ibm.com (d03relay01.boulder.ibm.com [9.17.195.226]) by e32.co.us.ibm.com (8.14.3/8.13.1) with ESMTP id o457n8uo007082 for ; Wed, 5 May 2010 01:49:08 -0600 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay01.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o457u8XO148940 for ; Wed, 5 May 2010 01:56:08 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id o457u6Ep031774 for ; Wed, 5 May 2010 01:56:08 -0600 Received: from amitarora.in.ibm.com ([9.124.212.65]) by d03av02.boulder.ibm.com (8.14.3/8.13.1/NCO v10.0 AVin) with ESMTP id o457u400031601; Wed, 5 May 2010 01:56:05 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2E3484854; Wed, 5 May 2010 13:26:03 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.14.2/8.14.2/Submit) id o457tuSh017681; Wed, 5 May 2010 13:25:56 +0530 Date: Wed, 5 May 2010 13:25:56 +0530 From: "Amit K. Arora" To: Eric Sandeen Cc: Christoph Hellwig , Andrew Morton , xfs@oss.sgi.com, Nikanth Karthikesan , coly.li@suse.de, Nick Piggin , Alexander Viro , linux-fsdevel@vger.kernel.org, "Theodore Ts'o" , Andreas Dilger , linux-ext4@vger.kernel.org, Eelis , Amit Arora X-ASG-Orig-Subj: [PATCH v2] New testcase to check if fallocate respects RLIMIT_FSIZE or not Subject: [PATCH v2] New testcase to check if fallocate respects RLIMIT_FSIZE or not Message-ID: <20100505075556.GA5142@amitarora.in.ibm.com> References: <201004281854.49730.knikanth@suse.de> <4BD85F1F.7030100@suse.de> <201004291014.07194.knikanth@suse.de> <20100430143319.d51d6d77.akpm@linux-foundation.org> <20100501070426.GA9562@amitarora.in.ibm.com> <20100501101846.GA3769@infradead.org> <20100503083135.GC13756@amitarora.in.ibm.com> <4BE08718.5040608@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BE08718.5040608@redhat.com> User-Agent: Mutt/1.5.17 (2007-11-01) X-Barracuda-Connect: e32.co.us.ibm.com[32.97.110.150] X-Barracuda-Start-Time: 1273046171 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29111 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Tue, May 04, 2010 at 03:44:08PM -0500, Eric Sandeen wrote: > Amit K. Arora wrote: > > Here is the new testcase. > Thanks! A few comments... Thanks for the review! > 228.out is missing from the patch Ok, added it in the new patch. > Also on my fedora box I don't get a coredump by default; can > you either make that explicit, or filter out the core message? Hmm.. for some strange reason I am no longer seeing this message. Tried on the same system as last time and couple of others also. > > > > Here is the test: > > ---------------- > > Add a new testcase to the xfstests suite to check if fallocate respects > > the limit imposed by RLIMIT_FSIZE (can be set by "ulimit -f XXX") or > > not, on a particular filesystem. > > ... > > > +# get standard environment, filters and checks > > +. ./common.rc > > +. ./common.filter > > Nitpick, I don't think you need common.filter, doesn't look like you are > using it. Right. Removed it.. > > +# FSIZE limit is now set to 100 MB. > > +# Lets try to preallocate 101 MB. This should fail. > > +$XFS_IO_PROG -F -f -c 'falloc 0 101m' $TEST_DIR/ouch > > +rm -f $TEST_DIR/ouch > > + > > +# Lets now try to preallocate 50 MB. This should succeed. > > +$XFS_IO_PROG -F -f -c 'falloc 0 50m' $TEST_DIR/ouch > > +rm -f $TEST_DIR/ouch > > Even more nitpicky, but sometimes I think it's nice to have the .out > file be a bit more descriptive in and of itself so when you see a > failing diff you have a better idea what's gone wrong. Agreed. Done. Here is the new patch with the changes: Add a new testcase to the xfstests suite to check if fallocate respects the limit imposed by RLIMIT_FSIZE (can be set by "ulimit -f XXX") or not, on a particular filesystem. Signed-off-by: Amit Arora diff -Nuarp xfstests-dev.org/228 xfstests-dev/228 --- xfstests-dev.org/228 1969-12-31 19:00:00.000000000 -0500 +++ xfstests-dev/228 2010-05-05 02:37:48.000000000 -0400 @@ -0,0 +1,79 @@ +#! /bin/bash +# FS QA Test No. 228 +# +# Check if fallocate respects RLIMIT_FSIZE +# +#----------------------------------------------------------------------- +# Copyright (c) 2010 IBM Corporation. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#----------------------------------------------------------------------- +# +# creator +owner=aarora@in.ibm.com + +seq=`basename $0` +echo "QA output created by $seq" + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +here=`pwd` +tmp=$TEST_DIR/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 25 + +# get standard environment, filters and checks +. ./common.rc + +# real QA test starts here +# generic, but xfs_io's fallocate must work +_supported_fs generic +# only Linux supports fallocate +_supported_os Linux + +[ -n "$XFS_IO_PROG" ] || _notrun "xfs_io executable not found" + +rm -f $seq.full + +# Sanity check to see if fallocate works +_require_xfs_io_falloc + +# Check if we have good enough space available +avail=`df -P $TEST_DIR | awk 'END {print $4}'` +[ "$avail" -ge 104000 ] || _notrun "Test device is too small ($avail KiB)" + +# Set the FSIZE ulimit to 100MB and check +ulimit -f 102400 +flim=`ulimit -f` +[ "$flim" != "unlimited" ] || _notrun "Unable to set FSIZE ulimit" +[ "$flim" -eq 102400 ] || _notrun "FSIZE ulimit is not correct (100 MB)" + +echo "File size limit is now set to 100 MB." +echo "Let us try to preallocate 101 MB. This should fail." +$XFS_IO_PROG -F -f -c 'falloc 0 101m' $TEST_DIR/ouch +rm -f $TEST_DIR/ouch + +echo "Let us now try to preallocate 50 MB. This should succeed." +$XFS_IO_PROG -F -f -c 'falloc 0 50m' $TEST_DIR/ouch +rm -f $TEST_DIR/ouch + +echo "Test over." +# success, all done +status=0 +exit diff -Nuarp xfstests-dev.org/228.out xfstests-dev/228.out --- xfstests-dev.org/228.out 1969-12-31 19:00:00.000000000 -0500 +++ xfstests-dev/228.out 2010-05-05 02:38:30.000000000 -0400 @@ -0,0 +1,6 @@ +QA output created by 228 +File size limit is now set to 100 MB. +Let us try to preallocate 101 MB. This should fail. +File size limit exceeded +Let us now try to preallocate 50 MB. This should succeed. +Test over. diff -Nuarp xfstests-dev.org/group xfstests-dev/group --- xfstests-dev.org/group 2010-05-03 02:35:09.000000000 -0400 +++ xfstests-dev/group 2010-05-05 02:38:00.000000000 -0400 @@ -341,3 +341,4 @@ deprecated 225 auto quick 226 auto enospc 227 auto fsr +228 rw auto prealloc quick From Philippe.DENIEL@CEA.FR Wed May 5 08:52:53 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_92 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o45DqrOd037882 for ; Wed, 5 May 2010 08:52:53 -0500 X-ASG-Debug-ID: 1273067700-6c6601df0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from cirse-out.extra.cea.fr (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 714DA310CBF for ; Wed, 5 May 2010 06:55:00 -0700 (PDT) Received: from cirse-out.extra.cea.fr (cirse-out.extra.cea.fr [132.166.172.106]) by cuda.sgi.com with ESMTP id Vx2pz7iQ6IhA7oYe for ; Wed, 05 May 2010 06:55:00 -0700 (PDT) Received: from pisaure.intra.cea.fr (pisaure.intra.cea.fr [132.166.88.21]) by cirse.extra.cea.fr (8.14.2/8.14.2/CEAnet-Internet-out-2.0) with ESMTP id o45Dsxgh022241 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 5 May 2010 15:54:59 +0200 Received: from muguet2.intra.cea.fr (muguet2.intra.cea.fr [132.166.192.7]) by pisaure.intra.cea.fr (8.14.4/8.14.4) with ESMTP id o45DsxjP017128 for ; Wed, 5 May 2010 15:54:59 +0200 (envelope-from Philippe.DENIEL@CEA.FR) Received: from zia.bruyeres.cea.fr (esteban.dam.intra.cea.fr [132.165.76.10]) by muguet2.intra.cea.fr (8.13.8/8.13.8/CEAnet-Intranet-out-1.1) with SMTP id o45DsxZX001414 for ; Wed, 5 May 2010 15:54:59 +0200 Received: (qmail 10082 invoked from network); 5 May 2010 13:54:59 -0000 Message-ID: <4BE178B3.8030501@cea.fr> Date: Wed, 05 May 2010 15:54:59 +0200 From: DENIEL Philippe Organization: CEA-DAM User-Agent: Thunderbird 2.0.0.6 (X11/20070728) MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Question : Using libhandle from xfsprogs and xfs actions made "by handle" Subject: Question : Using libhandle from xfsprogs and xfs actions made "by handle" Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 05 May 2010 13:54:59.0160 (UTC) FILETIME=[8D142580:01CAEC5A] X-Barracuda-Connect: cirse-out.extra.cea.fr[132.166.172.106] X-Barracuda-Start-Time: 1273067701 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29131 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Hi, I had a look at the stuff within xfsprogs and it really look pretty nice. One thing is of great interest to me : the libhandle.so library. I am currently developing a NFS server running in userspace (see http://nfs-ganesha.sourceforge.net for details). As you know, the NFS protocol has a "handle based" semantics in the way it manages FS objects. All objects are identified by a unique filehandle or by their name and the parent's directory filehandle. The trouble is that libC does not include such "by handle" calls to manage FS, only the old fashioned POSIX API which is a "By path" API. When looking at XFS, I saw there was "open_by_handle" and "path_to_handle" calls. This sounds very very good to me : this sounds like kind of bridge to build a handle-based API to address XFS. But so far, I am a bit stuck : for exporting XFS through my NFS server, I would need to do "by handle" everything that can be done through POSIX calls, open/read/write/close files, create files/directories/symlinks, erasing or moving files... and so on. I do not know if this is possible with the calls in libhandle.so. But if I had such handle based tools, I think I could make a nice NFS server on top of XFS (I did this kind of port for LUSTRE (which has a full handle based API) in my NFS server and I had really good performances). Can someone provide me with information about this ? Regards Philippe From sandeen@redhat.com Wed May 5 10:48:51 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o45Fmo8I042650 for ; Wed, 5 May 2010 10:48:51 -0500 X-ASG-Debug-ID: 1273074658-153202ab0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx1.redhat.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 4B3BB94443B for ; Wed, 5 May 2010 08:50:58 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id 3G3vav7Ko7FvhkXw for ; Wed, 05 May 2010 08:50:58 -0700 (PDT) X-ASG-Whitelist: Barracuda Reputation Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o45FoAUN026171 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 May 2010 11:50:10 -0400 Received: from neon.msp.redhat.com (neon.msp.redhat.com [10.15.80.10]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o45Fo8tD009208; Wed, 5 May 2010 11:50:08 -0400 Message-ID: <4BE193AF.3070505@redhat.com> Date: Wed, 05 May 2010 10:50:07 -0500 From: Eric Sandeen User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc11 Lightning/1.0b2pre Thunderbird/3.0.3 MIME-Version: 1.0 To: "Amit K. Arora" CC: Christoph Hellwig , Andrew Morton , xfs@oss.sgi.com, Nikanth Karthikesan , coly.li@suse.de, Nick Piggin , Alexander Viro , linux-fsdevel@vger.kernel.org, "Theodore Ts'o" , Andreas Dilger , linux-ext4@vger.kernel.org, Eelis , Amit Arora X-ASG-Orig-Subj: Re: [PATCH v2] New testcase to check if fallocate respects RLIMIT_FSIZE or not Subject: Re: [PATCH v2] New testcase to check if fallocate respects RLIMIT_FSIZE or not References: <201004281854.49730.knikanth@suse.de> <4BD85F1F.7030100@suse.de> <201004291014.07194.knikanth@suse.de> <20100430143319.d51d6d77.akpm@linux-foundation.org> <20100501070426.GA9562@amitarora.in.ibm.com> <20100501101846.GA3769@infradead.org> <20100503083135.GC13756@amitarora.in.ibm.com> <4BE08718.5040608@redhat.com> <20100505075556.GA5142@amitarora.in.ibm.com> In-Reply-To: <20100505075556.GA5142@amitarora.in.ibm.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 X-Barracuda-Connect: mx1.redhat.com[209.132.183.28] X-Barracuda-Start-Time: 1273074660 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On 05/05/2010 02:55 AM, Amit K. Arora wrote: > On Tue, May 04, 2010 at 03:44:08PM -0500, Eric Sandeen wrote: >> Amit K. Arora wrote: >>> Here is the new testcase. >> Thanks! A few comments... > Thanks for the review! Sure thing - looks good, I'll merge it after a retest if it all goes well. :) -Eric From lists@nabble.com Wed May 5 15:07:20 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM, T_TO_NO_BRKTS_FREEMAIL autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o45K7KgN052556 for ; Wed, 5 May 2010 15:07:20 -0500 X-ASG-Debug-ID: 1273090168-0ae700c40000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from kuber.nabble.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 3EA271BD4CE3 for ; Wed, 5 May 2010 13:09:28 -0700 (PDT) Received: from kuber.nabble.com (kuber.nabble.com [216.139.236.158]) by cuda.sgi.com with ESMTP id EN57YOvciPBrVMRK for ; Wed, 05 May 2010 13:09:28 -0700 (PDT) Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1O9kuO-0002wM-0U for xfs@oss.sgi.com; Wed, 05 May 2010 13:09:28 -0700 Message-ID: <28465863.post@talk.nabble.com> Date: Wed, 5 May 2010 13:09:27 -0700 (PDT) From: Rafal Blaszczyk To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: Fwd: xfs - fixing wrong xfs size Subject: Re: Fwd: xfs - fixing wrong xfs size In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: blaszczykr+linux@gmail.com References: <49E353FD.5060207@sandeen.net> X-Barracuda-Connect: kuber.nabble.com[216.139.236.158] X-Barracuda-Start-Time: 1273090169 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29149 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Mon, Apr 13, 2009 at 5:02 PM, Eric Sandeen wrote: > > attempt to access beyond end of device > > md0: rw=0, want=123024384, limit=123023488 > > I/O error in filesystem ("md0") meta-data dev md0 block 0x75533f8 > > ("xfs_read_buf") error 5 buf count 4096 > > > > XFS: size check 2 failed Hi, I've had similiar problems with md on top of lvm after converting from single LV to mirrored (md device). I've managed to solve it by expanding underlying devices by just a few megabytes. I suppose it's not your case because you're dealing with bare devices. But you could still try to experiment with xfs_growfs or other xfs_* tools. >From my case - this is not md's fault. XFS wants to have more underlying space than it has and it cannot be mounted. -- View this message in context: http://old.nabble.com/xfs---fixing-wrong-xfs-size-tp23023562p28465863.html Sent from the Xfs - General mailing list archive at Nabble.com. From sandeen@sandeen.net Wed May 5 17:09:56 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o45M9t5k056550 for ; Wed, 5 May 2010 17:09:56 -0500 X-ASG-Debug-ID: 1273097524-4a7c01ac0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx1.redhat.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 6E21C94AFC4 for ; Wed, 5 May 2010 15:12:05 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id zqsjKnUs16pol6pf for ; Wed, 05 May 2010 15:12:05 -0700 (PDT) X-ASG-Whitelist: Barracuda Reputation Received: from int-mx08.intmail.prod.int.phx2.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o45MC3jB005031 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 May 2010 18:12:03 -0400 Received: from neon.msp.redhat.com (neon.msp.redhat.com [10.15.80.10]) by int-mx08.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o45MC2iM028754; Wed, 5 May 2010 18:12:03 -0400 Message-ID: <4BE1ED32.7010108@sandeen.net> Date: Wed, 05 May 2010 17:12:02 -0500 From: Eric Sandeen User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc11 Lightning/1.0b2pre Thunderbird/3.0.3 MIME-Version: 1.0 To: Rafal Blaszczyk CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: Fwd: xfs - fixing wrong xfs size Subject: Re: Fwd: xfs - fixing wrong xfs size References: <49E353FD.5060207@sandeen.net> <28465863.post@talk.nabble.com> In-Reply-To: <28465863.post@talk.nabble.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 10.5.11.21 X-Barracuda-Connect: mx1.redhat.com[209.132.183.28] X-Barracuda-Start-Time: 1273097525 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On 05/05/2010 03:09 PM, Rafal Blaszczyk wrote: > > > > On Mon, Apr 13, 2009 at 5:02 PM, Eric Sandeen wrote: > > >>> attempt to access beyond end of device >>> md0: rw=0, want=123024384, limit=123023488 >>> I/O error in filesystem ("md0") meta-data dev md0 block 0x75533f8 >> >> ("xfs_read_buf") error 5 buf count 4096 >>> >>> XFS: size check 2 failed > > Hi, > I've had similiar problems with md on top of lvm after converting from > single LV to mirrored (md device). I've managed to solve it by expanding > underlying devices by just a few megabytes. I suppose it's not your case > because you're dealing with bare devices. > But you could still try to experiment with xfs_growfs or other xfs_* tools. > > From my case - this is not md's fault. XFS wants to have more underlying > space than it has and it cannot be mounted. xfs wants only as much space as it has when it was created :) If that changes such that it is smaller, you'll get this warning at mount time. This is almost certainly the result of something that happened outside xfs's control. -Eric From sandeen@sandeen.net Wed May 5 17:20:15 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o45MKEUW057004 for ; Wed, 5 May 2010 17:20:15 -0500 X-ASG-Debug-ID: 1273098144-462c02550000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx1.redhat.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7DC2294B1A2 for ; Wed, 5 May 2010 15:22:24 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id ua8FdzbZXHpYaYAs for ; Wed, 05 May 2010 15:22:24 -0700 (PDT) X-ASG-Whitelist: Barracuda Reputation Received: from int-mx01.intmail.prod.int.phx2.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o45MMLwS004751 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 May 2010 18:22:22 -0400 Received: from neon.msp.redhat.com (neon.msp.redhat.com [10.15.80.10]) by int-mx01.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o45MMLJG023319; Wed, 5 May 2010 18:22:21 -0400 Message-ID: <4BE1EF9D.2030901@sandeen.net> Date: Wed, 05 May 2010 17:22:21 -0500 From: Eric Sandeen User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc11 Lightning/1.0b2pre Thunderbird/3.0.3 MIME-Version: 1.0 To: Nebojsa Trpkovic CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: Fwd: xfs - fixing wrong xfs size Subject: Re: Fwd: xfs - fixing wrong xfs size References: <49E353FD.5060207@sandeen.net> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 10.5.11.11 X-Barracuda-Connect: mx1.redhat.com[209.132.183.28] X-Barracuda-Start-Time: 1273098144 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On 04/13/2009 11:33 AM, Nebojsa Trpkovic wrote: > how can I set superblock block-count value (now I have realy nothing to > loose - the only other option is to give up of that data) ? Sorry, I never replied to this :( You can use xfs_db to set it; or you could comment out the kernel check... and mount readonly, and copy off the data you can get to? -Eric From SRS0+o8Pk+65+fromorbit.com=dave@internode.on.net Wed May 5 20:43:50 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00,FRT_ADOBE2, J_CHICKENPOX_64,J_CHICKENPOX_65,LOCAL_GNU_PATCH autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461hn2h064952 for ; Wed, 5 May 2010 20:43:50 -0500 X-ASG-Debug-ID: 1273110355-746600af0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2EB421B0A122 for ; Wed, 5 May 2010 18:45:56 -0700 (PDT) Received: from mail.internode.on.net (bld-mail17.adl2.internode.on.net [150.101.137.102]) by cuda.sgi.com with ESMTP id h5nacnRybBUXEhhA for ; Wed, 05 May 2010 18:45:56 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23089810-1927428 for ; Thu, 06 May 2010 11:15:54 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9q9w-0005DS-Ep for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:52 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9q9v-0000cR-G4 for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:51 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 05/11] xfs: Clean up XFS_BLI_* flag namespace Subject: [PATCH 05/11] xfs: Clean up XFS_BLI_* flag namespace Date: Thu, 6 May 2010 11:45:45 +1000 Message-Id: <1273110351-2333-6-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail17.adl2.internode.on.net[150.101.137.102] X-Barracuda-Start-Time: 1273110358 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Clean up the buffer log format (XFS_BLI_*) flags because they have a polluted namespace. They XFS_BLI_ prefix is used for both in-memory and on-disk flag feilds, but have overlapping values for different flags. Rename the buffer log format flags to use the XFS_BLF_* prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed flags. Signed-off-by: Dave Chinner --- fs/xfs/linux-2.6/xfs_super.c | 2 +- fs/xfs/quota/xfs_dquot.c | 6 ++-- fs/xfs/xfs_buf_item.c | 42 +++++++++++++++++++------------------- fs/xfs/xfs_buf_item.h | 14 ++++++------ fs/xfs/xfs_log_recover.c | 46 +++++++++++++++++++++--------------------- fs/xfs/xfs_log_recover.h | 2 +- fs/xfs/xfs_trans_buf.c | 28 ++++++++++++------------ 7 files changed, 70 insertions(+), 70 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index a43d09e..1e88c98 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -1753,7 +1753,7 @@ xfs_init_zones(void) * but it is much faster. */ xfs_buf_item_zone = kmem_zone_init((sizeof(xfs_buf_log_item_t) + - (((XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK) / + (((XFS_MAX_BLOCKSIZE / XFS_BLF_CHUNK) / NBWORD) * sizeof(int))), "xfs_buf_item"); if (!xfs_buf_item_zone) goto out_destroy_trans_zone; diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c index b89ec5d..585e763 100644 --- a/fs/xfs/quota/xfs_dquot.c +++ b/fs/xfs/quota/xfs_dquot.c @@ -344,9 +344,9 @@ xfs_qm_init_dquot_blk( for (i = 0; i < q->qi_dqperchunk; i++, d++, curid++) xfs_qm_dqinit_core(curid, type, d); xfs_trans_dquot_buf(tp, bp, - (type & XFS_DQ_USER ? XFS_BLI_UDQUOT_BUF : - ((type & XFS_DQ_PROJ) ? XFS_BLI_PDQUOT_BUF : - XFS_BLI_GDQUOT_BUF))); + (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF : + ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF : + XFS_BLF_GDQUOT_BUF))); xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1); } diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 4cd5f61..bcbb661 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -64,7 +64,7 @@ xfs_buf_item_log_debug( nbytes = last - first + 1; bfset(bip->bli_logged, first, nbytes); for (x = 0; x < nbytes; x++) { - chunk_num = byte >> XFS_BLI_SHIFT; + chunk_num = byte >> XFS_BLF_SHIFT; word_num = chunk_num >> BIT_TO_WORD_SHIFT; bit_num = chunk_num & (NBWORD - 1); wordp = &(bip->bli_format.blf_data_map[word_num]); @@ -166,7 +166,7 @@ xfs_buf_item_size( * cancel flag in it. */ trace_xfs_buf_item_size_stale(bip); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); return 1; } @@ -197,9 +197,9 @@ xfs_buf_item_size( } else if (next_bit != last_bit + 1) { last_bit = next_bit; nvecs++; - } else if (xfs_buf_offset(bp, next_bit * XFS_BLI_CHUNK) != - (xfs_buf_offset(bp, last_bit * XFS_BLI_CHUNK) + - XFS_BLI_CHUNK)) { + } else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) != + (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) + + XFS_BLF_CHUNK)) { last_bit = next_bit; nvecs++; } else { @@ -261,7 +261,7 @@ xfs_buf_item_format( * cancel flag in it. */ trace_xfs_buf_item_format_stale(bip); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); bip->bli_format.blf_size = nvecs; return; } @@ -294,28 +294,28 @@ xfs_buf_item_format( * keep counting and scanning. */ if (next_bit == -1) { - buffer_offset = first_bit * XFS_BLI_CHUNK; + buffer_offset = first_bit * XFS_BLF_CHUNK; vecp->i_addr = xfs_buf_offset(bp, buffer_offset); - vecp->i_len = nbits * XFS_BLI_CHUNK; + vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; nvecs++; break; } else if (next_bit != last_bit + 1) { - buffer_offset = first_bit * XFS_BLI_CHUNK; + buffer_offset = first_bit * XFS_BLF_CHUNK; vecp->i_addr = xfs_buf_offset(bp, buffer_offset); - vecp->i_len = nbits * XFS_BLI_CHUNK; + vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; nvecs++; vecp++; first_bit = next_bit; last_bit = next_bit; nbits = 1; - } else if (xfs_buf_offset(bp, next_bit << XFS_BLI_SHIFT) != - (xfs_buf_offset(bp, last_bit << XFS_BLI_SHIFT) + - XFS_BLI_CHUNK)) { - buffer_offset = first_bit * XFS_BLI_CHUNK; + } else if (xfs_buf_offset(bp, next_bit << XFS_BLF_SHIFT) != + (xfs_buf_offset(bp, last_bit << XFS_BLF_SHIFT) + + XFS_BLF_CHUNK)) { + buffer_offset = first_bit * XFS_BLF_CHUNK; vecp->i_addr = xfs_buf_offset(bp, buffer_offset); - vecp->i_len = nbits * XFS_BLI_CHUNK; + vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; /* You would think we need to bump the nvecs here too, but we do not * this number is used by recovery, and it gets confused by the boundary @@ -399,7 +399,7 @@ xfs_buf_item_unpin( ASSERT(XFS_BUF_VALUSEMA(bp) <= 0); ASSERT(!(XFS_BUF_ISDELAYWRITE(bp))); ASSERT(XFS_BUF_ISSTALE(bp)); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); trace_xfs_buf_item_unpin_stale(bip); /* @@ -550,7 +550,7 @@ xfs_buf_item_unlock( */ if (bip->bli_flags & XFS_BLI_STALE) { trace_xfs_buf_item_unlock_stale(bip); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); if (!aborted) { atomic_dec(&bip->bli_refcount); return; @@ -707,12 +707,12 @@ xfs_buf_item_init( } /* - * chunks is the number of XFS_BLI_CHUNK size pieces + * chunks is the number of XFS_BLF_CHUNK size pieces * the buffer can be divided into. Make sure not to * truncate any pieces. map_size is the size of the * bitmap needed to describe the chunks of the buffer. */ - chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLI_CHUNK - 1)) >> XFS_BLI_SHIFT); + chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLF_CHUNK - 1)) >> XFS_BLF_SHIFT); map_size = (int)((chunks + NBWORD) >> BIT_TO_WORD_SHIFT); bip = (xfs_buf_log_item_t*)kmem_zone_zalloc(xfs_buf_item_zone, @@ -780,8 +780,8 @@ xfs_buf_item_log( /* * Convert byte offsets to bit numbers. */ - first_bit = first >> XFS_BLI_SHIFT; - last_bit = last >> XFS_BLI_SHIFT; + first_bit = first >> XFS_BLF_SHIFT; + last_bit = last >> XFS_BLF_SHIFT; /* * Calculate the total number of bits to be set. diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index df44545..8cbb82b 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -41,22 +41,22 @@ typedef struct xfs_buf_log_format { * This flag indicates that the buffer contains on disk inodes * and requires special recovery handling. */ -#define XFS_BLI_INODE_BUF 0x1 +#define XFS_BLF_INODE_BUF 0x1 /* * This flag indicates that the buffer should not be replayed * during recovery because its blocks are being freed. */ -#define XFS_BLI_CANCEL 0x2 +#define XFS_BLF_CANCEL 0x2 /* * This flag indicates that the buffer contains on disk * user or group dquots and may require special recovery handling. */ -#define XFS_BLI_UDQUOT_BUF 0x4 -#define XFS_BLI_PDQUOT_BUF 0x8 -#define XFS_BLI_GDQUOT_BUF 0x10 +#define XFS_BLF_UDQUOT_BUF 0x4 +#define XFS_BLF_PDQUOT_BUF 0x8 +#define XFS_BLF_GDQUOT_BUF 0x10 -#define XFS_BLI_CHUNK 128 -#define XFS_BLI_SHIFT 7 +#define XFS_BLF_CHUNK 128 +#define XFS_BLF_SHIFT 7 #define BIT_TO_WORD_SHIFT 5 #define NBWORD (NBBY * sizeof(unsigned int)) diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 0de08e3..14a69ae 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1576,7 +1576,7 @@ xlog_recover_reorder_trans( switch (ITEM_TYPE(item)) { case XFS_LI_BUF: - if (!(buf_f->blf_flags & XFS_BLI_CANCEL)) { + if (!(buf_f->blf_flags & XFS_BLF_CANCEL)) { trace_xfs_log_recover_item_reorder_head(log, trans, item, pass); list_move(&item->ri_list, &trans->r_itemq); @@ -1638,7 +1638,7 @@ xlog_recover_do_buffer_pass1( /* * If this isn't a cancel buffer item, then just return. */ - if (!(flags & XFS_BLI_CANCEL)) { + if (!(flags & XFS_BLF_CANCEL)) { trace_xfs_log_recover_buf_not_cancel(log, buf_f); return; } @@ -1696,7 +1696,7 @@ xlog_recover_do_buffer_pass1( * Check to see whether the buffer being recovered has a corresponding * entry in the buffer cancel record table. If it does then return 1 * so that it will be cancelled, otherwise return 0. If the buffer is - * actually a buffer cancel item (XFS_BLI_CANCEL is set), then decrement + * actually a buffer cancel item (XFS_BLF_CANCEL is set), then decrement * the refcount on the entry in the table and remove it from the table * if this is the last reference. * @@ -1721,7 +1721,7 @@ xlog_check_buffer_cancelled( * There is nothing in the table built in pass one, * so this buffer must not be cancelled. */ - ASSERT(!(flags & XFS_BLI_CANCEL)); + ASSERT(!(flags & XFS_BLF_CANCEL)); return 0; } @@ -1733,7 +1733,7 @@ xlog_check_buffer_cancelled( * There is no corresponding entry in the table built * in pass one, so this buffer has not been cancelled. */ - ASSERT(!(flags & XFS_BLI_CANCEL)); + ASSERT(!(flags & XFS_BLF_CANCEL)); return 0; } @@ -1752,7 +1752,7 @@ xlog_check_buffer_cancelled( * one in the table and remove it if this is the * last reference. */ - if (flags & XFS_BLI_CANCEL) { + if (flags & XFS_BLF_CANCEL) { bcp->bc_refcount--; if (bcp->bc_refcount == 0) { if (prevp == NULL) { @@ -1772,7 +1772,7 @@ xlog_check_buffer_cancelled( * We didn't find a corresponding entry in the table, so * return 0 so that the buffer is NOT cancelled. */ - ASSERT(!(flags & XFS_BLI_CANCEL)); + ASSERT(!(flags & XFS_BLF_CANCEL)); return 0; } @@ -1874,8 +1874,8 @@ xlog_recover_do_inode_buffer( nbits = xfs_contig_bits(data_map, map_size, bit); ASSERT(nbits > 0); - reg_buf_offset = bit << XFS_BLI_SHIFT; - reg_buf_bytes = nbits << XFS_BLI_SHIFT; + reg_buf_offset = bit << XFS_BLF_SHIFT; + reg_buf_bytes = nbits << XFS_BLF_SHIFT; item_index++; } @@ -1889,7 +1889,7 @@ xlog_recover_do_inode_buffer( } ASSERT(item->ri_buf[item_index].i_addr != NULL); - ASSERT((item->ri_buf[item_index].i_len % XFS_BLI_CHUNK) == 0); + ASSERT((item->ri_buf[item_index].i_len % XFS_BLF_CHUNK) == 0); ASSERT((reg_buf_offset + reg_buf_bytes) <= XFS_BUF_COUNT(bp)); /* @@ -1955,9 +1955,9 @@ xlog_recover_do_reg_buffer( nbits = xfs_contig_bits(data_map, map_size, bit); ASSERT(nbits > 0); ASSERT(item->ri_buf[i].i_addr != NULL); - ASSERT(item->ri_buf[i].i_len % XFS_BLI_CHUNK == 0); + ASSERT(item->ri_buf[i].i_len % XFS_BLF_CHUNK == 0); ASSERT(XFS_BUF_COUNT(bp) >= - ((uint)bit << XFS_BLI_SHIFT)+(nbits<blf_flags & - (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) { + (XFS_BLF_UDQUOT_BUF|XFS_BLF_PDQUOT_BUF|XFS_BLF_GDQUOT_BUF)) { if (item->ri_buf[i].i_addr == NULL) { cmn_err(CE_ALERT, "XFS: NULL dquot in %s.", __func__); @@ -1987,9 +1987,9 @@ xlog_recover_do_reg_buffer( } memcpy(xfs_buf_offset(bp, - (uint)bit << XFS_BLI_SHIFT), /* dest */ + (uint)bit << XFS_BLF_SHIFT), /* dest */ item->ri_buf[i].i_addr, /* source */ - nbits<blf_flags & XFS_BLI_UDQUOT_BUF) + if (buf_f->blf_flags & XFS_BLF_UDQUOT_BUF) type |= XFS_DQ_USER; - if (buf_f->blf_flags & XFS_BLI_PDQUOT_BUF) + if (buf_f->blf_flags & XFS_BLF_PDQUOT_BUF) type |= XFS_DQ_PROJ; - if (buf_f->blf_flags & XFS_BLI_GDQUOT_BUF) + if (buf_f->blf_flags & XFS_BLF_GDQUOT_BUF) type |= XFS_DQ_GROUP; /* * This type of quotas was turned off, so ignore this buffer @@ -2173,7 +2173,7 @@ xlog_recover_do_dquot_buffer( * here which overlaps that may be stale. * * When meta-data buffers are freed at run time we log a buffer item - * with the XFS_BLI_CANCEL bit set to indicate that previous copies + * with the XFS_BLF_CANCEL bit set to indicate that previous copies * of the buffer in the log should not be replayed at recovery time. * This is so that if the blocks covered by the buffer are reused for * file data before we crash we don't end up replaying old, freed @@ -2207,7 +2207,7 @@ xlog_recover_do_buffer_trans( if (pass == XLOG_RECOVER_PASS1) { /* * In this pass we're only looking for buf items - * with the XFS_BLI_CANCEL bit set. + * with the XFS_BLF_CANCEL bit set. */ xlog_recover_do_buffer_pass1(log, buf_f); return 0; @@ -2244,7 +2244,7 @@ xlog_recover_do_buffer_trans( mp = log->l_mp; buf_flags = XBF_LOCK; - if (!(flags & XFS_BLI_INODE_BUF)) + if (!(flags & XFS_BLF_INODE_BUF)) buf_flags |= XBF_MAPPED; bp = xfs_buf_read(mp->m_ddev_targp, blkno, len, buf_flags); @@ -2257,10 +2257,10 @@ xlog_recover_do_buffer_trans( } error = 0; - if (flags & XFS_BLI_INODE_BUF) { + if (flags & XFS_BLF_INODE_BUF) { error = xlog_recover_do_inode_buffer(mp, item, bp, buf_f); } else if (flags & - (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) { + (XFS_BLF_UDQUOT_BUF|XFS_BLF_PDQUOT_BUF|XFS_BLF_GDQUOT_BUF)) { xlog_recover_do_dquot_buffer(mp, log, item, bp, buf_f); } else { xlog_recover_do_reg_buffer(mp, item, bp, buf_f); diff --git a/fs/xfs/xfs_log_recover.h b/fs/xfs/xfs_log_recover.h index 75d7492..1c55ccb 100644 --- a/fs/xfs/xfs_log_recover.h +++ b/fs/xfs/xfs_log_recover.h @@ -28,7 +28,7 @@ #define XLOG_RHASH(tid) \ ((((__uint32_t)tid)>>XLOG_RHASH_SHIFT) & (XLOG_RHASH_SIZE-1)) -#define XLOG_MAX_REGIONS_IN_ITEM (XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK / 2 + 1) +#define XLOG_MAX_REGIONS_IN_ITEM (XFS_MAX_BLOCKSIZE / XFS_BLF_CHUNK / 2 + 1) /* diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 9cd8090..3390c3e 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -114,7 +114,7 @@ _xfs_trans_bjoin( xfs_buf_item_init(bp, tp->t_mountp); bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(!(bip->bli_flags & XFS_BLI_LOGGED)); if (reset_recur) bip->bli_recur = 0; @@ -511,7 +511,7 @@ xfs_trans_brelse(xfs_trans_t *tp, bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(bip->bli_item.li_type == XFS_LI_BUF); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(atomic_read(&bip->bli_refcount) > 0); /* @@ -619,7 +619,7 @@ xfs_trans_bhold(xfs_trans_t *tp, bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(atomic_read(&bip->bli_refcount) > 0); bip->bli_flags |= XFS_BLI_HOLD; trace_xfs_trans_bhold(bip); @@ -641,7 +641,7 @@ xfs_trans_bhold_release(xfs_trans_t *tp, bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(atomic_read(&bip->bli_refcount) > 0); ASSERT(bip->bli_flags & XFS_BLI_HOLD); bip->bli_flags &= ~XFS_BLI_HOLD; @@ -704,7 +704,7 @@ xfs_trans_log_buf(xfs_trans_t *tp, bip->bli_flags &= ~XFS_BLI_STALE; ASSERT(XFS_BUF_ISSTALE(bp)); XFS_BUF_UNSTALE(bp); - bip->bli_format.blf_flags &= ~XFS_BLI_CANCEL; + bip->bli_format.blf_flags &= ~XFS_BLF_CANCEL; } lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); @@ -762,8 +762,8 @@ xfs_trans_binval( ASSERT(!(XFS_BUF_ISDELAYWRITE(bp))); ASSERT(XFS_BUF_ISSTALE(bp)); ASSERT(!(bip->bli_flags & (XFS_BLI_LOGGED | XFS_BLI_DIRTY))); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_INODE_BUF)); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_INODE_BUF)); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); ASSERT(lidp->lid_flags & XFS_LID_DIRTY); ASSERT(tp->t_flags & XFS_TRANS_DIRTY); return; @@ -774,7 +774,7 @@ xfs_trans_binval( * in the buf log item. The STALE flag will be used in * xfs_buf_item_unpin() to determine if it should clean up * when the last reference to the buf item is given up. - * We set the XFS_BLI_CANCEL flag in the buf log format structure + * We set the XFS_BLF_CANCEL flag in the buf log format structure * and log the buf item. This will be used at recovery time * to determine that copies of the buffer in the log before * this should not be replayed. @@ -793,8 +793,8 @@ xfs_trans_binval( XFS_BUF_STALE(bp); bip->bli_flags |= XFS_BLI_STALE; bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_DIRTY); - bip->bli_format.blf_flags &= ~XFS_BLI_INODE_BUF; - bip->bli_format.blf_flags |= XFS_BLI_CANCEL; + bip->bli_format.blf_flags &= ~XFS_BLF_INODE_BUF; + bip->bli_format.blf_flags |= XFS_BLF_CANCEL; memset((char *)(bip->bli_format.blf_data_map), 0, (bip->bli_format.blf_map_size * sizeof(uint))); lidp->lid_flags |= XFS_LID_DIRTY; @@ -826,7 +826,7 @@ xfs_trans_inode_buf( bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(atomic_read(&bip->bli_refcount) > 0); - bip->bli_format.blf_flags |= XFS_BLI_INODE_BUF; + bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF; } /* @@ -908,9 +908,9 @@ xfs_trans_dquot_buf( ASSERT(XFS_BUF_ISBUSY(bp)); ASSERT(XFS_BUF_FSPRIVATE2(bp, xfs_trans_t *) == tp); ASSERT(XFS_BUF_FSPRIVATE(bp, void *) != NULL); - ASSERT(type == XFS_BLI_UDQUOT_BUF || - type == XFS_BLI_PDQUOT_BUF || - type == XFS_BLI_GDQUOT_BUF); + ASSERT(type == XFS_BLF_UDQUOT_BUF || + type == XFS_BLF_PDQUOT_BUF || + type == XFS_BLF_GDQUOT_BUF); bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(atomic_read(&bip->bli_refcount) > 0); -- 1.5.6.5 From SRS0+QjcZ+65+fromorbit.com=dave@internode.on.net Wed May 5 20:43:56 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_65 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461huLt064969 for ; Wed, 5 May 2010 20:43:56 -0500 X-ASG-Debug-ID: 1273110364-707500d60000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 9FD311B0A173 for ; Wed, 5 May 2010 18:46:04 -0700 (PDT) Received: from mail.internode.on.net (bld-mail15.adl6.internode.on.net [150.101.137.100]) by cuda.sgi.com with ESMTP id Og2TiwyOnl4F3hsp for ; Wed, 05 May 2010 18:46:04 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 11655209-1927428 for ; Thu, 06 May 2010 11:16:03 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9q9w-0005DL-6Z for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:52 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9q9v-0000cF-5z for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:51 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 01/11] xfs: Don't reuse the same transaciton ID for duplicated transactions. Subject: [PATCH 01/11] xfs: Don't reuse the same transaciton ID for duplicated transactions. Date: Thu, 6 May 2010 11:45:41 +1000 Message-Id: <1273110351-2333-2-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail15.adl6.internode.on.net[150.101.137.100] X-Barracuda-Start-Time: 1273110365 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner The transaction ID is written into the log as the unique identifier for transactions during recover. When duplicating a transaction, we reuse the log ticket, which means it has the same transaction ID as the previous transaction. Rather than regenerating a random transaction ID for the duplicated transaction, just add one to the current ID so that duplicated transaction can be easily spotted in the log and during recovery during problem diagnosis. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 3038dd5..687b220 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -360,6 +360,15 @@ xfs_log_reserve( ASSERT(flags & XFS_LOG_PERM_RESERV); internal_ticket = *ticket; + /* + * this is a new transaction on the ticket, so we need to + * change the transaction ID so that the next transaction has a + * different TID in the log. Just add one to the existing tid + * so that we can see chains of rolling transactions in the log + * easily. + */ + internal_ticket->t_tid++; + trace_xfs_log_reserve(log, internal_ticket); xlog_grant_push_ail(mp, internal_ticket->t_unit_res); -- 1.5.6.5 From SRS0+QjcZ+65+fromorbit.com=dave@internode.on.net Wed May 5 20:43:57 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461hutx064981 for ; Wed, 5 May 2010 20:43:57 -0500 X-ASG-Debug-ID: 1273110364-137202cb0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 51FD33139B0 for ; Wed, 5 May 2010 18:46:04 -0700 (PDT) Received: from mail.internode.on.net (bld-mail18.adl2.internode.on.net [150.101.137.103]) by cuda.sgi.com with ESMTP id qcUOHz0n7KEjpG5G for ; Wed, 05 May 2010 18:46:04 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23180505-1927428 for ; Thu, 06 May 2010 11:16:03 +0930 (CST) Received: from [192.168.1.9] (helo=disturbed) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qA6-0005DN-70 for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9q9v-0000cK-AJ for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:51 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 03/11] xfs: allow log ticket allocation to take allocation flags Subject: [PATCH 03/11] xfs: allow log ticket allocation to take allocation flags Date: Thu, 6 May 2010 11:45:43 +1000 Message-Id: <1273110351-2333-4-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail18.adl2.internode.on.net[150.101.137.103] X-Barracuda-Start-Time: 1273110366 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Delayed logging currently requires ticket allocation to succeed, so we need to be able to sleep on allocation. It also should not allow memory allocation to recurse into the filesystem. hence we need to pass allocation flags directing the type of allocation the caller requires. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 687b220..83be6a6 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -88,11 +88,9 @@ STATIC void xlog_ungrant_log_space(xlog_t *log, /* local ticket functions */ -STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, - int unit_bytes, - int count, - char clientid, - uint flags); +STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, int unit_bytes, int count, + char clientid, uint flags, + int alloc_flags); #if defined(DEBUG) STATIC void xlog_verify_dest_ptr(xlog_t *log, char *ptr); @@ -376,7 +374,8 @@ xfs_log_reserve( } else { /* may sleep if need to allocate more tickets */ internal_ticket = xlog_ticket_alloc(log, unit_bytes, cnt, - client, flags); + client, flags, + KM_SLEEP|KM_MAYFAIL); if (!internal_ticket) return XFS_ERROR(ENOMEM); internal_ticket->t_trans_type = t_type; @@ -3331,13 +3330,14 @@ xlog_ticket_alloc( int unit_bytes, int cnt, char client, - uint xflags) + uint xflags, + int alloc_flags) { struct xlog_ticket *tic; uint num_headers; int iclog_space; - tic = kmem_zone_zalloc(xfs_log_ticket_zone, KM_SLEEP|KM_MAYFAIL); + tic = kmem_zone_zalloc(xfs_log_ticket_zone, alloc_flags); if (!tic) return NULL; -- 1.5.6.5 From SRS0+8lZF+65+fromorbit.com=dave@internode.on.net Wed May 5 20:43:56 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461hu5k064977 for ; Wed, 5 May 2010 20:43:56 -0500 X-ASG-Debug-ID: 1273110364-706800fd0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7BD9C1B0A173 for ; Wed, 5 May 2010 18:46:05 -0700 (PDT) Received: from mail.internode.on.net (bld-mail19.adl2.internode.on.net [150.101.137.104]) by cuda.sgi.com with ESMTP id riOHIoqezqvMTVcE for ; Wed, 05 May 2010 18:46:05 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23118553-1927428 for ; Thu, 06 May 2010 11:16:03 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9q9w-0005DK-6U for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:52 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9q9v-0000cD-44 for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:51 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 0/11] xfs: delayed logging Subject: [PATCH 0/11] xfs: delayed logging Date: Thu, 6 May 2010 11:45:40 +1000 Message-Id: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 X-Barracuda-Connect: bld-mail19.adl2.internode.on.net[150.101.137.104] X-Barracuda-Start-Time: 1273110366 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-ASG-Whitelist: BODY (http://marc\.info/\?) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Hi Folks, This is version 4 of the delayed logging series. I won't repeat everything about what it is, just point you here: http://marc.info/?l=linux-xfs&m=126862777118946&w=2 for the description, and here: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging for the current code. Note that this is a rebased branch, so you'll need to pull it again into a new branch to review. This version includes a number of fixes and cleanups related to the busy extent tracking. This includes fixing a long standing off-by-one that was causing assert failures when inserting busy extents that overlapped with existing busy extents. The patch series follows this mail to make it easier for people to respond to specific pieces of the code during review. I'm still making the entire patch set available through git, though. The changes are mostly small and isolated, so there isn't much change from the previous version: Version 4: 26 files changed, 2351 insertions(+), 510 deletions(-) Version 3: 28 files changed, 2366 insertions(+), 506 deletions(-) Version 2: 22 files changed, 2188 insertions(+), 377 deletions(-) Version 1: 19 files changed, 2594 insertions(+), 580 deletions(-) Changes for V4: o fixes duplicate transaction IDs on rolling transactions (new commit) o folded in a busy extent freeing cleanup from Christoph Hellwig o made API prefix consistent (xfs_alloc_busy_*) o combined xfs_alloc_mark_busy and xfs_alloc_busy_insert o reverted back to tracking transaction pointers in busy extents o removed exporting of transaction ID for busy extents o fixed an off-by-one in the extent range match in the busy extent search code that has been triggering assert failures o use list_splice_init() when splicing busy extents from the transaction to the checkpoint context to ensure we don't get transactions thinking they have busy extents to free after we've already attached them to the checkpoint. Changes for V3: o changed buffer log item reference counted model to be consistent for both logging modes o cleaned up XFS_BLI flags usage (new commit) o separated out log ticket overrun printing cleanup (new commit) o made sure "delaylog" option shows up in /proc/mounts o collapsed many of the intermediate commits together to make it easier to review o fixed inode buffer tagging issue that was causing shutdowns in log recovery in test 087 and 121 Changes for V2: o 22 files changed, 2188 insertions(+), 377 deletions(-) o fixed some memory leaks o fixed ticket allocation for checkpoints to use KM_NOFS o minor code cleanups o performed stress and scalability testing The following changes since commit 6ff75b78182c314112c1173edaab6c164c95d775: Christoph Hellwig (1): xfs: mark xfs_iomap_write_ helpers static are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging Dave Chinner (11): xfs: Don't reuse the same transaciton ID for duplicated transactions. xfs: Improve scalability of busy extent tracking xfs: allow log ticket allocation to take allocation flags xfs: modify buffer item reference counting V2 xfs: Clean up XFS_BLI_* flag namespace xfs: clean up log ticket overrun debug output xfs: Delayed logging design documentation xfs: Introduce delayed logging core code xfs: forced unmounts need to push the CIL xfs: enable background pushing of the CIL xfs: Ensure inode allocation buffers are fully replayed .../filesystems/xfs-delayed-logging-design.txt | 819 ++++++++++++++++++++ fs/xfs/Makefile | 1 + fs/xfs/linux-2.6/xfs_buf.c | 11 +- fs/xfs/linux-2.6/xfs_quotaops.c | 1 + fs/xfs/linux-2.6/xfs_super.c | 12 +- fs/xfs/linux-2.6/xfs_trace.h | 80 ++- fs/xfs/quota/xfs_dquot.c | 6 +- fs/xfs/xfs_ag.h | 21 +- fs/xfs/xfs_alloc.c | 276 ++++--- fs/xfs/xfs_alloc.h | 7 +- fs/xfs/xfs_alloc_btree.c | 2 +- fs/xfs/xfs_buf_item.c | 166 ++-- fs/xfs/xfs_buf_item.h | 18 +- fs/xfs/xfs_error.c | 2 +- fs/xfs/xfs_log.c | 116 ++- fs/xfs/xfs_log.h | 10 +- fs/xfs/xfs_log_cil.c | 733 ++++++++++++++++++ fs/xfs/xfs_log_priv.h | 116 +++- fs/xfs/xfs_log_recover.c | 46 +- fs/xfs/xfs_log_recover.h | 2 +- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_trans.c | 195 ++++- fs/xfs/xfs_trans.h | 44 +- fs/xfs/xfs_trans_buf.c | 46 +- fs/xfs/xfs_trans_item.c | 114 +--- fs/xfs/xfs_trans_priv.h | 16 +- 26 files changed, 2351 insertions(+), 510 deletions(-) create mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt create mode 100644 fs/xfs/xfs_log_cil.c From SRS0+8lZF+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:01 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461i1ek065016 for ; Wed, 5 May 2010 20:44:01 -0500 X-ASG-Debug-ID: 1273110369-4479034d0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id CFECB1B0A17B for ; Wed, 5 May 2010 18:46:09 -0700 (PDT) Received: from mail.internode.on.net (bld-mail19.adl2.internode.on.net [150.101.137.104]) by cuda.sgi.com with ESMTP id oEIuuqgC2UrMJLrl for ; Wed, 05 May 2010 18:46:09 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23118559-1927428 for ; Thu, 06 May 2010 11:16:08 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qAB-0005Dv-H3 for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:07 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9qA5-0000cc-OW for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:01 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 09/11] xfs: forced unmounts need to push the CIL Subject: [PATCH 09/11] xfs: forced unmounts need to push the CIL Date: Thu, 6 May 2010 11:45:49 +1000 Message-Id: <1273110351-2333-10-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail19.adl2.internode.on.net[150.101.137.104] X-Barracuda-Start-Time: 1273110370 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0001 1.0000 -2.0206 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner If the filesystem is being shut down and the there is no log error, the current code forces out the current log buffers. This code now needs to push the CIL before it forces out the log buffers to acheive the same result. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 15 +++++++++++++++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 88cdfac..7aabd79 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3684,6 +3684,11 @@ xlog_state_ioerror( * c. nothing new gets queued up after (a) and (b) are done. * d. if !logerror, flush the iclogs to disk, then seal them off * for business. + * + * Note: for delayed logging the !logerror case needs to flush the regions + * held in memory out to the iclogs before flushing them to disk. This needs + * to be done before the log is marked as shutdown, otherwise the flush to the + * iclogs will fail. */ int xfs_log_force_umount( @@ -3717,6 +3722,16 @@ xfs_log_force_umount( return 1; } retval = 0; + + /* + * Flush the in memory commit item list before marking the log as + * being shut down. We need to do it in this order to ensure all the + * completed transactions are flushed to disk with the xfs_log_force() + * call below. + */ + if (!logerror && (mp->m_flags & XFS_MOUNT_DELAYLOG)) + xlog_cil_push(log, 1); + /* * We must hold both the GRANT lock and the LOG lock, * before we mark the filesystem SHUTDOWN and wake -- 1.5.6.5 From SRS0+o8Pk+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:01 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,J_CHICKENPOX_63, J_CHICKENPOX_64 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461i08w065010 for ; Wed, 5 May 2010 20:44:01 -0500 X-ASG-Debug-ID: 1273110367-742e00c80000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2609B1B0A17A for ; Wed, 5 May 2010 18:46:07 -0700 (PDT) Received: from mail.internode.on.net (bld-mail17.adl2.internode.on.net [150.101.137.102]) by cuda.sgi.com with ESMTP id hOUitqjtS9RJ9bjv for ; Wed, 05 May 2010 18:46:07 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23089826-1927428 for ; Thu, 06 May 2010 11:16:04 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9q9w-0005DM-7d for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:52 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9q9v-0000cH-82 for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:51 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 02/11] xfs: Improve scalability of busy extent tracking Subject: [PATCH 02/11] xfs: Improve scalability of busy extent tracking Date: Thu, 6 May 2010 11:45:42 +1000 Message-Id: <1273110351-2333-3-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail17.adl2.internode.on.net[150.101.137.102] X-Barracuda-Start-Time: 1273110369 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean When we free a metadata extent, we record it in the per-AG busy extent array so that it is not re-used before the freeing transaction hits the disk. This array is fixed size, so when it overflows we make further allocation transactions synchronous because we cannot track more freed extents until those transactions hit the disk and are completed. Under heavy mixed allocation and freeing workloads with large log buffers, we can overflow this array quite easily. Further, the array is sparsely populated, which means that inserts need to search for a free slot, and array searches often have to search many more slots that are actually used to check all the busy extents. Quite inefficient, really. To enable this aspect of extent freeing to scale better, we need a structure that can grow dynamically. While in other areas of XFS we have used radix trees, the extents being freed are at random locations on disk so are better suited to being indexed by an rbtree. So, use a per-AG rbtree indexed by block number to track busy extents. This incures a memory allocation when marking an extent busy, but should not occur too often in low memory situations. This should scale to an arbitrary number of extents so should not be a limitation for features such as in-memory aggregation of transactions. However, there are still situations where we can't avoid allocating busy extents (such as allocation from the AGFL). To minimise the overhead of such occurences, we need to avoid doing a synchronous log force while holding the AGF locked to ensure that the previous transactions are safely on disk before we use the extent. We can do this by marking the transaction doing the allocation as synchronous rather issuing a log force. Because of the locking involved and the ordering of transactions, the synchronous transaction provides the same guarantees as a synchronous log force because it ensures that all the prior transactions are already on disk when the synchronous transaction hits the disk. i.e. it preserves the free->allocate order of the extent correctly in recovery. By doing this, we avoid holding the AGF locked while log writes are in progress, hence reducing the length of time the lock is held and therefore we increase the rate at which we can allocate and free from the allocation group, thereby increasing overall throughput. The only problem with this approach is that when a metadata buffer is marked stale (e.g. a directory block is removed), then buffer remains pinned and locked until the log goes to disk. The issue here is that if that stale buffer is reallocated in a subsequent transaction, the attempt to lock that buffer in the transaction will hang waiting the log to go to disk to unlock and unpin the buffer. Hence if someone tries to lock a pinned, stale, locked buffer we need to push on the log to get it unlocked ASAP. Effectively we are trading off a guaranteed log force for a much less common trigger for log force to occur. Ideally we should not reallocate busy extents. That is a much more complex fix to the problem as it involves direct intervention in the allocation btree searches in many places. This is left to a future set of modifications. Finally, now that we track busy extents in allocated memory, we don't need the descriptors in the transaction structure to point to them. We can replace the complex busy chunk infrastructure with a simple linked list of busy extents. This allows us to remove a large chunk of code, making the overall change a net reduction in code size. Signed-off-by: Dave Chinner --- fs/xfs/linux-2.6/xfs_buf.c | 9 ++ fs/xfs/linux-2.6/xfs_quotaops.c | 1 + fs/xfs/linux-2.6/xfs_trace.h | 80 ++++++++---- fs/xfs/xfs_ag.h | 21 ++-- fs/xfs/xfs_alloc.c | 276 ++++++++++++++++++++++++--------------- fs/xfs/xfs_alloc.h | 7 +- fs/xfs/xfs_alloc_btree.c | 2 +- fs/xfs/xfs_trans.c | 41 ++----- fs/xfs/xfs_trans.h | 35 +----- fs/xfs/xfs_trans_item.c | 109 --------------- fs/xfs/xfs_trans_priv.h | 4 - 11 files changed, 265 insertions(+), 320 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index 6873afc..82678bf 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -37,6 +37,7 @@ #include "xfs_sb.h" #include "xfs_inum.h" +#include "xfs_log.h" #include "xfs_ag.h" #include "xfs_dmapi.h" #include "xfs_mount.h" @@ -850,6 +851,12 @@ xfs_buf_lock_value( * Note that this in no way locks the underlying pages, so it is only * useful for synchronizing concurrent use of buffer objects, not for * synchronizing independent access to the underlying pages. + * + * If we come across a stale, pinned, locked buffer, we know that we + * are being asked to lock a buffer that has been reallocated. Because + * it is pinned, we know that the log has not been pushed to disk and + * hence it will still be locked. Rather than sleeping until someone + * else pushes the log, push it ourselves before trying to get the lock. */ void xfs_buf_lock( @@ -857,6 +864,8 @@ xfs_buf_lock( { trace_xfs_buf_lock(bp, _RET_IP_); + if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE)) + xfs_log_force(bp->b_mount, 0); if (atomic_read(&bp->b_io_remaining)) blk_run_address_space(bp->b_target->bt_mapping); down(&bp->b_sema); diff --git a/fs/xfs/linux-2.6/xfs_quotaops.c b/fs/xfs/linux-2.6/xfs_quotaops.c index 1947514..2e73688 100644 --- a/fs/xfs/linux-2.6/xfs_quotaops.c +++ b/fs/xfs/linux-2.6/xfs_quotaops.c @@ -19,6 +19,7 @@ #include "xfs_dmapi.h" #include "xfs_sb.h" #include "xfs_inum.h" +#include "xfs_log.h" #include "xfs_ag.h" #include "xfs_mount.h" #include "xfs_quota.h" diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h index 8a319cf..0934a27 100644 --- a/fs/xfs/linux-2.6/xfs_trace.h +++ b/fs/xfs/linux-2.6/xfs_trace.h @@ -1059,83 +1059,109 @@ TRACE_EVENT(xfs_bunmap, ); +#define XFS_BUSY_SYNC \ + { 0, "async" }, \ + { 1, "sync" } + TRACE_EVENT(xfs_alloc_busy, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, - xfs_extlen_t len, int slot), - TP_ARGS(mp, agno, agbno, len, slot), + TP_PROTO(struct xfs_trans *trans, xfs_agnumber_t agno, + xfs_agblock_t agbno, xfs_extlen_t len, int sync), + TP_ARGS(trans, agno, agbno, len, sync), TP_STRUCT__entry( __field(dev_t, dev) + __field(struct xfs_trans *, tp) __field(xfs_agnumber_t, agno) __field(xfs_agblock_t, agbno) __field(xfs_extlen_t, len) - __field(int, slot) + __field(int, sync) ), TP_fast_assign( - __entry->dev = mp->m_super->s_dev; + __entry->dev = trans->t_mountp->m_super->s_dev; + __entry->tp = trans; __entry->agno = agno; __entry->agbno = agbno; __entry->len = len; - __entry->slot = slot; + __entry->sync = sync; ), - TP_printk("dev %d:%d agno %u agbno %u len %u slot %d", + TP_printk("dev %d:%d trans 0x%p agno %u agbno %u len %u %s", MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->tp, __entry->agno, __entry->agbno, __entry->len, - __entry->slot) + __print_symbolic(__entry->sync, XFS_BUSY_SYNC)) ); -#define XFS_BUSY_STATES \ - { 0, "found" }, \ - { 1, "missing" } - TRACE_EVENT(xfs_alloc_unbusy, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - int slot, int found), - TP_ARGS(mp, agno, slot, found), + xfs_agblock_t agbno, xfs_extlen_t len), + TP_ARGS(mp, agno, agbno, len), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) - __field(int, slot) - __field(int, found) + __field(xfs_agblock_t, agbno) + __field(xfs_extlen_t, len) ), TP_fast_assign( __entry->dev = mp->m_super->s_dev; __entry->agno = agno; - __entry->slot = slot; - __entry->found = found; + __entry->agbno = agbno; + __entry->len = len; ), - TP_printk("dev %d:%d agno %u slot %d %s", + TP_printk("dev %d:%d agno %u agbno %u len %u", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, - __entry->slot, - __print_symbolic(__entry->found, XFS_BUSY_STATES)) + __entry->agbno, + __entry->len) ); +#define XFS_BUSY_STATES \ + { 0, "missing" }, \ + { 1, "found" } + TRACE_EVENT(xfs_alloc_busysearch, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, - xfs_extlen_t len, xfs_lsn_t lsn), - TP_ARGS(mp, agno, agbno, len, lsn), + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, + xfs_agblock_t agbno, xfs_extlen_t len, int found), + TP_ARGS(mp, agno, agbno, len, found), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) __field(xfs_agblock_t, agbno) __field(xfs_extlen_t, len) - __field(xfs_lsn_t, lsn) + __field(int, found) ), TP_fast_assign( __entry->dev = mp->m_super->s_dev; __entry->agno = agno; __entry->agbno = agbno; __entry->len = len; - __entry->lsn = lsn; + __entry->found = found; ), - TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx", + TP_printk("dev %d:%d agno %u agbno %u len %u %s", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, __entry->agbno, __entry->len, + __print_symbolic(__entry->found, XFS_BUSY_STATES)) +); + +TRACE_EVENT(xfs_trans_commit_lsn, + TP_PROTO(struct xfs_trans *trans), + TP_ARGS(trans), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(struct xfs_trans *, tp) + __field(xfs_lsn_t, lsn) + ), + TP_fast_assign( + __entry->dev = trans->t_mountp->m_super->s_dev; + __entry->tp = trans; + __entry->lsn = trans->t_commit_lsn; + ), + TP_printk("dev %d:%d trans 0x%p commit_lsn 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->tp, __entry->lsn) ); diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h index abb8222..3972018 100644 --- a/fs/xfs/xfs_ag.h +++ b/fs/xfs/xfs_ag.h @@ -175,14 +175,17 @@ typedef struct xfs_agfl { } xfs_agfl_t; /* - * Busy block/extent entry. Used in perag to mark blocks that have been freed - * but whose transactions aren't committed to disk yet. + * Busy block/extent entry. Indexed by a rbtree in perag to mark blocks that + * have been freed but whose transactions aren't committed to disk yet. */ -typedef struct xfs_perag_busy { - xfs_agblock_t busy_start; - xfs_extlen_t busy_length; - struct xfs_trans *busy_tp; /* transaction that did the free */ -} xfs_perag_busy_t; +struct xfs_busy_extent { + struct rb_node rb_node; /* ag by-bno indexed search tree */ + struct list_head list; /* transaction busy extent list */ + xfs_agnumber_t agno; + xfs_agblock_t bno; + xfs_extlen_t length; + struct xfs_trans *tp; +}; /* * Per-ag incore structure, copies of information in agf and agi, @@ -216,7 +219,8 @@ typedef struct xfs_perag { xfs_agino_t pagl_leftrec; xfs_agino_t pagl_rightrec; #ifdef __KERNEL__ - spinlock_t pagb_lock; /* lock for pagb_list */ + spinlock_t pagb_lock; /* lock for pagb_tree */ + struct rb_root pagb_tree; /* ordered tree of busy extents */ atomic_t pagf_fstrms; /* # of filestreams active in this AG */ @@ -226,7 +230,6 @@ typedef struct xfs_perag { int pag_ici_reclaimable; /* reclaimable inodes */ #endif int pagb_count; /* pagb slots in use */ - xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS]; /* unstable blocks */ } xfs_perag_t; /* diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c index 94cddbf..673a526 100644 --- a/fs/xfs/xfs_alloc.c +++ b/fs/xfs/xfs_alloc.c @@ -46,11 +46,9 @@ #define XFSA_FIXUP_BNO_OK 1 #define XFSA_FIXUP_CNT_OK 2 -STATIC void -xfs_alloc_search_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - xfs_agblock_t bno, - xfs_extlen_t len); +static int +xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno, + xfs_agblock_t bno, xfs_extlen_t len); /* * Prototypes for per-ag allocation routines @@ -540,9 +538,16 @@ xfs_alloc_ag_vextent( be32_to_cpu(agf->agf_length)); xfs_alloc_log_agf(args->tp, args->agbp, XFS_AGF_FREEBLKS); - /* search the busylist for these blocks */ - xfs_alloc_search_busy(args->tp, args->agno, - args->agbno, args->len); + /* + * Search the busylist for these blocks and mark the + * transaction as synchronous if blocks are found. This + * avoids the need to block in due to a synchronous log + * force to ensure correct ordering as the synchronous + * transaction will guarantee that for us. + */ + if (xfs_alloc_busy_search(args->mp, args->agno, + args->agbno, args->len)) + xfs_trans_set_sync(args->tp); } if (!args->isfl) xfs_trans_mod_sb(args->tp, @@ -1693,7 +1698,7 @@ xfs_free_ag_extent( * when the iclog commits to disk. If a busy block is allocated, * the iclog is pushed up to the LSN that freed the block. */ - xfs_alloc_mark_busy(tp, agno, bno, len); + xfs_alloc_busy_insert(tp, agno, bno, len); return 0; error0: @@ -1993,10 +1998,17 @@ xfs_alloc_get_freelist( * and remain there until the freeing transaction is committed to * disk. Now that we have allocated blocks, this list must be * searched to see if a block is being reused. If one is, then - * the freeing transaction must be pushed to disk NOW by forcing - * to disk all iclogs up that transaction's LSN. - */ - xfs_alloc_search_busy(tp, be32_to_cpu(agf->agf_seqno), bno, 1); + * the freeing transaction must be pushed to disk before this + * transaction. + * + * We do this by setting the current transaction + * to a sync transaction which guarantees that the freeing transaction + * is on disk before this transaction. This is done instead of a + * synchronous log force here so that we don't sit and wait with + * the AGF locked in the transaction during the log force. + */ + if (xfs_alloc_busy_search(mp, be32_to_cpu(agf->agf_seqno), bno, 1)) + xfs_trans_set_sync(tp); return 0; } @@ -2201,7 +2213,7 @@ xfs_alloc_read_agf( be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]); spin_lock_init(&pag->pagb_lock); pag->pagb_count = 0; - memset(pag->pagb_list, 0, sizeof(pag->pagb_list)); + pag->pagb_tree = RB_ROOT; pag->pagf_init = 1; } #ifdef DEBUG @@ -2479,127 +2491,185 @@ error0: * list is reused, the transaction that freed it must be forced to disk * before continuing to use the block. * - * xfs_alloc_mark_busy - add to the per-ag busy list - * xfs_alloc_clear_busy - remove an item from the per-ag busy list + * xfs_alloc_busy_insert - add to the per-ag busy list + * xfs_alloc_busy_clear - remove an item from the per-ag busy list + * xfs_alloc_busy_search - search for a busy extent */ + void -xfs_alloc_mark_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - xfs_agblock_t bno, - xfs_extlen_t len) +xfs_alloc_busy_insert( + struct xfs_trans *tp, + xfs_agnumber_t agno, + xfs_agblock_t bno, + xfs_extlen_t len) { - xfs_perag_busy_t *bsy; + struct xfs_busy_extent *new; + struct xfs_busy_extent *busyp; struct xfs_perag *pag; - int n; - - pag = xfs_perag_get(tp->t_mountp, agno); - spin_lock(&pag->pagb_lock); + struct rb_node **rbp; + struct rb_node *parent; - /* search pagb_list for an open slot */ - for (bsy = pag->pagb_list, n = 0; - n < XFS_PAGB_NUM_SLOTS; - bsy++, n++) { - if (bsy->busy_tp == NULL) { - break; - } - } - trace_xfs_alloc_busy(tp->t_mountp, agno, bno, len, n); - - if (n < XFS_PAGB_NUM_SLOTS) { - bsy = &pag->pagb_list[n]; - pag->pagb_count++; - bsy->busy_start = bno; - bsy->busy_length = len; - bsy->busy_tp = tp; - xfs_trans_add_busy(tp, agno, n); - } else { + new = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL); + if (!new) { /* - * The busy list is full! Since it is now not possible to - * track the free block, make this a synchronous transaction - * to insure that the block is not reused before this - * transaction commits. + * No Memory! Since it is now not possible to track the free + * block, make this a synchronous transaction to insure that + * the block is not reused before this transaction commits. */ + trace_xfs_alloc_busy(tp, agno, bno, len, 1); xfs_trans_set_sync(tp); + return; } - spin_unlock(&pag->pagb_lock); - xfs_perag_put(pag); -} + new->agno = agno; + new->bno = bno; + new->length = len; + new->tp = tp; -void -xfs_alloc_clear_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - int idx) -{ - struct xfs_perag *pag; - xfs_perag_busy_t *list; + INIT_LIST_HEAD(&new->list); - ASSERT(idx < XFS_PAGB_NUM_SLOTS); - pag = xfs_perag_get(tp->t_mountp, agno); - spin_lock(&pag->pagb_lock); - list = pag->pagb_list; + /* trace before insert to be able to see failed inserts */ + trace_xfs_alloc_busy(tp, agno, bno, len, 0); - trace_xfs_alloc_unbusy(tp->t_mountp, agno, idx, list[idx].busy_tp == tp); + pag = xfs_perag_get(tp->t_mountp, new->agno); +restart: + spin_lock(&pag->pagb_lock); + rbp = &pag->pagb_tree.rb_node; + parent = NULL; + while (*rbp) { + parent = *rbp; + busyp = rb_entry(parent, struct xfs_busy_extent, rb_node); + + if (new->bno < busyp->bno) + rbp = &(*rbp)->rb_left; + else if (new->bno > busyp->bno) + rbp = &(*rbp)->rb_right; + else { - if (list[idx].busy_tp == tp) { - list[idx].busy_tp = NULL; - pag->pagb_count--; + /* + * We're trying to reuse an already busy extent? + * + * That means the transaction that marked it busy must + * still be committing, but we are freeing it again + * here. This could be the same transaction (btree + * manipulations may allocate and free blocks + * multiple times in a transaction), so if it is make + * sure the transaction is marked synchronous already + * and update the length of the busy extent to match + * the new range being freed. + * + * If it is not the same transaction, then we need to + * wait for the transaction that marked this extent + * busy to complete. I don' think we can avoid a log + * force in this case. We can't rely on the contents of + * the transaction point in busyp, so we have to force + * everything. + * + * Note that we do not use the transaction structure + * for identifying equal transactions. This is because + * there is the possibility of transaction structures + * being reallocated from the slab after being freed + * and triggering false detections here. Hence use the + * transaction ticket ID to determine if it is the same + * transaction. + */ + if (busyp->tp != tp) { + spin_unlock(&pag->pagb_lock); + xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); + goto restart; + } + busyp->length = max(busyp->length, new->length); + spin_unlock(&pag->pagb_lock); + ASSERT(tp->t_flags & XFS_TRANS_SYNC); + xfs_perag_put(pag); + kmem_free(new); + return; + } } + rb_link_node(&new->rb_node, parent, rbp); + rb_insert_color(&new->rb_node, &pag->pagb_tree); + + list_add(&new->list, &tp->t_busy); spin_unlock(&pag->pagb_lock); xfs_perag_put(pag); } - /* - * If we find the extent in the busy list, force the log out to get the - * extent out of the busy list so the caller can use it straight away. + * Search for a busy extent within the range of the extent we are about to + * allocate. You need to be holding the busy extent tree lock when calling + * xfs_alloc_busy_search(). This function returns 0 for no overlapping busy + * extent, -1 for an overlapping but not exact busy extent, and 1 for an exact + * match. This is done so that a non-zero return indicates an overlap that + * will require a synchronous transaction, but it can still be + * used to distinguish between a partial or exact match. */ -STATIC void -xfs_alloc_search_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - xfs_agblock_t bno, - xfs_extlen_t len) +static int +xfs_alloc_busy_search( + struct xfs_mount *mp, + xfs_agnumber_t agno, + xfs_agblock_t bno, + xfs_extlen_t len) { struct xfs_perag *pag; - xfs_perag_busy_t *bsy; + struct rb_node *rbp; xfs_agblock_t uend, bend; - xfs_lsn_t lsn = 0; - int cnt; + struct xfs_busy_extent *busyp; + int match = 0; - pag = xfs_perag_get(tp->t_mountp, agno); + pag = xfs_perag_get(mp, agno); spin_lock(&pag->pagb_lock); - cnt = pag->pagb_count; - /* - * search pagb_list for this slot, skipping open slots. We have to - * search the entire array as there may be multiple overlaps and - * we have to get the most recent LSN for the log force to push out - * all the transactions that span the range. - */ uend = bno + len - 1; - for (cnt = 0; cnt < pag->pagb_count; cnt++) { - bsy = &pag->pagb_list[cnt]; - if (!bsy->busy_tp) - continue; - - bend = bsy->busy_start + bsy->busy_length - 1; - if (bno > bend || uend < bsy->busy_start) - continue; - - /* (start1,length1) within (start2, length2) */ - if (XFS_LSN_CMP(bsy->busy_tp->t_commit_lsn, lsn) > 0) - lsn = bsy->busy_tp->t_commit_lsn; + rbp = pag->pagb_tree.rb_node; + + /* find closest start bno overlap */ + while (rbp) { + busyp = rb_entry(rbp, struct xfs_busy_extent, rb_node); + bend = busyp->bno + busyp->length - 1; + if (bno < busyp->bno) { + /* may overlap, but exact start block is lower */ + if (uend >= busyp->bno) + match = -1; + rbp = rbp->rb_left; + } else if (bno > busyp->bno) { + /* may overlap, but exact start block is higher */ + if (bno <= bend) + match = -1; + rbp = rbp->rb_right; + } else { + /* bno matches busyp, length determines exact match */ + match = (busyp->length == len) ? 1 : -1; + break; + } } spin_unlock(&pag->pagb_lock); + trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match); xfs_perag_put(pag); - trace_xfs_alloc_busysearch(tp->t_mountp, agno, bno, len, lsn); + return match; +} - /* - * If a block was found, force the log through the LSN of the - * transaction that freed the block - */ - if (lsn) - xfs_log_force_lsn(tp->t_mountp, lsn, XFS_LOG_SYNC); +void +xfs_alloc_busy_clear( + struct xfs_mount *mp, + struct xfs_busy_extent *busyp) +{ + struct xfs_perag *pag; + + trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno, + busyp->length); + + ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno, + busyp->length) == 1); + + list_del_init(&busyp->list); + + pag = xfs_perag_get(mp, busyp->agno); + spin_lock(&pag->pagb_lock); + rb_erase(&busyp->rb_node, &pag->pagb_tree); + spin_unlock(&pag->pagb_lock); + xfs_perag_put(pag); + + kmem_free(busyp); } diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h index 599bffa..6d05199 100644 --- a/fs/xfs/xfs_alloc.h +++ b/fs/xfs/xfs_alloc.h @@ -22,6 +22,7 @@ struct xfs_buf; struct xfs_mount; struct xfs_perag; struct xfs_trans; +struct xfs_busy_extent; /* * Freespace allocation types. Argument to xfs_alloc_[v]extent. @@ -119,15 +120,13 @@ xfs_alloc_longest_free_extent(struct xfs_mount *mp, #ifdef __KERNEL__ void -xfs_alloc_mark_busy(xfs_trans_t *tp, +xfs_alloc_busy_insert(xfs_trans_t *tp, xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len); void -xfs_alloc_clear_busy(xfs_trans_t *tp, - xfs_agnumber_t ag, - int idx); +xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp); #endif /* __KERNEL__ */ diff --git a/fs/xfs/xfs_alloc_btree.c b/fs/xfs/xfs_alloc_btree.c index b726e10..83f4942 100644 --- a/fs/xfs/xfs_alloc_btree.c +++ b/fs/xfs/xfs_alloc_btree.c @@ -134,7 +134,7 @@ xfs_allocbt_free_block( * disk. If a busy block is allocated, the iclog is pushed up to the * LSN that freed the block. */ - xfs_alloc_mark_busy(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1); + xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1); xfs_trans_agbtree_delta(cur->bc_tp, -1); return 0; } diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index be578ec..40d9595 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -44,6 +44,7 @@ #include "xfs_trans_priv.h" #include "xfs_trans_space.h" #include "xfs_inode_item.h" +#include "xfs_trace.h" kmem_zone_t *xfs_trans_zone; @@ -243,9 +244,8 @@ _xfs_trans_alloc( tp->t_type = type; tp->t_mountp = mp; tp->t_items_free = XFS_LIC_NUM_SLOTS; - tp->t_busy_free = XFS_LBC_NUM_SLOTS; xfs_lic_init(&(tp->t_items)); - XFS_LBC_INIT(&(tp->t_busy)); + INIT_LIST_HEAD(&tp->t_busy); return tp; } @@ -255,8 +255,13 @@ _xfs_trans_alloc( */ STATIC void xfs_trans_free( - xfs_trans_t *tp) + struct xfs_trans *tp) { + struct xfs_busy_extent *busyp, *n; + + list_for_each_entry_safe(busyp, n, &tp->t_busy, list) + xfs_alloc_busy_clear(tp->t_mountp, busyp); + atomic_dec(&tp->t_mountp->m_active_trans); xfs_trans_free_dqinfo(tp); kmem_zone_free(xfs_trans_zone, tp); @@ -285,9 +290,8 @@ xfs_trans_dup( ntp->t_type = tp->t_type; ntp->t_mountp = tp->t_mountp; ntp->t_items_free = XFS_LIC_NUM_SLOTS; - ntp->t_busy_free = XFS_LBC_NUM_SLOTS; xfs_lic_init(&(ntp->t_items)); - XFS_LBC_INIT(&(ntp->t_busy)); + INIT_LIST_HEAD(&ntp->t_busy); ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); ASSERT(tp->t_ticket != NULL); @@ -423,7 +427,6 @@ undo_blocks: return error; } - /* * Record the indicated change to the given field for application * to the file system's superblock when the transaction commits. @@ -930,26 +933,6 @@ xfs_trans_item_committed( IOP_UNPIN(lip); } -/* Clear all the per-AG busy list items listed in this transaction */ -static void -xfs_trans_clear_busy_extents( - struct xfs_trans *tp) -{ - xfs_log_busy_chunk_t *lbcp; - xfs_log_busy_slot_t *lbsp; - int i; - - for (lbcp = &tp->t_busy; lbcp != NULL; lbcp = lbcp->lbc_next) { - i = 0; - for (lbsp = lbcp->lbc_busy; i < lbcp->lbc_unused; i++, lbsp++) { - if (XFS_LBC_ISFREE(lbcp, i)) - continue; - xfs_alloc_clear_busy(tp, lbsp->lbc_ag, lbsp->lbc_idx); - } - } - xfs_trans_free_busy(tp); -} - /* * This is typically called by the LM when a transaction has been fully * committed to disk. It needs to unpin the items which have @@ -984,7 +967,6 @@ xfs_trans_committed( kmem_free(licp); } - xfs_trans_clear_busy_extents(tp); xfs_trans_free(tp); } @@ -1013,7 +995,6 @@ xfs_trans_uncommit( xfs_trans_unreserve_and_mod_dquots(tp); xfs_trans_free_items(tp, flags); - xfs_trans_free_busy(tp); xfs_trans_free(tp); } @@ -1075,6 +1056,8 @@ xfs_trans_commit_iclog( *commit_lsn = xfs_log_done(mp, tp->t_ticket, &commit_iclog, log_flags); tp->t_commit_lsn = *commit_lsn; + trace_xfs_trans_commit_lsn(tp); + if (nvec > XFS_TRANS_LOGVEC_COUNT) kmem_free(log_vector); @@ -1260,7 +1243,6 @@ out_unreserve: } current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0); - xfs_trans_free_busy(tp); xfs_trans_free(tp); XFS_STATS_INC(xs_trans_empty); @@ -1339,7 +1321,6 @@ xfs_trans_cancel( current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); xfs_trans_free_items(tp, flags); - xfs_trans_free_busy(tp); xfs_trans_free(tp); } diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index c62beee..ff7e9e6 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -813,6 +813,7 @@ struct xfs_log_item_desc; struct xfs_mount; struct xfs_trans; struct xfs_dquot_acct; +struct xfs_busy_extent; typedef struct xfs_log_item { struct list_head li_ail; /* AIL pointers */ @@ -872,34 +873,6 @@ typedef struct xfs_item_ops { #define XFS_ITEM_PUSHBUF 3 /* - * This structure is used to maintain a list of block ranges that have been - * freed in the transaction. The ranges are listed in the perag[] busy list - * between when they're freed and the transaction is committed to disk. - */ - -typedef struct xfs_log_busy_slot { - xfs_agnumber_t lbc_ag; - ushort lbc_idx; /* index in perag.busy[] */ -} xfs_log_busy_slot_t; - -#define XFS_LBC_NUM_SLOTS 31 -typedef struct xfs_log_busy_chunk { - struct xfs_log_busy_chunk *lbc_next; - uint lbc_free; /* free slots bitmask */ - ushort lbc_unused; /* first unused */ - xfs_log_busy_slot_t lbc_busy[XFS_LBC_NUM_SLOTS]; -} xfs_log_busy_chunk_t; - -#define XFS_LBC_MAX_SLOT (XFS_LBC_NUM_SLOTS - 1) -#define XFS_LBC_FREEMASK ((1U << XFS_LBC_NUM_SLOTS) - 1) - -#define XFS_LBC_INIT(cp) ((cp)->lbc_free = XFS_LBC_FREEMASK) -#define XFS_LBC_CLAIM(cp, slot) ((cp)->lbc_free &= ~(1 << (slot))) -#define XFS_LBC_SLOT(cp, slot) (&((cp)->lbc_busy[(slot)])) -#define XFS_LBC_VACANCY(cp) (((cp)->lbc_free) & XFS_LBC_FREEMASK) -#define XFS_LBC_ISFREE(cp, slot) ((cp)->lbc_free & (1 << (slot))) - -/* * This is the type of function which can be given to xfs_trans_callback() * to be called upon the transaction's commit to disk. */ @@ -950,8 +923,7 @@ typedef struct xfs_trans { unsigned int t_items_free; /* log item descs free */ xfs_log_item_chunk_t t_items; /* first log item desc chunk */ xfs_trans_header_t t_header; /* header for in-log trans */ - unsigned int t_busy_free; /* busy descs free */ - xfs_log_busy_chunk_t t_busy; /* busy/async free blocks */ + struct list_head t_busy; /* list of busy extents */ unsigned long t_pflags; /* saved process flags state */ } xfs_trans_t; @@ -1025,9 +997,6 @@ int _xfs_trans_commit(xfs_trans_t *, void xfs_trans_cancel(xfs_trans_t *, int); int xfs_trans_ail_init(struct xfs_mount *); void xfs_trans_ail_destroy(struct xfs_mount *); -xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp, - xfs_agnumber_t ag, - xfs_extlen_t idx); extern kmem_zone_t *xfs_trans_zone; diff --git a/fs/xfs/xfs_trans_item.c b/fs/xfs/xfs_trans_item.c index eb3fc57..2937a1e 100644 --- a/fs/xfs/xfs_trans_item.c +++ b/fs/xfs/xfs_trans_item.c @@ -438,112 +438,3 @@ xfs_trans_unlock_chunk( return freed; } - - -/* - * This is called to add the given busy item to the transaction's - * list of busy items. It must find a free busy item descriptor - * or allocate a new one and add the item to that descriptor. - * The function returns a pointer to busy descriptor used to point - * to the new busy entry. The log busy entry will now point to its new - * descriptor with its ???? field. - */ -xfs_log_busy_slot_t * -xfs_trans_add_busy(xfs_trans_t *tp, xfs_agnumber_t ag, xfs_extlen_t idx) -{ - xfs_log_busy_chunk_t *lbcp; - xfs_log_busy_slot_t *lbsp; - int i=0; - - /* - * If there are no free descriptors, allocate a new chunk - * of them and put it at the front of the chunk list. - */ - if (tp->t_busy_free == 0) { - lbcp = (xfs_log_busy_chunk_t*) - kmem_alloc(sizeof(xfs_log_busy_chunk_t), KM_SLEEP); - ASSERT(lbcp != NULL); - /* - * Initialize the chunk, and then - * claim the first slot in the newly allocated chunk. - */ - XFS_LBC_INIT(lbcp); - XFS_LBC_CLAIM(lbcp, 0); - lbcp->lbc_unused = 1; - lbsp = XFS_LBC_SLOT(lbcp, 0); - - /* - * Link in the new chunk and update the free count. - */ - lbcp->lbc_next = tp->t_busy.lbc_next; - tp->t_busy.lbc_next = lbcp; - tp->t_busy_free = XFS_LIC_NUM_SLOTS - 1; - - /* - * Initialize the descriptor and the generic portion - * of the log item. - * - * Point the new slot at this item and return it. - * Also point the log item at its currently active - * descriptor and set the item's mount pointer. - */ - lbsp->lbc_ag = ag; - lbsp->lbc_idx = idx; - return lbsp; - } - - /* - * Find the free descriptor. It is somewhere in the chunklist - * of descriptors. - */ - lbcp = &tp->t_busy; - while (lbcp != NULL) { - if (XFS_LBC_VACANCY(lbcp)) { - if (lbcp->lbc_unused <= XFS_LBC_MAX_SLOT) { - i = lbcp->lbc_unused; - break; - } else { - /* out-of-order vacancy */ - cmn_err(CE_DEBUG, "OOO vacancy lbcp 0x%p\n", lbcp); - ASSERT(0); - } - } - lbcp = lbcp->lbc_next; - } - ASSERT(lbcp != NULL); - /* - * If we find a free descriptor, claim it, - * initialize it, and return it. - */ - XFS_LBC_CLAIM(lbcp, i); - if (lbcp->lbc_unused <= i) { - lbcp->lbc_unused = i + 1; - } - lbsp = XFS_LBC_SLOT(lbcp, i); - tp->t_busy_free--; - lbsp->lbc_ag = ag; - lbsp->lbc_idx = idx; - return lbsp; -} - - -/* - * xfs_trans_free_busy - * Free all of the busy lists from a transaction - */ -void -xfs_trans_free_busy(xfs_trans_t *tp) -{ - xfs_log_busy_chunk_t *lbcp; - xfs_log_busy_chunk_t *lbcq; - - lbcp = tp->t_busy.lbc_next; - while (lbcp != NULL) { - lbcq = lbcp->lbc_next; - kmem_free(lbcp); - lbcp = lbcq; - } - - XFS_LBC_INIT(&tp->t_busy); - tp->t_busy.lbc_unused = 0; -} diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 73e2ad3..901dc0f 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -38,10 +38,6 @@ struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *, void xfs_trans_free_items(struct xfs_trans *, int); void xfs_trans_unlock_items(struct xfs_trans *, xfs_lsn_t); -void xfs_trans_free_busy(xfs_trans_t *tp); -xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp, - xfs_agnumber_t ag, - xfs_extlen_t idx); /* * AIL traversal cursor. -- 1.5.6.5 From SRS0+o8Pk+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:07 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461i6wn065053 for ; Wed, 5 May 2010 20:44:07 -0500 X-ASG-Debug-ID: 1273110374-707401040000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1E1B11B0A183 for ; Wed, 5 May 2010 18:46:14 -0700 (PDT) Received: from mail.internode.on.net (bld-mail17.adl2.internode.on.net [150.101.137.102]) by cuda.sgi.com with ESMTP id m7NJICAWh4k3PXd5 for ; Wed, 05 May 2010 18:46:14 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23089836-1927428 for ; Thu, 06 May 2010 11:16:13 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qA6-0005Df-HH for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9qA5-0000cT-H0 for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:01 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 06/11] xfs: clean up log ticket overrun debug output Subject: [PATCH 06/11] xfs: clean up log ticket overrun debug output Date: Thu, 6 May 2010 11:45:46 +1000 Message-Id: <1273110351-2333-7-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail17.adl2.internode.on.net[150.101.137.102] X-Barracuda-Start-Time: 1273110376 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Push the error message output when a ticket overrun is detected into the ticket printing functions. Also remove the debug version of the code as the production version will still panic just as effectively on a debug kernel via the panic mask being set. Signed-off-by: Dave Chinner --- fs/xfs/xfs_error.c | 2 +- fs/xfs/xfs_log.c | 19 +++++-------------- 2 files changed, 6 insertions(+), 15 deletions(-) diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c index ef96175..047b8a8 100644 --- a/fs/xfs/xfs_error.c +++ b/fs/xfs/xfs_error.c @@ -170,7 +170,7 @@ xfs_cmn_err(int panic_tag, int level, xfs_mount_t *mp, char *fmt, ...) va_list ap; #ifdef DEBUG - xfs_panic_mask |= XFS_PTAG_SHUTDOWN_CORRUPT; + xfs_panic_mask |= (XFS_PTAG_SHUTDOWN_CORRUPT | XFS_PTAG_LOGRES); #endif if (xfs_panic_mask && (xfs_panic_mask & panic_tag) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 83be6a6..1efb303 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1645,6 +1645,10 @@ xlog_print_tic_res(xfs_mount_t *mp, xlog_ticket_t *ticket) "bad-rtype" : res_type_str[r_type-1]), ticket->t_res_arr[i].r_len); } + + xfs_cmn_err(XFS_PTAG_LOGRES, CE_ALERT, mp, + "xfs_log_write: reservation ran out. Need to up reservation"); + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); } /* @@ -1897,21 +1901,8 @@ xlog_write( *start_lsn = 0; len = xlog_write_calc_vec_length(ticket, log_vector); - if (ticket->t_curr_res < len) { + if (ticket->t_curr_res < len) xlog_print_tic_res(log->l_mp, ticket); -#ifdef DEBUG - xlog_panic( - "xfs_log_write: reservation ran out. Need to up reservation"); -#else - /* Customer configurable panic */ - xfs_cmn_err(XFS_PTAG_LOGRES, CE_ALERT, log->l_mp, - "xfs_log_write: reservation ran out. Need to up reservation"); - - /* If we did not panic, shutdown the filesystem */ - xfs_force_shutdown(log->l_mp, SHUTDOWN_CORRUPT_INCORE); -#endif - } - ticket->t_curr_res -= len; index = 0; -- 1.5.6.5 From SRS0+QjcZ+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:01 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,J_CHICKENPOX_63, J_CHICKENPOX_66,LOCAL_GNU_PATCH,TVD_PH_BODY_ACCOUNTS_PRE autolearn=unavailable version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461i1Sp065022 for ; Wed, 5 May 2010 20:44:01 -0500 X-ASG-Debug-ID: 1273110367-167902c70000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 712193139BA for ; Wed, 5 May 2010 18:46:07 -0700 (PDT) Received: from mail.internode.on.net (bld-mail16.adl2.internode.on.net [150.101.137.101]) by cuda.sgi.com with ESMTP id 0WQAsL6sRw9LigfH for ; Wed, 05 May 2010 18:46:07 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23226481-1927428 for ; Thu, 06 May 2010 11:16:05 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qA6-0005Dj-Ln for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9qA5-0000cZ-LX for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:01 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 08/11] xfs: Introduce delayed logging core code Subject: [PATCH 08/11] xfs: Introduce delayed logging core code Date: Thu, 6 May 2010 11:45:48 +1000 Message-Id: <1273110351-2333-9-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail16.adl2.internode.on.net[150.101.137.101] X-Barracuda-Start-Time: 1273110369 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner The delayed logging code only changes in-memory structures and as such can be enabled and disabled with a mount option. Add the mount option and emit a warning that this is an experimental feature that should not be used in production yet. We also need infrastructure to track committed items that have not yet been written to the log. This is what the Committed Item List (CIL) is for. The log item also needs to be extended to track the current log vector, the associated memory buffer and it's location in the Commit Item List. Extend the log item and log vector structures to enable this tracking. To maintain the current log format for transactions with delayed logging, we need to introduce a checkpoint transaction and a context for tracking each checkpoint from initiation to transaction completion. This includes adding a log ticket for tracking space log required/used by the context checkpoint. To track all the changes we need an io vector array per log item, rather than a single array for the entire transaction. Using the new log vector structure for this requires two passes - the first to allocate the log vector structures and chain them together, and the second to fill them out. This log vector chain can then be passed to the CIL for formatting, pinning and insertion into the CIL. Formatting of the log vector chain is relatively simple - it's just a loop over the iovecs on each log vector, but it is made slightly more complex because we re-write the iovec after the copy to point back at the memory buffer we just copied into. This code also needs to pin log items. If the log item is not already tracked in this checkpoint context, then it needs to be pinned. Otherwise it is already pinned and we don't need to pin it again. The only other complexity is calculating the amount of new log space the formatting has consumed. This needs to be accounted to the transaction in progress, and the accounting is made more complex becase we need also to steal space from it for log metadata in the checkpoint transaction. Calculate all this at insert time and update all the tickets, counters, etc correctly. Once we've formatted all the log items in the transaction, attach the busy extents to the checkpoint context so the busy extents live until checkpoint completion and can be processed at that point in time. Transactions can then be freed at this point in time. Now we need to issue checkpoints - we are tracking the amount of log space used by the items in the CIL, so we can trigger background checkpoints when the space usage gets to a certain threshold. Otherwise, checkpoints need ot be triggered when a log synchronisation point is reached - a log force event. Because the log write code already handles chained log vectors, writing the transaction is trivial, too. Construct a transaction header, add it to the head of the chain and write it into the log, then issue a commit record write. Then we can release the checkpoint log ticket and attach the context to the log buffer so it can be called during Io completion to complete the checkpoint. We also need to allow for synchronising multiple in-flight checkpoints. This is needed for two things - the first is to ensure that checkpoint commit records appear in the log in the correct sequence order (so they are replayed in the correct order). The second is so that xfs_log_force_lsn() operates correctly and only flushes and/or waits for the specific sequence it was provided with. To do this we need a wait variable and a list tracking the checkpoint commits in progress. We can walk this list and wait for the checkpoints to change state or complete easily, an this provides the necessary synchronisation for correct operation in both cases. Signed-off-by: Dave Chinner --- fs/xfs/Makefile | 1 + fs/xfs/linux-2.6/xfs_super.c | 10 + fs/xfs/xfs_log.c | 67 ++++- fs/xfs/xfs_log.h | 9 +- fs/xfs/xfs_log_cil.c | 666 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_log_priv.h | 71 +++++- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_trans.c | 144 +++++++++- fs/xfs/xfs_trans.h | 8 +- fs/xfs/xfs_trans_item.c | 5 +- fs/xfs/xfs_trans_priv.h | 12 +- 11 files changed, 963 insertions(+), 31 deletions(-) create mode 100644 fs/xfs/xfs_log_cil.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index b4769e4..c8fb13f 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -77,6 +77,7 @@ xfs-y += xfs_alloc.o \ xfs_itable.o \ xfs_dfrag.o \ xfs_log.o \ + xfs_log_cil.o \ xfs_log_recover.o \ xfs_mount.o \ xfs_mru_cache.o \ diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index 1e88c98..6a7c8c9 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -118,6 +118,8 @@ mempool_t *xfs_ioend_pool; #define MNTOPT_DMAPI "dmapi" /* DMI enabled (DMAPI / XDSM) */ #define MNTOPT_XDSM "xdsm" /* DMI enabled (DMAPI / XDSM) */ #define MNTOPT_DMI "dmi" /* DMI enabled (DMAPI / XDSM) */ +#define MNTOPT_DELAYLOG "delaylog" /* Delayed loging enabled */ +#define MNTOPT_NODELAYLOG "nodelaylog" /* Delayed loging disabled */ /* * Table driven mount option parser. @@ -373,6 +375,13 @@ xfs_parseargs( mp->m_flags |= XFS_MOUNT_DMAPI; } else if (!strcmp(this_char, MNTOPT_DMI)) { mp->m_flags |= XFS_MOUNT_DMAPI; + } else if (!strcmp(this_char, MNTOPT_DELAYLOG)) { + mp->m_flags |= XFS_MOUNT_DELAYLOG; + cmn_err(CE_WARN, + "Enabling EXPERIMENTAL delayed logging feature " + "- use at your own risk.\n"); + } else if (!strcmp(this_char, MNTOPT_NODELAYLOG)) { + mp->m_flags &= ~XFS_MOUNT_DELAYLOG; } else if (!strcmp(this_char, "ihashsize")) { cmn_err(CE_WARN, "XFS: ihashsize no longer used, option is deprecated."); @@ -534,6 +543,7 @@ xfs_showargs( { XFS_MOUNT_FILESTREAMS, "," MNTOPT_FILESTREAM }, { XFS_MOUNT_DMAPI, "," MNTOPT_DMAPI }, { XFS_MOUNT_GRPID, "," MNTOPT_GRPID }, + { XFS_MOUNT_DELAYLOG, "," MNTOPT_DELAYLOG }, { 0, NULL } }; static struct proc_xfs_info xfs_info_unset[] = { diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 1efb303..88cdfac 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -54,9 +54,6 @@ STATIC xlog_t * xlog_alloc_log(xfs_mount_t *mp, STATIC int xlog_space_left(xlog_t *log, int cycle, int bytes); STATIC int xlog_sync(xlog_t *log, xlog_in_core_t *iclog); STATIC void xlog_dealloc_log(xlog_t *log); -STATIC int xlog_write(struct log *log, struct xfs_log_vec *log_vector, - struct xlog_ticket *tic, xfs_lsn_t *start_lsn, - xlog_in_core_t **commit_iclog, uint flags); /* local state machine functions */ STATIC void xlog_state_done_syncing(xlog_in_core_t *iclog, int); @@ -86,12 +83,6 @@ STATIC int xlog_regrant_write_log_space(xlog_t *log, STATIC void xlog_ungrant_log_space(xlog_t *log, xlog_ticket_t *ticket); - -/* local ticket functions */ -STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, int unit_bytes, int count, - char clientid, uint flags, - int alloc_flags); - #if defined(DEBUG) STATIC void xlog_verify_dest_ptr(xlog_t *log, char *ptr); STATIC void xlog_verify_grant_head(xlog_t *log, int equals); @@ -460,6 +451,16 @@ xfs_log_mount( /* Normal transactions can now occur */ mp->m_log->l_flags &= ~XLOG_ACTIVE_RECOVERY; + /* + * Now the log has been fully initialised and we know were our + * space grant counters are, we can initialise the permanent ticket + * needed for delayed logging to work. + */ + error = xlog_cil_init_post_recovery(mp->m_log); + if (error) { + ASSERT(0); + goto out_destroy_ail; + } return 0; out_destroy_ail: @@ -666,6 +667,10 @@ xfs_log_item_init( item->li_ailp = mp->m_ail; item->li_type = type; item->li_ops = ops; + item->li_lv = NULL; + + INIT_LIST_HEAD(&item->li_ail); + INIT_LIST_HEAD(&item->li_cil); } /* @@ -1176,6 +1181,9 @@ xlog_alloc_log(xfs_mount_t *mp, *iclogp = log->l_iclog; /* complete ring */ log->l_iclog->ic_prev = prev_iclog; /* re-write 1st prev ptr */ + error = xlog_cil_init(log); + if (error) + goto out_free_iclog; return log; out_free_iclog: @@ -1502,6 +1510,8 @@ xlog_dealloc_log(xlog_t *log) xlog_in_core_t *iclog, *next_iclog; int i; + xlog_cil_destroy(log); + iclog = log->l_iclog; for (i=0; il_iclog_bufs; i++) { sv_destroy(&iclog->ic_force_wait); @@ -1544,8 +1554,10 @@ xlog_state_finish_copy(xlog_t *log, * print out info relating to regions written which consume * the reservation */ -STATIC void -xlog_print_tic_res(xfs_mount_t *mp, xlog_ticket_t *ticket) +void +xlog_print_tic_res( + struct xfs_mount *mp, + struct xlog_ticket *ticket) { uint i; uint ophdr_spc = ticket->t_res_num_ophdrs * (uint)sizeof(xlog_op_header_t); @@ -1877,7 +1889,7 @@ xlog_write_copy_finish( * we don't update ic_offset until the end when we know exactly how many * bytes have been written out. */ -STATIC int +int xlog_write( struct log *log, struct xfs_log_vec *log_vector, @@ -1901,9 +1913,26 @@ xlog_write( *start_lsn = 0; len = xlog_write_calc_vec_length(ticket, log_vector); - if (ticket->t_curr_res < len) + if (log->l_cilp) { + /* + * Region headers and bytes are already accounted for. + * We only need to take into account start records and + * split regions in this function. + */ + if (ticket->t_flags & XLOG_TIC_INITED) + ticket->t_curr_res -= sizeof(xlog_op_header_t); + + /* + * Commit record headers need to be accounted for. These + * come in as separate writes so are easy to detect. + */ + if (flags & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) + ticket->t_curr_res -= sizeof(xlog_op_header_t); + } else + ticket->t_curr_res -= len; + + if (ticket->t_curr_res < 0) xlog_print_tic_res(log->l_mp, ticket); - ticket->t_curr_res -= len; index = 0; lv = log_vector; @@ -2999,6 +3028,8 @@ _xfs_log_force( XFS_STATS_INC(xs_log_force); + xlog_cil_push(log, 1); + spin_lock(&log->l_icloglock); iclog = log->l_iclog; @@ -3148,6 +3179,12 @@ _xfs_log_force_lsn( XFS_STATS_INC(xs_log_force); + if (log->l_cilp) { + lsn = xlog_cil_push_lsn(log, lsn); + if (lsn == NULLCOMMITLSN) + return 0; + } + try_again: spin_lock(&log->l_icloglock); iclog = log->l_iclog; @@ -3315,7 +3352,7 @@ xfs_log_ticket_get( /* * Allocate and initialise a new log ticket. */ -STATIC xlog_ticket_t * +xlog_ticket_t * xlog_ticket_alloc( struct log *log, int unit_bytes, diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 229d1f3..1764f11 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -114,6 +114,9 @@ struct xfs_log_vec { struct xfs_log_vec *lv_next; /* next lv in build list */ int lv_niovecs; /* number of iovecs in lv */ struct xfs_log_iovec *lv_iovecp; /* iovec array */ + struct xfs_log_item *lv_item; /* owner */ + char *lv_buf; /* formatted buffer */ + int lv_buf_len; /* size of formatted buffer */ }; /* @@ -134,6 +137,7 @@ struct xlog_in_core; struct xlog_ticket; struct xfs_log_item; struct xfs_item_ops; +struct xfs_trans; void xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item, @@ -187,9 +191,12 @@ int xfs_log_need_covered(struct xfs_mount *mp); void xlog_iodone(struct xfs_buf *); -struct xlog_ticket * xfs_log_ticket_get(struct xlog_ticket *ticket); +struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); void xfs_log_ticket_put(struct xlog_ticket *ticket); +int xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, + struct xfs_log_vec *log_vector, + xfs_lsn_t *commit_lsn, int flags); #endif diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c new file mode 100644 index 0000000..3cb1957 --- /dev/null +++ b/fs/xfs/xfs_log_cil.c @@ -0,0 +1,666 @@ +/* + * Copyright (c) 2010 Redhat, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_types.h" +#include "xfs_bit.h" +#include "xfs_log.h" +#include "xfs_inum.h" +#include "xfs_trans.h" +#include "xfs_trans_priv.h" +#include "xfs_log_priv.h" +#include "xfs_sb.h" +#include "xfs_ag.h" +#include "xfs_dir2.h" +#include "xfs_dmapi.h" +#include "xfs_mount.h" +#include "xfs_error.h" +#include "xfs_alloc.h" + +/* + * Perform initial CIL structure initialisation. If the CIL is not + * enabled in this filesystem, ensure the log->l_cilp is null so + * we can check this conditional to determine if we are doing delayed + * logging or not. + */ +int +xlog_cil_init( + struct log *log) +{ + struct xfs_cil *cil; + struct xfs_cil_ctx *ctx; + + log->l_cilp = NULL; + if (!(log->l_mp->m_flags & XFS_MOUNT_DELAYLOG)) + return 0; + + cil = kmem_zalloc(sizeof(*cil), KM_SLEEP|KM_MAYFAIL); + if (!cil) + return ENOMEM; + + ctx = kmem_zalloc(sizeof(*ctx), KM_SLEEP|KM_MAYFAIL); + if (!ctx) { + kmem_free(cil); + return ENOMEM; + } + + INIT_LIST_HEAD(&cil->xc_cil); + INIT_LIST_HEAD(&cil->xc_committing); + spin_lock_init(&cil->xc_cil_lock); + init_rwsem(&cil->xc_ctx_lock); + sv_init(&cil->xc_commit_wait, SV_DEFAULT, "cilwait"); + + INIT_LIST_HEAD(&ctx->committing); + INIT_LIST_HEAD(&ctx->busy_extents); + ctx->sequence = 1; + ctx->cil = cil; + cil->xc_ctx = ctx; + + cil->xc_log = log; + log->l_cilp = cil; + return 0; +} + +void +xlog_cil_destroy( + struct log *log) +{ + if (!log->l_cilp) + return; + + if (log->l_cilp->xc_ctx) { + if (log->l_cilp->xc_ctx->ticket) + xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket); + kmem_free(log->l_cilp->xc_ctx); + } + + ASSERT(list_empty(&log->l_cilp->xc_cil)); + kmem_free(log->l_cilp); +} + +/* + * Allocate a new ticket. Failing to get a new ticket makes it really hard to + * recover, so we don't allow failure here. Also, we allocate in a context that + * we don't want to be issuing transactions from, so we need to tell the + * allocation code this as well. + * + * We don't reserve any space for the ticket - we are going to steal whatever + * space we require from transactions as they commit. To ensure we reserve all + * the space required, we need to set the current reservation of the ticket to + * zero so that we know to steal the initial transaction overhead from the + * first transaction commit. + */ +static struct xlog_ticket * +xlog_cil_ticket_alloc( + struct log *log) +{ + struct xlog_ticket *tic; + + tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0, + KM_SLEEP|KM_NOFS); + tic->t_trans_type = XFS_TRANS_CHECKPOINT; + + /* + * set the current reservation to zero so we know to steal the basic + * transaction overhead reservation from the first transaction commit. + */ + tic->t_curr_res = 0; + return tic; +} + +/* + * After the first stage of log recovery is done, we know where the head and + * tail of the log are. We need this log initialisation done before we can + * initialise the first CIL checkpoint context. + * + * Here we allocate a log ticket to track space usage during a CIL push. This + * ticket is passed to xlog_write() directly so that we don't slowly leak log + * space by failing to account for space used by log headers and additional + * region headers for split regions. + */ +int +xlog_cil_init_post_recovery( + struct log *log) +{ + if (!log->l_cilp) + return 0; + + log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log); + log->l_cilp->xc_ctx->sequence = 1; + log->l_cilp->xc_ctx->commit_lsn = xlog_assign_lsn(log->l_curr_cycle, + log->l_curr_block); + return 0; +} + +/* + * Insert the log item into the CIL and calculate the difference in space + * consumed by the item. Add the space to the checkpoint ticket and calculate + * if the change requires additional log metadata. If it does, take that space + * as well. Remove the amount of space we addded to the checkpoint ticket from + * the current transaction ticket so that the accounting works out correctly. + * + * If this is the first time the item is being placed into the CIL in this + * context, pin it so it can't be written to disk until the CIL is flushed to + * the iclog and the iclog written to disk. + */ +static void +xlog_cil_insert( + struct log *log, + struct xlog_ticket *ticket, + struct xfs_log_item *item, + struct xfs_log_vec *lv) +{ + struct xfs_cil *cil = log->l_cilp; + struct xfs_log_vec *old = lv->lv_item->li_lv; + struct xfs_cil_ctx *ctx = cil->xc_ctx; + int len; + int diff_iovecs; + int iclog_space; + + if (old) { + /* existing lv on log item, space used is a delta */ + ASSERT(!list_empty(&item->li_cil)); + ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs); + + len = lv->lv_buf_len - old->lv_buf_len; + diff_iovecs = lv->lv_niovecs - old->lv_niovecs; + kmem_free(old->lv_buf); + kmem_free(old); + } else { + /* new lv, must pin the log item */ + ASSERT(!lv->lv_item->li_lv); + ASSERT(list_empty(&item->li_cil)); + + len = lv->lv_buf_len; + diff_iovecs = lv->lv_niovecs; + IOP_PIN(lv->lv_item); + + } + len += diff_iovecs * sizeof(xlog_op_header_t); + + /* attach new log vector to log item */ + lv->lv_item->li_lv = lv; + + spin_lock(&cil->xc_cil_lock); + list_move_tail(&item->li_cil, &cil->xc_cil); + ctx->nvecs += diff_iovecs; + + /* + * Now transfer enough transaction reservation to the context ticket + * for the checkpoint. The context ticket is special - the unit + * reservation has to grow as well as the current reservation as we + * steal from tickets so we can correctly determine the space used + * during the transaction commit. + */ + if (ctx->ticket->t_curr_res == 0) { + /* first commit in checkpoint, steal the header reservation */ + ASSERT(ticket->t_curr_res >= ctx->ticket->t_unit_res + len); + ctx->ticket->t_curr_res = ctx->ticket->t_unit_res; + ticket->t_curr_res -= ctx->ticket->t_unit_res; + } + + /* do we need space for more log record headers? */ + iclog_space = log->l_iclog_size - log->l_iclog_hsize; + if (len > 0 && (ctx->space_used / iclog_space != + (ctx->space_used + len) / iclog_space)) { + int hdrs; + + hdrs = (len + iclog_space - 1) / iclog_space; + /* need to take into account split region headers, too */ + hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header); + ctx->ticket->t_unit_res += hdrs; + ctx->ticket->t_curr_res += hdrs; + ticket->t_curr_res -= hdrs; + ASSERT(ticket->t_curr_res >= len); + } + ticket->t_curr_res -= len; + ctx->space_used += len; + + spin_unlock(&cil->xc_cil_lock); +} + +/* + * Format log item into a flat buffers + * + * For delayed logging, we need to hold a formatted buffer containing + * all the changes on the log item. This enables us to relog the item + * in memory and write it out asynchronously without needing to relock + * the object that was modified at the time it gets written into the + * iclog. + * + * This function works out the length of the buffer needed for each + * log item, allocates them and formats the the log vector for the item + * into the buffer. The buffer is then attached to the log item and the + * vector is formatted into the buffer. The log item and formatted log vector + * are then inserted into the Committed Item List for tracking until the + * next checkpoint is written out. + */ +static void +xlog_cil_format_items( + struct log *log, + struct xfs_log_vec *log_vector, + struct xlog_ticket *ticket, + xfs_lsn_t *start_lsn) +{ + struct xfs_log_vec *lv; + + if (start_lsn) + *start_lsn = log->l_cilp->xc_ctx->sequence; + + /* + * we don't set up region headers here; we simply copy the regions into + * the flat buffer. We can do this because we still have to do a + * formatting step to write the regions into the iclog buffer. Writing + * the ophdrs during the iclog write means that we can support + * splitting large regions across iclog boundares without needing a + * change in the format of the item/region encapsulation. + * + * Hence what we need to do now is change the vector buffer pointer to + * point to the copied region inside the buffer we just allocated. This + * allows us to format the regions into the iclog as though they are + * being formatted directly out of the objects themselves. + */ + ASSERT(log_vector); + for (lv = log_vector; lv; lv = lv->lv_next) { + void *ptr; + int index; + int offset = 0; + int len = 0; + + for (index = 0; index < lv->lv_niovecs; index++) + len += lv->lv_iovecp[index].i_len; + + lv->lv_buf_len = len; + lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS); + ptr = lv->lv_buf; + + for (index = 0; index < lv->lv_niovecs; index++) { + struct xfs_log_iovec *vec = &lv->lv_iovecp[index]; + + memcpy(ptr, vec->i_addr, vec->i_len); + vec->i_addr = ptr; + xlog_write_adv_cnt(&ptr, &len, &offset, vec->i_len); + } + ASSERT(len == 0); + + xlog_cil_insert(log, ticket, lv->lv_item, lv); + } +} + +static void +xlog_cil_free_logvec( + struct xfs_log_vec *log_vector) +{ + struct xfs_log_vec *lv; + + for (lv = log_vector; lv; ) { + struct xfs_log_vec *next = lv->lv_next; + kmem_free(lv->lv_buf); + kmem_free(lv); + lv = next; + } +} + +/* + * Commit a transaction with the given vector to the Committed Item List. + * + * To do this, we need to format the item, pin it in memory if required and + * account for the space used by the transaction. Once we have done that we + * need to release the unused reservation for the transaction, attach the + * transaction to the checkpoint context so we carry the busy extents through + * to checkpoint completion, and then unlock all the items in the transaction. + * + * For more specific information about the order of operations in + * xfs_log_commit_cil() please refer to the comments in + * xfs_trans_commit_iclog(). + */ +int +xfs_log_commit_cil( + struct xfs_mount *mp, + struct xfs_trans *tp, + struct xfs_log_vec *log_vector, + xfs_lsn_t *commit_lsn, + int flags) +{ + struct log *log = mp->m_log; + int log_flags = 0; + + if (flags & XFS_TRANS_RELEASE_LOG_RES) + log_flags = XFS_LOG_REL_PERM_RESERV; + + if (XLOG_FORCED_SHUTDOWN(log)) { + xlog_cil_free_logvec(log_vector); + return XFS_ERROR(EIO); + } + + /* lock out background commit */ + down_read(&log->l_cilp->xc_ctx_lock); + xlog_cil_format_items(log, log_vector, tp->t_ticket, commit_lsn); + + /* check we didn't blow the reservation */ + if (tp->t_ticket->t_curr_res < 0) + xlog_print_tic_res(log->l_mp, tp->t_ticket); + + /* attach the transaction to the CIL if it has any busy extents */ + if (!list_empty(&tp->t_busy)) { + spin_lock(&log->l_cilp->xc_cil_lock); + list_splice_init(&tp->t_busy, + &log->l_cilp->xc_ctx->busy_extents); + spin_unlock(&log->l_cilp->xc_cil_lock); + } + + tp->t_commit_lsn = *commit_lsn; + xfs_log_done(mp, tp->t_ticket, NULL, log_flags); + xfs_trans_unreserve_and_mod_sb(tp); + + /* background commit is allowed again */ + up_read(&log->l_cilp->xc_ctx_lock); + current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); + + /* xfs_trans_free_items() unlocks them first */ + xfs_trans_free_items(tp, *commit_lsn, 0); + xfs_trans_free(tp); + return 0; +} + +/* + * Mark all items committed and clear busy extents. We free the log vector + * chains in a separate pass so that we unpin the log items as quickly as + * possible. + */ +static void +xlog_cil_committed( + void *args, + int abort) +{ + struct xfs_cil_ctx *ctx = args; + struct xfs_log_vec *lv; + int abortflag = abort ? XFS_LI_ABORTED : 0; + struct xfs_busy_extent *busyp, *n; + + /* unpin all the log items */ + for (lv = ctx->lv_chain; lv; lv = lv->lv_next ) { + xfs_trans_item_committed(lv->lv_item, ctx->start_lsn, + abortflag); + } + + list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) + xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp); + + spin_lock(&ctx->cil->xc_cil_lock); + list_del(&ctx->committing); + spin_unlock(&ctx->cil->xc_cil_lock); + + xlog_cil_free_logvec(ctx->lv_chain); + kmem_free(ctx); +} + +/* + * Push the Committed Item List to the log. If the push_now flag is not set, + * then it is a background flush and so we can chose to ignore it. + */ +int +xlog_cil_push( + struct log *log, + int push_now) +{ + struct xfs_cil *cil = log->l_cilp; + struct xfs_log_vec *lv; + struct xfs_cil_ctx *ctx; + struct xfs_cil_ctx *new_ctx; + struct xlog_in_core *commit_iclog; + struct xlog_ticket *tic; + int num_lv; + int num_iovecs; + int len; + int error = 0; + struct xfs_trans_header thdr; + struct xfs_log_iovec lhdr; + struct xfs_log_vec lvhdr = { NULL }; + xfs_lsn_t commit_lsn; + + if (!cil) + return 0; + + /* XXX: don't sleep for background? */ + new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_SLEEP|KM_NOFS); + new_ctx->ticket = xlog_cil_ticket_alloc(log); + + /* lock out transaction commit */ + down_write(&cil->xc_ctx_lock); + ctx = cil->xc_ctx; + + /* check if we've anything to push */ + if (list_empty(&cil->xc_cil)) { + up_write(&cil->xc_ctx_lock); + xfs_log_ticket_put(new_ctx->ticket); + kmem_free(new_ctx); + return 0; + } + + /* + * pull all the log vectors off the items in the CIL, and + * remove the items from the CIL. We don't need the CIL lock + * here because it's only needed on the transaction commit + * side which is currently locked out by the flush lock. + */ + lv = NULL; + num_lv = 0; + num_iovecs = 0; + len = 0; + while (!list_empty(&cil->xc_cil)) { + struct xfs_log_item *item; + int i; + + item = list_first_entry(&cil->xc_cil, + struct xfs_log_item, li_cil); + list_del_init(&item->li_cil); + if (!ctx->lv_chain) + ctx->lv_chain = item->li_lv; + else + lv->lv_next = item->li_lv; + lv = item->li_lv; + item->li_lv = NULL; + + num_lv++; + num_iovecs += lv->lv_niovecs; + for (i = 0; i < lv->lv_niovecs; i++) + len += lv->lv_iovecp[i].i_len; + } + + /* + * initialise the new context and attach it to the CIL. Then attach + * the current context to the CIL committing lsit so it can be found + * during log forces to extract the commit lsn of the sequence that + * needs to be forced. + */ + INIT_LIST_HEAD(&new_ctx->committing); + INIT_LIST_HEAD(&new_ctx->busy_extents); + new_ctx->sequence = ctx->sequence + 1; + new_ctx->cil = cil; + cil->xc_ctx = new_ctx; + + /* + * The switch is now done, so we can drop the context lock and move out + * of a shared context. We can't just go straight to the commit record, + * though - we need to synchronise with previous and future commits so + * that the commit records are correctly ordered in the log to ensure + * that we process items during log IO completion in the correct order. + * + * For example, if we get an EFI in one checkpoint and the EFD in the + * next (e.g. due to log forces), we do not want the checkpoint with + * the EFD to be committed before the checkpoint with the EFI. Hence + * we must strictly order the commit records of the checkpoints so + * that: a) the checkpoint callbacks are attached to the iclogs in the + * correct order; and b) the checkpoints are replayed in correct order + * in log recovery. + * + * Hence we need to add this context to the committing context list so + * that higher sequences will wait for us to write out a commit record + * before they do. + */ + spin_lock(&cil->xc_cil_lock); + list_add(&ctx->committing, &cil->xc_committing); + spin_unlock(&cil->xc_cil_lock); + up_write(&cil->xc_ctx_lock); + + /* + * Build a checkpoint transaction header and write it to the log to + * begin the transaction. We need to account for the space used by the + * transaction header here as it is not accounted for in xlog_write(). + * + * The LSN we need to pass to the log items on transaction commit is + * the LSN reported by the first log vector write. If we use the commit + * record lsn then we can move the tail beyond the grant write head. + */ + tic = ctx->ticket; + thdr.th_magic = XFS_TRANS_HEADER_MAGIC; + thdr.th_type = XFS_TRANS_CHECKPOINT; + thdr.th_tid = tic->t_tid; + thdr.th_num_items = num_iovecs; + lhdr.i_addr = (xfs_caddr_t)&thdr; + lhdr.i_len = sizeof(xfs_trans_header_t); + lhdr.i_type = XLOG_REG_TYPE_TRANSHDR; + tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t); + + lvhdr.lv_niovecs = 1; + lvhdr.lv_iovecp = &lhdr; + lvhdr.lv_next = ctx->lv_chain; + + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0); + if (error) + goto out_abort; + + /* + * now that we've written the checkpoint into the log, strictly + * order the commit records so replay will get them in the right order. + */ +restart: + spin_lock(&cil->xc_cil_lock); + list_for_each_entry(new_ctx, &cil->xc_committing, committing) { + /* + * Higher sequences will wait for this one so skip them. + * Don't wait for own own sequence, either. + */ + if (new_ctx->sequence >= ctx->sequence) + continue; + if (!new_ctx->commit_lsn) { + /* + * It is still being pushed! Wait for the push to + * complete, then start again from the beginning. + */ + sv_wait(&cil->xc_commit_wait, 0, &cil->xc_cil_lock, 0); + goto restart; + } + } + spin_unlock(&cil->xc_cil_lock); + + commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, 0); + if (error || commit_lsn == -1) + goto out_abort; + + /* attach all the transactions w/ busy extents to iclog */ + ctx->log_cb.cb_func = xlog_cil_committed; + ctx->log_cb.cb_arg = ctx; + error = xfs_log_notify(log->l_mp, commit_iclog, &ctx->log_cb); + if (error) + goto out_abort; + + /* + * now the checkpoint commit is complete and we've attached the + * callbacks to the iclog we can assign the commit LSN to the context + * and wake up anyone who is waiting for the commit to complete. + */ + spin_lock(&cil->xc_cil_lock); + ctx->commit_lsn = commit_lsn; + sv_broadcast(&cil->xc_commit_wait); + spin_unlock(&cil->xc_cil_lock); + + /* release the hounds! */ + return xfs_log_release_iclog(log->l_mp, commit_iclog); + +out_abort: + xlog_cil_committed(ctx, XFS_LI_ABORTED); + return XFS_ERROR(EIO); +} + +/* + * Conditionally push the CIL based on the sequence passed in. + * + * We only need to push if we haven't already pushed the sequence + * number given. Hence the only time we will trigger a push here is + * if the push sequence is the same as the current context. + * + * We return the current commit lsn to allow the callers to determine if a + * iclog flush is necessary following this call. + * + * XXX: Initially, just push the CIL unconditionally and return whatever + * commit lsn is there. It'll be empty, so this is broken for now. + */ +xfs_lsn_t +xlog_cil_push_lsn( + struct log *log, + xfs_lsn_t push_seq) +{ + struct xfs_cil *cil = log->l_cilp; + struct xfs_cil_ctx *ctx; + xfs_lsn_t commit_lsn = NULLCOMMITLSN; + +restart: + down_write(&cil->xc_ctx_lock); + ASSERT(push_seq <= cil->xc_ctx->sequence); + + /* check to see if we need to force out the current context */ + if (push_seq == cil->xc_ctx->sequence) { + up_write(&cil->xc_ctx_lock); + xlog_cil_push(log, 1); + goto restart; + } + + /* + * See if we can find a previous sequence still committing. + * We can drop the flush lock as soon as we have the cil lock + * because we are now only comparing contexts protected by + * the cil lock. + * + * We need to wait for all previous sequence commits to complete + * before allowing the force of push_seq to go ahead. Hence block + * on commits for those as well. + */ + spin_lock(&cil->xc_cil_lock); + up_write(&cil->xc_ctx_lock); + list_for_each_entry(ctx, &cil->xc_committing, committing) { + if (ctx->sequence > push_seq) + continue; + if (!ctx->commit_lsn) { + /* + * It is still being pushed! Wait for the push to + * complete, then start again from the beginning. + */ + sv_wait(&cil->xc_commit_wait, 0, &cil->xc_cil_lock, 0); + goto restart; + } + if (ctx->sequence != push_seq) + continue; + /* found it! */ + commit_lsn = ctx->commit_lsn; + } + spin_unlock(&cil->xc_cil_lock); + return commit_lsn; +} + diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 9cf6951..e9e8324 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -379,6 +379,54 @@ typedef struct xlog_in_core { } xlog_in_core_t; /* + * The CIL context is used to aggregate per-transaction details as well be + * passed to the iclog for checkpoint post-commit processing. After being + * passed to the iclog, another context needs to be allocated for tracking the + * next set of transactions to be aggregated into a checkpoint. + */ +struct xfs_cil; + +struct xfs_cil_ctx { + struct xfs_cil *cil; + xfs_lsn_t sequence; /* chkpt sequence # */ + xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ + xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ + struct xlog_ticket *ticket; /* chkpt ticket */ + int nvecs; /* number of regions */ + int space_used; /* aggregate size of regions */ + struct list_head busy_extents; /* busy extents in chkpt */ + struct xfs_log_vec *lv_chain; /* logvecs being pushed */ + xfs_log_callback_t log_cb; /* completion callback hook. */ + struct list_head committing; /* ctx committing list */ +}; + +/* + * Committed Item List structure + * + * This structure is used to track log items that have been committed but not + * yet written into the log. It is used only when the delayed logging mount + * option is enabled. + * + * This structure tracks the list of committing checkpoint contexts so + * we can avoid the problem of having to hold out new transactions during a + * flush until we have a the commit record LSN of the checkpoint. We can + * traverse the list of committing contexts in xlog_cil_push_lsn() to find a + * sequence match and extract the commit LSN directly from there. If the + * checkpoint is still in the process of committing, we can block waiting for + * the commit LSN to be determined as well. This should make synchronous + * operations almost as efficient as the old logging methods. + */ +struct xfs_cil { + struct log *xc_log; + struct list_head xc_cil; + spinlock_t xc_cil_lock; + struct xfs_cil_ctx *xc_ctx; + struct rw_semaphore xc_ctx_lock; + struct list_head xc_committing; + sv_t xc_commit_wait; +}; + +/* * The reservation head lsn is not made up of a cycle number and block number. * Instead, it uses a cycle number and byte number. Logs don't expect to * overflow 31 bits worth of byte offset, so using a byte number will mean @@ -388,6 +436,7 @@ typedef struct log { /* The following fields don't need locking */ struct xfs_mount *l_mp; /* mount point */ struct xfs_ail *l_ailp; /* AIL log is working with */ + struct xfs_cil *l_cilp; /* CIL log is working with */ struct xfs_buf *l_xbuf; /* extra buffer for log * wrapping */ struct xfs_buftarg *l_targ; /* buftarg of log */ @@ -438,14 +487,17 @@ typedef struct log { #define XLOG_FORCED_SHUTDOWN(log) ((log)->l_flags & XLOG_IO_ERROR) - /* common routines */ extern xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp); extern int xlog_recover(xlog_t *log); extern int xlog_recover_finish(xlog_t *log); extern void xlog_pack_data(xlog_t *log, xlog_in_core_t *iclog, int); -extern kmem_zone_t *xfs_log_ticket_zone; +extern kmem_zone_t *xfs_log_ticket_zone; +struct xlog_ticket *xlog_ticket_alloc(struct log *log, int unit_bytes, + int count, char client, uint xflags, + int alloc_flags); + static inline void xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) @@ -455,6 +507,21 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) *off += bytes; } +void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); +int xlog_write(struct log *log, struct xfs_log_vec *log_vector, + struct xlog_ticket *tic, xfs_lsn_t *start_lsn, + xlog_in_core_t **commit_iclog, uint flags); + +/* + * Committed Item List interfaces + */ +int xlog_cil_init(struct log *log); +int xlog_cil_init_post_recovery(struct log *log); +void xlog_cil_destroy(struct log *log); + +int xlog_cil_push(struct log *log, int push_now); +xfs_lsn_t xlog_cil_push_lsn(struct log *log, xfs_lsn_t push_sequence); + /* * Unmount record type is used as a pseudo transaction type for the ticket. * It's value must be outside the range of XFS_TRANS_* values. diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 9ff48a1..1d2c7ee 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -268,6 +268,7 @@ typedef struct xfs_mount { #define XFS_MOUNT_WSYNC (1ULL << 0) /* for nfs - all metadata ops must be synchronous except for space allocations */ +#define XFS_MOUNT_DELAYLOG (1ULL << 1) /* delayed logging is enabled */ #define XFS_MOUNT_DMAPI (1ULL << 2) /* dmapi is enabled */ #define XFS_MOUNT_WAS_CLEAN (1ULL << 3) #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 40d9595..9bdb492 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -253,7 +253,7 @@ _xfs_trans_alloc( * Free the transaction structure. If there is more clean up * to do when the structure is freed, add it here. */ -STATIC void +void xfs_trans_free( struct xfs_trans *tp) { @@ -655,7 +655,7 @@ xfs_trans_apply_sb_deltas( * XFS_TRANS_SB_DIRTY will not be set when the transaction is updated but we * still need to update the incore superblock with the changes. */ -STATIC void +void xfs_trans_unreserve_and_mod_sb( xfs_trans_t *tp) { @@ -883,7 +883,7 @@ xfs_trans_fill_vecs( * they could be immediately flushed and we'd have to race with the flusher * trying to pull the item from the AIL as we add it. */ -static void +void xfs_trans_item_committed( struct xfs_log_item *lip, xfs_lsn_t commit_lsn, @@ -994,7 +994,7 @@ xfs_trans_uncommit( xfs_trans_unreserve_and_mod_sb(tp); xfs_trans_unreserve_and_mod_dquots(tp); - xfs_trans_free_items(tp, flags); + xfs_trans_free_items(tp, NULLCOMMITLSN, flags); xfs_trans_free(tp); } @@ -1144,6 +1144,132 @@ xfs_trans_commit_iclog( return xfs_log_release_iclog(mp, commit_iclog); } +/* + * Walk the log items and allocate log vector structures for + * each item large enough to fit all the vectors they require. + * Note that this format differs from the old log vector format in + * that there is no transaction header in these log vectors. + */ +STATIC struct xfs_log_vec * +xfs_trans_alloc_log_vecs( + xfs_trans_t *tp) +{ + xfs_log_item_desc_t *lidp; + struct xfs_log_vec *lv = NULL; + struct xfs_log_vec *ret_lv = NULL; + + lidp = xfs_trans_first_item(tp); + + /* Bail out if we didn't find a log item. */ + if (!lidp) { + ASSERT(0); + return NULL; + } + + while (lidp != NULL) { + struct xfs_log_vec *new_lv; + + /* Skip items which aren't dirty in this transaction. */ + if (!(lidp->lid_flags & XFS_LID_DIRTY)) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + + /* Skip items that do not have any vectors for writing */ + lidp->lid_size = IOP_SIZE(lidp->lid_item); + if (!lidp->lid_size) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + + new_lv = kmem_zalloc(sizeof(*new_lv) + + lidp->lid_size * sizeof(struct xfs_log_iovec), + KM_SLEEP); + + /* The allocated iovec region lies beyond the log vector. */ + new_lv->lv_iovecp = (struct xfs_log_iovec *)&new_lv[1]; + if (!ret_lv) + ret_lv = new_lv; + else + lv->lv_next = new_lv; + lv = new_lv; + lidp = xfs_trans_next_item(tp, lidp); + } + + return ret_lv; +} + +/* + * Fill in the vector with pointers to data to be logged + * by this transaction. + * Each dirty item takes the + * number of vectors it indicated it needed in xfs_trans_alloc_log_vecs(). + * There is no transaction header in this format. + * + * We do not pin the items here as they are formatted, we leave that to + * the CIL commit. This is done because the pinning of the item is + * conditional on whether the item is already pinned in the CIL. Hence + * the check and pin must be done under the protection of the flush lock. + */ +STATIC void +xfs_trans_fill_log_vecs( + struct xfs_trans *tp, + struct xfs_log_vec *log_vector) +{ + xfs_log_item_desc_t *lidp; + struct xfs_log_vec *lv = log_vector; + + lidp = xfs_trans_first_item(tp); + ASSERT(lidp); + while (lidp) { + /* + * Skip items which aren't dirty in this transaction. + */ + if (!(lidp->lid_flags & XFS_LID_DIRTY)) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + /* Skip items that do not have any vectors for writing */ + if (!lidp->lid_size) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + IOP_FORMAT(lidp->lid_item, lv->lv_iovecp); + lv->lv_niovecs = lidp->lid_size; + lv->lv_item = lidp->lid_item; + + lidp = xfs_trans_next_item(tp, lidp); + lv = lv->lv_next; + } +} + +static int +xfs_trans_commit_cil( + struct xfs_mount *mp, + struct xfs_trans *tp, + xfs_lsn_t *commit_lsn, + int flags) +{ + struct xfs_log_vec *log_vector; + + /* + * Get each log item to allocate a vector structure for + * the log item to to pass to the log write code. + */ + log_vector = xfs_trans_alloc_log_vecs(tp); + if (!log_vector) + return ENOMEM; + + /* + * Fill in the log_vector and pin the logged items, and + * then write the transaction to the log. We have to lock + * out CIL flushes from this point as we are going to pin + */ + xfs_trans_fill_log_vecs(tp, log_vector); + + return xfs_log_commit_cil(mp, tp, log_vector, commit_lsn, flags); + +} /* * xfs_trans_commit @@ -1204,7 +1330,11 @@ _xfs_trans_commit( xfs_trans_apply_sb_deltas(tp); xfs_trans_apply_dquot_deltas(tp); - error = xfs_trans_commit_iclog(mp, tp, &commit_lsn, flags); + if (mp->m_flags & XFS_MOUNT_DELAYLOG) + error = xfs_trans_commit_cil(mp, tp, &commit_lsn, flags); + else + error = xfs_trans_commit_iclog(mp, tp, &commit_lsn, flags); + if (error == ENOMEM) { xfs_force_shutdown(mp, SHUTDOWN_LOG_IO_ERROR); error = XFS_ERROR(EIO); @@ -1242,7 +1372,7 @@ out_unreserve: error = XFS_ERROR(EIO); } current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); - xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0); + xfs_trans_free_items(tp, NULLCOMMITLSN, error ? XFS_TRANS_ABORT : 0); xfs_trans_free(tp); XFS_STATS_INC(xs_trans_empty); @@ -1320,7 +1450,7 @@ xfs_trans_cancel( /* mark this thread as no longer being in a transaction */ current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); - xfs_trans_free_items(tp, flags); + xfs_trans_free_items(tp, NULLCOMMITLSN, flags); xfs_trans_free(tp); } diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index ff7e9e6..b1ea20c 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -106,7 +106,8 @@ typedef struct xfs_trans_header { #define XFS_TRANS_GROWFSRT_FREE 39 #define XFS_TRANS_SWAPEXT 40 #define XFS_TRANS_SB_COUNT 41 -#define XFS_TRANS_TYPE_MAX 41 +#define XFS_TRANS_CHECKPOINT 42 +#define XFS_TRANS_TYPE_MAX 42 /* new transaction types need to be reflected in xfs_logprint(8) */ #define XFS_TRANS_TYPES \ @@ -148,6 +149,7 @@ typedef struct xfs_trans_header { { XFS_TRANS_GROWFSRT_FREE, "GROWFSRT_FREE" }, \ { XFS_TRANS_SWAPEXT, "SWAPEXT" }, \ { XFS_TRANS_SB_COUNT, "SB_COUNT" }, \ + { XFS_TRANS_CHECKPOINT, "CHECKPOINT" }, \ { XFS_TRANS_DUMMY1, "DUMMY1" }, \ { XFS_TRANS_DUMMY2, "DUMMY2" }, \ { XLOG_UNMOUNT_REC_TYPE, "UNMOUNT" } @@ -829,6 +831,10 @@ typedef struct xfs_log_item { /* buffer item iodone */ /* callback func */ struct xfs_item_ops *li_ops; /* function list */ + + /* delayed logging */ + struct list_head li_cil; /* CIL pointers */ + struct xfs_log_vec *li_lv; /* active log vector */ } xfs_log_item_t; #define XFS_LI_IN_AIL 0x1 diff --git a/fs/xfs/xfs_trans_item.c b/fs/xfs/xfs_trans_item.c index 2937a1e..f11d37d 100644 --- a/fs/xfs/xfs_trans_item.c +++ b/fs/xfs/xfs_trans_item.c @@ -299,6 +299,7 @@ xfs_trans_next_item(xfs_trans_t *tp, xfs_log_item_desc_t *lidp) void xfs_trans_free_items( xfs_trans_t *tp, + xfs_lsn_t commit_lsn, int flags) { xfs_log_item_chunk_t *licp; @@ -311,7 +312,7 @@ xfs_trans_free_items( * Special case the embedded chunk so we don't free it below. */ if (!xfs_lic_are_all_free(licp)) { - (void) xfs_trans_unlock_chunk(licp, 1, abort, NULLCOMMITLSN); + (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); xfs_lic_all_free(licp); licp->lic_unused = 0; } @@ -322,7 +323,7 @@ xfs_trans_free_items( */ while (licp != NULL) { ASSERT(!xfs_lic_are_all_free(licp)); - (void) xfs_trans_unlock_chunk(licp, 1, abort, NULLCOMMITLSN); + (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); next_licp = licp->lic_next; kmem_free(licp); licp = next_licp; diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 901dc0f..330fa5f 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -35,9 +35,15 @@ struct xfs_log_item_desc *xfs_trans_find_item(struct xfs_trans *, struct xfs_log_item_desc *xfs_trans_first_item(struct xfs_trans *); struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *, struct xfs_log_item_desc *); -void xfs_trans_free_items(struct xfs_trans *, int); -void xfs_trans_unlock_items(struct xfs_trans *, - xfs_lsn_t); + +void xfs_trans_unlock_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn); +void xfs_trans_free_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn, + int flags); + +void xfs_trans_item_committed(struct xfs_log_item *lip, + xfs_lsn_t commit_lsn, int aborted); +void xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp); +void xfs_trans_free(struct xfs_trans *tp); /* * AIL traversal cursor. -- 1.5.6.5 From SRS0+QjcZ+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:12 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00, TVD_PH_BODY_ACCOUNTS_PRE autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461iB3X065090 for ; Wed, 5 May 2010 20:44:12 -0500 X-ASG-Debug-ID: 1273110376-0ba503140000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 398E11281408 for ; Wed, 5 May 2010 18:46:17 -0700 (PDT) Received: from mail.internode.on.net (bld-mail14.adl6.internode.on.net [150.101.137.99]) by cuda.sgi.com with ESMTP id PyZOti2eCOWXRzcg for ; Wed, 05 May 2010 18:46:17 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23506668-1927428 for ; Thu, 06 May 2010 11:16:15 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qAG-0005Dx-NL for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:12 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9qA5-0000cX-Io for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:01 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 07/11] xfs: Delayed logging design documentation Subject: [PATCH 07/11] xfs: Delayed logging design documentation Date: Thu, 6 May 2010 11:45:47 +1000 Message-Id: <1273110351-2333-8-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail14.adl6.internode.on.net[150.101.137.99] X-Barracuda-Start-Time: 1273110379 X-Barracuda-Bayes: INNOCENT GLOBAL 0.4283 1.0000 0.0000 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29167 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Document the design of the delayed logging implementation. This includes assumptions made, dead ends followed, the reasoning behind the structuring of the code, the layout of various structures, how things fit together, traps and pit-falls avoided, etc. This is all too much to document in the code itself, so do it in a separate file. Signed-off-by: Dave Chinner --- .../filesystems/xfs-delayed-logging-design.txt | 819 ++++++++++++++++++++ 1 files changed, 819 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt new file mode 100644 index 0000000..961cc8d --- /dev/null +++ b/Documentation/filesystems/xfs-delayed-logging-design.txt @@ -0,0 +1,819 @@ +XFS Delayed Logging Design +-------------------------- + +Introduction to Re-logging in XFS +--------------------------------- + +XFS logging is a combination of logical and physical logging. Some objects, +such as inodes and dquots, are logged in logical format where the details +logged are made up of the changes to in-core structures rather than on-disk +structures. Other objects - typically buffers - have their physical changes +logged. The reason for these differences is to reduce the amount of log space +required for objects that are frequently logged. Parts inodes are more +frequently logged than others, and inodes are typically more frequently logged +than any other object (except maybe the superblock buffer) so keeping the +amount of metadata logged low is of prime importance. + +The reason that this is such a concern is that XFS allows multiple separate +modifications to a single object to be carried in the log at any given time. +This allows the log to avoid needing to flush each change to disk before +recording a new change to the object. XFS does this via a method called +"re-logging". Conceptually, this is quite simple - all it requires is that any +new change to the object is recorded with a *new copy* of all the existing +changes in the new transaction that is written to the log. + +That is, if we have a sequence of changes A through to F, and the object was +written to disk after change D, we would see in the log the following series +of transactions, their contents and the log sequence number (LSN) of the +transaction: + + Transaction Contents LSN + A A X + B A+B X+n + C A+B+C X+n+m + D A+B+C+D X+n+m+o + + E E Y (> X+n+m+o) + F E+F YŮŤ+p + +In other words, each time an object is relogged, the new transaction contains +the aggregation of all the previous changes currently held only in the log. + +This relogging technique also allows objects to be moved forward in the log so +that an object being relogged does not prevent the tail of the log from ever +moving forward. This can be seen in the table above by the changing +(increasing) LSN of each subsquent transaction - the LSN is effectively a +direct encoding of the location in the log of the transaction. + +This relogging is also used to implement long-running, multiple-commit +transactions. These transaction are known as rolling transactions, and require +a special log reservation known as a permanent transaction reservation. A +typical example of a rolling transaction is the removal of extents from an +inode which can only be done at a rate of two extents per transaction because +of reservation size limitations. Hence a rolling extent removal transaction +keeps relogging the inode and btree buffers as they get modified in each +removal operation. This keeps them moving forward in the log as the operation +progresses, ensuring that current operation never gets blocked by itself if the +log wraps around. + +Hence it can be seen that the relogging operation is fundamental to the correct +working of the XFS journalling subsystem. From the above description, most +people should be able to see why the XFS metadata operations writes so much to +the log - repeated operations to the same objects write the same changes to +the log over and over again. Worse is the fact that objects tend to get +dirtier as they get relogged, so each subsequent transaction is writing more +metadata into the log. + +Another feature of the XFS transaction subsystem is that most transactions are +asynchronous. That is, they don't commit to disk until either a log buffer is +filled (a log buffer can hold multiple transactions) or a synchronous operation +forces the log buffers holding the transactions to disk. This means that XFS is +doing aggregation of transactions in memory - batching them, if you like - to +minimise the impact of the log IO on transaction throughput. + +The limitation on asynchrnous transaction throughput is the number and size of +log buffers made available by the log manager. By default there are 8 log +buffers available and the size of each is 32kB - the size can be increased up +to 256kB by use of a mount option. + +Effectively, this gives us the maximum bound out outstanding metadata changes +that can be made to the filesystem at any point in time - if all the log +buffers are full and under IO, then no more transactions can be committed until +the current batch completes. It is now common for a single current CPU core to +be to able to issue enough transactions to keep the log buffers full and under +IO permanently. Hence the XFS journalling subsystem can be considered to be IO +bound. + +Delayed Logging: Concepts +------------------------- + +The key thing to note about the asynchronous logging combined with the +relogging technique XFS uses is that we can be relogging changed objects +multiple times before they are committed to disk in the log buffers. If we +return to the previous relogging example, it is entirely possible that +transactions A through D are committed to disk in the same log buffer. + +That is, a single log buffer may contain multiple copies of the same object, +but only one of those copies needs to be there - the last one "D", as it +contains all the changes from the previous changes. In other words, we have one +necessary copy in the log buffer, and three stale copies that are simply +wasting space. When we are doing repeated operations on the same set of +objects, these "stale objects" can be over 90% of the space used in the log +buffers. It is clear that reducing the number of stale objects logged to the +log would greatly reduce the amount of metadata we write to the log, and +this is the fundamental goal of delayed logging. + +From a conceptual point of view, XFS is already doing relogging in memory (where +memory == log buffer), only it is doing it extremely inefficiently. It is using +logical to physical formatting to do the relogging because there is no +infrastructure to keep track of logical changes in memory prior to physically +formating the changes in a transaction to the log buffer. Hence we cannot avoid +accumulating stale objects in the log buffers. + +Delayed logging is the name we've given to keeping and tracking transactional +changes to objects in memory outside the log buffer infrastructure. Because of +the relogging concept fundamental to the XFS journalling subsystem, this is +actually relatively easy to do - all the changes to logged items are already +tracked in the current infrastructure. The big problem is how to accumulate +them and get them to the log in a consistent, recoverable manner. +Describing the problems and how they have been solved is the focus of this +document. + +One of the key changes that delayed logging makes to the operation of the +journalling subsystem is that is dissociates the amount of outstanding metadata +changes from the size and number of log buffers available. In other words, +instead of there only being a maximum of 2MB of transaction changes not written to +the log at any point in time, there may be a much greater amount being +accumulated in memory. Hence the potential for loss of metadata on a crash is +much greater than for the existing logging mechanism. + +It should be noted that this does not change the guarantee that log recovery +will result in a consistent filesystem. What it does mean is that as far as the +recovered filesysetm is concerned, there may be many thousands of transactions +that simply did not occur as a result of the crash. This makes it even more +important that applications that care about their data use fsync() where they +need to ensure application level data integrity is maintained. + +It should be noted that delayed logging is not an innovative new concept that +warrants rigorous proofs to determine whether it is correct or not. The method +of accumulating changes in memory for some period before writing them to the +log is used effectively in many filesystems including ext3 and ext4. Hence +no time is spent in this document trying to convince the reader that the +concept is sound. Instead it is simply considered a "solved problem" and as +such implementing it in XFS is purely an exercise in software engineering. + +The fundamental requirements for delayed logging in XFS are simple: + + 1. Reduce the amount of metadata written to the log by at least + an order of magnitude. + 2. Supply sufficient statistics to validate Requirement #1. + 3. Supply sufficient new tracing infrastructure to be able to debug + problems with the new code. + 4. No on-disk format change (metadata or log format). + 5. Enable and disable with a mount option. + 6. No performance regressions for synchronous transaction workloads. + +Delayed Logging: Design +----------------------- + +Storing Changes + +The problem with accumulating changes at a logical level (i.e. just using the +existing log item dirty region tracking) is that when it comes to writing the +changes to the log buffers, we need to ensure that the object we are formatting +is not changing while we do this. This requires locking the object to prevent +concurrent modification. Hence flushing the logical changes to the log would +require us to lock every object, format them, and then unlock them again. + +This introduces lots of scope of deadlocks with transactions that are already +running. For example, a transaction has object A locked and modified, but needs +the delayed logging tracking lock to commit the transaction. However, the +flushing thread has the delayed logging tracking lock already held, and is +trying to get the lock on object A to flush it to the log buffer. This appears +to be an unsolvable deadlock condition, and it was solving this problem that +was the barrier to implementing delayed logging for so long. + +The solution is relatively simple - it just a long time to recognise it. Put +simply, the current logging code formats the changes to each item into an +vector array that points to the changed regions in the item. The log write code +simply copies the memory these vectors point to into the log buffer during +transaction commit while the item is locked in the transaction. Instead of +using the log buffer as the destination of the formatting code, we can use +an allocated memory buffer big enough to fit the formatted vector. + +If we then copy the vector into the memory buffer and then rewrite the vector +to point to the memory buffer rather than the object itself, we now have a copy +of the changes in a format that is compatible with the log buffer writing code. +that does not require us to lock the item to access. This formatting and +rewriting can all be done while the object is locked during transaction commit, +resulting in a vector that is transactionally consistent and can be accessed +without needing to lock the owning item. + +Hence we avoid the need to lock items when we need to flush outstanding +asynchronous transactions to the log. The differences between the existing +formatting method and the delayed logging formatting can be seen in the +diagram below. + +Current format log vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Log Buffer +-V1-+-V2-+----V3----+ + +Delayed logging vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Memory Buffer +-V1-+-V2-+----V3----+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +The memory buffer and associated vector need to be passed as a single object, +but still need to be associated with the parent object so if the object is +relogged we can replace the current memory buffer with a new memory buffer that +contains the latest changes. + +The reason for keeping the vector around after we've formatted the memory +buffer is to support splitting vectors across log buffer boundaries correctly. +If we don't keep the vector around, we do not know where the region boundaries +are in the item, so we'd need a new encapsulation method for regions in the log +buffer writing (i.e. double encapsulation). This would be an on-disk format +change and as such is not desirable. It also means we'd have to write the log +region headers in the formatting stage, which is problematic as there is per +region state that needs to be placed into the headers during the log write. + +Hence we need to keep the vector, but by attaching the memory buffer to it and +rewrite the vector addresses to point at the memory buffer we end up with a +self-describing object that it can be passed to the log buffer write code to be +handled in exactly the same manner as the existing log vectors are handled. +Hence we avoid needing a new on-disk format to handle items that have been +relogged in memory. + + +Tracking Changes + +Now that we can record transactional changes in memory in a form that allows +them to be used without limitations, we need to be able to track and accumulate +them so that they can be written to the log at some later point in time. The +log item is the natural place to store this vector and buffer, and also makes sense +to be the object that is used to track committed objects as it will always +exist once the object has been included in a transaction. + +The log item is already used to track the log items that have been written to +the log but not yet written to disk. Such log items are considered "active" +and as such are stored in the Active Item List (AIL) which is a LSN-ordered +double linked list. Items are inserted into this list during log buffer IO +completion, after which they are unpinned and can be written to disk. An object +that is in the AIL can be relogged, which causes the object to be pinned again +and then moved forward in the AIL when the log buffer IO completes for that +transaction. + +Essentially, this shows that an item that is in the AIL can still be modified +and relogged, so any tracking must be separate to the AIL infrastructure. As +such, we cannot reuse the AIL list pointers for tracking committed items, nor +can we store state in any field that is protected by the AIL lock. Hence the +committed item tracking needs it's own locks, lists and state fields in the log +item. + +Similar to the AIL, tracking of committed items is done through a new list +called the Committed Item List (CIL). The list tracks log items that have been +committed and have formatted memory buffers attached to them. It tracks +objects in transaction commit order, so when an object is relogged it is +removed from it's place in the list and re-inserted at the tail. This is entire +arbitrary and done to make it easy for debugging - the last items in the list +are the ones that are most recently modified. Ordering of the CIL is not +necessary for transactional integrity (as discussed in the next section) +so the ordering is done for convenience/sanity of the developers. + + +Delayed Logging: Checkpoints + +When we have a log synchronisation event, commonly known as a "log force", +all the items in the CIL must be written into the log via the log buffers. +We need to write these items in the order that they exist in the CIL, and they +need to be written as an atomic transaction. The need for all the objects to be +written as an atomic transaction comes from the requirements of relogging and +log replay - all the changes in all the objects in a given transaction must +either be completely replayed during log recovery, or not replayed at all. If +a transaction is not replayed because it is not complete in the log, then +no later transactions should be replayed, either. + +To fulfill this requirement, we need to write the entire CIL in a single log +transaction. Fortunately, the XFS log code has no fixed limit on the size of a +transaction, nor does the log replay code. The only fundamental limit is that +the transaction cannot be larger than just under half the size of the log. The +reason for this limit is that to find the head and tail of the log, there must +be at least one complete transaction in the log at any given time. If a +transaction is larger than half the log, then there is the possibility that a +crash during the write of a such a transaction could partially overwrites the +only complete previous transaction in the log. This will result in a recovery +failure and an inconsistent filesystem and hence we must enforce the maximum +size of a checkpoint to be slightly less than a half the log. + +Apart from this size requirement, a checkpoint transaction looks no different +to any other transaction - it contains a transaction header, a series of +formatted log items and a commit record at the tail. From a recovery +perspective, the checkpoint transaction is also no different - just a lot +bigger with a lot more items in it. The worst case effect of this is that we +might need to tune the recovery transaction object hash size. + +Because the checkpoint is just another transaction and all the changes to log +items are stored as log vectors, we can use the existing log buffer writing +code to write the changes into the log. To do this efficiently, we need to +minimise the time we hold the CIL locked while writing the checkpoint +transaction. The current log write code enables us to do this easily with the +way it separates the writing of the transaction contents (the log vectors) from +the transaction commit record, but tracking this requires us to have a +per-checkpoint context that travels through the log write process through to +checkpoint completion. + +Hence a checkpoint has a context that tracks the state of the current +checkpoint from initiation to checkpoint completion. A new context is initiated +at the same time a checkpoint transaction is started. That is, when we remove +all the current items from the CIL during a checkpoint operation, we move all +those changes into the current checkpoint context. We then initialise a new +context and attach that to the CIL for aggregation of new transactions. + +This allows us to unlock the CIL immediately after transfer of all the +committed items and effectively allow new transactions to be issued while we +are formatting the checkpoint into the log. It also allows concurrent +checkpoints to be written into the log buffers in the case of log force heavy +workloads, just like the existing transaction commit code does. This, however, +requires that we strictly order the commit records in the log so that +checkpoint sequence order is maintained during log replay. + +To ensure that we can be writing an item into a checkpoint transaction at +the same time another transaction modifies the item and inserts the log item +into the new CIL, then checkpoint transaction commit code cannot use log items +to store the list of log vectors that need to be written into the transaction. +Hence log vectors need to be able to be chained together to allow them to be +detatched from the log items. That is, when the CIL is flushed the memory +buffer and log vector attached to each log item needs to be attached to the +checkpoint context so that the log item can be released. In diagrammatic form, +the CIL would look like this before the flush: + + CIL Head + | + V + Log Item <-> log vector 1 -> memory buffer + | -> vector array + V + Log Item <-> log vector 2 -> memory buffer + | -> vector array + V + ...... + | + V + Log Item <-> log vector N-1 -> memory buffer + | -> vector array + V + Log Item <-> log vector N -> memory buffer + -> vector array + +And after the flush the CIL head is empty, and the checkpoint context log +vector list would look like: + + Checkpoint Context + | + V + log vector 1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector 2 -> memory buffer + | -> vector array + | -> Log Item + V + ...... + | + V + log vector N-1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector N -> memory buffer + -> vector array + -> Log Item + +Once this transfer is done, the CIL can be unlocked and new transactions can +start, while the checkpoint flush code works over the log vector chain to +commit the checkpoint. + +Once the checkpoint is written into the log buffers, the checkpoint context is +attached to the log buffer that the commit record was written to along with a +completion callback. Log IO completion will call that callback, which can then +run transaction committed processing for the log items (i.e. insert into AIL +and unpin) in the log vector chain and then free the log vector chain and +checkpoint context. + +Discussion Point: I am uncertain as to whether the log item is the most +efficient way to track vectors, even though it seems like the natural way to do +it. The fact that we walk the log items (in the CIL) just to chain the log +vectors and break the link between the log item and the log vector means that +we take a cache line hit for the log item list modification, then another for +the log vector chaining. If we track by the log vectors, then we only need to +break the link between the log item and the log vector, which means we should +dirty only the log item cachelines. Normally I wouldn't be concerned about one +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log +vectors in one checkpoint transaction. I'd guess this is a "measure and +compare" situation that can be done after a working and reviewed implementation +is in the dev tree.... + +Delayed Logging: Checkpoint Sequencing + +One of the key aspects of the XFS transaction subsystem is that it tags +committed transactions with the log sequence number of the transaction commit. +This allows transactions to be issued asynchronously even though there may be +future operations that cannot be completed until that transaction is fully +committed to the log. In the rare case that a dependent operation occurs (e.g. +re-using a freed metadata extent for a data extent), a special, optimised log +force can be issued to force the dependent transaction to disk immediately. + +To do this, transactions need to record the LSN of the commit record of the +transaction. This LSN comes directly from the log buffer the transaction is +written into. While this works just fine for the existing transaction +mechanism, it does not work for delayed logging because transactions are not +written directly into the log buffers. Hence some other method of sequencing +transactions is required. + +As discussed in the checkpoint section, delayed logging uses per-checkpoint +contexts, and as such it is simple to assign a sequence number to each +checkpoint. Because the switching of checkpoint contexts must be done +atomically, it is simple to ensure that each new context has a monotonically +increasing sequence number assigned to it without the need for an external +atomic counter - we can just take the current context sequence number and add +one to it for the new context. + +Then, instead of assigning a log buffer LSN to the transaction commit LSN +during the commit, we can assign the current checkpoint sequence. This allows +operations that track transactions that have not yet completed know what +checkpoint sequence needs to be committed before they can continue. As a +result, the code that forces the log to a specific LSN now needs to ensure that +the log forces to a specific checkpoint. + +To ensure that we can do this, we need to track all the checkpoint contexts +that are currently committing to the log. When we flush a checkpoint, the +context gets added to a "committing" list which can be searched. When a +checkpoint commit completes, it is removed from the committing list. Because +the checkpoint context records the LSN of the commit record for the checkpoint, +we can also wait on the log buffer that contains the commit record, thereby +using the existing log force mechanisms to execute synchronous forces. + +It should be noted that the synchronous forces may need to be extended with +mitigation algorithms similar to the current log buffer code to allow +aggregation of multiple synchronous transactions if there are already +synchronous transactions being flushed. Investigation of the performance of the +current design is needed before making any decisions here. + +The main concern with log forces is to ensure that all the previous checkpoints +are also committed to disk before the one we need to wait for. Therefore we +need to check that all the prior contexts in the committing list are also +complete before waiting on the one we need to complete. We do this +synchronisation in the log force code so that we don't need to wait anywhere +else for such serialisation - it only matters when we do a log force. + +The only remaining complexity is that a log force now also has to handle the +case where the forcing sequence number is the same as the current context. That +is, we need to flush the CIL and potentially wait for it to complete. This is a +simple addition to the existing log forcing code to check the sequence numbers +and push if required. Indeed, placing the current sequence checkpoint flush in +the log force code enables the current mechanism for issuing synchronous +transactions to remain untouched (i.e. commit an asynchronous transaction, then +force the log at the LSN of that transaction) and so the higher level code +behaves the same regardless of whether delayed logging is being used or not. + +Delayed Logging: Checkpoint Log Space Accounting + +The big issue for a checkpoint transaction is the log space reservation for the +transaction. We don't know how big a checkpoint transaction is going to be +ahead of time, nor how many log buffers it will take to write out, nor the +number of split log vector regions are going to be used. We can track the +amount of log space required as we add items to the commit item list, but we +still need to reserve the space in the log for the checkpoint. + +A typical transaction reserves enough space in the log for the worst case space +usage of the transaction. The reservation accounts for log record headers, +transaction and region headers, headers for split regions, buffer tail padding, +etc. as well as the actual space for all the changed metadata in the +transaction. While some of this is fixed overhead, much of it is dependent on +the size of the transaction and the number of regions being logged (the number +of log vectors in the transaction). + +An example of the differences would be logging directory changes versus logging +inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then +there are lots of transactions that only contain an inode core and an inode log +format structure. That is, two vectors totalling roughly 150 bytes. If we +modify 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 +vectors. Each vector is 12 bytes, so the total to be logged is approximately +1.75MB. In comparison, if we are logging full directory buffers, they are +typically 4KB each, so we in 1.5MB of directory buffers we'd have roughly 400 +buffers and a buffer format structure for each buffer - roughly 800 vectors or +1.51MB total space. From this, it should be obvious that a static log space +reservation is not particularly flexible and is difficult to select the +"optimal value" for all workloads. + +Further, if we are going to use a static reservation, which bit of the entire +reservation does it cover? We account for space used by the transaction +reservation by tracking the space currently used by the object in the CIL and +then calculating the increase or decrease in space used as the object is +relogged. This allows for a checkpoint reservation to only have to account for +log buffer metadata used such as log header records. + +However, even using a static reservation for just the log metadata is +problematic. Typically log record headers use at least 16KB of log space per +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be +large enough to handle arbitrary sized checkpoint transactions. This +reservation needs to be made before the checkpoint is started, and we need to +be able to reserve the space without sleeping. For a 8MB checkpoint, we need a +reservation of around 150KB, which is a non-trivial amount of space. + +A static reservation needs to manipulate the log grant counters - we can take a +permanent reservation on the space, but we still need to make sure we refresh +the write reservation (the actual space availble to the transaction) after +every checkpoint transaction completion. Unfortunately, if this space is not +available when required, then the regrant code will sleep waiting for it. + +The problem with this is that it can lead to deadlocks as we may need to commit +checkpoints to be able to free up log space (refer back to the description of +rolling transactions for an example of this). Hence we *must* always have +space available in the log if we are to use static reservations, and that is +very difficult and complex to arrange. It is possible to do, but there is a +simpler way. + +The simpler way of doing this is tracking the entire log space used by the +items in the CIL and using this to dynamically calculate the amount of log +space required by the log metadata. If this log metadata space changes as a +result of a transaction commit inserting a new memory buffer into the CIL, then +the difference in space required is removed from the transaction that causes +the change. Transactions at this level will *always* have enough space +available in their reservation for this as they have already reserved the +maximal amount of log metadata space they require, and such a delta reservation +will always be less than or equal to the maximal amount in the reservation. + +Hence we can grow the checkpoint transaction reservation dynamically as items +are added to the CIL and avoid the need for reserving and regranting log space +up front. This avoids deadlocks and removes a blocking point from the +checkpoint flush code. + +As mentioned early, transactions can't grow to more than half the size of the +log. Hence as part of the reservation growing, we need to also check the size +of the reservation against the maximum allowed transaction size. If we reach +the maximum threshold, we need to push the CIL to the log. This is effectively +a "background flush" and is done on demand. This is identical to +a CIL push triggered by a log force, only that there is no waiting for the +checkpoint commit to complete. This background push checked and executed by +transaction commit code. + +If the transaction subsystem goes idle while we still have items in the CIL, +they will be flushed by the periodic log force issued by the xfssyncd. This log +force will push the CIL to disk, and if the transaction subsystem stays idle, +allow the idle log to be covered (effectively marked clean) in exactly the same +manner that is done for the existing logging method. A discussion point is +whether this log force needs to be done more frequently than the current rate +which is once every 30s. + + +Delayed Logging: Log Item Pinning + +Currently log items are pinned during transaction commit while the items are +still locked. This happens just after the items are formatted, though it could +be done any time before the items are unlocked. The result of this mechanism is +that items get pinned once for every transaction that is committed to the log +buffers. Hence items that are relogged in the log buffers will have a pin count +for every outstanding transaction they were dirtied in. When each of these +transactions is completed, they will unpin the item once. As a result, the item +only becomes unpinned when all the transactions complete and there are no +pending transactions. Thus the pinning and unpinning of a log item is symmetric +as there is a 1:1 relationship with transaction commit and log item completion. + +For delayed logging, however, we have an assymetric transaction commit to +completion relationship. Every time an object is relogged in the CIL it goes +through the commit process without a corresponding completion being registered. +That is, we now have a many-to-one relationship between transaction commit and +log item completion. THe result of this is that pinning and unpinning of the +log items becomes unbalanced if we retain the "pin on transaction commit, unpin +on transaction completion" model. + +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on +insertion into the CIL, unpin on checkpoint completion". In other words, the +pinning and unpinning becomes symmetric around a checkpoint context. We have to +pin the object the first time it is inserted into the CIL - if it is already in +the CIL during a transaction commit, then we do not pin it again. Because there +can be multiple outstanding checkpoint contexts, we can still see elevated pin +counts, but as each checkpoint completes the pin count will retain the correct +value according to it's context. + +Just to make matters more slightly more complex, this checkpoint level context +for the pin count means that the pinning of an item must take place under the +CIL commit/flush lock. If we pin the object outside this lock, we cannot +guarantee which context the pin count is associated with. This is because of +the fact pinning the item is dependent on whether the item is present in the +current CIL or not. If we don't pin the CIL first before we check and pin the +object, we have a race with CIL being flushed between the check and the pin +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit +lock to guarantee that we pin the items correctly. + +Delayed Logging: Concurrent Scalability + +A fundamental requirement for the CIL is that accesses through transaction +commits must scale to many concurrent commits. The current transaction commit +code does not break down even when there are transactions coming from 2048 +processors at once. The current transaction code does not go any faster than if +there was only one CPU using it, but it does not slow down either. + +As a result, the delayed logging transaction commit code needs to be designed +for concurrency from the ground up. It is obvious that there are serialisation +points in the design - the three important ones are: + + 1. Locking out new transaction commits while flushing the CIL + 2. Adding items to the CIL and updating item space accounting + 3. Checkpoint commit ordering + +Looking at the transaction commit and CIL flushing interactions, it is clear +that we have a many-to-one interaction here. That is, the only restriction on +the number of concurrent transactions that can be trying to commit at once is +the amount of space available in the log for their reservations. The practical +limit here is in the order of several hundred concurrent transactions for a +128MB log, which means that it is generally one per CPU in a machine. + +The amount of time a transaction commit needs to hold out a flush is a +relatively long period of time - the pinning of log items needs to be done +while we are holding out a CIL flush, so at the moment that means it is held +across the formatting of the objects into memory buffers (i.e. while memcpy()s +are in progress). Ultimately a two pass algorithm where the formatting is done +separately to the pinning of objects could be used to reduce the hold time of +the transaction commit side. + +Because of the number of potential transaction commit side holders, the lock +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not +want every other CPU in the machine spinning on the CIL lock. Given that +flushing the CIL could involve walking a list of tens of thousands of log +items, it will get held for a significant time and so spin contention is a +significant concern. Preventing lots of CPUs spinning doing nothing is the +main reason for choosing a sleeping lock even though nothing in either the +transaction commit or CIL flush side sleeps with the lock held. + +It should also be noted that CIL flushing is also a relatively rare operation +compared to transaction commit for asynchronous transaction workloads - only +time will tell if using a read-write semaphore for exclusion will limit +transaction commit concurrency due to cache line bouncing of the lock on the +read side. + +The second serialisation point is on the transaction commit side where items +are inserted into the CIL. Because transactions can enter this code +concurrently, the CIL needs to be protected separately from the above +commit/flush exclusion. It also needs to be an exclusive lock but it is only +held for a very short time and so a spin lock is appropriate here. It is +possible that this lock will become a contention point, but given the short +hold time once per transaction I think that contention is unlikely. + +The final serialisation point is the checkpoint commit record ordering code +that is run as part of the checkpoint commit and log force sequencing. The code +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter +an ordering loop after writing all the log vectors into the log buffers but +before writing the commit record. This loop walks the list of committing +checkpoints and needs to block waiting for checkpoints to complete their commit +record write. As a result it needs a lock and a wait variable. Log force +sequencing also requires the same lock, list walk, and blocking mechanism to +ensure completion of checkpoints. + +These two sequencing operations can use the mechanism even though the +events they are waiting for are different. The checkpoint commit record +sequencing needs to wait until checkpoint contexts contain a commit LSN +(obtained through completion of a commit record write) while log force +sequencing needs to wait until previous checkpoint contexts are removed from +the committing list (i.e. they've completed). A simple wait variable and +broadcast wakeups (thundering herds) has been used to implement these two +serialisation queues. They use the same lock as the CIL, too. If we see too +much contention on the CIL lock, or too many context switches as a result of +the broadcast wakeups these operations can be put under a new spinlock and +given separate wait lists to reduce lock contention and the number of processes +woken by the wrong event. + + +Lifecycle Changes + +The existing log item life cycle is as follows: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory + Format item into log buffer + Write commit LSN into transaction + Unlock item + Attach transaction to log buffer + + + + + 7. Transaction completion + Mark log item committed + Insert log item into AIL + Write commit LSN into log item + Unpin log item + 8. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + + + 9. Log item removed from AIL + Moves log tail + Item unlocked + +Essentially, steps 1-6 operate independently from step 7, which is also +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur +at the same time. If the log item is in the AIL or between steps 6 and 7 +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 +are entered and completed is the object considered clean.` + +With delayed logging, there are new steps inserted into the life cycle: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory if not pinned in CIL + Format item into log vector + buffer + Attach log vector and buffer to log item + Insert log item into CIL + Write CIL context sequence into transaction + Unlock item + + + + 7. CIL push + lock CIL flush + Chain log vectors and buffers together + Remove items from CIL + unlock CIL flush + write log vectors into log + sequence commit records + attach checkpoint context to log buffer + + + + + 8. Checkpoint completion + Mark log item committed + Insert item into AIL + Write commit LSN into log item + Unpin log item + 9. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + 10. Log item removed from AIL + Moves log tail + Item unlocked + +From this, it can be seen that the only life cycle differences between the two +logging methods are in the middle of the life cycle - they still have the same +beginning and end and execution constraints. The only differences are in the +commiting of the log items to the log itself and the completion processing. +Hence delayed logging should not introduce any constraints on log item +behaviour, allocation or freeing that don't already exist. + +As a result of this zero-impact "insertion" of delayed logging infrastructure +and the design of the internal structures to avoid on disk format changes, we +can basically switch between delayed logging and the existing mechanism with a +mount option. Fundamentally, there is no reason why the log manager would not +be able to swap methods automatically and transparently depending on load +characteristics, but this should not be necessary if delayed logging works as +designed. + +Roadmap: + +2.6.35 Inclusion in mainline as an experimental mount option + => approximately 2-3 months to merge window + => needs to be in xfs-dev tree in 4-6 weeks + => code is nearing readiness for review + +2.6.37 Remove experimental tag from mount option + => should be roughly 6 months after initial merge + => enough time to: + => gain confidence and fix problems reported by early + adopters (a.k.a. guinea pigs) + => address worst performance regressions and undesired + behaviours + => start tuning/optimising code for parallelism + => start tuning/optimising algorithms consuming + excessive CPU time + +2.6.39 Switch default mount option to use delayed logging + => should be roughly 12 months after initial merge + => enough time to shake out remaining problems before next round of + enterprise distro kernel rebases + + + -- 1.5.6.5 From SRS0+QjcZ+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:07 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461i65F065055 for ; Wed, 5 May 2010 20:44:07 -0500 X-ASG-Debug-ID: 1273110374-137702fb0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 432203139C0 for ; Wed, 5 May 2010 18:46:15 -0700 (PDT) Received: from mail.internode.on.net (bld-mail18.adl2.internode.on.net [150.101.137.103]) by cuda.sgi.com with ESMTP id Uv8a7wHSHAtYPdSV for ; Wed, 05 May 2010 18:46:15 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23180513-1927428 for ; Thu, 06 May 2010 11:16:14 +0930 (CST) Received: from [192.168.1.9] (helo=disturbed) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qAG-0005Do-QL for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:12 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9qA5-0000cf-Tx for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:01 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 10/11] xfs: enable background pushing of the CIL Subject: [PATCH 10/11] xfs: enable background pushing of the CIL Date: Thu, 6 May 2010 11:45:50 +1000 Message-Id: <1273110351-2333-11-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail18.adl2.internode.on.net[150.101.137.103] X-Barracuda-Start-Time: 1273110376 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner If we let the CIL grow without bound, it will grow large enough to violate recovery constraints (must be at least one complete transaction in the log at all times) or take forever to write out through the log buffers. Hence we need a check during asynchronous transactions as to whether the CIL needs to be pushed. We track the amount of log space the CIL consumes, so it is relatively simple to limit it on a pure size basis. Make the limit the minimum of just under half the log size (recovery constraint) or 8MB of log space (which is an awful lot of metadata). Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 25 ++++++++++++++++++++++++- fs/xfs/xfs_log_priv.h | 45 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+), 1 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 3cb1957..806cf6b 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -339,6 +339,7 @@ xfs_log_commit_cil( { struct log *log = mp->m_log; int log_flags = 0; + int push = 0; if (flags & XFS_TRANS_RELEASE_LOG_RES) log_flags = XFS_LOG_REL_PERM_RESERV; @@ -368,13 +369,26 @@ xfs_log_commit_cil( xfs_log_done(mp, tp->t_ticket, NULL, log_flags); xfs_trans_unreserve_and_mod_sb(tp); - /* background commit is allowed again */ + /* check for background commit */ + if (log->l_cilp->xc_ctx->space_used > XLOG_CIL_SPACE_LIMIT(log)) + push = 1; + up_read(&log->l_cilp->xc_ctx_lock); current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); /* xfs_trans_free_items() unlocks them first */ xfs_trans_free_items(tp, *commit_lsn, 0); xfs_trans_free(tp); + + /* + * We need to push CIL every so often so we don't cache more than we + * can fit in the log. The limit really is that a checkpoint can't be + * more than half the log (the current checkpoint is not allowed to + * overwrite the previous checkpoint), but commit latency and memory + * usage limit this to a smaller size in most cases. + */ + if (push) + xlog_cil_push(log, 0); return 0; } @@ -453,6 +467,15 @@ xlog_cil_push( return 0; } + /* check for spurious background flush */ + if (!push_now && + log->l_cilp->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { + up_write(&cil->xc_ctx_lock); + xfs_log_ticket_put(new_ctx->ticket); + kmem_free(new_ctx); + return 0; + } + /* * pull all the log vectors off the items in the CIL, and * remove the items from the CIL. We don't need the CIL lock diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index e9e8324..7490277 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -427,6 +427,51 @@ struct xfs_cil { }; /* + * The amount of log space we should the CIL to aggregate is difficult to size. + * Whatever we chose we have to make we can get a reservation for the log space + * effectively, that it is large enough to capture sufficient relogging to + * reduce log buffer IO significantly, but it is not too large for the log or + * induces too much latency when writing out through the iclogs. We track both + * space consumed and the number of vectors in the checkpoint context, so we + * need to decide which to use for limiting. + * + * Every log buffer we write out during a push needs a header reserved, which + * is at least one sector and more for v2 logs. Hence we need a reservation of + * at least 512 bytes per 32k of log space just for the LR headers. That means + * 16KB of reservation per megabyte of delayed logging space we will consume, + * plus various headers. The number of headers will vary based on the num of + * io vectors, so limiting on a specific number of vectors is going to result + * in transactions of varying size. IOWs, it is more consistent to track and + * limit space consumed in the log rather than by the number of objects being + * logged in order to prevent checkpoint ticket overruns. + * + * Further, use of static reservations through the log grant mechanism is + * problematic. It introduces a lot of complexity (e.g. reserve grant vs write + * grant) and a significant deadlock potential because regranting write space + * can block on log pushes. Hence if we have to regrant log space during a log + * push, we can deadlock. + * + * However, we can avoid this by use of a dynamic "reservation stealing" + * technique during transaction commit whereby unused reservation space in the + * transaction ticket is transferred to the CIL ctx commit ticket to cover the + * space needed by the checkpoint transaction. This means that we never need to + * specifically reserve space for the CIL checkpoint transaction, nor do we + * need to regrant space once the checkpoint completes. This also means the + * checkpoint transaction ticket is specific to the checkpoint context, rather + * than the CIL itself. + * + * With dynamic reservations, we can basically make up arbitrary limits for the + * checkpoint size so long as they don't violate any other size rules. Hence + * the initial maximum size for the checkpoint transaction will be set to a + * quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit + * right now based on the latency of writing out a large amount of data through + * the circular iclog buffers. + */ + +#define XLOG_CIL_SPACE_LIMIT(log) \ + (min((log->l_logsize >> 2), (8 * 1024 * 1024))) + +/* * The reservation head lsn is not made up of a cycle number and block number. * Instead, it uses a cycle number and byte number. Logs don't expect to * overflow 31 bits worth of byte offset, so using a byte number will mean -- 1.5.6.5 From SRS0+wf72+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:09 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,LOCAL_GNU_PATCH autolearn=unavailable version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461i91r065079 for ; Wed, 5 May 2010 20:44:09 -0500 X-ASG-Debug-ID: 1273110376-179202780000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 8A9543139C1 for ; Wed, 5 May 2010 18:46:17 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id Sn9SA8AFJrF7b0Ci for ; Wed, 05 May 2010 18:46:17 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23264927-1927428 for ; Thu, 06 May 2010 11:16:16 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qA6-0005Dp-Uf for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9qA5-0000ch-Ve for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:01 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 11/11] xfs: Ensure inode allocation buffers are fully replayed Subject: [PATCH 11/11] xfs: Ensure inode allocation buffers are fully replayed Date: Thu, 6 May 2010 11:45:51 +1000 Message-Id: <1273110351-2333-12-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1273110378 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner With delayed logging, we can get inode allocation buffers in the same transaction inode unlink buffers. We don't currently mark inode allocation buffers in the log, so inode unlink buffers take precedence over allocation buffers. The result is that when they are combined into the same checkpoint, only the unlinked inode chain fields are replayed, resulting in uninitialised inode buffers being detected when the next inode modification is replayed. To fix this, we need to ensure that we do not set the inode buffer flag in the buffer log item format flags if the inode allocation has not already hit the log. To avoid requiring a change to log recovery, we really need to make this a modification that relies only on in-memory sate. We can do this by checking during buffer log formatting (while the CIL cannot be flushed) if we are still in the same sequence when we commit the unlink transaction as the inode allocation transaction. If we are, then we do not add the inode buffer flag to the buffer log format item flags. This means the entire buffer will be replayed, not just the unlinked fields. We do this while CIL flusheŃ• are locked out to ensure that we don't race with the sequence numbers changing and hence fail to put the inode buffer flag in the buffer format flags when we really need to. Also, move an assert in the buffer release path outside the hash spinlock so that if the assert is hit the system continues to run in a debuggable state. Signed-off-by: Dave Chinner --- fs/xfs/linux-2.6/xfs_buf.c | 2 +- fs/xfs/xfs_buf_item.c | 14 ++++++++++++ fs/xfs/xfs_buf_item.h | 4 ++- fs/xfs/xfs_log.h | 1 + fs/xfs/xfs_log_cil.c | 48 ++++++++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_trans.c | 16 +++++++++++-- fs/xfs/xfs_trans.h | 1 + fs/xfs/xfs_trans_buf.c | 20 +++++++++--------- 8 files changed, 89 insertions(+), 17 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index 82678bf..e085eca 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -800,9 +800,9 @@ xfs_buf_rele( } else if (bp->b_flags & XBF_FS_MANAGED) { spin_unlock(&hash->bh_lock); } else { - ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q))); list_del_init(&bp->b_hash_list); spin_unlock(&hash->bh_lock); + ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q))); xfs_buf_free(bp); } } diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index bcbb661..02a8098 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -254,6 +254,20 @@ xfs_buf_item_format( vecp++; nvecs = 1; + /* + * If it is an inode buffer, transfer the in-memory state to the + * format flags and clear the in-memory state. We do not transfer + * this state if the inode buffer allocation has not yet been committed + * to the log as setting the XFS_BLI_INODE_BUF flag will prevent + * correct replay of the inode allocation. + */ + if (bip->bli_flags & XFS_BLI_INODE_BUF) { + if (!((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && + xfs_log_item_in_current_chkpt(&bip->bli_item))) + bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF; + bip->bli_flags &= ~XFS_BLI_INODE_BUF; + } + if (bip->bli_flags & XFS_BLI_STALE) { /* * The buffer is stale, so all we need to log diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index 8cbb82b..f20bb47 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -69,6 +69,7 @@ typedef struct xfs_buf_log_format { #define XFS_BLI_LOGGED 0x08 #define XFS_BLI_INODE_ALLOC_BUF 0x10 #define XFS_BLI_STALE_INODE 0x20 +#define XFS_BLI_INODE_BUF 0x40 #define XFS_BLI_FLAGS \ { XFS_BLI_HOLD, "HOLD" }, \ @@ -76,7 +77,8 @@ typedef struct xfs_buf_log_format { { XFS_BLI_STALE, "STALE" }, \ { XFS_BLI_LOGGED, "LOGGED" }, \ { XFS_BLI_INODE_ALLOC_BUF, "INODE_ALLOC" }, \ - { XFS_BLI_STALE_INODE, "STALE_INODE" } + { XFS_BLI_STALE_INODE, "STALE_INODE" }, \ + { XFS_BLI_INODE_BUF, "INODE_BUF" } #ifdef __KERNEL__ diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 1764f11..ff4c443 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -197,6 +197,7 @@ void xfs_log_ticket_put(struct xlog_ticket *ticket); int xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, struct xfs_log_vec *log_vector, xfs_lsn_t *commit_lsn, int flags); +bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); #endif diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 806cf6b..f6733eb 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -201,6 +201,15 @@ xlog_cil_insert( ctx->nvecs += diff_iovecs; /* + * If this is the first time the item is being committed to the CIL, + * store the sequence number on the log item so we can tell + * in future commits whether this is the first checkpoint the item is + * being committed into. + */ + if (!item->li_seq) + item->li_seq = ctx->sequence; + + /* * Now transfer enough transaction reservation to the context ticket * for the checkpoint. The context ticket is special - the unit * reservation has to grow as well as the current reservation as we @@ -328,6 +337,10 @@ xlog_cil_free_logvec( * For more specific information about the order of operations in * xfs_log_commit_cil() please refer to the comments in * xfs_trans_commit_iclog(). + * + * Called with the context lock already held in read mode to lock out + * background commit, returns without it held once background commits are + * allowed again. */ int xfs_log_commit_cil( @@ -346,11 +359,10 @@ xfs_log_commit_cil( if (XLOG_FORCED_SHUTDOWN(log)) { xlog_cil_free_logvec(log_vector); + up_read(&log->l_cilp->xc_ctx_lock); return XFS_ERROR(EIO); } - /* lock out background commit */ - down_read(&log->l_cilp->xc_ctx_lock); xlog_cil_format_items(log, log_vector, tp->t_ticket, commit_lsn); /* check we didn't blow the reservation */ @@ -687,3 +699,35 @@ restart: return commit_lsn; } +/* + * Check if the current log item was first committed in this sequence. + * We can't rely on just the log item being in the CIL, we have to check + * the recorded commit sequence number. + * + * Note: for this to be used in a non-racy manner, it has to be called with + * CIL flushing locked out. As a result, it should only be used during the + * transaction commit process when deciding what to format into the item. + */ +bool +xfs_log_item_in_current_chkpt( + struct xfs_log_item *lip) +{ + struct xfs_cil_ctx *ctx; + + if (!(lip->li_mountp->m_flags & XFS_MOUNT_DELAYLOG)) + return false; + if (list_empty(&lip->li_cil)) + return false; + + ctx = lip->li_mountp->m_log->l_cilp->xc_ctx; + + /* + * li_seq is written on the first commit of a log item to record the + * first checkpoint it is written to. Hence if it is different to the + * current sequence, we're in a new checkpoint. + */ + if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0) + return false; + return true; +} + diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 9bdb492..3e88c3f 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -45,6 +45,7 @@ #include "xfs_trans_space.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_log_priv.h" kmem_zone_t *xfs_trans_zone; @@ -1261,10 +1262,19 @@ xfs_trans_commit_cil( return ENOMEM; /* - * Fill in the log_vector and pin the logged items, and - * then write the transaction to the log. We have to lock - * out CIL flushes from this point as we are going to pin + * Now we need to fill in the log_vector and pin the logged items, and + * then write the transaction to the log. + * + * Important: We have to lock out CIL flushes from this point as + * transferring state from the in memory log items to the log item + * headers during formatting may require atomicity against log writes + * to ensure that state is transferred to the log without racing + * against flushes. + * + * xfs_log_commit_cil() will release the lock as part of the commit + * process. */ + down_read(&mp->m_log->l_cilp->xc_ctx_lock); xfs_trans_fill_log_vecs(tp, log_vector); return xfs_log_commit_cil(mp, tp, log_vector, commit_lsn, flags); diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index b1ea20c..8c69e78 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -835,6 +835,7 @@ typedef struct xfs_log_item { /* delayed logging */ struct list_head li_cil; /* CIL pointers */ struct xfs_log_vec *li_lv; /* active log vector */ + xfs_lsn_t li_seq; /* CIL commit seq */ } xfs_log_item_t; #define XFS_LI_IN_AIL 0x1 diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 3390c3e..63d81a2 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -792,7 +792,7 @@ xfs_trans_binval( XFS_BUF_UNDELAYWRITE(bp); XFS_BUF_STALE(bp); bip->bli_flags |= XFS_BLI_STALE; - bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_DIRTY); + bip->bli_flags &= ~(XFS_BLI_INODE_BUF | XFS_BLI_LOGGED | XFS_BLI_DIRTY); bip->bli_format.blf_flags &= ~XFS_BLF_INODE_BUF; bip->bli_format.blf_flags |= XFS_BLF_CANCEL; memset((char *)(bip->bli_format.blf_data_map), 0, @@ -802,16 +802,16 @@ xfs_trans_binval( } /* - * This call is used to indicate that the buffer contains on-disk - * inodes which must be handled specially during recovery. They - * require special handling because only the di_next_unlinked from - * the inodes in the buffer should be recovered. The rest of the - * data in the buffer is logged via the inodes themselves. + * This call is used to indicate that the buffer contains on-disk inodes which + * must be handled specially during recovery. They require special handling + * because only the di_next_unlinked from the inodes in the buffer should be + * recovered. The rest of the data in the buffer is logged via the inodes + * themselves. * - * All we do is set the XFS_BLI_INODE_BUF flag in the buffer's log - * format structure so that we'll know what to do at recovery time. + * All we do is set the XFS_BLI_INODE_BUF flag in the items flags so it can be + * transferred to the buffer's log format structure so that we'll know what to + * do at recovery time. */ -/* ARGSUSED */ void xfs_trans_inode_buf( xfs_trans_t *tp, @@ -826,7 +826,7 @@ xfs_trans_inode_buf( bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(atomic_read(&bip->bli_refcount) > 0); - bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF; + bip->bli_flags |= XFS_BLI_INODE_BUF; } /* -- 1.5.6.5 From SRS0+61RJ+65+fromorbit.com=dave@internode.on.net Wed May 5 20:44:07 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o461i6uO065054 for ; Wed, 5 May 2010 20:44:07 -0500 X-ASG-Debug-ID: 1273110374-137502c80000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1E0C83139BD for ; Wed, 5 May 2010 18:46:14 -0700 (PDT) Received: from mail.internode.on.net (bld-mail13.adl6.internode.on.net [150.101.137.98]) by cuda.sgi.com with ESMTP id UEZVoSixDaFVed0Z for ; Wed, 05 May 2010 18:46:14 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23378077-1927428 for ; Thu, 06 May 2010 11:16:13 +0930 (CST) Received: from [192.168.1.9] (helo=disturbed) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1O9qA6-0005DO-AD for xfs@oss.sgi.com; Thu, 06 May 2010 11:46:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1O9q9v-0000cN-Cr for xfs@oss.sgi.com; Thu, 06 May 2010 11:45:51 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 04/11] xfs: modify buffer item reference counting V2 Subject: [PATCH 04/11] xfs: modify buffer item reference counting V2 Date: Thu, 6 May 2010 11:45:44 +1000 Message-Id: <1273110351-2333-5-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: bld-mail13.adl6.internode.on.net[150.101.137.98] X-Barracuda-Start-Time: 1273110376 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29166 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner The buffer log item reference counts used to take referenceŃ• for every transaction, similar to the pin counting. This is symmetric (like the pin/unpin) with respect to transaction completion, but with dleayed logging becomes assymetric as the pinning becomes assymetric w.r.t. transaction completion. To make both cases the same, allow the buffer pinning to take a reference to the buffer log item and always drop the reference the transaction has on it when being unlocked. This is balanced correctly because the unpin operation always drops a reference to the log item. Hence reference counting becomes symmetric w.r.t. item pinning as well as w.r.t active transactions and as a result the reference counting model remain consistent between normal and delayed logging. Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf_item.c | 110 ++++++++++++++++++++++-------------------------- 1 files changed, 50 insertions(+), 60 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 240340a..4cd5f61 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -341,10 +341,15 @@ xfs_buf_item_format( } /* - * This is called to pin the buffer associated with the buf log - * item in memory so it cannot be written out. Simply call bpin() - * on the buffer to do this. + * This is called to pin the buffer associated with the buf log item in memory + * so it cannot be written out. Simply call bpin() on the buffer to do this. + * + * We also always take a reference to the buffer log item here so that the bli + * is held while the item is pinned in memory. This means that we can + * unconditionally drop the reference count a transaction holds when the + * transaction is completed. */ + STATIC void xfs_buf_item_pin( xfs_buf_log_item_t *bip) @@ -356,6 +361,7 @@ xfs_buf_item_pin( ASSERT(atomic_read(&bip->bli_refcount) > 0); ASSERT((bip->bli_flags & XFS_BLI_LOGGED) || (bip->bli_flags & XFS_BLI_STALE)); + atomic_inc(&bip->bli_refcount); trace_xfs_buf_item_pin(bip); xfs_bpin(bp); } @@ -489,20 +495,23 @@ xfs_buf_item_trylock( } /* - * Release the buffer associated with the buf log item. - * If there is no dirty logged data associated with the - * buffer recorded in the buf log item, then free the - * buf log item and remove the reference to it in the - * buffer. + * Release the buffer associated with the buf log item. If there is no dirty + * logged data associated with the buffer recorded in the buf log item, then + * free the buf log item and remove the reference to it in the buffer. * - * This call ignores the recursion count. It is only called - * when the buffer should REALLY be unlocked, regardless - * of the recursion count. + * This call ignores the recursion count. It is only called when the buffer + * should REALLY be unlocked, regardless of the recursion count. * - * If the XFS_BLI_HOLD flag is set in the buf log item, then - * free the log item if necessary but do not unlock the buffer. - * This is for support of xfs_trans_bhold(). Make sure the - * XFS_BLI_HOLD field is cleared if we don't free the item. + * We unconditionally drop the transaction's reference to the log item. If the + * item was logged, then another reference was taken when it was pinned, so we + * can safely drop the transaction reference now. This also allows us to avoid + * potential races with the unpin code freeing the bli by not referencing the + * bli after we've dropped the reference count. + * + * If the XFS_BLI_HOLD flag is set in the buf log item, then free the log item + * if necessary but do not unlock the buffer. This is for support of + * xfs_trans_bhold(). Make sure the XFS_BLI_HOLD field is cleared if we don't + * free the item. */ STATIC void xfs_buf_item_unlock( @@ -514,73 +523,54 @@ xfs_buf_item_unlock( bp = bip->bli_buf; - /* - * Clear the buffer's association with this transaction. - */ + /* Clear the buffer's association with this transaction. */ XFS_BUF_SET_FSPRIVATE2(bp, NULL); /* - * If this is a transaction abort, don't return early. - * Instead, allow the brelse to happen. - * Normally it would be done for stale (cancelled) buffers - * at unpin time, but we'll never go through the pin/unpin - * cycle if we abort inside commit. + * If this is a transaction abort, don't return early. Instead, allow + * the brelse to happen. Normally it would be done for stale + * (cancelled) buffers at unpin time, but we'll never go through the + * pin/unpin cycle if we abort inside commit. */ aborted = (bip->bli_item.li_flags & XFS_LI_ABORTED) != 0; /* - * If the buf item is marked stale, then don't do anything. - * We'll unlock the buffer and free the buf item when the - * buffer is unpinned for the last time. + * Before possibly freeing the buf item, determine if we should + * release the buffer at the end of this routine. + */ + hold = bip->bli_flags & XFS_BLI_HOLD; + + /* Clear the per transaction state. */ + bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD); + + /* + * If the buf item is marked stale, then don't do anything. We'll + * unlock the buffer and free the buf item when the buffer is unpinned + * for the last time. */ if (bip->bli_flags & XFS_BLI_STALE) { - bip->bli_flags &= ~XFS_BLI_LOGGED; trace_xfs_buf_item_unlock_stale(bip); ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); - if (!aborted) + if (!aborted) { + atomic_dec(&bip->bli_refcount); return; + } } - /* - * Drop the transaction's reference to the log item if - * it was not logged as part of the transaction. Otherwise - * we'll drop the reference in xfs_buf_item_unpin() when - * the transaction is really through with the buffer. - */ - if (!(bip->bli_flags & XFS_BLI_LOGGED)) { - atomic_dec(&bip->bli_refcount); - } else { - /* - * Clear the logged flag since this is per - * transaction state. - */ - bip->bli_flags &= ~XFS_BLI_LOGGED; - } - - /* - * Before possibly freeing the buf item, determine if we should - * release the buffer at the end of this routine. - */ - hold = bip->bli_flags & XFS_BLI_HOLD; trace_xfs_buf_item_unlock(bip); /* - * If the buf item isn't tracking any data, free it. - * Otherwise, if XFS_BLI_HOLD is set clear it. + * If the buf item isn't tracking any data, free it, otherwise drop the + * reference we hold to it. */ if (xfs_bitmap_empty(bip->bli_format.blf_data_map, - bip->bli_format.blf_map_size)) { + bip->bli_format.blf_map_size)) xfs_buf_item_relse(bp); - } else if (hold) { - bip->bli_flags &= ~XFS_BLI_HOLD; - } + else + atomic_dec(&bip->bli_refcount); - /* - * Release the buffer if XFS_BLI_HOLD was not set. - */ - if (!hold) { + if (!hold) xfs_buf_relse(bp); - } } /* -- 1.5.6.5 From nathans@aconex.com Wed May 5 23:15:11 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o464FAJ2070817 for ; Wed, 5 May 2010 23:15:10 -0500 X-ASG-Debug-ID: 1273119438-1ad4013f0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from postoffice2.aconex.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id C6324128138D for ; Wed, 5 May 2010 21:17:19 -0700 (PDT) Received: from postoffice2.aconex.com (mail.aconex.com [203.89.202.182]) by cuda.sgi.com with ESMTP id netBeEfaPKTvED5B for ; Wed, 05 May 2010 21:17:19 -0700 (PDT) Received: from postoffice.aconex.com (localhost [127.0.0.1]) by postoffice2.aconex.com (Spam & Virus Firewall) with ESMTP id 93C8A51790F; Thu, 6 May 2010 14:17:16 +1000 (EST) Received: from postoffice.aconex.com (postoffice.yarra.acx [192.168.102.1]) by postoffice2.aconex.com with ESMTP id 5B0DEFyCxjSYm0Gm; Thu, 06 May 2010 14:17:16 +1000 (EST) Received: from gatekeeper.aconex.com (gatekeeper.yarra.acx [192.168.102.10]) by postoffice.aconex.com (Postfix) with ESMTP id 54D9AA50283; Thu, 6 May 2010 14:14:03 +1000 (EST) Received: from localhost (localhost.localdomain [127.0.0.1]) by gatekeeper.aconex.com (Postfix) with ESMTP id 796224886E5; Thu, 6 May 2010 14:17:16 +1000 (EST) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Scanned: amavisd-new at aconex.com Received: from gatekeeper.aconex.com ([127.0.0.1]) by localhost (gatekeeper.aconex.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id thZTbe4ceoS4; Thu, 6 May 2010 14:17:11 +1000 (EST) Received: from mail-au.aconex.com (mail-au.aconex.com [192.168.102.12]) by gatekeeper.aconex.com (Postfix) with ESMTP id BF1364886DF; Thu, 6 May 2010 14:17:11 +1000 (EST) Received: from mail-au.aconex.com (mail-au.aconex.com [192.168.102.12]) by mail-au.aconex.com (Postfix) with ESMTP id 9CFC06108EDE; Thu, 6 May 2010 14:17:05 +1000 (EST) Date: Thu, 6 May 2010 14:17:04 +1000 (EST) From: nathans@aconex.com Sender: nscott@aconex.com To: Dave Chinner Cc: xfs@oss.sgi.com Message-ID: <827899224.88951273119424327.JavaMail.root@mail-au> In-Reply-To: <921460991.88921273119221574.JavaMail.root@mail-au> X-ASG-Orig-Subj: Re: [PATCH 07/11] xfs: Delayed logging design documentation Subject: Re: [PATCH 07/11] xfs: Delayed logging design documentation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [203.89.192.141] X-Mailer: Zimbra 5.0.18_GA_3011.RHEL5_64 (ZimbraWebClient - FF3.0 (Linux)/5.0.18_GA_3011.RHEL5_64) X-Barracuda-Connect: mail.aconex.com[203.89.202.182] X-Barracuda-Start-Time: 1273119439 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=NO_REAL_NAME X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29175 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.00 NO_REAL_NAME From: does not include a real name X-Virus-Status: Clean Yo, Looking good Dave! Thanks for writing this up, it made for some interesting lunch time reading. ;) Here's a few typos I noticed, marked 'em up as I read along... > +required for objects that are frequently logged. Parts inodes are "Some parts of ..."? > +The limitation on asynchrnous transaction throughput is the number "asynchronous" > +Effectively, this gives us the maximum bound out outstanding metadata "out" -> "of" > +formating the changes in a transaction to the log buffer. Hence we "formatting" > +One of the key changes that delayed logging makes to the operation of > the > +journalling subsystem is that is dissociates the amount of "is" -> "it", and "disassociates" (or maybe "dissociates" is what you meant, you chem freak you. ;) > +recovered filesysetm is concerned, there may be many thousands of filesystem. > +This introduces lots of scope of deadlocks with transactions that are "scope for deadlocks"? > +The solution is relatively simple - it just a long time to recognise "took a long time" > +rewrite the vector addresses to point at the memory buffer we end up "and rewriting the" > with a > +self-describing object that it can be passed to the log buffer write "that can be" > is entire > +arbitrary and done to make it easy for debugging - the last items in "is entirely" > +crash during the write of a such a transaction could partially > overwrites the "partially overwrite" > +format structure. That is, two vectors totalling roughly 150 bytes. "totaling" > +the write reservation (the actual space availble to the transaction) "available" > +checkpoint commit to complete. This background push checked and > executed by transaction commit code. "is checked" > +log item completion. THe result of this is that pinning and unpinning "The" > +are entered and completed is the object considered clean.` Spurious "`" there? > + => gain confidence and fix problems reported by early > + adopters (a.k.a. guinea pigs) :-) cheers. -- Nathan From SRS0+i3iU+65+fromorbit.com=david@internode.on.net Thu May 6 01:05:01 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46651Ss076163 for ; Thu, 6 May 2010 01:05:01 -0500 X-ASG-Debug-ID: 1273126028-6b48033a0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 51A18314366 for ; Wed, 5 May 2010 23:07:08 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id HP0sWxFeVfENdQDF for ; Wed, 05 May 2010 23:07:08 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23303081-1927428 for multiple; Thu, 06 May 2010 15:37:07 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1O9uEk-0005R3-6Y; Thu, 06 May 2010 16:07:06 +1000 Date: Thu, 6 May 2010 16:07:06 +1000 From: Dave Chinner To: nathans@aconex.com Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 07/11] xfs: Delayed logging design documentation Subject: Re: [PATCH 07/11] xfs: Delayed logging design documentation Message-ID: <20100506060706.GB19579@dastard> References: <921460991.88921273119221574.JavaMail.root@mail-au> <827899224.88951273119424327.JavaMail.root@mail-au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <827899224.88951273119424327.JavaMail.root@mail-au> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1273126030 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29181 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 02:17:04PM +1000, nathans@aconex.com wrote: > Yo, > > Looking good Dave! Thanks for writing this up, it made for some > interesting lunch time reading. ;) > > Here's a few typos I noticed, marked 'em up as I read along... Thanks for the proof reading. ;) I've fixed all the bits you pointed out, and a couple more I found while fixing those.... Cheers, Dave. -- Dave Chinner david@fromorbit.com From SRS0+1LZ9+65+fromorbit.com=david@internode.on.net Thu May 6 08:24:51 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46DOphc095183 for ; Thu, 6 May 2010 08:24:51 -0500 X-ASG-Debug-ID: 1273152405-0b6901100000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1434B315364 for ; Thu, 6 May 2010 06:26:45 -0700 (PDT) Received: from mail.internode.on.net (bld-mail14.adl6.internode.on.net [150.101.137.99]) by cuda.sgi.com with ESMTP id 2xShjUvp74Qve0e2 for ; Thu, 06 May 2010 06:26:45 -0700 (PDT) Received: from dastard (unverified [121.44.229.111]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23594965-1927428 for ; Thu, 06 May 2010 22:56:43 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1OA169-0005qI-VT for xfs@oss.sgi.com; Thu, 06 May 2010 23:26:42 +1000 Date: Thu, 6 May 2010 23:26:41 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 0/11] xfs: delayed logging Subject: Re: [PATCH 0/11] xfs: delayed logging Message-ID: <20100506132641.GC19579@dastard> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273110351-2333-1-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail14.adl6.internode.on.net[150.101.137.99] X-Barracuda-Start-Time: 1273152407 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-ASG-Whitelist: BODY (http://marc\.info/\?) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 11:45:40AM +1000, Dave Chinner wrote: > > Hi Folks, > > This is version 4 of the delayed logging series. > > I won't repeat everything about what it is, just point you > here: > > http://marc.info/?l=linux-xfs&m=126862777118946&w=2 > > for the description, and here: > > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging > > for the current code. Note that this is a rebased branch, so you'll > need to pull it again into a new branch to review. > > This version includes a number of fixes and cleanups related to the > busy extent tracking. This includes fixing a long standing > off-by-one that was causing assert failures when inserting busy > extents that overlapped with existing busy extents. Ok, so I'm still getting assert failures, but they are much harder to hit. However, here's the fragment of a trace that points out why delayed logging is now tripping over this problem: $ grep _busy: t.t |tail -20|cut -d ":" -f 2- xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xa9a1ac8d agno 1 agbno 133 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xa9a1ac8d agno 1 agbno 24 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xa9a1ac8d agno 1 agbno 91909 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xa9a1ac8d agno 1 agbno 75 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xfe95d7e3 agno 1 agbno 100504 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xfe95d7e3 agno 1 agbno 100505 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xfe95d7e3 agno 1 agbno 100506 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xfe95d7e3 agno 1 agbno 100507 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xfe95d7e3 agno 1 agbno 100508 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xfe95d7e3 agno 1 agbno 100509 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0xbe3d9ca6 agno 0 agbno 37809 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0x4575b77f agno 2 agbno 133387 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0xeb882f7e agno 2 agbno 151935 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0x362c7c31 agno 2 agbno 133386 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0x4dda728f agno 2 agbno 151936 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0xc1ca9675 agno 3 agbno 49832 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0x8a3e0a41 agno 3 agbno 49833 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff880067c42da0 tid 0x3fe9fb72 agno 1 agbno 109637 len 1 async xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0x69f15b1c agno 1 agbno 91909 len 1 async First thing to note is that there are only two different addresses for transaction structures here, but there are 11 different transaction IDs. That's a bit of a problem, really. The assert fail was triggered by the last line: xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0x69f15b1c agno 1 agbno 91909 len 1 async Which appears to have already been marked busy by a different transaction earlier on: xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xa9a1ac8d agno 1 agbno 91909 len 1 async If I just pull out the last operations on that block: $ grep "agno 1 agbno 919" t.t |tail -5 fs_mark-2741 [007] 795063.752648: xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0xa9a1ac8d agno 1 agbno 91909 len 1 async fs_mark-2742 [005] 795063.754591: xfs_alloc_busysearch: dev 253:16 agno 1 agbno 91909 len 1 found fs_mark-2745 [007] 795063.775012: xfs_alloc_busy: dev 253:16 trans 0xffff8801067eada0 tid 0x69f15b1c agno 1 agbno 91909 len 1 async xfslogd/4-497 [004] 795063.775540: xfs_alloc_unbusy: dev 253:16 agno 1 agbno 91909 len 1 xfslogd/4-497 [004] 795063.775542: xfs_alloc_busysearch: dev 253:16 agno 1 agbno 91909 len 1 found We can see that there are two separate processes marking the same extent busy, using the same transaction structure address. But they have to be two different transactions, because a transaction is always done in the context of a single kernel thread. i.e: fs_mark-2741 fs_mark-2742 fs_mark-2743 xfslogd xact alloc free 1:91909 mark busy commit xact free xact xact alloc alloc 1:91909 busy search mark xact sync commit xact free xact force log checkpoint starts .... xact alloc free 1:91909 mark busy finds match, not sync *** KABOOM! *** .... log IO completes unbusy 1:91909 checkpoint completes So, now I think I can explain the causes of that assert failure. Firstly, the off-by one I found this morning in the search code which would trigger for both delayed and normal logging modes. This had nothing to do with transaction commits and lifecycles, just a search failing to set the transaction synchronous when it should have. Secondly, for delayed logging only, matching by transaction structure address triggers the failure because busy extents have a much longer life than the transaction structure. It is clear why the transaction ID matching didn't trip over - it would have triggered a log force in this situation, and hence blocked until the checkpoint that fs_mark-2742 had triggered was complete before redoing the rbtree insert. Right now I'm simply going to go back to using the transaction ID for matching transactions, even though the above analysis points out that even that is not as efficient as it could be for delayed logging. That is, we don't even need to force the log or have a synchronous transaction if the extent was first freed in the current checkpoint seqeunce. Doing that, however, requires pinning the checkpoint sequence (i.e. preventing a flush) until the current transaction commits. While that is in the plan for delayed logging, it is future functionality and hence I'm not going to attempt to design and implement it this close to 2.6.35-rc cycle. [*] Christoph - does this answer all your concerns with the busy extent tracking modifications, or is there still something that are left unexplained? Cheers, Dave. [*] Checkpoint pinning is needed to implement atomic multi-transaction operations such as "create with attribute". -- Dave Chinner david@fromorbit.com From aelder@sgi.com Thu May 6 12:24:00 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46HO0kk123514 for ; Thu, 6 May 2010 12:24:00 -0500 Received: from stout.americas.sgi.com (stout.americas.sgi.com [128.162.232.50]) by relay2.corp.sgi.com (Postfix) with ESMTP id 08ACA3040D1; Thu, 6 May 2010 10:26:08 -0700 (PDT) Received: from stout.americas.sgi.com (localhost6.localdomain6 [127.0.0.1]) by stout.americas.sgi.com (8.14.3/8.14.2) with ESMTP id o46HQ7WL018284; Thu, 6 May 2010 12:26:07 -0500 Received: (from aelder@localhost) by stout.americas.sgi.com (8.14.3/8.14.3/Submit) id o46HQ78v018283; Thu, 6 May 2010 12:26:07 -0500 From: Alex Elder Message-Id: <201005061726.o46HQ78v018283@stout.americas.sgi.com> Date: Thu, 06 May 2010 12:26:07 -0500 To: xfs@oss.sgi.com Subject: [PATCH] xfstests: honor comments in the test group file User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean (Re-posting unchanged in hopes it will get reviewed this time.) There are some spots in the "group" file where test numbers have groups listed after a '#' character, clearly intending for those groups to be commented out. But the way the group list gets generated that commenting doesn't work, and in fact these tests explicitly *are* included in such commented-out groups. This patch fixes that, stripping out all comments (which start with a '#' character and end with a newline) from the file before building the set of test numbers for a group. Signed-off-by: Alex Elder --- common | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) Index: b/common =================================================================== --- a/common +++ b/common @@ -58,9 +58,10 @@ do if $group then # arg after -g - group_list=`sed -n ; Thu, 6 May 2010 14:10:17 -0500 X-ASG-Debug-ID: 1273173141-43ba034f0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 90CA1317661 for ; Thu, 6 May 2010 12:12:21 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id jBgYk3hlSCwN5uc9 for ; Thu, 06 May 2010 12:12:21 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA6Uc-0005kv-Q1; Thu, 06 May 2010 19:12:18 +0000 Date: Thu, 6 May 2010 15:12:18 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 0/11] xfs: delayed logging Subject: Re: [PATCH 0/11] xfs: delayed logging Message-ID: <20100506191218.GA18555@infradead.org> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> <20100506132641.GC19579@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100506132641.GC19579@dastard> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273173146 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean > Secondly, for delayed logging only, matching by transaction > structure address triggers the failure because busy extents > have a much longer life than the transaction structure. It is clear > why the transaction ID matching didn't trip over - it would have > triggered a log force in this situation, and hence blocked until > the checkpoint that fs_mark-2742 had triggered was complete before > redoing the rbtree insert. True, the busy extents get spliced over to the cil context, so they outlive the transaction structure. > Right now I'm simply going to go back to using the transaction ID > for matching transactions, even though the above analysis points out > that even that is not as efficient as it could be for delayed > logging. That is, we don't even need to force the log or have a > synchronous transaction if the extent was first freed in the current > checkpoint seqeunce. Doing that, however, requires pinning the > checkpoint sequence (i.e. preventing a flush) until the current > transaction commits. While that is in the plan for delayed logging, > it is future functionality and hence I'm not going to attempt to > design and implement it this close to 2.6.35-rc cycle. [*] Sounds fine to me. I'm not a fan of exporting the tid, but it seems like there's no good way around it for now. Please make the tid exporting a separate changeset so that it's easily revertable once this is sorted out. And documenting all this in comments in the code so that it's archived would be very useful! From kathyjones@edison.ca Thu May 6 14:28:12 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: **** X-Spam-Status: No, score=4.7 required=5.0 tests=BAYES_60,URIBL_BLACK autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46JSB1Y127641 for ; Thu, 6 May 2010 14:28:12 -0500 X-ASG-Debug-ID: 1273174217-3207004f0000-w1Z2WR X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from h2.edison.ca (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 27888155BDBA for ; Thu, 6 May 2010 12:30:17 -0700 (PDT) Received: from h2.edison.ca (h2.goproprio.com [207.134.98.170]) by cuda.sgi.com with ESMTP id YZP4cO2ciX9AIeCW for ; Thu, 06 May 2010 12:30:17 -0700 (PDT) Received: from edison.ca ([207.134.98.155] helo=h2.edison.ca) by h2.edison.ca with smtp (Exim 4.69) (envelope-from ) id 1OA6hm-00086W-LD; Thu, 06 May 2010 15:25:54 -0400 Received: from cologne ([124.217.225.222] helo=cologne) with IPv4:25 by h2.edison.ca; 6 May 2010 15:25:47 -0400 To: emailus@thebravest.com, freshwap14@live.com, oficina@sochiderm.cl, clarkaf@post.queensu.ca X-ASG-Orig-Subj: physician mailing list Subject: physician mailing list Reply-To: kathyjones@edison.ca From: "Riley diminutive" X-Assp-Delay: emailus@thebravest.com not delayed (noProcessing); 6 May 2010 15:25:48 -0400 X-Antivirus-Scanner: Clean mail though you should still use an Antivirus X-Barracuda-Connect: h2.goproprio.com[207.134.98.170] X-Barracuda-Start-Time: 1273174218 Message-Id: <20100506193017.27888155BDBA@cuda.sgi.com> Date: Thu, 6 May 2010 12:30:17 -0700 (PDT) X-Barracuda-Bayes: INNOCENT GLOBAL 0.0058 1.0000 -1.9834 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.98 X-Barracuda-Spam-Status: No, SCORE=-1.98 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29219 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean I have many different good quality lists from various sources. The prices are relatively low as well. Drop me a line here: Hugh.Mathews@realresults.co.cc I'll get you all the details and samples. Send us an email to disappear@realresults.co.cc we will discontinue from the list From BATV+c80a52312441aaaaf662+2447+infradead.org+hch@bombadil.srs.infradead.org Thu May 6 15:33:07 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_92 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46KX610130167 for ; Thu, 6 May 2010 15:33:07 -0500 X-ASG-Debug-ID: 1273178116-3b08021f0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 4B20E31778B for ; Thu, 6 May 2010 13:35:16 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id IC44Fm77kNGZvLmz for ; Thu, 06 May 2010 13:35:16 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA7ms-0002Zn-Ud; Thu, 06 May 2010 20:35:14 +0000 Date: Thu, 6 May 2010 16:35:14 -0400 From: Christoph Hellwig To: DENIEL Philippe Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: Question : Using libhandle from xfsprogs and xfs actions made "by handle" Subject: Re: Question : Using libhandle from xfsprogs and xfs actions made "by handle" Message-ID: <20100506203514.GA6854@infradead.org> References: <4BE178B3.8030501@cea.fr> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BE178B3.8030501@cea.fr> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273178117 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Wed, May 05, 2010 at 03:54:59PM +0200, DENIEL Philippe wrote: > When looking at XFS, I saw there was "open_by_handle" and > "path_to_handle" calls. This sounds very very good to me : this sounds > like kind of bridge to build a handle-based API to address XFS. But so > far, I am a bit stuck : for exporting XFS through my NFS server, I would > need to do "by handle" everything that can be done through POSIX calls, > open/read/write/close files, create files/directories/symlinks, erasing > or moving files... and so on. I do not know if this is possible with the > calls in libhandle.so. But if I had such handle based tools, I think I > could make a nice NFS server on top of XFS (I did this kind of port for > LUSTRE (which has a full handle based API) in my NFS server and I had > really good performances). Can someone provide me with information about > this ? For some reason the handle code currently rejects special files (block/char/fifo) in open_by_handle. I can't see any good reason for that and plan to submit a patch to lift that restriction. Except for that libhandle is exactly what you want - there's various tools that use it for that kind of work, the most prominent is xfsdump/xfsrestore. From BATV+c80a52312441aaaaf662+2447+infradead.org+hch@bombadil.srs.infradead.org Thu May 6 15:34:53 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46KYr3x130246 for ; Thu, 6 May 2010 15:34:53 -0500 X-ASG-Debug-ID: 1273178223-4c1400d00000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id C5A2B3177B2; Thu, 6 May 2010 13:37:03 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id EkUmFMuylpIHXAed; Thu, 06 May 2010 13:37:03 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA7oc-0002mF-UH; Thu, 06 May 2010 20:37:02 +0000 Date: Thu, 6 May 2010 16:37:02 -0400 From: Christoph Hellwig To: Alex Elder Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH] xfstests: honor comments in the test group file Subject: Re: [PATCH] xfstests: honor comments in the test group file Message-ID: <20100506203702.GB6854@infradead.org> References: <201005061726.o46HQ78v018283@stout.americas.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201005061726.o46HQ78v018283@stout.americas.sgi.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273178223 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 12:26:07PM -0500, Alex Elder wrote: > (Re-posting unchanged in hopes it will get reviewed this time.) > > There are some spots in the "group" file where test numbers have > groups listed after a '#' character, clearly intending for those > groups to be commented out. But the way the group list gets > generated that commenting doesn't work, and in fact these tests > explicitly *are* included in such commented-out groups. > > This patch fixes that, stripping out all comments (which start > with a '#' character and end with a newline) from the file before > building the set of test numbers for a group. > > Signed-off-by: Alex Elder Looks good, Reviewed-by: Christoph Hellwig From BATV+c80a52312441aaaaf662+2447+infradead.org+hch@bombadil.srs.infradead.org Thu May 6 15:39:06 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46Kd6jW130380 for ; Thu, 6 May 2010 15:39:06 -0500 X-ASG-Debug-ID: 1273178476-3afa02380000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id E8DD83177F5 for ; Thu, 6 May 2010 13:41:16 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id xiWjZZHr1HVNzcn8 for ; Thu, 06 May 2010 13:41:16 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA7sh-0003oo-E6; Thu, 06 May 2010 20:41:15 +0000 Date: Thu, 6 May 2010 16:41:15 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 01/11] xfs: Don't reuse the same transaciton ID for duplicated transactions. Subject: Re: [PATCH 01/11] xfs: Don't reuse the same transaciton ID for duplicated transactions. Message-ID: <20100506204115.GA14309@infradead.org> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> <1273110351-2333-2-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273110351-2333-2-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273178476 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 11:45:41AM +1000, Dave Chinner wrote: > From: Dave Chinner > > The transaction ID is written into the log as the unique identifier > for transactions during recover. When duplicating a transaction, we > reuse the log ticket, which means it has the same transaction ID as > the previous transaction. > > Rather than regenerating a random transaction ID for the duplicated > transaction, just add one to the current ID so that duplicated > transaction can be easily spotted in the log and during recovery > during problem diagnosis. Oh well, more fun with transaction ids. I'm still not happy with the not guarnateed unique schemes we have here, but for now this should do it. Reviewed-by: Christoph Hellwig From BATV+c80a52312441aaaaf662+2447+infradead.org+hch@bombadil.srs.infradead.org Thu May 6 15:41:01 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46Kf0mk130438 for ; Thu, 6 May 2010 15:41:01 -0500 X-ASG-Debug-ID: 1273178591-494501550000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 4F28D31781D for ; Thu, 6 May 2010 13:43:11 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id 8I7HWK0TMSjpkuhk for ; Thu, 06 May 2010 13:43:11 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA7uY-0003wV-RL; Thu, 06 May 2010 20:43:10 +0000 Date: Thu, 6 May 2010 16:43:10 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 03/11] xfs: allow log ticket allocation to take allocation flags Subject: Re: [PATCH 03/11] xfs: allow log ticket allocation to take allocation flags Message-ID: <20100506204310.GB14309@infradead.org> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> <1273110351-2333-4-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273110351-2333-4-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273178591 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 11:45:43AM +1000, Dave Chinner wrote: > From: Dave Chinner > > Delayed logging currently requires ticket allocation to succeed, so > we need to be able to sleep on allocation. It also should not allow > memory allocation to recurse into the filesystem. hence we need to > pass allocation flags directing the type of allocation the caller > requires. Looks good, Reviewed-by: Christoph Hellwig From BATV+c80a52312441aaaaf662+2447+infradead.org+hch@bombadil.srs.infradead.org Thu May 6 15:48:44 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46Kmiju130822 for ; Thu, 6 May 2010 15:48:44 -0500 X-ASG-Debug-ID: 1273179054-5580013a0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id DB328155BF26 for ; Thu, 6 May 2010 13:50:54 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id b9kHhq5cxLfv93ES for ; Thu, 06 May 2010 13:50:54 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA822-0006JR-CP; Thu, 06 May 2010 20:50:54 +0000 Date: Thu, 6 May 2010 16:50:54 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 06/11] xfs: clean up log ticket overrun debug output Subject: Re: [PATCH 06/11] xfs: clean up log ticket overrun debug output Message-ID: <20100506205054.GA24110@infradead.org> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> <1273110351-2333-7-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273110351-2333-7-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273179054 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 11:45:46AM +1000, Dave Chinner wrote: > From: Dave Chinner > > Push the error message output when a ticket overrun is detected > into the ticket printing functions. Also remove the debug version > of the code as the production version will still panic just as > effectively on a debug kernel via the panic mask being set. Looks good, Reviewed-by: Christoph Hellwig From BATV+c80a52312441aaaaf662+2447+infradead.org+hch@bombadil.srs.infradead.org Thu May 6 15:49:49 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46KnmE4130867 for ; Thu, 6 May 2010 15:49:48 -0500 X-ASG-Debug-ID: 1273179119-48da01580000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 401C93178D4 for ; Thu, 6 May 2010 13:51:59 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id PHcOyQzCYFwwwkKZ for ; Thu, 06 May 2010 13:51:59 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA834-0006S1-V0; Thu, 06 May 2010 20:51:58 +0000 Date: Thu, 6 May 2010 16:51:58 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 05/11] xfs: Clean up XFS_BLI_* flag namespace Subject: Re: [PATCH 05/11] xfs: Clean up XFS_BLI_* flag namespace Message-ID: <20100506205158.GB24110@infradead.org> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> <1273110351-2333-6-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273110351-2333-6-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273179119 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 11:45:45AM +1000, Dave Chinner wrote: > From: Dave Chinner > > Clean up the buffer log format (XFS_BLI_*) flags because they have a > polluted namespace. They XFS_BLI_ prefix is used for both in-memory > and on-disk flag feilds, but have overlapping values for different > flags. Rename the buffer log format flags to use the XFS_BLF_* > prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed > flags. > > Signed-off-by: Dave Chinner Looks good, Reviewed-by: Christoph Hellwig From BATV+c80a52312441aaaaf662+2447+infradead.org+hch@bombadil.srs.infradead.org Thu May 6 15:58:23 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46KwMa9131264 for ; Thu, 6 May 2010 15:58:23 -0500 X-ASG-Debug-ID: 1273179626-554801630000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 4DA48155C494 for ; Thu, 6 May 2010 14:00:27 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id wjqI5KUYvLCvOomc for ; Thu, 06 May 2010 14:00:27 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OA8BG-00007T-Pd; Thu, 06 May 2010 21:00:26 +0000 Date: Thu, 6 May 2010 17:00:26 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 04/11] xfs: modify buffer item reference counting V2 Subject: Re: [PATCH 04/11] xfs: modify buffer item reference counting V2 Message-ID: <20100506210026.GA30264@infradead.org> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> <1273110351-2333-5-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273110351-2333-5-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273179627 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 11:45:44AM +1000, Dave Chinner wrote: > From: Dave Chinner > > The buffer log item reference counts used to take reference?? for every > transaction, similar to the pin counting. This is symmetric (like the > pin/unpin) with respect to transaction completion, but with dleayed logging > becomes assymetric as the pinning becomes assymetric w.r.t. transaction > completion. > > To make both cases the same, allow the buffer pinning to take a reference to > the buffer log item and always drop the reference the transaction has on it > when being unlocked. This is balanced correctly because the unpin operation > always drops a reference to the log item. Hence reference counting becomes > symmetric w.r.t. item pinning as well as w.r.t active transactions and as a > result the reference counting model remain consistent between normal and > delayed logging. > > Signed-off-by: Dave Chinner Looks good and makes the buffer refconting model a lot more sensible. Reviewed-by: Christoph Hellwig From SRS0+SkGd+65+fromorbit.com=david@internode.on.net Thu May 6 18:51:55 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o46Npt8w137294 for ; Thu, 6 May 2010 18:51:55 -0500 X-ASG-Debug-ID: 1273190043-413200670000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1C483136DE43 for ; Thu, 6 May 2010 16:54:04 -0700 (PDT) Received: from mail.internode.on.net (bld-mail13.adl6.internode.on.net [150.101.137.98]) by cuda.sgi.com with ESMTP id oFMA5tk8poayg6jb for ; Thu, 06 May 2010 16:54:04 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23509891-1927428 for multiple; Fri, 07 May 2010 09:24:02 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1OAAtF-0006Z4-Jx; Fri, 07 May 2010 09:54:01 +1000 Date: Fri, 7 May 2010 09:54:01 +1000 From: Dave Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 0/11] xfs: delayed logging Subject: Re: [PATCH 0/11] xfs: delayed logging Message-ID: <20100506235401.GD19579@dastard> References: <1273110351-2333-1-git-send-email-david@fromorbit.com> <20100506132641.GC19579@dastard> <20100506191218.GA18555@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100506191218.GA18555@infradead.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail13.adl6.internode.on.net[150.101.137.98] X-Barracuda-Start-Time: 1273190045 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0001 1.0000 -2.0204 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29236 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Thu, May 06, 2010 at 03:12:18PM -0400, Christoph Hellwig wrote: > > Secondly, for delayed logging only, matching by transaction > > structure address triggers the failure because busy extents > > have a much longer life than the transaction structure. It is clear > > why the transaction ID matching didn't trip over - it would have > > triggered a log force in this situation, and hence blocked until > > the checkpoint that fs_mark-2742 had triggered was complete before > > redoing the rbtree insert. > > True, the busy extents get spliced over to the cil context, so they > outlive the transaction structure. > > > Right now I'm simply going to go back to using the transaction ID > > for matching transactions, even though the above analysis points out > > that even that is not as efficient as it could be for delayed > > logging. That is, we don't even need to force the log or have a > > synchronous transaction if the extent was first freed in the current > > checkpoint seqeunce. Doing that, however, requires pinning the > > checkpoint sequence (i.e. preventing a flush) until the current > > transaction commits. While that is in the plan for delayed logging, > > it is future functionality and hence I'm not going to attempt to > > design and implement it this close to 2.6.35-rc cycle. [*] > > Sounds fine to me. I'm not a fan of exporting the tid, but it > seems like there's no good way around it for now. Please make the > tid exporting a separate changeset so that it's easily revertable > once this is sorted out. Yeah, that's a good idea. > And documenting all this in comments in the code so that it's archived > would be very useful! Will do. Cheers, Dave. -- Dave Chinner david@fromorbit.com From SRS0+L8Ek+66+fromorbit.com=dave@internode.on.net Fri May 7 00:38:57 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_65 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475cvQS148795 for ; Fri, 7 May 2010 00:38:57 -0500 X-ASG-Debug-ID: 1273210865-63e901fe0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id BE4E69638F4 for ; Thu, 6 May 2010 22:41:06 -0700 (PDT) Received: from mail.internode.on.net (bld-mail19.adl2.internode.on.net [150.101.137.104]) by cuda.sgi.com with ESMTP id SxPb1cMAYAWmWS2N for ; Thu, 06 May 2010 22:41:06 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23295655-1927428 for ; Fri, 07 May 2010 15:11:03 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJ4-0006rB-51 for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJ2-00066F-HB for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:00 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 01/12] xfs: Don't reuse the same transaction ID for duplicated transactions. Subject: [PATCH 01/12] xfs: Don't reuse the same transaction ID for duplicated transactions. Date: Fri, 7 May 2010 15:40:49 +1000 Message-Id: <1273210860-23414-2-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail19.adl2.internode.on.net[150.101.137.104] X-Barracuda-Start-Time: 1273210867 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner The transaction ID is written into the log as the unique identifier for transactions during recover. When duplicating a transaction, we reuse the log ticket, which means it has the same transaction ID as the previous transaction. Rather than regenerating a random transaction ID for the duplicated transaction, just add one to the current ID so that duplicated transaction can be easily spotted in the log and during recovery during problem diagnosis. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_log.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 3038dd5..687b220 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -360,6 +360,15 @@ xfs_log_reserve( ASSERT(flags & XFS_LOG_PERM_RESERV); internal_ticket = *ticket; + /* + * this is a new transaction on the ticket, so we need to + * change the transaction ID so that the next transaction has a + * different TID in the log. Just add one to the existing tid + * so that we can see chains of rolling transactions in the log + * easily. + */ + internal_ticket->t_tid++; + trace_xfs_log_reserve(log, internal_ticket); xlog_grant_push_ail(mp, internal_ticket->t_unit_res); -- 1.5.6.5 From SRS0+KKyr+66+fromorbit.com=dave@internode.on.net Fri May 7 00:38:56 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475cup2148792 for ; Fri, 7 May 2010 00:38:56 -0500 X-ASG-Debug-ID: 1273210865-66df01c80000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id E95A89638FA for ; Thu, 6 May 2010 22:41:06 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id MRZALpYbNeukYAkv for ; Thu, 06 May 2010 22:41:06 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23442142-1927428 for ; Fri, 07 May 2010 15:11:03 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJ4-0006rA-4y for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJ2-00066D-F6 for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:00 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 0/12] xfs: delayed logging V5 Subject: [PATCH 0/12] xfs: delayed logging V5 Date: Fri, 7 May 2010 15:40:48 +1000 Message-Id: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1273210867 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-ASG-Whitelist: BODY (http://marc\.info/\?) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Hi Folks, This is version 5 of the delayed logging series. I won't repeat everything about what it is, just point you here: http://marc.info/?l=linux-xfs&m=126862777118946&w=2 for the description, and here: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging for the current code. Note that this is a rebased branch, so you'll need to pull it again into a new branch to review. This version includes a documentation updates and fixes to the busy extent tracking infrastructure. The patch series follows this mail to make it easier for people to respond to specific pieces of the code during review. I'm still making the entire patch set available through git, though. Changes the previous versions: Version 5: 27 files changed, 2457 insertions(+), 513 deletions(-) Version 4: 26 files changed, 2351 insertions(+), 510 deletions(-) Version 3: 28 files changed, 2366 insertions(+), 506 deletions(-) Version 2: 22 files changed, 2188 insertions(+), 377 deletions(-) Version 1: 19 files changed, 2594 insertions(+), 580 deletions(-) Changes for V5: o fixed many typos in the desgin documentation - thanks to Nathan Scott for proof reading it. :) o found another transaction assert failure - un-reverting the change to transaction ID matching as the reason it avoided the assert failures is now known. (new commit for exporting the ticket ID). o added transaction ID to busy extent tracing o added lots of comments explaining the reason for needing transaction ID matching w/ delayed logging. o added overlap detection in busy extent inserts Changes for V4: o fixes duplicate transaction IDs on rolling transactions (new commit) o folded in a busy extent freeing cleanup from Christoph Hellwig o made API prefix consistent (xfs_alloc_busy_*) o combined xfs_alloc_mark_busy and xfs_alloc_busy_insert o reverted back to tracking transaction pointers in busy extents o removed exporting of transaction ID for busy extents o fixed an off-by-one in the extent range match in the busy extent search code that has been triggering assert failures o use list_splice_init() when splicing busy extents from the transaction to the checkpoint context to ensure we don't get transactions thinking they have busy extents to free after we've already attached them to the checkpoint. Changes for V3: o changed buffer log item reference counted model to be consistent for both logging modes o cleaned up XFS_BLI flags usage (new commit) o separated out log ticket overrun printing cleanup (new commit) o made sure "delaylog" option shows up in /proc/mounts o collapsed many of the intermediate commits together to make it easier to review o fixed inode buffer tagging issue that was causing shutdowns in log recovery in test 087 and 121 Changes for V2: o 22 files changed, 2188 insertions(+), 377 deletions(-) o fixed some memory leaks o fixed ticket allocation for checkpoints to use KM_NOFS o minor code cleanups o performed stress and scalability testing The following changes since commit 6ff75b78182c314112c1173edaab6c164c95d775: Christoph Hellwig (1): xfs: mark xfs_iomap_write_ helpers static are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging Dave Chinner (12): xfs: Don't reuse the same transaction ID for duplicated transactions. xfs: allow log ticket allocation to take allocation flags xfs: modify buffer item reference counting xfs: Clean up XFS_BLI_* flag namespace xfs: clean up log ticket overrun debug output xfs: make the log ticket ID available outside the log infrastructure xfs: Improve scalability of busy extent tracking xfs: Delayed logging design documentation xfs: Introduce delayed logging core code xfs: forced unmounts need to push the CIL xfs: enable background pushing of the CIL xfs: Ensure inode allocation buffers are fully replayed .../filesystems/xfs-delayed-logging-design.txt | 819 ++++++++++++++++++++ fs/xfs/Makefile | 1 + fs/xfs/linux-2.6/xfs_buf.c | 11 +- fs/xfs/linux-2.6/xfs_quotaops.c | 1 + fs/xfs/linux-2.6/xfs_super.c | 12 +- fs/xfs/linux-2.6/xfs_trace.h | 83 ++- fs/xfs/quota/xfs_dquot.c | 6 +- fs/xfs/xfs_ag.h | 24 +- fs/xfs/xfs_alloc.c | 364 +++++++--- fs/xfs/xfs_alloc.h | 7 +- fs/xfs/xfs_alloc_btree.c | 2 +- fs/xfs/xfs_buf_item.c | 166 ++-- fs/xfs/xfs_buf_item.h | 18 +- fs/xfs/xfs_error.c | 2 +- fs/xfs/xfs_log.c | 123 +++- fs/xfs/xfs_log.h | 14 +- fs/xfs/xfs_log_cil.c | 733 ++++++++++++++++++ fs/xfs/xfs_log_priv.h | 118 +++- fs/xfs/xfs_log_recover.c | 46 +- fs/xfs/xfs_log_recover.h | 2 +- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_trans.c | 195 ++++- fs/xfs/xfs_trans.h | 44 +- fs/xfs/xfs_trans_buf.c | 46 +- fs/xfs/xfs_trans_item.c | 114 +--- fs/xfs/xfs_trans_priv.h | 16 +- fs/xfs/xfs_types.h | 2 + 27 files changed, 2457 insertions(+), 513 deletions(-) create mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt create mode 100644 fs/xfs/xfs_log_cil.c From SRS0+V/pq+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:05 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475d5l2148826 for ; Fri, 7 May 2010 00:39:05 -0500 X-ASG-Debug-ID: 1273210873-077501580000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 4F817318A32 for ; Thu, 6 May 2010 22:41:14 -0700 (PDT) Received: from mail.internode.on.net (bld-mail18.adl2.internode.on.net [150.101.137.103]) by cuda.sgi.com with ESMTP id hefrCQnZYFAMAo9J for ; Thu, 06 May 2010 22:41:14 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23359239-1927428 for ; Fri, 07 May 2010 15:11:13 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJ4-0006rH-CI for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJ2-00066Q-TI for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:00 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 05/12] xfs: clean up log ticket overrun debug output Subject: [PATCH 05/12] xfs: clean up log ticket overrun debug output Date: Fri, 7 May 2010 15:40:53 +1000 Message-Id: <1273210860-23414-6-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail18.adl2.internode.on.net[150.101.137.103] X-Barracuda-Start-Time: 1273210875 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Push the error message output when a ticket overrun is detected into the ticket printing functions. Also remove the debug version of the code as the production version will still panic just as effectively on a debug kernel via the panic mask being set. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_error.c | 2 +- fs/xfs/xfs_log.c | 19 +++++-------------- 2 files changed, 6 insertions(+), 15 deletions(-) diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c index ef96175..047b8a8 100644 --- a/fs/xfs/xfs_error.c +++ b/fs/xfs/xfs_error.c @@ -170,7 +170,7 @@ xfs_cmn_err(int panic_tag, int level, xfs_mount_t *mp, char *fmt, ...) va_list ap; #ifdef DEBUG - xfs_panic_mask |= XFS_PTAG_SHUTDOWN_CORRUPT; + xfs_panic_mask |= (XFS_PTAG_SHUTDOWN_CORRUPT | XFS_PTAG_LOGRES); #endif if (xfs_panic_mask && (xfs_panic_mask & panic_tag) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 83be6a6..1efb303 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1645,6 +1645,10 @@ xlog_print_tic_res(xfs_mount_t *mp, xlog_ticket_t *ticket) "bad-rtype" : res_type_str[r_type-1]), ticket->t_res_arr[i].r_len); } + + xfs_cmn_err(XFS_PTAG_LOGRES, CE_ALERT, mp, + "xfs_log_write: reservation ran out. Need to up reservation"); + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); } /* @@ -1897,21 +1901,8 @@ xlog_write( *start_lsn = 0; len = xlog_write_calc_vec_length(ticket, log_vector); - if (ticket->t_curr_res < len) { + if (ticket->t_curr_res < len) xlog_print_tic_res(log->l_mp, ticket); -#ifdef DEBUG - xlog_panic( - "xfs_log_write: reservation ran out. Need to up reservation"); -#else - /* Customer configurable panic */ - xfs_cmn_err(XFS_PTAG_LOGRES, CE_ALERT, log->l_mp, - "xfs_log_write: reservation ran out. Need to up reservation"); - - /* If we did not panic, shutdown the filesystem */ - xfs_force_shutdown(log->l_mp, SHUTDOWN_CORRUPT_INCORE); -#endif - } - ticket->t_curr_res -= len; index = 0; -- 1.5.6.5 From SRS0+KKyr+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:05 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475d5dd148834 for ; Fri, 7 May 2010 00:39:05 -0500 X-ASG-Debug-ID: 1273210874-574703400000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 032AD9638FB for ; Thu, 6 May 2010 22:41:15 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id fcJANYZrLqe88Tc8 for ; Thu, 06 May 2010 22:41:15 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23442150-1927428 for ; Fri, 07 May 2010 15:11:13 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJ4-0006rJ-EI for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJ2-00066U-VI for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:00 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 06/12] xfs: make the log ticket ID available outside the log infrastructure Subject: [PATCH 06/12] xfs: make the log ticket ID available outside the log infrastructure Date: Fri, 7 May 2010 15:40:54 +1000 Message-Id: <1273210860-23414-7-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1273210877 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner The ticket ID is needed to uniquely identify transactions when doing busy extent matching. Delayed logging changes the lifecycle of busy extents with respect to the transaction structure lifecycle. Hence we can no longer use the transaction structure as a means of determining the owner of the busy extent as it may be freed and reused while the busy extent is still active. This commit provides the infrastructure to access the xlog_tid_t held in the ticket from a transaction handle. This avoids the need for callers to peek into the transaction and log structures to find this out. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 7 +++++++ fs/xfs/xfs_log.h | 7 ++++++- fs/xfs/xfs_log_priv.h | 2 -- 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 1efb303..19d0c5f 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3312,6 +3312,13 @@ xfs_log_ticket_get( return ticket; } +xlog_tid_t +xfs_log_get_trans_ident( + struct xfs_trans *tp) +{ + return tp->t_ticket->t_tid; +} + /* * Allocate and initialise a new log ticket. */ diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 229d1f3..38af110 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -18,8 +18,10 @@ #ifndef __XFS_LOG_H__ #define __XFS_LOG_H__ -/* get lsn fields */ +/* transaction ID type */ +typedef __uint32_t xlog_tid_t; +/* get lsn fields */ #define CYCLE_LSN(lsn) ((uint)((lsn)>>32)) #define BLOCK_LSN(lsn) ((uint)(lsn)) @@ -134,6 +136,7 @@ struct xlog_in_core; struct xlog_ticket; struct xfs_log_item; struct xfs_item_ops; +struct xfs_trans; void xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item, @@ -190,6 +193,8 @@ void xlog_iodone(struct xfs_buf *); struct xlog_ticket * xfs_log_ticket_get(struct xlog_ticket *ticket); void xfs_log_ticket_put(struct xlog_ticket *ticket); +xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp); + #endif diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 9cf6951..ac97bdd 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -152,8 +152,6 @@ static inline uint xlog_get_client_id(__be32 i) #define XLOG_RECOVERY_NEEDED 0x4 /* log was recovered */ #define XLOG_IO_ERROR 0x8 /* log hit an I/O error, and being shutdown */ -typedef __uint32_t xlog_tid_t; - #ifdef __KERNEL__ /* -- 1.5.6.5 From SRS0+1K0y+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:06 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475d5Fk148836 for ; Fri, 7 May 2010 00:39:06 -0500 X-ASG-Debug-ID: 1273210874-38b400c10000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 3168F1370478 for ; Thu, 6 May 2010 22:41:14 -0700 (PDT) Received: from mail.internode.on.net (bld-mail13.adl6.internode.on.net [150.101.137.98]) by cuda.sgi.com with ESMTP id Zz885nI9sefibWLq for ; Thu, 06 May 2010 22:41:14 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23556234-1927428 for ; Fri, 07 May 2010 15:11:13 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJ4-0006rC-5U for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJ2-00066H-JK for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:00 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 02/12] xfs: allow log ticket allocation to take allocation flags Subject: [PATCH 02/12] xfs: allow log ticket allocation to take allocation flags Date: Fri, 7 May 2010 15:40:50 +1000 Message-Id: <1273210860-23414-3-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail13.adl6.internode.on.net[150.101.137.98] X-Barracuda-Start-Time: 1273210876 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Delayed logging currently requires ticket allocation to succeed, so we need to be able to sleep on allocation. It also should not allow memory allocation to recurse into the filesystem. hence we need to pass allocation flags directing the type of allocation the caller requires. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_log.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 687b220..83be6a6 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -88,11 +88,9 @@ STATIC void xlog_ungrant_log_space(xlog_t *log, /* local ticket functions */ -STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, - int unit_bytes, - int count, - char clientid, - uint flags); +STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, int unit_bytes, int count, + char clientid, uint flags, + int alloc_flags); #if defined(DEBUG) STATIC void xlog_verify_dest_ptr(xlog_t *log, char *ptr); @@ -376,7 +374,8 @@ xfs_log_reserve( } else { /* may sleep if need to allocate more tickets */ internal_ticket = xlog_ticket_alloc(log, unit_bytes, cnt, - client, flags); + client, flags, + KM_SLEEP|KM_MAYFAIL); if (!internal_ticket) return XFS_ERROR(ENOMEM); internal_ticket->t_trans_type = t_type; @@ -3331,13 +3330,14 @@ xlog_ticket_alloc( int unit_bytes, int cnt, char client, - uint xflags) + uint xflags, + int alloc_flags) { struct xlog_ticket *tic; uint num_headers; int iclog_space; - tic = kmem_zone_zalloc(xfs_log_ticket_zone, KM_SLEEP|KM_MAYFAIL); + tic = kmem_zone_zalloc(xfs_log_ticket_zone, alloc_flags); if (!tic) return NULL; -- 1.5.6.5 From SRS0+1K0y+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:13 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00,FRT_ADOBE2, J_CHICKENPOX_64,J_CHICKENPOX_65,LOCAL_GNU_PATCH autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475dCq8148876 for ; Fri, 7 May 2010 00:39:13 -0500 X-ASG-Debug-ID: 1273210880-38b400c30000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 917C71370485 for ; Thu, 6 May 2010 22:41:21 -0700 (PDT) Received: from mail.internode.on.net (bld-mail13.adl6.internode.on.net [150.101.137.98]) by cuda.sgi.com with ESMTP id cA2n9RAWwqlmR9vB for ; Thu, 06 May 2010 22:41:21 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23556247-1927428 for ; Fri, 07 May 2010 15:11:20 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJA-0006rV-JM for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:08 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJ2-00066O-RU for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:00 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 04/12] xfs: Clean up XFS_BLI_* flag namespace Subject: [PATCH 04/12] xfs: Clean up XFS_BLI_* flag namespace Date: Fri, 7 May 2010 15:40:52 +1000 Message-Id: <1273210860-23414-5-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail13.adl6.internode.on.net[150.101.137.98] X-Barracuda-Start-Time: 1273210882 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Clean up the buffer log format (XFS_BLI_*) flags because they have a polluted namespace. They XFS_BLI_ prefix is used for both in-memory and on-disk flag feilds, but have overlapping values for different flags. Rename the buffer log format flags to use the XFS_BLF_* prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed flags. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/linux-2.6/xfs_super.c | 2 +- fs/xfs/quota/xfs_dquot.c | 6 ++-- fs/xfs/xfs_buf_item.c | 42 +++++++++++++++++++------------------- fs/xfs/xfs_buf_item.h | 14 ++++++------ fs/xfs/xfs_log_recover.c | 46 +++++++++++++++++++++--------------------- fs/xfs/xfs_log_recover.h | 2 +- fs/xfs/xfs_trans_buf.c | 28 ++++++++++++------------ 7 files changed, 70 insertions(+), 70 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index a43d09e..1e88c98 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -1753,7 +1753,7 @@ xfs_init_zones(void) * but it is much faster. */ xfs_buf_item_zone = kmem_zone_init((sizeof(xfs_buf_log_item_t) + - (((XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK) / + (((XFS_MAX_BLOCKSIZE / XFS_BLF_CHUNK) / NBWORD) * sizeof(int))), "xfs_buf_item"); if (!xfs_buf_item_zone) goto out_destroy_trans_zone; diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c index b89ec5d..585e763 100644 --- a/fs/xfs/quota/xfs_dquot.c +++ b/fs/xfs/quota/xfs_dquot.c @@ -344,9 +344,9 @@ xfs_qm_init_dquot_blk( for (i = 0; i < q->qi_dqperchunk; i++, d++, curid++) xfs_qm_dqinit_core(curid, type, d); xfs_trans_dquot_buf(tp, bp, - (type & XFS_DQ_USER ? XFS_BLI_UDQUOT_BUF : - ((type & XFS_DQ_PROJ) ? XFS_BLI_PDQUOT_BUF : - XFS_BLI_GDQUOT_BUF))); + (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF : + ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF : + XFS_BLF_GDQUOT_BUF))); xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1); } diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 4cd5f61..bcbb661 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -64,7 +64,7 @@ xfs_buf_item_log_debug( nbytes = last - first + 1; bfset(bip->bli_logged, first, nbytes); for (x = 0; x < nbytes; x++) { - chunk_num = byte >> XFS_BLI_SHIFT; + chunk_num = byte >> XFS_BLF_SHIFT; word_num = chunk_num >> BIT_TO_WORD_SHIFT; bit_num = chunk_num & (NBWORD - 1); wordp = &(bip->bli_format.blf_data_map[word_num]); @@ -166,7 +166,7 @@ xfs_buf_item_size( * cancel flag in it. */ trace_xfs_buf_item_size_stale(bip); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); return 1; } @@ -197,9 +197,9 @@ xfs_buf_item_size( } else if (next_bit != last_bit + 1) { last_bit = next_bit; nvecs++; - } else if (xfs_buf_offset(bp, next_bit * XFS_BLI_CHUNK) != - (xfs_buf_offset(bp, last_bit * XFS_BLI_CHUNK) + - XFS_BLI_CHUNK)) { + } else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) != + (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) + + XFS_BLF_CHUNK)) { last_bit = next_bit; nvecs++; } else { @@ -261,7 +261,7 @@ xfs_buf_item_format( * cancel flag in it. */ trace_xfs_buf_item_format_stale(bip); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); bip->bli_format.blf_size = nvecs; return; } @@ -294,28 +294,28 @@ xfs_buf_item_format( * keep counting and scanning. */ if (next_bit == -1) { - buffer_offset = first_bit * XFS_BLI_CHUNK; + buffer_offset = first_bit * XFS_BLF_CHUNK; vecp->i_addr = xfs_buf_offset(bp, buffer_offset); - vecp->i_len = nbits * XFS_BLI_CHUNK; + vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; nvecs++; break; } else if (next_bit != last_bit + 1) { - buffer_offset = first_bit * XFS_BLI_CHUNK; + buffer_offset = first_bit * XFS_BLF_CHUNK; vecp->i_addr = xfs_buf_offset(bp, buffer_offset); - vecp->i_len = nbits * XFS_BLI_CHUNK; + vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; nvecs++; vecp++; first_bit = next_bit; last_bit = next_bit; nbits = 1; - } else if (xfs_buf_offset(bp, next_bit << XFS_BLI_SHIFT) != - (xfs_buf_offset(bp, last_bit << XFS_BLI_SHIFT) + - XFS_BLI_CHUNK)) { - buffer_offset = first_bit * XFS_BLI_CHUNK; + } else if (xfs_buf_offset(bp, next_bit << XFS_BLF_SHIFT) != + (xfs_buf_offset(bp, last_bit << XFS_BLF_SHIFT) + + XFS_BLF_CHUNK)) { + buffer_offset = first_bit * XFS_BLF_CHUNK; vecp->i_addr = xfs_buf_offset(bp, buffer_offset); - vecp->i_len = nbits * XFS_BLI_CHUNK; + vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; /* You would think we need to bump the nvecs here too, but we do not * this number is used by recovery, and it gets confused by the boundary @@ -399,7 +399,7 @@ xfs_buf_item_unpin( ASSERT(XFS_BUF_VALUSEMA(bp) <= 0); ASSERT(!(XFS_BUF_ISDELAYWRITE(bp))); ASSERT(XFS_BUF_ISSTALE(bp)); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); trace_xfs_buf_item_unpin_stale(bip); /* @@ -550,7 +550,7 @@ xfs_buf_item_unlock( */ if (bip->bli_flags & XFS_BLI_STALE) { trace_xfs_buf_item_unlock_stale(bip); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); if (!aborted) { atomic_dec(&bip->bli_refcount); return; @@ -707,12 +707,12 @@ xfs_buf_item_init( } /* - * chunks is the number of XFS_BLI_CHUNK size pieces + * chunks is the number of XFS_BLF_CHUNK size pieces * the buffer can be divided into. Make sure not to * truncate any pieces. map_size is the size of the * bitmap needed to describe the chunks of the buffer. */ - chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLI_CHUNK - 1)) >> XFS_BLI_SHIFT); + chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLF_CHUNK - 1)) >> XFS_BLF_SHIFT); map_size = (int)((chunks + NBWORD) >> BIT_TO_WORD_SHIFT); bip = (xfs_buf_log_item_t*)kmem_zone_zalloc(xfs_buf_item_zone, @@ -780,8 +780,8 @@ xfs_buf_item_log( /* * Convert byte offsets to bit numbers. */ - first_bit = first >> XFS_BLI_SHIFT; - last_bit = last >> XFS_BLI_SHIFT; + first_bit = first >> XFS_BLF_SHIFT; + last_bit = last >> XFS_BLF_SHIFT; /* * Calculate the total number of bits to be set. diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index df44545..8cbb82b 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -41,22 +41,22 @@ typedef struct xfs_buf_log_format { * This flag indicates that the buffer contains on disk inodes * and requires special recovery handling. */ -#define XFS_BLI_INODE_BUF 0x1 +#define XFS_BLF_INODE_BUF 0x1 /* * This flag indicates that the buffer should not be replayed * during recovery because its blocks are being freed. */ -#define XFS_BLI_CANCEL 0x2 +#define XFS_BLF_CANCEL 0x2 /* * This flag indicates that the buffer contains on disk * user or group dquots and may require special recovery handling. */ -#define XFS_BLI_UDQUOT_BUF 0x4 -#define XFS_BLI_PDQUOT_BUF 0x8 -#define XFS_BLI_GDQUOT_BUF 0x10 +#define XFS_BLF_UDQUOT_BUF 0x4 +#define XFS_BLF_PDQUOT_BUF 0x8 +#define XFS_BLF_GDQUOT_BUF 0x10 -#define XFS_BLI_CHUNK 128 -#define XFS_BLI_SHIFT 7 +#define XFS_BLF_CHUNK 128 +#define XFS_BLF_SHIFT 7 #define BIT_TO_WORD_SHIFT 5 #define NBWORD (NBBY * sizeof(unsigned int)) diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 0de08e3..14a69ae 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1576,7 +1576,7 @@ xlog_recover_reorder_trans( switch (ITEM_TYPE(item)) { case XFS_LI_BUF: - if (!(buf_f->blf_flags & XFS_BLI_CANCEL)) { + if (!(buf_f->blf_flags & XFS_BLF_CANCEL)) { trace_xfs_log_recover_item_reorder_head(log, trans, item, pass); list_move(&item->ri_list, &trans->r_itemq); @@ -1638,7 +1638,7 @@ xlog_recover_do_buffer_pass1( /* * If this isn't a cancel buffer item, then just return. */ - if (!(flags & XFS_BLI_CANCEL)) { + if (!(flags & XFS_BLF_CANCEL)) { trace_xfs_log_recover_buf_not_cancel(log, buf_f); return; } @@ -1696,7 +1696,7 @@ xlog_recover_do_buffer_pass1( * Check to see whether the buffer being recovered has a corresponding * entry in the buffer cancel record table. If it does then return 1 * so that it will be cancelled, otherwise return 0. If the buffer is - * actually a buffer cancel item (XFS_BLI_CANCEL is set), then decrement + * actually a buffer cancel item (XFS_BLF_CANCEL is set), then decrement * the refcount on the entry in the table and remove it from the table * if this is the last reference. * @@ -1721,7 +1721,7 @@ xlog_check_buffer_cancelled( * There is nothing in the table built in pass one, * so this buffer must not be cancelled. */ - ASSERT(!(flags & XFS_BLI_CANCEL)); + ASSERT(!(flags & XFS_BLF_CANCEL)); return 0; } @@ -1733,7 +1733,7 @@ xlog_check_buffer_cancelled( * There is no corresponding entry in the table built * in pass one, so this buffer has not been cancelled. */ - ASSERT(!(flags & XFS_BLI_CANCEL)); + ASSERT(!(flags & XFS_BLF_CANCEL)); return 0; } @@ -1752,7 +1752,7 @@ xlog_check_buffer_cancelled( * one in the table and remove it if this is the * last reference. */ - if (flags & XFS_BLI_CANCEL) { + if (flags & XFS_BLF_CANCEL) { bcp->bc_refcount--; if (bcp->bc_refcount == 0) { if (prevp == NULL) { @@ -1772,7 +1772,7 @@ xlog_check_buffer_cancelled( * We didn't find a corresponding entry in the table, so * return 0 so that the buffer is NOT cancelled. */ - ASSERT(!(flags & XFS_BLI_CANCEL)); + ASSERT(!(flags & XFS_BLF_CANCEL)); return 0; } @@ -1874,8 +1874,8 @@ xlog_recover_do_inode_buffer( nbits = xfs_contig_bits(data_map, map_size, bit); ASSERT(nbits > 0); - reg_buf_offset = bit << XFS_BLI_SHIFT; - reg_buf_bytes = nbits << XFS_BLI_SHIFT; + reg_buf_offset = bit << XFS_BLF_SHIFT; + reg_buf_bytes = nbits << XFS_BLF_SHIFT; item_index++; } @@ -1889,7 +1889,7 @@ xlog_recover_do_inode_buffer( } ASSERT(item->ri_buf[item_index].i_addr != NULL); - ASSERT((item->ri_buf[item_index].i_len % XFS_BLI_CHUNK) == 0); + ASSERT((item->ri_buf[item_index].i_len % XFS_BLF_CHUNK) == 0); ASSERT((reg_buf_offset + reg_buf_bytes) <= XFS_BUF_COUNT(bp)); /* @@ -1955,9 +1955,9 @@ xlog_recover_do_reg_buffer( nbits = xfs_contig_bits(data_map, map_size, bit); ASSERT(nbits > 0); ASSERT(item->ri_buf[i].i_addr != NULL); - ASSERT(item->ri_buf[i].i_len % XFS_BLI_CHUNK == 0); + ASSERT(item->ri_buf[i].i_len % XFS_BLF_CHUNK == 0); ASSERT(XFS_BUF_COUNT(bp) >= - ((uint)bit << XFS_BLI_SHIFT)+(nbits<blf_flags & - (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) { + (XFS_BLF_UDQUOT_BUF|XFS_BLF_PDQUOT_BUF|XFS_BLF_GDQUOT_BUF)) { if (item->ri_buf[i].i_addr == NULL) { cmn_err(CE_ALERT, "XFS: NULL dquot in %s.", __func__); @@ -1987,9 +1987,9 @@ xlog_recover_do_reg_buffer( } memcpy(xfs_buf_offset(bp, - (uint)bit << XFS_BLI_SHIFT), /* dest */ + (uint)bit << XFS_BLF_SHIFT), /* dest */ item->ri_buf[i].i_addr, /* source */ - nbits<blf_flags & XFS_BLI_UDQUOT_BUF) + if (buf_f->blf_flags & XFS_BLF_UDQUOT_BUF) type |= XFS_DQ_USER; - if (buf_f->blf_flags & XFS_BLI_PDQUOT_BUF) + if (buf_f->blf_flags & XFS_BLF_PDQUOT_BUF) type |= XFS_DQ_PROJ; - if (buf_f->blf_flags & XFS_BLI_GDQUOT_BUF) + if (buf_f->blf_flags & XFS_BLF_GDQUOT_BUF) type |= XFS_DQ_GROUP; /* * This type of quotas was turned off, so ignore this buffer @@ -2173,7 +2173,7 @@ xlog_recover_do_dquot_buffer( * here which overlaps that may be stale. * * When meta-data buffers are freed at run time we log a buffer item - * with the XFS_BLI_CANCEL bit set to indicate that previous copies + * with the XFS_BLF_CANCEL bit set to indicate that previous copies * of the buffer in the log should not be replayed at recovery time. * This is so that if the blocks covered by the buffer are reused for * file data before we crash we don't end up replaying old, freed @@ -2207,7 +2207,7 @@ xlog_recover_do_buffer_trans( if (pass == XLOG_RECOVER_PASS1) { /* * In this pass we're only looking for buf items - * with the XFS_BLI_CANCEL bit set. + * with the XFS_BLF_CANCEL bit set. */ xlog_recover_do_buffer_pass1(log, buf_f); return 0; @@ -2244,7 +2244,7 @@ xlog_recover_do_buffer_trans( mp = log->l_mp; buf_flags = XBF_LOCK; - if (!(flags & XFS_BLI_INODE_BUF)) + if (!(flags & XFS_BLF_INODE_BUF)) buf_flags |= XBF_MAPPED; bp = xfs_buf_read(mp->m_ddev_targp, blkno, len, buf_flags); @@ -2257,10 +2257,10 @@ xlog_recover_do_buffer_trans( } error = 0; - if (flags & XFS_BLI_INODE_BUF) { + if (flags & XFS_BLF_INODE_BUF) { error = xlog_recover_do_inode_buffer(mp, item, bp, buf_f); } else if (flags & - (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) { + (XFS_BLF_UDQUOT_BUF|XFS_BLF_PDQUOT_BUF|XFS_BLF_GDQUOT_BUF)) { xlog_recover_do_dquot_buffer(mp, log, item, bp, buf_f); } else { xlog_recover_do_reg_buffer(mp, item, bp, buf_f); diff --git a/fs/xfs/xfs_log_recover.h b/fs/xfs/xfs_log_recover.h index 75d7492..1c55ccb 100644 --- a/fs/xfs/xfs_log_recover.h +++ b/fs/xfs/xfs_log_recover.h @@ -28,7 +28,7 @@ #define XLOG_RHASH(tid) \ ((((__uint32_t)tid)>>XLOG_RHASH_SHIFT) & (XLOG_RHASH_SIZE-1)) -#define XLOG_MAX_REGIONS_IN_ITEM (XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK / 2 + 1) +#define XLOG_MAX_REGIONS_IN_ITEM (XFS_MAX_BLOCKSIZE / XFS_BLF_CHUNK / 2 + 1) /* diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 9cd8090..3390c3e 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -114,7 +114,7 @@ _xfs_trans_bjoin( xfs_buf_item_init(bp, tp->t_mountp); bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(!(bip->bli_flags & XFS_BLI_LOGGED)); if (reset_recur) bip->bli_recur = 0; @@ -511,7 +511,7 @@ xfs_trans_brelse(xfs_trans_t *tp, bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(bip->bli_item.li_type == XFS_LI_BUF); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(atomic_read(&bip->bli_refcount) > 0); /* @@ -619,7 +619,7 @@ xfs_trans_bhold(xfs_trans_t *tp, bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(atomic_read(&bip->bli_refcount) > 0); bip->bli_flags |= XFS_BLI_HOLD; trace_xfs_trans_bhold(bip); @@ -641,7 +641,7 @@ xfs_trans_bhold_release(xfs_trans_t *tp, bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_CANCEL)); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_CANCEL)); ASSERT(atomic_read(&bip->bli_refcount) > 0); ASSERT(bip->bli_flags & XFS_BLI_HOLD); bip->bli_flags &= ~XFS_BLI_HOLD; @@ -704,7 +704,7 @@ xfs_trans_log_buf(xfs_trans_t *tp, bip->bli_flags &= ~XFS_BLI_STALE; ASSERT(XFS_BUF_ISSTALE(bp)); XFS_BUF_UNSTALE(bp); - bip->bli_format.blf_flags &= ~XFS_BLI_CANCEL; + bip->bli_format.blf_flags &= ~XFS_BLF_CANCEL; } lidp = xfs_trans_find_item(tp, (xfs_log_item_t*)bip); @@ -762,8 +762,8 @@ xfs_trans_binval( ASSERT(!(XFS_BUF_ISDELAYWRITE(bp))); ASSERT(XFS_BUF_ISSTALE(bp)); ASSERT(!(bip->bli_flags & (XFS_BLI_LOGGED | XFS_BLI_DIRTY))); - ASSERT(!(bip->bli_format.blf_flags & XFS_BLI_INODE_BUF)); - ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); + ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_INODE_BUF)); + ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); ASSERT(lidp->lid_flags & XFS_LID_DIRTY); ASSERT(tp->t_flags & XFS_TRANS_DIRTY); return; @@ -774,7 +774,7 @@ xfs_trans_binval( * in the buf log item. The STALE flag will be used in * xfs_buf_item_unpin() to determine if it should clean up * when the last reference to the buf item is given up. - * We set the XFS_BLI_CANCEL flag in the buf log format structure + * We set the XFS_BLF_CANCEL flag in the buf log format structure * and log the buf item. This will be used at recovery time * to determine that copies of the buffer in the log before * this should not be replayed. @@ -793,8 +793,8 @@ xfs_trans_binval( XFS_BUF_STALE(bp); bip->bli_flags |= XFS_BLI_STALE; bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_DIRTY); - bip->bli_format.blf_flags &= ~XFS_BLI_INODE_BUF; - bip->bli_format.blf_flags |= XFS_BLI_CANCEL; + bip->bli_format.blf_flags &= ~XFS_BLF_INODE_BUF; + bip->bli_format.blf_flags |= XFS_BLF_CANCEL; memset((char *)(bip->bli_format.blf_data_map), 0, (bip->bli_format.blf_map_size * sizeof(uint))); lidp->lid_flags |= XFS_LID_DIRTY; @@ -826,7 +826,7 @@ xfs_trans_inode_buf( bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(atomic_read(&bip->bli_refcount) > 0); - bip->bli_format.blf_flags |= XFS_BLI_INODE_BUF; + bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF; } /* @@ -908,9 +908,9 @@ xfs_trans_dquot_buf( ASSERT(XFS_BUF_ISBUSY(bp)); ASSERT(XFS_BUF_FSPRIVATE2(bp, xfs_trans_t *) == tp); ASSERT(XFS_BUF_FSPRIVATE(bp, void *) != NULL); - ASSERT(type == XFS_BLI_UDQUOT_BUF || - type == XFS_BLI_PDQUOT_BUF || - type == XFS_BLI_GDQUOT_BUF); + ASSERT(type == XFS_BLF_UDQUOT_BUF || + type == XFS_BLF_PDQUOT_BUF || + type == XFS_BLF_GDQUOT_BUF); bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(atomic_read(&bip->bli_refcount) > 0); -- 1.5.6.5 From SRS0+V/pq+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:17 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475dHsD148897 for ; Fri, 7 May 2010 00:39:17 -0500 X-ASG-Debug-ID: 1273210886-044e01830000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id C1F68318A37 for ; Thu, 6 May 2010 22:41:26 -0700 (PDT) Received: from mail.internode.on.net (bld-mail15.adl6.internode.on.net [150.101.137.100]) by cuda.sgi.com with ESMTP id aLgmHDpLBmAaJvjC for ; Thu, 06 May 2010 22:41:26 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 11831201-1927428 for ; Fri, 07 May 2010 15:11:25 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJE-0006rf-Me for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:12 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJD-00066m-7i for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:11 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 11/12] xfs: enable background pushing of the CIL Subject: [PATCH 11/12] xfs: enable background pushing of the CIL Date: Fri, 7 May 2010 15:40:59 +1000 Message-Id: <1273210860-23414-12-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail15.adl6.internode.on.net[150.101.137.100] X-Barracuda-Start-Time: 1273210887 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner If we let the CIL grow without bound, it will grow large enough to violate recovery constraints (must be at least one complete transaction in the log at all times) or take forever to write out through the log buffers. Hence we need a check during asynchronous transactions as to whether the CIL needs to be pushed. We track the amount of log space the CIL consumes, so it is relatively simple to limit it on a pure size basis. Make the limit the minimum of just under half the log size (recovery constraint) or 8MB of log space (which is an awful lot of metadata). Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 25 ++++++++++++++++++++++++- fs/xfs/xfs_log_priv.h | 45 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+), 1 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 3cb1957..806cf6b 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -339,6 +339,7 @@ xfs_log_commit_cil( { struct log *log = mp->m_log; int log_flags = 0; + int push = 0; if (flags & XFS_TRANS_RELEASE_LOG_RES) log_flags = XFS_LOG_REL_PERM_RESERV; @@ -368,13 +369,26 @@ xfs_log_commit_cil( xfs_log_done(mp, tp->t_ticket, NULL, log_flags); xfs_trans_unreserve_and_mod_sb(tp); - /* background commit is allowed again */ + /* check for background commit */ + if (log->l_cilp->xc_ctx->space_used > XLOG_CIL_SPACE_LIMIT(log)) + push = 1; + up_read(&log->l_cilp->xc_ctx_lock); current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); /* xfs_trans_free_items() unlocks them first */ xfs_trans_free_items(tp, *commit_lsn, 0); xfs_trans_free(tp); + + /* + * We need to push CIL every so often so we don't cache more than we + * can fit in the log. The limit really is that a checkpoint can't be + * more than half the log (the current checkpoint is not allowed to + * overwrite the previous checkpoint), but commit latency and memory + * usage limit this to a smaller size in most cases. + */ + if (push) + xlog_cil_push(log, 0); return 0; } @@ -453,6 +467,15 @@ xlog_cil_push( return 0; } + /* check for spurious background flush */ + if (!push_now && + log->l_cilp->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { + up_write(&cil->xc_ctx_lock); + xfs_log_ticket_put(new_ctx->ticket); + kmem_free(new_ctx); + return 0; + } + /* * pull all the log vectors off the items in the CIL, and * remove the items from the CIL. We don't need the CIL lock diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index f9a0e64..fa1aaa5 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -425,6 +425,51 @@ struct xfs_cil { }; /* + * The amount of log space we should the CIL to aggregate is difficult to size. + * Whatever we chose we have to make we can get a reservation for the log space + * effectively, that it is large enough to capture sufficient relogging to + * reduce log buffer IO significantly, but it is not too large for the log or + * induces too much latency when writing out through the iclogs. We track both + * space consumed and the number of vectors in the checkpoint context, so we + * need to decide which to use for limiting. + * + * Every log buffer we write out during a push needs a header reserved, which + * is at least one sector and more for v2 logs. Hence we need a reservation of + * at least 512 bytes per 32k of log space just for the LR headers. That means + * 16KB of reservation per megabyte of delayed logging space we will consume, + * plus various headers. The number of headers will vary based on the num of + * io vectors, so limiting on a specific number of vectors is going to result + * in transactions of varying size. IOWs, it is more consistent to track and + * limit space consumed in the log rather than by the number of objects being + * logged in order to prevent checkpoint ticket overruns. + * + * Further, use of static reservations through the log grant mechanism is + * problematic. It introduces a lot of complexity (e.g. reserve grant vs write + * grant) and a significant deadlock potential because regranting write space + * can block on log pushes. Hence if we have to regrant log space during a log + * push, we can deadlock. + * + * However, we can avoid this by use of a dynamic "reservation stealing" + * technique during transaction commit whereby unused reservation space in the + * transaction ticket is transferred to the CIL ctx commit ticket to cover the + * space needed by the checkpoint transaction. This means that we never need to + * specifically reserve space for the CIL checkpoint transaction, nor do we + * need to regrant space once the checkpoint completes. This also means the + * checkpoint transaction ticket is specific to the checkpoint context, rather + * than the CIL itself. + * + * With dynamic reservations, we can basically make up arbitrary limits for the + * checkpoint size so long as they don't violate any other size rules. Hence + * the initial maximum size for the checkpoint transaction will be set to a + * quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit + * right now based on the latency of writing out a large amount of data through + * the circular iclog buffers. + */ + +#define XLOG_CIL_SPACE_LIMIT(log) \ + (min((log->l_logsize >> 2), (8 * 1024 * 1024))) + +/* * The reservation head lsn is not made up of a cycle number and block number. * Instead, it uses a cycle number and byte number. Logs don't expect to * overflow 31 bits worth of byte offset, so using a byte number will mean -- 1.5.6.5 From SRS0+V/pq+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:18 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,LOCAL_GNU_PATCH autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475dIQ9148905 for ; Fri, 7 May 2010 00:39:18 -0500 X-ASG-Debug-ID: 1273210886-045101860000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1987A318A38 for ; Thu, 6 May 2010 22:41:26 -0700 (PDT) Received: from mail.internode.on.net (bld-mail15.adl6.internode.on.net [150.101.137.100]) by cuda.sgi.com with ESMTP id gtMjCLWSYf25lKZp for ; Thu, 06 May 2010 22:41:26 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 11831200-1927428 for ; Fri, 07 May 2010 15:11:25 +0930 (CST) Received: from [192.168.1.9] (helo=disturbed) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJO-0006rj-MM for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:22 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJD-00066p-9a for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:11 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 12/12] xfs: Ensure inode allocation buffers are fully replayed Subject: [PATCH 12/12] xfs: Ensure inode allocation buffers are fully replayed Date: Fri, 7 May 2010 15:41:00 +1000 Message-Id: <1273210860-23414-13-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: bld-mail15.adl6.internode.on.net[150.101.137.100] X-Barracuda-Start-Time: 1273210888 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner With delayed logging, we can get inode allocation buffers in the same transaction inode unlink buffers. We don't currently mark inode allocation buffers in the log, so inode unlink buffers take precedence over allocation buffers. The result is that when they are combined into the same checkpoint, only the unlinked inode chain fields are replayed, resulting in uninitialised inode buffers being detected when the next inode modification is replayed. To fix this, we need to ensure that we do not set the inode buffer flag in the buffer log item format flags if the inode allocation has not already hit the log. To avoid requiring a change to log recovery, we really need to make this a modification that relies only on in-memory sate. We can do this by checking during buffer log formatting (while the CIL cannot be flushed) if we are still in the same sequence when we commit the unlink transaction as the inode allocation transaction. If we are, then we do not add the inode buffer flag to the buffer log format item flags. This means the entire buffer will be replayed, not just the unlinked fields. We do this while CIL flusheŃ• are locked out to ensure that we don't race with the sequence numbers changing and hence fail to put the inode buffer flag in the buffer format flags when we really need to. Also, move an assert in the buffer release path outside the hash spinlock so that if the assert is hit the system continues to run in a debuggable state. Signed-off-by: Dave Chinner --- fs/xfs/linux-2.6/xfs_buf.c | 2 +- fs/xfs/xfs_buf_item.c | 14 ++++++++++++ fs/xfs/xfs_buf_item.h | 4 ++- fs/xfs/xfs_log.h | 1 + fs/xfs/xfs_log_cil.c | 48 ++++++++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_trans.c | 16 +++++++++++-- fs/xfs/xfs_trans.h | 1 + fs/xfs/xfs_trans_buf.c | 20 +++++++++--------- 8 files changed, 89 insertions(+), 17 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index 82678bf..e085eca 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -800,9 +800,9 @@ xfs_buf_rele( } else if (bp->b_flags & XBF_FS_MANAGED) { spin_unlock(&hash->bh_lock); } else { - ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q))); list_del_init(&bp->b_hash_list); spin_unlock(&hash->bh_lock); + ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q))); xfs_buf_free(bp); } } diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index bcbb661..02a8098 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -254,6 +254,20 @@ xfs_buf_item_format( vecp++; nvecs = 1; + /* + * If it is an inode buffer, transfer the in-memory state to the + * format flags and clear the in-memory state. We do not transfer + * this state if the inode buffer allocation has not yet been committed + * to the log as setting the XFS_BLI_INODE_BUF flag will prevent + * correct replay of the inode allocation. + */ + if (bip->bli_flags & XFS_BLI_INODE_BUF) { + if (!((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && + xfs_log_item_in_current_chkpt(&bip->bli_item))) + bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF; + bip->bli_flags &= ~XFS_BLI_INODE_BUF; + } + if (bip->bli_flags & XFS_BLI_STALE) { /* * The buffer is stale, so all we need to log diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index 8cbb82b..f20bb47 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -69,6 +69,7 @@ typedef struct xfs_buf_log_format { #define XFS_BLI_LOGGED 0x08 #define XFS_BLI_INODE_ALLOC_BUF 0x10 #define XFS_BLI_STALE_INODE 0x20 +#define XFS_BLI_INODE_BUF 0x40 #define XFS_BLI_FLAGS \ { XFS_BLI_HOLD, "HOLD" }, \ @@ -76,7 +77,8 @@ typedef struct xfs_buf_log_format { { XFS_BLI_STALE, "STALE" }, \ { XFS_BLI_LOGGED, "LOGGED" }, \ { XFS_BLI_INODE_ALLOC_BUF, "INODE_ALLOC" }, \ - { XFS_BLI_STALE_INODE, "STALE_INODE" } + { XFS_BLI_STALE_INODE, "STALE_INODE" }, \ + { XFS_BLI_INODE_BUF, "INODE_BUF" } #ifdef __KERNEL__ diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 4a0c574..04c78e6 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -198,6 +198,7 @@ xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp); int xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, struct xfs_log_vec *log_vector, xfs_lsn_t *commit_lsn, int flags); +bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); #endif diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 806cf6b..f6733eb 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -201,6 +201,15 @@ xlog_cil_insert( ctx->nvecs += diff_iovecs; /* + * If this is the first time the item is being committed to the CIL, + * store the sequence number on the log item so we can tell + * in future commits whether this is the first checkpoint the item is + * being committed into. + */ + if (!item->li_seq) + item->li_seq = ctx->sequence; + + /* * Now transfer enough transaction reservation to the context ticket * for the checkpoint. The context ticket is special - the unit * reservation has to grow as well as the current reservation as we @@ -328,6 +337,10 @@ xlog_cil_free_logvec( * For more specific information about the order of operations in * xfs_log_commit_cil() please refer to the comments in * xfs_trans_commit_iclog(). + * + * Called with the context lock already held in read mode to lock out + * background commit, returns without it held once background commits are + * allowed again. */ int xfs_log_commit_cil( @@ -346,11 +359,10 @@ xfs_log_commit_cil( if (XLOG_FORCED_SHUTDOWN(log)) { xlog_cil_free_logvec(log_vector); + up_read(&log->l_cilp->xc_ctx_lock); return XFS_ERROR(EIO); } - /* lock out background commit */ - down_read(&log->l_cilp->xc_ctx_lock); xlog_cil_format_items(log, log_vector, tp->t_ticket, commit_lsn); /* check we didn't blow the reservation */ @@ -687,3 +699,35 @@ restart: return commit_lsn; } +/* + * Check if the current log item was first committed in this sequence. + * We can't rely on just the log item being in the CIL, we have to check + * the recorded commit sequence number. + * + * Note: for this to be used in a non-racy manner, it has to be called with + * CIL flushing locked out. As a result, it should only be used during the + * transaction commit process when deciding what to format into the item. + */ +bool +xfs_log_item_in_current_chkpt( + struct xfs_log_item *lip) +{ + struct xfs_cil_ctx *ctx; + + if (!(lip->li_mountp->m_flags & XFS_MOUNT_DELAYLOG)) + return false; + if (list_empty(&lip->li_cil)) + return false; + + ctx = lip->li_mountp->m_log->l_cilp->xc_ctx; + + /* + * li_seq is written on the first commit of a log item to record the + * first checkpoint it is written to. Hence if it is different to the + * current sequence, we're in a new checkpoint. + */ + if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0) + return false; + return true; +} + diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 9bdb492..3e88c3f 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -45,6 +45,7 @@ #include "xfs_trans_space.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_log_priv.h" kmem_zone_t *xfs_trans_zone; @@ -1261,10 +1262,19 @@ xfs_trans_commit_cil( return ENOMEM; /* - * Fill in the log_vector and pin the logged items, and - * then write the transaction to the log. We have to lock - * out CIL flushes from this point as we are going to pin + * Now we need to fill in the log_vector and pin the logged items, and + * then write the transaction to the log. + * + * Important: We have to lock out CIL flushes from this point as + * transferring state from the in memory log items to the log item + * headers during formatting may require atomicity against log writes + * to ensure that state is transferred to the log without racing + * against flushes. + * + * xfs_log_commit_cil() will release the lock as part of the commit + * process. */ + down_read(&mp->m_log->l_cilp->xc_ctx_lock); xfs_trans_fill_log_vecs(tp, log_vector); return xfs_log_commit_cil(mp, tp, log_vector, commit_lsn, flags); diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index b1ea20c..8c69e78 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -835,6 +835,7 @@ typedef struct xfs_log_item { /* delayed logging */ struct list_head li_cil; /* CIL pointers */ struct xfs_log_vec *li_lv; /* active log vector */ + xfs_lsn_t li_seq; /* CIL commit seq */ } xfs_log_item_t; #define XFS_LI_IN_AIL 0x1 diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 3390c3e..63d81a2 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -792,7 +792,7 @@ xfs_trans_binval( XFS_BUF_UNDELAYWRITE(bp); XFS_BUF_STALE(bp); bip->bli_flags |= XFS_BLI_STALE; - bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_DIRTY); + bip->bli_flags &= ~(XFS_BLI_INODE_BUF | XFS_BLI_LOGGED | XFS_BLI_DIRTY); bip->bli_format.blf_flags &= ~XFS_BLF_INODE_BUF; bip->bli_format.blf_flags |= XFS_BLF_CANCEL; memset((char *)(bip->bli_format.blf_data_map), 0, @@ -802,16 +802,16 @@ xfs_trans_binval( } /* - * This call is used to indicate that the buffer contains on-disk - * inodes which must be handled specially during recovery. They - * require special handling because only the di_next_unlinked from - * the inodes in the buffer should be recovered. The rest of the - * data in the buffer is logged via the inodes themselves. + * This call is used to indicate that the buffer contains on-disk inodes which + * must be handled specially during recovery. They require special handling + * because only the di_next_unlinked from the inodes in the buffer should be + * recovered. The rest of the data in the buffer is logged via the inodes + * themselves. * - * All we do is set the XFS_BLI_INODE_BUF flag in the buffer's log - * format structure so that we'll know what to do at recovery time. + * All we do is set the XFS_BLI_INODE_BUF flag in the items flags so it can be + * transferred to the buffer's log format structure so that we'll know what to + * do at recovery time. */ -/* ARGSUSED */ void xfs_trans_inode_buf( xfs_trans_t *tp, @@ -826,7 +826,7 @@ xfs_trans_inode_buf( bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *); ASSERT(atomic_read(&bip->bli_refcount) > 0); - bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF; + bip->bli_flags |= XFS_BLI_INODE_BUF; } /* -- 1.5.6.5 From SRS0+1K0y+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:20 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475dKgR148915 for ; Fri, 7 May 2010 00:39:20 -0500 X-ASG-Debug-ID: 1273210889-0449018c0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 9E664318A3C for ; Thu, 6 May 2010 22:41:29 -0700 (PDT) Received: from mail.internode.on.net (bld-mail13.adl6.internode.on.net [150.101.137.98]) by cuda.sgi.com with ESMTP id txuk9a9suEHNbWCz for ; Thu, 06 May 2010 22:41:29 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23556264-1927428 for ; Fri, 07 May 2010 15:11:28 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJJ-0006rr-QY for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:17 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJD-00066j-5y for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:11 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 10/12] xfs: forced unmounts need to push the CIL Subject: [PATCH 10/12] xfs: forced unmounts need to push the CIL Date: Fri, 7 May 2010 15:40:58 +1000 Message-Id: <1273210860-23414-11-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail13.adl6.internode.on.net[150.101.137.98] X-Barracuda-Start-Time: 1273210890 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0026 1.0000 -2.0040 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.00 X-Barracuda-Spam-Status: No, SCORE=-2.00 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner If the filesystem is being shut down and the there is no log error, the current code forces out the current log buffers. This code now needs to push the CIL before it forces out the log buffers to acheive the same result. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 15 +++++++++++++++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 23f2a05..7144a9e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3691,6 +3691,11 @@ xlog_state_ioerror( * c. nothing new gets queued up after (a) and (b) are done. * d. if !logerror, flush the iclogs to disk, then seal them off * for business. + * + * Note: for delayed logging the !logerror case needs to flush the regions + * held in memory out to the iclogs before flushing them to disk. This needs + * to be done before the log is marked as shutdown, otherwise the flush to the + * iclogs will fail. */ int xfs_log_force_umount( @@ -3724,6 +3729,16 @@ xfs_log_force_umount( return 1; } retval = 0; + + /* + * Flush the in memory commit item list before marking the log as + * being shut down. We need to do it in this order to ensure all the + * completed transactions are flushed to disk with the xfs_log_force() + * call below. + */ + if (!logerror && (mp->m_flags & XFS_MOUNT_DELAYLOG)) + xlog_cil_push(log, 1); + /* * We must hold both the GRANT lock and the LOG lock, * before we mark the filesystem SHUTDOWN and wake -- 1.5.6.5 From SRS0+J6fy+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:22 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475dLpY148934 for ; Fri, 7 May 2010 00:39:22 -0500 X-ASG-Debug-ID: 1273210864-045301780000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 782B0318A2E for ; Thu, 6 May 2010 22:41:05 -0700 (PDT) Received: from mail.internode.on.net (bld-mail17.adl2.internode.on.net [150.101.137.102]) by cuda.sgi.com with ESMTP id qCDbhUWBpw9QDd1U for ; Thu, 06 May 2010 22:41:05 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23267856-1927428 for ; Fri, 07 May 2010 15:11:03 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJ4-0006rD-87 for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:02 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJ2-00066K-Lh for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:00 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 03/12] xfs: modify buffer item reference counting Subject: [PATCH 03/12] xfs: modify buffer item reference counting Date: Fri, 7 May 2010 15:40:51 +1000 Message-Id: <1273210860-23414-4-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: bld-mail17.adl2.internode.on.net[150.101.137.102] X-Barracuda-Start-Time: 1273210866 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner The buffer log item reference counts used to take referenceŃ• for every transaction, similar to the pin counting. This is symmetric (like the pin/unpin) with respect to transaction completion, but with dleayed logging becomes assymetric as the pinning becomes assymetric w.r.t. transaction completion. To make both cases the same, allow the buffer pinning to take a reference to the buffer log item and always drop the reference the transaction has on it when being unlocked. This is balanced correctly because the unpin operation always drops a reference to the log item. Hence reference counting becomes symmetric w.r.t. item pinning as well as w.r.t active transactions and as a result the reference counting model remain consistent between normal and delayed logging. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_buf_item.c | 110 ++++++++++++++++++++++-------------------------- 1 files changed, 50 insertions(+), 60 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 240340a..4cd5f61 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -341,10 +341,15 @@ xfs_buf_item_format( } /* - * This is called to pin the buffer associated with the buf log - * item in memory so it cannot be written out. Simply call bpin() - * on the buffer to do this. + * This is called to pin the buffer associated with the buf log item in memory + * so it cannot be written out. Simply call bpin() on the buffer to do this. + * + * We also always take a reference to the buffer log item here so that the bli + * is held while the item is pinned in memory. This means that we can + * unconditionally drop the reference count a transaction holds when the + * transaction is completed. */ + STATIC void xfs_buf_item_pin( xfs_buf_log_item_t *bip) @@ -356,6 +361,7 @@ xfs_buf_item_pin( ASSERT(atomic_read(&bip->bli_refcount) > 0); ASSERT((bip->bli_flags & XFS_BLI_LOGGED) || (bip->bli_flags & XFS_BLI_STALE)); + atomic_inc(&bip->bli_refcount); trace_xfs_buf_item_pin(bip); xfs_bpin(bp); } @@ -489,20 +495,23 @@ xfs_buf_item_trylock( } /* - * Release the buffer associated with the buf log item. - * If there is no dirty logged data associated with the - * buffer recorded in the buf log item, then free the - * buf log item and remove the reference to it in the - * buffer. + * Release the buffer associated with the buf log item. If there is no dirty + * logged data associated with the buffer recorded in the buf log item, then + * free the buf log item and remove the reference to it in the buffer. * - * This call ignores the recursion count. It is only called - * when the buffer should REALLY be unlocked, regardless - * of the recursion count. + * This call ignores the recursion count. It is only called when the buffer + * should REALLY be unlocked, regardless of the recursion count. * - * If the XFS_BLI_HOLD flag is set in the buf log item, then - * free the log item if necessary but do not unlock the buffer. - * This is for support of xfs_trans_bhold(). Make sure the - * XFS_BLI_HOLD field is cleared if we don't free the item. + * We unconditionally drop the transaction's reference to the log item. If the + * item was logged, then another reference was taken when it was pinned, so we + * can safely drop the transaction reference now. This also allows us to avoid + * potential races with the unpin code freeing the bli by not referencing the + * bli after we've dropped the reference count. + * + * If the XFS_BLI_HOLD flag is set in the buf log item, then free the log item + * if necessary but do not unlock the buffer. This is for support of + * xfs_trans_bhold(). Make sure the XFS_BLI_HOLD field is cleared if we don't + * free the item. */ STATIC void xfs_buf_item_unlock( @@ -514,73 +523,54 @@ xfs_buf_item_unlock( bp = bip->bli_buf; - /* - * Clear the buffer's association with this transaction. - */ + /* Clear the buffer's association with this transaction. */ XFS_BUF_SET_FSPRIVATE2(bp, NULL); /* - * If this is a transaction abort, don't return early. - * Instead, allow the brelse to happen. - * Normally it would be done for stale (cancelled) buffers - * at unpin time, but we'll never go through the pin/unpin - * cycle if we abort inside commit. + * If this is a transaction abort, don't return early. Instead, allow + * the brelse to happen. Normally it would be done for stale + * (cancelled) buffers at unpin time, but we'll never go through the + * pin/unpin cycle if we abort inside commit. */ aborted = (bip->bli_item.li_flags & XFS_LI_ABORTED) != 0; /* - * If the buf item is marked stale, then don't do anything. - * We'll unlock the buffer and free the buf item when the - * buffer is unpinned for the last time. + * Before possibly freeing the buf item, determine if we should + * release the buffer at the end of this routine. + */ + hold = bip->bli_flags & XFS_BLI_HOLD; + + /* Clear the per transaction state. */ + bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD); + + /* + * If the buf item is marked stale, then don't do anything. We'll + * unlock the buffer and free the buf item when the buffer is unpinned + * for the last time. */ if (bip->bli_flags & XFS_BLI_STALE) { - bip->bli_flags &= ~XFS_BLI_LOGGED; trace_xfs_buf_item_unlock_stale(bip); ASSERT(bip->bli_format.blf_flags & XFS_BLI_CANCEL); - if (!aborted) + if (!aborted) { + atomic_dec(&bip->bli_refcount); return; + } } - /* - * Drop the transaction's reference to the log item if - * it was not logged as part of the transaction. Otherwise - * we'll drop the reference in xfs_buf_item_unpin() when - * the transaction is really through with the buffer. - */ - if (!(bip->bli_flags & XFS_BLI_LOGGED)) { - atomic_dec(&bip->bli_refcount); - } else { - /* - * Clear the logged flag since this is per - * transaction state. - */ - bip->bli_flags &= ~XFS_BLI_LOGGED; - } - - /* - * Before possibly freeing the buf item, determine if we should - * release the buffer at the end of this routine. - */ - hold = bip->bli_flags & XFS_BLI_HOLD; trace_xfs_buf_item_unlock(bip); /* - * If the buf item isn't tracking any data, free it. - * Otherwise, if XFS_BLI_HOLD is set clear it. + * If the buf item isn't tracking any data, free it, otherwise drop the + * reference we hold to it. */ if (xfs_bitmap_empty(bip->bli_format.blf_data_map, - bip->bli_format.blf_map_size)) { + bip->bli_format.blf_map_size)) xfs_buf_item_relse(bp); - } else if (hold) { - bip->bli_flags &= ~XFS_BLI_HOLD; - } + else + atomic_dec(&bip->bli_refcount); - /* - * Release the buffer if XFS_BLI_HOLD was not set. - */ - if (!hold) { + if (!hold) xfs_buf_relse(bp); - } } /* -- 1.5.6.5 From SRS0+V/pq+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:21 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,J_CHICKENPOX_63, J_CHICKENPOX_64 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475dLhM148929 for ; Fri, 7 May 2010 00:39:21 -0500 X-ASG-Debug-ID: 1273210888-63ee020b0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1BA90963902 for ; Thu, 6 May 2010 22:41:29 -0700 (PDT) Received: from mail.internode.on.net (bld-mail16.adl2.internode.on.net [150.101.137.101]) by cuda.sgi.com with ESMTP id aGk7jjK98smAGyfJ for ; Thu, 06 May 2010 22:41:29 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23402973-1927428 for ; Fri, 07 May 2010 15:11:24 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJE-0006rX-Fl for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:12 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJC-00066W-WE for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:11 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 07/12] xfs: Improve scalability of busy extent tracking Subject: [PATCH 07/12] xfs: Improve scalability of busy extent tracking Date: Fri, 7 May 2010 15:40:55 +1000 Message-Id: <1273210860-23414-8-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail16.adl2.internode.on.net[150.101.137.101] X-Barracuda-Start-Time: 1273210891 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean When we free a metadata extent, we record it in the per-AG busy extent array so that it is not re-used before the freeing transaction hits the disk. This array is fixed size, so when it overflows we make further allocation transactions synchronous because we cannot track more freed extents until those transactions hit the disk and are completed. Under heavy mixed allocation and freeing workloads with large log buffers, we can overflow this array quite easily. Further, the array is sparsely populated, which means that inserts need to search for a free slot, and array searches often have to search many more slots that are actually used to check all the busy extents. Quite inefficient, really. To enable this aspect of extent freeing to scale better, we need a structure that can grow dynamically. While in other areas of XFS we have used radix trees, the extents being freed are at random locations on disk so are better suited to being indexed by an rbtree. So, use a per-AG rbtree indexed by block number to track busy extents. This incures a memory allocation when marking an extent busy, but should not occur too often in low memory situations. This should scale to an arbitrary number of extents so should not be a limitation for features such as in-memory aggregation of transactions. However, there are still situations where we can't avoid allocating busy extents (such as allocation from the AGFL). To minimise the overhead of such occurences, we need to avoid doing a synchronous log force while holding the AGF locked to ensure that the previous transactions are safely on disk before we use the extent. We can do this by marking the transaction doing the allocation as synchronous rather issuing a log force. Because of the locking involved and the ordering of transactions, the synchronous transaction provides the same guarantees as a synchronous log force because it ensures that all the prior transactions are already on disk when the synchronous transaction hits the disk. i.e. it preserves the free->allocate order of the extent correctly in recovery. By doing this, we avoid holding the AGF locked while log writes are in progress, hence reducing the length of time the lock is held and therefore we increase the rate at which we can allocate and free from the allocation group, thereby increasing overall throughput. The only problem with this approach is that when a metadata buffer is marked stale (e.g. a directory block is removed), then buffer remains pinned and locked until the log goes to disk. The issue here is that if that stale buffer is reallocated in a subsequent transaction, the attempt to lock that buffer in the transaction will hang waiting the log to go to disk to unlock and unpin the buffer. Hence if someone tries to lock a pinned, stale, locked buffer we need to push on the log to get it unlocked ASAP. Effectively we are trading off a guaranteed log force for a much less common trigger for log force to occur. Ideally we should not reallocate busy extents. That is a much more complex fix to the problem as it involves direct intervention in the allocation btree searches in many places. This is left to a future set of modifications. Finally, now that we track busy extents in allocated memory, we don't need the descriptors in the transaction structure to point to them. We can replace the complex busy chunk infrastructure with a simple linked list of busy extents. This allows us to remove a large chunk of code, making the overall change a net reduction in code size. Signed-off-by: Dave Chinner --- fs/xfs/linux-2.6/xfs_buf.c | 9 + fs/xfs/linux-2.6/xfs_quotaops.c | 1 + fs/xfs/linux-2.6/xfs_trace.h | 83 ++++++--- fs/xfs/xfs_ag.h | 24 ++- fs/xfs/xfs_alloc.c | 364 ++++++++++++++++++++++++++++----------- fs/xfs/xfs_alloc.h | 7 +- fs/xfs/xfs_alloc_btree.c | 2 +- fs/xfs/xfs_log.h | 3 - fs/xfs/xfs_trans.c | 41 ++---- fs/xfs/xfs_trans.h | 35 +---- fs/xfs/xfs_trans_item.c | 109 ------------ fs/xfs/xfs_trans_priv.h | 4 - fs/xfs/xfs_types.h | 2 + 13 files changed, 361 insertions(+), 323 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index 6873afc..82678bf 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -37,6 +37,7 @@ #include "xfs_sb.h" #include "xfs_inum.h" +#include "xfs_log.h" #include "xfs_ag.h" #include "xfs_dmapi.h" #include "xfs_mount.h" @@ -850,6 +851,12 @@ xfs_buf_lock_value( * Note that this in no way locks the underlying pages, so it is only * useful for synchronizing concurrent use of buffer objects, not for * synchronizing independent access to the underlying pages. + * + * If we come across a stale, pinned, locked buffer, we know that we + * are being asked to lock a buffer that has been reallocated. Because + * it is pinned, we know that the log has not been pushed to disk and + * hence it will still be locked. Rather than sleeping until someone + * else pushes the log, push it ourselves before trying to get the lock. */ void xfs_buf_lock( @@ -857,6 +864,8 @@ xfs_buf_lock( { trace_xfs_buf_lock(bp, _RET_IP_); + if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE)) + xfs_log_force(bp->b_mount, 0); if (atomic_read(&bp->b_io_remaining)) blk_run_address_space(bp->b_target->bt_mapping); down(&bp->b_sema); diff --git a/fs/xfs/linux-2.6/xfs_quotaops.c b/fs/xfs/linux-2.6/xfs_quotaops.c index 1947514..2e73688 100644 --- a/fs/xfs/linux-2.6/xfs_quotaops.c +++ b/fs/xfs/linux-2.6/xfs_quotaops.c @@ -19,6 +19,7 @@ #include "xfs_dmapi.h" #include "xfs_sb.h" #include "xfs_inum.h" +#include "xfs_log.h" #include "xfs_ag.h" #include "xfs_mount.h" #include "xfs_quota.h" diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h index 8a319cf..ff6bc79 100644 --- a/fs/xfs/linux-2.6/xfs_trace.h +++ b/fs/xfs/linux-2.6/xfs_trace.h @@ -1059,83 +1059,112 @@ TRACE_EVENT(xfs_bunmap, ); +#define XFS_BUSY_SYNC \ + { 0, "async" }, \ + { 1, "sync" } + TRACE_EVENT(xfs_alloc_busy, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, - xfs_extlen_t len, int slot), - TP_ARGS(mp, agno, agbno, len, slot), + TP_PROTO(struct xfs_trans *trans, xfs_agnumber_t agno, + xfs_agblock_t agbno, xfs_extlen_t len, int sync), + TP_ARGS(trans, agno, agbno, len, sync), TP_STRUCT__entry( __field(dev_t, dev) + __field(struct xfs_trans *, tp) + __field(int, tid) __field(xfs_agnumber_t, agno) __field(xfs_agblock_t, agbno) __field(xfs_extlen_t, len) - __field(int, slot) + __field(int, sync) ), TP_fast_assign( - __entry->dev = mp->m_super->s_dev; + __entry->dev = trans->t_mountp->m_super->s_dev; + __entry->tp = trans; + __entry->tid = trans->t_ticket->t_tid; __entry->agno = agno; __entry->agbno = agbno; __entry->len = len; - __entry->slot = slot; + __entry->sync = sync; ), - TP_printk("dev %d:%d agno %u agbno %u len %u slot %d", + TP_printk("dev %d:%d trans 0x%p tid 0x%x agno %u agbno %u len %u %s", MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->tp, + __entry->tid, __entry->agno, __entry->agbno, __entry->len, - __entry->slot) + __print_symbolic(__entry->sync, XFS_BUSY_SYNC)) ); -#define XFS_BUSY_STATES \ - { 0, "found" }, \ - { 1, "missing" } - TRACE_EVENT(xfs_alloc_unbusy, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - int slot, int found), - TP_ARGS(mp, agno, slot, found), + xfs_agblock_t agbno, xfs_extlen_t len), + TP_ARGS(mp, agno, agbno, len), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) - __field(int, slot) - __field(int, found) + __field(xfs_agblock_t, agbno) + __field(xfs_extlen_t, len) ), TP_fast_assign( __entry->dev = mp->m_super->s_dev; __entry->agno = agno; - __entry->slot = slot; - __entry->found = found; + __entry->agbno = agbno; + __entry->len = len; ), - TP_printk("dev %d:%d agno %u slot %d %s", + TP_printk("dev %d:%d agno %u agbno %u len %u", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, - __entry->slot, - __print_symbolic(__entry->found, XFS_BUSY_STATES)) + __entry->agbno, + __entry->len) ); +#define XFS_BUSY_STATES \ + { 0, "missing" }, \ + { 1, "found" } + TRACE_EVENT(xfs_alloc_busysearch, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, - xfs_extlen_t len, xfs_lsn_t lsn), - TP_ARGS(mp, agno, agbno, len, lsn), + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, + xfs_agblock_t agbno, xfs_extlen_t len, int found), + TP_ARGS(mp, agno, agbno, len, found), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) __field(xfs_agblock_t, agbno) __field(xfs_extlen_t, len) - __field(xfs_lsn_t, lsn) + __field(int, found) ), TP_fast_assign( __entry->dev = mp->m_super->s_dev; __entry->agno = agno; __entry->agbno = agbno; __entry->len = len; - __entry->lsn = lsn; + __entry->found = found; ), - TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx", + TP_printk("dev %d:%d agno %u agbno %u len %u %s", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, __entry->agbno, __entry->len, + __print_symbolic(__entry->found, XFS_BUSY_STATES)) +); + +TRACE_EVENT(xfs_trans_commit_lsn, + TP_PROTO(struct xfs_trans *trans), + TP_ARGS(trans), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(struct xfs_trans *, tp) + __field(xfs_lsn_t, lsn) + ), + TP_fast_assign( + __entry->dev = trans->t_mountp->m_super->s_dev; + __entry->tp = trans; + __entry->lsn = trans->t_commit_lsn; + ), + TP_printk("dev %d:%d trans 0x%p commit_lsn 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->tp, __entry->lsn) ); diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h index abb8222..401f364 100644 --- a/fs/xfs/xfs_ag.h +++ b/fs/xfs/xfs_ag.h @@ -175,14 +175,20 @@ typedef struct xfs_agfl { } xfs_agfl_t; /* - * Busy block/extent entry. Used in perag to mark blocks that have been freed - * but whose transactions aren't committed to disk yet. + * Busy block/extent entry. Indexed by a rbtree in perag to mark blocks that + * have been freed but whose transactions aren't committed to disk yet. + * + * Note that we use the transaction ID to record the transaction, not the + * transaction structure itself. See xfs_alloc_busy_insert() for details. */ -typedef struct xfs_perag_busy { - xfs_agblock_t busy_start; - xfs_extlen_t busy_length; - struct xfs_trans *busy_tp; /* transaction that did the free */ -} xfs_perag_busy_t; +struct xfs_busy_extent { + struct rb_node rb_node; /* ag by-bno indexed search tree */ + struct list_head list; /* transaction busy extent list */ + xfs_agnumber_t agno; + xfs_agblock_t bno; + xfs_extlen_t length; + xlog_tid_t tid; /* transaction that created this */ +}; /* * Per-ag incore structure, copies of information in agf and agi, @@ -216,7 +222,8 @@ typedef struct xfs_perag { xfs_agino_t pagl_leftrec; xfs_agino_t pagl_rightrec; #ifdef __KERNEL__ - spinlock_t pagb_lock; /* lock for pagb_list */ + spinlock_t pagb_lock; /* lock for pagb_tree */ + struct rb_root pagb_tree; /* ordered tree of busy extents */ atomic_t pagf_fstrms; /* # of filestreams active in this AG */ @@ -226,7 +233,6 @@ typedef struct xfs_perag { int pag_ici_reclaimable; /* reclaimable inodes */ #endif int pagb_count; /* pagb slots in use */ - xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS]; /* unstable blocks */ } xfs_perag_t; /* diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c index 94cddbf..f8d592b 100644 --- a/fs/xfs/xfs_alloc.c +++ b/fs/xfs/xfs_alloc.c @@ -46,11 +46,9 @@ #define XFSA_FIXUP_BNO_OK 1 #define XFSA_FIXUP_CNT_OK 2 -STATIC void -xfs_alloc_search_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - xfs_agblock_t bno, - xfs_extlen_t len); +static int +xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno, + xfs_agblock_t bno, xfs_extlen_t len); /* * Prototypes for per-ag allocation routines @@ -540,9 +538,16 @@ xfs_alloc_ag_vextent( be32_to_cpu(agf->agf_length)); xfs_alloc_log_agf(args->tp, args->agbp, XFS_AGF_FREEBLKS); - /* search the busylist for these blocks */ - xfs_alloc_search_busy(args->tp, args->agno, - args->agbno, args->len); + /* + * Search the busylist for these blocks and mark the + * transaction as synchronous if blocks are found. This + * avoids the need to block in due to a synchronous log + * force to ensure correct ordering as the synchronous + * transaction will guarantee that for us. + */ + if (xfs_alloc_busy_search(args->mp, args->agno, + args->agbno, args->len)) + xfs_trans_set_sync(args->tp); } if (!args->isfl) xfs_trans_mod_sb(args->tp, @@ -1693,7 +1698,7 @@ xfs_free_ag_extent( * when the iclog commits to disk. If a busy block is allocated, * the iclog is pushed up to the LSN that freed the block. */ - xfs_alloc_mark_busy(tp, agno, bno, len); + xfs_alloc_busy_insert(tp, agno, bno, len); return 0; error0: @@ -1993,10 +1998,17 @@ xfs_alloc_get_freelist( * and remain there until the freeing transaction is committed to * disk. Now that we have allocated blocks, this list must be * searched to see if a block is being reused. If one is, then - * the freeing transaction must be pushed to disk NOW by forcing - * to disk all iclogs up that transaction's LSN. - */ - xfs_alloc_search_busy(tp, be32_to_cpu(agf->agf_seqno), bno, 1); + * the freeing transaction must be pushed to disk before this + * transaction. + * + * We do this by setting the current transaction + * to a sync transaction which guarantees that the freeing transaction + * is on disk before this transaction. This is done instead of a + * synchronous log force here so that we don't sit and wait with + * the AGF locked in the transaction during the log force. + */ + if (xfs_alloc_busy_search(mp, be32_to_cpu(agf->agf_seqno), bno, 1)) + xfs_trans_set_sync(tp); return 0; } @@ -2201,7 +2213,7 @@ xfs_alloc_read_agf( be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]); spin_lock_init(&pag->pagb_lock); pag->pagb_count = 0; - memset(pag->pagb_list, 0, sizeof(pag->pagb_list)); + pag->pagb_tree = RB_ROOT; pag->pagf_init = 1; } #ifdef DEBUG @@ -2479,127 +2491,273 @@ error0: * list is reused, the transaction that freed it must be forced to disk * before continuing to use the block. * - * xfs_alloc_mark_busy - add to the per-ag busy list - * xfs_alloc_clear_busy - remove an item from the per-ag busy list + * xfs_alloc_busy_insert - add to the per-ag busy list + * xfs_alloc_busy_clear - remove an item from the per-ag busy list + * xfs_alloc_busy_search - search for a busy extent + */ + +/* + * Insert a new extent into the busy tree. + * + * This is straight forward, except that we can get overlaps with existing busy + * extents, and even duplicate busy extents. There are two main cases we have + * to handle here. + * + * The first case is a transaction that triggers a "free - allocate - free" + * cycle. This can occur during btree manipulations as a btree block is freed + * to the freelist, then allocated from the free list, then freed again. In + * this case, the second extnet free is what triggers the duplicate and as such + * the transaction IDs should match. Because the extent was allocated in this + * transaction, the transaction must be marked as synchronous. This is true for + * all cases where the free/alloc/free occurs in the one transaction, hence the + * addition of the ASSERT(tp->t_flags & XFS_TRANS_SYNC) to this case. This + * serves to catch violations of the second case quite effectively. + * + * The second case is where the free/alloc/free occur in different + * transactions. In this case, we can't mark the extent busy immediately + * because it is already tracked in a transaction that may be committing. When + * the log commit completes, the busy extent will be removed from the tree. If + * we allow this busy insert to continue using that busy extent structure, it + * can be freed before this transaction is safely in the log. Hence our only + * option in this case is to force the log to remove the existing busy extent + * from the list before we insert the new one with the current transaction ID. + * + * The problem we are trying to avoid in the free-alloc-free in separate + * transactions is most easily described with a timeline: + * + * Thread 1 Thread 2 Thread 3 xfslogd + * xact alloc + * free X + * mark busy + * commit xact + * free xact + * xact alloc + * alloc X + * busy search + * mark xact sync + * commit xact + * free xact + * force log + * checkpoint starts + * .... + * xact alloc + * free X + * mark busy + * finds match + * *** KABOOM! *** + * .... + * log IO completes + * unbusy 1:91909 + * checkpoint completes + * + * By issuing a log force in thread 3 @ "KABOOM", the thread will block until + * the checkpoint completes, and the busy extent it matched will have been + * removed from the tree when it is woken. Hence it can then continue safely. + * + * However, to ensure this matching process is robust, we need to use the + * transaction ID for identifying transaction, as delayed logging results in + * the busy extent and transaction lifecycles being different. i.e. the busy + * extent is active for a lot longer than the transaction. Hence the + * transaction structure can be freed and reallocated, then mark the same + * extent busy again in the new transaction. In this case the new transaction + * will have a different tid but can have the same address, and hence we need + * to check against the tid. + * + * Future: for delayed logging, we could avoid the log force is the extent was + * first freed in the current checkpoint sequence. This, however, requires the + * ability to pin the current checkpoint in memory until this transaction + * commits to ensure that both the original free and the current one combine + * logically into the one checkpoint. If the checkpoint sequences are + * different, however, we still need to wait on a log force. */ void -xfs_alloc_mark_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - xfs_agblock_t bno, - xfs_extlen_t len) +xfs_alloc_busy_insert( + struct xfs_trans *tp, + xfs_agnumber_t agno, + xfs_agblock_t bno, + xfs_extlen_t len) { - xfs_perag_busy_t *bsy; + struct xfs_busy_extent *new; + struct xfs_busy_extent *busyp; struct xfs_perag *pag; - int n; + struct rb_node **rbp; + struct rb_node *parent; + xfs_agblock_t uend, bend; + int match; - pag = xfs_perag_get(tp->t_mountp, agno); - spin_lock(&pag->pagb_lock); - /* search pagb_list for an open slot */ - for (bsy = pag->pagb_list, n = 0; - n < XFS_PAGB_NUM_SLOTS; - bsy++, n++) { - if (bsy->busy_tp == NULL) { - break; - } + new = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL); + if (!new) { + /* + * No Memory! Since it is now not possible to track the free + * block, make this a synchronous transaction to insure that + * the block is not reused before this transaction commits. + */ + trace_xfs_alloc_busy(tp, agno, bno, len, 1); + xfs_trans_set_sync(tp); + return; } - trace_xfs_alloc_busy(tp->t_mountp, agno, bno, len, n); + new->agno = agno; + new->bno = bno; + new->length = len; + new->tid = xfs_log_get_trans_ident(tp); - if (n < XFS_PAGB_NUM_SLOTS) { - bsy = &pag->pagb_list[n]; - pag->pagb_count++; - bsy->busy_start = bno; - bsy->busy_length = len; - bsy->busy_tp = tp; - xfs_trans_add_busy(tp, agno, n); - } else { + INIT_LIST_HEAD(&new->list); + + /* trace before insert to be able to see failed inserts */ + trace_xfs_alloc_busy(tp, agno, bno, len, 0); + + pag = xfs_perag_get(tp->t_mountp, new->agno); + uend = bno + len - 1; +restart: + spin_lock(&pag->pagb_lock); + rbp = &pag->pagb_tree.rb_node; + parent = NULL; + busyp = NULL; + match = 0; + while (*rbp) { + parent = *rbp; + busyp = rb_entry(parent, struct xfs_busy_extent, rb_node); + bend = busyp->bno + busyp->length - 1; + + if (new->bno < busyp->bno) { + /* may overlap, but exact start block is lower */ + rbp = &(*rbp)->rb_left; + if (uend >= busyp->bno) { + if (busyp->tid != new->tid) + match = -1; + else if (match >= 0) + match = 1; + } + } else if (new->bno > busyp->bno) { + /* may overlap, but exact start block is higher */ + rbp = &(*rbp)->rb_right; + if (bno <= bend) { + if (busyp->tid != new->tid) + match = -1; + else if (match >= 0) + match = 1; + } + } else { + if (busyp->tid != new->tid) + match = -1; + else if (match >= 0) + match = 1; + break; + } + busyp = NULL; + } + if (match < 0) { + /* overlap marked busy in different transaction */ + spin_unlock(&pag->pagb_lock); + xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); + goto restart; + } + if (match > 0) { /* - * The busy list is full! Since it is now not possible to - * track the free block, make this a synchronous transaction - * to insure that the block is not reused before this - * transaction commits. + * overlap marked busy in same transaction. Update if exact + * start block match, otherwise combine the busy extents into + * a single range. */ - xfs_trans_set_sync(tp); + if (busyp->bno == new->bno) { + busyp->length = max(busyp->length, new->length); + spin_unlock(&pag->pagb_lock); + ASSERT(tp->t_flags & XFS_TRANS_SYNC); + xfs_perag_put(pag); + kmem_free(new); + return; + } + rb_erase(&busyp->rb_node, &pag->pagb_tree); + new->length = max(busyp->bno + busyp->length, + new->bno + new->length) - + min(busyp->bno, new->bno); + new->bno = min(busyp->bno, new->bno); } + rb_link_node(&new->rb_node, parent, rbp); + rb_insert_color(&new->rb_node, &pag->pagb_tree); + + list_add(&new->list, &tp->t_busy); spin_unlock(&pag->pagb_lock); xfs_perag_put(pag); + kmem_free(busyp); } -void -xfs_alloc_clear_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - int idx) +/* + * Search for a busy extent within the range of the extent we are about to + * allocate. You need to be holding the busy extent tree lock when calling + * xfs_alloc_busy_search(). This function returns 0 for no overlapping busy + * extent, -1 for an overlapping but not exact busy extent, and 1 for an exact + * match. This is done so that a non-zero return indicates an overlap that + * will require a synchronous transaction, but it can still be + * used to distinguish between a partial or exact match. + */ +static int +xfs_alloc_busy_search( + struct xfs_mount *mp, + xfs_agnumber_t agno, + xfs_agblock_t bno, + xfs_extlen_t len) { struct xfs_perag *pag; - xfs_perag_busy_t *list; + struct rb_node *rbp; + xfs_agblock_t uend, bend; + struct xfs_busy_extent *busyp; + int match = 0; - ASSERT(idx < XFS_PAGB_NUM_SLOTS); - pag = xfs_perag_get(tp->t_mountp, agno); + pag = xfs_perag_get(mp, agno); spin_lock(&pag->pagb_lock); - list = pag->pagb_list; - trace_xfs_alloc_unbusy(tp->t_mountp, agno, idx, list[idx].busy_tp == tp); - - if (list[idx].busy_tp == tp) { - list[idx].busy_tp = NULL; - pag->pagb_count--; + uend = bno + len - 1; + rbp = pag->pagb_tree.rb_node; + + /* find closest start bno overlap */ + while (rbp) { + busyp = rb_entry(rbp, struct xfs_busy_extent, rb_node); + bend = busyp->bno + busyp->length - 1; + if (bno < busyp->bno) { + /* may overlap, but exact start block is lower */ + if (uend >= busyp->bno) + match = -1; + rbp = rbp->rb_left; + } else if (bno > busyp->bno) { + /* may overlap, but exact start block is higher */ + if (bno <= bend) + match = -1; + rbp = rbp->rb_right; + } else { + /* bno matches busyp, length determines exact match */ + match = (busyp->length == len) ? 1 : -1; + break; + } } - spin_unlock(&pag->pagb_lock); + trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match); xfs_perag_put(pag); + return match; } - -/* - * If we find the extent in the busy list, force the log out to get the - * extent out of the busy list so the caller can use it straight away. - */ -STATIC void -xfs_alloc_search_busy(xfs_trans_t *tp, - xfs_agnumber_t agno, - xfs_agblock_t bno, - xfs_extlen_t len) +void +xfs_alloc_busy_clear( + struct xfs_mount *mp, + struct xfs_busy_extent *busyp) { struct xfs_perag *pag; - xfs_perag_busy_t *bsy; - xfs_agblock_t uend, bend; - xfs_lsn_t lsn = 0; - int cnt; - pag = xfs_perag_get(tp->t_mountp, agno); - spin_lock(&pag->pagb_lock); - cnt = pag->pagb_count; + trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno, + busyp->length); - /* - * search pagb_list for this slot, skipping open slots. We have to - * search the entire array as there may be multiple overlaps and - * we have to get the most recent LSN for the log force to push out - * all the transactions that span the range. - */ - uend = bno + len - 1; - for (cnt = 0; cnt < pag->pagb_count; cnt++) { - bsy = &pag->pagb_list[cnt]; - if (!bsy->busy_tp) - continue; - - bend = bsy->busy_start + bsy->busy_length - 1; - if (bno > bend || uend < bsy->busy_start) - continue; - - /* (start1,length1) within (start2, length2) */ - if (XFS_LSN_CMP(bsy->busy_tp->t_commit_lsn, lsn) > 0) - lsn = bsy->busy_tp->t_commit_lsn; - } + ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno, + busyp->length) == 1); + + list_del_init(&busyp->list); + + pag = xfs_perag_get(mp, busyp->agno); + spin_lock(&pag->pagb_lock); + rb_erase(&busyp->rb_node, &pag->pagb_tree); spin_unlock(&pag->pagb_lock); xfs_perag_put(pag); - trace_xfs_alloc_busysearch(tp->t_mountp, agno, bno, len, lsn); - /* - * If a block was found, force the log through the LSN of the - * transaction that freed the block - */ - if (lsn) - xfs_log_force_lsn(tp->t_mountp, lsn, XFS_LOG_SYNC); + kmem_free(busyp); } diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h index 599bffa..6d05199 100644 --- a/fs/xfs/xfs_alloc.h +++ b/fs/xfs/xfs_alloc.h @@ -22,6 +22,7 @@ struct xfs_buf; struct xfs_mount; struct xfs_perag; struct xfs_trans; +struct xfs_busy_extent; /* * Freespace allocation types. Argument to xfs_alloc_[v]extent. @@ -119,15 +120,13 @@ xfs_alloc_longest_free_extent(struct xfs_mount *mp, #ifdef __KERNEL__ void -xfs_alloc_mark_busy(xfs_trans_t *tp, +xfs_alloc_busy_insert(xfs_trans_t *tp, xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len); void -xfs_alloc_clear_busy(xfs_trans_t *tp, - xfs_agnumber_t ag, - int idx); +xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp); #endif /* __KERNEL__ */ diff --git a/fs/xfs/xfs_alloc_btree.c b/fs/xfs/xfs_alloc_btree.c index b726e10..83f4942 100644 --- a/fs/xfs/xfs_alloc_btree.c +++ b/fs/xfs/xfs_alloc_btree.c @@ -134,7 +134,7 @@ xfs_allocbt_free_block( * disk. If a busy block is allocated, the iclog is pushed up to the * LSN that freed the block. */ - xfs_alloc_mark_busy(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1); + xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1); xfs_trans_agbtree_delta(cur->bc_tp, -1); return 0; } diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 38af110..05f205a 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -18,9 +18,6 @@ #ifndef __XFS_LOG_H__ #define __XFS_LOG_H__ -/* transaction ID type */ -typedef __uint32_t xlog_tid_t; - /* get lsn fields */ #define CYCLE_LSN(lsn) ((uint)((lsn)>>32)) #define BLOCK_LSN(lsn) ((uint)(lsn)) diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index be578ec..40d9595 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -44,6 +44,7 @@ #include "xfs_trans_priv.h" #include "xfs_trans_space.h" #include "xfs_inode_item.h" +#include "xfs_trace.h" kmem_zone_t *xfs_trans_zone; @@ -243,9 +244,8 @@ _xfs_trans_alloc( tp->t_type = type; tp->t_mountp = mp; tp->t_items_free = XFS_LIC_NUM_SLOTS; - tp->t_busy_free = XFS_LBC_NUM_SLOTS; xfs_lic_init(&(tp->t_items)); - XFS_LBC_INIT(&(tp->t_busy)); + INIT_LIST_HEAD(&tp->t_busy); return tp; } @@ -255,8 +255,13 @@ _xfs_trans_alloc( */ STATIC void xfs_trans_free( - xfs_trans_t *tp) + struct xfs_trans *tp) { + struct xfs_busy_extent *busyp, *n; + + list_for_each_entry_safe(busyp, n, &tp->t_busy, list) + xfs_alloc_busy_clear(tp->t_mountp, busyp); + atomic_dec(&tp->t_mountp->m_active_trans); xfs_trans_free_dqinfo(tp); kmem_zone_free(xfs_trans_zone, tp); @@ -285,9 +290,8 @@ xfs_trans_dup( ntp->t_type = tp->t_type; ntp->t_mountp = tp->t_mountp; ntp->t_items_free = XFS_LIC_NUM_SLOTS; - ntp->t_busy_free = XFS_LBC_NUM_SLOTS; xfs_lic_init(&(ntp->t_items)); - XFS_LBC_INIT(&(ntp->t_busy)); + INIT_LIST_HEAD(&ntp->t_busy); ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); ASSERT(tp->t_ticket != NULL); @@ -423,7 +427,6 @@ undo_blocks: return error; } - /* * Record the indicated change to the given field for application * to the file system's superblock when the transaction commits. @@ -930,26 +933,6 @@ xfs_trans_item_committed( IOP_UNPIN(lip); } -/* Clear all the per-AG busy list items listed in this transaction */ -static void -xfs_trans_clear_busy_extents( - struct xfs_trans *tp) -{ - xfs_log_busy_chunk_t *lbcp; - xfs_log_busy_slot_t *lbsp; - int i; - - for (lbcp = &tp->t_busy; lbcp != NULL; lbcp = lbcp->lbc_next) { - i = 0; - for (lbsp = lbcp->lbc_busy; i < lbcp->lbc_unused; i++, lbsp++) { - if (XFS_LBC_ISFREE(lbcp, i)) - continue; - xfs_alloc_clear_busy(tp, lbsp->lbc_ag, lbsp->lbc_idx); - } - } - xfs_trans_free_busy(tp); -} - /* * This is typically called by the LM when a transaction has been fully * committed to disk. It needs to unpin the items which have @@ -984,7 +967,6 @@ xfs_trans_committed( kmem_free(licp); } - xfs_trans_clear_busy_extents(tp); xfs_trans_free(tp); } @@ -1013,7 +995,6 @@ xfs_trans_uncommit( xfs_trans_unreserve_and_mod_dquots(tp); xfs_trans_free_items(tp, flags); - xfs_trans_free_busy(tp); xfs_trans_free(tp); } @@ -1075,6 +1056,8 @@ xfs_trans_commit_iclog( *commit_lsn = xfs_log_done(mp, tp->t_ticket, &commit_iclog, log_flags); tp->t_commit_lsn = *commit_lsn; + trace_xfs_trans_commit_lsn(tp); + if (nvec > XFS_TRANS_LOGVEC_COUNT) kmem_free(log_vector); @@ -1260,7 +1243,6 @@ out_unreserve: } current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0); - xfs_trans_free_busy(tp); xfs_trans_free(tp); XFS_STATS_INC(xs_trans_empty); @@ -1339,7 +1321,6 @@ xfs_trans_cancel( current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); xfs_trans_free_items(tp, flags); - xfs_trans_free_busy(tp); xfs_trans_free(tp); } diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index c62beee..ff7e9e6 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -813,6 +813,7 @@ struct xfs_log_item_desc; struct xfs_mount; struct xfs_trans; struct xfs_dquot_acct; +struct xfs_busy_extent; typedef struct xfs_log_item { struct list_head li_ail; /* AIL pointers */ @@ -872,34 +873,6 @@ typedef struct xfs_item_ops { #define XFS_ITEM_PUSHBUF 3 /* - * This structure is used to maintain a list of block ranges that have been - * freed in the transaction. The ranges are listed in the perag[] busy list - * between when they're freed and the transaction is committed to disk. - */ - -typedef struct xfs_log_busy_slot { - xfs_agnumber_t lbc_ag; - ushort lbc_idx; /* index in perag.busy[] */ -} xfs_log_busy_slot_t; - -#define XFS_LBC_NUM_SLOTS 31 -typedef struct xfs_log_busy_chunk { - struct xfs_log_busy_chunk *lbc_next; - uint lbc_free; /* free slots bitmask */ - ushort lbc_unused; /* first unused */ - xfs_log_busy_slot_t lbc_busy[XFS_LBC_NUM_SLOTS]; -} xfs_log_busy_chunk_t; - -#define XFS_LBC_MAX_SLOT (XFS_LBC_NUM_SLOTS - 1) -#define XFS_LBC_FREEMASK ((1U << XFS_LBC_NUM_SLOTS) - 1) - -#define XFS_LBC_INIT(cp) ((cp)->lbc_free = XFS_LBC_FREEMASK) -#define XFS_LBC_CLAIM(cp, slot) ((cp)->lbc_free &= ~(1 << (slot))) -#define XFS_LBC_SLOT(cp, slot) (&((cp)->lbc_busy[(slot)])) -#define XFS_LBC_VACANCY(cp) (((cp)->lbc_free) & XFS_LBC_FREEMASK) -#define XFS_LBC_ISFREE(cp, slot) ((cp)->lbc_free & (1 << (slot))) - -/* * This is the type of function which can be given to xfs_trans_callback() * to be called upon the transaction's commit to disk. */ @@ -950,8 +923,7 @@ typedef struct xfs_trans { unsigned int t_items_free; /* log item descs free */ xfs_log_item_chunk_t t_items; /* first log item desc chunk */ xfs_trans_header_t t_header; /* header for in-log trans */ - unsigned int t_busy_free; /* busy descs free */ - xfs_log_busy_chunk_t t_busy; /* busy/async free blocks */ + struct list_head t_busy; /* list of busy extents */ unsigned long t_pflags; /* saved process flags state */ } xfs_trans_t; @@ -1025,9 +997,6 @@ int _xfs_trans_commit(xfs_trans_t *, void xfs_trans_cancel(xfs_trans_t *, int); int xfs_trans_ail_init(struct xfs_mount *); void xfs_trans_ail_destroy(struct xfs_mount *); -xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp, - xfs_agnumber_t ag, - xfs_extlen_t idx); extern kmem_zone_t *xfs_trans_zone; diff --git a/fs/xfs/xfs_trans_item.c b/fs/xfs/xfs_trans_item.c index eb3fc57..2937a1e 100644 --- a/fs/xfs/xfs_trans_item.c +++ b/fs/xfs/xfs_trans_item.c @@ -438,112 +438,3 @@ xfs_trans_unlock_chunk( return freed; } - - -/* - * This is called to add the given busy item to the transaction's - * list of busy items. It must find a free busy item descriptor - * or allocate a new one and add the item to that descriptor. - * The function returns a pointer to busy descriptor used to point - * to the new busy entry. The log busy entry will now point to its new - * descriptor with its ???? field. - */ -xfs_log_busy_slot_t * -xfs_trans_add_busy(xfs_trans_t *tp, xfs_agnumber_t ag, xfs_extlen_t idx) -{ - xfs_log_busy_chunk_t *lbcp; - xfs_log_busy_slot_t *lbsp; - int i=0; - - /* - * If there are no free descriptors, allocate a new chunk - * of them and put it at the front of the chunk list. - */ - if (tp->t_busy_free == 0) { - lbcp = (xfs_log_busy_chunk_t*) - kmem_alloc(sizeof(xfs_log_busy_chunk_t), KM_SLEEP); - ASSERT(lbcp != NULL); - /* - * Initialize the chunk, and then - * claim the first slot in the newly allocated chunk. - */ - XFS_LBC_INIT(lbcp); - XFS_LBC_CLAIM(lbcp, 0); - lbcp->lbc_unused = 1; - lbsp = XFS_LBC_SLOT(lbcp, 0); - - /* - * Link in the new chunk and update the free count. - */ - lbcp->lbc_next = tp->t_busy.lbc_next; - tp->t_busy.lbc_next = lbcp; - tp->t_busy_free = XFS_LIC_NUM_SLOTS - 1; - - /* - * Initialize the descriptor and the generic portion - * of the log item. - * - * Point the new slot at this item and return it. - * Also point the log item at its currently active - * descriptor and set the item's mount pointer. - */ - lbsp->lbc_ag = ag; - lbsp->lbc_idx = idx; - return lbsp; - } - - /* - * Find the free descriptor. It is somewhere in the chunklist - * of descriptors. - */ - lbcp = &tp->t_busy; - while (lbcp != NULL) { - if (XFS_LBC_VACANCY(lbcp)) { - if (lbcp->lbc_unused <= XFS_LBC_MAX_SLOT) { - i = lbcp->lbc_unused; - break; - } else { - /* out-of-order vacancy */ - cmn_err(CE_DEBUG, "OOO vacancy lbcp 0x%p\n", lbcp); - ASSERT(0); - } - } - lbcp = lbcp->lbc_next; - } - ASSERT(lbcp != NULL); - /* - * If we find a free descriptor, claim it, - * initialize it, and return it. - */ - XFS_LBC_CLAIM(lbcp, i); - if (lbcp->lbc_unused <= i) { - lbcp->lbc_unused = i + 1; - } - lbsp = XFS_LBC_SLOT(lbcp, i); - tp->t_busy_free--; - lbsp->lbc_ag = ag; - lbsp->lbc_idx = idx; - return lbsp; -} - - -/* - * xfs_trans_free_busy - * Free all of the busy lists from a transaction - */ -void -xfs_trans_free_busy(xfs_trans_t *tp) -{ - xfs_log_busy_chunk_t *lbcp; - xfs_log_busy_chunk_t *lbcq; - - lbcp = tp->t_busy.lbc_next; - while (lbcp != NULL) { - lbcq = lbcp->lbc_next; - kmem_free(lbcp); - lbcp = lbcq; - } - - XFS_LBC_INIT(&tp->t_busy); - tp->t_busy.lbc_unused = 0; -} diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 73e2ad3..901dc0f 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -38,10 +38,6 @@ struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *, void xfs_trans_free_items(struct xfs_trans *, int); void xfs_trans_unlock_items(struct xfs_trans *, xfs_lsn_t); -void xfs_trans_free_busy(xfs_trans_t *tp); -xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp, - xfs_agnumber_t ag, - xfs_extlen_t idx); /* * AIL traversal cursor. diff --git a/fs/xfs/xfs_types.h b/fs/xfs/xfs_types.h index b099045..3207752 100644 --- a/fs/xfs/xfs_types.h +++ b/fs/xfs/xfs_types.h @@ -75,6 +75,8 @@ typedef __uint32_t xfs_dahash_t; /* dir/attr hash value */ typedef __uint16_t xfs_prid_t; /* prid_t truncated to 16bits in XFS */ +typedef __uint32_t xlog_tid_t; /* transaction ID type */ + /* * These types are 64 bits on disk but are either 32 or 64 bits in memory. * Disk based types: -- 1.5.6.5 From SRS0+V/pq+66+fromorbit.com=dave@internode.on.net Fri May 7 00:39:21 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00, TVD_PH_BODY_ACCOUNTS_PRE autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o475dLSA148932 for ; Fri, 7 May 2010 00:39:21 -0500 X-ASG-Debug-ID: 1273210887-038c019e0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id EA8FD318A39 for ; Thu, 6 May 2010 22:41:28 -0700 (PDT) Received: from mail.internode.on.net (bld-mail18.adl2.internode.on.net [150.101.137.103]) by cuda.sgi.com with ESMTP id 8OFs8yUh6ZGRwTBV for ; Thu, 06 May 2010 22:41:28 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23359261-1927428 for ; Fri, 07 May 2010 15:11:24 +0930 (CST) Received: from [192.168.1.9] (helo=disturbed) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJO-0006rY-F5 for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:22 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJD-00066e-1v for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:11 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 08/12] xfs: Delayed logging design documentation Subject: [PATCH 08/12] xfs: Delayed logging design documentation Date: Fri, 7 May 2010 15:40:56 +1000 Message-Id: <1273210860-23414-9-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail18.adl2.internode.on.net[150.101.137.103] X-Barracuda-Start-Time: 1273210889 X-Barracuda-Bayes: INNOCENT GLOBAL 0.4873 1.0000 0.0000 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner Document the design of the delayed logging implementation. This includes assumptions made, dead ends followed, the reasoning behind the structuring of the code, the layout of various structures, how things fit together, traps and pit-falls avoided, etc. This is all too much to document in the code itself, so do it in a separate file. Signed-off-by: Dave Chinner --- .../filesystems/xfs-delayed-logging-design.txt | 819 ++++++++++++++++++++ 1 files changed, 819 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt new file mode 100644 index 0000000..2b7ff2e --- /dev/null +++ b/Documentation/filesystems/xfs-delayed-logging-design.txt @@ -0,0 +1,819 @@ +XFS Delayed Logging Design +-------------------------- + +Introduction to Re-logging in XFS +--------------------------------- + +XFS logging is a combination of logical and physical logging. Some objects, +such as inodes and dquots, are logged in logical format where the details +logged are made up of the changes to in-core structures rather than on-disk +structures. Other objects - typically buffers - have their physical changes +logged. The reason for these differences is to reduce the amount of log space +required for objects that are frequently logged. Some parts of inodes are more +frequently logged than others, and inodes are typically more frequently logged +than any other object (except maybe the superblock buffer) so keeping the +amount of metadata logged low is of prime importance. + +The reason that this is such a concern is that XFS allows multiple separate +modifications to a single object to be carried in the log at any given time. +This allows the log to avoid needing to flush each change to disk before +recording a new change to the object. XFS does this via a method called +"re-logging". Conceptually, this is quite simple - all it requires is that any +new change to the object is recorded with a *new copy* of all the existing +changes in the new transaction that is written to the log. + +That is, if we have a sequence of changes A through to F, and the object was +written to disk after change D, we would see in the log the following series +of transactions, their contents and the log sequence number (LSN) of the +transaction: + + Transaction Contents LSN + A A X + B A+B X+n + C A+B+C X+n+m + D A+B+C+D X+n+m+o + + E E Y (> X+n+m+o) + F E+F YŮŤ+p + +In other words, each time an object is relogged, the new transaction contains +the aggregation of all the previous changes currently held only in the log. + +This relogging technique also allows objects to be moved forward in the log so +that an object being relogged does not prevent the tail of the log from ever +moving forward. This can be seen in the table above by the changing +(increasing) LSN of each subsquent transaction - the LSN is effectively a +direct encoding of the location in the log of the transaction. + +This relogging is also used to implement long-running, multiple-commit +transactions. These transaction are known as rolling transactions, and require +a special log reservation known as a permanent transaction reservation. A +typical example of a rolling transaction is the removal of extents from an +inode which can only be done at a rate of two extents per transaction because +of reservation size limitations. Hence a rolling extent removal transaction +keeps relogging the inode and btree buffers as they get modified in each +removal operation. This keeps them moving forward in the log as the operation +progresses, ensuring that current operation never gets blocked by itself if the +log wraps around. + +Hence it can be seen that the relogging operation is fundamental to the correct +working of the XFS journalling subsystem. From the above description, most +people should be able to see why the XFS metadata operations writes so much to +the log - repeated operations to the same objects write the same changes to +the log over and over again. Worse is the fact that objects tend to get +dirtier as they get relogged, so each subsequent transaction is writing more +metadata into the log. + +Another feature of the XFS transaction subsystem is that most transactions are +asynchronous. That is, they don't commit to disk until either a log buffer is +filled (a log buffer can hold multiple transactions) or a synchronous operation +forces the log buffers holding the transactions to disk. This means that XFS is +doing aggregation of transactions in memory - batching them, if you like - to +minimise the impact of the log IO on transaction throughput. + +The limitation on asynchronous transaction throughput is the number and size of +log buffers made available by the log manager. By default there are 8 log +buffers available and the size of each is 32kB - the size can be increased up +to 256kB by use of a mount option. + +Effectively, this gives us the maximum bound of outstanding metadata changes +that can be made to the filesystem at any point in time - if all the log +buffers are full and under IO, then no more transactions can be committed until +the current batch completes. It is now common for a single current CPU core to +be to able to issue enough transactions to keep the log buffers full and under +IO permanently. Hence the XFS journalling subsystem can be considered to be IO +bound. + +Delayed Logging: Concepts +------------------------- + +The key thing to note about the asynchronous logging combined with the +relogging technique XFS uses is that we can be relogging changed objects +multiple times before they are committed to disk in the log buffers. If we +return to the previous relogging example, it is entirely possible that +transactions A through D are committed to disk in the same log buffer. + +That is, a single log buffer may contain multiple copies of the same object, +but only one of those copies needs to be there - the last one "D", as it +contains all the changes from the previous changes. In other words, we have one +necessary copy in the log buffer, and three stale copies that are simply +wasting space. When we are doing repeated operations on the same set of +objects, these "stale objects" can be over 90% of the space used in the log +buffers. It is clear that reducing the number of stale objects written to the +log would greatly reduce the amount of metadata we write to the log, and this +is the fundamental goal of delayed logging. + +From a conceptual point of view, XFS is already doing relogging in memory (where +memory == log buffer), only it is doing it extremely inefficiently. It is using +logical to physical formatting to do the relogging because there is no +infrastructure to keep track of logical changes in memory prior to physically +formatting the changes in a transaction to the log buffer. Hence we cannot avoid +accumulating stale objects in the log buffers. + +Delayed logging is the name we've given to keeping and tracking transactional +changes to objects in memory outside the log buffer infrastructure. Because of +the relogging concept fundamental to the XFS journalling subsystem, this is +actually relatively easy to do - all the changes to logged items are already +tracked in the current infrastructure. The big problem is how to accumulate +them and get them to the log in a consistent, recoverable manner. +Describing the problems and how they have been solved is the focus of this +document. + +One of the key changes that delayed logging makes to the operation of the +journalling subsystem is that it disassociates the amount of outstanding +metadata changes from the size and number of log buffers available. In other +words, instead of there only being a maximum of 2MB of transaction changes not +written to the log at any point in time, there may be a much greater amount +being accumulated in memory. Hence the potential for loss of metadata on a +crash is much greater than for the existing logging mechanism. + +It should be noted that this does not change the guarantee that log recovery +will result in a consistent filesystem. What it does mean is that as far as the +recovered filesystem is concerned, there may be many thousands of transactions +that simply did not occur as a result of the crash. This makes it even more +important that applications that care about their data use fsync() where they +need to ensure application level data integrity is maintained. + +It should be noted that delayed logging is not an innovative new concept that +warrants rigorous proofs to determine whether it is correct or not. The method +of accumulating changes in memory for some period before writing them to the +log is used effectively in many filesystems including ext3 and ext4. Hence +no time is spent in this document trying to convince the reader that the +concept is sound. Instead it is simply considered a "solved problem" and as +such implementing it in XFS is purely an exercise in software engineering. + +The fundamental requirements for delayed logging in XFS are simple: + + 1. Reduce the amount of metadata written to the log by at least + an order of magnitude. + 2. Supply sufficient statistics to validate Requirement #1. + 3. Supply sufficient new tracing infrastructure to be able to debug + problems with the new code. + 4. No on-disk format change (metadata or log format). + 5. Enable and disable with a mount option. + 6. No performance regressions for synchronous transaction workloads. + +Delayed Logging: Design +----------------------- + +Storing Changes + +The problem with accumulating changes at a logical level (i.e. just using the +existing log item dirty region tracking) is that when it comes to writing the +changes to the log buffers, we need to ensure that the object we are formatting +is not changing while we do this. This requires locking the object to prevent +concurrent modification. Hence flushing the logical changes to the log would +require us to lock every object, format them, and then unlock them again. + +This introduces lots of scope for deadlocks with transactions that are already +running. For example, a transaction has object A locked and modified, but needs +the delayed logging tracking lock to commit the transaction. However, the +flushing thread has the delayed logging tracking lock already held, and is +trying to get the lock on object A to flush it to the log buffer. This appears +to be an unsolvable deadlock condition, and it was solving this problem that +was the barrier to implementing delayed logging for so long. + +The solution is relatively simple - it just took a long time to recognise it. +Put simply, the current logging code formats the changes to each item into an +vector array that points to the changed regions in the item. The log write code +simply copies the memory these vectors point to into the log buffer during +transaction commit while the item is locked in the transaction. Instead of +using the log buffer as the destination of the formatting code, we can use an +allocated memory buffer big enough to fit the formatted vector. + +If we then copy the vector into the memory buffer and rewrite the vector to +point to the memory buffer rather than the object itself, we now have a copy of +the changes in a format that is compatible with the log buffer writing code. +that does not require us to lock the item to access. This formatting and +rewriting can all be done while the object is locked during transaction commit, +resulting in a vector that is transactionally consistent and can be accessed +without needing to lock the owning item. + +Hence we avoid the need to lock items when we need to flush outstanding +asynchronous transactions to the log. The differences between the existing +formatting method and the delayed logging formatting can be seen in the +diagram below. + +Current format log vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Log Buffer +-V1-+-V2-+----V3----+ + +Delayed logging vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Memory Buffer +-V1-+-V2-+----V3----+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +The memory buffer and associated vector need to be passed as a single object, +but still need to be associated with the parent object so if the object is +relogged we can replace the current memory buffer with a new memory buffer that +contains the latest changes. + +The reason for keeping the vector around after we've formatted the memory +buffer is to support splitting vectors across log buffer boundaries correctly. +If we don't keep the vector around, we do not know where the region boundaries +are in the item, so we'd need a new encapsulation method for regions in the log +buffer writing (i.e. double encapsulation). This would be an on-disk format +change and as such is not desirable. It also means we'd have to write the log +region headers in the formatting stage, which is problematic as there is per +region state that needs to be placed into the headers during the log write. + +Hence we need to keep the vector, but by attaching the memory buffer to it and +rewriting the vector addresses to point at the memory buffer we end up with a +self-describing object that can be passed to the log buffer write code to be +handled in exactly the same manner as the existing log vectors are handled. +Hence we avoid needing a new on-disk format to handle items that have been +relogged in memory. + + +Tracking Changes + +Now that we can record transactional changes in memory in a form that allows +them to be used without limitations, we need to be able to track and accumulate +them so that they can be written to the log at some later point in time. The +log item is the natural place to store this vector and buffer, and also makes sense +to be the object that is used to track committed objects as it will always +exist once the object has been included in a transaction. + +The log item is already used to track the log items that have been written to +the log but not yet written to disk. Such log items are considered "active" +and as such are stored in the Active Item List (AIL) which is a LSN-ordered +double linked list. Items are inserted into this list during log buffer IO +completion, after which they are unpinned and can be written to disk. An object +that is in the AIL can be relogged, which causes the object to be pinned again +and then moved forward in the AIL when the log buffer IO completes for that +transaction. + +Essentially, this shows that an item that is in the AIL can still be modified +and relogged, so any tracking must be separate to the AIL infrastructure. As +such, we cannot reuse the AIL list pointers for tracking committed items, nor +can we store state in any field that is protected by the AIL lock. Hence the +committed item tracking needs it's own locks, lists and state fields in the log +item. + +Similar to the AIL, tracking of committed items is done through a new list +called the Committed Item List (CIL). The list tracks log items that have been +committed and have formatted memory buffers attached to them. It tracks objects +in transaction commit order, so when an object is relogged it is removed from +it's place in the list and re-inserted at the tail. This is entirely arbitrary +and done to make it easy for debugging - the last items in the list are the +ones that are most recently modified. Ordering of the CIL is not necessary for +transactional integrity (as discussed in the next section) so the ordering is +done for convenience/sanity of the developers. + + +Delayed Logging: Checkpoints + +When we have a log synchronisation event, commonly known as a "log force", +all the items in the CIL must be written into the log via the log buffers. +We need to write these items in the order that they exist in the CIL, and they +need to be written as an atomic transaction. The need for all the objects to be +written as an atomic transaction comes from the requirements of relogging and +log replay - all the changes in all the objects in a given transaction must +either be completely replayed during log recovery, or not replayed at all. If +a transaction is not replayed because it is not complete in the log, then +no later transactions should be replayed, either. + +To fulfill this requirement, we need to write the entire CIL in a single log +transaction. Fortunately, the XFS log code has no fixed limit on the size of a +transaction, nor does the log replay code. The only fundamental limit is that +the transaction cannot be larger than just under half the size of the log. The +reason for this limit is that to find the head and tail of the log, there must +be at least one complete transaction in the log at any given time. If a +transaction is larger than half the log, then there is the possibility that a +crash during the write of a such a transaction could partially overwrite the +only complete previous transaction in the log. This will result in a recovery +failure and an inconsistent filesystem and hence we must enforce the maximum +size of a checkpoint to be slightly less than a half the log. + +Apart from this size requirement, a checkpoint transaction looks no different +to any other transaction - it contains a transaction header, a series of +formatted log items and a commit record at the tail. From a recovery +perspective, the checkpoint transaction is also no different - just a lot +bigger with a lot more items in it. The worst case effect of this is that we +might need to tune the recovery transaction object hash size. + +Because the checkpoint is just another transaction and all the changes to log +items are stored as log vectors, we can use the existing log buffer writing +code to write the changes into the log. To do this efficiently, we need to +minimise the time we hold the CIL locked while writing the checkpoint +transaction. The current log write code enables us to do this easily with the +way it separates the writing of the transaction contents (the log vectors) from +the transaction commit record, but tracking this requires us to have a +per-checkpoint context that travels through the log write process through to +checkpoint completion. + +Hence a checkpoint has a context that tracks the state of the current +checkpoint from initiation to checkpoint completion. A new context is initiated +at the same time a checkpoint transaction is started. That is, when we remove +all the current items from the CIL during a checkpoint operation, we move all +those changes into the current checkpoint context. We then initialise a new +context and attach that to the CIL for aggregation of new transactions. + +This allows us to unlock the CIL immediately after transfer of all the +committed items and effectively allow new transactions to be issued while we +are formatting the checkpoint into the log. It also allows concurrent +checkpoints to be written into the log buffers in the case of log force heavy +workloads, just like the existing transaction commit code does. This, however, +requires that we strictly order the commit records in the log so that +checkpoint sequence order is maintained during log replay. + +To ensure that we can be writing an item into a checkpoint transaction at +the same time another transaction modifies the item and inserts the log item +into the new CIL, then checkpoint transaction commit code cannot use log items +to store the list of log vectors that need to be written into the transaction. +Hence log vectors need to be able to be chained together to allow them to be +detatched from the log items. That is, when the CIL is flushed the memory +buffer and log vector attached to each log item needs to be attached to the +checkpoint context so that the log item can be released. In diagrammatic form, +the CIL would look like this before the flush: + + CIL Head + | + V + Log Item <-> log vector 1 -> memory buffer + | -> vector array + V + Log Item <-> log vector 2 -> memory buffer + | -> vector array + V + ...... + | + V + Log Item <-> log vector N-1 -> memory buffer + | -> vector array + V + Log Item <-> log vector N -> memory buffer + -> vector array + +And after the flush the CIL head is empty, and the checkpoint context log +vector list would look like: + + Checkpoint Context + | + V + log vector 1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector 2 -> memory buffer + | -> vector array + | -> Log Item + V + ...... + | + V + log vector N-1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector N -> memory buffer + -> vector array + -> Log Item + +Once this transfer is done, the CIL can be unlocked and new transactions can +start, while the checkpoint flush code works over the log vector chain to +commit the checkpoint. + +Once the checkpoint is written into the log buffers, the checkpoint context is +attached to the log buffer that the commit record was written to along with a +completion callback. Log IO completion will call that callback, which can then +run transaction committed processing for the log items (i.e. insert into AIL +and unpin) in the log vector chain and then free the log vector chain and +checkpoint context. + +Discussion Point: I am uncertain as to whether the log item is the most +efficient way to track vectors, even though it seems like the natural way to do +it. The fact that we walk the log items (in the CIL) just to chain the log +vectors and break the link between the log item and the log vector means that +we take a cache line hit for the log item list modification, then another for +the log vector chaining. If we track by the log vectors, then we only need to +break the link between the log item and the log vector, which means we should +dirty only the log item cachelines. Normally I wouldn't be concerned about one +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log +vectors in one checkpoint transaction. I'd guess this is a "measure and +compare" situation that can be done after a working and reviewed implementation +is in the dev tree.... + +Delayed Logging: Checkpoint Sequencing + +One of the key aspects of the XFS transaction subsystem is that it tags +committed transactions with the log sequence number of the transaction commit. +This allows transactions to be issued asynchronously even though there may be +future operations that cannot be completed until that transaction is fully +committed to the log. In the rare case that a dependent operation occurs (e.g. +re-using a freed metadata extent for a data extent), a special, optimised log +force can be issued to force the dependent transaction to disk immediately. + +To do this, transactions need to record the LSN of the commit record of the +transaction. This LSN comes directly from the log buffer the transaction is +written into. While this works just fine for the existing transaction +mechanism, it does not work for delayed logging because transactions are not +written directly into the log buffers. Hence some other method of sequencing +transactions is required. + +As discussed in the checkpoint section, delayed logging uses per-checkpoint +contexts, and as such it is simple to assign a sequence number to each +checkpoint. Because the switching of checkpoint contexts must be done +atomically, it is simple to ensure that each new context has a monotonically +increasing sequence number assigned to it without the need for an external +atomic counter - we can just take the current context sequence number and add +one to it for the new context. + +Then, instead of assigning a log buffer LSN to the transaction commit LSN +during the commit, we can assign the current checkpoint sequence. This allows +operations that track transactions that have not yet completed know what +checkpoint sequence needs to be committed before they can continue. As a +result, the code that forces the log to a specific LSN now needs to ensure that +the log forces to a specific checkpoint. + +To ensure that we can do this, we need to track all the checkpoint contexts +that are currently committing to the log. When we flush a checkpoint, the +context gets added to a "committing" list which can be searched. When a +checkpoint commit completes, it is removed from the committing list. Because +the checkpoint context records the LSN of the commit record for the checkpoint, +we can also wait on the log buffer that contains the commit record, thereby +using the existing log force mechanisms to execute synchronous forces. + +It should be noted that the synchronous forces may need to be extended with +mitigation algorithms similar to the current log buffer code to allow +aggregation of multiple synchronous transactions if there are already +synchronous transactions being flushed. Investigation of the performance of the +current design is needed before making any decisions here. + +The main concern with log forces is to ensure that all the previous checkpoints +are also committed to disk before the one we need to wait for. Therefore we +need to check that all the prior contexts in the committing list are also +complete before waiting on the one we need to complete. We do this +synchronisation in the log force code so that we don't need to wait anywhere +else for such serialisation - it only matters when we do a log force. + +The only remaining complexity is that a log force now also has to handle the +case where the forcing sequence number is the same as the current context. That +is, we need to flush the CIL and potentially wait for it to complete. This is a +simple addition to the existing log forcing code to check the sequence numbers +and push if required. Indeed, placing the current sequence checkpoint flush in +the log force code enables the current mechanism for issuing synchronous +transactions to remain untouched (i.e. commit an asynchronous transaction, then +force the log at the LSN of that transaction) and so the higher level code +behaves the same regardless of whether delayed logging is being used or not. + +Delayed Logging: Checkpoint Log Space Accounting + +The big issue for a checkpoint transaction is the log space reservation for the +transaction. We don't know how big a checkpoint transaction is going to be +ahead of time, nor how many log buffers it will take to write out, nor the +number of split log vector regions are going to be used. We can track the +amount of log space required as we add items to the commit item list, but we +still need to reserve the space in the log for the checkpoint. + +A typical transaction reserves enough space in the log for the worst case space +usage of the transaction. The reservation accounts for log record headers, +transaction and region headers, headers for split regions, buffer tail padding, +etc. as well as the actual space for all the changed metadata in the +transaction. While some of this is fixed overhead, much of it is dependent on +the size of the transaction and the number of regions being logged (the number +of log vectors in the transaction). + +An example of the differences would be logging directory changes versus logging +inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then +there are lots of transactions that only contain an inode core and an inode log +format structure. That is, two vectors totaling roughly 150 bytes. If we modify +10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each +vector is 12 bytes, so the total to be logged is approximately 1.75MB. In +comparison, if we are logging full directory buffers, they are typically 4KB +each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a +buffer format structure for each buffer - roughly 800 vectors or 1.51MB total +space. From this, it should be obvious that a static log space reservation is +not particularly flexible and is difficult to select the "optimal value" for +all workloads. + +Further, if we are going to use a static reservation, which bit of the entire +reservation does it cover? We account for space used by the transaction +reservation by tracking the space currently used by the object in the CIL and +then calculating the increase or decrease in space used as the object is +relogged. This allows for a checkpoint reservation to only have to account for +log buffer metadata used such as log header records. + +However, even using a static reservation for just the log metadata is +problematic. Typically log record headers use at least 16KB of log space per +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be +large enough to handle arbitrary sized checkpoint transactions. This +reservation needs to be made before the checkpoint is started, and we need to +be able to reserve the space without sleeping. For a 8MB checkpoint, we need a +reservation of around 150KB, which is a non-trivial amount of space. + +A static reservation needs to manipulate the log grant counters - we can take a +permanent reservation on the space, but we still need to make sure we refresh +the write reservation (the actual space available to the transaction) after +every checkpoint transaction completion. Unfortunately, if this space is not +available when required, then the regrant code will sleep waiting for it. + +The problem with this is that it can lead to deadlocks as we may need to commit +checkpoints to be able to free up log space (refer back to the description of +rolling transactions for an example of this). Hence we *must* always have +space available in the log if we are to use static reservations, and that is +very difficult and complex to arrange. It is possible to do, but there is a +simpler way. + +The simpler way of doing this is tracking the entire log space used by the +items in the CIL and using this to dynamically calculate the amount of log +space required by the log metadata. If this log metadata space changes as a +result of a transaction commit inserting a new memory buffer into the CIL, then +the difference in space required is removed from the transaction that causes +the change. Transactions at this level will *always* have enough space +available in their reservation for this as they have already reserved the +maximal amount of log metadata space they require, and such a delta reservation +will always be less than or equal to the maximal amount in the reservation. + +Hence we can grow the checkpoint transaction reservation dynamically as items +are added to the CIL and avoid the need for reserving and regranting log space +up front. This avoids deadlocks and removes a blocking point from the +checkpoint flush code. + +As mentioned early, transactions can't grow to more than half the size of the +log. Hence as part of the reservation growing, we need to also check the size +of the reservation against the maximum allowed transaction size. If we reach +the maximum threshold, we need to push the CIL to the log. This is effectively +a "background flush" and is done on demand. This is identical to +a CIL push triggered by a log force, only that there is no waiting for the +checkpoint commit to complete. This background push is checked and executed by +transaction commit code. + +If the transaction subsystem goes idle while we still have items in the CIL, +they will be flushed by the periodic log force issued by the xfssyncd. This log +force will push the CIL to disk, and if the transaction subsystem stays idle, +allow the idle log to be covered (effectively marked clean) in exactly the same +manner that is done for the existing logging method. A discussion point is +whether this log force needs to be done more frequently than the current rate +which is once every 30s. + + +Delayed Logging: Log Item Pinning + +Currently log items are pinned during transaction commit while the items are +still locked. This happens just after the items are formatted, though it could +be done any time before the items are unlocked. The result of this mechanism is +that items get pinned once for every transaction that is committed to the log +buffers. Hence items that are relogged in the log buffers will have a pin count +for every outstanding transaction they were dirtied in. When each of these +transactions is completed, they will unpin the item once. As a result, the item +only becomes unpinned when all the transactions complete and there are no +pending transactions. Thus the pinning and unpinning of a log item is symmetric +as there is a 1:1 relationship with transaction commit and log item completion. + +For delayed logging, however, we have an assymetric transaction commit to +completion relationship. Every time an object is relogged in the CIL it goes +through the commit process without a corresponding completion being registered. +That is, we now have a many-to-one relationship between transaction commit and +log item completion. The result of this is that pinning and unpinning of the +log items becomes unbalanced if we retain the "pin on transaction commit, unpin +on transaction completion" model. + +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on +insertion into the CIL, unpin on checkpoint completion". In other words, the +pinning and unpinning becomes symmetric around a checkpoint context. We have to +pin the object the first time it is inserted into the CIL - if it is already in +the CIL during a transaction commit, then we do not pin it again. Because there +can be multiple outstanding checkpoint contexts, we can still see elevated pin +counts, but as each checkpoint completes the pin count will retain the correct +value according to it's context. + +Just to make matters more slightly more complex, this checkpoint level context +for the pin count means that the pinning of an item must take place under the +CIL commit/flush lock. If we pin the object outside this lock, we cannot +guarantee which context the pin count is associated with. This is because of +the fact pinning the item is dependent on whether the item is present in the +current CIL or not. If we don't pin the CIL first before we check and pin the +object, we have a race with CIL being flushed between the check and the pin +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit +lock to guarantee that we pin the items correctly. + +Delayed Logging: Concurrent Scalability + +A fundamental requirement for the CIL is that accesses through transaction +commits must scale to many concurrent commits. The current transaction commit +code does not break down even when there are transactions coming from 2048 +processors at once. The current transaction code does not go any faster than if +there was only one CPU using it, but it does not slow down either. + +As a result, the delayed logging transaction commit code needs to be designed +for concurrency from the ground up. It is obvious that there are serialisation +points in the design - the three important ones are: + + 1. Locking out new transaction commits while flushing the CIL + 2. Adding items to the CIL and updating item space accounting + 3. Checkpoint commit ordering + +Looking at the transaction commit and CIL flushing interactions, it is clear +that we have a many-to-one interaction here. That is, the only restriction on +the number of concurrent transactions that can be trying to commit at once is +the amount of space available in the log for their reservations. The practical +limit here is in the order of several hundred concurrent transactions for a +128MB log, which means that it is generally one per CPU in a machine. + +The amount of time a transaction commit needs to hold out a flush is a +relatively long period of time - the pinning of log items needs to be done +while we are holding out a CIL flush, so at the moment that means it is held +across the formatting of the objects into memory buffers (i.e. while memcpy()s +are in progress). Ultimately a two pass algorithm where the formatting is done +separately to the pinning of objects could be used to reduce the hold time of +the transaction commit side. + +Because of the number of potential transaction commit side holders, the lock +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not +want every other CPU in the machine spinning on the CIL lock. Given that +flushing the CIL could involve walking a list of tens of thousands of log +items, it will get held for a significant time and so spin contention is a +significant concern. Preventing lots of CPUs spinning doing nothing is the +main reason for choosing a sleeping lock even though nothing in either the +transaction commit or CIL flush side sleeps with the lock held. + +It should also be noted that CIL flushing is also a relatively rare operation +compared to transaction commit for asynchronous transaction workloads - only +time will tell if using a read-write semaphore for exclusion will limit +transaction commit concurrency due to cache line bouncing of the lock on the +read side. + +The second serialisation point is on the transaction commit side where items +are inserted into the CIL. Because transactions can enter this code +concurrently, the CIL needs to be protected separately from the above +commit/flush exclusion. It also needs to be an exclusive lock but it is only +held for a very short time and so a spin lock is appropriate here. It is +possible that this lock will become a contention point, but given the short +hold time once per transaction I think that contention is unlikely. + +The final serialisation point is the checkpoint commit record ordering code +that is run as part of the checkpoint commit and log force sequencing. The code +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter +an ordering loop after writing all the log vectors into the log buffers but +before writing the commit record. This loop walks the list of committing +checkpoints and needs to block waiting for checkpoints to complete their commit +record write. As a result it needs a lock and a wait variable. Log force +sequencing also requires the same lock, list walk, and blocking mechanism to +ensure completion of checkpoints. + +These two sequencing operations can use the mechanism even though the +events they are waiting for are different. The checkpoint commit record +sequencing needs to wait until checkpoint contexts contain a commit LSN +(obtained through completion of a commit record write) while log force +sequencing needs to wait until previous checkpoint contexts are removed from +the committing list (i.e. they've completed). A simple wait variable and +broadcast wakeups (thundering herds) has been used to implement these two +serialisation queues. They use the same lock as the CIL, too. If we see too +much contention on the CIL lock, or too many context switches as a result of +the broadcast wakeups these operations can be put under a new spinlock and +given separate wait lists to reduce lock contention and the number of processes +woken by the wrong event. + + +Lifecycle Changes + +The existing log item life cycle is as follows: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory + Format item into log buffer + Write commit LSN into transaction + Unlock item + Attach transaction to log buffer + + + + + 7. Transaction completion + Mark log item committed + Insert log item into AIL + Write commit LSN into log item + Unpin log item + 8. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + + + 9. Log item removed from AIL + Moves log tail + Item unlocked + +Essentially, steps 1-6 operate independently from step 7, which is also +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur +at the same time. If the log item is in the AIL or between steps 6 and 7 +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 +are entered and completed is the object considered clean. + +With delayed logging, there are new steps inserted into the life cycle: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory if not pinned in CIL + Format item into log vector + buffer + Attach log vector and buffer to log item + Insert log item into CIL + Write CIL context sequence into transaction + Unlock item + + + + 7. CIL push + lock CIL flush + Chain log vectors and buffers together + Remove items from CIL + unlock CIL flush + write log vectors into log + sequence commit records + attach checkpoint context to log buffer + + + + + 8. Checkpoint completion + Mark log item committed + Insert item into AIL + Write commit LSN into log item + Unpin log item + 9. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + 10. Log item removed from AIL + Moves log tail + Item unlocked + +From this, it can be seen that the only life cycle differences between the two +logging methods are in the middle of the life cycle - they still have the same +beginning and end and execution constraints. The only differences are in the +commiting of the log items to the log itself and the completion processing. +Hence delayed logging should not introduce any constraints on log item +behaviour, allocation or freeing that don't already exist. + +As a result of this zero-impact "insertion" of delayed logging infrastructure +and the design of the internal structures to avoid on disk format changes, we +can basically switch between delayed logging and the existing mechanism with a +mount option. Fundamentally, there is no reason why the log manager would not +be able to swap methods automatically and transparently depending on load +characteristics, but this should not be necessary if delayed logging works as +designed. + +Roadmap: + +2.6.35 Inclusion in mainline as an experimental mount option + => approximately 2-3 months to merge window + => needs to be in xfs-dev tree in 4-6 weeks + => code is nearing readiness for review + +2.6.37 Remove experimental tag from mount option + => should be roughly 6 months after initial merge + => enough time to: + => gain confidence and fix problems reported by early + adopters (a.k.a. guinea pigs) + => address worst performance regressions and undesired + behaviours + => start tuning/optimising code for parallelism + => start tuning/optimising algorithms consuming + excessive CPU time + +2.6.39 Switch default mount option to use delayed logging + => should be roughly 12 months after initial merge + => enough time to shake out remaining problems before next round of + enterprise distro kernel rebases + + + -- 1.5.6.5 From SRS0+KKyr+66+fromorbit.com=dave@internode.on.net Fri May 7 01:01:37 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,J_CHICKENPOX_63, J_CHICKENPOX_66,LOCAL_GNU_PATCH,TVD_PH_BODY_ACCOUNTS_PRE autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4761aG7150054 for ; Fri, 7 May 2010 01:01:37 -0500 X-ASG-Debug-ID: 1273210880-038e01a30000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 6D977318A34 for ; Thu, 6 May 2010 22:41:20 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id GHCnM7mujBzhSuMJ for ; Thu, 06 May 2010 22:41:20 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23442163-1927428 for ; Fri, 07 May 2010 15:11:18 +0930 (CST) Received: from disturbed ([192.168.1.9]) by dastard with esmtp (Exim 4.71) (envelope-from ) id 1OAGJE-0006rZ-J3 for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:12 +1000 Received: from dave by disturbed with local (Exim 4.71) (envelope-from ) id 1OAGJD-00066g-4B for xfs@oss.sgi.com; Fri, 07 May 2010 15:41:11 +1000 From: Dave Chinner To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH 09/12] xfs: Introduce delayed logging core code Subject: [PATCH 09/12] xfs: Introduce delayed logging core code Date: Fri, 7 May 2010 15:40:57 +1000 Message-Id: <1273210860-23414-10-git-send-email-david@fromorbit.com> X-Mailer: git-send-email 1.5.6.5 In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1273210882 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29248 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean From: Dave Chinner The delayed logging code only changes in-memory structures and as such can be enabled and disabled with a mount option. Add the mount option and emit a warning that this is an experimental feature that should not be used in production yet. We also need infrastructure to track committed items that have not yet been written to the log. This is what the Committed Item List (CIL) is for. The log item also needs to be extended to track the current log vector, the associated memory buffer and it's location in the Commit Item List. Extend the log item and log vector structures to enable this tracking. To maintain the current log format for transactions with delayed logging, we need to introduce a checkpoint transaction and a context for tracking each checkpoint from initiation to transaction completion. This includes adding a log ticket for tracking space log required/used by the context checkpoint. To track all the changes we need an io vector array per log item, rather than a single array for the entire transaction. Using the new log vector structure for this requires two passes - the first to allocate the log vector structures and chain them together, and the second to fill them out. This log vector chain can then be passed to the CIL for formatting, pinning and insertion into the CIL. Formatting of the log vector chain is relatively simple - it's just a loop over the iovecs on each log vector, but it is made slightly more complex because we re-write the iovec after the copy to point back at the memory buffer we just copied into. This code also needs to pin log items. If the log item is not already tracked in this checkpoint context, then it needs to be pinned. Otherwise it is already pinned and we don't need to pin it again. The only other complexity is calculating the amount of new log space the formatting has consumed. This needs to be accounted to the transaction in progress, and the accounting is made more complex becase we need also to steal space from it for log metadata in the checkpoint transaction. Calculate all this at insert time and update all the tickets, counters, etc correctly. Once we've formatted all the log items in the transaction, attach the busy extents to the checkpoint context so the busy extents live until checkpoint completion and can be processed at that point in time. Transactions can then be freed at this point in time. Now we need to issue checkpoints - we are tracking the amount of log space used by the items in the CIL, so we can trigger background checkpoints when the space usage gets to a certain threshold. Otherwise, checkpoints need ot be triggered when a log synchronisation point is reached - a log force event. Because the log write code already handles chained log vectors, writing the transaction is trivial, too. Construct a transaction header, add it to the head of the chain and write it into the log, then issue a commit record write. Then we can release the checkpoint log ticket and attach the context to the log buffer so it can be called during Io completion to complete the checkpoint. We also need to allow for synchronising multiple in-flight checkpoints. This is needed for two things - the first is to ensure that checkpoint commit records appear in the log in the correct sequence order (so they are replayed in the correct order). The second is so that xfs_log_force_lsn() operates correctly and only flushes and/or waits for the specific sequence it was provided with. To do this we need a wait variable and a list tracking the checkpoint commits in progress. We can walk this list and wait for the checkpoints to change state or complete easily, an this provides the necessary synchronisation for correct operation in both cases. Signed-off-by: Dave Chinner --- fs/xfs/Makefile | 1 + fs/xfs/linux-2.6/xfs_super.c | 10 + fs/xfs/xfs_log.c | 67 ++++- fs/xfs/xfs_log.h | 9 +- fs/xfs/xfs_log_cil.c | 666 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_log_priv.h | 71 +++++- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_trans.c | 144 +++++++++- fs/xfs/xfs_trans.h | 8 +- fs/xfs/xfs_trans_item.c | 5 +- fs/xfs/xfs_trans_priv.h | 12 +- 11 files changed, 963 insertions(+), 31 deletions(-) create mode 100644 fs/xfs/xfs_log_cil.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index b4769e4..c8fb13f 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -77,6 +77,7 @@ xfs-y += xfs_alloc.o \ xfs_itable.o \ xfs_dfrag.o \ xfs_log.o \ + xfs_log_cil.o \ xfs_log_recover.o \ xfs_mount.o \ xfs_mru_cache.o \ diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index 1e88c98..6a7c8c9 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -118,6 +118,8 @@ mempool_t *xfs_ioend_pool; #define MNTOPT_DMAPI "dmapi" /* DMI enabled (DMAPI / XDSM) */ #define MNTOPT_XDSM "xdsm" /* DMI enabled (DMAPI / XDSM) */ #define MNTOPT_DMI "dmi" /* DMI enabled (DMAPI / XDSM) */ +#define MNTOPT_DELAYLOG "delaylog" /* Delayed loging enabled */ +#define MNTOPT_NODELAYLOG "nodelaylog" /* Delayed loging disabled */ /* * Table driven mount option parser. @@ -373,6 +375,13 @@ xfs_parseargs( mp->m_flags |= XFS_MOUNT_DMAPI; } else if (!strcmp(this_char, MNTOPT_DMI)) { mp->m_flags |= XFS_MOUNT_DMAPI; + } else if (!strcmp(this_char, MNTOPT_DELAYLOG)) { + mp->m_flags |= XFS_MOUNT_DELAYLOG; + cmn_err(CE_WARN, + "Enabling EXPERIMENTAL delayed logging feature " + "- use at your own risk.\n"); + } else if (!strcmp(this_char, MNTOPT_NODELAYLOG)) { + mp->m_flags &= ~XFS_MOUNT_DELAYLOG; } else if (!strcmp(this_char, "ihashsize")) { cmn_err(CE_WARN, "XFS: ihashsize no longer used, option is deprecated."); @@ -534,6 +543,7 @@ xfs_showargs( { XFS_MOUNT_FILESTREAMS, "," MNTOPT_FILESTREAM }, { XFS_MOUNT_DMAPI, "," MNTOPT_DMAPI }, { XFS_MOUNT_GRPID, "," MNTOPT_GRPID }, + { XFS_MOUNT_DELAYLOG, "," MNTOPT_DELAYLOG }, { 0, NULL } }; static struct proc_xfs_info xfs_info_unset[] = { diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 19d0c5f..23f2a05 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -54,9 +54,6 @@ STATIC xlog_t * xlog_alloc_log(xfs_mount_t *mp, STATIC int xlog_space_left(xlog_t *log, int cycle, int bytes); STATIC int xlog_sync(xlog_t *log, xlog_in_core_t *iclog); STATIC void xlog_dealloc_log(xlog_t *log); -STATIC int xlog_write(struct log *log, struct xfs_log_vec *log_vector, - struct xlog_ticket *tic, xfs_lsn_t *start_lsn, - xlog_in_core_t **commit_iclog, uint flags); /* local state machine functions */ STATIC void xlog_state_done_syncing(xlog_in_core_t *iclog, int); @@ -86,12 +83,6 @@ STATIC int xlog_regrant_write_log_space(xlog_t *log, STATIC void xlog_ungrant_log_space(xlog_t *log, xlog_ticket_t *ticket); - -/* local ticket functions */ -STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, int unit_bytes, int count, - char clientid, uint flags, - int alloc_flags); - #if defined(DEBUG) STATIC void xlog_verify_dest_ptr(xlog_t *log, char *ptr); STATIC void xlog_verify_grant_head(xlog_t *log, int equals); @@ -460,6 +451,16 @@ xfs_log_mount( /* Normal transactions can now occur */ mp->m_log->l_flags &= ~XLOG_ACTIVE_RECOVERY; + /* + * Now the log has been fully initialised and we know were our + * space grant counters are, we can initialise the permanent ticket + * needed for delayed logging to work. + */ + error = xlog_cil_init_post_recovery(mp->m_log); + if (error) { + ASSERT(0); + goto out_destroy_ail; + } return 0; out_destroy_ail: @@ -666,6 +667,10 @@ xfs_log_item_init( item->li_ailp = mp->m_ail; item->li_type = type; item->li_ops = ops; + item->li_lv = NULL; + + INIT_LIST_HEAD(&item->li_ail); + INIT_LIST_HEAD(&item->li_cil); } /* @@ -1176,6 +1181,9 @@ xlog_alloc_log(xfs_mount_t *mp, *iclogp = log->l_iclog; /* complete ring */ log->l_iclog->ic_prev = prev_iclog; /* re-write 1st prev ptr */ + error = xlog_cil_init(log); + if (error) + goto out_free_iclog; return log; out_free_iclog: @@ -1502,6 +1510,8 @@ xlog_dealloc_log(xlog_t *log) xlog_in_core_t *iclog, *next_iclog; int i; + xlog_cil_destroy(log); + iclog = log->l_iclog; for (i=0; il_iclog_bufs; i++) { sv_destroy(&iclog->ic_force_wait); @@ -1544,8 +1554,10 @@ xlog_state_finish_copy(xlog_t *log, * print out info relating to regions written which consume * the reservation */ -STATIC void -xlog_print_tic_res(xfs_mount_t *mp, xlog_ticket_t *ticket) +void +xlog_print_tic_res( + struct xfs_mount *mp, + struct xlog_ticket *ticket) { uint i; uint ophdr_spc = ticket->t_res_num_ophdrs * (uint)sizeof(xlog_op_header_t); @@ -1877,7 +1889,7 @@ xlog_write_copy_finish( * we don't update ic_offset until the end when we know exactly how many * bytes have been written out. */ -STATIC int +int xlog_write( struct log *log, struct xfs_log_vec *log_vector, @@ -1901,9 +1913,26 @@ xlog_write( *start_lsn = 0; len = xlog_write_calc_vec_length(ticket, log_vector); - if (ticket->t_curr_res < len) + if (log->l_cilp) { + /* + * Region headers and bytes are already accounted for. + * We only need to take into account start records and + * split regions in this function. + */ + if (ticket->t_flags & XLOG_TIC_INITED) + ticket->t_curr_res -= sizeof(xlog_op_header_t); + + /* + * Commit record headers need to be accounted for. These + * come in as separate writes so are easy to detect. + */ + if (flags & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) + ticket->t_curr_res -= sizeof(xlog_op_header_t); + } else + ticket->t_curr_res -= len; + + if (ticket->t_curr_res < 0) xlog_print_tic_res(log->l_mp, ticket); - ticket->t_curr_res -= len; index = 0; lv = log_vector; @@ -2999,6 +3028,8 @@ _xfs_log_force( XFS_STATS_INC(xs_log_force); + xlog_cil_push(log, 1); + spin_lock(&log->l_icloglock); iclog = log->l_iclog; @@ -3148,6 +3179,12 @@ _xfs_log_force_lsn( XFS_STATS_INC(xs_log_force); + if (log->l_cilp) { + lsn = xlog_cil_push_lsn(log, lsn); + if (lsn == NULLCOMMITLSN) + return 0; + } + try_again: spin_lock(&log->l_icloglock); iclog = log->l_iclog; @@ -3322,7 +3359,7 @@ xfs_log_get_trans_ident( /* * Allocate and initialise a new log ticket. */ -STATIC xlog_ticket_t * +xlog_ticket_t * xlog_ticket_alloc( struct log *log, int unit_bytes, diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 05f205a..4a0c574 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -113,6 +113,9 @@ struct xfs_log_vec { struct xfs_log_vec *lv_next; /* next lv in build list */ int lv_niovecs; /* number of iovecs in lv */ struct xfs_log_iovec *lv_iovecp; /* iovec array */ + struct xfs_log_item *lv_item; /* owner */ + char *lv_buf; /* formatted buffer */ + int lv_buf_len; /* size of formatted buffer */ }; /* @@ -187,11 +190,15 @@ int xfs_log_need_covered(struct xfs_mount *mp); void xlog_iodone(struct xfs_buf *); -struct xlog_ticket * xfs_log_ticket_get(struct xlog_ticket *ticket); +struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); void xfs_log_ticket_put(struct xlog_ticket *ticket); xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp); +int xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, + struct xfs_log_vec *log_vector, + xfs_lsn_t *commit_lsn, int flags); + #endif diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c new file mode 100644 index 0000000..3cb1957 --- /dev/null +++ b/fs/xfs/xfs_log_cil.c @@ -0,0 +1,666 @@ +/* + * Copyright (c) 2010 Redhat, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_types.h" +#include "xfs_bit.h" +#include "xfs_log.h" +#include "xfs_inum.h" +#include "xfs_trans.h" +#include "xfs_trans_priv.h" +#include "xfs_log_priv.h" +#include "xfs_sb.h" +#include "xfs_ag.h" +#include "xfs_dir2.h" +#include "xfs_dmapi.h" +#include "xfs_mount.h" +#include "xfs_error.h" +#include "xfs_alloc.h" + +/* + * Perform initial CIL structure initialisation. If the CIL is not + * enabled in this filesystem, ensure the log->l_cilp is null so + * we can check this conditional to determine if we are doing delayed + * logging or not. + */ +int +xlog_cil_init( + struct log *log) +{ + struct xfs_cil *cil; + struct xfs_cil_ctx *ctx; + + log->l_cilp = NULL; + if (!(log->l_mp->m_flags & XFS_MOUNT_DELAYLOG)) + return 0; + + cil = kmem_zalloc(sizeof(*cil), KM_SLEEP|KM_MAYFAIL); + if (!cil) + return ENOMEM; + + ctx = kmem_zalloc(sizeof(*ctx), KM_SLEEP|KM_MAYFAIL); + if (!ctx) { + kmem_free(cil); + return ENOMEM; + } + + INIT_LIST_HEAD(&cil->xc_cil); + INIT_LIST_HEAD(&cil->xc_committing); + spin_lock_init(&cil->xc_cil_lock); + init_rwsem(&cil->xc_ctx_lock); + sv_init(&cil->xc_commit_wait, SV_DEFAULT, "cilwait"); + + INIT_LIST_HEAD(&ctx->committing); + INIT_LIST_HEAD(&ctx->busy_extents); + ctx->sequence = 1; + ctx->cil = cil; + cil->xc_ctx = ctx; + + cil->xc_log = log; + log->l_cilp = cil; + return 0; +} + +void +xlog_cil_destroy( + struct log *log) +{ + if (!log->l_cilp) + return; + + if (log->l_cilp->xc_ctx) { + if (log->l_cilp->xc_ctx->ticket) + xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket); + kmem_free(log->l_cilp->xc_ctx); + } + + ASSERT(list_empty(&log->l_cilp->xc_cil)); + kmem_free(log->l_cilp); +} + +/* + * Allocate a new ticket. Failing to get a new ticket makes it really hard to + * recover, so we don't allow failure here. Also, we allocate in a context that + * we don't want to be issuing transactions from, so we need to tell the + * allocation code this as well. + * + * We don't reserve any space for the ticket - we are going to steal whatever + * space we require from transactions as they commit. To ensure we reserve all + * the space required, we need to set the current reservation of the ticket to + * zero so that we know to steal the initial transaction overhead from the + * first transaction commit. + */ +static struct xlog_ticket * +xlog_cil_ticket_alloc( + struct log *log) +{ + struct xlog_ticket *tic; + + tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0, + KM_SLEEP|KM_NOFS); + tic->t_trans_type = XFS_TRANS_CHECKPOINT; + + /* + * set the current reservation to zero so we know to steal the basic + * transaction overhead reservation from the first transaction commit. + */ + tic->t_curr_res = 0; + return tic; +} + +/* + * After the first stage of log recovery is done, we know where the head and + * tail of the log are. We need this log initialisation done before we can + * initialise the first CIL checkpoint context. + * + * Here we allocate a log ticket to track space usage during a CIL push. This + * ticket is passed to xlog_write() directly so that we don't slowly leak log + * space by failing to account for space used by log headers and additional + * region headers for split regions. + */ +int +xlog_cil_init_post_recovery( + struct log *log) +{ + if (!log->l_cilp) + return 0; + + log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log); + log->l_cilp->xc_ctx->sequence = 1; + log->l_cilp->xc_ctx->commit_lsn = xlog_assign_lsn(log->l_curr_cycle, + log->l_curr_block); + return 0; +} + +/* + * Insert the log item into the CIL and calculate the difference in space + * consumed by the item. Add the space to the checkpoint ticket and calculate + * if the change requires additional log metadata. If it does, take that space + * as well. Remove the amount of space we addded to the checkpoint ticket from + * the current transaction ticket so that the accounting works out correctly. + * + * If this is the first time the item is being placed into the CIL in this + * context, pin it so it can't be written to disk until the CIL is flushed to + * the iclog and the iclog written to disk. + */ +static void +xlog_cil_insert( + struct log *log, + struct xlog_ticket *ticket, + struct xfs_log_item *item, + struct xfs_log_vec *lv) +{ + struct xfs_cil *cil = log->l_cilp; + struct xfs_log_vec *old = lv->lv_item->li_lv; + struct xfs_cil_ctx *ctx = cil->xc_ctx; + int len; + int diff_iovecs; + int iclog_space; + + if (old) { + /* existing lv on log item, space used is a delta */ + ASSERT(!list_empty(&item->li_cil)); + ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs); + + len = lv->lv_buf_len - old->lv_buf_len; + diff_iovecs = lv->lv_niovecs - old->lv_niovecs; + kmem_free(old->lv_buf); + kmem_free(old); + } else { + /* new lv, must pin the log item */ + ASSERT(!lv->lv_item->li_lv); + ASSERT(list_empty(&item->li_cil)); + + len = lv->lv_buf_len; + diff_iovecs = lv->lv_niovecs; + IOP_PIN(lv->lv_item); + + } + len += diff_iovecs * sizeof(xlog_op_header_t); + + /* attach new log vector to log item */ + lv->lv_item->li_lv = lv; + + spin_lock(&cil->xc_cil_lock); + list_move_tail(&item->li_cil, &cil->xc_cil); + ctx->nvecs += diff_iovecs; + + /* + * Now transfer enough transaction reservation to the context ticket + * for the checkpoint. The context ticket is special - the unit + * reservation has to grow as well as the current reservation as we + * steal from tickets so we can correctly determine the space used + * during the transaction commit. + */ + if (ctx->ticket->t_curr_res == 0) { + /* first commit in checkpoint, steal the header reservation */ + ASSERT(ticket->t_curr_res >= ctx->ticket->t_unit_res + len); + ctx->ticket->t_curr_res = ctx->ticket->t_unit_res; + ticket->t_curr_res -= ctx->ticket->t_unit_res; + } + + /* do we need space for more log record headers? */ + iclog_space = log->l_iclog_size - log->l_iclog_hsize; + if (len > 0 && (ctx->space_used / iclog_space != + (ctx->space_used + len) / iclog_space)) { + int hdrs; + + hdrs = (len + iclog_space - 1) / iclog_space; + /* need to take into account split region headers, too */ + hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header); + ctx->ticket->t_unit_res += hdrs; + ctx->ticket->t_curr_res += hdrs; + ticket->t_curr_res -= hdrs; + ASSERT(ticket->t_curr_res >= len); + } + ticket->t_curr_res -= len; + ctx->space_used += len; + + spin_unlock(&cil->xc_cil_lock); +} + +/* + * Format log item into a flat buffers + * + * For delayed logging, we need to hold a formatted buffer containing + * all the changes on the log item. This enables us to relog the item + * in memory and write it out asynchronously without needing to relock + * the object that was modified at the time it gets written into the + * iclog. + * + * This function works out the length of the buffer needed for each + * log item, allocates them and formats the the log vector for the item + * into the buffer. The buffer is then attached to the log item and the + * vector is formatted into the buffer. The log item and formatted log vector + * are then inserted into the Committed Item List for tracking until the + * next checkpoint is written out. + */ +static void +xlog_cil_format_items( + struct log *log, + struct xfs_log_vec *log_vector, + struct xlog_ticket *ticket, + xfs_lsn_t *start_lsn) +{ + struct xfs_log_vec *lv; + + if (start_lsn) + *start_lsn = log->l_cilp->xc_ctx->sequence; + + /* + * we don't set up region headers here; we simply copy the regions into + * the flat buffer. We can do this because we still have to do a + * formatting step to write the regions into the iclog buffer. Writing + * the ophdrs during the iclog write means that we can support + * splitting large regions across iclog boundares without needing a + * change in the format of the item/region encapsulation. + * + * Hence what we need to do now is change the vector buffer pointer to + * point to the copied region inside the buffer we just allocated. This + * allows us to format the regions into the iclog as though they are + * being formatted directly out of the objects themselves. + */ + ASSERT(log_vector); + for (lv = log_vector; lv; lv = lv->lv_next) { + void *ptr; + int index; + int offset = 0; + int len = 0; + + for (index = 0; index < lv->lv_niovecs; index++) + len += lv->lv_iovecp[index].i_len; + + lv->lv_buf_len = len; + lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS); + ptr = lv->lv_buf; + + for (index = 0; index < lv->lv_niovecs; index++) { + struct xfs_log_iovec *vec = &lv->lv_iovecp[index]; + + memcpy(ptr, vec->i_addr, vec->i_len); + vec->i_addr = ptr; + xlog_write_adv_cnt(&ptr, &len, &offset, vec->i_len); + } + ASSERT(len == 0); + + xlog_cil_insert(log, ticket, lv->lv_item, lv); + } +} + +static void +xlog_cil_free_logvec( + struct xfs_log_vec *log_vector) +{ + struct xfs_log_vec *lv; + + for (lv = log_vector; lv; ) { + struct xfs_log_vec *next = lv->lv_next; + kmem_free(lv->lv_buf); + kmem_free(lv); + lv = next; + } +} + +/* + * Commit a transaction with the given vector to the Committed Item List. + * + * To do this, we need to format the item, pin it in memory if required and + * account for the space used by the transaction. Once we have done that we + * need to release the unused reservation for the transaction, attach the + * transaction to the checkpoint context so we carry the busy extents through + * to checkpoint completion, and then unlock all the items in the transaction. + * + * For more specific information about the order of operations in + * xfs_log_commit_cil() please refer to the comments in + * xfs_trans_commit_iclog(). + */ +int +xfs_log_commit_cil( + struct xfs_mount *mp, + struct xfs_trans *tp, + struct xfs_log_vec *log_vector, + xfs_lsn_t *commit_lsn, + int flags) +{ + struct log *log = mp->m_log; + int log_flags = 0; + + if (flags & XFS_TRANS_RELEASE_LOG_RES) + log_flags = XFS_LOG_REL_PERM_RESERV; + + if (XLOG_FORCED_SHUTDOWN(log)) { + xlog_cil_free_logvec(log_vector); + return XFS_ERROR(EIO); + } + + /* lock out background commit */ + down_read(&log->l_cilp->xc_ctx_lock); + xlog_cil_format_items(log, log_vector, tp->t_ticket, commit_lsn); + + /* check we didn't blow the reservation */ + if (tp->t_ticket->t_curr_res < 0) + xlog_print_tic_res(log->l_mp, tp->t_ticket); + + /* attach the transaction to the CIL if it has any busy extents */ + if (!list_empty(&tp->t_busy)) { + spin_lock(&log->l_cilp->xc_cil_lock); + list_splice_init(&tp->t_busy, + &log->l_cilp->xc_ctx->busy_extents); + spin_unlock(&log->l_cilp->xc_cil_lock); + } + + tp->t_commit_lsn = *commit_lsn; + xfs_log_done(mp, tp->t_ticket, NULL, log_flags); + xfs_trans_unreserve_and_mod_sb(tp); + + /* background commit is allowed again */ + up_read(&log->l_cilp->xc_ctx_lock); + current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); + + /* xfs_trans_free_items() unlocks them first */ + xfs_trans_free_items(tp, *commit_lsn, 0); + xfs_trans_free(tp); + return 0; +} + +/* + * Mark all items committed and clear busy extents. We free the log vector + * chains in a separate pass so that we unpin the log items as quickly as + * possible. + */ +static void +xlog_cil_committed( + void *args, + int abort) +{ + struct xfs_cil_ctx *ctx = args; + struct xfs_log_vec *lv; + int abortflag = abort ? XFS_LI_ABORTED : 0; + struct xfs_busy_extent *busyp, *n; + + /* unpin all the log items */ + for (lv = ctx->lv_chain; lv; lv = lv->lv_next ) { + xfs_trans_item_committed(lv->lv_item, ctx->start_lsn, + abortflag); + } + + list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) + xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp); + + spin_lock(&ctx->cil->xc_cil_lock); + list_del(&ctx->committing); + spin_unlock(&ctx->cil->xc_cil_lock); + + xlog_cil_free_logvec(ctx->lv_chain); + kmem_free(ctx); +} + +/* + * Push the Committed Item List to the log. If the push_now flag is not set, + * then it is a background flush and so we can chose to ignore it. + */ +int +xlog_cil_push( + struct log *log, + int push_now) +{ + struct xfs_cil *cil = log->l_cilp; + struct xfs_log_vec *lv; + struct xfs_cil_ctx *ctx; + struct xfs_cil_ctx *new_ctx; + struct xlog_in_core *commit_iclog; + struct xlog_ticket *tic; + int num_lv; + int num_iovecs; + int len; + int error = 0; + struct xfs_trans_header thdr; + struct xfs_log_iovec lhdr; + struct xfs_log_vec lvhdr = { NULL }; + xfs_lsn_t commit_lsn; + + if (!cil) + return 0; + + /* XXX: don't sleep for background? */ + new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_SLEEP|KM_NOFS); + new_ctx->ticket = xlog_cil_ticket_alloc(log); + + /* lock out transaction commit */ + down_write(&cil->xc_ctx_lock); + ctx = cil->xc_ctx; + + /* check if we've anything to push */ + if (list_empty(&cil->xc_cil)) { + up_write(&cil->xc_ctx_lock); + xfs_log_ticket_put(new_ctx->ticket); + kmem_free(new_ctx); + return 0; + } + + /* + * pull all the log vectors off the items in the CIL, and + * remove the items from the CIL. We don't need the CIL lock + * here because it's only needed on the transaction commit + * side which is currently locked out by the flush lock. + */ + lv = NULL; + num_lv = 0; + num_iovecs = 0; + len = 0; + while (!list_empty(&cil->xc_cil)) { + struct xfs_log_item *item; + int i; + + item = list_first_entry(&cil->xc_cil, + struct xfs_log_item, li_cil); + list_del_init(&item->li_cil); + if (!ctx->lv_chain) + ctx->lv_chain = item->li_lv; + else + lv->lv_next = item->li_lv; + lv = item->li_lv; + item->li_lv = NULL; + + num_lv++; + num_iovecs += lv->lv_niovecs; + for (i = 0; i < lv->lv_niovecs; i++) + len += lv->lv_iovecp[i].i_len; + } + + /* + * initialise the new context and attach it to the CIL. Then attach + * the current context to the CIL committing lsit so it can be found + * during log forces to extract the commit lsn of the sequence that + * needs to be forced. + */ + INIT_LIST_HEAD(&new_ctx->committing); + INIT_LIST_HEAD(&new_ctx->busy_extents); + new_ctx->sequence = ctx->sequence + 1; + new_ctx->cil = cil; + cil->xc_ctx = new_ctx; + + /* + * The switch is now done, so we can drop the context lock and move out + * of a shared context. We can't just go straight to the commit record, + * though - we need to synchronise with previous and future commits so + * that the commit records are correctly ordered in the log to ensure + * that we process items during log IO completion in the correct order. + * + * For example, if we get an EFI in one checkpoint and the EFD in the + * next (e.g. due to log forces), we do not want the checkpoint with + * the EFD to be committed before the checkpoint with the EFI. Hence + * we must strictly order the commit records of the checkpoints so + * that: a) the checkpoint callbacks are attached to the iclogs in the + * correct order; and b) the checkpoints are replayed in correct order + * in log recovery. + * + * Hence we need to add this context to the committing context list so + * that higher sequences will wait for us to write out a commit record + * before they do. + */ + spin_lock(&cil->xc_cil_lock); + list_add(&ctx->committing, &cil->xc_committing); + spin_unlock(&cil->xc_cil_lock); + up_write(&cil->xc_ctx_lock); + + /* + * Build a checkpoint transaction header and write it to the log to + * begin the transaction. We need to account for the space used by the + * transaction header here as it is not accounted for in xlog_write(). + * + * The LSN we need to pass to the log items on transaction commit is + * the LSN reported by the first log vector write. If we use the commit + * record lsn then we can move the tail beyond the grant write head. + */ + tic = ctx->ticket; + thdr.th_magic = XFS_TRANS_HEADER_MAGIC; + thdr.th_type = XFS_TRANS_CHECKPOINT; + thdr.th_tid = tic->t_tid; + thdr.th_num_items = num_iovecs; + lhdr.i_addr = (xfs_caddr_t)&thdr; + lhdr.i_len = sizeof(xfs_trans_header_t); + lhdr.i_type = XLOG_REG_TYPE_TRANSHDR; + tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t); + + lvhdr.lv_niovecs = 1; + lvhdr.lv_iovecp = &lhdr; + lvhdr.lv_next = ctx->lv_chain; + + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0); + if (error) + goto out_abort; + + /* + * now that we've written the checkpoint into the log, strictly + * order the commit records so replay will get them in the right order. + */ +restart: + spin_lock(&cil->xc_cil_lock); + list_for_each_entry(new_ctx, &cil->xc_committing, committing) { + /* + * Higher sequences will wait for this one so skip them. + * Don't wait for own own sequence, either. + */ + if (new_ctx->sequence >= ctx->sequence) + continue; + if (!new_ctx->commit_lsn) { + /* + * It is still being pushed! Wait for the push to + * complete, then start again from the beginning. + */ + sv_wait(&cil->xc_commit_wait, 0, &cil->xc_cil_lock, 0); + goto restart; + } + } + spin_unlock(&cil->xc_cil_lock); + + commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, 0); + if (error || commit_lsn == -1) + goto out_abort; + + /* attach all the transactions w/ busy extents to iclog */ + ctx->log_cb.cb_func = xlog_cil_committed; + ctx->log_cb.cb_arg = ctx; + error = xfs_log_notify(log->l_mp, commit_iclog, &ctx->log_cb); + if (error) + goto out_abort; + + /* + * now the checkpoint commit is complete and we've attached the + * callbacks to the iclog we can assign the commit LSN to the context + * and wake up anyone who is waiting for the commit to complete. + */ + spin_lock(&cil->xc_cil_lock); + ctx->commit_lsn = commit_lsn; + sv_broadcast(&cil->xc_commit_wait); + spin_unlock(&cil->xc_cil_lock); + + /* release the hounds! */ + return xfs_log_release_iclog(log->l_mp, commit_iclog); + +out_abort: + xlog_cil_committed(ctx, XFS_LI_ABORTED); + return XFS_ERROR(EIO); +} + +/* + * Conditionally push the CIL based on the sequence passed in. + * + * We only need to push if we haven't already pushed the sequence + * number given. Hence the only time we will trigger a push here is + * if the push sequence is the same as the current context. + * + * We return the current commit lsn to allow the callers to determine if a + * iclog flush is necessary following this call. + * + * XXX: Initially, just push the CIL unconditionally and return whatever + * commit lsn is there. It'll be empty, so this is broken for now. + */ +xfs_lsn_t +xlog_cil_push_lsn( + struct log *log, + xfs_lsn_t push_seq) +{ + struct xfs_cil *cil = log->l_cilp; + struct xfs_cil_ctx *ctx; + xfs_lsn_t commit_lsn = NULLCOMMITLSN; + +restart: + down_write(&cil->xc_ctx_lock); + ASSERT(push_seq <= cil->xc_ctx->sequence); + + /* check to see if we need to force out the current context */ + if (push_seq == cil->xc_ctx->sequence) { + up_write(&cil->xc_ctx_lock); + xlog_cil_push(log, 1); + goto restart; + } + + /* + * See if we can find a previous sequence still committing. + * We can drop the flush lock as soon as we have the cil lock + * because we are now only comparing contexts protected by + * the cil lock. + * + * We need to wait for all previous sequence commits to complete + * before allowing the force of push_seq to go ahead. Hence block + * on commits for those as well. + */ + spin_lock(&cil->xc_cil_lock); + up_write(&cil->xc_ctx_lock); + list_for_each_entry(ctx, &cil->xc_committing, committing) { + if (ctx->sequence > push_seq) + continue; + if (!ctx->commit_lsn) { + /* + * It is still being pushed! Wait for the push to + * complete, then start again from the beginning. + */ + sv_wait(&cil->xc_commit_wait, 0, &cil->xc_cil_lock, 0); + goto restart; + } + if (ctx->sequence != push_seq) + continue; + /* found it! */ + commit_lsn = ctx->commit_lsn; + } + spin_unlock(&cil->xc_cil_lock); + return commit_lsn; +} + diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ac97bdd..f9a0e64 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -377,6 +377,54 @@ typedef struct xlog_in_core { } xlog_in_core_t; /* + * The CIL context is used to aggregate per-transaction details as well be + * passed to the iclog for checkpoint post-commit processing. After being + * passed to the iclog, another context needs to be allocated for tracking the + * next set of transactions to be aggregated into a checkpoint. + */ +struct xfs_cil; + +struct xfs_cil_ctx { + struct xfs_cil *cil; + xfs_lsn_t sequence; /* chkpt sequence # */ + xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ + xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ + struct xlog_ticket *ticket; /* chkpt ticket */ + int nvecs; /* number of regions */ + int space_used; /* aggregate size of regions */ + struct list_head busy_extents; /* busy extents in chkpt */ + struct xfs_log_vec *lv_chain; /* logvecs being pushed */ + xfs_log_callback_t log_cb; /* completion callback hook. */ + struct list_head committing; /* ctx committing list */ +}; + +/* + * Committed Item List structure + * + * This structure is used to track log items that have been committed but not + * yet written into the log. It is used only when the delayed logging mount + * option is enabled. + * + * This structure tracks the list of committing checkpoint contexts so + * we can avoid the problem of having to hold out new transactions during a + * flush until we have a the commit record LSN of the checkpoint. We can + * traverse the list of committing contexts in xlog_cil_push_lsn() to find a + * sequence match and extract the commit LSN directly from there. If the + * checkpoint is still in the process of committing, we can block waiting for + * the commit LSN to be determined as well. This should make synchronous + * operations almost as efficient as the old logging methods. + */ +struct xfs_cil { + struct log *xc_log; + struct list_head xc_cil; + spinlock_t xc_cil_lock; + struct xfs_cil_ctx *xc_ctx; + struct rw_semaphore xc_ctx_lock; + struct list_head xc_committing; + sv_t xc_commit_wait; +}; + +/* * The reservation head lsn is not made up of a cycle number and block number. * Instead, it uses a cycle number and byte number. Logs don't expect to * overflow 31 bits worth of byte offset, so using a byte number will mean @@ -386,6 +434,7 @@ typedef struct log { /* The following fields don't need locking */ struct xfs_mount *l_mp; /* mount point */ struct xfs_ail *l_ailp; /* AIL log is working with */ + struct xfs_cil *l_cilp; /* CIL log is working with */ struct xfs_buf *l_xbuf; /* extra buffer for log * wrapping */ struct xfs_buftarg *l_targ; /* buftarg of log */ @@ -436,14 +485,17 @@ typedef struct log { #define XLOG_FORCED_SHUTDOWN(log) ((log)->l_flags & XLOG_IO_ERROR) - /* common routines */ extern xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp); extern int xlog_recover(xlog_t *log); extern int xlog_recover_finish(xlog_t *log); extern void xlog_pack_data(xlog_t *log, xlog_in_core_t *iclog, int); -extern kmem_zone_t *xfs_log_ticket_zone; +extern kmem_zone_t *xfs_log_ticket_zone; +struct xlog_ticket *xlog_ticket_alloc(struct log *log, int unit_bytes, + int count, char client, uint xflags, + int alloc_flags); + static inline void xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) @@ -453,6 +505,21 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) *off += bytes; } +void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); +int xlog_write(struct log *log, struct xfs_log_vec *log_vector, + struct xlog_ticket *tic, xfs_lsn_t *start_lsn, + xlog_in_core_t **commit_iclog, uint flags); + +/* + * Committed Item List interfaces + */ +int xlog_cil_init(struct log *log); +int xlog_cil_init_post_recovery(struct log *log); +void xlog_cil_destroy(struct log *log); + +int xlog_cil_push(struct log *log, int push_now); +xfs_lsn_t xlog_cil_push_lsn(struct log *log, xfs_lsn_t push_sequence); + /* * Unmount record type is used as a pseudo transaction type for the ticket. * It's value must be outside the range of XFS_TRANS_* values. diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 9ff48a1..1d2c7ee 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -268,6 +268,7 @@ typedef struct xfs_mount { #define XFS_MOUNT_WSYNC (1ULL << 0) /* for nfs - all metadata ops must be synchronous except for space allocations */ +#define XFS_MOUNT_DELAYLOG (1ULL << 1) /* delayed logging is enabled */ #define XFS_MOUNT_DMAPI (1ULL << 2) /* dmapi is enabled */ #define XFS_MOUNT_WAS_CLEAN (1ULL << 3) #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 40d9595..9bdb492 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -253,7 +253,7 @@ _xfs_trans_alloc( * Free the transaction structure. If there is more clean up * to do when the structure is freed, add it here. */ -STATIC void +void xfs_trans_free( struct xfs_trans *tp) { @@ -655,7 +655,7 @@ xfs_trans_apply_sb_deltas( * XFS_TRANS_SB_DIRTY will not be set when the transaction is updated but we * still need to update the incore superblock with the changes. */ -STATIC void +void xfs_trans_unreserve_and_mod_sb( xfs_trans_t *tp) { @@ -883,7 +883,7 @@ xfs_trans_fill_vecs( * they could be immediately flushed and we'd have to race with the flusher * trying to pull the item from the AIL as we add it. */ -static void +void xfs_trans_item_committed( struct xfs_log_item *lip, xfs_lsn_t commit_lsn, @@ -994,7 +994,7 @@ xfs_trans_uncommit( xfs_trans_unreserve_and_mod_sb(tp); xfs_trans_unreserve_and_mod_dquots(tp); - xfs_trans_free_items(tp, flags); + xfs_trans_free_items(tp, NULLCOMMITLSN, flags); xfs_trans_free(tp); } @@ -1144,6 +1144,132 @@ xfs_trans_commit_iclog( return xfs_log_release_iclog(mp, commit_iclog); } +/* + * Walk the log items and allocate log vector structures for + * each item large enough to fit all the vectors they require. + * Note that this format differs from the old log vector format in + * that there is no transaction header in these log vectors. + */ +STATIC struct xfs_log_vec * +xfs_trans_alloc_log_vecs( + xfs_trans_t *tp) +{ + xfs_log_item_desc_t *lidp; + struct xfs_log_vec *lv = NULL; + struct xfs_log_vec *ret_lv = NULL; + + lidp = xfs_trans_first_item(tp); + + /* Bail out if we didn't find a log item. */ + if (!lidp) { + ASSERT(0); + return NULL; + } + + while (lidp != NULL) { + struct xfs_log_vec *new_lv; + + /* Skip items which aren't dirty in this transaction. */ + if (!(lidp->lid_flags & XFS_LID_DIRTY)) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + + /* Skip items that do not have any vectors for writing */ + lidp->lid_size = IOP_SIZE(lidp->lid_item); + if (!lidp->lid_size) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + + new_lv = kmem_zalloc(sizeof(*new_lv) + + lidp->lid_size * sizeof(struct xfs_log_iovec), + KM_SLEEP); + + /* The allocated iovec region lies beyond the log vector. */ + new_lv->lv_iovecp = (struct xfs_log_iovec *)&new_lv[1]; + if (!ret_lv) + ret_lv = new_lv; + else + lv->lv_next = new_lv; + lv = new_lv; + lidp = xfs_trans_next_item(tp, lidp); + } + + return ret_lv; +} + +/* + * Fill in the vector with pointers to data to be logged + * by this transaction. + * Each dirty item takes the + * number of vectors it indicated it needed in xfs_trans_alloc_log_vecs(). + * There is no transaction header in this format. + * + * We do not pin the items here as they are formatted, we leave that to + * the CIL commit. This is done because the pinning of the item is + * conditional on whether the item is already pinned in the CIL. Hence + * the check and pin must be done under the protection of the flush lock. + */ +STATIC void +xfs_trans_fill_log_vecs( + struct xfs_trans *tp, + struct xfs_log_vec *log_vector) +{ + xfs_log_item_desc_t *lidp; + struct xfs_log_vec *lv = log_vector; + + lidp = xfs_trans_first_item(tp); + ASSERT(lidp); + while (lidp) { + /* + * Skip items which aren't dirty in this transaction. + */ + if (!(lidp->lid_flags & XFS_LID_DIRTY)) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + /* Skip items that do not have any vectors for writing */ + if (!lidp->lid_size) { + lidp = xfs_trans_next_item(tp, lidp); + continue; + } + IOP_FORMAT(lidp->lid_item, lv->lv_iovecp); + lv->lv_niovecs = lidp->lid_size; + lv->lv_item = lidp->lid_item; + + lidp = xfs_trans_next_item(tp, lidp); + lv = lv->lv_next; + } +} + +static int +xfs_trans_commit_cil( + struct xfs_mount *mp, + struct xfs_trans *tp, + xfs_lsn_t *commit_lsn, + int flags) +{ + struct xfs_log_vec *log_vector; + + /* + * Get each log item to allocate a vector structure for + * the log item to to pass to the log write code. + */ + log_vector = xfs_trans_alloc_log_vecs(tp); + if (!log_vector) + return ENOMEM; + + /* + * Fill in the log_vector and pin the logged items, and + * then write the transaction to the log. We have to lock + * out CIL flushes from this point as we are going to pin + */ + xfs_trans_fill_log_vecs(tp, log_vector); + + return xfs_log_commit_cil(mp, tp, log_vector, commit_lsn, flags); + +} /* * xfs_trans_commit @@ -1204,7 +1330,11 @@ _xfs_trans_commit( xfs_trans_apply_sb_deltas(tp); xfs_trans_apply_dquot_deltas(tp); - error = xfs_trans_commit_iclog(mp, tp, &commit_lsn, flags); + if (mp->m_flags & XFS_MOUNT_DELAYLOG) + error = xfs_trans_commit_cil(mp, tp, &commit_lsn, flags); + else + error = xfs_trans_commit_iclog(mp, tp, &commit_lsn, flags); + if (error == ENOMEM) { xfs_force_shutdown(mp, SHUTDOWN_LOG_IO_ERROR); error = XFS_ERROR(EIO); @@ -1242,7 +1372,7 @@ out_unreserve: error = XFS_ERROR(EIO); } current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); - xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0); + xfs_trans_free_items(tp, NULLCOMMITLSN, error ? XFS_TRANS_ABORT : 0); xfs_trans_free(tp); XFS_STATS_INC(xs_trans_empty); @@ -1320,7 +1450,7 @@ xfs_trans_cancel( /* mark this thread as no longer being in a transaction */ current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); - xfs_trans_free_items(tp, flags); + xfs_trans_free_items(tp, NULLCOMMITLSN, flags); xfs_trans_free(tp); } diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index ff7e9e6..b1ea20c 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -106,7 +106,8 @@ typedef struct xfs_trans_header { #define XFS_TRANS_GROWFSRT_FREE 39 #define XFS_TRANS_SWAPEXT 40 #define XFS_TRANS_SB_COUNT 41 -#define XFS_TRANS_TYPE_MAX 41 +#define XFS_TRANS_CHECKPOINT 42 +#define XFS_TRANS_TYPE_MAX 42 /* new transaction types need to be reflected in xfs_logprint(8) */ #define XFS_TRANS_TYPES \ @@ -148,6 +149,7 @@ typedef struct xfs_trans_header { { XFS_TRANS_GROWFSRT_FREE, "GROWFSRT_FREE" }, \ { XFS_TRANS_SWAPEXT, "SWAPEXT" }, \ { XFS_TRANS_SB_COUNT, "SB_COUNT" }, \ + { XFS_TRANS_CHECKPOINT, "CHECKPOINT" }, \ { XFS_TRANS_DUMMY1, "DUMMY1" }, \ { XFS_TRANS_DUMMY2, "DUMMY2" }, \ { XLOG_UNMOUNT_REC_TYPE, "UNMOUNT" } @@ -829,6 +831,10 @@ typedef struct xfs_log_item { /* buffer item iodone */ /* callback func */ struct xfs_item_ops *li_ops; /* function list */ + + /* delayed logging */ + struct list_head li_cil; /* CIL pointers */ + struct xfs_log_vec *li_lv; /* active log vector */ } xfs_log_item_t; #define XFS_LI_IN_AIL 0x1 diff --git a/fs/xfs/xfs_trans_item.c b/fs/xfs/xfs_trans_item.c index 2937a1e..f11d37d 100644 --- a/fs/xfs/xfs_trans_item.c +++ b/fs/xfs/xfs_trans_item.c @@ -299,6 +299,7 @@ xfs_trans_next_item(xfs_trans_t *tp, xfs_log_item_desc_t *lidp) void xfs_trans_free_items( xfs_trans_t *tp, + xfs_lsn_t commit_lsn, int flags) { xfs_log_item_chunk_t *licp; @@ -311,7 +312,7 @@ xfs_trans_free_items( * Special case the embedded chunk so we don't free it below. */ if (!xfs_lic_are_all_free(licp)) { - (void) xfs_trans_unlock_chunk(licp, 1, abort, NULLCOMMITLSN); + (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); xfs_lic_all_free(licp); licp->lic_unused = 0; } @@ -322,7 +323,7 @@ xfs_trans_free_items( */ while (licp != NULL) { ASSERT(!xfs_lic_are_all_free(licp)); - (void) xfs_trans_unlock_chunk(licp, 1, abort, NULLCOMMITLSN); + (void) xfs_trans_unlock_chunk(licp, 1, abort, commit_lsn); next_licp = licp->lic_next; kmem_free(licp); licp = next_licp; diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 901dc0f..330fa5f 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -35,9 +35,15 @@ struct xfs_log_item_desc *xfs_trans_find_item(struct xfs_trans *, struct xfs_log_item_desc *xfs_trans_first_item(struct xfs_trans *); struct xfs_log_item_desc *xfs_trans_next_item(struct xfs_trans *, struct xfs_log_item_desc *); -void xfs_trans_free_items(struct xfs_trans *, int); -void xfs_trans_unlock_items(struct xfs_trans *, - xfs_lsn_t); + +void xfs_trans_unlock_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn); +void xfs_trans_free_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn, + int flags); + +void xfs_trans_item_committed(struct xfs_log_item *lip, + xfs_lsn_t commit_lsn, int aborted); +void xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp); +void xfs_trans_free(struct xfs_trans *tp); /* * AIL traversal cursor. -- 1.5.6.5 From kb@sysmikro.com.pl Fri May 7 03:21:10 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o478L9gU154608 for ; Fri, 7 May 2010 03:21:10 -0500 X-ASG-Debug-ID: 1273220597-571901c10000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from v007470.home.net.pl (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with SMTP id DCEB2966B54 for ; Fri, 7 May 2010 01:23:17 -0700 (PDT) Received: from v007470.home.net.pl (v007470.home.net.pl [212.85.125.104]) by cuda.sgi.com with SMTP id 0nJcmxqblHY01vqH for ; Fri, 07 May 2010 01:23:17 -0700 (PDT) Received: from 217.8.165.66 [217.8.165.66] (HELO linux2g2g.site) by sysmikro.home.pl [212.85.125.104] with SMTP (IdeaSmtpServer v0.70) id c675c96eb20eba45; Fri, 7 May 2010 10:23:13 +0200 From: Krzysztof =?utf-8?q?B=C5=82aszkowski?= Organization: Systemy mikroprocesorowe To: xfs@oss.sgi.com X-ASG-Orig-Subj: posix_fallocate Subject: posix_fallocate Date: Fri, 7 May 2010 10:22:37 +0200 User-Agent: KMail/1.9.5 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <201005071022.37863.kb@sysmikro.com.pl> X-Barracuda-Connect: v007470.home.net.pl[212.85.125.104] X-Barracuda-Start-Time: 1273220600 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29258 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Hello, I use this to preallocate large space but found an issue. Posix_fallocate works right with sizes like 100G, 1T and even 10T on some boxes (on some other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user space process would be "R"unning forever and it is not interruptible. Furthermore some other not related processes like sshd, bash enter D state. There is nothing in kernel log. I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T sizes. I noticed that for 1st three sizes the log is as long as abt 1.5M (2M peak) while 10T generates 94M long log. I couldn't retrieve a log for 17T case because "cat /sys ... /trace" enters D. I would appreciate any help because i gave up with ftrace logs analysis. The xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case while there are abt 163k lines in 94M log. And all i could see is poss some relationship between time spent in xfs_vn_fallocate subfunctions vs requested space. Box details: 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size, kernel 2.6.31.5, more recent kernels neither xfs were not tested. Regards, Krzysztof Blaszkowski From stan@hardwarefreak.com Fri May 7 04:19:12 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o479JBOR156614 for ; Fri, 7 May 2010 04:19:12 -0500 X-ASG-Debug-ID: 1273224081-7c7200b30000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from greer.hardwarefreak.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 271BC1BD5AC9 for ; Fri, 7 May 2010 02:21:21 -0700 (PDT) Received: from greer.hardwarefreak.com (mo-65-41-216-221.sta.embarqhsd.net [65.41.216.221]) by cuda.sgi.com with ESMTP id gtbSYdUG0x4szXtU for ; Fri, 07 May 2010 02:21:21 -0700 (PDT) Received: from [192.168.100.53] (gffx.hardwarefreak.com [192.168.100.53]) by greer.hardwarefreak.com (Postfix) with ESMTP id A56466C04C for ; Fri, 7 May 2010 04:21:20 -0500 (CDT) Message-ID: <4BE3DC2D.3000607@hardwarefreak.com> Date: Fri, 07 May 2010 04:23:57 -0500 From: Stan Hoeppner User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate References: <201005071022.37863.kb@sysmikro.com.pl> In-Reply-To: <201005071022.37863.kb@sysmikro.com.pl> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: mo-65-41-216-221.sta.embarqhsd.net[65.41.216.221] X-Barracuda-Start-Time: 1273224082 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.42 X-Barracuda-Spam-Status: No, SCORE=-1.42 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_SC5_MJ1963, RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29263 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS 0.50 BSF_SC5_MJ1963 Custom Rule MJ1963 X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Krzysztof BĹ‚aszkowski put forth on 5/7/2010 3:22 AM: > Hello, > > I use this to preallocate large space but found an issue. Posix_fallocate > works right with sizes like 100G, 1T and even 10T on some boxes (on some > other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user > space process would be "R"unning forever and it is not interruptible. > Furthermore some other not related processes like sshd, bash enter D state. > There is nothing in kernel log. > > I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T sizes. > I noticed that for 1st three sizes the log is as long as abt 1.5M (2M peak) > while 10T generates 94M long log. I couldn't retrieve a log for 17T case > because "cat /sys ... /trace" enters D. > > I would appreciate any help because i gave up with ftrace logs analysis. The > xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case while there > are abt 163k lines in 94M log. And all i could see is poss some relationship > between time spent in xfs_vn_fallocate subfunctions vs requested space. > > Box details: > 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size, > kernel 2.6.31.5, more recent kernels neither xfs were not tested. 32 or 64 bit kernel? What is the size of the XFS filesystem on the 25TB LVM LUN against which you're running posix_fallocate? The reason I ask is that XFS has a 16TB per filesystem limitation on 32 bit kernels. I can only assume that your XFS filesystem is larger than 16TB since you're attempting to posix_fallocate 16TB. But, it's best to ask for confirmation rather than assume, especially given that your problem is appearing near that magical 16TB boundary. -- Stan From kb@sysmikro.com.pl Fri May 7 04:46:31 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o479kVQK157596 for ; Fri, 7 May 2010 04:46:31 -0500 X-ASG-Debug-ID: 1273225721-4c4700260000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from v007470.home.net.pl (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with SMTP id DDBC81493892 for ; Fri, 7 May 2010 02:48:42 -0700 (PDT) Received: from v007470.home.net.pl (v007470.home.net.pl [212.85.125.104]) by cuda.sgi.com with SMTP id 3VhUg3eIS7zpN4fi for ; Fri, 07 May 2010 02:48:42 -0700 (PDT) Received: from 217.8.165.66 [217.8.165.66] (HELO linux2g2g.site) by sysmikro.home.pl [212.85.125.104] with SMTP (IdeaSmtpServer v0.70) id 40d73203b5e3409d; Fri, 7 May 2010 11:48:39 +0200 From: Krzysztof =?utf-8?q?B=C5=82aszkowski?= Organization: Systemy mikroprocesorowe To: xfs@oss.sgi.com, Stan Hoeppner X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate Date: Fri, 7 May 2010 11:48:02 +0200 User-Agent: KMail/1.9.5 References: <201005071022.37863.kb@sysmikro.com.pl> <4BE3DC2D.3000607@hardwarefreak.com> In-Reply-To: <4BE3DC2D.3000607@hardwarefreak.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <201005071148.03012.kb@sysmikro.com.pl> X-Barracuda-Connect: v007470.home.net.pl[212.85.125.104] X-Barracuda-Start-Time: 1273225722 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29264 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Friday 07 May 2010 11:23, Stan Hoeppner wrote: > Krzysztof B=C5=82aszkowski put forth on 5/7/2010 3:22 AM: > > Hello, > > > > I use this to preallocate large space but found an issue. Posix_falloca= te > > works right with sizes like 100G, 1T and even 10T on some boxes (on some > > other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user > > space process would be "R"unning forever and it is not interruptible. > > Furthermore some other not related processes like sshd, bash enter D > > state. There is nothing in kernel log. > > > > I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T > > sizes. I noticed that for 1st three sizes the log is as long as abt 1.5M > > (2M peak) while 10T generates 94M long log. I couldn't retrieve a log f= or > > 17T case because "cat /sys ... /trace" enters D. > > > > I would appreciate any help because i gave up with ftrace logs analysis. > > The xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case > > while there are abt 163k lines in 94M log. And all i could see is poss > > some relationship between time spent in xfs_vn_fallocate subfunctions vs > > requested space. > > > > Box details: > > 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size, > > kernel 2.6.31.5, more recent kernels neither xfs were not tested. > > 32 or 64 bit kernel?=20 sorry, i meant 64 bit. > What is the size of the XFS filesystem on the 25TB=20 > LVM LUN against which you're running posix_fallocate? xfs occupies whole lun (ie 25TB) > The reason I ask is=20 > that XFS has a 16TB per filesystem limitation on 32 bit kernels. I can > only assume that your XFS filesystem is larger than 16TB since you're > attempting to posix_fallocate 16TB. But, it's best to ask for confirmati= on > rather than assume, especially given that your problem is appearing near > that magical 16TB boundary. sure, i see. I use 64 bit by default. Regards, Krzysztof From kb@sysmikro.com.pl Fri May 7 05:05:33 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47A5XAN158090 for ; Fri, 7 May 2010 05:05:33 -0500 X-ASG-Debug-ID: 1273226864-5a7200390000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from v007470.home.net.pl (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with SMTP id D2163967BB6 for ; Fri, 7 May 2010 03:07:44 -0700 (PDT) Received: from v007470.home.net.pl (v007470.home.net.pl [212.85.125.104]) by cuda.sgi.com with SMTP id EOv2AF69pA5D327A for ; Fri, 07 May 2010 03:07:44 -0700 (PDT) Received: from 217.8.165.66 [217.8.165.66] (HELO linux2g2g.site) by sysmikro.home.pl [212.85.125.104] with SMTP (IdeaSmtpServer v0.70) id 5aee48a3ab6c632a; Fri, 7 May 2010 12:07:41 +0200 From: Krzysztof =?utf-8?q?B=C5=82aszkowski?= Organization: Systemy mikroprocesorowe To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate Date: Fri, 7 May 2010 12:07:07 +0200 User-Agent: KMail/1.9.5 Cc: Stan Hoeppner References: <201005071022.37863.kb@sysmikro.com.pl> <4BE3DC2D.3000607@hardwarefreak.com> In-Reply-To: <4BE3DC2D.3000607@hardwarefreak.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <201005071207.07745.kb@sysmikro.com.pl> X-Barracuda-Connect: v007470.home.net.pl[212.85.125.104] X-Barracuda-Start-Time: 1273226865 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29266 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Hello Stan, It seems that your mail server blocks traffic from ".pl" domains so don't b= e=20 offended if you will not see my reply sent to your address. (Remote host said: 550 5.7.1 : Client= =20 host rejected: We do not accept mail from .pl domains) Krzysztof On Friday 07 May 2010 11:23, Stan Hoeppner wrote: > Krzysztof B=C5=82aszkowski put forth on 5/7/2010 3:22 AM: > > Hello, > > > > I use this to preallocate large space but found an issue. Posix_falloca= te > > works right with sizes like 100G, 1T and even 10T on some boxes (on some > > other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user > > space process would be "R"unning forever and it is not interruptible. > > Furthermore some other not related processes like sshd, bash enter D > > state. There is nothing in kernel log. > > > > I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T > > sizes. I noticed that for 1st three sizes the log is as long as abt 1.5M > > (2M peak) while 10T generates 94M long log. I couldn't retrieve a log f= or > > 17T case because "cat /sys ... /trace" enters D. > > > > I would appreciate any help because i gave up with ftrace logs analysis. > > The xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case > > while there are abt 163k lines in 94M log. And all i could see is poss > > some relationship between time spent in xfs_vn_fallocate subfunctions vs > > requested space. > > > > Box details: > > 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size, > > kernel 2.6.31.5, more recent kernels neither xfs were not tested. > > 32 or 64 bit kernel? What is the size of the XFS filesystem on the 25TB > LVM LUN against which you're running posix_fallocate? The reason I ask is > that XFS has a 16TB per filesystem limitation on 32 bit kernels. I can > only assume that your XFS filesystem is larger than 16TB since you're > attempting to posix_fallocate 16TB. But, it's best to ask for confirmati= on > rather than assume, especially given that your problem is appearing near > that magical 16TB boundary. From stan@hardwarefreak.com Fri May 7 05:35:50 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=BAYES_20 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47AZoqF158870 for ; Fri, 7 May 2010 05:35:50 -0500 X-ASG-Debug-ID: 1273228678-4f4a02bf0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from greer.hardwarefreak.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id F030A31955D for ; Fri, 7 May 2010 03:37:58 -0700 (PDT) Received: from greer.hardwarefreak.com (mo-65-41-216-221.sta.embarqhsd.net [65.41.216.221]) by cuda.sgi.com with ESMTP id cDOP9pPgNLLxgXee for ; Fri, 07 May 2010 03:37:58 -0700 (PDT) Received: from [192.168.100.53] (gffx.hardwarefreak.com [192.168.100.53]) by greer.hardwarefreak.com (Postfix) with ESMTP id 0F70C6C054 for ; Fri, 7 May 2010 05:37:58 -0500 (CDT) Message-ID: <4BE3EE86.5090103@hardwarefreak.com> Date: Fri, 07 May 2010 05:42:14 -0500 From: Stan Hoeppner User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate References: <201005071022.37863.kb@sysmikro.com.pl> <4BE3DC2D.3000607@hardwarefreak.com> <201005071207.07745.kb@sysmikro.com.pl> In-Reply-To: <201005071207.07745.kb@sysmikro.com.pl> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: mo-65-41-216-221.sta.embarqhsd.net[65.41.216.221] X-Barracuda-Start-Time: 1273228680 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0523 1.0000 -1.6855 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.09 X-Barracuda-Spam-Status: No, SCORE=-1.09 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_SC5_MJ1963, RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29267 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS 0.50 BSF_SC5_MJ1963 Custom Rule MJ1963 X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Krzysztof BĹ‚aszkowski put forth on 5/7/2010 5:07 AM: > Hello Stan, > > It seems that your mail server blocks traffic from ".pl" domains so don't be > offended if you will not see my reply sent to your address. > > (Remote host said: 550 5.7.1 : Client > host rejected: We do not accept mail from .pl domains) Sorry about that. Due to the constant battle against spam I've implemented some pretty draconian countermeasures over time, including some ccTLD blocking of SMTP. All my "overseas" contacts are via public mailing lists, as in this case. Very rarely does a conversation need to go "off list", so I've not had much of a problem with this setup. If you'd like I can whitelist your address. -- Stan From kb@sysmikro.com.pl Fri May 7 05:54:34 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47AsTK8159292 for ; Fri, 7 May 2010 05:54:34 -0500 X-ASG-Debug-ID: 1273229799-5eea01840000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from v007470.home.net.pl (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with SMTP id 18E621493898 for ; Fri, 7 May 2010 03:56:39 -0700 (PDT) Received: from v007470.home.net.pl (v007470.home.net.pl [212.85.125.104]) by cuda.sgi.com with SMTP id N6jXjGAs1BbFMdX1 for ; Fri, 07 May 2010 03:56:39 -0700 (PDT) Received: from 217.8.165.66 [217.8.165.66] (HELO linux2g2g.site) by sysmikro.home.pl [212.85.125.104] with SMTP (IdeaSmtpServer v0.70) id 64767627cdb82ba2; Fri, 7 May 2010 12:56:35 +0200 From: Krzysztof =?utf-8?q?B=C5=82aszkowski?= Organization: Systemy mikroprocesorowe To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate Date: Fri, 7 May 2010 12:56:00 +0200 User-Agent: KMail/1.9.5 Cc: Stan Hoeppner References: <201005071022.37863.kb@sysmikro.com.pl> <201005071207.07745.kb@sysmikro.com.pl> <4BE3EE86.5090103@hardwarefreak.com> In-Reply-To: <4BE3EE86.5090103@hardwarefreak.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <201005071256.00744.kb@sysmikro.com.pl> X-Barracuda-Connect: v007470.home.net.pl[212.85.125.104] X-Barracuda-Start-Time: 1273229801 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29268 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Friday 07 May 2010 12:42, Stan Hoeppner wrote: > Krzysztof B=C5=82aszkowski put forth on 5/7/2010 5:07 AM: > > Hello Stan, > > > > It seems that your mail server blocks traffic from ".pl" domains so don= 't > > be offended if you will not see my reply sent to your address. > > > > (Remote host said: 550 5.7.1 : > > Client host rejected: We do not accept mail from .pl domains) > > Sorry about that. Due to the constant battle against spam I've implement= ed > some pretty draconian countermeasures over time, including some ccTLD > blocking of SMTP. All my "overseas" contacts are via public mailing list= s, > as in this case. Very rarely does a conversation need to go "off list", = so > I've not had much of a problem with this setup. If you'd like I can > whitelist your address. you are welcome. if it is not a big hassle then go ahead. thanks, Krzysztof From polmail@gmail.com Fri May 7 06:34:11 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: *** X-Spam-Status: No, score=3.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, HTML_MESSAGE,T_DKIM_INVALID,T_TO_NO_BRKTS_FREEMAIL autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47BYBGX160463 for ; Fri, 7 May 2010 06:34:11 -0500 X-ASG-Debug-ID: 1273232180-020603c20000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail-wy0-f181.google.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 9E1591DE64DA for ; Fri, 7 May 2010 04:36:20 -0700 (PDT) Received: from mail-wy0-f181.google.com (mail-wy0-f181.google.com [74.125.82.181]) by cuda.sgi.com with ESMTP id mwekYfDXDFXYpsrC for ; Fri, 07 May 2010 04:36:20 -0700 (PDT) Received: by wyb36 with SMTP id 36so556176wyb.26 for ; Fri, 07 May 2010 04:36:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:content-type; bh=BIOGeVQXRP6c0PmGjx86zYRjnQk+ZTAeGzHJeUnDU2k=; b=eyfNcFgPz2yP95toDH/wrTU2BeqVfdYrJDcKNV97zr5cA0UTNjvNhkJjGm7x8+CDsv 4plVK1YUQWzLWZuBPWXhVqyc74QHPGHyVo8e4eG6m12QfzkN+JywTY1uLRVX8Sonyagr FP2jHYBsuDKM7o8W0ruPVn+RrBBgucmx90KO0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type; b=AUrTWjWKx4mck0Q5S0S2A2mHmL6kTY2PXwQiEmc506vLsUQk/qyw8y4Mg63iYiRN25 R6IimJgBvVsu4FYX5dQpIqeUZKt8xhYYWbY4ijZRKNFHl0PfPzlT5QVV1jIMoCFh4BdX TgL6Qy/6eo3RavBwaVzOiEkuab5Nc+8wST3No= Received: by 10.216.168.135 with SMTP id k7mr3710013wel.129.1273232179692; Fri, 07 May 2010 04:36:19 -0700 (PDT) Received: from [192.168.1.101] (25.Red-80-59-137.staticIP.rima-tde.net [80.59.137.25]) by mx.google.com with ESMTPS id l46sm928299wed.22.2010.05.07.04.36.18 (version=SSLv3 cipher=RC4-MD5); Fri, 07 May 2010 04:36:19 -0700 (PDT) Message-ID: <4BE3FB32.8030709@gmail.com> Date: Fri, 07 May 2010 13:36:18 +0200 From: Pol User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Please Help Subject: Please Help Content-Type: multipart/alternative; boundary="------------090901010707050403020108" X-Barracuda-Connect: mail-wy0-f181.google.com[74.125.82.181] X-Barracuda-Start-Time: 1273232181 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=DKIM_SIGNED, DKIM_VERIFIED, HTML_MESSAGE X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29270 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- -0.00 DKIM_VERIFIED Domain Keys Identified Mail: signature passes verification 0.00 DKIM_SIGNED Domain Keys Identified Mail: message has a signature 0.00 HTML_MESSAGE BODY: HTML included in message X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean This is a multi-part message in MIME format. --------------090901010707050403020108 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Good morning. I'm writing from Barcelona and English is not my born language, so I'd like to apologize in advance for any possible mistakes in my text. I'm a Windows user who has recently moved to Linux (Ubuntu 10.04), and I have a serious problem regarding my Hard Drives' File System. I have a desktop version of Ubuntu and I'm a complete regular user. I have two physical drives in my system: 1. 36GB: EXT4 partition for /, another EXT4 for /home and a SWAP one. 2. 1TB (for data) drive. I generate so much video and music data per month (AVI, MKV, MP3, WAV) because of my job, and need to copy it to external hard drives to ensure I don't lose any of it. My question is about the FS to use in these data drives. I currently have all of them in XFS fyle system. Every file I generate is saved in my internal XFS drive, and whenever the hd is almost full I copy the important files to External Hard Drives which are also formatted as XFS. My problem comes after reading a couple of posts from 2006 in some forums on the web. They said that XFS is very unsecure when a power failure happens and recommended EXT3 (EXT4 these days I guess). They said that after a power failure it's very common to see data loss (something that never happened to me in all my years using NTFS). As far as I know XFS is much more secure than NTFS so I don't really understand this issue. I assume these people were talking about systems which need to be continously writing to the disk, but my knowledge about this is very limited. Did I chose the correct FS for my drives? Thank you very much for your time. --------------090901010707050403020108 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Good morning.
I'm writing from Barcelona and English is not my born language, so I'd like to apologize in advance for any possible mistakes in my text.

I'm a Windows user who has recently moved to Linux (Ubuntu 10.04), and I have a serious problem regarding my Hard Drives' File System.
I have a desktop version of Ubuntu and I'm a complete regular user.

I have two physical drives in my system:
1. 36GB: EXT4 partition for /, another EXT4 for /home and a SWAP one.
2. 1TB (for data) drive.

I generate so much video and music data per month (AVI, MKV, MP3, WAV) because of my job, and need to copy it to external hard drives to ensure I don't lose any of it.

My question is about the FS to use in these data drives.
I currently have all of them in XFS fyle system. Every file I generate is saved in my internal XFS drive, and whenever the hd is almost full I copy the important files to External Hard Drives which are also formatted as XFS.

My problem comes after reading a couple of posts from 2006 in some forums on the web. They said that XFS is very unsecure when a power failure happens and recommended EXT3 (EXT4 these days I guess). They said that after a power failure it's very common to see data loss (something that never happened to me in all my years using NTFS).

As far as I know XFS is much more secure than NTFS so I don't really understand this issue. I assume these people were talking about systems which need to be continously writing to the disk, but my knowledge about this is very limited.

Did I chose the correct FS for my drives?


Thank you very much for your time.
--------------090901010707050403020108-- From BATV+6a614bba7c2810bad1d9+2448+infradead.org+hch@bombadil.srs.infradead.org Fri May 7 06:39:24 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: * X-Spam-Status: No, score=1.1 required=5.0 tests=BAYES_00,SUBJ_TICKET autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47BdMnu160578 for ; Fri, 7 May 2010 06:39:24 -0500 X-ASG-Debug-ID: 1273232492-020503df0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1ED961DE652A for ; Fri, 7 May 2010 04:41:33 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id VuaqVjUV5deqSdLU for ; Fri, 07 May 2010 04:41:33 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OALvu-0006hW-Px; Fri, 07 May 2010 11:41:30 +0000 Date: Fri, 7 May 2010 07:41:30 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 06/12] xfs: make the log ticket ID available outside the log infrastructure Subject: Re: [PATCH 06/12] xfs: make the log ticket ID available outside the log infrastructure Message-ID: <20100507114130.GA25474@infradead.org> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> <1273210860-23414-7-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273210860-23414-7-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273232493 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Fri, May 07, 2010 at 03:40:54PM +1000, Dave Chinner wrote: > From: Dave Chinner > > The ticket ID is needed to uniquely identify transactions when doing busy > extent matching. Delayed logging changes the lifecycle of busy extents with > respect to the transaction structure lifecycle. Hence we can no longer use > the transaction structure as a means of determining the owner of the busy > extent as it may be freed and reused while the busy extent is still active. > > This commit provides the infrastructure to access the xlog_tid_t held in the > ticket from a transaction handle. This avoids the need for callers to peek > into the transaction and log structures to find this out. No happy about exposing the tid, but given that we need it for now: Reviewed-by: Christoph Hellwig From sandeen@sandeen.net Fri May 7 11:24:19 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47GOJkY167996 for ; Fri, 7 May 2010 11:24:19 -0500 X-ASG-Debug-ID: 1273249589-0ddd01390000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.sandeen.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 5A963137E390 for ; Fri, 7 May 2010 09:26:29 -0700 (PDT) Received: from mail.sandeen.net (64-131-60-146.usfamily.net [64.131.60.146]) by cuda.sgi.com with ESMTP id YM75OmmeHcfO46OK for ; Fri, 07 May 2010 09:26:29 -0700 (PDT) Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sandeen.net (Postfix) with ESMTP id 4F90D944B89; Fri, 7 May 2010 11:26:28 -0500 (CDT) Message-ID: <4BE43F34.40309@sandeen.net> Date: Fri, 07 May 2010 11:26:28 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: =?UTF-8?B?S3J6eXN6dG9mIELFgmFzemtvd3NraQ==?= CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate References: <201005071022.37863.kb@sysmikro.com.pl> In-Reply-To: <201005071022.37863.kb@sysmikro.com.pl> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: 64-131-60-146.usfamily.net[64.131.60.146] X-Barracuda-Start-Time: 1273249590 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.92 X-Barracuda-Spam-Status: No, SCORE=-1.92 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29288 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Krzysztof BĹ‚aszkowski wrote: > Hello, > > I use this to preallocate large space but found an issue. Posix_fallocate > works right with sizes like 100G, 1T and even 10T on some boxes (on some > other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user > space process would be "R"unning forever and it is not interruptible. > Furthermore some other not related processes like sshd, bash enter D state. > There is nothing in kernel log. > > I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T sizes. > I noticed that for 1st three sizes the log is as long as abt 1.5M (2M peak) > while 10T generates 94M long log. I couldn't retrieve a log for 17T case > because "cat /sys ... /trace" enters D. > > I would appreciate any help because i gave up with ftrace logs analysis. The > xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case while there > are abt 163k lines in 94M log. And all i could see is poss some relationship > between time spent in xfs_vn_fallocate subfunctions vs requested space. > > Box details: > 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size, > kernel 2.6.31.5, more recent kernels neither xfs were not tested. It'd be great if you could test a more recent kernel. sysrq-t would give us all the backtraces, except I suppose not for the running process... I can try to scrape up >16T to test on at some point ... -Eric From sandeen@sandeen.net Fri May 7 11:51:18 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47GpHOc168600 for ; Fri, 7 May 2010 11:51:17 -0500 X-ASG-Debug-ID: 1273251207-09c1024a0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.sandeen.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 1B03C1DE85D5 for ; Fri, 7 May 2010 09:53:27 -0700 (PDT) Received: from mail.sandeen.net (64-131-60-146.usfamily.net [64.131.60.146]) by cuda.sgi.com with ESMTP id PzflvdTHUEIJz1qj for ; Fri, 07 May 2010 09:53:27 -0700 (PDT) Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sandeen.net (Postfix) with ESMTP id 682A7A745FD; Fri, 7 May 2010 11:53:27 -0500 (CDT) Message-ID: <4BE44587.6090603@sandeen.net> Date: Fri, 07 May 2010 11:53:27 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: =?UTF-8?B?S3J6eXN6dG9mIELFgmFzemtvd3NraQ==?= CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate References: <201005071022.37863.kb@sysmikro.com.pl> <4BE43F34.40309@sandeen.net> In-Reply-To: <4BE43F34.40309@sandeen.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: 64-131-60-146.usfamily.net[64.131.60.146] X-Barracuda-Start-Time: 1273251208 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.92 X-Barracuda-Spam-Status: No, SCORE=-1.92 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29290 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Eric Sandeen wrote: > Krzysztof BĹ‚aszkowski wrote: >> Hello, >> >> I use this to preallocate large space but found an issue. Posix_fallocate >> works right with sizes like 100G, 1T and even 10T on some boxes (on some >> other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user >> space process would be "R"unning forever and it is not interruptible. >> Furthermore some other not related processes like sshd, bash enter D state. >> There is nothing in kernel log. Oh, one thing you should know is that depending on your version of glibc, posix_fallocate may be writing 0s and not using preallocation calls. Do you know which yours is using? strace should tell you on a small file test. Anyway, I am seeing things get stuck around 8T it seems... # touch /mnt/test/bigfile # xfs_io -c "resvsp 0 16t" /mnt/test/bigfile ... wait ... in other window ... # du -hc /mnt/test/bigfile 8.0G /mnt/test/bigfile 8.0G total # echo t > /proc/sysrq-trigger # dmesg | grep -A20 xfs_io xfs_io R running task 3576 29444 29362 0x00000006 ffff8809cfbb4920 ffffffff81478d9f ffffffffa032d3c5 0000000000000246 ffff8809cfbb4920 ffffffff814788bc 0000000000000000 ffffffff81ba3510 ffff8809d3429a68 ffffffffa032b60f ffff8809d3429aa8 000000000000001e Call Trace: [] ? __mutex_lock_common+0x36d/0x392 [] ? xfs_icsb_modify_counters+0x17f/0x1ac [xfs] [] ? xfs_icsb_unlock_all_counters+0x4d/0x60 [xfs] [] ? xfs_icsb_disable_counter+0x8c/0x95 [xfs] [] ? mutex_lock_nested+0x3e/0x43 [] ? xfs_icsb_modify_counters+0x18d/0x1ac [xfs] [] ? xfs_mod_incore_sb+0x29/0x6e [xfs] [] ? _xfs_trans_alloc+0x27/0x61 [xfs] [] ? xfs_trans_reserve+0x6c/0x19e [xfs] [] ? up_write+0x2b/0x32 [] ? xfs_alloc_file_space+0x163/0x306 [xfs] [] ? sched_clock_cpu+0xc3/0xce [] ? xfs_change_file_space+0x12a/0x2b8 [xfs] [] ? down_write_nested+0x80/0x8b [] ? xfs_ilock+0x30/0xb4 [xfs] [] ? xfs_vn_fallocate+0x80/0xf4 [xfs] -- R xfs_io 29444 86014624.786617 162 120 86014624.786617 137655.161327 408.979977 / # uname -r 2.6.34-0.4.rc0.git2.fc14.x86_64 I'll look into it. -Eric From aelder@sgi.com Fri May 7 14:50:52 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_43 autolearn=no version=3.4.0-r929098 Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47JopUm172499 for ; Fri, 7 May 2010 14:50:52 -0500 Received: from cf--amer001e--3.americas.sgi.com (cf--amer001e--3.americas.sgi.com [137.38.100.5]) by relay1.corp.sgi.com (Postfix) with ESMTP id 35FAA8F80AD for ; Fri, 7 May 2010 12:53:00 -0700 (PDT) Received: from [128.162.232.185] ([128.162.232.185]) by cf--amer001e--3.americas.sgi.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 7 May 2010 14:52:14 -0500 Subject: [ANNOUNCE] xfsprogs v3.1.2 From: Alex Elder Reply-To: aelder@sgi.com To: XFS Mailing List Content-Type: text/plain; charset="UTF-8" Date: Fri, 07 May 2010 14:52:14 -0500 Message-ID: <1273261934.2971.30.camel@doink> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 07 May 2010 19:52:14.0865 (UTC) FILETIME=[CA94A810:01CAEE1E] X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Version 3.1.2 of xfsprogs has been released. A gzipped-tar archive of the source code is available here: ftp://oss.sgi.com/projects/xfs/cmd_tars/xfsprogs-3.1.2.tar.gz The source code can be accessed via git using this URL: git://oss.sgi.com/xfs/cmds/xfsprogs.git Below is a summary (from the doc/CHANGES file) of changes since release 3.1.1: xfsprogs-3.1.2 (6 May 2010) - Fix missing thread synchronization in xfs_repair duplicate extent tracking. - Fix handling of dynamic attribute fork roots in xfs_fsr. - Fix sb_bad_features2 manipulations when tweaking the lazy count flag. - Add support for building on Debian GNU/kFreeBSD, thanks to Petr Salinger. - Improvements to the mkfs.xfs manpage, thanks to Wengang Wang. - Various small blkid integration fixes in mkfs.xfs. - Fix build against stricter system headers. From SRS0+mjXB+66+fromorbit.com=david@internode.on.net Fri May 7 17:14:46 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o47MEknu174811 for ; Fri, 7 May 2010 17:14:46 -0500 X-ASG-Debug-ID: 1273270614-694a03540000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 21A5231C05D for ; Fri, 7 May 2010 15:16:54 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id 69uyySI7mBvSoyaC for ; Fri, 07 May 2010 15:16:54 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23517764-1927428 for multiple; Sat, 08 May 2010 07:46:53 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1OAVqm-00080G-7e; Sat, 08 May 2010 08:16:52 +1000 Date: Sat, 8 May 2010 08:16:52 +1000 From: Dave Chinner To: Eric Sandeen Cc: Krzysztof =?utf-8?Q?B=C5=82aszkowski?= , xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate Message-ID: <20100507221652.GB25419@dastard> References: <201005071022.37863.kb@sysmikro.com.pl> <4BE43F34.40309@sandeen.net> <4BE44587.6090603@sandeen.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <4BE44587.6090603@sandeen.net> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1273270616 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29306 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Fri, May 07, 2010 at 11:53:27AM -0500, Eric Sandeen wrote: > Eric Sandeen wrote: > > Krzysztof BĹ‚aszkowski wrote: > >> Hello, > >> > >> I use this to preallocate large space but found an issue. Posix_fallocate > >> works right with sizes like 100G, 1T and even 10T on some boxes (on some > >> other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user > >> space process would be "R"unning forever and it is not interruptible. > >> Furthermore some other not related processes like sshd, bash enter D state. > >> There is nothing in kernel log. > > Oh, one thing you should know is that depending on your version of glibc, > posix_fallocate may be writing 0s and not using preallocation calls. > > Do you know which yours is using? strace should tell you on a small > file test. > > Anyway, I am seeing things get stuck around 8T it seems... > > # touch /mnt/test/bigfile > # xfs_io -c "resvsp 0 16t" /mnt/test/bigfile > > ... wait ... in other window ... > > # du -hc /mnt/test/bigfile > 8.0G /mnt/test/bigfile > 8.0G total > > # echo t > /proc/sysrq-trigger > # dmesg | grep -A20 xfs_io > xfs_io R running task 3576 29444 29362 0x00000006 > ffff8809cfbb4920 ffffffff81478d9f ffffffffa032d3c5 0000000000000246 > ffff8809cfbb4920 ffffffff814788bc 0000000000000000 ffffffff81ba3510 > ffff8809d3429a68 ffffffffa032b60f ffff8809d3429aa8 000000000000001e > Call Trace: > [] ? __mutex_lock_common+0x36d/0x392 > [] ? xfs_icsb_modify_counters+0x17f/0x1ac [xfs] > [] ? xfs_icsb_unlock_all_counters+0x4d/0x60 [xfs] > [] ? xfs_icsb_disable_counter+0x8c/0x95 [xfs] > [] ? mutex_lock_nested+0x3e/0x43 > [] ? xfs_icsb_modify_counters+0x18d/0x1ac [xfs] > [] ? xfs_mod_incore_sb+0x29/0x6e [xfs] > [] ? _xfs_trans_alloc+0x27/0x61 [xfs] > [] ? xfs_trans_reserve+0x6c/0x19e [xfs] > [] ? up_write+0x2b/0x32 > [] ? xfs_alloc_file_space+0x163/0x306 [xfs] > [] ? sched_clock_cpu+0xc3/0xce > [] ? xfs_change_file_space+0x12a/0x2b8 [xfs] > [] ? down_write_nested+0x80/0x8b > [] ? xfs_ilock+0x30/0xb4 [xfs] > [] ? xfs_vn_fallocate+0x80/0xf4 [xfs] > -- > R xfs_io 29444 86014624.786617 162 120 86014624.786617 137655.161327 408.979977 / > > # uname -r > 2.6.34-0.4.rc0.git2.fc14.x86_64 > > I'll look into it. On my current delayed-logging branch on a 30GB filesystem: # xfs_io -f -c "resvsp 0 16t" /mnt/scratch/bigfile And in dmesg: [60173.119760] Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 475 [60173.121263] ------------[ cut here ]------------ [60173.121771] kernel BUG at fs/xfs/support/debug.c:109! [60173.121771] invalid opcode: 0000 [#1] SMP [60173.121771] last sysfs file: /sys/devices/virtio-pci/virtio2/block/vdb/removable [60173.121771] CPU 7 [60173.121771] Modules linked in: [last unloaded: scsi_wait_scan] [60173.121771] [60173.121771] Pid: 3596, comm: xfs_io Not tainted 2.6.34-rc1-dgc #138 /Bochs [60173.121771] RIP: 0010:[] [] assfail+0x1f/0x30 [60173.121771] RSP: 0018:ffff880112043808 EFLAGS: 00010292 [60173.121771] RAX: 000000000000006d RBX: ffff880105038da0 RCX: 0000000000000000 [60173.121771] RDX: ffff880003600000 RSI: 0000000000000000 RDI: 0000000000000246 [60173.121771] RBP: ffff880112043808 R08: 0000000000000002 R09: 0000000000000000 [60173.121771] R10: ffffffff81a70bb8 R11: 0000000000000000 R12: ffffffffffe20082 [60173.121771] R13: ffff88011cea5000 R14: 0000000000000001 R15: 0000000000000000 [60173.121771] FS: 00007f0311cda6f0(0000) GS:ffff880003600000(0000) knlGS:0000000000000000 [60173.121771] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [60173.121771] CR2: 00007f031164d750 CR3: 000000011bc59000 CR4: 00000000000006e0 [60173.121771] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [60173.121771] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [60173.121771] Process xfs_io (pid: 3596, threadinfo ffff880112042000, task ffff88011629a740) [60173.121771] Stack: [60173.121771] ffff880112043838 ffffffff81342415 00000000001dff7e ffff880112043928 [60173.121771] <0> 00000000001dff7e 0000000000000004 ffff880112043858 ffffffff812e82ce [60173.121771] <0> ffff880112043928 ffff88011cea5000 ffff8801120438c8 ffffffff812e9008 [60173.121771] Call Trace: [60173.121771] [] xfs_trans_mod_sb+0x2f5/0x330 [60173.121771] [] xfs_alloc_ag_vextent+0x18e/0x2b0 [60173.121771] [] xfs_alloc_vextent+0x598/0x870 [60173.121771] [] xfs_bmap_btalloc+0x29f/0x7b0 [60173.121771] [] ? xfs_bmap_search_multi_extents+0x71/0x110 [60173.121771] [] xfs_bmap_alloc+0x21/0x40 [60173.121771] [] xfs_bmapi+0xf2c/0x1a90 [60173.121771] [] ? xlog_grant_log_space+0x35/0x640 [60173.121771] [] ? xfs_ilock+0x10b/0x190 [60173.121771] [] xfs_alloc_file_space+0x190/0x440 [60173.121771] [] ? trace_hardirqs_on+0xd/0x10 [60173.121771] [] xfs_change_file_space+0x2d4/0x380 [60173.121771] [] ? down_write_nested+0x9e/0xb0 [60173.121771] [] ? xfs_ilock+0xe8/0x190 [60173.121771] [] xfs_vn_fallocate+0x87/0x110 [60173.121771] [] ? __do_fault+0x12c/0x450 [60173.121771] [] ? might_fault+0x5c/0xb0 [60173.121771] [] ? __do_fault+0x399/0x450 [60173.121771] [] do_fallocate+0x103/0x110 [60173.121771] [] ioctl_preallocate+0x8c/0xb0 [60173.121771] [] do_vfs_ioctl+0x415/0x5b0 [60173.121771] [] ? up_read+0x23/0x40 [60173.121771] [] sys_ioctl+0x81/0xa0 [60173.121771] [] system_call_fastpath+0x16/0x1b So there's been a transaction overrun, which tends to imply we're allocating too much in a single transaction. I'd say there's an overflow happening somewhere in this path. Cheers, Dave. -- Dave Chinner david@fromorbit.com From helpdeskteam@mail2webmaster.com Fri May 7 20:05:58 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: **** X-Spam-Status: No, score=4.0 required=5.0 tests=BAYES_99,FREEMAIL_FROM, T_TO_NO_BRKTS_FREEMAIL autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4815w4e178965 for ; Fri, 7 May 2010 20:05:58 -0500 X-ASG-Debug-ID: 1273280893-0561025d0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from viu.viu.edu (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7661396ECD2 for ; Fri, 7 May 2010 18:08:13 -0700 (PDT) Received: from viu.viu.edu (viu.viu.edu [74.55.51.90]) by cuda.sgi.com with ESMTP id dGJXzu2W8i1XWzra for ; Fri, 07 May 2010 18:08:13 -0700 (PDT) Received: from localhost ([127.0.0.1]:58716 helo=webmail.viu.edu) by viu.viu.edu with esmtpa (Exim 4.69) (envelope-from ) id 1OAVdj-0001LU-1x; Fri, 07 May 2010 18:03:23 -0400 Received: from 94.197.36.208 ([94.197.36.208]) (proxying for 94.197.36.208) (SquirrelMail authenticated user dominika@viu.edu) by webmail.viu.edu with HTTP; Fri, 7 May 2010 18:03:23 -0400 Message-ID: <6c57f50a5921aeb44fbd83564767dfb2.squirrel@webmail.viu.edu> Date: Fri, 7 May 2010 18:03:23 -0400 X-ASG-Orig-Subj: Webmail konto oppdatering Subject: Webmail konto oppdatering From: "Mail Administrator" Reply-To: helpdeskteam@mail2webmaster.com User-Agent: SquirrelMail/1.4.20 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - viu.viu.edu X-AntiAbuse: Original Domain - oss.sgi.com X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - mail2webmaster.com X-Source: X-Source-Args: X-Source-Dir: X-Barracuda-Connect: viu.viu.edu[74.55.51.90] X-Barracuda-Start-Time: 1273280894 X-Barracuda-Bayes: INNOCENT GLOBAL 0.2895 1.0000 -0.3935 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: 1.19 X-Barracuda-Spam-Status: No, SCORE=1.19 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=MISSING_HEADERS, TO_CC_NONE X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29318 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 1.58 MISSING_HEADERS Missing To: header 0.00 TO_CC_NONE No To: or Cc: header To: undisclosed-recipients:; X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Kjćre Webmail bruker, Denne meldingen er fra Webmail IT Service meldingssentralen til alle abonnenter / webmail-brukere. Vi er nĺ oppgraderer vĺr database og e-post senter pĺ grunn av en uvanlig aktiviteter identifisert i e-postmeldingen systemet. Vi sletter alle ubrukte Webmail kontoer. Du mĺ bekrefte webmail-konto ved ĺ bekrefte din identitet Webmail. Dette vil hindre at Webmail-kontoen fra blitt lukket under denne řvelsen. For ĺ bekrefte at du Web-Mail identitet, skal du gi fřlgende data; Fornavn: Etternavn: Brukernavn / ID: Passord: Skriv inn passordet: E-postadresse: * Viktig * Vennligst oppgi alle disse opplysninger fullstendig og korrekt ellers pĺ grunn til sikkerhetsmessige grunner at vi mĺ avslutte kontoen midlertidig. Vi takker for raskt oppmerksom pĺ denne saken. Vćr oppmerksom pĺ at Dette er et sikkerhetstiltak som skal beskytte deg og din Webmail Konto. Vi beklager eventuelle ulemper dette medfřrer. Vennlig hilsen Webmail IT Service From christian.affolter@purplehaze.ch Sat May 8 07:32:56 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_45 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o48CWtW7192226 for ; Sat, 8 May 2010 07:32:56 -0500 X-ASG-Debug-ID: 1273322099-72de02320000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from smtp.stepping-stone.ch (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 02E5631DA24 for ; Sat, 8 May 2010 05:35:00 -0700 (PDT) Received: from smtp.stepping-stone.ch (smtp.stepping-stone.ch [194.176.109.228]) by cuda.sgi.com with ESMTP id ISrQPXlexF69bwhM for ; Sat, 08 May 2010 05:35:00 -0700 (PDT) Received: from localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) by smtp.stepping-stone.ch (Postfix) with ESMTP id 2005340021B; Sat, 8 May 2010 14:34:59 +0200 (CEST) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Scanned: amavisd-new at stepping-stone.ch Received: from smtp.stepping-stone.ch ([10.17.98.46]) by localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) (amavisd-new, port 10024) with LMTP id UgXI6hZTK9DK; Sat, 8 May 2010 14:34:46 +0200 (CEST) Received: from [192.168.1.4] (84-73-140-121.dclient.hispeed.ch [84.73.140.121]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by smtp.stepping-stone.ch (Postfix) with ESMTPSA id C575B40042A; Sat, 8 May 2010 14:34:44 +0200 (CEST) Message-ID: <4BE55A63.8070203@purplehaze.ch> Date: Sat, 08 May 2010 14:34:43 +0200 From: Christian Affolter User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100420 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: failed to read root inode Subject: failed to read root inode Content-Type: multipart/mixed; boundary="------------040805010700020206070201" X-Barracuda-Connect: smtp.stepping-stone.ch[194.176.109.228] X-Barracuda-Start-Time: 1273322102 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29358 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Status: Clean This is a multi-part message in MIME format. --------------040805010700020206070201 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi After a disk crash within a hardware RAID-6 controller and kernel freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume: Filesystem "dm-13": Disabling barriers, not supported by the underlying device XFS mounting filesystem dm-13 Starting XFS recovery on filesystem: dm-13 (logdev: internal) XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff8035c58d Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1 Call Trace: [] xfs_free_extent+0xcd/0x110 [] xfs_free_ag_extent+0x4e3/0x740 [] xfs_free_extent+0xcd/0x110 [] xlog_recover_process_efi+0x18d/0x1d0 [] xlog_recover_process_efis+0x60/0xa0 [] xlog_recover_finish+0x23/0xf0 [] xfs_mountfs+0x4da/0x680 [] kmem_alloc+0x58/0x100 [] kmem_zalloc+0x2b/0x40 [] xfs_mount+0x36d/0x3a0 [] xfs_fs_fill_super+0xbd/0x220 [] get_sb_bdev+0x141/0x180 [] xfs_fs_fill_super+0x0/0x220 [] vfs_kern_mount+0x56/0xc0 [] do_kern_mount+0x53/0x110 [] do_new_mount+0x9b/0xe0 [] do_mount+0x1e6/0x220 [] __get_free_pages+0x15/0x60 [] sys_mount+0x9b/0x100 [] system_call_after_swapgs+0x7b/0x80 Filesystem "dm-13": XFS internal error xfs_trans_cancel at line 1163 of file fs/xfs/xfs_trans.c. Caller 0xffffffff80395eb1 Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1 Call Trace: [] xlog_recover_process_efi+0x1a1/0x1d0 [] xfs_trans_cancel+0x126/0x150 [] xlog_recover_process_efi+0x1a1/0x1d0 [] xlog_recover_process_efis+0x60/0xa0 [] xlog_recover_finish+0x23/0xf0 [] xfs_mountfs+0x4da/0x680 [] kmem_alloc+0x58/0x100 [] kmem_zalloc+0x2b/0x40 [] xfs_mount+0x36d/0x3a0 [] xfs_fs_fill_super+0xbd/0x220 [] get_sb_bdev+0x141/0x180 [] xfs_fs_fill_super+0x0/0x220 [] vfs_kern_mount+0x56/0xc0 [] do_kern_mount+0x53/0x110 [] do_new_mount+0x9b/0xe0 [] do_mount+0x1e6/0x220 [] __get_free_pages+0x15/0x60 [] sys_mount+0x9b/0x100 [] system_call_after_swapgs+0x7b/0x80 xfs_force_shutdown(dm-13,0x8) called from line 1164 of file fs/xfs/xfs_trans.c. Return address = 0xffffffff8039fd2f Filesystem "dm-13": Corruption of in-memory data detected. Shutting down filesystem: dm-13 Please umount the filesystem, and rectify the problem(s) Failed to recover EFIs on filesystem: dm-13 XFS: log mount finish failed I tried to repair the filesystem with the help of xfs_repair many times, without any luck: Filesystem "dm-13": Disabling barriers, not supported by the underlying device XFS mounting filesystem dm-13 XFS: failed to read root inode xfs_check output: cache_node_purge: refcount was 1, not zero (node=0x820010) xfs_check: cannot read root inode (117) cache_node_purge: refcount was 1, not zero (node=0x8226b0) xfs_check: cannot read realtime bitmap inode (117) block 0/8 expected type unknown got log block 0/9 expected type unknown got log block 0/10 expected type unknown got log block 0/11 expected type unknown got log bad magic number 0xfeed for inode 128 [...] Are there any other ways to fix the unreadable root inode or to restore the remaining data? Environment informations: Linux Kernel: 2.6.26-gentoo (x86_64) xfsprogs: 3.0.3 Attached you'll find the xfs_repair and xfs_check output. Thanks in advance and kind regards Christian --------------040805010700020206070201 Content-Type: text/plain; name="xfs_repair.log" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="xfs_repair.log" Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 bad magic number 0x0 on inode 128 bad version number 0x0 on inode 128 bad magic number 0x0 on inode 129 bad version number 0x0 on inode 129 bad magic number 0x0 on inode 130 bad version number 0x0 on inode 130 bad magic number 0x0 on inode 131 bad version number 0x0 on inode 131 bad magic number 0x0 on inode 132 bad version number 0x0 on inode 132 bad magic number 0x0 on inode 133 bad version number 0x0 on inode 133 bad magic number 0x0 on inode 134 bad version number 0x0 on inode 134 bad magic number 0x0 on inode 135 bad version number 0x0 on inode 135 bad magic number 0x0 on inode 136 bad version number 0x0 on inode 136 bad magic number 0x0 on inode 137 bad version number 0x0 on inode 137 bad magic number 0x0 on inode 138 bad version number 0x0 on inode 138 bad magic number 0x0 on inode 139 bad version number 0x0 on inode 139 bad magic number 0x0 on inode 140 bad version number 0x0 on inode 140 bad magic number 0x0 on inode 141 bad version number 0x0 on inode 141 bad magic number 0x0 on inode 142 bad version number 0x0 on inode 142 bad magic number 0x0 on inode 143 bad version number 0x0 on inode 143 bad magic number 0x0 on inode 144 bad version number 0x0 on inode 144 bad magic number 0x0 on inode 145 bad version number 0x0 on inode 145 bad magic number 0x0 on inode 146 bad version number 0x0 on inode 146 bad magic number 0x0 on inode 147 bad version number 0x0 on inode 147 bad magic number 0x0 on inode 148 bad version number 0x0 on inode 148 bad magic number 0x0 on inode 149 bad version number 0x0 on inode 149 bad magic number 0x0 on inode 150 bad version number 0x0 on inode 150 bad magic number 0x0 on inode 151 bad version number 0x0 on inode 151 bad magic number 0x0 on inode 152 bad version number 0x0 on inode 152 bad magic number 0x0 on inode 153 bad version number 0x0 on inode 153 bad magic number 0x0 on inode 154 bad version number 0x0 on inode 154 bad magic number 0x0 on inode 155 bad version number 0x0 on inode 155 bad magic number 0x0 on inode 156 bad version number 0x0 on inode 156 bad magic number 0x0 on inode 157 bad version number 0x0 on inode 157 bad magic number 0x0 on inode 158 bad version number 0x0 on inode 158 bad magic number 0x0 on inode 159 bad version number 0x0 on inode 159 bad magic number 0x0 on inode 160 bad version number 0x0 on inode 160 bad magic number 0x0 on inode 161 bad version number 0x0 on inode 161 bad magic number 0x0 on inode 162 bad version number 0x0 on inode 162 bad magic number 0x0 on inode 163 bad version number 0x0 on inode 163 bad magic number 0x0 on inode 164 bad version number 0x0 on inode 164 bad magic number 0x0 on inode 165 bad version number 0x0 on inode 165 bad magic number 0x0 on inode 166 bad version number 0x0 on inode 166 bad magic number 0x0 on inode 167 bad version number 0x0 on inode 167 bad magic number 0x0 on inode 168 bad version number 0x0 on inode 168 bad magic number 0x0 on inode 169 bad version number 0x0 on inode 169 bad magic number 0x0 on inode 170 bad version number 0x0 on inode 170 bad magic number 0x0 on inode 171 bad version number 0x0 on inode 171 bad magic number 0x0 on inode 172 bad version number 0x0 on inode 172 bad magic number 0x0 on inode 173 bad version number 0x0 on inode 173 bad magic number 0x0 on inode 174 bad version number 0x0 on inode 174 bad magic number 0x0 on inode 175 bad version number 0x0 on inode 175 bad magic number 0x0 on inode 176 bad version number 0x0 on inode 176 bad magic number 0x0 on inode 177 bad version number 0x0 on inode 177 bad magic number 0x0 on inode 178 bad version number 0x0 on inode 178 bad magic number 0x0 on inode 179 bad version number 0x0 on inode 179 bad magic number 0x0 on inode 180 bad version number 0x0 on inode 180 bad magic number 0x0 on inode 181 bad version number 0x0 on inode 181 bad magic number 0x0 on inode 182 bad version number 0x0 on inode 182 bad magic number 0x0 on inode 183 bad version number 0x0 on inode 183 bad magic number 0x0 on inode 184 bad version number 0x0 on inode 184 bad magic number 0x0 on inode 185 bad version number 0x0 on inode 185 bad magic number 0x0 on inode 186 bad version number 0x0 on inode 186 bad magic number 0x0 on inode 187 bad version number 0x0 on inode 187 bad magic number 0x0 on inode 188 bad version number 0x0 on inode 188 bad magic number 0x0 on inode 189 bad version number 0x0 on inode 189 bad magic number 0x0 on inode 190 bad version number 0x0 on inode 190 bad magic number 0x0 on inode 191 bad version number 0x0 on inode 191 bad magic number 0x0 on inode 128, resetting magic number bad version number 0x0 on inode 128, resetting version number imap claims a free inode 128 is in use, correcting imap and clearing inode cleared root inode 128 bad magic number 0x0 on inode 129, resetting magic number bad version number 0x0 on inode 129, resetting version number imap claims a free inode 129 is in use, correcting imap and clearing inode cleared realtime bitmap inode 129 bad magic number 0x0 on inode 130, resetting magic number bad version number 0x0 on inode 130, resetting version number imap claims a free inode 130 is in use, correcting imap and clearing inode cleared realtime summary inode 130 bad magic number 0x0 on inode 131, resetting magic number bad version number 0x0 on inode 131, resetting version number imap claims a free inode 131 is in use, correcting imap and clearing inode cleared inode 131 bad magic number 0x0 on inode 132, resetting magic number bad version number 0x0 on inode 132, resetting version number bad magic number 0x0 on inode 133, resetting magic number bad version number 0x0 on inode 133, resetting version number bad magic number 0x0 on inode 134, resetting magic number bad version number 0x0 on inode 134, resetting version number bad magic number 0x0 on inode 135, resetting magic number bad version number 0x0 on inode 135, resetting version number bad magic number 0x0 on inode 136, resetting magic number bad version number 0x0 on inode 136, resetting version number bad magic number 0x0 on inode 137, resetting magic number bad version number 0x0 on inode 137, resetting version number bad magic number 0x0 on inode 138, resetting magic number bad version number 0x0 on inode 138, resetting version number bad magic number 0x0 on inode 139, resetting magic number bad version number 0x0 on inode 139, resetting version number bad magic number 0x0 on inode 140, resetting magic number bad version number 0x0 on inode 140, resetting version number bad magic number 0x0 on inode 141, resetting magic number bad version number 0x0 on inode 141, resetting version number bad magic number 0x0 on inode 142, resetting magic number bad version number 0x0 on inode 142, resetting version number bad magic number 0x0 on inode 143, resetting magic number bad version number 0x0 on inode 143, resetting version number bad magic number 0x0 on inode 144, resetting magic number bad version number 0x0 on inode 144, resetting version number bad magic number 0x0 on inode 145, resetting magic number bad version number 0x0 on inode 145, resetting version number bad magic number 0x0 on inode 146, resetting magic number bad version number 0x0 on inode 146, resetting version number bad magic number 0x0 on inode 147, resetting magic number bad version number 0x0 on inode 147, resetting version number bad magic number 0x0 on inode 148, resetting magic number bad version number 0x0 on inode 148, resetting version number bad magic number 0x0 on inode 149, resetting magic number bad version number 0x0 on inode 149, resetting version number bad magic number 0x0 on inode 150, resetting magic number bad version number 0x0 on inode 150, resetting version number bad magic number 0x0 on inode 151, resetting magic number bad version number 0x0 on inode 151, resetting version number bad magic number 0x0 on inode 152, resetting magic number bad version number 0x0 on inode 152, resetting version number bad magic number 0x0 on inode 153, resetting magic number bad version number 0x0 on inode 153, resetting version number bad magic number 0x0 on inode 154, resetting magic number bad version number 0x0 on inode 154, resetting version number bad magic number 0x0 on inode 155, resetting magic number bad version number 0x0 on inode 155, resetting version number bad magic number 0x0 on inode 156, resetting magic number bad version number 0x0 on inode 156, resetting version number bad magic number 0x0 on inode 157, resetting magic number bad version number 0x0 on inode 157, resetting version number bad magic number 0x0 on inode 158, resetting magic number bad version number 0x0 on inode 158, resetting version number bad magic number 0x0 on inode 159, resetting magic number bad version number 0x0 on inode 159, resetting version number bad magic number 0x0 on inode 160, resetting magic number bad version number 0x0 on inode 160, resetting version number bad magic number 0x0 on inode 161, resetting magic number bad version number 0x0 on inode 161, resetting version number bad magic number 0x0 on inode 162, resetting magic number bad version number 0x0 on inode 162, resetting version number bad magic number 0x0 on inode 163, resetting magic number bad version number 0x0 on inode 163, resetting version number bad magic number 0x0 on inode 164, resetting magic number bad version number 0x0 on inode 164, resetting version number bad magic number 0x0 on inode 165, resetting magic number bad version number 0x0 on inode 165, resetting version number bad magic number 0x0 on inode 166, resetting magic number bad version number 0x0 on inode 166, resetting version number bad magic number 0x0 on inode 167, resetting magic number bad version number 0x0 on inode 167, resetting version number bad magic number 0x0 on inode 168, resetting magic number bad version number 0x0 on inode 168, resetting version number bad magic number 0x0 on inode 169, resetting magic number bad version number 0x0 on inode 169, resetting version number bad magic number 0x0 on inode 170, resetting magic number bad version number 0x0 on inode 170, resetting version number bad magic number 0x0 on inode 171, resetting magic number bad version number 0x0 on inode 171, resetting version number bad magic number 0x0 on inode 172, resetting magic number bad version number 0x0 on inode 172, resetting version number bad magic number 0x0 on inode 173, resetting magic number bad version number 0x0 on inode 173, resetting version number bad magic number 0x0 on inode 174, resetting magic number bad version number 0x0 on inode 174, resetting version number bad magic number 0x0 on inode 175, resetting magic number bad version number 0x0 on inode 175, resetting version number bad magic number 0x0 on inode 176, resetting magic number bad version number 0x0 on inode 176, resetting version number bad magic number 0x0 on inode 177, resetting magic number bad version number 0x0 on inode 177, resetting version number bad magic number 0x0 on inode 178, resetting magic number bad version number 0x0 on inode 178, resetting version number bad magic number 0x0 on inode 179, resetting magic number bad version number 0x0 on inode 179, resetting version number bad magic number 0x0 on inode 180, resetting magic number bad version number 0x0 on inode 180, resetting version number bad magic number 0x0 on inode 181, resetting magic number bad version number 0x0 on inode 181, resetting version number bad magic number 0x0 on inode 182, resetting magic number bad version number 0x0 on inode 182, resetting version number bad magic number 0x0 on inode 183, resetting magic number bad version number 0x0 on inode 183, resetting version number bad magic number 0x0 on inode 184, resetting magic number bad version number 0x0 on inode 184, resetting version number bad magic number 0x0 on inode 185, resetting magic number bad version number 0x0 on inode 185, resetting version number bad magic number 0x0 on inode 186, resetting magic number bad version number 0x0 on inode 186, resetting version number bad magic number 0x0 on inode 187, resetting magic number bad version number 0x0 on inode 187, resetting version number bad magic number 0x0 on inode 188, resetting magic number bad version number 0x0 on inode 188, resetting version number bad magic number 0x0 on inode 189, resetting magic number bad version number 0x0 on inode 189, resetting version number bad magic number 0x0 on inode 190, resetting magic number bad version number 0x0 on inode 190, resetting version number bad magic number 0x0 on inode 191, resetting magic number bad version number 0x0 on inode 191, resetting version number - agno = 1 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... root inode lost - check for inodes claiming duplicate blocks... - agno = 1 - agno = 0 inode block 8 multiply claimed, state was 4 inode block 9 multiply claimed, state was 4 inode block 10 multiply claimed, state was 4 inode block 11 multiply claimed, state was 4 entry ".." at block 0 offset 32 in directory inode 424256 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 424256 entry ".." at block 0 offset 32 in directory inode 1075051913 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051913 entry ".." at block 0 offset 32 in directory inode 1075051915 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051915 entry ".." at block 0 offset 32 in directory inode 1075051918 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051918 entry ".." at block 0 offset 32 in directory inode 1075051920 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051920 entry ".." at block 0 offset 32 in directory inode 1075051921 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051921 entry ".." at block 0 offset 32 in directory inode 1075051923 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051923 entry ".." at block 0 offset 32 in directory inode 1075051930 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051930 entry ".." at block 0 offset 32 in directory inode 1075051931 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051931 entry ".." at block 0 offset 32 in directory inode 1075051932 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051932 entry ".." at block 0 offset 32 in directory inode 1075051934 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075051934 entry ".." at block 0 offset 32 in directory inode 1075117889 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075117889 entry ".." at block 0 offset 32 in directory inode 1075117896 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075117896 entry ".." at block 0 offset 32 in directory inode 1075117907 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075117907 entry ".." at block 0 offset 32 in directory inode 1075244397 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075244397 entry ".." at block 0 offset 32 in directory inode 1075244399 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075244399 entry ".." at block 0 offset 32 in directory inode 1075244401 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1075244401 entry ".." at block 0 offset 32 in directory inode 1076805266 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076805266 entry ".." at block 0 offset 32 in directory inode 1076811247 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076811247 entry ".." at block 0 offset 32 in directory inode 1076811248 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076811248 entry ".." at block 0 offset 32 in directory inode 1076811260 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076811260 entry ".." at block 0 offset 32 in directory inode 1076811261 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076811261 entry ".." at block 0 offset 32 in directory inode 1076811262 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076811262 entry ".." at block 0 offset 32 in directory inode 1076811263 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076811263 entry ".." at block 0 offset 32 in directory inode 1076818528 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818528 entry ".." at block 0 offset 32 in directory inode 1076818529 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818529 entry ".." at block 0 offset 32 in directory inode 1076818542 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818542 entry ".." at block 0 offset 32 in directory inode 1076818543 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818543 entry ".." at block 0 offset 32 in directory inode 1076818544 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818544 entry ".." at block 0 offset 32 in directory inode 1076818545 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818545 entry ".." at block 0 offset 32 in directory inode 1076818546 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818546 entry ".." at block 0 offset 32 in directory inode 1076818548 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818548 entry ".." at block 0 offset 32 in directory inode 1076818549 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818549 entry ".." at block 0 offset 32 in directory inode 1076818554 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818554 entry ".." at block 0 offset 32 in directory inode 1076818555 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818555 entry ".." at block 0 offset 32 in directory inode 1076818556 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818556 entry ".." at block 0 offset 32 in directory inode 1076818559 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818559 entry ".." at block 0 offset 32 in directory inode 1076818562 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818562 entry ".." at block 0 offset 32 in directory inode 1076818563 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818563 entry ".." at block 0 offset 32 in directory inode 1076818591 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076818591 entry ".." at block 0 offset 32 in directory inode 1076828546 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828546 entry ".." at block 0 offset 32 in directory inode 1076828549 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828549 entry ".." at block 0 offset 32 in directory inode 1076828554 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828554 entry ".." at block 0 offset 32 in directory inode 1076828555 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828555 entry ".." at block 0 offset 32 in directory inode 1076828571 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828571 entry ".." at block 0 offset 32 in directory inode 1076828573 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828573 entry ".." at block 0 offset 32 in directory inode 1076828594 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828594 entry ".." at block 0 offset 32 in directory inode 1076828599 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828599 entry ".." at block 0 offset 32 in directory inode 1076828607 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076828607 entry ".." at block 0 offset 32 in directory inode 1076839776 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839776 entry ".." at block 0 offset 32 in directory inode 1076839777 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839777 entry ".." at block 0 offset 32 in directory inode 1076839779 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839779 entry ".." at block 0 offset 32 in directory inode 1076839780 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839780 entry ".." at block 0 offset 32 in directory inode 1076839784 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839784 entry ".." at block 0 offset 32 in directory inode 1076839785 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839785 entry ".." at block 0 offset 32 in directory inode 1076839786 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839786 entry ".." at block 0 offset 32 in directory inode 1076839792 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839792 entry ".." at block 0 offset 32 in directory inode 1076839794 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839794 entry ".." at block 0 offset 32 in directory inode 1076839795 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839795 entry ".." at block 0 offset 32 in directory inode 1076839823 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839823 entry ".." at block 0 offset 32 in directory inode 1076839825 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839825 entry ".." at block 0 offset 32 in directory inode 1076839826 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839826 entry ".." at block 0 offset 32 in directory inode 1076839828 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839828 entry ".." at block 0 offset 32 in directory inode 1076839831 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839831 entry ".." at block 0 offset 32 in directory inode 1076839835 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839835 entry ".." at block 0 offset 32 in directory inode 1076839836 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839836 entry ".." at block 0 offset 32 in directory inode 1076839837 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839837 entry ".." at block 0 offset 32 in directory inode 1076839838 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076839838 entry ".." at block 0 offset 32 in directory inode 1076848162 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076848162 entry ".." at block 0 offset 32 in directory inode 1076848202 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076848202 entry ".." at block 0 offset 32 in directory inode 1076848204 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076848204 entry ".." at block 0 offset 32 in directory inode 1076848221 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076848221 entry ".." at block 0 offset 32 in directory inode 1076848222 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076848222 entry ".." at block 0 offset 32 in directory inode 1076853088 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853088 entry ".." at block 0 offset 32 in directory inode 1076853089 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853089 entry ".." at block 0 offset 32 in directory inode 1076853091 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853091 entry ".." at block 0 offset 32 in directory inode 1076853114 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853114 entry ".." at block 0 offset 32 in directory inode 1076853126 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853126 entry ".." at block 0 offset 32 in directory inode 1076853127 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853127 entry ".." at block 0 offset 32 in directory inode 1076853133 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853133 entry ".." at block 0 offset 32 in directory inode 1076853134 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853134 entry ".." at block 0 offset 32 in directory inode 1076853135 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853135 entry ".." at block 0 offset 32 in directory inode 1076853136 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853136 entry ".." at block 0 offset 32 in directory inode 1076853138 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853138 entry ".." at block 0 offset 32 in directory inode 1076853139 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076853139 entry ".." at block 0 offset 32 in directory inode 1076854593 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854593 entry ".." at block 0 offset 32 in directory inode 1076854603 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854603 entry ".." at block 0 offset 32 in directory inode 1076854604 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854604 entry ".." at block 0 offset 32 in directory inode 1076854610 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854610 entry ".." at block 0 offset 32 in directory inode 1076854621 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854621 entry ".." at block 0 offset 32 in directory inode 1076854628 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854628 entry ".." at block 0 offset 32 in directory inode 1076854639 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854639 entry ".." at block 0 offset 32 in directory inode 1076854640 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854640 entry ".." at block 0 offset 32 in directory inode 1076854643 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854643 entry ".." at block 0 offset 32 in directory inode 1076854644 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1076854644 entry ".." at block 0 offset 32 in directory inode 1077602230 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1077602230 entry ".." at block 0 offset 32 in directory inode 1077652948 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1077652948 entry ".." at block 0 offset 32 in directory inode 1098465511 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465511 entry ".." at block 0 offset 32 in directory inode 1098465515 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465515 entry ".." at block 0 offset 32 in directory inode 1098465519 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465519 entry ".." at block 0 offset 32 in directory inode 1098465523 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465523 entry ".." at block 0 offset 32 in directory inode 1098465524 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465524 entry ".." at block 0 offset 32 in directory inode 1098465528 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465528 entry ".." at block 0 offset 32 in directory inode 1098465529 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465529 entry ".." at block 0 offset 32 in directory inode 1098465537 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 1098465537 entry ".." at block 0 offset 32 in directory inode 125005222 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 125005222 entry ".." at block 0 offset 32 in directory inode 125005255 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 125005255 entry ".." at block 0 offset 32 in directory inode 125005256 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 125005256 entry ".." at block 0 offset 32 in directory inode 125005577 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 125005577 entry ".." at block 0 offset 32 in directory inode 125660420 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 125660420 entry ".." at block 0 offset 32 in directory inode 130503573 references free inode 131 clearing inode number in entry at offset 32... no .. entry for directory 130503573 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... reinitializing root directory reinitializing realtime bitmap inode reinitializing realtime summary inode - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... bad hash table for directory inode 424256 (no data entry): rebuilding rebuilding directory inode 424256 bad hash table for directory inode 125005222 (no data entry): rebuilding rebuilding directory inode 125005222 bad hash table for directory inode 125005255 (no data entry): rebuilding rebuilding directory inode 125005255 bad hash table for directory inode 125005256 (no data entry): rebuilding rebuilding directory inode 125005256 bad hash table for directory inode 125005577 (no data entry): rebuilding rebuilding directory inode 125005577 bad hash table for directory inode 125660420 (no data entry): rebuilding rebuilding directory inode 125660420 bad hash table for directory inode 130503573 (no data entry): rebuilding rebuilding directory inode 130503573 bad hash table for directory inode 1075051913 (no data entry): rebuilding rebuilding directory inode 1075051913 bad hash table for directory inode 1075051915 (no data entry): rebuilding rebuilding directory inode 1075051915 bad hash table for directory inode 1075051918 (no data entry): rebuilding rebuilding directory inode 1075051918 bad hash table for directory inode 1075051920 (no data entry): rebuilding rebuilding directory inode 1075051920 bad hash table for directory inode 1075051921 (no data entry): rebuilding rebuilding directory inode 1075051921 bad hash table for directory inode 1075051923 (no data entry): rebuilding rebuilding directory inode 1075051923 bad hash table for directory inode 1075051930 (no data entry): rebuilding rebuilding directory inode 1075051930 bad hash table for directory inode 1075051931 (no data entry): rebuilding rebuilding directory inode 1075051931 bad hash table for directory inode 1075051932 (no data entry): rebuilding rebuilding directory inode 1075051932 bad hash table for directory inode 1075051934 (no data entry): rebuilding rebuilding directory inode 1075051934 bad hash table for directory inode 1075117889 (no data entry): rebuilding rebuilding directory inode 1075117889 bad hash table for directory inode 1075117896 (no data entry): rebuilding rebuilding directory inode 1075117896 bad hash table for directory inode 1075117907 (no data entry): rebuilding rebuilding directory inode 1075117907 bad hash table for directory inode 1075244397 (no data entry): rebuilding rebuilding directory inode 1075244397 bad hash table for directory inode 1075244399 (no data entry): rebuilding rebuilding directory inode 1075244399 bad hash table for directory inode 1075244401 (no data entry): rebuilding rebuilding directory inode 1075244401 bad hash table for directory inode 1076805266 (no data entry): rebuilding rebuilding directory inode 1076805266 bad hash table for directory inode 1076811247 (no data entry): rebuilding rebuilding directory inode 1076811247 bad hash table for directory inode 1076811248 (no data entry): rebuilding rebuilding directory inode 1076811248 bad hash table for directory inode 1076811260 (no data entry): rebuilding rebuilding directory inode 1076811260 bad hash table for directory inode 1076811261 (no data entry): rebuilding rebuilding directory inode 1076811261 bad hash table for directory inode 1076811262 (no data entry): rebuilding rebuilding directory inode 1076811262 bad hash table for directory inode 1076811263 (no data entry): rebuilding rebuilding directory inode 1076811263 bad hash table for directory inode 1076818528 (no data entry): rebuilding rebuilding directory inode 1076818528 bad hash table for directory inode 1076818529 (no data entry): rebuilding rebuilding directory inode 1076818529 bad hash table for directory inode 1076818542 (no data entry): rebuilding rebuilding directory inode 1076818542 bad hash table for directory inode 1076818543 (no data entry): rebuilding rebuilding directory inode 1076818543 bad hash table for directory inode 1076818544 (no data entry): rebuilding rebuilding directory inode 1076818544 bad hash table for directory inode 1076818545 (no data entry): rebuilding rebuilding directory inode 1076818545 bad hash table for directory inode 1076818546 (no data entry): rebuilding rebuilding directory inode 1076818546 bad hash table for directory inode 1076818548 (no data entry): rebuilding rebuilding directory inode 1076818548 bad hash table for directory inode 1076818549 (no data entry): rebuilding rebuilding directory inode 1076818549 bad hash table for directory inode 1076818554 (no data entry): rebuilding rebuilding directory inode 1076818554 bad hash table for directory inode 1076818555 (no data entry): rebuilding rebuilding directory inode 1076818555 bad hash table for directory inode 1076818556 (no data entry): rebuilding rebuilding directory inode 1076818556 bad hash table for directory inode 1076818559 (no data entry): rebuilding rebuilding directory inode 1076818559 bad hash table for directory inode 1076818562 (no data entry): rebuilding rebuilding directory inode 1076818562 bad hash table for directory inode 1076818563 (no data entry): rebuilding rebuilding directory inode 1076818563 bad hash table for directory inode 1076818591 (no data entry): rebuilding rebuilding directory inode 1076818591 bad hash table for directory inode 1076828546 (no data entry): rebuilding rebuilding directory inode 1076828546 bad hash table for directory inode 1076828549 (no data entry): rebuilding rebuilding directory inode 1076828549 bad hash table for directory inode 1076828554 (no data entry): rebuilding rebuilding directory inode 1076828554 bad hash table for directory inode 1076828555 (no data entry): rebuilding rebuilding directory inode 1076828555 bad hash table for directory inode 1076828571 (no data entry): rebuilding rebuilding directory inode 1076828571 bad hash table for directory inode 1076828573 (no data entry): rebuilding rebuilding directory inode 1076828573 bad hash table for directory inode 1076828594 (no data entry): rebuilding rebuilding directory inode 1076828594 bad hash table for directory inode 1076828599 (no data entry): rebuilding rebuilding directory inode 1076828599 bad hash table for directory inode 1076828607 (no data entry): rebuilding rebuilding directory inode 1076828607 bad hash table for directory inode 1076839776 (no data entry): rebuilding rebuilding directory inode 1076839776 bad hash table for directory inode 1076839777 (no data entry): rebuilding rebuilding directory inode 1076839777 bad hash table for directory inode 1076839779 (no data entry): rebuilding rebuilding directory inode 1076839779 bad hash table for directory inode 1076839780 (no data entry): rebuilding rebuilding directory inode 1076839780 bad hash table for directory inode 1076839784 (no data entry): rebuilding rebuilding directory inode 1076839784 bad hash table for directory inode 1076839785 (no data entry): rebuilding rebuilding directory inode 1076839785 bad hash table for directory inode 1076839786 (no data entry): rebuilding rebuilding directory inode 1076839786 bad hash table for directory inode 1076839792 (no data entry): rebuilding rebuilding directory inode 1076839792 bad hash table for directory inode 1076839794 (no data entry): rebuilding rebuilding directory inode 1076839794 bad hash table for directory inode 1076839795 (no data entry): rebuilding rebuilding directory inode 1076839795 bad hash table for directory inode 1076839823 (no data entry): rebuilding rebuilding directory inode 1076839823 bad hash table for directory inode 1076839825 (no data entry): rebuilding rebuilding directory inode 1076839825 bad hash table for directory inode 1076839826 (no data entry): rebuilding rebuilding directory inode 1076839826 bad hash table for directory inode 1076839828 (no data entry): rebuilding rebuilding directory inode 1076839828 bad hash table for directory inode 1076839831 (no data entry): rebuilding rebuilding directory inode 1076839831 bad hash table for directory inode 1076839835 (no data entry): rebuilding rebuilding directory inode 1076839835 bad hash table for directory inode 1076839836 (no data entry): rebuilding rebuilding directory inode 1076839836 bad hash table for directory inode 1076839837 (no data entry): rebuilding rebuilding directory inode 1076839837 bad hash table for directory inode 1076839838 (no data entry): rebuilding rebuilding directory inode 1076839838 bad hash table for directory inode 1076848162 (no data entry): rebuilding rebuilding directory inode 1076848162 bad hash table for directory inode 1076848202 (no data entry): rebuilding rebuilding directory inode 1076848202 bad hash table for directory inode 1076848204 (no data entry): rebuilding rebuilding directory inode 1076848204 bad hash table for directory inode 1076848221 (no data entry): rebuilding rebuilding directory inode 1076848221 bad hash table for directory inode 1076848222 (no data entry): rebuilding rebuilding directory inode 1076848222 bad hash table for directory inode 1076853088 (no data entry): rebuilding rebuilding directory inode 1076853088 bad hash table for directory inode 1076853089 (no data entry): rebuilding rebuilding directory inode 1076853089 bad hash table for directory inode 1076853091 (no data entry): rebuilding rebuilding directory inode 1076853091 bad hash table for directory inode 1076853114 (no data entry): rebuilding rebuilding directory inode 1076853114 bad hash table for directory inode 1076853126 (no data entry): rebuilding rebuilding directory inode 1076853126 bad hash table for directory inode 1076853127 (no data entry): rebuilding rebuilding directory inode 1076853127 bad hash table for directory inode 1076853133 (no data entry): rebuilding rebuilding directory inode 1076853133 bad hash table for directory inode 1076853134 (no data entry): rebuilding rebuilding directory inode 1076853134 bad hash table for directory inode 1076853135 (no data entry): rebuilding rebuilding directory inode 1076853135 bad hash table for directory inode 1076853136 (no data entry): rebuilding rebuilding directory inode 1076853136 bad hash table for directory inode 1076853138 (no data entry): rebuilding rebuilding directory inode 1076853138 bad hash table for directory inode 1076853139 (no data entry): rebuilding rebuilding directory inode 1076853139 bad hash table for directory inode 1076854593 (no data entry): rebuilding rebuilding directory inode 1076854593 bad hash table for directory inode 1076854603 (no data entry): rebuilding rebuilding directory inode 1076854603 bad hash table for directory inode 1076854604 (no data entry): rebuilding rebuilding directory inode 1076854604 bad hash table for directory inode 1076854610 (no data entry): rebuilding rebuilding directory inode 1076854610 bad hash table for directory inode 1076854621 (no data entry): rebuilding rebuilding directory inode 1076854621 bad hash table for directory inode 1076854628 (no data entry): rebuilding rebuilding directory inode 1076854628 bad hash table for directory inode 1076854639 (no data entry): rebuilding rebuilding directory inode 1076854639 bad hash table for directory inode 1076854640 (no data entry): rebuilding rebuilding directory inode 1076854640 bad hash table for directory inode 1076854643 (no data entry): rebuilding rebuilding directory inode 1076854643 bad hash table for directory inode 1076854644 (no data entry): rebuilding rebuilding directory inode 1076854644 bad hash table for directory inode 1077602230 (no data entry): rebuilding rebuilding directory inode 1077602230 bad hash table for directory inode 1077652948 (no data entry): rebuilding rebuilding directory inode 1077652948 bad hash table for directory inode 1098465511 (no data entry): rebuilding rebuilding directory inode 1098465511 bad hash table for directory inode 1098465515 (no data entry): rebuilding rebuilding directory inode 1098465515 bad hash table for directory inode 1098465519 (no data entry): rebuilding rebuilding directory inode 1098465519 bad hash table for directory inode 1098465523 (no data entry): rebuilding rebuilding directory inode 1098465523 bad hash table for directory inode 1098465524 (no data entry): rebuilding rebuilding directory inode 1098465524 bad hash table for directory inode 1098465528 (no data entry): rebuilding rebuilding directory inode 1098465528 bad hash table for directory inode 1098465529 (no data entry): rebuilding rebuilding directory inode 1098465529 bad hash table for directory inode 1098465537 (no data entry): rebuilding rebuilding directory inode 1098465537 - traversal finished ... - moving disconnected inodes to lost+found ... disconnected dir inode 424256, moving to lost+found disconnected inode 424257, moving to lost+found disconnected inode 424258, moving to lost+found disconnected inode 5667475, moving to lost+found disconnected dir inode 125005217, moving to lost+found disconnected dir inode 125005218, moving to lost+found disconnected dir inode 125005219, moving to lost+found disconnected dir inode 125005221, moving to lost+found disconnected dir inode 125005222, moving to lost+found disconnected dir inode 125005223, moving to lost+found disconnected dir inode 125005224, moving to lost+found disconnected dir inode 125005225, moving to lost+found disconnected dir inode 125005226, moving to lost+found disconnected dir inode 125005228, moving to lost+found disconnected dir inode 125005229, moving to lost+found disconnected dir inode 125005230, moving to lost+found disconnected dir inode 125005231, moving to lost+found disconnected dir inode 125005232, moving to lost+found disconnected dir inode 125005233, moving to lost+found disconnected dir inode 125005236, moving to lost+found disconnected dir inode 125005237, moving to lost+found disconnected dir inode 125005238, moving to lost+found disconnected dir inode 125005239, moving to lost+found disconnected dir inode 125005240, moving to lost+found disconnected dir inode 125005241, moving to lost+found disconnected dir inode 125005242, moving to lost+found disconnected dir inode 125005243, moving to lost+found disconnected dir inode 125005244, moving to lost+found disconnected dir inode 125005245, moving to lost+found disconnected dir inode 125005247, moving to lost+found disconnected dir inode 125005248, moving to lost+found disconnected dir inode 125005249, moving to lost+found disconnected dir inode 125005250, moving to lost+found disconnected dir inode 125005251, moving to lost+found disconnected dir inode 125005252, moving to lost+found disconnected dir inode 125005253, moving to lost+found disconnected dir inode 125005254, moving to lost+found disconnected dir inode 125005255, moving to lost+found disconnected dir inode 125005256, moving to lost+found disconnected dir inode 125005257, moving to lost+found disconnected dir inode 125005259, moving to lost+found disconnected dir inode 125005260, moving to lost+found disconnected dir inode 125005262, moving to lost+found disconnected dir inode 125005263, moving to lost+found disconnected dir inode 125005264, moving to lost+found disconnected dir inode 125005265, moving to lost+found disconnected dir inode 125005266, moving to lost+found disconnected dir inode 125005267, moving to lost+found disconnected dir inode 125005268, moving to lost+found disconnected dir inode 125005269, moving to lost+found disconnected dir inode 125005270, moving to lost+found disconnected dir inode 125005271, moving to lost+found disconnected dir inode 125005272, moving to lost+found disconnected dir inode 125005273, moving to lost+found disconnected dir inode 125005274, moving to lost+found disconnected dir inode 125005275, moving to lost+found disconnected dir inode 125005276, moving to lost+found disconnected dir inode 125005277, moving to lost+found disconnected dir inode 125005278, moving to lost+found disconnected dir inode 125005279, moving to lost+found disconnected dir inode 125005568, moving to lost+found disconnected dir inode 125005569, moving to lost+found disconnected dir inode 125005571, moving to lost+found disconnected dir inode 125005572, moving to lost+found disconnected dir inode 125005573, moving to lost+found disconnected dir inode 125005574, moving to lost+found disconnected dir inode 125005575, moving to lost+found disconnected dir inode 125005576, moving to lost+found disconnected dir inode 125005577, moving to lost+found disconnected dir inode 125016316, moving to lost+found disconnected dir inode 125660420, moving to lost+found disconnected dir inode 125685326, moving to lost+found disconnected dir inode 125685351, moving to lost+found disconnected inode 125908558, moving to lost+found disconnected inode 125908559, moving to lost+found disconnected inode 125908560, moving to lost+found disconnected dir inode 126090218, moving to lost+found disconnected dir inode 126090220, moving to lost+found disconnected dir inode 126090224, moving to lost+found disconnected dir inode 130503573, moving to lost+found disconnected dir inode 164462811, moving to lost+found disconnected dir inode 164462812, moving to lost+found disconnected dir inode 164462813, moving to lost+found disconnected dir inode 164462814, moving to lost+found disconnected dir inode 164462815, moving to lost+found disconnected dir inode 164462816, moving to lost+found disconnected dir inode 164462817, moving to lost+found disconnected dir inode 164462818, moving to lost+found disconnected dir inode 164462819, moving to lost+found disconnected dir inode 164462820, moving to lost+found disconnected dir inode 164462821, moving to lost+found disconnected dir inode 164462822, moving to lost+found disconnected dir inode 164462823, moving to lost+found disconnected dir inode 164462824, moving to lost+found disconnected dir inode 164462825, moving to lost+found disconnected dir inode 164462826, moving to lost+found disconnected dir inode 190881713, moving to lost+found disconnected dir inode 240441693, moving to lost+found disconnected dir inode 1075051900, moving to lost+found disconnected dir inode 1075051901, moving to lost+found disconnected dir inode 1075051902, moving to lost+found disconnected dir inode 1075051903, moving to lost+found disconnected dir inode 1075051908, moving to lost+found disconnected dir inode 1075051909, moving to lost+found disconnected dir inode 1075051910, moving to lost+found disconnected dir inode 1075051912, moving to lost+found disconnected dir inode 1075051913, moving to lost+found disconnected dir inode 1075051915, moving to lost+found disconnected dir inode 1075051916, moving to lost+found disconnected dir inode 1075051917, moving to lost+found disconnected dir inode 1075051918, moving to lost+found disconnected dir inode 1075051919, moving to lost+found disconnected dir inode 1075051920, moving to lost+found disconnected dir inode 1075051921, moving to lost+found disconnected dir inode 1075051922, moving to lost+found disconnected dir inode 1075051923, moving to lost+found disconnected dir inode 1075051930, moving to lost+found disconnected dir inode 1075051931, moving to lost+found disconnected dir inode 1075051932, moving to lost+found disconnected dir inode 1075051933, moving to lost+found disconnected dir inode 1075051934, moving to lost+found disconnected dir inode 1075117888, moving to lost+found disconnected dir inode 1075117889, moving to lost+found disconnected dir inode 1075117890, moving to lost+found disconnected dir inode 1075117891, moving to lost+found disconnected dir inode 1075117896, moving to lost+found disconnected dir inode 1075117897, moving to lost+found disconnected dir inode 1075117902, moving to lost+found disconnected dir inode 1075117903, moving to lost+found disconnected dir inode 1075117907, moving to lost+found disconnected dir inode 1075117908, moving to lost+found disconnected dir inode 1075117939, moving to lost+found disconnected dir inode 1075244397, moving to lost+found disconnected dir inode 1075244399, moving to lost+found disconnected dir inode 1075244400, moving to lost+found disconnected dir inode 1075244401, moving to lost+found disconnected dir inode 1075244402, moving to lost+found disconnected dir inode 1076383229, moving to lost+found disconnected dir inode 1076383230, moving to lost+found disconnected dir inode 1076797385, moving to lost+found disconnected dir inode 1076797389, moving to lost+found disconnected dir inode 1076797390, moving to lost+found disconnected dir inode 1076797391, moving to lost+found disconnected dir inode 1076797393, moving to lost+found disconnected dir inode 1076797395, moving to lost+found disconnected dir inode 1076797399, moving to lost+found disconnected dir inode 1076797407, moving to lost+found disconnected dir inode 1076797408, moving to lost+found disconnected dir inode 1076797410, moving to lost+found disconnected dir inode 1076797411, moving to lost+found disconnected dir inode 1076797412, moving to lost+found disconnected dir inode 1076797413, moving to lost+found disconnected dir inode 1076797417, moving to lost+found disconnected dir inode 1076797418, moving to lost+found disconnected dir inode 1076797420, moving to lost+found disconnected dir inode 1076797421, moving to lost+found disconnected dir inode 1076797422, moving to lost+found disconnected dir inode 1076797423, moving to lost+found disconnected dir inode 1076797424, moving to lost+found disconnected dir inode 1076797425, moving to lost+found disconnected dir inode 1076797426, moving to lost+found disconnected dir inode 1076797427, moving to lost+found disconnected dir inode 1076797428, moving to lost+found disconnected dir inode 1076797429, moving to lost+found disconnected dir inode 1076797433, moving to lost+found disconnected dir inode 1076797437, moving to lost+found disconnected dir inode 1076797438, moving to lost+found disconnected dir inode 1076797440, moving to lost+found disconnected dir inode 1076797442, moving to lost+found disconnected dir inode 1076797445, moving to lost+found disconnected dir inode 1076797446, moving to lost+found disconnected dir inode 1076797447, moving to lost+found disconnected dir inode 1076797451, moving to lost+found disconnected dir inode 1076797455, moving to lost+found disconnected dir inode 1076797459, moving to lost+found disconnected dir inode 1076797461, moving to lost+found disconnected dir inode 1076797462, moving to lost+found disconnected dir inode 1076797463, moving to lost+found disconnected dir inode 1076797464, moving to lost+found disconnected dir inode 1076797465, moving to lost+found disconnected dir inode 1076797466, moving to lost+found disconnected dir inode 1076797467, moving to lost+found disconnected dir inode 1076797468, moving to lost+found disconnected dir inode 1076797469, moving to lost+found disconnected dir inode 1076797470, moving to lost+found disconnected dir inode 1076797471, moving to lost+found disconnected dir inode 1076797472, moving to lost+found disconnected dir inode 1076797473, moving to lost+found disconnected dir inode 1076797474, moving to lost+found disconnected dir inode 1076797479, moving to lost+found disconnected dir inode 1076797481, moving to lost+found disconnected dir inode 1076797488, moving to lost+found disconnected dir inode 1076797490, moving to lost+found disconnected dir inode 1076797491, moving to lost+found disconnected dir inode 1076797492, moving to lost+found disconnected dir inode 1076797493, moving to lost+found disconnected dir inode 1076797494, moving to lost+found disconnected dir inode 1076797495, moving to lost+found disconnected dir inode 1076797496, moving to lost+found disconnected dir inode 1076797497, moving to lost+found disconnected dir inode 1076797498, moving to lost+found disconnected dir inode 1076797500, moving to lost+found disconnected dir inode 1076797502, moving to lost+found disconnected dir inode 1076797503, moving to lost+found disconnected dir inode 1076805248, moving to lost+found disconnected dir inode 1076805249, moving to lost+found disconnected dir inode 1076805254, moving to lost+found disconnected dir inode 1076805255, moving to lost+found disconnected dir inode 1076805256, moving to lost+found disconnected dir inode 1076805257, moving to lost+found disconnected dir inode 1076805259, moving to lost+found disconnected dir inode 1076805260, moving to lost+found disconnected dir inode 1076805261, moving to lost+found disconnected dir inode 1076805262, moving to lost+found disconnected dir inode 1076805263, moving to lost+found disconnected dir inode 1076805266, moving to lost+found disconnected dir inode 1076805267, moving to lost+found disconnected dir inode 1076805268, moving to lost+found disconnected dir inode 1076805269, moving to lost+found disconnected dir inode 1076805270, moving to lost+found disconnected dir inode 1076805272, moving to lost+found disconnected dir inode 1076805273, moving to lost+found disconnected dir inode 1076805274, moving to lost+found disconnected dir inode 1076805276, moving to lost+found disconnected dir inode 1076805277, moving to lost+found disconnected dir inode 1076805278, moving to lost+found disconnected dir inode 1076805279, moving to lost+found disconnected dir inode 1076805280, moving to lost+found disconnected dir inode 1076805281, moving to lost+found disconnected dir inode 1076805286, moving to lost+found disconnected dir inode 1076805288, moving to lost+found disconnected dir inode 1076805293, moving to lost+found disconnected dir inode 1076805294, moving to lost+found disconnected dir inode 1076805295, moving to lost+found disconnected dir inode 1076805296, moving to lost+found disconnected dir inode 1076805297, moving to lost+found disconnected dir inode 1076805298, moving to lost+found disconnected dir inode 1076805299, moving to lost+found disconnected dir inode 1076805300, moving to lost+found disconnected dir inode 1076805301, moving to lost+found disconnected dir inode 1076805307, moving to lost+found disconnected dir inode 1076805308, moving to lost+found disconnected dir inode 1076809126, moving to lost+found disconnected dir inode 1076809135, moving to lost+found disconnected dir inode 1076809136, moving to lost+found disconnected dir inode 1076809137, moving to lost+found disconnected dir inode 1076809141, moving to lost+found disconnected dir inode 1076809143, moving to lost+found disconnected dir inode 1076809144, moving to lost+found disconnected dir inode 1076809145, moving to lost+found disconnected dir inode 1076809146, moving to lost+found disconnected dir inode 1076809147, moving to lost+found disconnected dir inode 1076809148, moving to lost+found disconnected dir inode 1076809149, moving to lost+found disconnected dir inode 1076809150, moving to lost+found disconnected dir inode 1076809158, moving to lost+found disconnected dir inode 1076809162, moving to lost+found disconnected dir inode 1076809163, moving to lost+found disconnected dir inode 1076809164, moving to lost+found disconnected dir inode 1076809165, moving to lost+found disconnected dir inode 1076809166, moving to lost+found disconnected dir inode 1076809167, moving to lost+found disconnected dir inode 1076809168, moving to lost+found disconnected dir inode 1076809169, moving to lost+found disconnected dir inode 1076809170, moving to lost+found disconnected dir inode 1076809171, moving to lost+found disconnected dir inode 1076809172, moving to lost+found disconnected dir inode 1076809173, moving to lost+found disconnected dir inode 1076809174, moving to lost+found disconnected dir inode 1076809175, moving to lost+found disconnected dir inode 1076809176, moving to lost+found disconnected dir inode 1076809177, moving to lost+found disconnected dir inode 1076809178, moving to lost+found disconnected dir inode 1076809179, moving to lost+found disconnected dir inode 1076809180, moving to lost+found disconnected dir inode 1076809181, moving to lost+found disconnected dir inode 1076809182, moving to lost+found disconnected dir inode 1076809183, moving to lost+found disconnected dir inode 1076811200, moving to lost+found disconnected dir inode 1076811201, moving to lost+found disconnected dir inode 1076811202, moving to lost+found disconnected dir inode 1076811203, moving to lost+found disconnected dir inode 1076811204, moving to lost+found disconnected dir inode 1076811206, moving to lost+found disconnected dir inode 1076811210, moving to lost+found disconnected dir inode 1076811211, moving to lost+found disconnected dir inode 1076811212, moving to lost+found disconnected dir inode 1076811213, moving to lost+found disconnected dir inode 1076811214, moving to lost+found disconnected dir inode 1076811215, moving to lost+found disconnected dir inode 1076811216, moving to lost+found disconnected dir inode 1076811219, moving to lost+found disconnected dir inode 1076811220, moving to lost+found disconnected dir inode 1076811221, moving to lost+found disconnected dir inode 1076811224, moving to lost+found disconnected dir inode 1076811227, moving to lost+found disconnected dir inode 1076811228, moving to lost+found disconnected dir inode 1076811229, moving to lost+found disconnected dir inode 1076811233, moving to lost+found disconnected dir inode 1076811238, moving to lost+found disconnected dir inode 1076811239, moving to lost+found disconnected dir inode 1076811240, moving to lost+found disconnected dir inode 1076811242, moving to lost+found disconnected dir inode 1076811247, moving to lost+found disconnected dir inode 1076811248, moving to lost+found disconnected dir inode 1076811251, moving to lost+found disconnected dir inode 1076811252, moving to lost+found disconnected dir inode 1076811253, moving to lost+found disconnected dir inode 1076811254, moving to lost+found disconnected dir inode 1076811255, moving to lost+found disconnected dir inode 1076811256, moving to lost+found disconnected dir inode 1076811258, moving to lost+found disconnected dir inode 1076811259, moving to lost+found disconnected dir inode 1076811260, moving to lost+found disconnected dir inode 1076811261, moving to lost+found disconnected dir inode 1076811262, moving to lost+found disconnected dir inode 1076811263, moving to lost+found disconnected dir inode 1076818528, moving to lost+found disconnected dir inode 1076818529, moving to lost+found disconnected dir inode 1076818531, moving to lost+found disconnected dir inode 1076818532, moving to lost+found disconnected dir inode 1076818541, moving to lost+found disconnected dir inode 1076818542, moving to lost+found disconnected dir inode 1076818543, moving to lost+found disconnected dir inode 1076818544, moving to lost+found disconnected dir inode 1076818545, moving to lost+found disconnected dir inode 1076818546, moving to lost+found disconnected dir inode 1076818548, moving to lost+found disconnected dir inode 1076818549, moving to lost+found disconnected dir inode 1076818550, moving to lost+found disconnected dir inode 1076818554, moving to lost+found disconnected dir inode 1076818555, moving to lost+found disconnected dir inode 1076818556, moving to lost+found disconnected dir inode 1076818557, moving to lost+found disconnected dir inode 1076818558, moving to lost+found disconnected dir inode 1076818559, moving to lost+found disconnected dir inode 1076818561, moving to lost+found disconnected dir inode 1076818562, moving to lost+found disconnected dir inode 1076818563, moving to lost+found disconnected dir inode 1076818564, moving to lost+found disconnected dir inode 1076818565, moving to lost+found disconnected dir inode 1076818566, moving to lost+found disconnected dir inode 1076818567, moving to lost+found disconnected dir inode 1076818569, moving to lost+found disconnected dir inode 1076818570, moving to lost+found disconnected dir inode 1076818577, moving to lost+found disconnected dir inode 1076818578, moving to lost+found disconnected dir inode 1076818587, moving to lost+found disconnected dir inode 1076818588, moving to lost+found disconnected dir inode 1076818589, moving to lost+found disconnected dir inode 1076818590, moving to lost+found disconnected dir inode 1076818591, moving to lost+found disconnected dir inode 1076828510, moving to lost+found disconnected dir inode 1076828544, moving to lost+found disconnected dir inode 1076828545, moving to lost+found disconnected dir inode 1076828546, moving to lost+found disconnected dir inode 1076828547, moving to lost+found disconnected dir inode 1076828548, moving to lost+found disconnected dir inode 1076828549, moving to lost+found disconnected dir inode 1076828550, moving to lost+found disconnected dir inode 1076828551, moving to lost+found disconnected dir inode 1076828554, moving to lost+found disconnected dir inode 1076828555, moving to lost+found disconnected dir inode 1076828562, moving to lost+found disconnected dir inode 1076828563, moving to lost+found disconnected dir inode 1076828564, moving to lost+found disconnected dir inode 1076828565, moving to lost+found disconnected dir inode 1076828570, moving to lost+found disconnected dir inode 1076828571, moving to lost+found disconnected dir inode 1076828572, moving to lost+found disconnected dir inode 1076828573, moving to lost+found disconnected dir inode 1076828574, moving to lost+found disconnected dir inode 1076828575, moving to lost+found disconnected dir inode 1076828576, moving to lost+found disconnected dir inode 1076828577, moving to lost+found disconnected dir inode 1076828578, moving to lost+found disconnected dir inode 1076828579, moving to lost+found disconnected dir inode 1076828594, moving to lost+found disconnected dir inode 1076828595, moving to lost+found disconnected dir inode 1076828596, moving to lost+found disconnected dir inode 1076828597, moving to lost+found disconnected dir inode 1076828598, moving to lost+found disconnected dir inode 1076828599, moving to lost+found disconnected dir inode 1076828600, moving to lost+found disconnected dir inode 1076828601, moving to lost+found disconnected dir inode 1076828602, moving to lost+found disconnected dir inode 1076828603, moving to lost+found disconnected dir inode 1076828604, moving to lost+found disconnected dir inode 1076828605, moving to lost+found disconnected dir inode 1076828606, moving to lost+found disconnected dir inode 1076828607, moving to lost+found disconnected dir inode 1076839776, moving to lost+found disconnected dir inode 1076839777, moving to lost+found disconnected dir inode 1076839778, moving to lost+found disconnected dir inode 1076839779, moving to lost+found disconnected dir inode 1076839780, moving to lost+found disconnected dir inode 1076839782, moving to lost+found disconnected dir inode 1076839783, moving to lost+found disconnected dir inode 1076839784, moving to lost+found disconnected dir inode 1076839785, moving to lost+found disconnected dir inode 1076839786, moving to lost+found disconnected dir inode 1076839790, moving to lost+found disconnected dir inode 1076839791, moving to lost+found disconnected dir inode 1076839792, moving to lost+found disconnected dir inode 1076839793, moving to lost+found disconnected dir inode 1076839794, moving to lost+found disconnected dir inode 1076839795, moving to lost+found disconnected dir inode 1076839796, moving to lost+found disconnected dir inode 1076839797, moving to lost+found disconnected dir inode 1076839798, moving to lost+found disconnected dir inode 1076839799, moving to lost+found disconnected dir inode 1076839800, moving to lost+found disconnected dir inode 1076839801, moving to lost+found disconnected dir inode 1076839802, moving to lost+found disconnected dir inode 1076839803, moving to lost+found disconnected dir inode 1076839804, moving to lost+found disconnected dir inode 1076839805, moving to lost+found disconnected dir inode 1076839806, moving to lost+found disconnected dir inode 1076839807, moving to lost+found disconnected dir inode 1076839808, moving to lost+found disconnected dir inode 1076839809, moving to lost+found disconnected dir inode 1076839810, moving to lost+found disconnected dir inode 1076839812, moving to lost+found disconnected dir inode 1076839817, moving to lost+found disconnected dir inode 1076839818, moving to lost+found disconnected dir inode 1076839819, moving to lost+found disconnected dir inode 1076839823, moving to lost+found disconnected dir inode 1076839825, moving to lost+found disconnected dir inode 1076839826, moving to lost+found disconnected dir inode 1076839827, moving to lost+found disconnected dir inode 1076839828, moving to lost+found disconnected dir inode 1076839831, moving to lost+found disconnected dir inode 1076839833, moving to lost+found disconnected dir inode 1076839834, moving to lost+found disconnected dir inode 1076839835, moving to lost+found disconnected dir inode 1076839836, moving to lost+found disconnected dir inode 1076839837, moving to lost+found disconnected dir inode 1076839838, moving to lost+found disconnected dir inode 1076848162, moving to lost+found disconnected dir inode 1076848164, moving to lost+found disconnected dir inode 1076848169, moving to lost+found disconnected dir inode 1076848170, moving to lost+found disconnected dir inode 1076848171, moving to lost+found disconnected dir inode 1076848173, moving to lost+found disconnected dir inode 1076848174, moving to lost+found disconnected dir inode 1076848175, moving to lost+found disconnected dir inode 1076848176, moving to lost+found disconnected dir inode 1076848177, moving to lost+found disconnected dir inode 1076848178, moving to lost+found disconnected dir inode 1076848179, moving to lost+found disconnected dir inode 1076848182, moving to lost+found disconnected dir inode 1076848183, moving to lost+found disconnected dir inode 1076848184, moving to lost+found disconnected dir inode 1076848185, moving to lost+found disconnected dir inode 1076848186, moving to lost+found disconnected dir inode 1076848187, moving to lost+found disconnected dir inode 1076848194, moving to lost+found disconnected dir inode 1076848195, moving to lost+found disconnected dir inode 1076848196, moving to lost+found disconnected dir inode 1076848197, moving to lost+found disconnected dir inode 1076848198, moving to lost+found disconnected dir inode 1076848199, moving to lost+found disconnected dir inode 1076848200, moving to lost+found disconnected dir inode 1076848201, moving to lost+found disconnected dir inode 1076848202, moving to lost+found disconnected dir inode 1076848203, moving to lost+found disconnected dir inode 1076848204, moving to lost+found disconnected dir inode 1076848205, moving to lost+found disconnected dir inode 1076848206, moving to lost+found disconnected dir inode 1076848207, moving to lost+found disconnected dir inode 1076848208, moving to lost+found disconnected dir inode 1076848209, moving to lost+found disconnected dir inode 1076848210, moving to lost+found disconnected dir inode 1076848211, moving to lost+found disconnected dir inode 1076848212, moving to lost+found disconnected dir inode 1076848213, moving to lost+found disconnected dir inode 1076848214, moving to lost+found disconnected dir inode 1076848219, moving to lost+found disconnected dir inode 1076848221, moving to lost+found disconnected dir inode 1076848222, moving to lost+found disconnected dir inode 1076853088, moving to lost+found disconnected dir inode 1076853089, moving to lost+found disconnected dir inode 1076853091, moving to lost+found disconnected dir inode 1076853104, moving to lost+found disconnected dir inode 1076853105, moving to lost+found disconnected dir inode 1076853106, moving to lost+found disconnected dir inode 1076853107, moving to lost+found disconnected dir inode 1076853109, moving to lost+found disconnected dir inode 1076853110, moving to lost+found disconnected dir inode 1076853111, moving to lost+found disconnected dir inode 1076853114, moving to lost+found disconnected dir inode 1076853119, moving to lost+found disconnected dir inode 1076853126, moving to lost+found disconnected dir inode 1076853127, moving to lost+found disconnected dir inode 1076853128, moving to lost+found disconnected dir inode 1076853129, moving to lost+found disconnected dir inode 1076853130, moving to lost+found disconnected dir inode 1076853131, moving to lost+found disconnected dir inode 1076853132, moving to lost+found disconnected dir inode 1076853133, moving to lost+found disconnected dir inode 1076853134, moving to lost+found disconnected dir inode 1076853135, moving to lost+found disconnected dir inode 1076853136, moving to lost+found disconnected dir inode 1076853137, moving to lost+found disconnected dir inode 1076853138, moving to lost+found disconnected dir inode 1076853139, moving to lost+found disconnected dir inode 1076853140, moving to lost+found disconnected dir inode 1076853143, moving to lost+found disconnected dir inode 1076853144, moving to lost+found disconnected dir inode 1076853145, moving to lost+found disconnected dir inode 1076853146, moving to lost+found disconnected dir inode 1076853147, moving to lost+found disconnected dir inode 1076853148, moving to lost+found disconnected dir inode 1076853149, moving to lost+found disconnected dir inode 1076853150, moving to lost+found disconnected dir inode 1076853151, moving to lost+found disconnected dir inode 1076854592, moving to lost+found disconnected dir inode 1076854593, moving to lost+found disconnected dir inode 1076854594, moving to lost+found disconnected dir inode 1076854595, moving to lost+found disconnected dir inode 1076854596, moving to lost+found disconnected dir inode 1076854597, moving to lost+found disconnected dir inode 1076854598, moving to lost+found disconnected dir inode 1076854599, moving to lost+found disconnected dir inode 1076854600, moving to lost+found disconnected dir inode 1076854601, moving to lost+found disconnected dir inode 1076854602, moving to lost+found disconnected dir inode 1076854603, moving to lost+found disconnected dir inode 1076854604, moving to lost+found disconnected dir inode 1076854605, moving to lost+found disconnected dir inode 1076854606, moving to lost+found disconnected dir inode 1076854607, moving to lost+found disconnected dir inode 1076854608, moving to lost+found disconnected dir inode 1076854609, moving to lost+found disconnected dir inode 1076854610, moving to lost+found disconnected dir inode 1076854611, moving to lost+found disconnected dir inode 1076854612, moving to lost+found disconnected dir inode 1076854613, moving to lost+found disconnected dir inode 1076854614, moving to lost+found disconnected dir inode 1076854615, moving to lost+found disconnected dir inode 1076854616, moving to lost+found disconnected dir inode 1076854617, moving to lost+found disconnected dir inode 1076854619, moving to lost+found disconnected dir inode 1076854620, moving to lost+found disconnected dir inode 1076854621, moving to lost+found disconnected dir inode 1076854622, moving to lost+found disconnected dir inode 1076854623, moving to lost+found disconnected dir inode 1076854624, moving to lost+found disconnected dir inode 1076854625, moving to lost+found disconnected dir inode 1076854626, moving to lost+found disconnected dir inode 1076854627, moving to lost+found disconnected dir inode 1076854628, moving to lost+found disconnected dir inode 1076854630, moving to lost+found disconnected dir inode 1076854631, moving to lost+found disconnected dir inode 1076854632, moving to lost+found disconnected dir inode 1076854633, moving to lost+found disconnected dir inode 1076854634, moving to lost+found disconnected dir inode 1076854635, moving to lost+found disconnected dir inode 1076854636, moving to lost+found disconnected dir inode 1076854637, moving to lost+found disconnected dir inode 1076854638, moving to lost+found disconnected dir inode 1076854639, moving to lost+found disconnected dir inode 1076854640, moving to lost+found disconnected dir inode 1076854641, moving to lost+found disconnected dir inode 1076854642, moving to lost+found disconnected dir inode 1076854643, moving to lost+found disconnected dir inode 1076854644, moving to lost+found disconnected dir inode 1076854645, moving to lost+found disconnected dir inode 1076854646, moving to lost+found disconnected dir inode 1076854647, moving to lost+found disconnected dir inode 1076854648, moving to lost+found disconnected dir inode 1076854649, moving to lost+found disconnected dir inode 1076854650, moving to lost+found disconnected dir inode 1076854651, moving to lost+found disconnected dir inode 1076854652, moving to lost+found disconnected dir inode 1076854653, moving to lost+found disconnected dir inode 1076854654, moving to lost+found disconnected dir inode 1076854655, moving to lost+found disconnected dir inode 1076860256, moving to lost+found disconnected dir inode 1076860257, moving to lost+found disconnected dir inode 1076860258, moving to lost+found disconnected dir inode 1076860259, moving to lost+found disconnected dir inode 1076860260, moving to lost+found disconnected dir inode 1076860261, moving to lost+found disconnected dir inode 1076860262, moving to lost+found disconnected dir inode 1076860263, moving to lost+found disconnected dir inode 1076860264, moving to lost+found disconnected dir inode 1076860265, moving to lost+found disconnected dir inode 1076860266, moving to lost+found disconnected dir inode 1076860267, moving to lost+found disconnected dir inode 1076860268, moving to lost+found disconnected dir inode 1076860269, moving to lost+found disconnected dir inode 1076860270, moving to lost+found disconnected dir inode 1077602207, moving to lost+found disconnected dir inode 1077602228, moving to lost+found disconnected dir inode 1077602230, moving to lost+found disconnected dir inode 1077652948, moving to lost+found disconnected dir inode 1077652957, moving to lost+found disconnected dir inode 1077652960, moving to lost+found disconnected dir inode 1080545558, moving to lost+found disconnected dir inode 1092459354, moving to lost+found disconnected dir inode 1092503224, moving to lost+found disconnected dir inode 1092503243, moving to lost+found disconnected dir inode 1098129057, moving to lost+found disconnected dir inode 1098129058, moving to lost+found disconnected dir inode 1098129059, moving to lost+found disconnected dir inode 1098129060, moving to lost+found disconnected dir inode 1098129061, moving to lost+found disconnected dir inode 1098129062, moving to lost+found disconnected dir inode 1098129063, moving to lost+found disconnected dir inode 1098129064, moving to lost+found disconnected dir inode 1098129065, moving to lost+found disconnected dir inode 1098129066, moving to lost+found disconnected dir inode 1098129067, moving to lost+found disconnected dir inode 1098129068, moving to lost+found disconnected dir inode 1098129069, moving to lost+found disconnected dir inode 1098129070, moving to lost+found disconnected dir inode 1098129071, moving to lost+found disconnected dir inode 1098129072, moving to lost+found disconnected dir inode 1098465506, moving to lost+found disconnected dir inode 1098465507, moving to lost+found disconnected dir inode 1098465510, moving to lost+found disconnected dir inode 1098465511, moving to lost+found disconnected dir inode 1098465515, moving to lost+found disconnected dir inode 1098465519, moving to lost+found disconnected dir inode 1098465523, moving to lost+found disconnected dir inode 1098465524, moving to lost+found disconnected dir inode 1098465528, moving to lost+found disconnected dir inode 1098465529, moving to lost+found disconnected dir inode 1098465530, moving to lost+found disconnected dir inode 1098465536, moving to lost+found disconnected dir inode 1098465537, moving to lost+found disconnected dir inode 1098465543, moving to lost+found disconnected dir inode 1098465544, moving to lost+found disconnected dir inode 1098465545, moving to lost+found disconnected dir inode 1098465547, moving to lost+found disconnected dir inode 1098465548, moving to lost+found disconnected dir inode 1098465549, moving to lost+found disconnected dir inode 1098465550, moving to lost+found disconnected dir inode 1098465552, moving to lost+found disconnected dir inode 1147256007, moving to lost+found Phase 7 - verify and correct link counts... resetting inode 128 nlinks from 2 to 3 Note - quota info will be regenerated on next quota mount. done --------------040805010700020206070201 Content-Type: text/plain; name="xfs_check.log" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="xfs_check.log" cache_node_purge: refcount was 1, not zero (node=0x820010) xfs_check: cannot read root inode (117) cache_node_purge: refcount was 1, not zero (node=0x8226b0) xfs_check: cannot read realtime bitmap inode (117) block 0/8 expected type unknown got log block 0/9 expected type unknown got log block 0/10 expected type unknown got log block 0/11 expected type unknown got log bad magic number 0xfeed for inode 128 bad magic number 0 for inode 129 bad magic number 0xfeed for inode 130 bad magic number 0 for inode 131 bad magic number 0xfeed for inode 132 bad magic number 0 for inode 133 bad magic number 0xfeed for inode 134 bad magic number 0 for inode 135 bad magic number 0xfeed for inode 136 bad magic number 0 for inode 137 bad magic number 0xfeed for inode 138 bad magic number 0 for inode 139 bad magic number 0xfeed for inode 140 bad magic number 0 for inode 141 bad magic number 0xfeed for inode 142 bad magic number 0 for inode 143 bad magic number 0xfeed for inode 144 bad magic number 0 for inode 145 bad magic number 0xfeed for inode 146 bad magic number 0 for inode 147 bad magic number 0xfeed for inode 148 bad magic number 0 for inode 149 bad magic number 0xfeed for inode 150 bad magic number 0 for inode 151 bad magic number 0xfeed for inode 152 bad magic number 0 for inode 153 bad magic number 0xfeed for inode 154 bad magic number 0 for inode 155 bad magic number 0xfeed for inode 156 bad magic number 0 for inode 157 bad magic number 0xfeed for inode 158 bad magic number 0 for inode 159 bad magic number 0xfeed for inode 160 bad magic number 0 for inode 161 bad magic number 0xfeed for inode 162 bad magic number 0 for inode 163 bad magic number 0xfeed for inode 164 bad magic number 0 for inode 165 bad magic number 0xfeed for inode 166 bad magic number 0 for inode 167 bad magic number 0xfeed for inode 168 bad magic number 0 for inode 169 bad magic number 0xfeed for inode 170 bad magic number 0 for inode 171 bad magic number 0xfeed for inode 172 bad magic number 0 for inode 173 bad magic number 0xfeed for inode 174 bad magic number 0 for inode 175 bad magic number 0xfeed for inode 176 bad magic number 0 for inode 177 bad magic number 0xfeed for inode 178 bad magic number 0 for inode 179 bad magic number 0xfeed for inode 180 bad magic number 0 for inode 181 bad magic number 0xfeed for inode 182 bad magic number 0 for inode 183 bad magic number 0xfeed for inode 184 bad magic number 0 for inode 185 bad magic number 0xfeed for inode 186 bad magic number 0 for inode 187 bad magic number 0xfeed for inode 188 bad magic number 0 for inode 189 bad magic number 0xfeed for inode 190 bad magic number 0 for inode 191 root inode 128 is not a directory block 0/389453 type unknown not expected block 0/389457 type unknown not expected block 0/389458 type unknown not expected block 0/481472 type unknown not expected block 0/8778188 type unknown not expected block 0/8778189 type unknown not expected block 0/8778190 type unknown not expected block 0/17162349 type unknown not expected allocated inode 128 has 0 link count allocated inode 129 has 0 link count allocated inode 130 has 0 link count link count mismatch for inode 131 (name ?), nlink 0, counted 635 link count mismatch for inode 190881713 (name ?), nlink 2, counted 1 disconnected inode 5667475, nlink 1 link count mismatch for inode 240441693 (name ?), nlink 2, counted 1 disconnected inode 125908558, nlink 1 disconnected inode 125908559, nlink 1 disconnected inode 125908560, nlink 1 link count mismatch for inode 125005217 (name ?), nlink 2, counted 1 link count mismatch for inode 125005218 (name ?), nlink 2, counted 1 link count mismatch for inode 125005219 (name ?), nlink 2, counted 1 link count mismatch for inode 125005221 (name ?), nlink 2, counted 1 link count mismatch for inode 125005222 (name ?), nlink 2, counted 1 link count mismatch for inode 125005223 (name ?), nlink 2, counted 1 link count mismatch for inode 125005224 (name ?), nlink 2, counted 1 link count mismatch for inode 125005225 (name ?), nlink 2, counted 1 link count mismatch for inode 125005226 (name ?), nlink 2, counted 1 link count mismatch for inode 125005228 (name ?), nlink 2, counted 1 link count mismatch for inode 125005229 (name ?), nlink 2, counted 1 link count mismatch for inode 125005230 (name ?), nlink 2, counted 1 link count mismatch for inode 125005231 (name ?), nlink 2, counted 1 link count mismatch for inode 125005232 (name ?), nlink 2, counted 1 link count mismatch for inode 125005233 (name ?), nlink 2, counted 1 link count mismatch for inode 125005236 (name ?), nlink 2, counted 1 link count mismatch for inode 125005237 (name ?), nlink 2, counted 1 link count mismatch for inode 125005238 (name ?), nlink 2, counted 1 link count mismatch for inode 125005239 (name ?), nlink 2, counted 1 link count mismatch for inode 125005240 (name ?), nlink 2, counted 1 link count mismatch for inode 125005241 (name ?), nlink 3, counted 2 link count mismatch for inode 125005242 (name ?), nlink 2, counted 1 link count mismatch for inode 125005243 (name ?), nlink 2, counted 1 link count mismatch for inode 125005244 (name ?), nlink 2, counted 1 link count mismatch for inode 125005245 (name ?), nlink 2, counted 1 link count mismatch for inode 125005247 (name ?), nlink 2, counted 1 link count mismatch for inode 125005248 (name ?), nlink 2, counted 1 link count mismatch for inode 125005249 (name ?), nlink 2, counted 1 link count mismatch for inode 125005250 (name ?), nlink 2, counted 1 link count mismatch for inode 125005251 (name ?), nlink 2, counted 1 link count mismatch for inode 125005252 (name ?), nlink 2, counted 1 link count mismatch for inode 125005253 (name ?), nlink 2, counted 1 link count mismatch for inode 125005254 (name ?), nlink 2, counted 1 link count mismatch for inode 125005255 (name ?), nlink 2, counted 1 link count mismatch for inode 125005256 (name ?), nlink 2, counted 1 link count mismatch for inode 125005257 (name ?), nlink 2, counted 1 link count mismatch for inode 125005259 (name ?), nlink 2, counted 1 link count mismatch for inode 125005260 (name ?), nlink 2, counted 1 link count mismatch for inode 125005262 (name ?), nlink 2, counted 1 link count mismatch for inode 125005263 (name ?), nlink 2, counted 1 link count mismatch for inode 125005264 (name ?), nlink 2, counted 1 link count mismatch for inode 125005265 (name ?), nlink 2, counted 1 link count mismatch for inode 125005266 (name ?), nlink 2, counted 1 link count mismatch for inode 125005267 (name ?), nlink 2, counted 1 link count mismatch for inode 125005268 (name ?), nlink 2, counted 1 link count mismatch for inode 125005269 (name ?), nlink 2, counted 1 link count mismatch for inode 125005270 (name ?), nlink 2, counted 1 link count mismatch for inode 125005271 (name ?), nlink 2, counted 1 link count mismatch for inode 125005272 (name ?), nlink 2, counted 1 link count mismatch for inode 125005273 (name ?), nlink 2, counted 1 link count mismatch for inode 125005274 (name ?), nlink 2, counted 1 link count mismatch for inode 125005275 (name ?), nlink 2, counted 1 link count mismatch for inode 125005276 (name ?), nlink 2, counted 1 link count mismatch for inode 125005277 (name ?), nlink 2, counted 1 link count mismatch for inode 125005278 (name ?), nlink 2, counted 1 link count mismatch for inode 125005279 (name ?), nlink 2, counted 1 link count mismatch for inode 125005568 (name ?), nlink 2, counted 1 link count mismatch for inode 125005569 (name ?), nlink 2, counted 1 link count mismatch for inode 125005571 (name ?), nlink 2, counted 1 link count mismatch for inode 125005572 (name ?), nlink 2, counted 1 link count mismatch for inode 125005573 (name ?), nlink 2, counted 1 link count mismatch for inode 125005574 (name ?), nlink 2, counted 1 link count mismatch for inode 125005575 (name ?), nlink 2, counted 1 link count mismatch for inode 125005576 (name ?), nlink 2, counted 1 link count mismatch for inode 125005577 (name ?), nlink 13, counted 12 link count mismatch for inode 125685326 (name ?), nlink 2, counted 1 link count mismatch for inode 125685351 (name ?), nlink 8, counted 7 link count mismatch for inode 424256 (name ?), nlink 21, counted 20 disconnected inode 424257, nlink 1 disconnected inode 424258, nlink 1 link count mismatch for inode 125016316 (name ?), nlink 6, counted 5 link count mismatch for inode 126090218 (name ?), nlink 2, counted 1 link count mismatch for inode 126090220 (name ?), nlink 4, counted 3 link count mismatch for inode 126090224 (name ?), nlink 3, counted 2 link count mismatch for inode 130503573 (name ?), nlink 5, counted 4 link count mismatch for inode 164462811 (name ?), nlink 2, counted 1 link count mismatch for inode 164462812 (name ?), nlink 3, counted 2 link count mismatch for inode 164462813 (name ?), nlink 2, counted 1 link count mismatch for inode 164462814 (name ?), nlink 2, counted 1 link count mismatch for inode 164462815 (name ?), nlink 2, counted 1 link count mismatch for inode 164462816 (name ?), nlink 2, counted 1 link count mismatch for inode 164462817 (name ?), nlink 2, counted 1 link count mismatch for inode 164462818 (name ?), nlink 5, counted 4 link count mismatch for inode 164462819 (name ?), nlink 2, counted 1 link count mismatch for inode 164462820 (name ?), nlink 2, counted 1 link count mismatch for inode 164462821 (name ?), nlink 2, counted 1 link count mismatch for inode 164462822 (name ?), nlink 2, counted 1 link count mismatch for inode 164462823 (name ?), nlink 2, counted 1 link count mismatch for inode 164462824 (name ?), nlink 2, counted 1 link count mismatch for inode 164462825 (name ?), nlink 2, counted 1 link count mismatch for inode 164462826 (name ?), nlink 2, counted 1 link count mismatch for inode 125660420 (name ?), nlink 2, counted 1 link count mismatch for inode 1092503224 (name ?), nlink 2, counted 1 link count mismatch for inode 1092503243 (name ?), nlink 5, counted 4 link count mismatch for inode 1147256007 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853088 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853089 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853091 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853104 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853105 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853106 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853107 (name ?), nlink 3, counted 2 link count mismatch for inode 1076853109 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853110 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853111 (name ?), nlink 3, counted 2 link count mismatch for inode 1076853114 (name ?), nlink 6, counted 5 link count mismatch for inode 1076853119 (name ?), nlink 8, counted 7 link count mismatch for inode 1076853126 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853127 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853128 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853129 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853130 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853131 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853132 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853133 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853134 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853135 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853136 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853137 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853138 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853139 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853140 (name ?), nlink 4, counted 3 link count mismatch for inode 1076853143 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853144 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853145 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853146 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853147 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853148 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853149 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853150 (name ?), nlink 2, counted 1 link count mismatch for inode 1076853151 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797385 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797389 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797390 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797391 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797393 (name ?), nlink 3, counted 2 link count mismatch for inode 1076797395 (name ?), nlink 5, counted 4 link count mismatch for inode 1076797399 (name ?), nlink 3, counted 2 link count mismatch for inode 1076797407 (name ?), nlink 5, counted 4 link count mismatch for inode 1076797408 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797410 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797411 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797412 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797413 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797417 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797418 (name ?), nlink 3, counted 2 link count mismatch for inode 1076797420 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797421 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797422 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797423 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797424 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797425 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797426 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797427 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797428 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797429 (name ?), nlink 5, counted 4 link count mismatch for inode 1076797433 (name ?), nlink 5, counted 4 link count mismatch for inode 1076797437 (name ?), nlink 3, counted 2 link count mismatch for inode 1076797438 (name ?), nlink 3, counted 2 link count mismatch for inode 1076797440 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797442 (name ?), nlink 4, counted 3 link count mismatch for inode 1076797445 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797446 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797447 (name ?), nlink 5, counted 4 link count mismatch for inode 1076797451 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797455 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797459 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797461 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797462 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797463 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797464 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797465 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797466 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797467 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797468 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797469 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797470 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797471 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797472 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797473 (name ?), nlink 6, counted 5 link count mismatch for inode 1076797474 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797479 (name ?), nlink 3, counted 2 link count mismatch for inode 1076797481 (name ?), nlink 7, counted 6 link count mismatch for inode 1076797488 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797490 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797491 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797492 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797493 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797494 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797495 (name ?), nlink 7, counted 6 link count mismatch for inode 1076797496 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797497 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797498 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797500 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797502 (name ?), nlink 2, counted 1 link count mismatch for inode 1076797503 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854592 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854593 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854594 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854595 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854596 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854597 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854598 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854599 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854600 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854601 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854602 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854603 (name ?), nlink 3, counted 2 link count mismatch for inode 1076854604 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854605 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854606 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854607 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854608 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854609 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854610 (name ?), nlink 3, counted 2 link count mismatch for inode 1076854611 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854612 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854613 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854614 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854615 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854616 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854617 (name ?), nlink 5, counted 4 link count mismatch for inode 1076854619 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854620 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854621 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854622 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854623 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854624 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854625 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854626 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854627 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854628 (name ?), nlink 4, counted 3 link count mismatch for inode 1076854630 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854631 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854632 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854633 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854634 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854635 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854636 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854637 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854638 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854639 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854640 (name ?), nlink 3, counted 2 link count mismatch for inode 1076854641 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854642 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854643 (name ?), nlink 3, counted 2 link count mismatch for inode 1076854644 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854645 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854646 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854647 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854648 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854649 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854650 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854651 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854652 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854653 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854654 (name ?), nlink 2, counted 1 link count mismatch for inode 1076854655 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051900 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051901 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051902 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051903 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051908 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051909 (name ?), nlink 3, counted 2 link count mismatch for inode 1075051910 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051912 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051913 (name ?), nlink 3, counted 2 link count mismatch for inode 1075051915 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051916 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051917 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051918 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051919 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051920 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051921 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051922 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051923 (name ?), nlink 14, counted 13 link count mismatch for inode 1075051930 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051931 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051932 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051933 (name ?), nlink 2, counted 1 link count mismatch for inode 1075051934 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860256 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860257 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860258 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860259 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860260 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860261 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860262 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860263 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860264 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860265 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860266 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860267 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860268 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860269 (name ?), nlink 2, counted 1 link count mismatch for inode 1076860270 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805248 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805249 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805254 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805255 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805256 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805257 (name ?), nlink 3, counted 2 link count mismatch for inode 1076805259 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805260 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805261 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805262 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805263 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805266 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805267 (name ?), nlink 3, counted 2 link count mismatch for inode 1076805268 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805269 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805270 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805272 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805273 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805274 (name ?), nlink 5, counted 4 link count mismatch for inode 1076805276 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805277 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805278 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805279 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805280 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805281 (name ?), nlink 5, counted 4 link count mismatch for inode 1076805286 (name ?), nlink 3, counted 2 link count mismatch for inode 1076805288 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805293 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805294 (name ?), nlink 6, counted 5 link count mismatch for inode 1076805295 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805296 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805297 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805298 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805299 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805300 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805301 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805307 (name ?), nlink 2, counted 1 link count mismatch for inode 1076805308 (name ?), nlink 2, counted 1 link count mismatch for inode 1077652948 (name ?), nlink 8, counted 7 link count mismatch for inode 1077652957 (name ?), nlink 3, counted 2 link count mismatch for inode 1077652960 (name ?), nlink 5, counted 4 link count mismatch for inode 1092459354 (name ?), nlink 4, counted 3 link count mismatch for inode 1076809126 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809135 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809136 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809137 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809141 (name ?), nlink 3, counted 2 link count mismatch for inode 1076809143 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809144 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809145 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809146 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809147 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809148 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809149 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809150 (name ?), nlink 9, counted 8 link count mismatch for inode 1076809158 (name ?), nlink 5, counted 4 link count mismatch for inode 1076809162 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809163 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809164 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809165 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809166 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809167 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809168 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809169 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809170 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809171 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809172 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809173 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809174 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809175 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809176 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809177 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809178 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809179 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809180 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809181 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809182 (name ?), nlink 2, counted 1 link count mismatch for inode 1076809183 (name ?), nlink 2, counted 1 link count mismatch for inode 1077602207 (name ?), nlink 10, counted 9 link count mismatch for inode 1077602228 (name ?), nlink 5, counted 4 link count mismatch for inode 1077602230 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811200 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811201 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811202 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811203 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811204 (name ?), nlink 6, counted 5 link count mismatch for inode 1076811206 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811210 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811211 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811212 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811213 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811214 (name ?), nlink 5, counted 4 link count mismatch for inode 1076811215 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811216 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811219 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811220 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811221 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811224 (name ?), nlink 4, counted 3 link count mismatch for inode 1076811227 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811228 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811229 (name ?), nlink 5, counted 4 link count mismatch for inode 1076811233 (name ?), nlink 6, counted 5 link count mismatch for inode 1076811238 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811239 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811240 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811242 (name ?), nlink 5, counted 4 link count mismatch for inode 1076811247 (name ?), nlink 10, counted 9 link count mismatch for inode 1076811248 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811251 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811252 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811253 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811254 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811255 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811256 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811258 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811259 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811260 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811261 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811262 (name ?), nlink 2, counted 1 link count mismatch for inode 1076811263 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117888 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117889 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117890 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117891 (name ?), nlink 5, counted 4 link count mismatch for inode 1075117896 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117897 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117902 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117903 (name ?), nlink 4, counted 3 link count mismatch for inode 1075117907 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117908 (name ?), nlink 2, counted 1 link count mismatch for inode 1075117939 (name ?), nlink 2, counted 1 link count mismatch for inode 1080545558 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818528 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818529 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818531 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818532 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818541 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818542 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818543 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818544 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818545 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818546 (name ?), nlink 7, counted 6 link count mismatch for inode 1076818548 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818549 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818550 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818554 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818555 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818556 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818557 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818558 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818559 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818561 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818562 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818563 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818564 (name ?), nlink 5, counted 4 link count mismatch for inode 1076818565 (name ?), nlink 4, counted 3 link count mismatch for inode 1076818566 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818567 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818569 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818570 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818577 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818578 (name ?), nlink 6, counted 5 link count mismatch for inode 1076818587 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818588 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818589 (name ?), nlink 3, counted 2 link count mismatch for inode 1076818590 (name ?), nlink 2, counted 1 link count mismatch for inode 1076818591 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465506 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465507 (name ?), nlink 4, counted 3 link count mismatch for inode 1098465510 (name ?), nlink 6, counted 5 link count mismatch for inode 1098465511 (name ?), nlink 5, counted 4 link count mismatch for inode 1098465515 (name ?), nlink 5, counted 4 link count mismatch for inode 1098465519 (name ?), nlink 5, counted 4 link count mismatch for inode 1098465523 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465524 (name ?), nlink 5, counted 4 link count mismatch for inode 1098465528 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465529 (name ?), nlink 5, counted 4 link count mismatch for inode 1098465530 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465536 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465537 (name ?), nlink 6, counted 5 link count mismatch for inode 1098465543 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465544 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465545 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465547 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465548 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465549 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465550 (name ?), nlink 2, counted 1 link count mismatch for inode 1098465552 (name ?), nlink 2, counted 1 link count mismatch for inode 1075244397 (name ?), nlink 3, counted 2 link count mismatch for inode 1075244399 (name ?), nlink 2, counted 1 link count mismatch for inode 1075244400 (name ?), nlink 2, counted 1 link count mismatch for inode 1075244401 (name ?), nlink 2, counted 1 link count mismatch for inode 1075244402 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129057 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129058 (name ?), nlink 3, counted 2 link count mismatch for inode 1098129059 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129060 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129061 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129062 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129063 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129064 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129065 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129066 (name ?), nlink 3, counted 2 link count mismatch for inode 1098129067 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129068 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129069 (name ?), nlink 2, counted 1 link count mismatch for inode 1098129070 (name ?), nlink 3, counted 2 link count mismatch for inode 1098129071 (name ?), nlink 3, counted 2 link count mismatch for inode 1098129072 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828510 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828544 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828545 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828546 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828547 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828548 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828549 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828550 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828551 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828554 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828555 (name ?), nlink 8, counted 7 link count mismatch for inode 1076828562 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828563 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828564 (name ?), nlink 3, counted 2 link count mismatch for inode 1076828565 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828570 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828571 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828572 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828573 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828574 (name ?), nlink 3, counted 2 link count mismatch for inode 1076828575 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828576 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828577 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828578 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828579 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828594 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828595 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828596 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828597 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828598 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828599 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828600 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828601 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828602 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828603 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828604 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828605 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828606 (name ?), nlink 2, counted 1 link count mismatch for inode 1076828607 (name ?), nlink 2, counted 1 link count mismatch for inode 1076383229 (name ?), nlink 2, counted 1 link count mismatch for inode 1076383230 (name ?), nlink 6, counted 5 link count mismatch for inode 1076839776 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839777 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839778 (name ?), nlink 3, counted 2 link count mismatch for inode 1076839779 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839780 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839782 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839783 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839784 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839785 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839786 (name ?), nlink 5, counted 4 link count mismatch for inode 1076839790 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839791 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839792 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839793 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839794 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839795 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839796 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839797 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839798 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839799 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839800 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839801 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839802 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839803 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839804 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839805 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839806 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839807 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839808 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839809 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839810 (name ?), nlink 7, counted 6 link count mismatch for inode 1076839812 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839817 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839818 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839819 (name ?), nlink 5, counted 4 link count mismatch for inode 1076839823 (name ?), nlink 3, counted 2 link count mismatch for inode 1076839825 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839826 (name ?), nlink 5, counted 4 link count mismatch for inode 1076839827 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839828 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839831 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839833 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839834 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839835 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839836 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839837 (name ?), nlink 2, counted 1 link count mismatch for inode 1076839838 (name ?), nlink 5, counted 4 link count mismatch for inode 1076848162 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848164 (name ?), nlink 6, counted 5 link count mismatch for inode 1076848169 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848170 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848171 (name ?), nlink 3, counted 2 link count mismatch for inode 1076848173 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848174 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848175 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848176 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848177 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848178 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848179 (name ?), nlink 4, counted 3 link count mismatch for inode 1076848182 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848183 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848184 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848185 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848186 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848187 (name ?), nlink 7, counted 6 link count mismatch for inode 1076848194 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848195 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848196 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848197 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848198 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848199 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848200 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848201 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848202 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848203 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848204 (name ?), nlink 6, counted 5 link count mismatch for inode 1076848205 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848206 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848207 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848208 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848209 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848210 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848211 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848212 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848213 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848214 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848219 (name ?), nlink 6, counted 5 link count mismatch for inode 1076848221 (name ?), nlink 2, counted 1 link count mismatch for inode 1076848222 (name ?), nlink 3, counted 2 --------------040805010700020206070201-- From sandeen@sandeen.net Sat May 8 10:04:46 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_210 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o48F4kKX195212 for ; Sat, 8 May 2010 10:04:46 -0500 X-ASG-Debug-ID: 1273331224-70e200fc0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.sandeen.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 0AB831286E40 for ; Sat, 8 May 2010 08:07:04 -0700 (PDT) Received: from mail.sandeen.net (64-131-60-146.usfamily.net [64.131.60.146]) by cuda.sgi.com with ESMTP id 6vKGyPcGu0r2BEkE for ; Sat, 08 May 2010 08:07:04 -0700 (PDT) Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sandeen.net (Postfix) with ESMTP id AE046944B87; Sat, 8 May 2010 10:06:54 -0500 (CDT) Message-ID: <4BE57E0D.3020601@sandeen.net> Date: Sat, 08 May 2010 10:06:53 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: Christian Affolter CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> In-Reply-To: <4BE55A63.8070203@purplehaze.ch> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Barracuda-Connect: 64-131-60-146.usfamily.net[64.131.60.146] X-Barracuda-Start-Time: 1273331226 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.92 X-Barracuda-Spam-Status: No, SCORE=-1.92 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29368 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Christian Affolter wrote: > Hi > > After a disk crash within a hardware RAID-6 controller and kernel > freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume: Are you sure the volume is reassembled correctly? It seems like the fs has a ton of damage ... One trick I often recommend is to make a metadata image of the fs with xfs_metadump / xfs_mdrestore and run repair on that to see what repair -would- do, but I guess you've already run it on the real fs. So if repair isn't making a mountable fs, first suggestion would be to re-try with the latest version of repair. > Filesystem "dm-13": Disabling barriers, not supported by the underlying > device Honestly, that could be part of the problem too, if a bunch of disks with write caches all lost them, in the array. > XFS mounting filesystem dm-13 > Starting XFS recovery on filesystem: dm-13 (logdev: internal) > XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file > fs/xfs/xfs_alloc.c. Caller 0xffffffff8035c58d > Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1 ... > XFS: log mount finish failed So recovery is failing, you could try mount -o ro,norecovery at this point to see what's still left on the fs... but: > I tried to repair the filesystem with the help of xfs_repair many times, > without any luck: > Filesystem "dm-13": Disabling barriers, not supported by the underlying > device > XFS mounting filesystem dm-13 > XFS: failed to read root inode ... > xfs_check output: > cache_node_purge: refcount was 1, not zero (node=0x820010) > xfs_check: cannot read root inode (117) That's a bit of an odd root inode number, I think, which makes me think maybe there are still serious problems. > Are there any other ways to fix the unreadable root inode or to restore > the remaining data? > > > Environment informations: > Linux Kernel: 2.6.26-gentoo (x86_64) > xfsprogs: 3.0.3 Those are both pretty old at this point, I can't say there is anything specific in newer xfsprogs, but I'd probably give that a shot first. -Eric From BATV+692f5aa75ba2cdc8b286+2449+infradead.org+hch@bombadil.srs.infradead.org Sat May 8 12:13:38 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o48HDZA2197447 for ; Sat, 8 May 2010 12:13:38 -0500 X-ASG-Debug-ID: 1273338955-381701e00000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 184461078DCF for ; Sat, 8 May 2010 10:15:55 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id whVbNHPIOaAi99Lc for ; Sat, 08 May 2010 10:15:55 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OAncv-00005F-3P; Sat, 08 May 2010 17:15:45 +0000 Date: Sat, 8 May 2010 13:15:45 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 07/12] xfs: Improve scalability of busy extent tracking Subject: Re: [PATCH 07/12] xfs: Improve scalability of busy extent tracking Message-ID: <20100508171544.GA10971@infradead.org> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> <1273210860-23414-8-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273210860-23414-8-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273338956 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Looks good, but a couple minor comments below: Reviewed-by: Christoph Hellwig > diff --git a/fs/xfs/linux-2.6/xfs_quotaops.c b/fs/xfs/linux-2.6/xfs_quotaops.c > index 1947514..2e73688 100644 > --- a/fs/xfs/linux-2.6/xfs_quotaops.c > +++ b/fs/xfs/linux-2.6/xfs_quotaops.c > @@ -19,6 +19,7 @@ > #include "xfs_dmapi.h" > #include "xfs_sb.h" > #include "xfs_inum.h" > +#include "xfs_log.h" > #include "xfs_ag.h" > #include "xfs_mount.h" > #include "xfs_quota.h" This hunk is not needed. > --- a/fs/xfs/xfs_log.h > +++ b/fs/xfs/xfs_log.h > @@ -18,9 +18,6 @@ > #ifndef __XFS_LOG_H__ > #define __XFS_LOG_H__ > > -/* transaction ID type */ > -typedef __uint32_t xlog_tid_t; > - > --- a/fs/xfs/xfs_types.h > +++ b/fs/xfs/xfs_types.h > @@ -75,6 +75,8 @@ typedef __uint32_t xfs_dahash_t; /* dir/attr hash value */ > > typedef __uint16_t xfs_prid_t; /* prid_t truncated to 16bits in XFS */ > > +typedef __uint32_t xlog_tid_t; /* transaction ID type */ This should be in the patch introducing xfs_log_get_trans_ident. From BATV+692f5aa75ba2cdc8b286+2449+infradead.org+hch@bombadil.srs.infradead.org Sat May 8 14:06:34 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o48J6WIv199772 for ; Sat, 8 May 2010 14:06:34 -0500 X-ASG-Debug-ID: 1273345723-62ce02fc0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 144A91DE8EDB for ; Sat, 8 May 2010 12:08:43 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id xfE5sDfgADFmQSp3 for ; Sat, 08 May 2010 12:08:43 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OApOF-0005Zz-3w; Sat, 08 May 2010 19:08:43 +0000 Date: Sat, 8 May 2010 15:08:43 -0400 From: Christoph Hellwig To: xfs@oss.sgi.com, linux-kernel@vger.kernel.org X-ASG-Orig-Subj: XFS status update for April 2010 Subject: XFS status update for April 2010 Message-ID: <20100508190843.GA20445@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273345724 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean In April 2.6.34 still was in the release candidate phase, with a hand full of XFS fixes making it into mainline. Development for the 2.6.35 merge window went ahead full steam at the same time. While a fair amount of patches hit the development tree these were largely cleanups, with the real development activity happening on the mailing list. There was another round of patches and following discussion on the scalable busy extent tracking and delayed logging features mentioned last month. They are expected to be merged in May and queue up for the Linux 2.6.35 window. Last but not least April saw a large number of XFS fixes backported to the 2.6.32 and 2.6.33 -stable series. In user land xfsprogs has seen few but important updates, preparing for a new release next month. The xfs_repair tool saw a fix to correctly enable the lazy superblock counters on an existing filesystem, and xfs_fsr saw updates to better deal with dynamic attribute forks. Last but not a least a port to Debian GNU/kFreeBSD got merged. The xfstests test suite saw two new test cases and various smaller fixes. From stan@hardwarefreak.com Sat May 8 17:49:47 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o48Mnk5n204129 for ; Sat, 8 May 2010 17:49:46 -0500 X-ASG-Debug-ID: 1273359117-4f91002b0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from greer.hardwarefreak.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 6AE4631FDE9 for ; Sat, 8 May 2010 15:51:57 -0700 (PDT) Received: from greer.hardwarefreak.com (mo-65-41-216-221.sta.embarqhsd.net [65.41.216.221]) by cuda.sgi.com with ESMTP id 9Fu6EhN7nq3APVF5 for ; Sat, 08 May 2010 15:51:57 -0700 (PDT) Received: from [192.168.100.53] (gffx.hardwarefreak.com [192.168.100.53]) by greer.hardwarefreak.com (Postfix) with ESMTP id 2CA876C074 for ; Sat, 8 May 2010 17:51:57 -0500 (CDT) Message-ID: <4BE5EB5D.5020702@hardwarefreak.com> Date: Sat, 08 May 2010 17:53:17 -0500 From: Stan Hoeppner User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> In-Reply-To: <4BE55A63.8070203@purplehaze.ch> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Barracuda-Connect: mo-65-41-216-221.sta.embarqhsd.net[65.41.216.221] X-Barracuda-Start-Time: 1273359118 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0001 1.0000 -2.0207 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.42 X-Barracuda-Spam-Status: No, SCORE=-1.42 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_SC5_MJ1963, RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29395 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS 0.50 BSF_SC5_MJ1963 Custom Rule MJ1963 X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Christian Affolter put forth on 5/8/2010 7:34 AM: > Hi > > After a disk crash within a hardware RAID-6 controller and kernel > freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume: What storage management operation(s) were you performing when this crash occurred? Were you adding, deleting, shrinking, or growing an EVMS volume when the "crash" occurred, or was the system just sitting idle with no load when the crash occurred? Why did the "crash" of a single disk in a hardware RAID6 cause a kernel freeze? What is your definition of "disk crash"? A single physical disk failure should not have caused this under any circumstances. The RAID card should have handled a single disk failure transparently. Exactly which make/model is the RAID card? What is the status of each of the remaining disks attached to the card as reported by its BIOS? What is the status of the RAID6 volume as reported by the RAID card BIOS? What is the status of each of your EVMS volumes as reported by the EVMS UI? I'm asking all of these questions because it seems rather clear that the root cause of your problem lies at a layer well below the XFS filesystem. You have two layers of physical disk abstraction below XFS: a hardware RAID6 and a software logical volume manager. You've apparently suffered a storage system hardware failure, according to your description. You haven't given any details of the current status of the hardware RAID, or of the logical volumes, merely that XFS is having problems. I think a "Well duh!" is in order. Please provide _detailed_ information from the RAID card BIOS and the EVMS UI. Even if the problem isn't XFS related I for one would be glad to assist you in getting this fixed. Right now we don't have enough information. At least I don't. -- Stan From eflorac@intellique.com Sun May 9 08:26:28 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49DQRCD226345 for ; Sun, 9 May 2010 08:26:28 -0500 X-ASG-Debug-ID: 1273411712-373b02fa0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from smtp3-g21.free.fr (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id E733B178856F for ; Sun, 9 May 2010 06:28:36 -0700 (PDT) Received: from smtp3-g21.free.fr (smtp3-g21.free.fr [212.27.42.3]) by cuda.sgi.com with ESMTP id eCXiqKq69fo0A8Ff for ; Sun, 09 May 2010 06:28:36 -0700 (PDT) Received: from smtp3-g21.free.fr (localhost [127.0.0.1]) by smtp3-g21.free.fr (Postfix) with ESMTP id EBF0281807F; Sun, 9 May 2010 15:28:30 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp3-g21.free.fr (Postfix) with ESMTP; Sun, 9 May 2010 15:28:29 +0200 (CEST) Date: Sun, 9 May 2010 15:28:18 +0200 From: Emmanuel Florac To: Stan Hoeppner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode Message-ID: <20100509152818.7481c1e1@galadriel.home> In-Reply-To: <4BE5EB5D.5020702@hardwarefreak.com> References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> Organization: Intellique X-Mailer: Claws Mail 3.0.2 (GTK+ 2.12.9; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Barracuda-Connect: smtp3-g21.free.fr[212.27.42.3] X-Barracuda-Start-Time: 1273411718 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29448 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Le Sat, 08 May 2010 17:53:17 -0500 vous =E9criviez: > Why did the "crash" of a single disk in a hardware RAID6 cause a > kernel freeze? What is your definition of "disk crash"? A single > physical disk failure should not have caused this under any > circumstances. The RAID card should have handled a single disk > failure transparently. The RAID array may go west if the disk isn't properly set up, particularly if it's a desktop-class drive.=20 --=20 ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | | +33 1 78 94 84 02 ------------------------------------------------------------------------ From stan@hardwarefreak.com Sun May 9 09:49:33 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49EnXqu228205 for ; Sun, 9 May 2010 09:49:33 -0500 X-ASG-Debug-ID: 1273416703-72e600520000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from greer.hardwarefreak.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2BEEF3233F6 for ; Sun, 9 May 2010 07:51:43 -0700 (PDT) Received: from greer.hardwarefreak.com (mo-65-41-216-221.sta.embarqhsd.net [65.41.216.221]) by cuda.sgi.com with ESMTP id PkQJmiMZewozlawf for ; Sun, 09 May 2010 07:51:43 -0700 (PDT) Received: from [192.168.100.53] (gffx.hardwarefreak.com [192.168.100.53]) by greer.hardwarefreak.com (Postfix) with ESMTP id 9CC3D6C3D2 for ; Sun, 9 May 2010 09:51:43 -0500 (CDT) Message-ID: <4BE6CC83.5070305@hardwarefreak.com> Date: Sun, 09 May 2010 09:53:55 -0500 From: Stan Hoeppner User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> <20100509152818.7481c1e1@galadriel.home> In-Reply-To: <20100509152818.7481c1e1@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: mo-65-41-216-221.sta.embarqhsd.net[65.41.216.221] X-Barracuda-Start-Time: 1273416704 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.42 X-Barracuda-Spam-Status: No, SCORE=-1.42 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_SC5_MJ1963, RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29454 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS 0.50 BSF_SC5_MJ1963 Custom Rule MJ1963 X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Emmanuel Florac put forth on 5/9/2010 8:28 AM: > Le Sat, 08 May 2010 17:53:17 -0500 vous écriviez: > >> Why did the "crash" of a single disk in a hardware RAID6 cause a >> kernel freeze? What is your definition of "disk crash"? A single >> physical disk failure should not have caused this under any >> circumstances. The RAID card should have handled a single disk >> failure transparently. > > The RAID array may go west if the disk isn't properly set up, > particularly if it's a desktop-class drive. By design, a RAID6 pack should be able to handle two simultaneous drive failures before the array goes offline. According to the OP's post he lost one drive. Unless it's a really crappy RAID card or if he's using a bunch of dissimilar drives causing problems with the entire array, he shouldn't have had a problem. This is why I'm digging for more information. The information he presented here doesn't really make any sense. One physical disk failure _shouldn't_ have caused the problems he's experiencing. I don't think we got the full story. Oh, btw, when it comes to SATA drives, there is no difference between "desktop" and "enterprise" class drives. They're all the same. The ones sold as "enterprise" have merely been firmware matched and QC tested with a given vendor's SAN/NAS box and then certified for use with it. The vendor then sells only that one drive/firmware, maybe two certified drives so they have a second source in case of shortages or price gouging etc, in their arrays. According to the marketing droids, the only "true" "enterprise" drives currently on the market are SAS and fiber channel. The number of these drives actually shipping into the server/SAN/NAS storage marketplace is absolutely tiny compared to SATA drives. In total unit shipments, SATA is owning the datacenter as well as the desktop. Browse the various storage offerings across the big 3 and then 10 of the 2nd tier players and you'll find at least 8 out of 10 storage arrays are SATA, the remaining two being SAS and FC in the "high end" category, and usually over double the price of the SATA based arrays. This pricing of SAS/FC is what is driving SATA adoption. That and really large read/write caches on the SATA arrays boosting their performance for many workloads and negating the spindle speed advantage of the SAS and FC drives. -- Stan From christian.affolter@purplehaze.ch Sun May 9 09:50:56 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_210 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49Eot61228251 for ; Sun, 9 May 2010 09:50:56 -0500 X-ASG-Debug-ID: 1273416786-732f005c0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from smtp.stepping-stone.ch (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 189363233FB for ; Sun, 9 May 2010 07:53:06 -0700 (PDT) Received: from smtp.stepping-stone.ch (smtp.stepping-stone.ch [194.176.109.228]) by cuda.sgi.com with ESMTP id 2cH1zbSjBCB7Lo1e for ; Sun, 09 May 2010 07:53:06 -0700 (PDT) Received: from localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) by smtp.stepping-stone.ch (Postfix) with ESMTP id 281E6400389 for ; Sun, 9 May 2010 16:53:05 +0200 (CEST) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Scanned: amavisd-new at stepping-stone.ch Received: from smtp.stepping-stone.ch ([10.17.98.46]) by localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) (amavisd-new, port 10024) with LMTP id Liq8XIXoy+du for ; Sun, 9 May 2010 16:53:02 +0200 (CEST) Received: from [192.168.1.4] (84-73-140-121.dclient.hispeed.ch [84.73.140.121]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by smtp.stepping-stone.ch (Postfix) with ESMTPSA id 27FE340018C for ; Sun, 9 May 2010 16:53:01 +0200 (CEST) Message-ID: <4BE6CC4C.3030501@purplehaze.ch> Date: Sun, 09 May 2010 16:53:00 +0200 From: Christian Affolter User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100420 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> <4BE57E0D.3020601@sandeen.net> In-Reply-To: <4BE57E0D.3020601@sandeen.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Barracuda-Connect: smtp.stepping-stone.ch[194.176.109.228] X-Barracuda-Start-Time: 1273416787 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29454 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Status: Clean Hi Eric Thanks for your answer. >> After a disk crash within a hardware RAID-6 controller and kernel >> freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume: > > Are you sure the volume is reassembled correctly? It seems like the > fs has a ton of damage ... > > One trick I often recommend is to make a metadata image of the fs > with xfs_metadump / xfs_mdrestore and run repair on that to see > what repair -would- do, but I guess you've already run it on the > real fs. OK, I didn't know that. I actually cloned the faild volume using dd and run xfs_repair on the clone. > So if repair isn't making a mountable fs, first suggestion would > be to re-try with the latest version of repair. OK, I will try that. Unfortunately the latest upstream version isn't included within the distribution package repository, so I will have to compile it first. >> Filesystem "dm-13": Disabling barriers, not supported by the underlying >> device > > Honestly, that could be part of the problem too, if a bunch of > disks with write caches all lost them, in the array. > >> XFS mounting filesystem dm-13 >> Starting XFS recovery on filesystem: dm-13 (logdev: internal) >> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file >> fs/xfs/xfs_alloc.c. Caller 0xffffffff8035c58d >> Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1 > ... > >> XFS: log mount finish failed > > So recovery is failing, you could try mount -o ro,norecovery at this > point to see what's still left on the fs... but: This didn't made any difference, I'm still getting the same error message. >> I tried to repair the filesystem with the help of xfs_repair many times, >> without any luck: >> Filesystem "dm-13": Disabling barriers, not supported by the underlying >> device >> XFS mounting filesystem dm-13 >> XFS: failed to read root inode > > ... > >> xfs_check output: >> cache_node_purge: refcount was 1, not zero (node=0x820010) >> xfs_check: cannot read root inode (117) > > That's a bit of an odd root inode number, I think, which > makes me think maybe there are still serious problems. > >> Are there any other ways to fix the unreadable root inode or to restore >> the remaining data? >> >> >> Environment informations: >> Linux Kernel: 2.6.26-gentoo (x86_64) >> xfsprogs: 3.0.3 > > Those are both pretty old at this point, I can't say there is anything > specific in newer xfsprogs, but I'd probably give that a shot first. Yes I'm going to try the latest version. Thanks Christian From eflorac@intellique.com Sun May 9 10:32:17 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49FWHQM228970 for ; Sun, 9 May 2010 10:32:17 -0500 X-ASG-Debug-ID: 1273419263-018003160000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from smtp3-g21.free.fr (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 84E121206652 for ; Sun, 9 May 2010 08:34:27 -0700 (PDT) Received: from smtp3-g21.free.fr (smtp3-g21.free.fr [212.27.42.3]) by cuda.sgi.com with ESMTP id UTRKqcHqKX0Tkx81 for ; Sun, 09 May 2010 08:34:27 -0700 (PDT) Received: from smtp3-g21.free.fr (localhost [127.0.0.1]) by smtp3-g21.free.fr (Postfix) with ESMTP id D9FF5818157; Sun, 9 May 2010 17:34:21 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp3-g21.free.fr (Postfix) with ESMTP; Sun, 9 May 2010 17:34:19 +0200 (CEST) Date: Sun, 9 May 2010 17:34:07 +0200 From: Emmanuel Florac To: Stan Hoeppner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode Message-ID: <20100509173407.54467993@galadriel.home> In-Reply-To: <4BE6CC83.5070305@hardwarefreak.com> References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> <20100509152818.7481c1e1@galadriel.home> <4BE6CC83.5070305@hardwarefreak.com> Organization: Intellique X-Mailer: Claws Mail 3.0.2 (GTK+ 2.12.9; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Barracuda-Connect: smtp3-g21.free.fr[212.27.42.3] X-Barracuda-Start-Time: 1273419269 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29456 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Le Sun, 09 May 2010 09:53:55 -0500 vous =E9criviez: > Oh, btw, when it comes to SATA drives, there is no difference between > "desktop" and "enterprise" class drives. They're all the same.=20 Yes I know that, however there's an important difference : the default firmware setting of desktop drives makes them retry, retry, retry an retry on error, effectively freezing the array; while "enterprise drives" simply fails almost instantly at the slightest error (and work fine 5 minutes later usually). So a desktop drive failure in a raid array may actually block all IO for a very long time (like in minutes), and everything goes west if you then decide that the system must have crashed and pull the plug, if your RAID controller hasn't any backup battery. That's why you shouldn't ever use desktop drives in RAID arrays unless you know how to painfully configure them, one by one, with a friggin DOS utility ;) --=20 ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | | +33 1 78 94 84 02 ------------------------------------------------------------------------ From christian.affolter@purplehaze.ch Sun May 9 10:33:27 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49FXRsl229004 for ; Sun, 9 May 2010 10:33:27 -0500 X-ASG-Debug-ID: 1273419338-7d3603370000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from smtp.stepping-stone.ch (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id EC1EB1206662 for ; Sun, 9 May 2010 08:35:38 -0700 (PDT) Received: from smtp.stepping-stone.ch (smtp.stepping-stone.ch [194.176.109.228]) by cuda.sgi.com with ESMTP id DT0nneJ3aJUauj7V for ; Sun, 09 May 2010 08:35:38 -0700 (PDT) Received: from localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) by smtp.stepping-stone.ch (Postfix) with ESMTP id CA70D4003CF for ; Sun, 9 May 2010 17:35:37 +0200 (CEST) X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Scanned: amavisd-new at stepping-stone.ch Received: from smtp.stepping-stone.ch ([10.17.98.46]) by localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) (amavisd-new, port 10024) with LMTP id UTGyaxnZuxlH for ; Sun, 9 May 2010 17:35:34 +0200 (CEST) Received: from [192.168.1.4] (84-73-140-121.dclient.hispeed.ch [84.73.140.121]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by smtp.stepping-stone.ch (Postfix) with ESMTPSA id AA2C24003D5 for ; Sun, 9 May 2010 17:35:34 +0200 (CEST) Message-ID: <4BE6D63F.3070404@purplehaze.ch> Date: Sun, 09 May 2010 17:35:27 +0200 From: Christian Affolter User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100420 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> In-Reply-To: <4BE5EB5D.5020702@hardwarefreak.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Barracuda-Connect: smtp.stepping-stone.ch[194.176.109.228] X-Barracuda-Start-Time: 1273419338 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29456 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Status: Clean Hi >> After a disk crash within a hardware RAID-6 controller and kernel >> freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume: > > What storage management operation(s) were you performing when this crash > occurred? Were you adding, deleting, shrinking, or growing an EVMS volume > when the "crash" occurred, or was the system just sitting idle with no load > when the crash occurred? No, there were no storage management operations in progress while the system crashed. It's a NFS file server with random read and write operations. > Why did the "crash" of a single disk in a hardware RAID6 cause a kernel > freeze? What is your definition of "disk crash"? A single physical disk > failure should not have caused this under any circumstances. The RAID card > should have handled a single disk failure transparently. That's a good question ;) Honestly I don't know how this could happen, all I saw were a bunch of errors from the RAID controller driver. In the past two other disks failed and the controller reported each failure correctly and started to rebuild the array automatically by using the hot-spare disk. So it did its job two times correctly. [...] kernel: arcmsr0: abort device command of scsi id = 1 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 2 lun = 0 kernel: arcmsr0: ccb ='0xffff8100bf819080' isr got aborted command kernel: arcmsr0: isr get an illegal ccb command done acb = '0xffff81013ec245c8'ccb = '0xffff8100bf819080' ccbacb = '0xffff81013ec245c8' startdone = 0x0 ccboutstandingcount = 0 kernel: arcmsr0: abort device command of scsi id = 2 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: ccb ='0xffff8100bf814640' isr got aborted command kernel: arcmsr0: isr get an illegal ccb command done acb = '0xffff81013ec245c8'ccb = '0xffff8100bf814640' ccbacb = '0xffff81013ec245c8' startdone = 0x0 ccboutstandingcount = 13 kernel: sd 0:0:4:0: [sdd] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK,SUGGEST_OK kernel: end_request: I/O error, dev sdd, sector 1897479440 kernel: Device dm-79, XFS metadata write error block 0x1813c0 in dm-79 kernel: Device dm-65, XFS metadata write error block 0x2c1a0 in dm-65 kernel: xfs_force_shutdown(dm-65,0x1) called from line 1093 of file fs/xfs/xfs_buf_item.c. Return address = 0xffffffff80374359 kernel: Filesystem "dm-65": I/O Error Detected. Shutting down filesystem: dm-65 kernel: Please umount the filesystem, and rectify the problem(s) kernel: xfs_force_shutdown(dm-65,0x1) called from line 420 of file fs/xfs/xfs_rw.c. Return address = 0xffffffff803a9529 [...] Afterwards most of the volumes where shutdown and after a couple of hours the kernel freezes with a kernel panic (which I can't remember as I had no serial console attached). > Exactly which make/model is the RAID card? Areca Technology Corp. ARC-1160 16-Port PCI-X to SATA RAID Controller Firmware Version : V1.42 2006-10-13 > What is the status of each of the remaining disks attached to the card as reported by its BIOS? After the hard reset, one disk was reported as 'faild' and the rebuild started. > What is the status of the RAID6 volume as reported by the RAID card BIOS? By now, the rebuild finished, therefor the volume is in normal non-degraded state. > What is the status of each of your EVMS volumes as reported by the EVMS UI? They're all active. Do you need more informations here? There are approximately 45 active volumes on this server. > I'm asking all of these questions because it seems rather clear that the > root cause of your problem lies at a layer well below the XFS filesystem. Yes, I never blamed XFS for being the cause of the problem. > You have two layers of physical disk abstraction below XFS: a hardware > RAID6 and a software logical volume manager. You've apparently suffered a > storage system hardware failure, according to your description. You haven't > given any details of the current status of the hardware RAID, or of the > logical volumes, merely that XFS is having problems. I think a "Well duh!" > is in order. > > Please provide _detailed_ information from the RAID card BIOS and the EVMS > UI. Even if the problem isn't XFS related I for one would be glad to assist > you in getting this fixed. Right now we don't have enough information. At > least I don't. Thanks for your help Christian From eflorac@intellique.com Sun May 9 10:57:39 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49FvcRv229577 for ; Sun, 9 May 2010 10:57:39 -0500 X-ASG-Debug-ID: 1273420785-45b100420000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from smtp3-g21.free.fr (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id AC0E51395674 for ; Sun, 9 May 2010 08:59:48 -0700 (PDT) Received: from smtp3-g21.free.fr (smtp3-g21.free.fr [212.27.42.3]) by cuda.sgi.com with ESMTP id gMhJV32WBStDBm9A for ; Sun, 09 May 2010 08:59:48 -0700 (PDT) Received: from smtp3-g21.free.fr (localhost [127.0.0.1]) by smtp3-g21.free.fr (Postfix) with ESMTP id 742D881810A; Sun, 9 May 2010 17:59:43 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp3-g21.free.fr (Postfix) with ESMTP; Sun, 9 May 2010 17:59:41 +0200 (CEST) Date: Sun, 9 May 2010 17:59:35 +0200 From: Emmanuel Florac To: Christian Affolter Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode Message-ID: <20100509175935.15b6374e@galadriel.home> In-Reply-To: <4BE6D63F.3070404@purplehaze.ch> References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> <4BE6D63F.3070404@purplehaze.ch> Organization: Intellique X-Mailer: Claws Mail 3.0.2 (GTK+ 2.12.9; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Barracuda-Connect: smtp3-g21.free.fr[212.27.42.3] X-Barracuda-Start-Time: 1273420790 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29458 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Le Sun, 09 May 2010 17:35:27 +0200 vous =E9criviez: > kernel: arcmsr0: abort device command of scsi id =3D 1 lun =3D 0 > kernel: arcmsr0: abort device command of scsi id =3D 4 lun =3D 0 > kernel: arcmsr0: abort device command of scsi id =3D 4 lun =3D 0 > kernel: arcmsr0: abort device command of scsi id =3D 4 lun =3D 0 > kernel: arcmsr0: abort device command of scsi id =3D 4 lun =3D 0 > kernel: arcmsr0: abort device command of scsi id =3D 2 lun =3D 0 > kernel: arcmsr0: ccb =3D'0xffff8100bf819080' Looks like disks 1, 2 and 4 timed out there. What model are those disks? --=20 ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | | +33 1 78 94 84 02 ------------------------------------------------------------------------ From stan@hardwarefreak.com Sun May 9 12:29:36 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49HTZ35231388 for ; Sun, 9 May 2010 12:29:36 -0500 X-ASG-Debug-ID: 1273426321-187002250000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from greer.hardwarefreak.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id CC1F41287E71 for ; Sun, 9 May 2010 10:32:01 -0700 (PDT) Received: from greer.hardwarefreak.com (mo-65-41-216-221.sta.embarqhsd.net [65.41.216.221]) by cuda.sgi.com with ESMTP id 799963EXAcfKOdgB for ; Sun, 09 May 2010 10:32:01 -0700 (PDT) Received: from [192.168.100.53] (gffx.hardwarefreak.com [192.168.100.53]) by greer.hardwarefreak.com (Postfix) with ESMTP id 31E976C32E for ; Sun, 9 May 2010 12:31:46 -0500 (CDT) Message-ID: <4BE6F22F.5090603@hardwarefreak.com> Date: Sun, 09 May 2010 12:34:39 -0500 From: Stan Hoeppner User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> <4BE6D63F.3070404@purplehaze.ch> In-Reply-To: <4BE6D63F.3070404@purplehaze.ch> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Barracuda-Connect: mo-65-41-216-221.sta.embarqhsd.net[65.41.216.221] X-Barracuda-Start-Time: 1273426321 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.42 X-Barracuda-Spam-Status: No, SCORE=-1.42 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_SC5_MJ1963, RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29463 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS 0.50 BSF_SC5_MJ1963 Custom Rule MJ1963 X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Christian Affolter put forth on 5/9/2010 10:35 AM: > That's a good question ;) Honestly I don't know how this could happen, > all I saw were a bunch of errors from the RAID controller driver. In the > past two other disks failed and the controller reported each failure > correctly and started to rebuild the array automatically by using the > hot-spare disk. So it did its job two times correctly. > [...] > > kernel: arcmsr0: abort device command of scsi id = 1 lun = 0 > kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 > kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 > kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 > kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 > kernel: arcmsr0: abort device command of scsi id = 2 lun = 0 > kernel: arcmsr0: ccb ='0xffff8100bf819080' Ok, that's not good. Looks like the Areca driver is showing communication failure with 3 physical drives simultaneously. Can't be a drive problem. I just read through about 20 Google hits, and it seems this Areca issue is pretty common. One OP said he had a defective card. The rest all report the same or similar errors, across many Areca models, using many different drives, under moderate to high I/O load, under multiple *nix OSes, usually lots of small file copies is the trigger. I've read plenty of less than flattering things about Areca RAID cards in the past. This is just more of the same. >From everything I've read on this, the problem is either: A. A defective Areca card B. Firmware issue (card and/or drives) C. Driver issue D. More than one of the above > Areca Technology Corp. ARC-1160 16-Port PCI-X to SATA RAID Controller > Firmware Version : V1.42 2006-10-13 That's a 16 port card. How many total drives do you have connected? Are they all the same model/firmware rev? If different models, do you at least have identical models in each RAID pack? Mixing different brands/models/firmware revs within a RAID pack is always a very bad idea. In fact, using anything but identical drives/firmware on a single controller card is a bad idea. Some cards are more finicky than others, but almost all of them will have problems of one kind or another with a mixed bag 'o drives. They can have problems with all identical drives if the drive firmware isn't to the card firmware's liking (see below). > After the hard reset, one disk was reported as 'faild' and the rebuild > started. Unfortunately the errors reported weren't indicative of a bad drive, but multiple bad drives. None of the drives are bad. The controller/firmware/driver have a problem, or have a problem with the drive(s) firmware. The Areca firmware marked one drive as bad because the logic says something besides the card/firmware/driver _must_ be bad. So, it marked one of the drives as bad and started rebuilding it. Back in the late '90s I had Mylex DAC960 cards doing exactly the same thing due to a problem with firmware on the Seagate ST118202 Cheetah drives. The DAC960 would just kick a drive offline willy nilly. This was with 8 identical firmware drives in RAID5 arrays on a single SCSI channel. Was really annoying. I was at customer sites twice weekly replacing and rebuilding drives until Seagate finally admitted the firmware bug and advance shipped us 50 new 3 series Cheetah drives. That was really fun replacing drives one by one and rebuilding the arrays after each drive swap. We lost a lot of labor $$ over that and had some less than happy customers. Once all the drives were replaced with the 3 series, we never had another problem with any of those arrays. I'm still surprised I was able to rebuild the arrays without issues after adding each new drive, which was a slightly different size with a different firmware. I was just sure the rebuilds would puke. I got lucky. These systems were in production, thus the reason we didn't restore from tape, which would have saved a lot of time. >> What is the status of the RAID6 volume as reported by the RAID card BIOS? > > By now, the rebuild finished, therefor the volume is in normal > non-degraded state. That's good. >> What is the status of each of your EVMS volumes as reported by the EVMS UI? > > They're all active. Do you need more informations here? There are > approximately 45 active volumes on this server. No. Just wanted to know if they're all reported as healthy. >> I'm asking all of these questions because it seems rather clear that the >> root cause of your problem lies at a layer well below the XFS filesystem. > > Yes, I never blamed XFS for being the cause of the problem. I should have worded that differently. I didn't mean to imply that you were blaming XFS. I meant that I wanted to help you figure out the root cause which wasn't XFS. >> You have two layers of physical disk abstraction below XFS: a hardware >> RAID6 and a software logical volume manager. You've apparently suffered a >> storage system hardware failure, according to your description. You haven't >> given any details of the current status of the hardware RAID, or of the >> logical volumes, merely that XFS is having problems. I think a "Well duh!" >> is in order. >> >> Please provide _detailed_ information from the RAID card BIOS and the EVMS >> UI. Even if the problem isn't XFS related I for one would be glad to assist >> you in getting this fixed. Right now we don't have enough information. At >> least I don't. On second read, this looks rather preachy and antagonistic. I truly did not intend that tone. Please accept my apology if this came across that way. I think I was starting to get frustrated because I wanted to troubleshoot this further but didn't feel I had enough info. Again, this was less than professional, and I apologize. -- Stan From BATV+7e82a9e4c1b177193ee4+2450+infradead.org+hch@bombadil.srs.infradead.org Sun May 9 12:48:39 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49Hmb9C231767 for ; Sun, 9 May 2010 12:48:38 -0500 X-ASG-Debug-ID: 1273427463-188302770000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 0D1468FB1B5 for ; Sun, 9 May 2010 10:51:03 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id mHKBod0nkVTuvSCp for ; Sun, 09 May 2010 10:51:03 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OBAeO-0000cP-CF for xfs@oss.sgi.com; Sun, 09 May 2010 17:50:48 +0000 Date: Sun, 9 May 2010 13:50:48 -0400 From: Christoph Hellwig To: xfs@oss.sgi.com X-ASG-Orig-Subj: [PATCH] xfs: add discard support (at transaction commit) Subject: [PATCH] xfs: add discard support (at transaction commit) Message-ID: <20100509175048.GA1435@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273427464 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Now that we have reliably tracking of deleted extents in a transaction we can easily implement "online" discard support which calls blkdev_issue_discard once a transaction commits. We simply have to walk the list of busy extents after transaction commit, but before deleting it from the rbtree tracking these busy extents. This does not replace by background discard support patch which is probably better for thin provisioned arrays - I will updated it to apply ontop of this patch when I'm ready to re-submit it. [Note: this patch needs Dave's delayed-logging series any patch titled "xfs: simplify log item descriptor tracking" applied] Signed-off-by: Christoph Hellwig Index: xfs/fs/xfs/xfs_alloc.c =================================================================== --- xfs.orig/fs/xfs/xfs_alloc.c 2010-05-09 19:08:34.544262404 +0200 +++ xfs/fs/xfs/xfs_alloc.c 2010-05-09 19:09:24.975032203 +0200 @@ -2761,3 +2761,25 @@ xfs_alloc_busy_clear( kmem_free(busyp); } + +int +xfs_discard_extent( + struct xfs_mount *mp, + struct xfs_busy_extent *busyp) +{ + int error = 0; + xfs_daddr_t bno; + __int64_t len; + + if ((mp->m_flags & XFS_MOUNT_DISCARD) == 0) + return 0; + + bno = XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno); + len = XFS_FSB_TO_BB(mp, busyp->length); + + error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, bno, len, + GFP_NOFS, DISCARD_FL_BARRIER); + if (error && error != EOPNOTSUPP) + xfs_fs_cmn_err(CE_NOTE, mp, "discard failed, error %d", error); + return error; +} Index: xfs/fs/xfs/xfs_alloc.h =================================================================== --- xfs.orig/fs/xfs/xfs_alloc.h 2010-05-09 19:08:34.551265756 +0200 +++ xfs/fs/xfs/xfs_alloc.h 2010-05-09 19:09:16.213023263 +0200 @@ -128,6 +128,9 @@ xfs_alloc_busy_insert(xfs_trans_t *tp, void xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp); +int +xfs_discard_extent(struct xfs_mount *mp, struct xfs_busy_extent *busyp); + #endif /* __KERNEL__ */ /* Index: xfs/fs/xfs/linux-2.6/xfs_super.c =================================================================== --- xfs.orig/fs/xfs/linux-2.6/xfs_super.c 2010-05-09 19:08:34.562254302 +0200 +++ xfs/fs/xfs/linux-2.6/xfs_super.c 2010-05-09 19:09:16.223693748 +0200 @@ -120,6 +120,8 @@ mempool_t *xfs_ioend_pool; #define MNTOPT_DMI "dmi" /* DMI enabled (DMAPI / XDSM) */ #define MNTOPT_DELAYLOG "delaylog" /* Delayed loging enabled */ #define MNTOPT_NODELAYLOG "nodelaylog" /* Delayed loging disabled */ +#define MNTOPT_DISCARD "discard" /* Discard unused blocks */ +#define MNTOPT_NODISCARD "nodiscard" /* Do not discard unused blocks */ /* * Table driven mount option parser. @@ -382,6 +384,10 @@ xfs_parseargs( "- use at your own risk.\n"); } else if (!strcmp(this_char, MNTOPT_NODELAYLOG)) { mp->m_flags &= ~XFS_MOUNT_DELAYLOG; + } else if (!strcmp(this_char, MNTOPT_DISCARD)) { + mp->m_flags |= XFS_MOUNT_DISCARD; + } else if (!strcmp(this_char, MNTOPT_NODISCARD)) { + mp->m_flags &= ~XFS_MOUNT_DISCARD; } else if (!strcmp(this_char, "ihashsize")) { cmn_err(CE_WARN, "XFS: ihashsize no longer used, option is deprecated."); Index: xfs/fs/xfs/xfs_mount.h =================================================================== --- xfs.orig/fs/xfs/xfs_mount.h 2010-05-09 19:08:34.571254232 +0200 +++ xfs/fs/xfs/xfs_mount.h 2010-05-09 19:09:16.231005734 +0200 @@ -274,6 +274,7 @@ typedef struct xfs_mount { #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem operations, typically for disk errors in metadata */ +#define XFS_MOUNT_DISCARD (1ULL << 5) /* discard unused blocks */ #define XFS_MOUNT_RETERR (1ULL << 6) /* return alignment errors to user */ #define XFS_MOUNT_NOALIGN (1ULL << 7) /* turn off stripe alignment Index: xfs/fs/xfs/xfs_log_cil.c =================================================================== --- xfs.orig/fs/xfs/xfs_log_cil.c 2010-05-09 19:08:34.580004407 +0200 +++ xfs/fs/xfs/xfs_log_cil.c 2010-05-09 19:09:16.241255699 +0200 @@ -415,6 +415,7 @@ xlog_cil_committed( int abort) { struct xfs_cil_ctx *ctx = args; + struct xfs_mount *mp = ctx->cil->xc_log->l_mp; struct xfs_log_vec *lv; int abortflag = abort ? XFS_LI_ABORTED : 0; struct xfs_busy_extent *busyp, *n; @@ -425,8 +426,10 @@ xlog_cil_committed( abortflag); } - list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) - xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp); + list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) { + xfs_discard_extent(mp, busyp); + xfs_alloc_busy_clear(mp, busyp); + } spin_lock(&ctx->cil->xc_cil_lock); list_del(&ctx->committing); Index: xfs/fs/xfs/xfs_trans.c =================================================================== --- xfs.orig/fs/xfs/xfs_trans.c 2010-05-09 19:08:34.596025010 +0200 +++ xfs/fs/xfs/xfs_trans.c 2010-05-09 19:09:16.252005804 +0200 @@ -1398,6 +1398,7 @@ xfs_trans_committed( int abortflag) { struct xfs_log_item_desc *lidp, *next; + struct xfs_busy_extent *busyp; /* Call the transaction's completion callback if there is one. */ if (tp->t_callback != NULL) @@ -1408,6 +1409,9 @@ xfs_trans_committed( xfs_trans_free_item_desc(lidp); } + list_for_each_entry(busyp, &tp->t_busy, list) + xfs_discard_extent(tp->t_mountp, busyp); + xfs_trans_free(tp); } From roger@filmlight.ltd.uk Sun May 9 13:01:45 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49I1jYp231964 for ; Sun, 9 May 2010 13:01:45 -0500 X-ASG-Debug-ID: 1273428235-331c02160000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from b.mx.filmlight.ltd.uk (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with SMTP id C3734323B11 for ; Sun, 9 May 2010 11:03:55 -0700 (PDT) Received: from b.mx.filmlight.ltd.uk (bongo.filmlight.ltd.uk [77.107.81.251]) by cuda.sgi.com with SMTP id XB1sxCk9Y6w8qwyB for ; Sun, 09 May 2010 11:03:55 -0700 (PDT) Received: (dqd 12711 invoked from network); 9 May 2010 18:03:53 -0000 Received: from centprod.demon.co.uk (HELO ?192.168.1.102?) (roger@62.49.60.134) by b.mx.filmlight.ltd.uk with SMTP; 9 May 2010 18:03:53 -0000 Cc: xfs@oss.sgi.com Message-Id: From: Roger Willcocks To: Christian Affolter In-Reply-To: <4BE55A63.8070203@purplehaze.ch> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode Date: Sun, 9 May 2010 19:03:52 +0100 References: <4BE55A63.8070203@purplehaze.ch> X-Mailer: Apple Mail (2.936) X-Barracuda-Connect: bongo.filmlight.ltd.uk[77.107.81.251] X-Barracuda-Start-Time: 1273428236 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29466 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean The original xfs_repair log looks reasonably sane, and suggests that the block(s) containing inodes 128-191 had been zeroed out. xfs_repair claims to have fixed all that up, and rebuilt the root directory amongst others. However xfs_check still complains, and correspondence off-list shows that xfs_repair -n detects the same 'bad magic number 0xfeed' on alternating inodes as xfs_check: bad magic number 0xfeed on inode 128 bad version number 0x0 on inode 128 bad magic number 0x0 on inode 129 bad version number 0x0 on inode 129 bad (negative) size -8161755683656211562 on inode 129 bad magic number 0xfeed on inode 130 bad version number 0x0 on inode 130 bad magic number 0x0 on inode 131 bad version number 0x0 on inode 131 bad (negative) size -8161755683656211562 on inode 131 This pattern corresponds to the corruption described in http://oss.sgi.com/archives/xfs/2008-01/msg00696.html xfs_check also says: block 0/8 expected type unknown got log block 0/9 expected type unknown got log block 0/10 expected type unknown got log block 0/11 expected type unknown got log Could xfs_repair (or remounting the disk) have written log data over those blocks? -- Roger On 8 May 2010, at 13:34, Christian Affolter wrote: > Hi > > After a disk crash within a hardware RAID-6 controller and kernel > freeze, I'm unable to mount an XFS filesystem on top of an EVMS > volume: > > Filesystem "dm-13": Disabling barriers, not supported by the > underlying > device > XFS mounting filesystem dm-13 > Starting XFS recovery on filesystem: dm-13 (logdev: internal) > XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file > fs/xfs/xfs_alloc.c. Caller 0xffffffff8035c58d > Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1 > > Call Trace: > [] xfs_free_extent+0xcd/0x110 > [] xfs_free_ag_extent+0x4e3/0x740 > [] xfs_free_extent+0xcd/0x110 > [] xlog_recover_process_efi+0x18d/0x1d0 > [] xlog_recover_process_efis+0x60/0xa0 > [] xlog_recover_finish+0x23/0xf0 > [] xfs_mountfs+0x4da/0x680 > [] kmem_alloc+0x58/0x100 > [] kmem_zalloc+0x2b/0x40 > [] xfs_mount+0x36d/0x3a0 > [] xfs_fs_fill_super+0xbd/0x220 > [] get_sb_bdev+0x141/0x180 > [] xfs_fs_fill_super+0x0/0x220 > [] vfs_kern_mount+0x56/0xc0 > [] do_kern_mount+0x53/0x110 > [] do_new_mount+0x9b/0xe0 > [] do_mount+0x1e6/0x220 > [] __get_free_pages+0x15/0x60 > [] sys_mount+0x9b/0x100 > [] system_call_after_swapgs+0x7b/0x80 > > Filesystem "dm-13": XFS internal error xfs_trans_cancel at line 1163 > of > file fs/xfs/xfs_trans.c. Caller 0xffffffff80395eb1 > Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1 > > Call Trace: > [] xlog_recover_process_efi+0x1a1/0x1d0 > [] xfs_trans_cancel+0x126/0x150 > [] xlog_recover_process_efi+0x1a1/0x1d0 > [] xlog_recover_process_efis+0x60/0xa0 > [] xlog_recover_finish+0x23/0xf0 > [] xfs_mountfs+0x4da/0x680 > [] kmem_alloc+0x58/0x100 > [] kmem_zalloc+0x2b/0x40 > [] xfs_mount+0x36d/0x3a0 > [] xfs_fs_fill_super+0xbd/0x220 > [] get_sb_bdev+0x141/0x180 > [] xfs_fs_fill_super+0x0/0x220 > [] vfs_kern_mount+0x56/0xc0 > [] do_kern_mount+0x53/0x110 > [] do_new_mount+0x9b/0xe0 > [] do_mount+0x1e6/0x220 > [] __get_free_pages+0x15/0x60 > [] sys_mount+0x9b/0x100 > [] system_call_after_swapgs+0x7b/0x80 > > xfs_force_shutdown(dm-13,0x8) called from line 1164 of file > fs/xfs/xfs_trans.c. Return address = 0xffffffff8039fd2f > Filesystem "dm-13": Corruption of in-memory data detected. Shutting > down filesystem: dm-13 > Please umount the filesystem, and rectify the problem(s) > Failed to recover EFIs on filesystem: dm-13 > XFS: log mount finish failed > > > I tried to repair the filesystem with the help of xfs_repair many > times, > without any luck: > Filesystem "dm-13": Disabling barriers, not supported by the > underlying > device > XFS mounting filesystem dm-13 > XFS: failed to read root inode > > xfs_check output: > cache_node_purge: refcount was 1, not zero (node=0x820010) > xfs_check: cannot read root inode (117) > cache_node_purge: refcount was 1, not zero (node=0x8226b0) > xfs_check: cannot read realtime bitmap inode (117) > block 0/8 expected type unknown got log > block 0/9 expected type unknown got log > block 0/10 expected type unknown got log > block 0/11 expected type unknown got log > bad magic number 0xfeed for inode 128 > [...] > > Are there any other ways to fix the unreadable root inode or to > restore > the remaining data? > > > Environment informations: > Linux Kernel: 2.6.26-gentoo (x86_64) > xfsprogs: 3.0.3 > > Attached you'll find the xfs_repair and xfs_check output. > > > Thanks in advance and kind regards > Christian > < > xfs_repair > .log>_______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs From rfu@kaneda.iguw.tuwien.ac.at Sun May 9 13:45:52 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,J_CHICKENPOX_45 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49IjpTJ232811 for ; Sun, 9 May 2010 13:45:52 -0500 X-ASG-Debug-ID: 1273430881-4b3601fe0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx02.kabsi.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 06A28323FD0 for ; Sun, 9 May 2010 11:48:01 -0700 (PDT) Received: from mx02.kabsi.at (mx02.kabsi.at [62.40.128.130]) by cuda.sgi.com with ESMTP id MDJGXOt21pGPcL10 for ; Sun, 09 May 2010 11:48:01 -0700 (PDT) Received: from 192.168.5.201 (h081217058120.dyn.cm.kabsi.at [81.217.58.120]) by mx02.kabsi.at (8.13.6/8.13.6) with ESMTP id o49Ilw9F013564; Sun, 9 May 2010 20:48:00 +0200 Date: Sun, 9 May 2010 20:48:00 +0200 From: Rainer Fuegenstein X-Mailer: The Bat! (v1.62r) Business Reply-To: Rainer Fuegenstein Organization: Vienna University of Technology X-Priority: 3 (Normal) Message-ID: <1816344475.20100509204800@kaneda.iguw.tuwien.ac.at> To: xfs@oss.sgi.com CC: linux-raid@vger.kernel.org X-ASG-Orig-Subj: xfs and raid5 - "Structure needs cleaning for directory open" Subject: xfs and raid5 - "Structure needs cleaning for directory open" MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable X-Barracuda-Connect: mx02.kabsi.at[62.40.128.130] X-Barracuda-Start-Time: 1273430883 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29467 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean today in the morning some daemon processes terminated because of errors in the xfs file system on top of a software raid5, consisting of 4*1.5TB WD caviar green SATA disks. current OS is centos 5.4, kernel is: Linux alfred 2.6.18-164.15.1.el5xen #1 SMP Wed Mar 17 12:04:23 EDT 2010 x86= _64 x86_64 x86_64 GNU/Linux the history: this raid was originally created on an ASUS M2N-X plus mainboard with all 4 drives connected to the on-board controller. (centos 5.4, current i386 kernel). it worked fine first, but after some months problems occured when copying files via SMB, in these situations dmesg showed a stack trace, starting with an interrupt problem deep in the kernel and reaching up to the xfs filesystem code. a few months ago the weekly raid check (/etc/cron.weekly/99-raid-check) started a re-sync of the raid which (on the M2N-X board) took about 2.5 to 3 days to complete. to overcome the interrupt problems, I recently bought an intel D510 atom mainboard and a "Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02))" sata controller, reinstalled centos 5.4 from scratch (x86_64 version) and attached the 4 sata disks which worked fine until this sunday night the 99-raid-check started again at 4:00 in the morning and lasted until just now (19:00 o'clock). around 12:00 noon (resync at about 50%) I noticed the first problems, namely "Structure needs cleaning for directory open" messages. at this time, a "du -sh *" revealed that around 50% of the data stored on the xfs was lost (due to directories that couldn't be read because of the "needs cleaning ..." error. a daring xfs_repair on the unmounted, but still syncing filesystem revealed & fixed no errors (see output below). after painfully waiting 7 hours for the resync to complete, it looks like the filesystem is OK and back to normal again: du shows the expected 3.5TB usage, there are no more "needs cleaning ..." errors and a quick check into the previously lost directories seems to show that the files contained within seem to be OK. I wonder what caused this behaviour (and how to prevent it in the future): 1) damages done to the xfs filesystem on the old board=3F shouldn't xfs_repair find & repair them=3F 2) does a re-syncing raid deliver bad/corrupt data to the filesystem layer above=3F 3) may this be a hardware/memory problem since xfs reports "Corruption of in-memory data detected". =3F 4) is the Promise SATA controller to blame =3F here's some output that may help. please let me know if you need more: *** this is where it started: May 9 04:22:01 alfred kernel: md: syncing RAID array md0 May 9 04:22:01 alfred kernel: md: minimum _guaranteed_ reconstruction spee= d: 1000 KB/sec/disc. May 9 04:22:01 alfred kernel: md: using maximum available idle IO bandwidt= h (but not more than 200000 KB/sec) for reconstructio n. May 9 04:22:01 alfred kernel: md: using 128k window, over a total of 14651= 35936 blocks. May 9 04:24:06 alfred kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO a= t line 4565 of file fs/xfs/xfs_bmap.c. Caller 0xffff ffff8835dba8 May 9 04:24:06 alfred kernel: May 9 04:24:06 alfred kernel: Call Trace: May 9 04:24:06 alfred kernel: [] :xfs:xfs_bmap_read_ext= ents+0x361/0x384 May 9 04:24:06 alfred kernel: [] :xfs:xfs_iread_extents= +0xac/0xc8 May 9 04:24:06 alfred kernel: [] :xfs:xfs_bmapi+0x226/0= xe79 May 9 04:24:06 alfred kernel: [] generic_make_request+0= x211/0x228 May 9 04:24:06 alfred kernel: [] :raid456:handle_stripe= +0x20a6/0x21ff May 9 04:24:06 alfred kernel: [] :xfs:xfs_iomap+0x144/0= x2a5 May 9 04:24:06 alfred kernel: [] :xfs:__xfs_get_blocks+= 0x7a/0x1bf May 9 04:24:06 alfred kernel: [] :raid456:make_request+= 0x4ba/0x4f4 May 9 04:24:06 alfred kernel: [] autoremove_wake_functi= on+0x0/0x2e May 9 04:24:06 alfred kernel: [] do_mpage_readpage+0x16= 7/0x474 May 9 04:24:06 alfred kernel: [] :xfs:xfs_get_blocks+0x= 0/0xe May 9 04:24:06 alfred kernel: [] :xfs:xfs_get_blocks+0x= 0/0xe May 9 04:24:06 alfred kernel: [] add_to_page_cache+0xb9= /0xc5 May 9 04:24:06 alfred kernel: [] :xfs:xfs_get_blocks+0x= 0/0xe May 9 04:24:06 alfred kernel: [] mpage_readpages+0x91/0= xd9 May 9 04:24:06 alfred kernel: [] :xfs:xfs_get_blocks+0x= 0/0xe May 9 04:24:06 alfred kernel: [] __alloc_pages+0x65/0x2= ce May 9 04:24:06 alfred kernel: [] __do_page_cache_readah= ead+0x130/0x1ab May 9 04:24:06 alfred kernel: [] blockable_page_cache_r= eadahead+0x53/0xb2 May 9 04:24:06 alfred kernel: [] page_cache_readahead+0= xd6/0x1af May 9 04:24:06 alfred kernel: [] do_generic_mapping_rea= d+0xc6/0x38a May 9 04:24:06 alfred kernel: [] file_read_actor+0x0/0x= 101 May 9 04:24:06 alfred kernel: [] __generic_file_aio_rea= d+0x14c/0x198 May 9 04:24:06 alfred kernel: [] :xfs:xfs_read+0x187/0x= 209 May 9 04:24:06 alfred kernel: [] :xfs:xfs_file_aio_read= +0x63/0x6b May 9 04:24:06 alfred kernel: [] do_sync_read+0xc7/0x104 May 9 04:24:06 alfred kernel: [] __dentry_open+0x101/0x= 1dc May 9 04:24:06 alfred kernel: [] autoremove_wake_functi= on+0x0/0x2e May 9 04:24:06 alfred kernel: [] do_filp_open+0x2a/0x38 May 9 04:24:06 alfred kernel: [] vfs_read+0xcb/0x171 May 9 04:24:06 alfred kernel: [] sys_read+0x45/0x6e May 9 04:24:06 alfred kernel: [] ia32_sysret+0x0/0x5 May 9 04:24:06 alfred kernel: May 9 04:24:06 alfred kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO a= t line 4565 of file fs/xfs/xfs_bmap.c. Caller 0xffff ffff8835dba8 *** (many, many more) May 9 06:19:16 alfred kernel: Filesystem "md0": corrupt dinode 1610637790,= (btree extents). Unmount and run xfs_repair. May 9 06:19:16 alfred kernel: Filesystem "md0": XFS internal error xfs_bma= p_read_extents(1) at line 4560 of file fs/xfs/xfs_bma p.c. Caller 0xffffffff8835dba8 May 9 06:19:16 alfred kernel: May 9 06:19:16 alfred kernel: Call Trace: May 9 06:19:16 alfred kernel: [] :xfs:xfs_bmap_read_ext= ents+0x361/0x384 May 9 06:19:16 alfred kernel: [] :xfs:xfs_iread_extents= +0xac/0xc8 May 9 06:19:16 alfred kernel: [] :xfs:xfs_bmapi+0x226/0= xe79 May 9 06:19:16 alfred kernel: [] :ip_conntrack:tcp_pkt_= to_tuple+0x0/0x61 May 9 06:19:16 alfred kernel: [] :ip_conntrack:__ip_con= ntrack_find+0xd/0xb7 May 9 06:19:16 alfred kernel: [] lock_timer_base+0x1b/0= x3c May 9 06:19:16 alfred kernel: [] __mod_timer+0xb0/0xbe May 9 06:19:16 alfred kernel: [] :ip_conntrack:__ip_ct_= refresh_acct+0x10f/0x152 May 9 06:19:16 alfred kernel: [] :ip_conntrack:tcp_pack= et+0xa5f/0xa9f May 9 06:19:16 alfred kernel: [] :xfs:xfs_iomap+0x144/0= x2a5 May 9 06:19:16 alfred kernel: [] :xfs:__xfs_get_blocks+= 0x7a/0x1bf May 9 06:19:16 alfred kernel: [] alloc_buffer_head+0x31= /0x36 May 9 06:19:16 alfred kernel: [] alloc_page_buffers+0x8= 1/0xd3 May 9 06:19:16 alfred kernel: [] __block_prepare_write+= 0x1ad/0x375 May 9 06:19:16 alfred kernel: [] :xfs:xfs_get_blocks+0x= 0/0xe May 9 06:19:16 alfred kernel: [] add_to_page_cache_lru+= 0x1c/0x22 May 9 06:19:16 alfred kernel: [] block_write_begin+0x80= /0xcf May 9 06:19:16 alfred kernel: [] :xfs:xfs_vm_write_begi= n+0x19/0x1e May 9 06:19:16 alfred kernel: [] :xfs:xfs_get_blocks+0x= 0/0xe May 9 06:19:16 alfred kernel: [] generic_file_buffered_= write+0x14b/0x60c May 9 06:19:16 alfred kernel: [] __d_lookup+0xb0/0xff May 9 06:19:16 alfred kernel: [] _spin_lock_irqsave+0x9= /0x14 May 9 06:19:16 alfred kernel: [] :xfs:xfs_write+0x49e/0= x69e May 9 06:19:16 alfred kernel: [] mntput_no_expire+0x19/= 0x89 May 9 06:19:16 alfred kernel: [] link_path_walk+0xa6/0x= b2 May 9 06:19:16 alfred kernel: [] :xfs:xfs_file_aio_writ= e+0x65/0x6a May 9 06:19:16 alfred kernel: [] do_sync_write+0xc7/0x1= 04 May 9 06:19:16 alfred kernel: [] __dentry_open+0x101/0x= 1dc May 9 06:19:16 alfred kernel: [] autoremove_wake_functi= on+0x0/0x2e May 9 06:19:16 alfred kernel: [] do_filp_open+0x2a/0x38 May 9 06:19:16 alfred kernel: [] vfs_write+0xce/0x174 May 9 06:19:16 alfred kernel: [] sys_write+0x45/0x6e May 9 06:19:16 alfred kernel: [] ia32_sysret+0x0/0x5 *** also many, many more, always the same dinode May 9 12:53:32 alfred kernel: Filesystem "md0": XFS internal error xfs_btr= ee_check_sblock at line 307 of file fs/xfs/xfs_btree. c. Caller 0xffffffff88358eb7 May 9 12:53:32 alfred kernel: May 9 12:53:32 alfred kernel: Call Trace: May 9 12:53:32 alfred kernel: [] :xfs:xfs_btree_check_s= block+0xaf/0xbe May 9 12:53:32 alfred kernel: [] :xfs:xfs_inobt_increme= nt+0x156/0x17e May 9 12:53:32 alfred kernel: [] :xfs:xfs_dialloc+0x4d0= /0x80c May 9 12:53:32 alfred kernel: [] find_or_create_page+0x= 3f/0xab May 9 12:53:32 alfred kernel: [] :xfs:xfs_ialloc+0x5f/0= x57f May 9 12:53:32 alfred kernel: [] :ext3:ext3_get_acl+0x6= 3/0x310 May 9 12:53:32 alfred kernel: [] kmem_cache_alloc+0x62/= 0x6d May 9 12:53:32 alfred kernel: [] :xfs:xfs_dir_ialloc+0x= 86/0x2b7 May 9 12:53:32 alfred kernel: [] :xfs:xlog_grant_log_sp= ace+0x204/0x25c May 9 12:53:32 alfred kernel: [] :xfs:xfs_create+0x237/= 0x45c May 9 12:53:32 alfred kernel: [] :xfs:xfs_attr_get+0x8e= /0x9f May 9 12:53:32 alfred kernel: [] :xfs:xfs_vn_mknod+0x14= 4/0x215 May 9 12:53:32 alfred kernel: [] vfs_create+0xe6/0x158 May 9 12:53:32 alfred kernel: [] open_namei+0x1a1/0x6ed May 9 12:53:32 alfred kernel: [] do_filp_open+0x1c/0x38 May 9 12:53:32 alfred kernel: [] do_sys_open+0x44/0xbe May 9 12:53:32 alfred kernel: [] ia32_sysret+0x0/0x5 May 9 12:53:32 alfred kernel: *** also many, many more May 9 13:44:35 alfred kernel: 00000000: ff ff ff ff ff ff ff ff 00 00 00 0= 0 00 00 00 00 =FF=FF=FF=FF=FF=FF=FF=FF........ May 9 13:44:35 alfred kernel: Filesystem "md0": XFS internal error xfs_da_= do_buf(2) at line 2112 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8834b82e May 9 13:44:35 alfred kernel: May 9 13:44:35 alfred kernel: Call Trace: May 9 13:44:35 alfred kernel: [] :xfs:xfs_da_do_buf+0x5= 03/0x5b1 May 9 13:44:35 alfred kernel: [] :xfs:xfs_da_read_buf+0= x16/0x1b May 9 13:44:35 alfred kernel: [] _atomic_dec_and_lock+0= x39/0x57 May 9 13:44:35 alfred kernel: [] :xfs:xfs_da_read_buf+0= x16/0x1b May 9 13:44:35 alfred kernel: [] :xfs:xfs_dir2_leaf_get= dents+0x354/0x5ec May 9 13:44:35 alfred kernel: [] :xfs:xfs_dir2_leaf_get= dents+0x354/0x5ec May 9 13:44:35 alfred kernel: [] :xfs:xfs_hack_filldir+= 0x0/0x5b May 9 13:44:35 alfred kernel: [] :xfs:xfs_hack_filldir+= 0x0/0x5b May 9 13:44:35 alfred kernel: [] :xfs:xfs_readdir+0xa7/= 0xb6 May 9 13:44:35 alfred kernel: [] :xfs:xfs_file_readdir+= 0xff/0x14c May 9 13:44:35 alfred kernel: [] filldir+0x0/0xb7 May 9 13:44:35 alfred kernel: [] filldir+0x0/0xb7 May 9 13:44:35 alfred kernel: [] vfs_readdir+0x77/0xa9 May 9 13:44:35 alfred kernel: [] sys_getdents+0x75/0xbd May 9 13:44:35 alfred kernel: [] tracesys+0x47/0xb6 May 9 13:44:35 alfred kernel: [] tracesys+0xab/0xb6 May 9 13:44:35 alfred kernel: May 9 13:51:24 alfred kernel: Filesystem "md0": Disabling barriers, trial = barrier write failed May 9 13:51:24 alfred kernel: XFS mounting filesystem md0 *** these xfs_da_do_buf errors appear at a rate of about 5 per second until 14:40 o'clock, then stop. file system was still mounted, maybe one daemon was still accessing it. *** xfs_repair performed when raid was at 50% resync and filesystem was corrupted: [root@alfred ~]# xfs_repair /dev/md0 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno =3D 0 - agno =3D 1 - agno =3D 2 - agno =3D 3 - agno =3D 4 - agno =3D 5 - agno =3D 6 - agno =3D 7 [...] - agno =3D 62 - agno =3D 63 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno =3D 0 - agno =3D 1 - agno =3D 2 - agno =3D 3 [...] - agno =3D 61 - agno =3D 62 - agno =3D 63 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... done raid output after sync was finished: [root@alfred md]# cat /sys/block/md0/md/array_state clean [root@alfred md]# cat /sys/block/md0/md/mismatch_cnt 0 tnx & cu From rfu@kaneda.iguw.tuwien.ac.at Sun May 9 14:05:06 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49J55R0233142 for ; Sun, 9 May 2010 14:05:05 -0500 X-ASG-Debug-ID: 1273432036-585a03790000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx02.kabsi.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id DAAB213957BB for ; Sun, 9 May 2010 12:07:16 -0700 (PDT) Received: from mx02.kabsi.at (mx02.kabsi.at [62.40.128.130]) by cuda.sgi.com with ESMTP id B3upRdCDNJR3fPnx for ; Sun, 09 May 2010 12:07:16 -0700 (PDT) Received: from 192.168.5.201 (h081217058120.dyn.cm.kabsi.at [81.217.58.120]) by mx02.kabsi.at (8.13.6/8.13.6) with ESMTP id o49J798r025155 for ; Sun, 9 May 2010 21:07:10 +0200 Date: Sun, 9 May 2010 21:07:10 +0200 From: Rainer Fuegenstein X-Mailer: The Bat! (v1.62r) Business Reply-To: Rainer Fuegenstein Organization: Vienna University of Technology X-Priority: 3 (Normal) Message-ID: <462402327.20100509210710@kaneda.iguw.tuwien.ac.at> To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: xfs and raid5 - "Structure needs cleaning for directory open" Subject: Re: xfs and raid5 - "Structure needs cleaning for directory open" In-Reply-To: <1816344475.20100509204800@kaneda.iguw.tuwien.ac.at> References: <1816344475.20100509204800@kaneda.iguw.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Barracuda-Connect: mx02.kabsi.at[62.40.128.130] X-Barracuda-Start-Time: 1273432037 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0002 1.0000 -2.0199 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29469 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean addendum - xfsprogs is of version: xfsprogs-2.9.4-1.el5.centos RF> today in the morning some daemon processes terminated because of RF> errors in the xfs file system on top of a software raid5, consisting RF> of 4*1.5TB WD caviar green SATA disks. [...] From SRS0+xPli+68+fromorbit.com=david@internode.on.net Sun May 9 16:57:38 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49LvbwT236157 for ; Sun, 9 May 2010 16:57:38 -0500 X-ASG-Debug-ID: 1273442402-28b603410000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 06E0697ED36 for ; Sun, 9 May 2010 15:00:03 -0700 (PDT) Received: from mail.internode.on.net (bld-mail17.adl2.internode.on.net [150.101.137.102]) by cuda.sgi.com with ESMTP id E7uPCgbbx6EMBlny for ; Sun, 09 May 2010 15:00:03 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23482711-1927428 for multiple; Mon, 10 May 2010 07:29:45 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1OBEXI-0001kM-7J; Mon, 10 May 2010 07:59:44 +1000 Date: Mon, 10 May 2010 07:59:44 +1000 From: Dave Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH] xfs: add discard support (at transaction commit) Subject: Re: [PATCH] xfs: add discard support (at transaction commit) Message-ID: <20100509215944.GF25419@dastard> References: <20100509175048.GA1435@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100509175048.GA1435@infradead.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail17.adl2.internode.on.net[150.101.137.102] X-Barracuda-Start-Time: 1273442405 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29478 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Sun, May 09, 2010 at 01:50:48PM -0400, Christoph Hellwig wrote: > Now that we have reliably tracking of deleted extents in a transaction > we can easily implement "online" discard support which calls > blkdev_issue_discard once a transaction commits. We simply have to > walk the list of busy extents after transaction commit, but before deleting > it from the rbtree tracking these busy extents. > > This does not replace by background discard support patch which is probably > better for thin provisioned arrays - I will updated it to apply ontop of > this patch when I'm ready to re-submit it. I think this can be made to work, but I don't really like it that much, especially the barrier flush part. Is there any particular reason we need to issue discards at this level apart from "other filesystems are doing it" rather than doing it lazily in a non-performance critical piece of code? Regardless of this, some questions about the patch come to mind: 1. is it safe to block the xfslogd in the block layer in, say, get_request()? i.e. should we be issuing IO from an IO completion handler? That raises red flags in my head... 2. issuing discards will block xfslogd and potentially stall the log if there are lots of discards to issue. 3. DISCARD_FL_BARRIER appears to be used to allow async issuing of the discard to ensure any followup write has the discard processed first. What happens if the device does not support barriers or barriers are turned off? To me it appears that a lack of barriers could result in a write being reordered in front of the discard. e.g. delalloc results in btree block freed, marked busy. New delalloc occurs, allocates block, marked sync, forces log, issues async discard, completes transaction and then writes data async. Which operation does the drive see and complete first - the discard or the data write? 4. A barrier IO on every discard? In a word: Ouch. Cheers, Dave. -- Dave Chinner david@fromorbit.com From rfu@kaneda.iguw.tuwien.ac.at Sun May 9 18:33:13 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49NXCnf239450 for ; Sun, 9 May 2010 18:33:13 -0500 X-ASG-Debug-ID: 1273448123-058d01140000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx05.kabsi.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 77CB312CFD0C for ; Sun, 9 May 2010 16:35:23 -0700 (PDT) Received: from mx05.kabsi.at (mx05.kabsi.at [195.202.128.131]) by cuda.sgi.com with ESMTP id kuqLNikG77dPr77h for ; Sun, 09 May 2010 16:35:23 -0700 (PDT) Received: from 192.168.5.201 (h081217058120.dyn.cm.kabsi.at [81.217.58.120]) by mx05.kabsi.at (8.13.8/8.13.8) with ESMTP id o49NZLsv002587 for ; Mon, 10 May 2010 01:35:22 +0200 Date: Mon, 10 May 2010 01:35:11 +0200 From: Rainer Fuegenstein X-Mailer: The Bat! (v1.62r) Business Reply-To: Rainer Fuegenstein Organization: Vienna University of Technology X-Priority: 3 (Normal) Message-ID: <1743435018.20100510013511@kaneda.iguw.tuwien.ac.at> To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: xfs and raid5 - "Structure needs cleaning for directory open" Subject: Re: xfs and raid5 - "Structure needs cleaning for directory open" In-Reply-To: <1816344475.20100509204800@kaneda.iguw.tuwien.ac.at> References: <1816344475.20100509204800@kaneda.iguw.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable X-Barracuda-Connect: mx05.kabsi.at[195.202.128.131] X-Barracuda-Start-Time: 1273448124 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29484 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean not sure if my original post made it to the list members since it never arrived here, but it is already listed in the archive: http://oss.sgi.com/pipermail/xfs/2010-May/045303.html another update: I restarted the server several times, always with the raid intact (no syncing) and started to backup as much data as possible. in the last hours the file system was heavily used (both read and write, about 2TB copied to another SATA disk); during this time the error occured a few times: since raid is stable I suspect it is entirely an XFS (or hardware) problem= =3F find: ./Hana: Structure needs cleaning 00000000: aa b9 6a 52 fa 62 8c 36 17 19 b8 75 0f 2a 92 f5 =AA=B9jR=FAb.6..= =B8u.*.=F5 Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2112 of file = fs/xfs/xfs_da_btree.c. Caller 0xffffffff8834482e Call Trace: [] :xfs:xfs_da_do_buf+0x503/0x5b1 [] :xfs:xfs_da_read_buf+0x16/0x1b [] _atomic_dec_and_lock+0x39/0x57 [] :xfs:xfs_da_read_buf+0x16/0x1b [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_readdir+0xa7/0xb6 [] :xfs:xfs_file_readdir+0xff/0x14c [] filldir+0x0/0xb7 [] filldir+0x0/0xb7 [] vfs_readdir+0x77/0xa9 [] sys_getdents+0x75/0xbd [] tracesys+0x47/0xb6 [] tracesys+0xab/0xb6 00000000: ea 4c ea e1 7a f1 2f 88 13 f9 a5 24 08 38 31 4e =EAL=EA=E1z=F1/.= .=F9=A5$.81N Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2112 of file = fs/xfs/xfs_da_btree.c. Caller 0xffffffff8834482e Call Trace: [] :xfs:xfs_da_do_buf+0x503/0x5b1 [] :xfs:xfs_da_read_buf+0x16/0x1b [] _atomic_dec_and_lock+0x39/0x57 [] :xfs:xfs_da_read_buf+0x16/0x1b [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_readdir+0xa7/0xb6 [] :xfs:xfs_file_readdir+0xff/0x14c [] filldir+0x0/0xb7 [] filldir+0x0/0xb7 [] vfs_readdir+0x77/0xa9 [] sys_getdents+0x75/0xbd [] tracesys+0x47/0xb6 [] tracesys+0xab/0xb6 00000000: ea 4c ea e1 7a f1 2f 88 13 f9 a5 24 08 38 31 4e =EAL=EA=E1z=F1/.= .=F9=A5$.81N Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2112 of file = fs/xfs/xfs_da_btree.c. Caller 0xffffffff8834482e Call Trace: [] :xfs:xfs_da_do_buf+0x503/0x5b1 [] :xfs:xfs_da_read_buf+0x16/0x1b [] _atomic_dec_and_lock+0x39/0x57 [] :xfs:xfs_da_read_buf+0x16/0x1b [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_readdir+0xa7/0xb6 [] :xfs:xfs_file_readdir+0xff/0x14c [] filldir+0x0/0xb7 [] filldir+0x0/0xb7 [] vfs_readdir+0x77/0xa9 [] sys_getdents+0x75/0xbd [] tracesys+0x47/0xb6 [] tracesys+0xab/0xb6 00000000: ea 4c ea e1 7a f1 2f 88 13 f9 a5 24 08 38 31 4e =EAL=EA=E1z=F1/.= .=F9=A5$.81N Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2112 of file = fs/xfs/xfs_da_btree.c. Caller 0xffffffff8834482e Call Trace: [] :xfs:xfs_da_do_buf+0x503/0x5b1 [] :xfs:xfs_da_read_buf+0x16/0x1b [] _atomic_dec_and_lock+0x39/0x57 [] :xfs:xfs_da_read_buf+0x16/0x1b [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_hack_filldir+0x0/0x5b [] :xfs:xfs_readdir+0xa7/0xb6 [] :xfs:xfs_file_readdir+0xff/0x14c [] filldir+0x0/0xb7 [] filldir+0x0/0xb7 [] vfs_readdir+0x77/0xa9 [] sys_getdents+0x75/0xbd [] tracesys+0x47/0xb6 [] tracesys+0xab/0xb6 please advise. tnx. From sandeen@sandeen.net Sun May 9 20:07:08 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4A177Od241774 for ; Sun, 9 May 2010 20:07:08 -0500 X-ASG-Debug-ID: 1273453758-14e800650000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.sandeen.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id BFC8D325836 for ; Sun, 9 May 2010 18:09:19 -0700 (PDT) Received: from mail.sandeen.net (64-131-60-146.usfamily.net [64.131.60.146]) by cuda.sgi.com with ESMTP id MgM5jsxlbWQGmSON for ; Sun, 09 May 2010 18:09:19 -0700 (PDT) Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sandeen.net (Postfix) with ESMTP id 9D82BA7B615; Sun, 9 May 2010 20:09:18 -0500 (CDT) Message-ID: <4BE75CBC.2000600@sandeen.net> Date: Sun, 09 May 2010 20:09:16 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: Stan Hoeppner CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: failed to read root inode Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> <20100509152818.7481c1e1@galadriel.home> <4BE6CC83.5070305@hardwarefreak.com> In-Reply-To: <4BE6CC83.5070305@hardwarefreak.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Barracuda-Connect: 64-131-60-146.usfamily.net[64.131.60.146] X-Barracuda-Start-Time: 1273453759 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.92 X-Barracuda-Spam-Status: No, SCORE=-1.92 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=RDNS_DYNAMIC X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29490 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.10 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Stan Hoeppner wrote: > Emmanuel Florac put forth on 5/9/2010 8:28 AM: >> Le Sat, 08 May 2010 17:53:17 -0500 vous écriviez: >> >>> Why did the "crash" of a single disk in a hardware RAID6 cause a >>> kernel freeze? What is your definition of "disk crash"? A single >>> physical disk failure should not have caused this under any >>> circumstances. The RAID card should have handled a single disk >>> failure transparently. >> The RAID array may go west if the disk isn't properly set up, >> particularly if it's a desktop-class drive. > > By design, a RAID6 pack should be able to handle two simultaneous drive > failures before the array goes offline. According to the OP's post he lost > one drive. Unless it's a really crappy RAID card or if he's using a bunch > of dissimilar drives causing problems with the entire array, he shouldn't > have had a problem. > > This is why I'm digging for more information. The information he presented > here doesn't really make any sense. One physical disk failure _shouldn't_ > have caused the problems he's experiencing. I don't think we got the full > story. I tend to agree, something is missing here, which means my suggestions for repair will be unlikely to be terribly successful; I think more is wrong than we know... -Eric From SRS0+4l5F+69+fromorbit.com=david@internode.on.net Sun May 9 21:18:29 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00,FAKE_REPLY_C autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4A2ISRk243329 for ; Sun, 9 May 2010 21:18:29 -0500 X-ASG-Debug-ID: 1273458039-748201ab0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id C4936139AFBF for ; Sun, 9 May 2010 19:20:39 -0700 (PDT) Received: from mail.internode.on.net (bld-mail19.adl2.internode.on.net [150.101.137.104]) by cuda.sgi.com with ESMTP id cwIPGakiJVTvwfFx for ; Sun, 09 May 2010 19:20:39 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23540668-1927428 for multiple; Mon, 10 May 2010 11:50:35 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1OBIbi-00021a-0K; Mon, 10 May 2010 12:20:34 +1000 Date: Mon, 10 May 2010 12:20:33 +1000 From: Dave Chinner To: Rainer Fuegenstein Cc: xfs@oss.sgi.com, linux-raid@vger.kernel.org X-ASG-Orig-Subj: Re: xfs and raid5 - "Structure needs cleaning for directory open" Subject: Re: xfs and raid5 - "Structure needs cleaning for directory open" Message-ID: <20100510022033.GB7165@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1743435018.20100510013511@kaneda.iguw.tuwien.ac.at> <462402327.20100509210710@kaneda.iguw.tuwien.ac.at> <1816344475.20100509204800@kaneda.iguw.tuwien.ac.at> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail19.adl2.internode.on.net[150.101.137.104] X-Barracuda-Start-Time: 1273458040 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29496 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Sun, May 09, 2010 at 08:48:00PM +0200, Rainer Fuegenstein wrote: > > today in the morning some daemon processes terminated because of > errors in the xfs file system on top of a software raid5, consisting > of 4*1.5TB WD caviar green SATA disks. Reminds me of a recent(-ish) md/dm readahead cancellation fix - that would fit the symptoms of (btree corruption showing up under heavy IO load but no corruption on disk. However, I can't seem to find any references to it at the moment (can't remember the bug title), but perhaps your distro doesn't have the fix in it? Cheers, Dave. -- Dave Chinner david@fromorbit.com From goodwinos@gmail.com Mon May 10 01:51:48 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM, T_TO_NO_BRKTS_FREEMAIL autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4A6plnp252709 for ; Mon, 10 May 2010 01:51:47 -0500 X-ASG-Debug-ID: 1273474455-5bf700590000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx1.redhat.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 93ACE97E7CA for ; Sun, 9 May 2010 23:54:16 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id heFQw9eDfrPRgMHg for ; Sun, 09 May 2010 23:54:16 -0700 (PDT) X-ASG-Whitelist: Barracuda Reputation Received: from int-mx04.intmail.prod.int.phx2.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.17]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o4A6rvCr026731 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 10 May 2010 02:53:58 -0400 Received: from ns3.rdu.redhat.com (ns3.rdu.redhat.com [10.11.255.199]) by int-mx04.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o4A6rvK9019897 for ; Mon, 10 May 2010 02:53:57 -0400 Received: from [10.64.48.185] (vpn1-48-185.bne.redhat.com [10.64.48.185]) by ns3.rdu.redhat.com (8.13.8/8.13.8) with ESMTP id o4A6rtBq010916 for ; Mon, 10 May 2010 02:53:56 -0400 Message-ID: <4BE7AD82.90300@gmail.com> Date: Mon, 10 May 2010 16:53:54 +1000 From: Mark Goodwin User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc12 Thunderbird/3.0.3 MIME-Version: 1.0 To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: xfs and raid5 - "Structure needs cleaning for directory open" Subject: Re: xfs and raid5 - "Structure needs cleaning for directory open" References: <20100510022033.GB7165@dastard> In-Reply-To: <20100510022033.GB7165@dastard> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 10.5.11.17 X-Barracuda-Connect: mx1.redhat.com[209.132.183.28] X-Barracuda-Start-Time: 1273474457 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On 05/10/2010 12:20 PM, Dave Chinner wrote: > On Sun, May 09, 2010 at 08:48:00PM +0200, Rainer Fuegenstein wrote: >> >> today in the morning some daemon processes terminated because of >> errors in the xfs file system on top of a software raid5, consisting >> of 4*1.5TB WD caviar green SATA disks. > > Reminds me of a recent(-ish) md/dm readahead cancellation fix - that > would fit the symptoms of (btree corruption showing up under heavy IO > load but no corruption on disk. However, I can't seem to find any > references to it at the moment (can't remember the bug title), but > perhaps your distro doesn't have the fix in it? If it's the bug that Dave's thinking of, it's an issue with md error handling and the BIO_UPTODATE flag, see : http://www.mail-archive.com/linux-raid@vger.kernel.org/msg06628.html For RHEL, this is addressed in Red Hat BZ 512552 https://bugzilla.redhat.com/show_bug.cgi?id=512552 upstream commit c2b00852fbae4f8c45c2651530ded3bd01bde814 Cheers -- Mark From kb@sysmikro.com.pl Mon May 10 02:09:55 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,MIME_8BIT_HEADER autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4A79sth253109 for ; Mon, 10 May 2010 02:09:55 -0500 X-ASG-Debug-ID: 1273475525-034400a80000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from v007470.home.net.pl (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with SMTP id 58035139BD1F for ; Mon, 10 May 2010 00:12:05 -0700 (PDT) Received: from v007470.home.net.pl (v007470.home.net.pl [212.85.125.104]) by cuda.sgi.com with SMTP id GTQQX9oGsijPrwHD for ; Mon, 10 May 2010 00:12:05 -0700 (PDT) Received: from chello089072102205.chello.pl [89.72.102.205] (HELO linux2g2g.site) by sysmikro.home.pl [212.85.125.104] with SMTP (IdeaSmtpServer v0.70) id 2d4dc24a52072b1d; Mon, 10 May 2010 09:12:03 +0200 From: Krzysztof =?utf-8?q?B=C5=82aszkowski?= Organization: Systemy mikroprocesorowe To: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: posix_fallocate Subject: Re: posix_fallocate Date: Mon, 10 May 2010 09:11:52 +0200 User-Agent: KMail/1.9.5 Cc: Eric Sandeen References: <201005071022.37863.kb@sysmikro.com.pl> <4BE43F34.40309@sandeen.net> <4BE44587.6090603@sandeen.net> In-Reply-To: <4BE44587.6090603@sandeen.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <201005100911.52491.kb@sysmikro.com.pl> X-Barracuda-Connect: v007470.home.net.pl[212.85.125.104] X-Barracuda-Start-Time: 1273475526 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29512 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Friday 07 May 2010 18:53, Eric Sandeen wrote: > Eric Sandeen wrote: > > Krzysztof B=C5=82aszkowski wrote: > >> Hello, > >> > >> I use this to preallocate large space but found an issue. > >> Posix_fallocate works right with sizes like 100G, 1T and even 10T on > >> some boxes (on some other can fail after e.g. 7T threshold) but if i > >> tried e.g. 16T the user space process would be "R"unning forever and it > >> is not interruptible. Furthermore some other not related processes like > >> sshd, bash enter D state. There is nothing in kernel log. > > Oh, one thing you should know is that depending on your version of glibc, > posix_fallocate may be writing 0s and not using preallocation calls. I am absolutely sure that recent libc doesn't emulate this syscall=20 > > Do you know which yours is using? syscall (libc 2.9) > strace should tell you on a small=20 > file test. > > Anyway, I am seeing things get stuck around 8T it seems... yes, i noticed that sometimes the threshold point is higher. > > # touch /mnt/test/bigfile > # xfs_io -c "resvsp 0 16t" /mnt/test/bigfile > > ... wait ... in other window ... > > # du -hc /mnt/test/bigfile > 8.0G /mnt/test/bigfile > 8.0G total > > # echo t > /proc/sysrq-trigger It was good idea to use sysrq. I didn't think about this but rather focused= on=20 ftrace and how to analyse these megs of data > # dmesg | grep -A20 xfs_io > xfs_io R running task 3576 29444 29362 0x00000006 > ffff8809cfbb4920 ffffffff81478d9f ffffffffa032d3c5 0000000000000246 > ffff8809cfbb4920 ffffffff814788bc 0000000000000000 ffffffff81ba3510 > ffff8809d3429a68 ffffffffa032b60f ffff8809d3429aa8 000000000000001e > Call Trace: > [] ? __mutex_lock_common+0x36d/0x392 > [] ? xfs_icsb_modify_counters+0x17f/0x1ac [xfs] > [] ? xfs_icsb_unlock_all_counters+0x4d/0x60 [xfs] > [] ? xfs_icsb_disable_counter+0x8c/0x95 [xfs] > [] ? mutex_lock_nested+0x3e/0x43 > [] ? xfs_icsb_modify_counters+0x18d/0x1ac [xfs] > [] ? xfs_mod_incore_sb+0x29/0x6e [xfs] > [] ? _xfs_trans_alloc+0x27/0x61 [xfs] > [] ? xfs_trans_reserve+0x6c/0x19e [xfs] > [] ? up_write+0x2b/0x32 > [] ? xfs_alloc_file_space+0x163/0x306 [xfs] > [] ? sched_clock_cpu+0xc3/0xce > [] ? xfs_change_file_space+0x12a/0x2b8 [xfs] > [] ? down_write_nested+0x80/0x8b > [] ? xfs_ilock+0x30/0xb4 [xfs] > [] ? xfs_vn_fallocate+0x80/0xf4 [xfs] > -- > R xfs_io 29444 86014624.786617 162 120 86014624.786617 = =20 > 137655.161327 408.979977 / > > # uname -r > 2.6.34-0.4.rc0.git2.fc14.x86_64 > > I'll look into it. We stick with 2.6.31.5 which seems to be good for us. We do not change kern= els=20 easily, as soon as higher revision arrives because it doesn't make sense fr= om=20 stability point of view. We have seen too many times regression bugs so if = we=20 are confident with some revision then there is no point to change this. Krzysztof B=C5=82aszkowski > > -Eric > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs From rfu@kaneda.iguw.tuwien.ac.at Mon May 10 05:19:55 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4AAJtZn257018 for ; Mon, 10 May 2010 05:19:55 -0500 X-ASG-Debug-ID: 1273486925-2f64029d0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mx03.kabsi.at (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 04740327D0F for ; Mon, 10 May 2010 03:22:05 -0700 (PDT) Received: from mx03.kabsi.at (mx03.kabsi.at [195.202.128.130]) by cuda.sgi.com with ESMTP id 3DTmcBhVldyyL8XV for ; Mon, 10 May 2010 03:22:05 -0700 (PDT) Received: from 192.168.5.201 (h081217058120.dyn.cm.kabsi.at [81.217.58.120]) by mx03.kabsi.at (8.13.8/8.13.8) with ESMTP id o4AAM2Ew001600; Mon, 10 May 2010 12:22:03 +0200 Date: Mon, 10 May 2010 12:22:03 +0200 From: Rainer Fuegenstein X-Mailer: The Bat! (v1.62r) Business Reply-To: Rainer Fuegenstein Organization: Vienna University of Technology X-Priority: 3 (Normal) Message-ID: <23944308.20100510122203@kaneda.iguw.tuwien.ac.at> To: Mark Goodwin CC: xfs@oss.sgi.com X-ASG-Orig-Subj: Re[2]: xfs and raid5 - "Structure needs cleaning for directory open" Subject: Re[2]: xfs and raid5 - "Structure needs cleaning for directory open" In-Reply-To: <4BE7AD82.90300@gmail.com> References: <20100510022033.GB7165@dastard> <4BE7AD82.90300@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Barracuda-Connect: mx03.kabsi.at[195.202.128.130] X-Barracuda-Start-Time: 1273486927 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.62 X-Barracuda-Spam-Status: No, SCORE=-1.62 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=BSF_SC0_SA085b X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29524 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.40 BSF_SC0_SA085b Custom Rule SA085b X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean mark, yes, thank you, that looks very much like it. CentOS is still at 5.4, version 5.5 is long overdue :-( tnx & cu MG> On 05/10/2010 12:20 PM, Dave Chinner wrote: >> On Sun, May 09, 2010 at 08:48:00PM +0200, Rainer Fuegenstein wrote: >>> >>> today in the morning some daemon processes terminated because of >>> errors in the xfs file system on top of a software raid5, consisting >>> of 4*1.5TB WD caviar green SATA disks. >> >> Reminds me of a recent(-ish) md/dm readahead cancellation fix - that >> would fit the symptoms of (btree corruption showing up under heavy IO >> load but no corruption on disk. However, I can't seem to find any >> references to it at the moment (can't remember the bug title), but >> perhaps your distro doesn't have the fix in it? MG> If it's the bug that Dave's thinking of, it's an issue with md MG> error handling and the BIO_UPTODATE flag, see : MG> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg06628.html MG> For RHEL, this is addressed in Red Hat BZ 512552 MG> https://bugzilla.redhat.com/show_bug.cgi?id=512552 MG> upstream commit c2b00852fbae4f8c45c2651530ded3bd01bde814 MG> Cheers MG> -- Mark MG> _______________________________________________ MG> xfs mailing list MG> xfs@oss.sgi.com MG> http://oss.sgi.com/mailman/listinfo/xfs ------------------------------------------------------------------------------ Unix gives you just enough rope to hang yourself -- and then a couple of more feet, just to be sure. (Eric Allman) ------------------------------------------------------------------------------ From BATV+c30e087d29709dd2b847+2451+infradead.org+hch@bombadil.srs.infradead.org Mon May 10 06:42:25 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,J_CHICKENPOX_63, J_CHICKENPOX_66 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4ABgOiL258900 for ; Mon, 10 May 2010 06:42:25 -0500 X-ASG-Debug-ID: 1273491895-7f5b00840000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 32512983416 for ; Mon, 10 May 2010 04:44:56 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id f2KOfdkhLXGFHtw1 for ; Mon, 10 May 2010 04:44:56 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OBRPX-0002jp-J2; Mon, 10 May 2010 11:44:35 +0000 Date: Mon, 10 May 2010 07:44:35 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 09/12] xfs: Introduce delayed logging core code Subject: Re: [PATCH 09/12] xfs: Introduce delayed logging core code Message-ID: <20100510114435.GA27624@infradead.org> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> <1273210860-23414-10-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273210860-23414-10-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273491896 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean Looks good to me, Reviewed-by: Christoph Hellwig A couple comments below anyway: > +int > +xlog_cil_init_post_recovery( > + struct log *log) > +{ > + if (!log->l_cilp) > + return 0; > + > + log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log); > + log->l_cilp->xc_ctx->sequence = 1; > + log->l_cilp->xc_ctx->commit_lsn = xlog_assign_lsn(log->l_curr_cycle, > + log->l_curr_block); > + return 0; > +} This should return void. > +static void > +xlog_cil_insert( > + struct log *log, > + struct xlog_ticket *ticket, > + struct xfs_log_item *item, > + struct xfs_log_vec *lv) > +{ > + struct xfs_cil *cil = log->l_cilp; > + struct xfs_log_vec *old = lv->lv_item->li_lv; > + struct xfs_cil_ctx *ctx = cil->xc_ctx; > + int len; > + int diff_iovecs; > + int iclog_space; > + > + if (old) { > + /* existing lv on log item, space used is a delta */ > + ASSERT(!list_empty(&item->li_cil)); > + ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs); > + > + len = lv->lv_buf_len - old->lv_buf_len; > + diff_iovecs = lv->lv_niovecs - old->lv_niovecs; Add asserts that len and diff_iovecs aren't negative here? > + for (lv = log_vector; lv; lv = lv->lv_next) { > + void *ptr; > + int index; > + int offset = 0; > + int len = 0; > + > + for (index = 0; index < lv->lv_niovecs; index++) > + len += lv->lv_iovecp[index].i_len; > + > + lv->lv_buf_len = len; > + lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS); > + ptr = lv->lv_buf; > + > + for (index = 0; index < lv->lv_niovecs; index++) { > + struct xfs_log_iovec *vec = &lv->lv_iovecp[index]; > + > + memcpy(ptr, vec->i_addr, vec->i_len); > + vec->i_addr = ptr; > + xlog_write_adv_cnt(&ptr, &len, &offset, vec->i_len); > + } > + ASSERT(len == 0); > + > + xlog_cil_insert(log, ticket, lv->lv_item, lv); The use of xlog_write_adv_cnt doesn't seem quite optimal to me. The offset variable is entirely unused, and len is only used for an asswer that could easily be reformulated as ASSERT(ptr == lv->lv_buf + len); if we replace the xlog_write_adv_cnt with a simple ptr += vec->i_len; > +/* > + * Push the Committed Item List to the log. If the push_now flag is not set, > + * then it is a background flush and so we can chose to ignore it. > + */ > +int > +xlog_cil_push( > + struct log *log, > + int push_now) > +{ > + struct xfs_cil *cil = log->l_cilp; The variables don't line up here. There's another instance of that in xlog_cil_insert, btw. > + /* check if we've anything to push */ > + if (list_empty(&cil->xc_cil)) { > + up_write(&cil->xc_ctx_lock); > + xfs_log_ticket_put(new_ctx->ticket); > + kmem_free(new_ctx); > + return 0; > + } Please add a out_skip label for this cleanup code, as it would be duplicated by the background flushing check added in a later patch. > + new_lv = kmem_zalloc(sizeof(*new_lv) + > + lidp->lid_size * sizeof(struct xfs_log_iovec), > + KM_SLEEP); > + > + /* The allocated iovec region lies beyond the log vector. */ > + new_lv->lv_iovecp = (struct xfs_log_iovec *)&new_lv[1]; > + if (!ret_lv) > + ret_lv = new_lv; > + else > + lv->lv_next = new_lv; > + lv = new_lv; I'd suggest already setting up lv->lv_niovecs and lv->lv_item here instead of in xfs_trans_fill_log_vecs. That way xfs_trans_fill_log_vecs can be simplified to: STATIC void xfs_trans_fill_log_vecs( struct xfs_trans *tp, struct xfs_log_vec *log_vector) { struct xfs_log_vec *lv; for (lv = log_vector; lv = lv->lv_next; lv) IOP_FORMAT(lidp->lid_item, lv->lv_iovecp); } Or just inlined into the caller or even xfs_log_commit_cil given how simple it is now. Moving it to xfs_log_commit_cil would also help avoiding the locking imbalance where xfs_log_commit_cil is called with xc_ctx_lock held but returns without it after the last patch in the series. That again might allow merging the IOP_FORMAT loop into xlog_cil_format_items. Btw, I wonder if xfs_log_commit_cil should simply move to xfs_trans.c? That would avoid having to export xfs_trans_unreserve_and_mod_sb, xfs_trans_free_items and xfs_trans_free from there, and only require exporting xlog_cil_format_items (if we didn't move that one as well, then xlog_cil_insert), while keeping things a lot more symmetric with the traditional commit path. From BATV+c30e087d29709dd2b847+2451+infradead.org+hch@bombadil.srs.infradead.org Mon May 10 06:42:48 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4ABglaw258917 for ; Mon, 10 May 2010 06:42:48 -0500 X-ASG-Debug-ID: 1273491911-35e302a60000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 5316E97CD24 for ; Mon, 10 May 2010 04:45:12 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id q6RUllitjLyziCDm for ; Mon, 10 May 2010 04:45:12 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OBRPp-0002kO-3C; Mon, 10 May 2010 11:44:53 +0000 Date: Mon, 10 May 2010 07:44:53 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 10/12] xfs: forced unmounts need to push the CIL Subject: Re: [PATCH 10/12] xfs: forced unmounts need to push the CIL Message-ID: <20100510114453.GB27624@infradead.org> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> <1273210860-23414-11-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273210860-23414-11-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273491912 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Fri, May 07, 2010 at 03:40:58PM +1000, Dave Chinner wrote: > From: Dave Chinner > > If the filesystem is being shut down and the there is no log error, > the current code forces out the current log buffers. This code now needs > to push the CIL before it forces out the log buffers to acheive the same > result. Looks good, Reviewed-by: Christoph Hellwig From BATV+c30e087d29709dd2b847+2451+infradead.org+hch@bombadil.srs.infradead.org Mon May 10 06:43:45 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4ABhjWe258989 for ; Mon, 10 May 2010 06:43:45 -0500 X-ASG-Debug-ID: 1273491976-7f6500890000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from bombadil.infradead.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id CB92E97CD39 for ; Mon, 10 May 2010 04:46:16 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by cuda.sgi.com with ESMTP id lVwlGuxR7LkxLAkX for ; Mon, 10 May 2010 04:46:16 -0700 (PDT) X-ASG-Whitelist: Client Received: from hch by bombadil.infradead.org with local (Exim 4.69 #1 (Red Hat Linux)) id 1OBRQr-0003KX-Iw; Mon, 10 May 2010 11:45:57 +0000 Date: Mon, 10 May 2010 07:45:57 -0400 From: Christoph Hellwig To: Dave Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 11/12] xfs: enable background pushing of the CIL Subject: Re: [PATCH 11/12] xfs: enable background pushing of the CIL Message-ID: <20100510114557.GC27624@infradead.org> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> <1273210860-23414-12-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1273210860-23414-12-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html X-Barracuda-Connect: bombadil.infradead.org[18.85.46.34] X-Barracuda-Start-Time: 1273491976 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Fri, May 07, 2010 at 03:40:59PM +1000, Dave Chinner wrote: > From: Dave Chinner > > If we let the CIL grow without bound, it will grow large enough to violate > recovery constraints (must be at least one complete transaction in the log at > all times) or take forever to write out through the log buffers. Hence we need > a check during asynchronous transactions as to whether the CIL needs to be > pushed. > > We track the amount of log space the CIL consumes, so it is relatively simple > to limit it on a pure size basis. Make the limit the minimum of just under half > the log size (recovery constraint) or 8MB of log space (which is an awful lot > of metadata). Looks good except for the use of a cleanup label in xlog_cil_push mention in the review of the main patch. Reviewed-by: Christoph Hellwig From SRS0+vHgI+69+fromorbit.com=david@internode.on.net Mon May 10 07:14:22 2010 X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,J_CHICKENPOX_63, J_CHICKENPOX_66 autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4ACEMi8260162 for ; Mon, 10 May 2010 07:14:22 -0500 X-ASG-Debug-ID: 1273493791-307201150000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 0F34B1DE9E0C for ; Mon, 10 May 2010 05:16:32 -0700 (PDT) Received: from mail.internode.on.net (bld-mail12.adl6.internode.on.net [150.101.137.97]) by cuda.sgi.com with ESMTP id 0H0c6QlW5nM5uLA6 for ; Mon, 10 May 2010 05:16:32 -0700 (PDT) Received: from dastard (unverified [121.45.169.108]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 23756651-1927428 for multiple; Mon, 10 May 2010 21:46:31 +0930 (CST) Received: from dave by dastard with local (Exim 4.71) (envelope-from ) id 1OBRuP-0002c5-0s; Mon, 10 May 2010 22:16:29 +1000 Date: Mon, 10 May 2010 22:16:28 +1000 From: Dave Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: [PATCH 09/12] xfs: Introduce delayed logging core code Subject: Re: [PATCH 09/12] xfs: Introduce delayed logging core code Message-ID: <20100510121628.GD7165@dastard> References: <1273210860-23414-1-git-send-email-david@fromorbit.com> <1273210860-23414-10-git-send-email-david@fromorbit.com> <20100510114435.GA27624@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100510114435.GA27624@infradead.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Barracuda-Connect: bld-mail12.adl6.internode.on.net[150.101.137.97] X-Barracuda-Start-Time: 1273493794 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.29529 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Mon, May 10, 2010 at 07:44:35AM -0400, Christoph Hellwig wrote: > Looks good to me, > > > Reviewed-by: Christoph Hellwig > > A couple comments below anyway: > > > +int > > +xlog_cil_init_post_recovery( > > + struct log *log) > > +{ > > + if (!log->l_cilp) > > + return 0; > > + > > + log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log); > > + log->l_cilp->xc_ctx->sequence = 1; > > + log->l_cilp->xc_ctx->commit_lsn = xlog_assign_lsn(log->l_curr_cycle, > > + log->l_curr_block); > > + return 0; > > +} > > This should return void. OK. > > +static void > > +xlog_cil_insert( > > + struct log *log, > > + struct xlog_ticket *ticket, > > + struct xfs_log_item *item, > > + struct xfs_log_vec *lv) > > +{ > > + struct xfs_cil *cil = log->l_cilp; > > + struct xfs_log_vec *old = lv->lv_item->li_lv; > > + struct xfs_cil_ctx *ctx = cil->xc_ctx; > > + int len; > > + int diff_iovecs; > > + int iclog_space; > > + > > + if (old) { > > + /* existing lv on log item, space used is a delta */ > > + ASSERT(!list_empty(&item->li_cil)); > > + ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs); > > + > > + len = lv->lv_buf_len - old->lv_buf_len; > > + diff_iovecs = lv->lv_niovecs - old->lv_niovecs; > > Add asserts that len and diff_iovecs aren't negative here? Actually, they can be negative here - a previously logged buffer that is now stale will go from ((N dirty regions * 128 bytes) + format header) to (zero dirty regions + format header), and effectively free up space as what was previously logged is now ignored due to the XFS_BLI_CANCEL flag in the format header. > > + for (lv = log_vector; lv; lv = lv->lv_next) { > > + void *ptr; > > + int index; > > + int offset = 0; > > + int len = 0; > > + > > + for (index = 0; index < lv->lv_niovecs; index++) > > + len += lv->lv_iovecp[index].i_len; > > + > > + lv->lv_buf_len = len; > > + lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS); > > + ptr = lv->lv_buf; > > + > > + for (index = 0; index < lv->lv_niovecs; index++) { > > + struct xfs_