From owner-xfs@oss.sgi.com Sun Apr 1 15:45:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 01 Apr 2007 15:45:16 -0700 (PDT) X-Spam-oss-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l31MjA6p010169 for ; Sun, 1 Apr 2007 15:45:11 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA12073; Mon, 2 Apr 2007 08:45:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l31Mj0Af40434916; Mon, 2 Apr 2007 08:45:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l31MiwTX46334670; Mon, 2 Apr 2007 08:44:58 +1000 (AEST) Date: Mon, 2 Apr 2007 08:44:58 +1000 From: David Chinner To: Martin Steigerwald Cc: linux-xfs@oss.sgi.com Subject: Re: write barrier and USB devices Message-ID: <20070401224458.GR32597093@melbourne.sgi.com> References: <200703301523.58027.Martin@lichtvoll.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200703301523.58027.Martin@lichtvoll.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11015 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, Mar 30, 2007 at 03:23:57PM +0200, Martin Steigerwald wrote: > > Hello! > > Does the usb mass storage driver support write barriers? You should ask the usb folks that question.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 2 11:24:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 11:24:12 -0700 (PDT) X-Spam-oss-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l32IO2kH008050 for ; Mon, 2 Apr 2007 11:24:04 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA23891; Mon, 2 Apr 2007 16:30:51 +1000 Date: Mon, 02 Apr 2007 16:31:52 +1100 From: Timothy Shimmin To: xfs-dev@sgi.com, xfs@oss.sgi.com Subject: review: remove unused ilen var from xfs_vnodeops.c Message-ID: X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11016 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs simple cleanup patch =========================================================================== Index: fs/xfs/xfs_vnodeops.c =========================================================================== --- a/fs/xfs/xfs_vnodeops.c 2007-04-02 15:56:30.000000000 +1000 +++ b/fs/xfs/xfs_vnodeops.c 2007-04-02 15:19:42.926081759 +1000 @@ -4289,7 +4289,6 @@ xfs_free_file_space( int error; xfs_fsblock_t firstfsb; xfs_bmap_free_t free_list; - xfs_off_t ilen; xfs_bmbt_irec_t imap; xfs_off_t ioffset; xfs_extlen_t mod=0; @@ -4338,10 +4337,7 @@ xfs_free_file_space( } rounding = max_t(uint, 1 << mp->m_sb.sb_blocklog, NBPP); - ilen = len + (offset & (rounding - 1)); ioffset = offset & ~(rounding - 1); - if (ilen & (rounding - 1)) - ilen = (ilen + rounding) & ~(rounding - 1); if (VN_CACHED(vp) != 0) { xfs_inval_cached_trace(&ip->i_iocore, ioffset, -1, ============================================================================= I think ilen was removed a while back when we changed a call interface... revision 1.534 date: 2002/07/08 22:09:30; author: lord; state: Exp; lines: +1 -2 modid: 2.4.x-xfs:slinx:122666a changes xfs_inval_cached_pages interface =========================================================================== Index: fs/xfs/xfs_vnodeops.c =========================================================================== --- a/fs/xfs/xfs_vnodeops.c 2007-04-02 15:38:34.000000000 +1000 +++ b/fs/xfs/xfs_vnodeops.c 2007-04-02 15:38:34.000000000 +1000 @@ -5459,8 +5459,7 @@ xfs_free_file_space( ioffset = offset & ~(rounding - 1); if (ilen & (rounding - 1)) ilen = (ilen + rounding) & ~(rounding - 1); - xfs_inval_cached_pages(XFS_ITOV(ip), &(ip->i_iocore), - ioffset, ilen, NULL, 0); + xfs_inval_cached_pages(XFS_ITOV(ip), &(ip->i_iocore), ioffset, 0, 0); /* * Need to zero the stuff we're not freeing, on disk. * If its specrt (realtime & can't use unwritten extents) then --Tim From owner-xfs@oss.sgi.com Mon Apr 2 11:35:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 11:35:14 -0700 (PDT) X-Spam-oss-Status: No, score=0.0 required=5.0 tests=BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail.g-house.de (ns2.g-housing.de [81.169.133.75]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32IXskH011758 for ; Mon, 2 Apr 2007 11:35:11 -0700 Received: from [77.99.119.196] (helo=77-99-119-196.cable.ubr04.linl.blueyonder.co.uk) by mail.g-house.de with esmtpsa (TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1HYR6a-0002wK-26 for xfs@oss.sgi.com; Mon, 02 Apr 2007 20:18:12 +0200 Date: Mon, 2 Apr 2007 19:18:01 +0100 (BST) From: Christian Kujau X-X-Sender: evil@sheep.housecafe.de To: xfs@oss.sgi.com Subject: possible recursive locking detected Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11017 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lists@nerdbynature.de Precedence: bulk X-list: xfs Hi, when I enabled a few more debug-options in the kernel (vanilla 2.6.21-rc5), I came across: [ INFO: possible recursive locking detected ] 2.6.21-rc5 #2 --------------------------------------------- rm/32198 is trying to acquire lock: xfs_ilock+0x71/0xa0 but task is already holding lock: xfs_ilock+0x71/0xa0 other info that might help us debug this: 3 locks held by rm/32198: do_unlinkat+0x96/0x160 vfs_unlink+0x75/0xe0 xfs_ilock+0x71/0xa0 stack backtrace: __lock_acquire+0xa99/0x1010 lock_acquire+0x57/0x70 xfs_ilock+0x71/0xa0 down_write+0x38/0x50 xfs_ilock+0x71/0xa0 xfs_ilock+0x71/0xa0 xfs_lock_dir_and_entry+0xf6/0x100 xfs_remove+0x197/0x4e0 d_instantiate+0x19/0x40 d_rehash+0x20/0x50 vfs_unlink+0x75/0xe0 xfs_vn_unlink+0x23/0x60 __mutex_lock_slowpath+0x13f/0x280 mark_held_locks+0x6b/0x90 __mutex_lock_slowpath+0x13f/0x280 __mutex_lock_slowpath+0x13f/0x280 trace_hardirqs_on+0xb9/0x160 vfs_unlink+0x75/0xe0 __mutex_lock_slowpath+0x132/0x280 vfs_unlink+0x75/0xe0 permission+0x91/0xf0 vfs_unlink+0x89/0xe0 do_unlinkat+0xd2/0x160 sysenter_past_esp+0x8d/0x99 trace_hardirqs_on+0xb9/0x160 sysenter_past_esp+0x5d/0x99 ======================= Is this something I have to worry about? Please see http://nerdbynature.de/bits/2.6.21-rc5/ for a few more details. Thanks, Christian. -- BOFH excuse #372: Forced to support NT servers; sysadmins quit. From owner-xfs@oss.sgi.com Mon Apr 2 13:19:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 13:19:27 -0700 (PDT) X-Spam-oss-Status: No, score=-1.3 required=5.0 tests=AWL,BAYES_20 autolearn=no version=3.2.0-pre1-r499012 Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32KJOfB015559 for ; Mon, 2 Apr 2007 13:19:25 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HYSjB-0006XR-RE; Mon, 02 Apr 2007 21:02:09 +0100 Date: Mon, 2 Apr 2007 21:02:09 +0100 From: Christoph Hellwig To: Timothy Shimmin Cc: xfs-dev@sgi.com, xfs@oss.sgi.com Subject: Re: review: remove unused ilen var from xfs_vnodeops.c Message-ID: <20070402200209.GA25101@infradead.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11018 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Mon, Apr 02, 2007 at 04:31:52PM +1100, Timothy Shimmin wrote: > simple cleanup patch looks good. From owner-xfs@oss.sgi.com Mon Apr 2 14:44:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 14:44:50 -0700 (PDT) X-Spam-oss-Status: No, score=3.0 required=5.0 tests=BAYES_95,HTML_MESSAGE autolearn=no version=3.2.0-pre1-r499012 Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.243]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32LijfB006582 for ; Mon, 2 Apr 2007 14:44:46 -0700 Received: by an-out-0708.google.com with SMTP id c5so1347883anc for ; Mon, 02 Apr 2007 14:44:44 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=B0dKOXK/Z19QmQgxQu99WBEx++uR/fqbhpRomvr1bdl0Go0keJ5m4MEuIy4+8IbZbuj/tKmtmSIhGlVq68O7zFhNJj/YYQTVX3Ww3e5egV2T9HRYdwH4k3r0Stdep02hmpeRp05IoOK5iLST4BmJFwXeo8jk8ESupxftDNe7WiU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=Pj9/4vi7bihLMXtvW2pCj7KVnv8jPvl2JSdnm0yBUnbZ1A0Se9Jqx7YsGqs6OIy0rbDBhPLrQ6AqHeOYCivaz7VLdgbi1g4kl+Jh0oRS0bBWJURBb74AgSkaLQs6teRmu/MAbrndaOH0MWOPIZHjLgkSq6n/YX0IjgltU5V8zO8= Received: by 10.100.144.11 with SMTP id r11mr3858808and.1175548737613; Mon, 02 Apr 2007 14:18:57 -0700 (PDT) Received: by 10.100.200.11 with HTTP; Mon, 2 Apr 2007 14:18:57 -0700 (PDT) Message-ID: <817da7960704021418g6e1d4662y3250be14bc01ab69@mail.gmail.com> Date: Mon, 2 Apr 2007 17:18:57 -0400 From: "Charles Weber" To: "Roger Heflin" Subject: Re: xfs partial dismount issue Cc: linux-xfs@oss.sgi.com, sandeen@sandeen.net In-Reply-To: <45EC868A.4060607@atipa.com> MIME-Version: 1.0 References: <45EC3DEA.3000105@sandeen.net> <45EC868A.4060607@atipa.com> Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 7bit Content-length: 2822 X-archive-position: 11019 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: chaweber@gmail.com Precedence: bulk X-list: xfs Well actually I did a test with ext3 and got the same result. It partially dismounted after a day or so of use and seemed identical to my previous xfs filesystem failures . My guess now is that this has always occurred when all 6 raid controllers (2 per card) were in use. I could go quite some time with 5 of the 6 controllers used. I consolidated everything to 2 cards, removed one card and put in fiber channel card for my new storage array So far no problems. If so then it seems something is funny about the cciss driver. thanks, Chuck On 3/5/07, Roger Heflin wrote: > > Charles Weber wrote: > > Eric Sandeen sandeen.net> writes: > > > >> Chuck Weber wrote: > >>> Hi everyone, I have a long running problem perhaps you can help with. > I > >>> will include as much detail as I can. I can set up a spare server-disk > >>> set for testing if you have any bright ideas. > >>> > >>> We use XFS for samba and nfs on x86_64 Fedora Proliant DL585/385 > >>> servers. Our busiest server has disk partitions go away. > >> What do you mean by this, exactly? The partitions themselves go away, > >> or are you talking about the problem described below where processes > >> start hanging? > >> > > Here is an example partition (1 of 6 or more xfs storage only). > > /share/store3 with samba shares on /share/store3/lls, lds, lxs and so > on. > > I will get a call saying my groups share (lxs) is no longer accessable. > I ssh > > into server and can ls /share/store3 but ls will hang when I ls > > /share/store3/lxs. Shortly there after ls will hang for the root or any > > directory on the partition. Other partitions will be fine and other > samba shares > > will be fine until the queued up process load bogs the server down. > > > > Charles, > > I have seen what may be a similar issue on SLES9SP2, we had 1 xfs > partition, and under certain conditions it would stop responding, all > non-xfs partitions were ok, and everything was fine after a reboot. > > Under sysrq-t it appeared to me that 2 separate processes were calling > fsync and were causing each other to deadlock (and locking all others > out of changing the xfs partition). I was not able to determine exactly > what the underlying bug was, but all of the hung processes > were waiting on locks in at least several widely different parts of the > xfs and kernel code, and adjusting the application to not fsync has > apparently resulted in the deadlock not occuring. In this case > there were multiple (2-4) different instances of the application calling > fsync apparently sometimes at close to the same time. With the > given application the failure was almost a certainly on one machine > (of 100) running the application overnight. > > Roger > [[HTML alternate version deleted]] From owner-xfs@oss.sgi.com Mon Apr 2 16:46:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 16:46:22 -0700 (PDT) X-Spam-oss-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00 autolearn=no version=3.2.0-pre1-r499012 Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32NkIfB010080 for ; Mon, 2 Apr 2007 16:46:20 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l32NkHUX007234; Mon, 2 Apr 2007 19:46:17 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l32NkHod021971; Mon, 2 Apr 2007 19:46:17 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l32NkGlR000765; Mon, 2 Apr 2007 19:46:16 -0400 Message-ID: <46119596.1020900@sandeen.net> Date: Mon, 02 Apr 2007 18:45:26 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Timothy Shimmin CC: xfs-dev@sgi.com, xfs@oss.sgi.com Subject: Re: review: remove unused ilen var from xfs_vnodeops.c References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11020 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Timothy Shimmin wrote: > simple cleanup patch > Hey that's my job! ;-) Looks good -Eric From owner-xfs@oss.sgi.com Mon Apr 2 17:54:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 17:54:15 -0700 (PDT) X-Spam-oss-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l330s9fB000643 for ; Mon, 2 Apr 2007 17:54:11 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA23394 for ; Tue, 3 Apr 2007 10:54:07 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1116) id C15D458FF80A; Tue, 3 Apr 2007 10:54:07 +1000 (EST) To: xfs@oss.sgi.com Subject: TAKE cleanup - remove ilen refs from vnodeops.c Message-Id: <20070403005407.C15D458FF80A@chook.melbourne.sgi.com> Date: Tue, 3 Apr 2007 10:54:07 +1000 (EST) From: tes@sgi.com (Tim Shimmin) X-archive-position: 11021 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Remove unused ilen variable and references. Date: Tue Apr 3 10:53:20 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/tes/2.6.x-xfs Inspected by: lachlan@sgi.com,sandeen@sandeen.net The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28344a fs/xfs/xfs_vnodeops.c - 1.694 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.694&r2=text&tr2=1.693&f=h - Remove unused ilen variable and references. From owner-xfs@oss.sgi.com Mon Apr 2 23:09:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 23:09:40 -0700 (PDT) X-Spam-oss-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3369YfB022611 for ; Mon, 2 Apr 2007 23:09:36 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA01072; Tue, 3 Apr 2007 16:09:33 +1000 Date: Tue, 03 Apr 2007 16:10:38 +1100 From: Timothy Shimmin To: xfs@oss.sgi.com, xfs-dev@sgi.com Subject: review: export xfs_buftarg_list for use by xfsidbg Message-ID: X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11023 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Content-Length: 6296 Lines: 185 Hi, This patch addresses the problem of having xfs_buftarg_list global for use by kdb and xfsidbg, where otherwise it would be static. If we are using xfsidbg, then we export the xfs_get_buftarg_list function to it. Previous to Dave's (dgc) static changes, we were only globalizing it if we were in DEBUG - when really it is a question of xfsidbg. Using CONFIG_KDB_MODULES as this is what we use in the Makefiles for determining if xfsidbg is used. --Tim linux-2.4/xfs_buf.c | 8 ++++++++ linux-2.4/xfs_buf.h | 3 +++ linux-2.4/xfs_ksyms.c | 5 ++--- linux-2.6/xfs_buf.c | 10 +++++++++- linux-2.6/xfs_buf.h | 3 +++ linux-2.6/xfs_ksyms.c | 5 ++--- xfsidbg.c | 9 +++------ 7 files changed, 30 insertions(+), 13 deletions(-) =========================================================================== Index: fs/xfs/linux-2.4/xfs_buf.c =========================================================================== --- a/fs/xfs/linux-2.4/xfs_buf.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.4/xfs_buf.c 2007-04-03 15:45:23.930213823 +1000 @@ -2335,3 +2335,11 @@ xfs_buf_terminate(void) kmem_zone_destroy(xfs_buf_zone); kmem_shake_deregister(xfs_buf_shake); } + +#ifdef CONFIG_KDB_MODULES +struct list_head * +xfs_get_buftarg_list(void) +{ + return &xfs_buftarg_list; +} +#endif =========================================================================== Index: fs/xfs/linux-2.4/xfs_buf.h =========================================================================== --- a/fs/xfs/linux-2.4/xfs_buf.h 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.4/xfs_buf.h 2007-04-03 15:46:59.997634360 +1000 @@ -500,6 +500,9 @@ extern void xfs_free_buftarg(xfs_buftarg extern void xfs_wait_buftarg(xfs_buftarg_t *); extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int); extern int xfs_flush_buftarg(xfs_buftarg_t *, int); +#ifdef CONFIG_KDB_MODULES +extern struct list_head *xfs_get_buftarg_list(void); +#endif #define xfs_getsize_buftarg(buftarg) block_size((buftarg)->bt_kdev) #define xfs_readonly_buftarg(buftarg) is_read_only((buftarg)->bt_kdev) =========================================================================== Index: fs/xfs/linux-2.4/xfs_ksyms.c =========================================================================== --- a/fs/xfs/linux-2.4/xfs_ksyms.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.4/xfs_ksyms.c 2007-04-03 15:46:29.489629322 +1000 @@ -124,9 +124,8 @@ EXPORT_SYMBOL(xfs_params); EXPORT_SYMBOL(xfs_bmbt_disk_get_all); #endif -#if defined(CONFIG_XFS_DEBUG) -extern struct list_head xfs_buftarg_list; -EXPORT_SYMBOL(xfs_buftarg_list); +#if defined(CONFIG_KDB_MODULES) +EXPORT_SYMBOL(xfs_get_buftarg_list); #endif /* =========================================================================== Index: fs/xfs/linux-2.6/xfs_buf.c =========================================================================== --- a/fs/xfs/linux-2.6/xfs_buf.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.6/xfs_buf.c 2007-04-03 15:11:15.778300965 +1000 @@ -1426,7 +1426,7 @@ xfs_free_bufhash( /* * buftarg list for delwrite queue processing */ -LIST_HEAD(xfs_buftarg_list); +static LIST_HEAD(xfs_buftarg_list); static DEFINE_SPINLOCK(xfs_buftarg_lock); STATIC void @@ -1867,3 +1867,11 @@ xfs_buf_terminate(void) ktrace_free(xfs_buf_trace_buf); #endif } + +#ifdef CONFIG_KDB_MODULES +struct list_head * +xfs_get_buftarg_list(void) +{ + return &xfs_buftarg_list; +} +#endif =========================================================================== Index: fs/xfs/linux-2.6/xfs_buf.h =========================================================================== --- a/fs/xfs/linux-2.6/xfs_buf.h 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.6/xfs_buf.h 2007-04-03 15:22:51.547106965 +1000 @@ -411,6 +411,9 @@ extern void xfs_free_buftarg(xfs_buftarg extern void xfs_wait_buftarg(xfs_buftarg_t *); extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int); extern int xfs_flush_buftarg(xfs_buftarg_t *, int); +#ifdef CONFIG_KDB_MODULES +extern struct list_head *xfs_get_buftarg_list(void); +#endif #define xfs_getsize_buftarg(buftarg) block_size((buftarg)->bt_bdev) #define xfs_readonly_buftarg(buftarg) bdev_read_only((buftarg)->bt_bdev) =========================================================================== Index: fs/xfs/linux-2.6/xfs_ksyms.c =========================================================================== --- a/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-03 14:52:27.011000730 +1000 @@ -125,9 +125,8 @@ EXPORT_SYMBOL(xfs_params); EXPORT_SYMBOL(xfs_bmbt_disk_get_all); #endif -#if defined(CONFIG_XFS_DEBUG) -extern struct list_head xfs_buftarg_list; -EXPORT_SYMBOL(xfs_buftarg_list); +#if defined(CONFIG_KDB_MODULES) +EXPORT_SYMBOL(xfs_get_buftarg_list); #endif /* =========================================================================== Index: fs/xfs/xfsidbg.c =========================================================================== --- a/fs/xfs/xfsidbg.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/xfsidbg.c 2007-04-03 15:24:02.201877199 +1000 @@ -62,6 +62,7 @@ #include "xfs_quota.h" #include "quota/xfs_qm.h" #include "xfs_iomap.h" +#include "xfs_buf.h" MODULE_AUTHOR("Silicon Graphics, Inc."); MODULE_DESCRIPTION("Additional kdb commands for debugging XFS"); @@ -2350,8 +2351,7 @@ kdbm_bp(int argc, const char **argv) static int kdbm_bpdelay(int argc, const char **argv) { -#ifdef DEBUG - extern struct list_head xfs_buftarg_list; + struct list_head *xfs_buftarg_list = xfs_get_buftarg_list(); struct list_head *curr, *next; xfs_buftarg_t *tp, *n; xfs_buf_t bp; @@ -2372,7 +2372,7 @@ kdbm_bpdelay(int argc, const char **argv } - list_for_each_entry_safe(tp, n, &xfs_buftarg_list, bt_list) { + list_for_each_entry_safe(tp, n, xfs_buftarg_list, bt_list) { list_for_each_safe(curr, next, &tp->bt_delwrite_queue) { addr = (unsigned long)list_entry(curr, xfs_buf_t, b_list); if ((diag = kdb_getarea(bp, addr))) @@ -2388,9 +2388,6 @@ kdbm_bpdelay(int argc, const char **argv } } } -#else - kdb_printf("bt_delwrite_queue inaccessible (non-debug)\n"); -#endif return 0; } From owner-xfs@oss.sgi.com Tue Apr 3 05:01:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 03 Apr 2007 05:01:56 -0700 (PDT) X-Spam-oss-Status: No, score=3.5 required=5.0 tests=BAYES_99 autolearn=no version=3.2.0-pre1-r499012 Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.245]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l33C1lfB004228 for ; Tue, 3 Apr 2007 05:01:48 -0700 Received: by an-out-0708.google.com with SMTP id c5so1529305anc for ; Tue, 03 Apr 2007 05:01:47 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=txHZ6dlfmDTAXzC1JJfxIC4zr23HysF+vBV63kPFJQojyrwYBxC+zarceB7LmuBqtg+GU2SQX9tFi5Miwz6XGDX79fdFvkAzmfEmatq5Ti3j0iqD9L3BDryfSKlBN2uqPNXX6oq/B6rX0trdWSr3ZltvhGKbXgbqyy2rdj5FEPg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=kzkQhrK0HCVnZ5TB83FgwDJEDJKg4Znm2oneaxe2FYT0dFJreyO4NtExZAfNS9pfZYvukd4EUpRjlmra1+0X7VaWc4xuFKNhB16tPPw3gefMiRbv7k7Rksrt3A70pAPDnCkyPdoHXcDXBe+Rls0h+3ic02qiNEV+oj2afLR68+I= Received: by 10.100.91.6 with SMTP id o6mr4212834anb.1175598219092; Tue, 03 Apr 2007 04:03:39 -0700 (PDT) Received: by 10.100.138.16 with HTTP; Tue, 3 Apr 2007 04:03:39 -0700 (PDT) Message-ID: <12fac1030704030403t3ffc3599w5a0191476eb8b865@mail.gmail.com> Date: Tue, 3 Apr 2007 13:03:39 +0200 From: Sencer To: xfs@oss.sgi.com Subject: md/dm devices, barrier support, commodity hardware and data integrity MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11024 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: alisencer@gmail.com Precedence: bulk X-list: xfs Hello, After reading up on XFS, there are a couple of issues that still seem kind of cloudy to me. I am merely a user of filesystems, so forgive me if some issues seem obvious. If you could confirm/clarify/answer the following issues, it would be very helpful to me. Situation 1) We are currently using XFS on a commodity x86 server with SATA drives (with NCQ) on Debian Etch (Kernel: 2.6.18-3-k7). We are also using Software-Raid1 (mdadm). All partitions except /boot are XFS. If I understand the FAQ and recent ml-discussions right, then 1a) without software-raid, we would enjoy write barrier support, however given that we are using md-devices this is not the case (kern.log confirms this by explicitly stating barrier support is disabled for mdX ...). Did this (barrier support with XFS on md) change in later kernels or is it likely to change in the near (or far) future? (I think I read mentions of md, and some kind of barrier-awareness on the ml, but didn't quite understand what effectively follows from it from a users POV). 1b) Given the current circumstances above, we should disable write cache as suggested in the faq (there are actually UPS's but they've failed before) to reduce the possibility of loosing data. Correct? We did need to do some hard-resets, and had power failure, though as of yet we never had problems with lost data on any xfs partitions, and I'd like to make sure it stays that way. 1c) We have backup strategies in place, so I can live with having a few partly damaged files and restoring them from backups. However I am not sure how we would make sure that we can find out about all such damaged files or if any such files exist ( referring to http://oss.sgi.com/projects/xfs/faq.html#nulls ). Are there tools for finding potential candidates for corruption? I am assuming that there would be a way to find out which files were the most recently touched with the help of the journal. Or do I just use shell-magic and find files by mtime and check if there are Nulls at the end of those file modified within the last minute or two before the crash? Situation 2) I hear many people saying that using XFS on machines that have no UPS (as in Notebooks [battery removed], Desktops etc.) is something that is not recommended. But after reading up on the issues, the recommendation should really go for every FS that only does meta-data journaling, as alluded to in the FAQ. 2a) And with the recent changes (barrier support and sync on truncated+modified+closed files) I assume there is really no reason to choose another meta-data journaling FS over XFS for such machines in terms of likelihood of damaged files after hard-resets and power failures - would you agree? 2b) When dm-crypt+luks is being used, there is no barrier support available (for XFS) even if the underlying hdd supports it, correct? Should this be expected to change, or is it more likely to stay that way? (due to limited dev. resources and priorities? or due to principal issues with it?) Thanks in advance Sencer From owner-xfs@oss.sgi.com Tue Apr 3 12:40:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 03 Apr 2007 12:41:00 -0700 (PDT) X-Spam-oss-Status: No, score=0.7 required=5.0 tests=AWL,BAYES_60, J_CHICKENPOX_45 autolearn=no version=3.2.0-pre1-r499012 Received: from barcelona.int.jammed.com (jammed.com [216.99.218.161]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l33JevfB027379 for ; Tue, 3 Apr 2007 12:40:58 -0700 Received: from barcelona.int.jammed.com (barcelona.int.jammed.com [172.16.64.15]) by barcelona.int.jammed.com (Postfix) with ESMTP id D5AA2BD6E for ; Tue, 3 Apr 2007 12:11:46 -0700 (PDT) Date: Tue, 3 Apr 2007 12:11:46 -0700 (PDT) From: "James W. Abendschan" X-X-Sender: jwa@barcelona.int.jammed.com To: xfs@oss.sgi.com Subject: xfs_repair segfault Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11025 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jwa@jammed.com Precedence: bulk X-list: xfs Hi there -- I have a 6.9TB XFS volume that is acting up after a power failure (I understand XFS + no UPS + PC hardware == badness. Not my decision.) The machine is a dual proc x86 (intel xeon 5130) w/ 8GB RAM running a custom 2.6.18 kernel on top of Ubuntu 6.06. Since xfs_check can't repair volumes of this size without scads of memory, I've been using xfs_repair to correct power-related problems before. Unfortunately, for some reason xfs_repair is segfaulting: # ulimit -c unlimited # xfs_repair -v /dev/md1 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... zero_log: head block 8 tail block 8 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 30 - agno = 31 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - clear lost+found (if it exists) ... - clearing existing "lost+found" inode Segmentation fault (core dumped) gdb doesn't show anything useful (I don't know how to interpret the I/O error) : # gdb /sbin/xfs_repair core GNU gdb 6.4-debian Copyright 2005 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i486-linux-gnu"...(no debugging symbols found) Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1". (no debugging symbols found) Core was generated by `xfs_repair -v /dev/md1'. Program terminated with signal 11, Segmentation fault. warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libuuid.so.1 Reading symbols from /lib/tls/i686/cmov/libc.so.6...(no debugging symbols found) Loaded symbols for /lib/tls/i686/cmov/libc.so.6 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 #0 0x08052f42 in ?? () (gdb) bt #0 0x08052f42 in ?? () #1 0x000088e9 in ?? () #2 0x00000800 in ?? () #3 0x00000080 in ?? () #4 0x00000000 in ?? () What's the next step? Thanks, James From owner-xfs@oss.sgi.com Tue Apr 3 17:42:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 03 Apr 2007 17:42:42 -0700 (PDT) X-Spam-oss-Status: No, score=0.3 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_45 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l340gbfB023972 for ; Tue, 3 Apr 2007 17:42:39 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA01820; Wed, 4 Apr 2007 10:42:30 +1000 Message-Id: <200704040042.KAA01820@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'James W. Abendschan'" , Subject: RE: xfs_repair segfault Date: Wed, 4 Apr 2007 10:45:47 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 In-Reply-To: Thread-Index: Acd2KC2mjxULtKq3T3C2GrfXBg0FrAAKdumQ X-archive-position: 11026 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Hi James, Would it be possible for you apply the patch I posted to xfs@oss in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html to the latest xfsprogs source, make and install it and run: # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 And make the image available for me to download and analyse? Regards, Barry. > -----Original Message----- > From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] > On Behalf Of James W. Abendschan > Sent: Wednesday, 4 April 2007 5:12 AM > To: xfs@oss.sgi.com > Subject: xfs_repair segfault > > Hi there -- I have a 6.9TB XFS volume that is acting up > after a power failure (I understand XFS + no UPS + PC > hardware == badness. Not my decision.) > > The machine is a dual proc x86 (intel xeon 5130) w/ 8GB RAM > running a custom 2.6.18 kernel on top of Ubuntu 6.06. > > Since xfs_check can't repair volumes of this size without > scads of memory, I've been using xfs_repair to correct > power-related problems before. > > Unfortunately, for some reason xfs_repair is segfaulting: > > # ulimit -c unlimited > # xfs_repair -v /dev/md1 > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > zero_log: head block 8 tail block 8 > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan and clear agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > - agno = 1 > - agno = 2 > - agno = 3 > - agno = 4 > - agno = 5 > - agno = 6 > - agno = 7 > - agno = 8 > - agno = 9 > - agno = 10 > - agno = 11 > - agno = 12 > - agno = 13 > - agno = 14 > - agno = 15 > - agno = 16 > - agno = 17 > - agno = 18 > - agno = 19 > - agno = 20 > - agno = 21 > - agno = 22 > - agno = 23 > - agno = 24 > - agno = 25 > - agno = 26 > - agno = 27 > - agno = 28 > - agno = 29 > - agno = 30 > - agno = 31 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > - clear lost+found (if it exists) ... > - clearing existing "lost+found" inode > Segmentation fault (core dumped) > > > gdb doesn't show anything useful (I don't know how to interpret > the I/O error) : > > > # gdb /sbin/xfs_repair core > GNU gdb 6.4-debian > Copyright 2005 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public > License, and you are > welcome to change it and/or distribute copies of it under > certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show > warranty" for details. > This GDB was configured as "i486-linux-gnu"...(no debugging > symbols found) > Using host libthread_db library > "/lib/tls/i686/cmov/libthread_db.so.1". > > (no debugging symbols found) > Core was generated by `xfs_repair -v /dev/md1'. > Program terminated with signal 11, Segmentation fault. > > warning: Can't read pathname for load map: Input/output error. > Reading symbols from /lib/libuuid.so.1...(no debugging > symbols found)...done. > Loaded symbols for /lib/libuuid.so.1 > Reading symbols from /lib/tls/i686/cmov/libc.so.6...(no > debugging symbols found) > Loaded symbols for /lib/tls/i686/cmov/libc.so.6 > Reading symbols from /lib/ld-linux.so.2...(no debugging > symbols found)...done. > Loaded symbols for /lib/ld-linux.so.2 > > #0 0x08052f42 in ?? () > (gdb) bt > #0 0x08052f42 in ?? () > #1 0x000088e9 in ?? () > #2 0x00000800 in ?? () > #3 0x00000080 in ?? () > #4 0x00000000 in ?? () > > > What's the next step? > > Thanks, > James > > > From owner-xfs@oss.sgi.com Wed Apr 4 06:24:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:24:27 -0700 (PDT) X-Spam-oss-Status: No, score=0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DOKfB022642 for ; Wed, 4 Apr 2007 06:24:21 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 373B8BF30 for ; Wed, 4 Apr 2007 15:05:39 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 11168-04 for ; Wed, 4 Apr 2007 15:05:35 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 3AA7CBF84; Wed, 4 Apr 2007 15:05:35 +0200 (CEST) Date: Wed, 4 Apr 2007 15:05:35 +0200 From: Thomas Kaehn To: xfs@oss.sgi.com Subject: Strange delete performance using XFS Message-ID: <20070404130535.GE18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.9i X-archive-position: 11033 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 2042 Lines: 62 Hi, I've got a strange problem on one machine using XFS. Deleting large directories (containing about 100000 files, 20k each) using "rm -rf" lasts nearly as long as creating the the files using a bash loop. The machine is running Debian Sarge with a vanilla 2.6.20.3 kernel. CPU: Dual Xeon(TM) CPU 3.20GHz RAM: 4 GB RAID10: 4x 320 GB disks connected to 3ware 9550SXU-8LP (Firmware Version = FE9X 3.08.00.004) The XFS was first created using default options and later on with "-d su=64k,sw=2 -l su=64k" which improved overall performance but not delete performance. Has anyone realized similar effects? On a different server (Dell 6850) the directory can be deleted within seconds. What could be the reason for the huge difference in delete performance? Please see below for "time" output. | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 6m6.814s | user 0m30.290s | sys 2m42.562s | # time rm -rf y | | real 5m18.034s | user 0m0.036s | sys 0m8.169s In contrast to this the result on the Dell machine looks more reasonable: | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 9m26.658s | user 0m24.134s | sys 3m3.623s | # time rm -rf x | | real 0m10.254s | user 0m0.124s | sys 0m10.105s Ciao, Thomas PS: Using JFS and ext3 it is also possible to delete the above directory in a couple of seconds. Only XFS seems problematic in this regard on this system. -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 06:29:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:29:55 -0700 (PDT) X-Spam-oss-Status: No, score=-0.9 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DTlfB024034 for ; Wed, 4 Apr 2007 06:29:49 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id B8834F063B21; Wed, 4 Apr 2007 09:29:46 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id B314A7176724; Wed, 4 Apr 2007 09:29:46 -0400 (EDT) Date: Wed, 4 Apr 2007 09:29:46 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404130535.GE18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-2087494152-1175693386=:7309" X-archive-position: 11034 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2784 Lines: 86 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-2087494152-1175693386=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi, > > I've got a strange problem on one machine using XFS. Deleting large > directories (containing about 100000 files, 20k each) using "rm -rf" > lasts nearly as long as creating the the files using a bash loop. > > The machine is running Debian Sarge with a vanilla 2.6.20.3 kernel. > CPU: Dual Xeon(TM) CPU 3.20GHz > RAM: 4 GB > RAID10: 4x 320 GB disks connected to 3ware 9550SXU-8LP > (Firmware Version =3D FE9X 3.08.00.004) > > The XFS was first created using default options and later on with > "-d su=3D64k,sw=3D2 -l su=3D64k" which improved overall performance > but not delete performance. > > Has anyone realized similar effects? On a different server (Dell 6850) > the directory can be deleted within seconds. What could be the reason > for the huge difference in delete performance? > > Please see below for "time" output. > > | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >/dev/null 2>&1; done > | > | real 6m6.814s > | user 0m30.290s > | sys 2m42.562s > | # time rm -rf y > | > | real 5m18.034s > | user 0m0.036s > | sys 0m8.169s > > In contrast to this the result on the Dell machine looks more > reasonable: > > | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >/dev/null 2>&1; done > | > | real 9m26.658s > | user 0m24.134s > | sys 3m3.623s > | # time rm -rf x > | > | real 0m10.254s > | user 0m0.124s > | sys 0m10.105s > > Ciao, > Thomas > > PS: Using JFS and ext3 it is also possible to delete the above directory > in a couple of seconds. Only XFS seems problematic in this regard on > this system. > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > Deletes on XFS is one area that is a little slower than other filesystems.= =20 You can increase the log size during the creation of the filesystem and=20 also increase logbufs to 8 and that might help.= ---1463747160-2087494152-1175693386=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 06:47:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:47:38 -0700 (PDT) X-Spam-oss-Status: No, score=0.6 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DlTfB027541 for ; Wed, 4 Apr 2007 06:47:31 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 74F7CC159; Wed, 4 Apr 2007 15:47:29 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 26351-05; Wed, 4 Apr 2007 15:47:24 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id E983AC126; Wed, 4 Apr 2007 15:47:24 +0200 (CEST) Date: Wed, 4 Apr 2007 15:47:24 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404134724.GF18320@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11035 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1850 Lines: 49 Hi Justin, On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: > >Please see below for "time" output. > > > >| # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 > >>/dev/null 2>&1; done > >| > >| real 6m6.814s > >| user 0m30.290s > >| sys 2m42.562s > >| # time rm -rf y > >| > >| real 5m18.034s > >| user 0m0.036s > >| sys 0m8.169s > Deletes on XFS is one area that is a little slower than other filesystems. > You can increase the log size during the creation of the filesystem and > also increase logbufs to 8 and that might help. Thanks for your suggestions. I also tried to increase the log size and logbufs mount option. This optimizes create and delete times to the above values (with default options both are around 9-10 minutes). The strange thing is that on a similar Dell machines using XFS, too, deletes take only ten seconds which would match user and system time. More than five minutes for deleting 100000 files where ext3 needs 3 seconds on the same machine is actually more than a little bit slower - to my mind there must be something wrong. JFS needs around 18 seconds. However I am not sure if the problem is hardware or software related. I've also tried to use the newest 3ware firmware - but this did not lead to an improvement. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 06:51:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:51:20 -0700 (PDT) X-Spam-oss-Status: No, score=-0.9 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DpAfB028519 for ; Wed, 4 Apr 2007 06:51:12 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 5A323F063B21; Wed, 4 Apr 2007 09:51:10 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 553C57176724; Wed, 4 Apr 2007 09:51:10 -0400 (EDT) Date: Wed, 4 Apr 2007 09:51:10 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404134724.GF18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-732358276-1175694670=:7309" X-archive-position: 11036 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2419 Lines: 72 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-732358276-1175694670=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>> Please see below for "time" output. >>> >>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>> /dev/null 2>&1; done >>> | >>> | real 6m6.814s >>> | user 0m30.290s >>> | sys 2m42.562s >>> | # time rm -rf y >>> | >>> | real 5m18.034s >>> | user 0m0.036s >>> | sys 0m8.169s > >> Deletes on XFS is one area that is a little slower than other filesystem= s. >> You can increase the log size during the creation of the filesystem and >> also increase logbufs to 8 and that might help. > > Thanks for your suggestions. > > I also tried to increase the log size and logbufs mount option. This > optimizes create and delete times to the above values (with default optio= ns > both are around 9-10 minutes). > > The strange thing is that on a similar Dell machines using XFS, too, > deletes take only ten seconds which would match user and system time. > > More than five minutes for deleting 100000 files where ext3 needs > 3 seconds on the same machine is actually more than a little bit slower > - to my mind there must be something wrong. JFS needs around 18 seconds. > > However I am not sure if the problem is hardware or software related. > I've also tried to use the newest 3ware firmware - but this did not lead > to an improvement. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > I am running some benchmarks with SW raid and will prevent my findings=20 shortly.= ---1463747160-732358276-1175694670=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 06:57:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:57:26 -0700 (PDT) X-Spam-oss-Status: No, score=-0.9 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DvKfB030429 for ; Wed, 4 Apr 2007 06:57:21 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id CFFF2F063B21; Wed, 4 Apr 2007 09:57:16 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id CAB0E7176724; Wed, 4 Apr 2007 09:57:16 -0400 (EDT) Date: Wed, 4 Apr 2007 09:57:16 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404134724.GF18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-295697546-1175695036=:7309" X-archive-position: 11037 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2757 Lines: 90 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-295697546-1175695036=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>> Please see below for "time" output. >>> >>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>> /dev/null 2>&1; done >>> | >>> | real 6m6.814s >>> | user 0m30.290s >>> | sys 2m42.562s >>> | # time rm -rf y >>> | >>> | real 5m18.034s >>> | user 0m0.036s >>> | sys 0m8.169s > >> Deletes on XFS is one area that is a little slower than other filesystem= s. >> You can increase the log size during the creation of the filesystem and >> also increase logbufs to 8 and that might help. > > Thanks for your suggestions. > > I also tried to increase the log size and logbufs mount option. This > optimizes create and delete times to the above values (with default optio= ns > both are around 9-10 minutes). > > The strange thing is that on a similar Dell machines using XFS, too, > deletes take only ten seconds which would match user and system time. > > More than five minutes for deleting 100000 files where ext3 needs > 3 seconds on the same machine is actually more than a little bit slower > - to my mind there must be something wrong. JFS needs around 18 seconds. > > However I am not sure if the problem is hardware or software related. > I've also tried to use the newest 3ware firmware - but this did not lead > to an improvement. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > The benchmark: $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k count= =3D20=20 >/dev/null 2>&1; done 1. Six 400GB SATA drives using SW RAID5: real 6m24.411s user 0m43.097s sys 2m17.350s 2. Four Raptor 150 ADFD drives using SW RAID5: real 3m16.962s user 0m42.899s sys 2m15.420s 3. Two Raptor 74GB *GD drives using SW RAID1: real 3m19.241s user 0m41.731s sys 2m15.873s ---1463747160-295697546-1175695036=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 06:57:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:58:03 -0700 (PDT) X-Spam-oss-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DvufB030663 for ; Wed, 4 Apr 2007 06:57:57 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 4A17FF063B21; Wed, 4 Apr 2007 09:57:56 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 269017176724; Wed, 4 Apr 2007 09:57:56 -0400 (EDT) Date: Wed, 4 Apr 2007 09:57:56 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1613578982-1175695076=:7309" X-archive-position: 11038 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2627 Lines: 79 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1613578982-1175695076=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Justin Piszcz wrote: > > > On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >> Hi Justin, >>=20 >> On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>>> Please see below for "time" output. >>>>=20 >>>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k= count=3D20 >>>>> /dev/null 2>&1; done >>>> | >>>> | real 6m6.814s >>>> | user 0m30.290s >>>> | sys 2m42.562s >>>> | # time rm -rf y >>>> | >>>> | real 5m18.034s >>>> | user 0m0.036s >>>> | sys 0m8.169s >>=20 >>> Deletes on XFS is one area that is a little slower than other filesyste= ms. >>> You can increase the log size during the creation of the filesystem and >>> also increase logbufs to 8 and that might help. >>=20 >> Thanks for your suggestions. >>=20 >> I also tried to increase the log size and logbufs mount option. This >> optimizes create and delete times to the above values (with default opti= ons >> both are around 9-10 minutes). >>=20 >> The strange thing is that on a similar Dell machines using XFS, too, >> deletes take only ten seconds which would match user and system time. >>=20 >> More than five minutes for deleting 100000 files where ext3 needs >> 3 seconds on the same machine is actually more than a little bit slower >> - to my mind there must be something wrong. JFS needs around 18 seconds. >>=20 >> However I am not sure if the problem is hardware or software related. >> I've also tried to use the newest 3ware firmware - but this did not lead >> to an improvement. >>=20 >> Ciao, >> Thomas >> --=20 >> Thomas K=E4hn WESTEND GmbH | Internet-Business-Provi= der >> Technik CISCO Systems Partner - Authorized Reseller >> Im S=FCsterfeld 6 Tel 0241/701333-= 18 >> tk@westend.com D-52072 Aachen Fax 0241/911879 >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 >> Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael= Kolb >>=20 >>=20 > > I am running some benchmarks with SW raid and will prevent my findings=20 > shortly. Removal tests coming shortly, benchmarking is always interesting.= ---1463747160-1613578982-1175695076=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:12:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:12:57 -0700 (PDT) X-Spam-oss-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34ECmfB002491 for ; Wed, 4 Apr 2007 07:12:50 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 6A9EEF063B21; Wed, 4 Apr 2007 10:12:48 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 655AF7176724; Wed, 4 Apr 2007 10:12:48 -0400 (EDT) Date: Wed, 4 Apr 2007 10:12:48 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1228164866-1175695968=:7309" X-archive-position: 11039 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 3274 Lines: 117 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1228164866-1175695968=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Justin Piszcz wrote: > > > On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >> Hi Justin, >>=20 >> On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>>> Please see below for "time" output. >>>>=20 >>>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k= count=3D20 >>>>> /dev/null 2>&1; done >>>> | >>>> | real 6m6.814s >>>> | user 0m30.290s >>>> | sys 2m42.562s >>>> | # time rm -rf y >>>> | >>>> | real 5m18.034s >>>> | user 0m0.036s >>>> | sys 0m8.169s >>=20 >>> Deletes on XFS is one area that is a little slower than other filesyste= ms. >>> You can increase the log size during the creation of the filesystem and >>> also increase logbufs to 8 and that might help. >>=20 >> Thanks for your suggestions. >>=20 >> I also tried to increase the log size and logbufs mount option. This >> optimizes create and delete times to the above values (with default opti= ons >> both are around 9-10 minutes). >>=20 >> The strange thing is that on a similar Dell machines using XFS, too, >> deletes take only ten seconds which would match user and system time. >>=20 >> More than five minutes for deleting 100000 files where ext3 needs >> 3 seconds on the same machine is actually more than a little bit slower >> - to my mind there must be something wrong. JFS needs around 18 seconds. >>=20 >> However I am not sure if the problem is hardware or software related. >> I've also tried to use the newest 3ware firmware - but this did not lead >> to an improvement. >>=20 >> Ciao, >> Thomas >> --=20 >> Thomas K=E4hn WESTEND GmbH | Internet-Business-Provi= der >> Technik CISCO Systems Partner - Authorized Reseller >> Im S=FCsterfeld 6 Tel 0241/701333-= 18 >> tk@westend.com D-52072 Aachen Fax 0241/911879 >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 >> Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael= Kolb >>=20 >>=20 > > The benchmark: > $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k coun= t=3D20=20 >> /dev/null 2>&1; done > > 1. Six 400GB SATA drives using SW RAID5: > real 6m24.411s > user 0m43.097s > sys 2m17.350s > > 2. Four Raptor 150 ADFD drives using SW RAID5: > real 3m16.962s > user 0m42.899s > sys 2m15.420s > > 3. Two Raptor 74GB *GD drives using SW RAID1: > real 3m19.241s > user 0m41.731s > sys 2m15.873s > > The removals: The benchmark: $ time rm -rf test 1. Six 400GB SATA drives using SW RAID5: real 0m33.996s user 0m0.057s sys 0m8.101s 2. Four Raptor 150 ADFD drives using SW RAID5: real 0m43.967s user 0m0.071s sys 0m8.340s 3. Two Raptor 74GB *GD drives using SW RAID1: real 0m32.965s user 0m0.049s sys 0m6.307s ---1463747160-1228164866-1175695968=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:13:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:13:43 -0700 (PDT) X-Spam-oss-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EDafB002895 for ; Wed, 4 Apr 2007 07:13:39 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 55C01F080A24; Wed, 4 Apr 2007 10:13:36 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 50D137176724; Wed, 4 Apr 2007 10:13:36 -0400 (EDT) Date: Wed, 4 Apr 2007 10:13:36 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-2094664383-1175696016=:7309" X-archive-position: 11040 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 3303 Lines: 109 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-2094664383-1175696016=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Justin Piszcz wrote: > > > On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >> Hi Justin, >>=20 >> On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>>> Please see below for "time" output. >>>>=20 >>>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k= count=3D20 >>>>> /dev/null 2>&1; done >>>> | >>>> | real 6m6.814s >>>> | user 0m30.290s >>>> | sys 2m42.562s >>>> | # time rm -rf y >>>> | >>>> | real 5m18.034s >>>> | user 0m0.036s >>>> | sys 0m8.169s >>=20 >>> Deletes on XFS is one area that is a little slower than other filesyste= ms. >>> You can increase the log size during the creation of the filesystem and >>> also increase logbufs to 8 and that might help. >>=20 >> Thanks for your suggestions. >>=20 >> I also tried to increase the log size and logbufs mount option. This >> optimizes create and delete times to the above values (with default opti= ons >> both are around 9-10 minutes). >>=20 >> The strange thing is that on a similar Dell machines using XFS, too, >> deletes take only ten seconds which would match user and system time. >>=20 >> More than five minutes for deleting 100000 files where ext3 needs >> 3 seconds on the same machine is actually more than a little bit slower >> - to my mind there must be something wrong. JFS needs around 18 seconds. >>=20 >> However I am not sure if the problem is hardware or software related. >> I've also tried to use the newest 3ware firmware - but this did not lead >> to an improvement. >>=20 >> Ciao, >> Thomas >> --=20 >> Thomas K=E4hn WESTEND GmbH | Internet-Business-Provi= der >> Technik CISCO Systems Partner - Authorized Reseller >> Im S=FCsterfeld 6 Tel 0241/701333-= 18 >> tk@westend.com D-52072 Aachen Fax 0241/911879 >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 >> Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael= Kolb >>=20 >>=20 > > The benchmark: > $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k coun= t=3D20=20 >> /dev/null 2>&1; done > > 1. Six 400GB SATA drives using SW RAID5: > real 6m24.411s > user 0m43.097s > sys 2m17.350s > > 2. Four Raptor 150 ADFD drives using SW RAID5: > real 3m16.962s > user 0m42.899s > sys 2m15.420s > > 3. Two Raptor 74GB *GD drives using SW RAID1: > real 3m19.241s > user 0m41.731s > sys 2m15.873s > > I used the DEFAULT create options for XFS as I find it highly optimizes=20 itself (at least with SW raid) with the exception of the ROOT FS, I had=20 that optimized awhile ago and I kept it: /dev/md2 / xfs=20 logbufs=3D8,logbsize=3D262144,biosize=3D16,noatime,nodiratime,nobarrier 0 = 1 For my regular RAID5s though I use defaults,noatime. Justin.= ---1463747160-2094664383-1175696016=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:22:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:22:18 -0700 (PDT) X-Spam-oss-Status: No, score=0.5 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EMBfB005858 for ; Wed, 4 Apr 2007 07:22:12 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id E39C7C16B; Wed, 4 Apr 2007 16:22:10 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 05404-04-13; Wed, 4 Apr 2007 16:22:07 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 6593AC12D; Wed, 4 Apr 2007 16:21:42 +0200 (CEST) Date: Wed, 4 Apr 2007 16:21:42 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404142142.GG18320@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11041 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1210 Lines: 37 Hi Justin, On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: > On Wed, 4 Apr 2007, Justin Piszcz wrote: > >On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >$ time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 > >>/dev/null 2>&1; done > > > >1. Six 400GB SATA drives using SW RAID5: > >real 6m24.411s > >user 0m43.097s > >sys 2m17.350s > > > > The removals: > The benchmark: > $ time rm -rf test > > 1. Six 400GB SATA drives using SW RAID5: > real 0m33.996s > user 0m0.057s > sys 0m8.101s thanks for your bechmark. To my mind this clearly shows that my setup is wrong at some point. I'll try again with your mount options. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 07:24:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:24:50 -0700 (PDT) X-Spam-oss-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EOhfB006635 for ; Wed, 4 Apr 2007 07:24:45 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id A39DAF063B21; Wed, 4 Apr 2007 10:24:42 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 9FC3C717672A; Wed, 4 Apr 2007 10:24:42 -0400 (EDT) Date: Wed, 4 Apr 2007 10:24:42 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404142142.GG18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> <20070404142142.GG18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-702127020-1175696682=:7309" X-archive-position: 11042 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 1917 Lines: 63 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-702127020-1175696682=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: >> On Wed, 4 Apr 2007, Justin Piszcz wrote: >>> On Wed, 4 Apr 2007, Thomas Kaehn wrote: >>> $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >>>> /dev/null 2>&1; done >>> >>> 1. Six 400GB SATA drives using SW RAID5: >>> real 6m24.411s >>> user 0m43.097s >>> sys 2m17.350s >>> >> >> The removals: >> The benchmark: >> $ time rm -rf test >> >> 1. Six 400GB SATA drives using SW RAID5: >> real 0m33.996s >> user 0m0.057s >> sys 0m8.101s > > thanks for your bechmark. To my mind this clearly shows that my > setup is wrong at some point. I'll try again with your mount options. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > My guess is mkfs.xfs cannot optimzie for your array like it can with a SW= =20 RAID device because it cannot see what is undereath it. Have you tried=20 making a SW RAID? I also use optimized parameters for my SW RAID1/5 as=20 well FYI. Justin.= ---1463747160-702127020-1175696682=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:35:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:35:52 -0700 (PDT) X-Spam-oss-Status: No, score=0.5 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EZkfB008940 for ; Wed, 4 Apr 2007 07:35:47 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 387EDC177; Wed, 4 Apr 2007 16:35:45 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 10353-03-2; Wed, 4 Apr 2007 16:35:37 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id B287BC117; Wed, 4 Apr 2007 16:35:33 +0200 (CEST) Date: Wed, 4 Apr 2007 16:35:33 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404143533.GF12481@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> <20070404142142.GG18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11043 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1933 Lines: 53 Hi Justin, On Wed, Apr 04, 2007 at 10:24:42AM -0400, Justin Piszcz wrote: > >On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: > >>On Wed, 4 Apr 2007, Justin Piszcz wrote: > >>>On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >>>$ time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 > >>>>/dev/null 2>&1; done > >>> > My guess is mkfs.xfs cannot optimzie for your array like it can with a SW > RAID device because it cannot see what is undereath it. Have you tried > making a SW RAID? I also use optimized parameters for my SW RAID1/5 as > well FYI. I guess this might be the problem. I've already tried to alter the stripe unit to match the RAID stripe size: "-d su=64k,sw=2 -l su=64k". Maybe the 3ware controller can't deal with the kind of read and write patterns needed by XFS. But in this case other people should have realized similar problems. On a different system with a 3ware 9500S-4LP using 4 disks as RAID5 setup I get a better (but not really good) result for delete performance (I've taken only 50000 files in this case as the system's CPU is much slower): | # time for i in `seq 1 50000`; do dd if=/dev/zero of=$i | bs=1k count=20 >/dev/null 2>&1; done | | real 18m21.643s | user 0m55.727s | sys 3m12.140s | backup:/srv/x# cd .. | backup:/srv# rm -rf x | | # time rm -rf x | | real 5m7.845s | user 0m0.160s | sys 0m11.369s Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 08:45:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 08:45:33 -0700 (PDT) X-Spam-oss-Status: No, score=-0.4 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from smtp108.sbc.mail.mud.yahoo.com (smtp108.sbc.mail.mud.yahoo.com [68.142.198.207]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l34FjRfB026359 for ; Wed, 4 Apr 2007 08:45:28 -0700 Received: (qmail 95791 invoked from network); 4 Apr 2007 15:45:26 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp108.sbc.mail.mud.yahoo.com with SMTP; 4 Apr 2007 15:45:25 -0000 X-YMail-OSG: VVG0PLEVM1mC181ic3eeqjwe.pA93jXFNrc6sdqfX4N8N.o06NwfUPyuHOiCpMxcsJnKi4Utiw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 74C8D1826127; Wed, 4 Apr 2007 08:45:23 -0700 (PDT) Date: Wed, 4 Apr 2007 08:45:23 -0700 From: Chris Wedgwood To: Thomas Kaehn Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404154523.GA20096@tuatara.stupidest.org> References: <20070404130535.GE18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070404130535.GE18320@mail3b.westend.com> X-archive-position: 11044 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs Content-Length: 1390 Lines: 46 On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: > I've got a strange problem on one machine using XFS. Deleting large > directories (containing about 100000 files, 20k each) using "rm -rf" > lasts nearly as long as creating the the files using a bash loop. quite possible > RAM: 4 GB > RAID10: 4x 320 GB disks connected to 3ware 9550SXU-8LP > (Firmware Version = FE9X 3.08.00.004) > The XFS was first created using default options and later on with > "-d su=64k,sw=2 -l su=64k" which improved overall performance > but not delete performance. have you tried w/o using the hw raid? > Has anyone realized similar effects? On a different server (Dell > 6850) the directory can be deleted within seconds. What could be the > reason for the huge difference in delete performance? a lot of log updates; does the other server have a battery-backed write-cache like many cards to these days? > | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done > | > | real 6m6.814s > | user 0m30.290s > | sys 2m42.562s that's about the same as my quick single-spindle cheap-desktop test here > | # time rm -rf y > | > | real 5m18.034s > | user 0m0.036s > | sys 0m8.169s v2 logs? what logbufs & logbsize is used? testing with my cheap crappy desktop workstation thing with a single disk I get "1m25.004s" for the delete From owner-xfs@oss.sgi.com Wed Apr 4 11:36:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 11:36:54 -0700 (PDT) X-Spam-oss-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34IadfB001401 for ; Wed, 4 Apr 2007 11:36:40 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 8B561F063B21; Wed, 4 Apr 2007 14:36:37 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 8A4A4717672A; Wed, 4 Apr 2007 14:36:37 -0400 (EDT) Date: Wed, 4 Apr 2007 14:36:37 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404134724.GF18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1108556179-1175711797=:16731" X-archive-position: 11045 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2388 Lines: 70 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1108556179-1175711797=:16731 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>> Please see below for "time" output. >>> >>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>> /dev/null 2>&1; done >>> | >>> | real 6m6.814s >>> | user 0m30.290s >>> | sys 2m42.562s >>> | # time rm -rf y >>> | >>> | real 5m18.034s >>> | user 0m0.036s >>> | sys 0m8.169s > >> Deletes on XFS is one area that is a little slower than other filesystem= s. >> You can increase the log size during the creation of the filesystem and >> also increase logbufs to 8 and that might help. > > Thanks for your suggestions. > > I also tried to increase the log size and logbufs mount option. This > optimizes create and delete times to the above values (with default optio= ns > both are around 9-10 minutes). > > The strange thing is that on a similar Dell machines using XFS, too, > deletes take only ten seconds which would match user and system time. > > More than five minutes for deleting 100000 files where ext3 needs > 3 seconds on the same machine is actually more than a little bit slower > - to my mind there must be something wrong. JFS needs around 18 seconds. > > However I am not sure if the problem is hardware or software related. > I've also tried to use the newest 3ware firmware - but this did not lead > to an improvement. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > For the ext3, try time bash -c 'rm -rf test; sync' ---1463747160-1108556179-1175711797=:16731-- From owner-xfs@oss.sgi.com Wed Apr 4 13:45:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 13:45:12 -0700 (PDT) X-Spam-oss-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34Kj4fB005320 for ; Wed, 4 Apr 2007 13:45:08 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id D56A6F063B21; Wed, 4 Apr 2007 16:45:03 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id D21F9717672A; Wed, 4 Apr 2007 16:45:03 -0400 (EDT) Date: Wed, 4 Apr 2007 16:45:03 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404143533.GF12481@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> <20070404142142.GG18320@mail3b.westend.com> <20070404143533.GF12481@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1667215911-1175719503=:20373" X-archive-position: 11046 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2514 Lines: 73 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1667215911-1175719503=:20373 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 10:24:42AM -0400, Justin Piszcz wrote: >>> On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: >>>> On Wed, 4 Apr 2007, Justin Piszcz wrote: >>>>> On Wed, 4 Apr 2007, Thomas Kaehn wrote: >>>>> $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>>>> /dev/null 2>&1; done >>>>> >> My guess is mkfs.xfs cannot optimzie for your array like it can with a SW >> RAID device because it cannot see what is undereath it. Have you tried >> making a SW RAID? I also use optimized parameters for my SW RAID1/5 as >> well FYI. > > I guess this might be the problem. I've already tried to alter > the stripe unit to match the RAID stripe size: "-d su=3D64k,sw=3D2 -l su= =3D64k". > > Maybe the 3ware controller can't deal with the kind of read and write > patterns needed by XFS. But in this case other people should have > realized similar problems. > > On a different system with a 3ware 9500S-4LP using 4 disks as RAID5 > setup I get a better (but not really good) result for delete > performance (I've taken only 50000 files in this case as the system's > CPU is much slower): > > | # time for i in `seq 1 50000`; do dd if=3D/dev/zero of=3D$i > | bs=3D1k count=3D20 >/dev/null 2>&1; done > | > | real 18m21.643s > | user 0m55.727s > | sys 3m12.140s > | backup:/srv/x# cd .. > | backup:/srv# rm -rf x > | > | # time rm -rf x > | > | real 5m7.845s > | user 0m0.160s > | sys 0m11.369s > > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > What do you get with ext3 when using time bash -c 'rm -f file; sync'= ---1463747160-1667215911-1175719503=:20373-- From owner-xfs@oss.sgi.com Wed Apr 4 13:58:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 13:58:30 -0700 (PDT) X-Spam-oss-Status: No, score=0.6 required=5.0 tests=BAYES_50,J_CHICKENPOX_43, J_CHICKENPOX_44,J_CHICKENPOX_45,J_CHICKENPOX_46,J_CHICKENPOX_47, J_CHICKENPOX_48 autolearn=no version=3.2.0-pre1-r499012 Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34KwPfB008103 for ; Wed, 4 Apr 2007 13:58:27 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id 7AE342E10D59; Wed, 4 Apr 2007 22:36:06 +0200 (CEST) Date: Wed, 4 Apr 2007 22:36:01 +0200 X-OfflineIMAP-1301118847-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1175718968-0570832815641-v4.0.11 From: Lars Ellenberg To: xfs@oss.sgi.com Subject: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070404203601.GA11771@barkeeper1.linbit> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.11 X-archive-position: 11047 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs Content-Length: 9501 Lines: 343 this is a plain debian sarge system, i386 (actually k7 athlon), nothing fancy. kernel is a self-build kernel.org 2.6.16 kernel, repackaged for sarge, so it is no longer obvious which exact sublevel. I can dig that up however, if it seems relevant. this is a backup volume with typically many (40 to 70, maybe?) hardlinks. Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg00-backup 1.2G 19M 1.2G 2% /mnt/backup Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-backup 1.2T 809G 365G 69% /mnt/backup somehow file system corruption crept in. may be related to some power loss, may be related to some strange 3w-xxxx: scsi0: AEN: WARNING: ATA UDMA upgrade: Port #3 messages (also for port 5), which are actually DOWNgrades... (unfortunately 2.6.16 has that double definition bug still in some header file, I'll submit the oneline patch to Adrian together whith some other stuff I have pending). maybe cosmic rays. memtest86+ is fine, btw. anyhow. after several runs of xfs_repair version 2.6.20 (and then cleaning up lost+found, where possible), some "empty" but not so empty directories stay behind. [root:/mnt/backup/lost+found]# rmdir 4059295137 rmdir: `4059295137': Directory not empty [root:/mnt/backup/lost+found]# find 4059295137 -ls 4059295137 8 drwxrwxr-x 2 1049 1049 4096 Apr 3 12:39 4059295137 that is it! [right now I have still 118 of these] additional runs of xfs_repair don't change the situation. on run takes about 8+ hours, though. NOTE that I used the default sarge xfsprogs version 2.6.20, not the upstream 2.8.20. yet. I'll start an xfs_repair run with 2.8.20 right after this post, though... from a post on xfs mailing list Message-Id: <200608150145.LAA07105@larry.melbourne.sgi.com> From: Barry Naujok To: 'Paul Slootman' , xfs@oss.sgi.com Subject: RE: cache_purge: shake on cache 0x5880a0 left 8 nodes!? Date: Tue, 15 Aug 2006 11:49:13 +1000 I found a suggestion to investigate like this: xfs_db version 2.6.20 xfs_db /dev/vg00/backup [sorry, I skipped the blockget -n and therefor the ncheck, if absolutely necessary, I can do that still, but I suspect it will take hours, too? I tried it, but after a few minutes with 95% CPU and about 2GB RAM used, I killed it...] xfs_db> inode 4059295137 xfs_db> p core.magic = 0x494e core.mode = 040775 core.version = 1 core.format = 2 (extents) core.nlinkv1 = 2 core.uid = 1049 core.gid = 1049 core.flushiter = 24 core.atime.sec = Thu Feb 1 20:30:54 2007 core.atime.nsec = 533550800 core.mtime.sec = Tue Apr 3 12:39:00 2007 core.mtime.nsec = 472123752 core.ctime.sec = Tue Apr 3 12:39:00 2007 core.ctime.nsec = 472123752 core.size = 4096 core.nblocks = 2 core.extsize = 0 core.nextents = 2 core.naextents = 0 core.forkoff = 0 core.aformat = 2 (extents) core.dmevmask = 0 core.dmstate = 0 core.newrtbm = 0 core.prealloc = 0 core.realtime = 0 core.immutable = 0 core.append = 0 core.sync = 0 core.noatime = 0 core.nodump = 0 core.gen = 11 next_unlinked = null u.bmx[0-1] = [startoff,startblock,blockcount,extentflag] 0:[0,253808988,1,0] 1:[8388608,253808989,1,0] xfs_db> ncheck lost+found must run blockget -n first xfs_db> dblock 0 xfs_db> p dhdr.magic = 0x58443244 dhdr.bestfree[0].offset = 0x20 dhdr.bestfree[0].length = 0xfd0 dhdr.bestfree[1].offset = 0 dhdr.bestfree[1].length = 0 dhdr.bestfree[2].offset = 0 dhdr.bestfree[2].length = 0 du[0].inumber = 4059295137 du[0].namelen = 1 du[0].name = "." du[0].tag = 0x10 du[1].freetag = 0xffff du[1].length = 0xfd0 du[1].tag = 0x20 du[2].inumber = 656 du[2].namelen = 2 du[2].name = ".." du[2].tag = 0xff0 xfs_db> dblock 8388608 xfs_db> p lhdr.info.forw = 0 lhdr.info.back = 0 lhdr.info.magic = 0xd2f1 lhdr.count = 99 lhdr.stale = 97 lbests[0] = 0:0xfd0 lents[0].hashval = 0x2e lents[0].address = 0x2 lents[1].hashval = 0x172e lents[1].address = 0x1fe lents[2].hashval = 0x859d16c lents[2].address = 0 lents[3].hashval = 0xdc6133e lents[3].address = 0 lents[4].hashval = 0xeffc248 lents[4].address = 0 lents[5].hashval = 0xfed728e lents[5].address = 0 lents[6].hashval = 0x124f4f36 lents[6].address = 0 lents[7].hashval = 0x13625491 lents[7].address = 0 lents[8].hashval = 0x1372549d lents[8].address = 0 lents[9].hashval = 0x19ef9ac0 lents[9].address = 0 lents[10].hashval = 0x1b1d6dce lents[10].address = 0 lents[11].hashval = 0x1db2dd93 lents[11].address = 0 lents[12].hashval = 0x262a70f3 lents[12].address = 0 lents[13].hashval = 0x29460811 lents[13].address = 0 lents[14].hashval = 0x2956081d lents[14].address = 0 lents[15].hashval = 0x2d8b48ab lents[15].address = 0 lents[16].hashval = 0x31e3c314 lents[16].address = 0 lents[17].hashval = 0x3669d1ab lents[17].address = 0 lents[18].hashval = 0x3679d1a7 lents[18].address = 0 lents[19].hashval = 0x36b8264b lents[19].address = 0 lents[20].hashval = 0x3996aced lents[20].address = 0 lents[21].hashval = 0x3a35ab00 lents[21].address = 0 lents[22].hashval = 0x3c670bbd lents[22].address = 0 lents[23].hashval = 0x3f150fd1 lents[23].address = 0 lents[24].hashval = 0x4308c7cd lents[24].address = 0 lents[25].hashval = 0x463489b9 lents[25].address = 0 lents[26].hashval = 0x475261d5 lents[26].address = 0 lents[27].hashval = 0x53469be7 lents[27].address = 0 lents[28].hashval = 0x53569beb lents[28].address = 0 lents[29].hashval = 0x589604ba lents[29].address = 0 lents[30].hashval = 0x594bf53d lents[30].address = 0 lents[31].hashval = 0x5981bb01 lents[31].address = 0 lents[32].hashval = 0x5a46bb9c lents[32].address = 0 lents[33].hashval = 0x5c0b668e lents[33].address = 0 lents[34].hashval = 0x5deca0a1 lents[34].address = 0 lents[35].hashval = 0x5f872eba lents[35].address = 0 lents[36].hashval = 0x5f8a3faa lents[36].address = 0 lents[37].hashval = 0x5f9a3fa6 lents[37].address = 0 lents[38].hashval = 0x691b5828 lents[38].address = 0 lents[39].hashval = 0x69d27a70 lents[39].address = 0 lents[40].hashval = 0x73c859c2 lents[40].address = 0 lents[41].hashval = 0x73d859ce lents[41].address = 0 lents[42].hashval = 0x75eb5177 lents[42].address = 0 lents[43].hashval = 0x75fb517b lents[43].address = 0 lents[44].hashval = 0x891e2e5b lents[44].address = 0 lents[45].hashval = 0x8965893f lents[45].address = 0 lents[46].hashval = 0x8cb8b142 lents[46].address = 0 lents[47].hashval = 0x8cce44a0 lents[47].address = 0 lents[48].hashval = 0x8de83f12 lents[48].address = 0 lents[49].hashval = 0x8f22b70e lents[49].address = 0 lents[50].hashval = 0x8f4ab64c lents[50].address = 0 lents[51].hashval = 0x90acb7a0 lents[51].address = 0 lents[52].hashval = 0x9302cac0 lents[52].address = 0 lents[53].hashval = 0x9312cacc lents[53].address = 0 lents[54].hashval = 0x950c4daa lents[54].address = 0 lents[55].hashval = 0x960164df lents[55].address = 0 lents[56].hashval = 0x97df8a76 lents[56].address = 0 lents[57].hashval = 0x9b1ca984 lents[57].address = 0 lents[58].hashval = 0x9be7549e lents[58].address = 0 lents[59].hashval = 0x9bf75492 lents[59].address = 0 lents[60].hashval = 0x9db319d9 lents[60].address = 0 lents[61].hashval = 0xa4926d71 lents[61].address = 0 lents[62].hashval = 0xa5677d7e lents[62].address = 0 lents[63].hashval = 0xa5777d72 lents[63].address = 0 lents[64].hashval = 0xaa01d033 lents[64].address = 0 lents[65].hashval = 0xaa11d03f lents[65].address = 0 lents[66].hashval = 0xaa8a84f0 lents[66].address = 0 lents[67].hashval = 0xaa9a84fc lents[67].address = 0 lents[68].hashval = 0xae810ff0 lents[68].address = 0 lents[69].hashval = 0xaedc9085 lents[69].address = 0 lents[70].hashval = 0xb55cb496 lents[70].address = 0 lents[71].hashval = 0xb8866b93 lents[71].address = 0 lents[72].hashval = 0xb98b5af9 lents[72].address = 0 lents[73].hashval = 0xb9d6e5c3 lents[73].address = 0 lents[74].hashval = 0xbb1c6ddf lents[74].address = 0 lents[75].hashval = 0xbc34bda9 lents[75].address = 0 lents[76].hashval = 0xbd20f48a lents[76].address = 0 lents[77].hashval = 0xbe8bae2a lents[77].address = 0 lents[78].hashval = 0xc86644bc lents[78].address = 0 lents[79].hashval = 0xcdb62e47 lents[79].address = 0 lents[80].hashval = 0xcdcd8923 lents[80].address = 0 lents[81].hashval = 0xd43269bc lents[81].address = 0 lents[82].hashval = 0xde074636 lents[82].address = 0 lents[83].hashval = 0xde17463a lents[83].address = 0 lents[84].hashval = 0xe1321219 lents[84].address = 0 lents[85].hashval = 0xe1465aa0 lents[85].address = 0 lents[86].hashval = 0xe65770a0 lents[86].address = 0 lents[87].hashval = 0xe7e6d19e lents[87].address = 0 lents[88].hashval = 0xeb8d1d7a lents[88].address = 0 lents[89].hashval = 0xf02bb092 lents[89].address = 0 lents[90].hashval = 0xf03bb09e lents[90].address = 0 lents[91].hashval = 0xf494b02c lents[91].address = 0 lents[92].hashval = 0xf58811b2 lents[92].address = 0 lents[93].hashval = 0xf59811be lents[93].address = 0 lents[94].hashval = 0xf988f496 lents[94].address = 0 lents[95].hashval = 0xfb4d59cd lents[95].address = 0 lents[96].hashval = 0xfb5d59c1 lents[96].address = 0 lents[97].hashval = 0xfca78996 lents[97].address = 0 lents[98].hashval = 0xfe80e968 lents[98].address = 0 ltail.bestcount = 1 xfs_db> hope that helps someone figure out what is wrong. if I can provide further info, anything, just tell me. cheers, -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Thu Apr 5 00:28:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 00:28:11 -0700 (PDT) X-Spam-oss-Status: No, score=0.4 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l357S6fB028589 for ; Thu, 5 Apr 2007 00:28:07 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 994E3C308; Thu, 5 Apr 2007 09:28:05 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 25356-05; Thu, 5 Apr 2007 09:28:03 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 93425C2BC; Thu, 5 Apr 2007 09:28:03 +0200 (CEST) Date: Thu, 5 Apr 2007 09:28:03 +0200 From: Thomas Kaehn To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405072803.GB2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070404154523.GA20096@tuatara.stupidest.org> User-Agent: Mutt/1.5.9i X-archive-position: 11049 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 2790 Lines: 71 Hi Chris, On Wed, Apr 04, 2007 at 08:45:23AM -0700, Chris Wedgwood wrote: > On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: > > The XFS was first created using default options and later on with > > "-d su=64k,sw=2 -l su=64k" which improved overall performance > > but not delete performance. > > have you tried w/o using the hw raid? I'am going to test with a single disk in the same machine. > > Has anyone realized similar effects? On a different server (Dell > > 6850) the directory can be deleted within seconds. What could be the > > reason for the huge difference in delete performance? > > a lot of log updates; does the other server have a battery-backed > write-cache like many cards to these days? The Dell system has got a battery-backed write-cache. The 3ware system has no battery unit. However it's supposed to provide write cache, too. At least I've enabled it in the RAID's configuration. The controller has got more than 100MB memory. > > | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done > > | > > | real 6m6.814s > > | user 0m30.290s > > | sys 2m42.562s > > that's about the same as my quick single-spindle cheap-desktop test > here > > > | # time rm -rf y > > | > > | real 5m18.034s > > | user 0m0.036s > > | sys 0m8.169s > > v2 logs? what logbufs & logbsize is used? > > testing with my cheap crappy desktop workstation thing with a > single disk I get "1m25.004s" for the delete Your delete time sounds more sensible to me. The file system was created first with default options - later on I tried to match the RAID's stripe size, increased the log size and mounted with logbufs=8: log stripe unit specified, using v2 logs meta-data=/dev/sda1 isize=256 agcount=8, agsize=125008 blks = sectsz=512 data = bsize=4096 blocks=1000032, imaxpct=25 = sunit=16 swidth=32 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=512 sunit=16 blks realtime =none extsz=65536 blocks=0, rtextents=0 Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 00:37:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 00:37:27 -0700 (PDT) X-Spam-oss-Status: No, score=0.3 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l357bOfB030945 for ; Thu, 5 Apr 2007 00:37:25 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id A4D40C2EE; Thu, 5 Apr 2007 09:37:23 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 28182-02; Thu, 5 Apr 2007 09:37:21 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id EEA00C2E0; Thu, 5 Apr 2007 09:37:21 +0200 (CEST) Date: Thu, 5 Apr 2007 09:37:21 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405073721.GC2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11050 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 751 Lines: 23 Hi Justin, On Wed, Apr 04, 2007 at 02:36:37PM -0400, Justin Piszcz wrote: > For the ext3, try time bash -c 'rm -rf test; sync' # time bash -c 'rm -rf y; sync' real 0m1.592s user 0m0.032s sys 0m1.408s Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 01:17:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 01:17:57 -0700 (PDT) X-Spam-oss-Status: No, score=0.3 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l358HrfB010719 for ; Thu, 5 Apr 2007 01:17:55 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 66206C17E; Thu, 5 Apr 2007 10:17:53 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 07235-04-3; Thu, 5 Apr 2007 10:17:51 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 795E4C117; Thu, 5 Apr 2007 10:17:51 +0200 (CEST) Date: Thu, 5 Apr 2007 10:17:51 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405081751.GD2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11051 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1654 Lines: 46 Hi Justin, On Wed, Apr 04, 2007 at 10:13:36AM -0400, Justin Piszcz wrote: > I used the DEFAULT create options for XFS as I find it highly optimizes > itself (at least with SW raid) with the exception of the ROOT FS, I had > that optimized awhile ago and I kept it: > > /dev/md2 / xfs > logbufs=8,logbsize=262144,biosize=16,noatime,nodiratime,nobarrier 0 1 > > > For my regular RAID5s though I use defaults,noatime. I've disabled barriers, too, and performance increased dramatically. However I am not aware of the consequences of disabling write barriers. The FAQ generally recommends using write barriers except when having a battery-backed cache (this 3ware has not). | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 3m52.182s | user 0m30.482s | sys 3m16.152s | | # time \rm -rf y | | real 0m16.327s | user 0m0.052s | sys 0m8.305s So I am unsure if disabling is an option for me. I could imagine that write barriers are not properly supported by 3ware or have to be fine tuned at the kernel or SCSI level. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 01:30:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 01:30:35 -0700 (PDT) X-Spam-oss-Status: No, score=0.6 required=5.0 tests=AWL,BAYES_60,HTML_MESSAGE, RDNS_NONE autolearn=no version=3.2.0-pre1-r499012 Received: from ilsmtp.nds.com ([192.118.32.12]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l358UQfD013932 for ; Thu, 5 Apr 2007 01:30:31 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 MIME-Version: 1.0 Subject: XFS Resiliency to the disk errors. Date: Thu, 5 Apr 2007 11:08:07 +0300 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: XFS Resiliency to the disk errors. Thread-Index: Acd3WYs9KYsrse46SLGHxEpphOjGKg== From: "Zak, Semion" To: Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 7bit X-archive-position: 11052 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: SZak@nds.com Precedence: bulk X-list: xfs Content-Length: 481 Lines: 21 Hi, We are studying possibility to use XFS with cheap (not too reliable) discs, so we have some questions: What in XFS is done to survive the disk errors (bad sectors)? I know about superblock duplication in every AG. What else? What is XFS behavior in case of the disk errors (panic/no mount/partial data access)? What could be done to restore? If zero bad sector/dump to other device/format/restore will help? Thanks, Semion. [[HTML alternate version deleted]] From owner-xfs@oss.sgi.com Thu Apr 5 02:03:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 02:03:40 -0700 (PDT) X-Spam-oss-Status: No, score=0.4 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_12,J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3593afB021908 for ; Thu, 5 Apr 2007 02:03:38 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 0DF9FC14C; Thu, 5 Apr 2007 11:03:36 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 20480-03; Thu, 5 Apr 2007 11:03:33 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id C3122BED3; Thu, 5 Apr 2007 11:03:33 +0200 (CEST) Date: Thu, 5 Apr 2007 11:03:33 +0200 From: Thomas Kaehn To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405090333.GE2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070405072803.GB2759@mail3b.westend.com> User-Agent: Mutt/1.5.9i X-archive-position: 11053 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1421 Lines: 40 Hi Chris, On Thu, Apr 05, 2007 at 09:28:03AM +0200, Thomas Kaehn wrote: > On Wed, Apr 04, 2007 at 08:45:23AM -0700, Chris Wedgwood wrote: > > On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: > > > The XFS was first created using default options and later on with > > > "-d su=64k,sw=2 -l su=64k" which improved overall performance > > > but not delete performance. > > > > have you tried w/o using the hw raid? > > I'am going to test with a single disk in the same machine. this is what I got with a single disk (defaults for mkfs.xfs, logbufs=8 for mount) in the same machine: | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 11m22.487s | user 0m30.278s | sys 2m33.762s | # time \rm -rf y | | real 8m20.963s | user 0m0.056s | sys 0m7.968s So there is no improvement for a single disk. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 03:21:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 03:21:49 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35ALifB006633 for ; Thu, 5 Apr 2007 03:21:45 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id BB6C8F063B21; Thu, 5 Apr 2007 06:21:43 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id B3139717672A; Thu, 5 Apr 2007 06:21:43 -0400 (EDT) Date: Thu, 5 Apr 2007 06:21:43 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070405090333.GE2759@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405090333.GE2759@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1411685764-1175768503=:17700" X-archive-position: 11054 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1411685764-1175768503=:17700 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 5 Apr 2007, Thomas Kaehn wrote: > Hi Chris, > > On Thu, Apr 05, 2007 at 09:28:03AM +0200, Thomas Kaehn wrote: >> On Wed, Apr 04, 2007 at 08:45:23AM -0700, Chris Wedgwood wrote: >>> On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: >>>> The XFS was first created using default options and later on with >>>> "-d su=3D64k,sw=3D2 -l su=3D64k" which improved overall performance >>>> but not delete performance. >>> >>> have you tried w/o using the hw raid? >> >> I'am going to test with a single disk in the same machine. > > this is what I got with a single disk (defaults for mkfs.xfs, logbufs=3D8= for > mount) in the same machine: > > | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >/dev/null 2>&1; done > | > | real 11m22.487s > | user 0m30.278s > | sys 2m33.762s > | # time \rm -rf y > | > | real 8m20.963s > | user 0m0.056s > | sys 0m7.968s > > So there is no improvement for a single disk. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > What kind of disks are you using? Maybe just slow disks??= ---1463747160-1411685764-1175768503=:17700-- From owner-xfs@oss.sgi.com Thu Apr 5 03:50:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 03:50:17 -0700 (PDT) Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35AoCfB013090 for ; Thu, 5 Apr 2007 03:50:13 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 92453C03B; Thu, 5 Apr 2007 12:50:08 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 17632-05; Thu, 5 Apr 2007 12:50:06 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 14D6FBFBF; Thu, 5 Apr 2007 12:50:06 +0200 (CEST) Date: Thu, 5 Apr 2007 12:50:06 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405105005.GF2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405090333.GE2759@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11055 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Hi Justin, On Thu, Apr 05, 2007 at 06:21:43AM -0400, Justin Piszcz wrote: > >So there is no improvement for a single disk. > > > What kind of disks are you using? Maybe just slow disks?? I am using the following disks: http://www.westerndigital.com/en/products/products.asp?DriveID=233 When disabling write barriers delete times are OK. I think that the 3ware RAID controller could have a problem with it. I'll try to contact 3ware in order to come to know if this feature is supported or not. Additionally I am going to try out some advices presented in the 3ware knowledge base. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 04:11:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 04:11:46 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35BBgfB018902 for ; Thu, 5 Apr 2007 04:11:43 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id DFC69F063B21; Thu, 5 Apr 2007 07:11:41 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id D8A5B717672A; Thu, 5 Apr 2007 07:11:41 -0400 (EDT) Date: Thu, 5 Apr 2007 07:11:41 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070405105005.GF2759@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405090333.GE2759@mail3b.westend.com> <20070405105005.GF2759@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1637025469-1175771501=:17700" X-archive-position: 11056 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1637025469-1175771501=:17700 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 5 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Thu, Apr 05, 2007 at 06:21:43AM -0400, Justin Piszcz wrote: >>> So there is no improvement for a single disk. >>> >> What kind of disks are you using? Maybe just slow disks?? > > I am using the following disks: > > http://www.westerndigital.com/en/products/products.asp?DriveID=3D233 > > When disabling write barriers delete times are OK. I think that > the 3ware RAID controller could have a problem with it. I'll try to > contact 3ware in order to come to know if this feature is supported or > not. > > Additionally I am going to try out some advices presented in the > 3ware knowledge base. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > Ah, ok-- Keep us updated/let us know if you get any new findings/etc. Something else you can try as well is turning off NCQ, that gave me a=20 10-35% performance boost depending on the benchmark. Justin.= ---1463747160-1637025469-1175771501=:17700-- From owner-xfs@oss.sgi.com Thu Apr 5 04:44:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 04:44:19 -0700 (PDT) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.146]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35BiDfB031345 for ; Thu, 5 Apr 2007 04:44:15 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l35Biv3S002799 for ; Thu, 5 Apr 2007 07:44:57 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l35BiC7l280220 for ; Thu, 5 Apr 2007 07:44:12 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l35BiCDL001445 for ; Thu, 5 Apr 2007 07:44:12 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l35BiBr1001354; Thu, 5 Apr 2007 07:44:11 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5A61929ECD4; Thu, 5 Apr 2007 17:14:17 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l35BiG93015298; Thu, 5 Apr 2007 17:14:16 +0530 Date: Thu, 5 Apr 2007 17:14:16 +0530 From: "Amit K. Arora" To: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070405114416.GB19982@amitarora.in.ibm.com> References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070405112619.GA19982@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070405112619.GA19982@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11057 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, Apr 05, 2007 at 04:56:19PM +0530, Amit K. Arora wrote: Correction below: > asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) > { > return sys_fallocate(fd, offset, len, mode); return sys_fallocate(fd, mode, offset, len); > } -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu Apr 5 04:45:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 04:45:23 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35BjHfB031746 for ; Thu, 5 Apr 2007 04:45:19 -0700 Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l35BQIr0010240 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 5 Apr 2007 07:26:19 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l35BQG2R000962 for ; Thu, 5 Apr 2007 07:26:16 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l35BQGB7176036 for ; Thu, 5 Apr 2007 05:26:16 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l35BQFCw023992 for ; Thu, 5 Apr 2007 05:26:16 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l35BQFnW023961; Thu, 5 Apr 2007 05:26:15 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id AB73B29ECD4; Thu, 5 Apr 2007 16:56:20 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l35BQJem007983; Thu, 5 Apr 2007 16:56:19 +0530 Date: Thu, 5 Apr 2007 16:56:19 +0530 From: "Amit K. Arora" To: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070405112619.GA19982@amitarora.in.ibm.com> References: <20070117094658.GA17390@amitarora.in.ibm.com> <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070330071417.GI355@devserv.devel.redhat.com> User-Agent: Mutt/1.4.1i X-archive-position: 11058 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > Wouldn't > int fallocate(loff_t offset, loff_t len, int fd, int mode) > work on both s390 and ppc/arm? glibc will certainly wrap it and > reorder the arguments as needed, so there is no need to keep fd first. This should work on all the platforms. The only concern I can think of here is the convention being followed till now, where all the entities on which the action has to be performed by the kernel (say fd, file/device name, pid etc.) is the first argument of the system call. If we can live with the small exception here, fine. Or else, we may have to implement the int fd, int mode, loff_t offset, loff_t len as the layout of arguments here. I think only s390 will have a problem with this, and we can think of a workaround for it (may be similar to what ARM did to implement sync_file_range() system call) : asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) { return sys_fallocate(fd, offset, len, mode); } To me both the approaches look slightly unconventional. But, we need to compromise somewhere to make things work on all the platforms. Any thoughts on which one of the above should we finalize on ? Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu Apr 5 08:29:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 08:29:30 -0700 (PDT) Received: from smtp101.sbc.mail.mud.yahoo.com (smtp101.sbc.mail.mud.yahoo.com [68.142.198.200]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l35FTKfB002141 for ; Thu, 5 Apr 2007 08:29:21 -0700 Received: (qmail 75615 invoked from network); 5 Apr 2007 15:29:19 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp101.sbc.mail.mud.yahoo.com with SMTP; 5 Apr 2007 15:29:18 -0000 X-YMail-OSG: 8JPF9LEVM1lVqLlXkX11bjo82.dG9Z1QiU9IWNNuDIvd.SD3BTlvPb5emYgszcu26.YWreJrHQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 43A481826127; Thu, 5 Apr 2007 08:29:17 -0700 (PDT) Date: Thu, 5 Apr 2007 08:29:17 -0700 From: Chris Wedgwood To: Thomas Kaehn Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405152917.GB23893@tuatara.stupidest.org> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070405072803.GB2759@mail3b.westend.com> X-archive-position: 11060 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs Content-Length: 770 Lines: 18 On Thu, Apr 05, 2007 at 09:28:03AM +0200, Thomas Kaehn wrote: > The Dell system has got a battery-backed write-cache. The 3ware > system has no battery unit. However it's supposed to provide write > cache, too. That sounds like the main reason for the difference. The Dell's raid system can safely buffer outstanding writes and flsuh them, the 3ware can't so it stalls waiting fot the disks to catch up. You could run blktrace and watch what's going on in both cases to verify this. The numbers do seem a little low for a raid array all the same, I'd be tempted to just use the 3ware as a JBOD and use sw, but I'm arguably biased, I've had so many reliability and performance problems with hw raid over the years I will almost always use sw raid given the choice. From owner-xfs@oss.sgi.com Thu Apr 5 09:06:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 09:06:40 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35G6WfB013207 for ; Thu, 5 Apr 2007 09:06:33 -0700 Received: from [192.168.1.103] (c-76-17-197-128.hsd1.mn.comcast.net [76.17.197.128]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 478131807E361; Thu, 5 Apr 2007 11:06:31 -0500 (CDT) Message-ID: <46151E86.2080704@sandeen.net> Date: Thu, 05 Apr 2007 11:06:30 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (Macintosh/20070221) MIME-Version: 1.0 To: "Zak, Semion" CC: xfs@oss.sgi.com Subject: Re: XFS Resiliency to the disk errors. References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11061 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Content-Length: 1285 Lines: 39 Zak, Semion wrote: > Hi, > > We are studying possibility to use XFS with cheap (not too reliable) > discs, so we have some questions: > > What in XFS is done to survive the disk errors (bad sectors)? > I know about superblock duplication in every AG. What else? > > What is XFS behavior in case of the disk errors (panic/no mount/partial > data access)? generally metadata IO errors or bad magic found in metadata will shut down the filesystem gracefully if it can. IO errors on data will just be IO errors. > What could be done to restore? xfsdump/xfsrestore I suppose > If zero bad sector/dump to other device/format/restore will help? Well, you can't make data out of nothing. you could dd off the junk drive, zeroing out unreadable sectors, point xfs_repair at it and hope for the best. Which, depending on the problem, could wind up not being very good. If you want to know how to recover from disaster, it sounds like perhaps your data is important enough that you should not plan for failure, but rather find a way to avoid it? Seems to me the only way I'd want to put drives which are expected to fail regularly into a product is if the recovery method of "replace the disk and re-image the appliance" was acceptable, but that's just me. :) -Eric From owner-xfs@oss.sgi.com Thu Apr 5 09:23:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 09:23:28 -0700 (PDT) Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35GNJfB017294 for ; Thu, 5 Apr 2007 09:23:21 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id 8E70C2E114F9; Thu, 5 Apr 2007 18:23:17 +0200 (CEST) Date: Thu, 5 Apr 2007 18:22:35 +0200 X-OfflineIMAP-x597882765-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1175790198-0130267256784-v4.0.11 From: Lars Ellenberg To: xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070405162235.GA816@barkeeper1.linbit> Mail-Followup-To: Lars Ellenberg , xfs@oss.sgi.com References: <20070404203601.GA11771@barkeeper1.linbit> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070404203601.GA11771@barkeeper1.linbit> User-Agent: Mutt/1.5.11 X-archive-position: 11062 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs Content-Length: 1344 Lines: 39 On Wed, Apr 04, 2007 at 10:36:01PM +0200, Lars Ellenberg wrote: > NOTE that I used the default sarge xfsprogs version 2.6.20, not > the upstream 2.8.20. yet. I'll start an xfs_repair run with > 2.8.20 right after this post, though... done. now, this used seriously more memory, and cpu, and the box went thrashing. after some experimenting, xfs_repair -o bhash=512 got it going without using excessive amounts of swap, so it finally finished after about 12 hours (2.6.20 needed 8:30, repeatable). it did not change the situation, however. I know I could clean these using xfs_db and an additional run of xfs_repair, but I'm going to keep these around for some more time, in case you want me to have a look at some internals still. file system itself has gone life again, I hope it does not hurt having those strange directories around. maybe it is even "just" a problem on the kernel side, not being able to convert so the expected "form" of directory? sorry, I'm not too deep in the xfs internals, so I need some input from the developers here... Thanks, -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Thu Apr 5 10:27:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 10:27:29 -0700 (PDT) Received: from rgminet02.oracle.com (rgminet02.oracle.com [148.87.113.119]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35HRKfB000743 for ; Thu, 5 Apr 2007 10:27:22 -0700 Received: from rgminet01.oracle.com (rgminet01.oracle.com [148.87.113.118]) by rgminet02.oracle.com (Switch-3.2.4/Switch-3.1.7) with ESMTP id l35FoFgj014786 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 5 Apr 2007 09:50:16 -0600 Received: from rgmgw3.us.oracle.com (rgmgw3.us.oracle.com [138.1.186.112]) by rgminet01.oracle.com (Switch-3.2.4/Switch-3.1.6) with ESMTP id l35FnVjP026456; Thu, 5 Apr 2007 09:49:31 -0600 Received: from acsmt350.oracle.com (acsmt350.oracle.com [141.146.40.150]) by rgmgw3.us.oracle.com (Switch-3.2.4/Switch-3.1.7) with ESMTP id l35FQpgT010519; Thu, 5 Apr 2007 09:49:29 -0600 Received: from pool-71-245-96-31.nycmny.fios.verizon.net by rcsmt252.oracle.com with ESMTP id 2592044011175788085; Thu, 05 Apr 2007 09:48:05 -0600 Date: Thu, 5 Apr 2007 08:50:16 -0700 From: Randy Dunlap To: "Amit K. Arora" Cc: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-Id: <20070405085016.a513526b.randy.dunlap@oracle.com> In-Reply-To: <20070405112619.GA19982@amitarora.in.ibm.com> References: <20070117094658.GA17390@amitarora.in.ibm.com> <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070405112619.GA19982@amitarora.in.ibm.com> Organization: Oracle Linux Eng. X-Mailer: Sylpheed 2.3.1 (GTK+ 2.8.10; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Whitelist: TRUE X-Whitelist: TRUE X-Whitelist: TRUE X-Brightmail-Tracker: AAAAAQAAAAI= X-Brightmail-Tracker: AAAAAQAAAAI= X-archive-position: 11063 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: randy.dunlap@oracle.com Precedence: bulk X-list: xfs Content-Length: 1611 Lines: 44 On Thu, 5 Apr 2007 16:56:19 +0530 Amit K. Arora wrote: > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > > Wouldn't > > int fallocate(loff_t offset, loff_t len, int fd, int mode) > > work on both s390 and ppc/arm? glibc will certainly wrap it and > > reorder the arguments as needed, so there is no need to keep fd first. > > This should work on all the platforms. The only concern I can think of > here is the convention being followed till now, where all the entities on > which the action has to be performed by the kernel (say fd, file/device > name, pid etc.) is the first argument of the system call. If we can live > with the small exception here, fine. > > Or else, we may have to implement the > > int fd, int mode, loff_t offset, loff_t len > > as the layout of arguments here. I think only s390 will have a problem > with this, and we can think of a workaround for it (may be similar to > what ARM did to implement sync_file_range() system call) : > > asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) > { > return sys_fallocate(fd, offset, len, mode); > } > > > To me both the approaches look slightly unconventional. But, we need to > compromise somewhere to make things work on all the platforms. > > Any thoughts on which one of the above should we finalize on ? > > Thanks! If s390 can work around the calling order that easily, I certainly prefer the more conventional ordering of: > int fd, int mode, loff_t offset, loff_t len --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** From owner-xfs@oss.sgi.com Fri Apr 6 03:19:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 06 Apr 2007 03:20:00 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l36AJsfB023236 for ; Fri, 6 Apr 2007 03:19:55 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 54FE07BA30E; Fri, 6 Apr 2007 03:58:22 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id D2ECB407F; Fri, 6 Apr 2007 03:58:20 -0600 (MDT) Date: Fri, 6 Apr 2007 03:58:20 -0600 From: Andreas Dilger To: "Amit K. Arora" Cc: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070406095820.GF5967@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070405112619.GA19982@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070405112619.GA19982@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11066 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs Content-Length: 1005 Lines: 28 On Apr 05, 2007 16:56 +0530, Amit K. Arora wrote: > This should work on all the platforms. The only concern I can think of > here is the convention being followed till now, where all the entities on > which the action has to be performed by the kernel (say fd, file/device > name, pid etc.) is the first argument of the system call. If we can live > with the small exception here, fine. Yes, it is much cleaner to have fd first, like every other such syscall. > Or else, we may have to implement the > > int fd, int mode, loff_t offset, loff_t len > > as the layout of arguments here. I think only s390 will have a problem > with this, and we can think of a workaround for it (may be similar to > what ARM did to implement sync_file_range() system call) : > > asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) > { > return sys_fallocate(fd, offset, len, mode); > } Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Fri Apr 6 11:57:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 06 Apr 2007 11:57:38 -0700 (PDT) Received: from ty.sabi.co.UK (82-69-39-138.dsl.in-addr.zen.co.uk [82.69.39.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l36IvRfB017453 for ; Fri, 6 Apr 2007 11:57:29 -0700 Received: from from [127.0.0.1] (helo=base.ty.sabi.co.UK) by ty.sabi.co.UK with esmtp(Exim 4.62 #1) id 1HZtVR-0004j5-QX for ; Fri, 06 Apr 2007 19:49:54 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17942.38470.367699.402354@base.ty.sabi.co.UK> Date: Fri, 6 Apr 2007 19:49:42 +0100 X-Face: SMJE]JPYVBO-9UR%/8d'mG.F!@.,l@c[f'[%S8'BZIcbQc3/">GrXDwb#;fTRGNmHr^JFb SAptvwWc,0+z+~p~"Gdr4H$(|N(yF(wwCM2bW0~U?HPEE^fkPGx^u[*[yV.gyB!hDOli}EF[\cW*S H&spRGFL}{`bj1TaD^l/"[ msn( /TH#THs{Hpj>)]f> Subject: Re: XFS Resiliency to the disk errors. In-Reply-To: References: X-Mailer: VM 7.17 under 21.4 (patch 20) XEmacs Lucid From: pg_xfs@xfs.for.sabi.co.UK (Peter Grandi) X-Disclaimer: This message contains only personal opinions X-archive-position: 11067 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: pg_xfs@xfs.for.sabi.co.UK Precedence: bulk X-list: xfs Content-Length: 1251 Lines: 33 >>> On Thu, 5 Apr 2007 11:08:07 +0300, "Zak, Semion" >>> said: SZak> Hi, We are studying possibility to use XFS with cheap (not SZak> too reliable) discs, so we have some questions: Astute move :-). I hope that you are also thinking of using 16-wide RAID5 too :-). SZak> What in XFS is done to survive the disk errors (bad SZak> sectors)? [ ... ] My impression is that the XFS design is really meant for highly scalable performance on enterprise level hardware, where the block device layer abstracts aways all drive error issues, including having UPSes. Sure you can use it otherwise, but it has a very different optimal usage envelope from 'ext3' or ReiserFS/Reiser4 (which have been designed with stronger resiliency and recoverability features, as they are more oriented to desktop and cheap server usage). Anyhow, a highly reliable block device layer can surely be built out of cheap disks, if one does it right, and people like EMC2 do it regularly with their midrange products. I may be interesting for your to have a look at the disk reliability statistics in some recent papers by some Google and CMU researchers, discussed here: http://swik.net/User:dolander/All+Things+Distributed/On+the+Reliability+of+Hard+Disks/ From owner-xfs@oss.sgi.com Sat Apr 7 06:28:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 07 Apr 2007 06:28:12 -0700 (PDT) Received: from ty.sabi.co.UK (82-69-39-138.dsl.in-addr.zen.co.uk [82.69.39.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l37DS3fB021195 for ; Sat, 7 Apr 2007 06:28:06 -0700 Received: from from [127.0.0.1] (helo=base.ty.sabi.co.UK) by ty.sabi.co.UK with esmtp(Exim 4.62 #1) id 1HZthd-0005LS-2Q for ; Fri, 06 Apr 2007 20:02:29 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17942.39236.623189.817503@base.ty.sabi.co.UK> Date: Fri, 6 Apr 2007 20:02:28 +0100 X-Face: SMJE]JPYVBO-9UR%/8d'mG.F!@.,l@c[f'[%S8'BZIcbQc3/">GrXDwb#;fTRGNmHr^JFb SAptvwWc,0+z+~p~"Gdr4H$(|N(yF(wwCM2bW0~U?HPEE^fkPGx^u[*[yV.gyB!hDOli}EF[\cW*S H&spRGFL}{`bj1TaD^l/"[ msn( /TH#THs{Hpj>)]f> Subject: Re: Strange delete performance using XFS In-Reply-To: <20070405152917.GB23893@tuatara.stupidest.org> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405152917.GB23893@tuatara.stupidest.org> X-Mailer: VM 7.17 under 21.4 (patch 20) XEmacs Lucid From: pg_xfs@xfs.for.sabi.co.UK (Peter Grandi) X-Disclaimer: This message contains only personal opinions X-archive-position: 11069 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: pg_xfs@xfs.for.sabi.co.UK Precedence: bulk X-list: xfs Content-Length: 1071 Lines: 28 [ ... slowness deleting a lot of inodes ... ] >> The Dell system has got a battery-backed write-cache. The >> 3ware system has no battery unit. However it's supposed to >> provide write cache, too. Whatever, but you cannot have both metadata consistency and high speed without fully reliable hw... > [ ... ] The Dell's raid system can safely buffer outstanding > writes and flsuh them, the 3ware can't so it stalls waiting > fot the disks to catch up. [ ... ] I'd be tempted to just use > the 3ware as a JBOD and use sw, but I'm arguably biased, I've > had so many reliability and performance problems with hw raid > over the years Uhm, I had a friend that worked for a middling storage system vendor and he was telling me horror stories about bugs and misdesigns in their quite popular RAID products. 3ware seem to me one of the more reliable RAID brands, their cautious approach may be why they are slower above. > I will almost always use sw raid given the choice. Does not buy you a lot over well designed RAID host adapter. it is also a lot less convenient. From owner-xfs@oss.sgi.com Sat Apr 7 13:47:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 07 Apr 2007 13:47:45 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l37KlefB023317 for ; Sat, 7 Apr 2007 13:47:42 -0700 Received: from localhost (dslb-084-056-094-087.pools.arcor-ip.net [84.56.94.87]) by mail.lichtvoll.de (Postfix) with ESMTP id 58DE55ADBB for ; Sat, 7 Apr 2007 22:47:39 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS Resiliency to the disk errors. Date: Sat, 7 Apr 2007 22:47:37 +0200 User-Agent: KMail/1.9.6 References: (sfid-20070405_112347_743716_6E82B98E) (sfid-20070405_112347_743716_6E82B98E) (sfid-20070405_112347_743716_6E82B98E) In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200704072247.38143.Martin@lichtvoll.de> X-archive-position: 11070 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Donnerstag 05 April 2007 schrieb Zak, Semion: > Hi, > > We are studying possibility to use XFS with cheap (not too reliable) > discs, so we have some questions: Hi Semion! I recommend at least monitoring the health status of the drives using smartmontools - with regular short and long selft test - or a similar mechanism. So you *may* at least be warned *before* a disk fails. Otherwise I would go for a redundant RAID array at least so that at least one drive in a bunch of drives can fail without data loss. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Mon Apr 9 08:36:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 08:36:35 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l39FaQfB032204 for ; Mon, 9 Apr 2007 08:36:28 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id D13CBDDFDB; Tue, 10 Apr 2007 01:36:11 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Message-ID: <17946.14646.808334.441833@cargo.ozlabs.ibm.com> Date: Mon, 9 Apr 2007 23:01:42 +1000 From: Paul Mackerras To: =?utf-8?B?SsO2cm4=?= Engel Cc: Heiko Carstens , Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call In-Reply-To: <20070330104449.GA9371@lazybastard.org> References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071929.GC8365@osiris.boeblingen.de.ibm.com> <17932.54606.323431.491736@cargo.ozlabs.ibm.com> <20070330104449.GA9371@lazybastard.org> X-Mailer: VM 7.19 under Emacs 21.4.1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l39FaSfB032208 X-archive-position: 11071 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Jörn Engel writes: > Wouldn't that work be confined to fallocate()? If I understand Heiko > correctly, the alternative would slow s390 down for every syscall, > including more performance-critical ones. The alternative that Jakub suggested wouldn't slow s390 down. Paul. From owner-xfs@oss.sgi.com Mon Apr 9 09:39:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 09:39:08 -0700 (PDT) Received: from longford.lazybastard.org (lazybastard.de [212.112.238.170]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l39Gd0fB011359 for ; Mon, 9 Apr 2007 09:39:00 -0700 Received: from joern by longford.lazybastard.org with local (Exim 4.50) id 1HawpE-0006QX-5Y; Mon, 09 Apr 2007 18:34:40 +0200 Date: Mon, 9 Apr 2007 18:34:37 +0200 From: =?utf-8?B?SsO2cm4=?= Engel To: Paul Mackerras Cc: Heiko Carstens , Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070409163436.GA24012@lazybastard.org> References: <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071929.GC8365@osiris.boeblingen.de.ibm.com> <17932.54606.323431.491736@cargo.ozlabs.ibm.com> <20070330104449.GA9371@lazybastard.org> <17946.14646.808334.441833@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <17946.14646.808334.441833@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.5.9i X-archive-position: 11072 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: joern@lazybastard.org Precedence: bulk X-list: xfs On Mon, 9 April 2007 23:01:42 +1000, Paul Mackerras wrote: > Jörn Engel writes: > > > Wouldn't that work be confined to fallocate()? If I understand Heiko > > correctly, the alternative would slow s390 down for every syscall, > > including more performance-critical ones. > > The alternative that Jakub suggested wouldn't slow s390 down. True. And it appears to be one of the least offensive options we have. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra From owner-xfs@oss.sgi.com Mon Apr 9 18:51:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 18:51:18 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3A1pBfB026573 for ; Mon, 9 Apr 2007 18:51:13 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA27963; Tue, 10 Apr 2007 11:51:05 +1000 Message-Id: <200704100151.LAA27963@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'Lars Ellenberg'" , Subject: RE: xfs_repair leaves empty but undeletable dirs in lost+found Date: Tue, 10 Apr 2007 11:55:14 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 In-Reply-To: <20070405162235.GA816@barkeeper1.linbit> Thread-Index: Acd3nt9nMNud96faT2y1vCJJ8ylFNADdDMXw X-archive-position: 11073 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Hi Lars, Would it be possible for you apply the patch I posted to xfs@oss in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html to the latest xfsprogs source, make and install it and run: # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 And make the image available for me to download and analyse? Regards, Barry. > -----Original Message----- > From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] > On Behalf Of Lars Ellenberg > Sent: Friday, 6 April 2007 2:23 AM > To: xfs@oss.sgi.com > Subject: Re: xfs_repair leaves empty but undeletable dirs in > lost+found > > On Wed, Apr 04, 2007 at 10:36:01PM +0200, Lars Ellenberg wrote: > > NOTE that I used the default sarge xfsprogs version 2.6.20, not > > the upstream 2.8.20. yet. I'll start an xfs_repair run with > > 2.8.20 right after this post, though... > > done. > now, this used seriously more memory, and cpu, > and the box went thrashing. > after some experimenting, > > xfs_repair -o bhash=512 > > got it going without using excessive amounts of swap, > so it finally finished after about 12 hours > (2.6.20 needed 8:30, repeatable). > > it did not change the situation, however. > > I know I could clean these using xfs_db and an additional run of > xfs_repair, but I'm going to keep these around for some more time, in > case you want me to have a look at some internals still. > > file system itself has gone life again, I hope it does not hurt having > those strange directories around. > > maybe it is even "just" a problem on the kernel side, > not being able to convert so the expected "form" of directory? > sorry, I'm not too deep in the xfs internals, so I need some > input from > the developers here... > > Thanks, > > -- > : Lars Ellenberg Tel +43-1-8178292-0 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : > __ > please use the "List-Reply" function of your email client. > > From owner-xfs@oss.sgi.com Mon Apr 9 23:49:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 23:49:17 -0700 (PDT) Received: from ilsmtp.nds.com ([192.118.32.12]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3A6n8fB016419 for ; Mon, 9 Apr 2007 23:49:14 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: RE: XFS Resiliency to the disk errors. Date: Tue, 10 Apr 2007 09:49:06 +0300 Message-ID: In-Reply-To: <46151E86.2080704@sandeen.net> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: XFS Resiliency to the disk errors. Thread-Index: Acd3nHBtIHJQ+P54QPiwtt50fU4c0ADnmBnA From: "Zak, Semion" To: "Eric Sandeen" Cc: Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l3A6nEfB016437 X-archive-position: 11074 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: SZak@nds.com Precedence: bulk X-list: xfs Thank you very much. I have other question, about data lose on crash/power cut. Is it possible to make it not more then in other file systems, if open the important file with O_SYNC flag, or use fsync and sync functions? Thanks, Semion. -----Original Message----- From: Eric Sandeen [mailto:sandeen@sandeen.net] Sent: Thursday, April 05, 2007 7:07 PM To: Zak, Semion Cc: xfs@oss.sgi.com Subject: Re: XFS Resiliency to the disk errors. Zak, Semion wrote: > Hi, > > We are studying possibility to use XFS with cheap (not too reliable) > discs, so we have some questions: > > What in XFS is done to survive the disk errors (bad sectors)? > I know about superblock duplication in every AG. What else? > > What is XFS behavior in case of the disk errors (panic/no > mount/partial data access)? generally metadata IO errors or bad magic found in metadata will shut down the filesystem gracefully if it can. IO errors on data will just be IO errors. > What could be done to restore? xfsdump/xfsrestore I suppose > If zero bad sector/dump to other device/format/restore will help? Well, you can't make data out of nothing. you could dd off the junk drive, zeroing out unreadable sectors, point xfs_repair at it and hope for the best. Which, depending on the problem, could wind up not being very good. If you want to know how to recover from disaster, it sounds like perhaps your data is important enough that you should not plan for failure, but rather find a way to avoid it? Seems to me the only way I'd want to put drives which are expected to fail regularly into a product is if the recovery method of "replace the disk and re-image the appliance" was acceptable, but that's just me. :) -Eric *********************************************************************************** This email message and any attachments thereto are intended only for use by the addressee(s) named above, and may contain legally privileged and/or confidential information. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the postmaster@nds.com and destroy the original message. *********************************************************************************** From owner-xfs@oss.sgi.com Tue Apr 10 02:26:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 02:26:49 -0700 (PDT) Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3A9QifB014917 for ; Tue, 10 Apr 2007 02:26:45 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id A9D432DF653D; Tue, 10 Apr 2007 11:26:42 +0200 (CEST) Date: Tue, 10 Apr 2007 11:24:43 +0200 X-OfflineIMAP-382579781-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1176197203-0163058626383-v4.0.11 From: Lars Ellenberg To: Barry Naujok Cc: xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070410092443.GA8496@barkeeper1.linbit> Mail-Followup-To: Lars Ellenberg , Barry Naujok , xfs@oss.sgi.com References: <20070405162235.GA816@barkeeper1.linbit> <200704100151.LAA27963@larry.melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704100151.LAA27963@larry.melbourne.sgi.com> User-Agent: Mutt/1.5.11 X-archive-position: 11075 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs On Tue, Apr 10, 2007 at 11:55:14AM +1000, Barry Naujok wrote: > Hi Lars, > > Would it be possible for you apply the patch I posted to xfs@oss > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > to the latest xfsprogs source, make and install it and run: > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > And make the image available for me to download and analyse? uhm. probably. I'll talk with the guy who owns the data :) out of curiosity: what exactly would you do with it? I mean, would that be sufficient to restore the "badness", with the files all filled with zero, and you'd be able to reproduce locally? -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Tue Apr 10 07:40:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 07:41:05 -0700 (PDT) Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3AEetfB020875 for ; Tue, 10 Apr 2007 07:40:56 -0700 Received: from mailhub.lss.emc.com (nagas.lss.emc.com [10.254.144.11]) by mexforward.lss.emc.com (Switch-3.2.5/Switch-3.1.7) with ESMTP id l3ADNodH011613; Tue, 10 Apr 2007 09:23:50 -0400 (EDT) Received: from [168.159.36.217] ([168.159.36.217]) by mailhub.lss.emc.com (Switch-3.2.5/Switch-3.1.7) with ESMTP id l3ADNm4A028359; Tue, 10 Apr 2007 09:23:48 -0400 (EDT) Message-ID: <461B8FE3.3010600@emc.com> Date: Tue, 10 Apr 2007 09:23:47 -0400 From: Ric Wheeler Reply-To: ric@emc.com User-Agent: Thunderbird 1.5.0.8 (X11/20061025) MIME-Version: 1.0 To: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, reiserfs-list@namesys.com, ext2-devel@lists.sourceforge.net, linux-ide@vger.kernel.org, ocfs2-devel@oss.oracle.com, linux-scsi@vger.kernel.org Subject: Linux 2007 File System & IO Workshop notes & talks Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-PMX-Version: 4.7.1.128075, Antispam-Engine: 2.5.1.298604, Antispam-Data: 2007.4.10.54233 X-PerlMx-Spam: Gauge=, SPAM=1%, Reasons='EMC_FROM_0+ -3, RDNS_NXDOMAIN 0, RDNS_SUSP 0, RDNS_SUSP_GENERIC 0, __CP_URI_IN_BODY 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0' X-archive-position: 11077 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: ric@emc.com Precedence: bulk X-list: xfs Content-Length: 471 Lines: 20 We have some of the material reviewed and posted now from the IO & FS workshop. USENIX has posted the talks at: http://www.usenix.org/events/lsf07/tech/tech.html A write up of the workshop went out at LWN and invoked a healthy discussion: http://lwn.net/Articles/226351/ At that LWN article, there is a link to the Linux FS wiki with good notes: http://linuxfs.pbwiki.com/LSF07-Workshop-Notes Another summary will go out in the next USENIX ;login edition. ric From owner-xfs@oss.sgi.com Tue Apr 10 14:18:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 14:18:09 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3ALI4fB031461 for ; Tue, 10 Apr 2007 14:18:06 -0700 Received: from localhost (dslb-084-056-073-232.pools.arcor-ip.net [84.56.73.232]) by mail.lichtvoll.de (Postfix) with ESMTP id 478185AD2D; Tue, 10 Apr 2007 22:45:02 +0200 (CEST) From: Martin Steigerwald To: Lars Ellenberg , Barry Naujok , xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Date: Tue, 10 Apr 2007 22:45:00 +0200 User-Agent: KMail/1.9.6 References: <20070405162235.GA816@barkeeper1.linbit> <200704100151.LAA27963@larry.melbourne.sgi.com> <20070410092443.GA8496@barkeeper1.linbit> (sfid-20070410_134231_652510_049D094B) In-Reply-To: <20070410092443.GA8496@barkeeper1.linbit> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200704102245.00734.Martin@lichtvoll.de> X-archive-position: 11078 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Dienstag 10 April 2007 schrieb Lars Ellenberg: > > Would it be possible for you apply the patch I posted to xfs@oss > > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > > to the latest xfsprogs source, make and install it and run: > > > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > > > And make the image available for me to download and analyse? > > uhm. probably. I'll talk with the guy who owns the data :) > > out of curiosity: what exactly would you do with it? > I mean, would that be sufficient to restore the "badness", > with the files all filled with zero, > and you'd be able to reproduce locally? Hi Lars! As far as I understand a meta data dump does not contain the actual data in the files. That would be sufficient als xfs_repair is for repairing metadata corruption. For analysing the reason why a file is undeleteable its actual contents should be quite irrelevant. Only thing that could possibly matter is the amount and location, not the contents of blocks a file occupies. But that doesn't seem to matter here either. It would contain meta data information on the directory and file names as well as timestamps, owner and rights - if you are concerned about the privacy of your customer you may want to try to reproduce the problem with different meta data information. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Tue Apr 10 17:11:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 17:11:26 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3B0BMfB001815 for ; Tue, 10 Apr 2007 17:11:23 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA28739; Wed, 11 Apr 2007 10:11:16 +1000 Message-Id: <200704110011.KAA28739@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'Lars Ellenberg'" Cc: Subject: RE: xfs_repair leaves empty but undeletable dirs in lost+found Date: Wed, 11 Apr 2007 10:16:57 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 In-Reply-To: <20070410092443.GA8496@barkeeper1.linbit> Thread-Index: Acd7Ul9QlkRwFlL5SyqfAUwRC1EM9QAe85tQ X-archive-position: 11079 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Hi Lars, It copies the inodes and directory contents and other metadata. I restore it here and run xfs_repair over it to see how it failed. xfs_repair only operates on metadata and does not check data. Currently, it does not obfuscate file names if there is a privacy/confidentiality concern but that is a feature I intend on adding later. Regards, Barry. > -----Original Message----- > From: Lars Ellenberg [mailto:lars.ellenberg@linbit.com] > Sent: Tuesday, 10 April 2007 7:25 PM > To: Barry Naujok > Cc: xfs@oss.sgi.com > Subject: Re: xfs_repair leaves empty but undeletable dirs in > lost+found > > On Tue, Apr 10, 2007 at 11:55:14AM +1000, Barry Naujok wrote: > > Hi Lars, > > > > Would it be possible for you apply the patch I posted to xfs@oss > > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > > to the latest xfsprogs source, make and install it and run: > > > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > > > And make the image available for me to download and analyse? > > uhm. probably. I'll talk with the guy who owns the data :) > > out of curiosity: what exactly would you do with it? > I mean, would that be sufficient to restore the "badness", > with the files all filled with zero, > and you'd be able to reproduce locally? > > -- > : Lars Ellenberg Tel +43-1-8178292-0 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : > __ > please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Wed Apr 11 00:21:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 11 Apr 2007 00:21:25 -0700 (PDT) Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3B7LKfB021100 for ; Wed, 11 Apr 2007 00:21:22 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id 197612E2716C; Wed, 11 Apr 2007 09:21:19 +0200 (CEST) Date: Tue, 10 Apr 2007 23:05:02 +0200 X-OfflineIMAP-x1476802397-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1176276079-0187333350209-v4.0.11 From: Lars Ellenberg To: xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070410210502.GA19842@barkeeper1.linbit> Mail-Followup-To: Lars Ellenberg , xfs@oss.sgi.com References: <20070405162235.GA816@barkeeper1.linbit> <200704100151.LAA27963@larry.melbourne.sgi.com> <20070410092443.GA8496@barkeeper1.linbit> <200704102245.00734.Martin@lichtvoll.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704102245.00734.Martin@lichtvoll.de> User-Agent: Mutt/1.5.11 X-archive-position: 11080 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs On Tue, Apr 10, 2007 at 10:45:00PM +0200, Martin Steigerwald wrote: > Am Dienstag 10 April 2007 schrieb Lars Ellenberg: > > > > Would it be possible for you apply the patch I posted to xfs@oss > > > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > > > to the latest xfsprogs source, make and install it and run: > > > > > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > > > > > And make the image available for me to download and analyse? > > > > uhm. probably. I'll talk with the guy who owns the data :) > > > > out of curiosity: what exactly would you do with it? > > I mean, would that be sufficient to restore the "badness", > > with the files all filled with zero, > > and you'd be able to reproduce locally? > > Hi Lars! > > As far as I understand a meta data dump does not contain the actual data > in the files. That would be sufficient als xfs_repair is for repairing > metadata corruption. For analysing the reason why a file is undeleteable > its actual contents should be quite irrelevant. Only thing that could > possibly matter is the amount and location, not the contents of blocks a > file occupies. But that doesn't seem to matter here either. > > It would contain meta data information on the directory and file names as > well as timestamps, owner and rights - if you are concerned about the > privacy of your customer you may want to try to reproduce the problem > with different meta data information. I'm very well aware of these things. what I meant to ask was: (how) could I do some "partial" metadump? because, * yes, I am concerned about the privacy (not too much, though, but still as this is not my data, I have to ask) * its probably going to be huge anyways * its going to take some time to produce, and this is a life system * it would probably really help for investigating further if I were able to reduce the amount of meta data involved (remember, one xfs_repair run took me 12 hours) and, most importantly: * reducing the amount of meta data is probably the first step Barry would do once he has my full dump, because thats the only way to go about debugging this. I'd like to help to reduce the work needed to debug this. so yes, I really would like to try to reproduce with different meta data information, but I'd need a hint what to look for in my existing bad data to be able to reproduce similar bad data. @Barry: I'd probably be able to get a full dump tomorrow. or even better, a partial dump, if you tell me what you'd need. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Wed Apr 11 00:37:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 11 Apr 2007 00:37:04 -0700 (PDT) Received: from mail.gatrixx.com (mail.gatrixx.com [217.111.11.44]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3B7axfB024064 for ; Wed, 11 Apr 2007 00:37:01 -0700 Received: (qmail 8091 invoked by uid 1008); 11 Apr 2007 09:36:53 +0200 Received: from unknown (HELO majestix.gallier.de) (ojoa@gatrixx.com@89.54.93.154) by 0 with AES256-SHA encrypted SMTP; 11 Apr 2007 09:36:53 +0200 Received: from [192.168.10.3] (olli@gutemine.gallier.de [192.168.10.3]) by majestix.gallier.de (8.13.8/8.13.8/Debian-2) with ESMTP id l3B7aqqk007017; Wed, 11 Apr 2007 09:36:52 +0200 Message-ID: <461C9014.6040109@j-o-a.de> Date: Wed, 11 Apr 2007 09:36:52 +0200 From: Oliver Joa User-Agent: Icedove 1.5.0.10 (X11/20070329) MIME-Version: 1.0 To: xfs-oss CC: linux-kernel@vger.kernel.org Subject: Re: Corrupt XFS -Filesystems on new Hardware and Kernel References: <46094344.4090007@j-o-a.de> <20070328113141.GQ32597093@melbourne.sgi.com> In-Reply-To: <20070328113141.GQ32597093@melbourne.sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11081 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: oliver@j-o-a.de Precedence: bulk X-list: xfs Hi, David Chinner wrote: > On Tue, Mar 27, 2007 at 06:16:04PM +0200, Oliver Joa wrote: >> Hi, >> >> since some weeks i try to get my new hardware running: >> >> Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz >> Intel DP965LT Mainboard >> Seagate SATA-Harddisk in AHCI-Mode >> >> After some hours of running or after some heavy file-i/o >> (find / | cpio -padm /test) I always get a corrupted >> XFS-filesystem. I solved the problem: I made a memtest and found a lot of memory-errors, then i bought a other brand of memory and everything working fine. The first memory i used was brandnew. I bought it together with the board and processor. It was from Kingston. Now i have one from Crucial, which seems to work fine. Thanks to everyone for the help Olli From owner-xfs@oss.sgi.com Wed Apr 11 03:02:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 11 Apr 2007 03:02:38 -0700 (PDT) Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3BA2VfB011436 for ; Wed, 11 Apr 2007 03:02:32 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 67D6AC16A; Wed, 11 Apr 2007 11:36:25 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 12154-01; Wed, 11 Apr 2007 11:36:22 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id EEC94C007; Wed, 11 Apr 2007 11:36:22 +0200 (CEST) Date: Wed, 11 Apr 2007 11:36:22 +0200 From: Thomas Kaehn To: Peter Grandi Cc: Linux XFS Subject: Re: Strange delete performance using XFS Message-ID: <20070411093622.GB28503@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405152917.GB23893@tuatara.stupidest.org> <17942.39236.623189.817503@base.ty.sabi.co.UK> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <17942.39236.623189.817503@base.ty.sabi.co.UK> User-Agent: Mutt/1.5.9i X-archive-position: 11082 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Hi Peter, On Fri, Apr 06, 2007 at 08:02:28PM +0100, Peter Grandi wrote: > > [ ... ] The Dell's raid system can safely buffer outstanding > > writes and flsuh them, the 3ware can't so it stalls waiting > > fot the disks to catch up. [ ... ] I'd be tempted to just use > > the 3ware as a JBOD and use sw, but I'm arguably biased, I've > > had so many reliability and performance problems with hw raid > > over the years > > Uhm, I had a friend that worked for a middling storage system > vendor and he was telling me horror stories about bugs and > misdesigns in their quite popular RAID products. > > 3ware seem to me one of the more reliable RAID brands, their > cautious approach may be why they are slower above. your are probably right. 3ware didn't answer yet. However I've found an option in the controller to set the "storsave" policy. In the default profile FUA (force unit access) commands are only acknowledged directly in case a BBU is present. Otherwise the controller waits until the data is written to disk. When selecting the "performance" profile FUA commands are ignored and delete time lowers to a couple of seconds. So the behaviour of the controller should be considered a feature. But I am still astonished how slow deletes were in the first place. This might be a bug or incompatibility anyhow. Thanks to all others for your suggestions. I'll inform you in case 3ware has news for me. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 12 04:05:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 04:06:02 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3CB5ufB007456 for ; Thu, 12 Apr 2007 04:05:58 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id D15AA7BA307; Thu, 12 Apr 2007 05:05:51 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 1E3F0407F; Thu, 12 Apr 2007 05:05:50 -0600 (MDT) Date: Thu, 12 Apr 2007 05:05:50 -0600 From: Andreas Dilger To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com Cc: hch@infradead.org Subject: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070412110550.GM5967@schatzie.adilger.int> Mail-Followup-To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11087 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs Content-Length: 4714 Lines: 117 I'm interested in getting input for implementing an ioctl to efficiently map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion times. We already have customers with single files in the 10TB range and we additionally need to get the mapping over the network so it needs to be efficient in terms of how data is passed, and how easily it can be extracted from the filesystem. I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. There was also recent discussion about SEEK_HOLE and SEEK_DATA as implemented by Sun, but even if we could skip the holes we still might need to do millions of FIBMAPs to see how large files are allocated on disk. Conversely, having filesystems implement an efficient FIBMAP ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE and SEEK_DATA instead of doing looping over ->bmap() inside the kernel as I saw one patch. struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff000000000000 #define FIEMAP_LEN_HOLE 0x01000000000000 #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The ->fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. One feature that XFS_IOC_GETBMAPX has that may be desirable is the ability to return unwritten extent information. In order to do this XFS required expanding the per-extent struct from 32 to 48 bytes per extent, but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) and keep 8 bytes or so for input/output flags per extent (would need to be masked before use). Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm->fm_extent.fe_start = 0; /* start of file */ fm->fm_extent.fe_len = -1; /* end of file */ fm->fm_extent_count = count; /* max extents in fm_extents[] array */ fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ fd = open(path, O_RDONLY); printf("logical\t\tphysical\t\tbytes\n"); /* The last entry will have less extents than the maximum */ while (fm->fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves fm_start * alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i < fm->fm_extent_count; i++) { __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; __u64 fm_next = fm->fm_start + len; int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", fm->fm_start, fm_next - 1, hole ? 0 : fm->fm_extents[i].fe_start, hole ? 0 : fm->fm_extents[i].fe_start + fm->fm_extents[i].fe_len - 1, len, hole ? "(hole) " : "", unwr ? "(unwritten) " : ""); /* get ready for printing next extent, or next ioctl */ fm->fm_start = fm_next; } } I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP. I'm quite open to suggestions at this point, both in terms of how usable the fibmap data structures are by the caller, and if we need to add anything to make them more flexible for the future. In terms of implementing this in the kernel, there was originally code for this during the development of the ext3 extent patches and it was done via a callback in the extent tree iterator so it is very efficient. I believe it implements all that is needed to allow this interface to be mapped onto XFS_IOC_BMAP internally (or vice versa). Even for block-mapped filesystems, they can at least improve over the ->bmap() case by skipping holes in files that cover [dt]indirect blocks (saving thousands of calls). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu Apr 12 05:08:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 05:08:17 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3CC83fB024925 for ; Thu, 12 Apr 2007 05:08:04 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:52806) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HbxOC-00043y-U8 (Exim 4.63) (return-path ); Thu, 12 Apr 2007 12:22:56 +0100 In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Thu, 12 Apr 2007 12:22:55 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11088 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs Content-Length: 7053 Lines: 191 Hi Andreas, On 12 Apr 2007, at 12:05, Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to > efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a > billion > times. We already have customers with single files in the 10TB > range and > we additionally need to get the mapping over the network so it > needs to > be efficient in terms of how data is passed, and how easily it can be > extracted from the filesystem. > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. > > There was also recent discussion about SEEK_HOLE and SEEK_DATA as > implemented by Sun, but even if we could skip the holes we still might > need to do millions of FIBMAPs to see how large files are allocated > on disk. Conversely, having filesystems implement an efficient FIBMAP > ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE > and SEEK_DATA instead of doing looping over ->bmap() inside the kernel > as I saw one patch. > > > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired > mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 Sound good but I would add: #define FIEMAP_LEN_NO_DIRECT_ACCESS This would say that the offset on disk can move at any time or that the data is compressed or encrypted on disk thus the data is not useful for direct disk access. On NTFS small files can be inside the inode and there direct access is not possible because the metadata on disk is protected with fixups which need to be removed when the inode is read into memory. If you access the data directly on disk, you would see corrupt data on reads and cause corruption on writes... Similarly both for compressed and encrypted files doing direct access to the on-disk data is totally nonsensical as you would see random junk on read and cause fatal data corruption on writes. Also why are you not using 0xff00000000000000, i.e. two more zeroes at the end? Seems unnecessary to drop an extra 8 bits of significance from the byte size... May not matter today but it almost certainly will do in the future (just remember what people said about the 640k limit in MSDOS when it first came out!)... Finally please make sure that the file system can return in one way or another errors for example when it fails to determine the extents because the system ran out of memory, there was an i/o error, whatever... It may even be useful to be able to say "here is an extent of size X bytes but we do not know where it is on disk because there was an error determining this particular extent's on-disk location for some reason or other"... > All offsets are in bytes to allow cases where filesystems are not > going Excellent! > block-aligned/sized allocations (e.g. tail packing). The > fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). Why the fe_start == 0? Surely just the flag is sufficient... On NTFS it is perfectly valid to have fe_start == 0 and to have that not be sparse (normally the $Boot system file is stored in the first 8 sectors of the volume)... Best regards, Anton > The ->fm_extents[] array includes all of the holes in addition to > allocated extents because this avoids the need to return both the > logical > and physical address for every extent and does not make processing any > harder. > > One feature that XFS_IOC_GETBMAPX has that may be desirable is the > ability to return unwritten extent information. In order to do > this XFS > required expanding the per-extent struct from 32 to 48 bytes per > extent, > but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what > hardship) > and keep 8 bytes or so for input/output flags per extent (would > need to > be masked before use). > > > Caller works something like: > > char buf[4096]; > struct fibmap *fm = (struct fibmap *)buf; > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > fm->fm_extent.fe_start = 0; /* start of file */ > fm->fm_extent.fe_len = -1; /* end of file */ > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > fd = open(path, O_RDONLY); > printf("logical\t\tphysical\t\tbytes\n"); > > /* The last entry will have less extents than the maximum */ > while (fm->fm_extent_count == count) { > rc = ioctl(fd, FIEMAP, fm); > if (rc) > break; > > /* kernel filled in fm_extents[] array, set fm_extent_count > * to be actual number of extents returned, leaves fm_start > * alone (unlike XFS_IOC_GETBMAP). */ > > for (i = 0; i < fm->fm_extent_count; i++) { > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > __u64 fm_next = fm->fm_start + len; > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > fm->fm_start, fm_next - 1, > hole ? 0 : fm->fm_extents[i].fe_start, > hole ? 0 : fm->fm_extents[i].fe_start + > fm->fm_extents[i].fe_len - 1, > len, hole ? "(hole) " : "", > unwr ? "(unwritten) " : ""); > > /* get ready for printing next extent, or next ioctl */ > fm->fm_start = fm_next; > } > } > > I'm not wedded to an ioctl interface, but it seems consistent with > FIBMAP. > I'm quite open to suggestions at this point, both in terms of how > usable > the fibmap data structures are by the caller, and if we need to add > anything > to make them more flexible for the future. > > In terms of implementing this in the kernel, there was originally > code for > this during the development of the ext3 extent patches and it was > done via > a callback in the extent tree iterator so it is very efficient. I > believe > it implements all that is needed to allow this interface to be mapped > onto XFS_IOC_BMAP internally (or vice versa). Even for block-mapped > filesystems, they can at least improve over the ->bmap() case by > skipping > holes in files that cover [dt]indirect blocks (saving thousands of > calls). > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Thu Apr 12 18:43:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 18:43:35 -0700 (PDT) Received: from alnrmhc14.comcast.net (alnrmhc14.comcast.net [206.18.177.54]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D1hDfB028328 for ; Thu, 12 Apr 2007 18:43:14 -0700 Received: from [192.168.1.10] (c-67-171-1-120.hsd1.wa.comcast.net[67.171.1.120]) by comcast.net (alnrmhc14) with SMTP id <20070413013301b1400npnf0e>; Fri, 13 Apr 2007 01:33:01 +0000 Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation From: Nicholas Miell To: Andreas Dilger Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> Content-Type: text/plain Date: Thu, 12 Apr 2007 18:33:00 -0700 Message-Id: <1176427980.3125.9.camel@entropy> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.0.njm.1) Content-Transfer-Encoding: 7bit X-archive-position: 11089 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nmiell@comcast.net Precedence: bulk X-list: xfs On Thu, 2007-04-12 at 05:05 -0600, Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion > times. We already have customers with single files in the 10TB range and > we additionally need to get the mapping over the network so it needs to > be efficient in terms of how data is passed, and how easily it can be > extracted from the filesystem. > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. > > There was also recent discussion about SEEK_HOLE and SEEK_DATA as > implemented by Sun, but even if we could skip the holes we still might > need to do millions of FIBMAPs to see how large files are allocated > on disk. Conversely, having filesystems implement an efficient FIBMAP > ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE > and SEEK_DATA instead of doing looping over ->bmap() inside the kernel > as I saw one patch. > I certainly hope not. SEEK_HOLE/SEEK_DATA is a poor interface and doesn't deserve to spread. OTOH, this is nicely done. > > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > -- Nicholas Miell From owner-xfs@oss.sgi.com Thu Apr 12 19:26:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 19:26:49 -0700 (PDT) Received: from tyo200.gate.nec.co.jp (TYO200.gate.nec.co.jp [210.143.35.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D2QZfB006786 for ; Thu, 12 Apr 2007 19:26:36 -0700 Received: from tyo202.gate.nec.co.jp ([10.7.69.202]) by tyo200.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3D0c044010281 for ; Fri, 13 Apr 2007 09:38:01 +0900 (JST) Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.161]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3D0bgSc024600 for ; Fri, 13 Apr 2007 09:37:42 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l3D0bgH18718 for xfs@oss.sgi.com; Fri, 13 Apr 2007 09:37:42 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv4.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l3D0bgg19679 for ; Fri, 13 Apr 2007 09:37:42 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070413.093821.29702412 for ; Fri, 13 Apr 2007 09:38:21 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Fri Apr 13 09:38:20 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 2E96FAE4B3; Fri, 13 Apr 2007 09:37:34 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l3D0bfRL020548; Fri, 13 Apr 2007 09:37:41 +0900 Message-Id: <200704130037.AA05196@TNESG9305.tnes.nec.co.jp> Date: Fri, 13 Apr 2007 09:37:35 +0900 To: xfs@oss.sgi.com Subject: [PATCH] remove the unnecessary word in the log message. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11090 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, This is the trivial fix to remove the unnecessary word in the log message. "required" is set in both xlog_recover() and xfs_dev_is_read_only(). Example: fsfile is the filesystem which needs log recovery. # losetup -r /dev/loop1 fsfile # mount -t xfs /dev/loop1 mpnt mount: block device /dev/loop1 is write-protected, mounting read-only mount: cannot mount block device /dev/loop1 read-only /var/log/messages: Jan 23 15:05:22 g9517 kernel: XFS: recovery required required on read-only device. Signed-off-by: Utako Kusaka --- --- linux-2.6.20-orig/fs/xfs/xfs_log_recover.c 2007-02-05 03:44:54.000000000 +0900 +++ linux-2.6.20/fs/xfs/xfs_log_recover.c 2007-04-11 13:23:04.000000000 +0900 @@ -3937,8 +3937,7 @@ xlog_recover( * under the vfs layer, so we can get away with it unless * the device itself is read-only, in which case we fail. */ - if ((error = xfs_dev_is_read_only(log->l_mp, - "recovery required"))) { + if ((error = xfs_dev_is_read_only(log->l_mp, "recovery"))) { return error; } From owner-xfs@oss.sgi.com Thu Apr 12 21:02:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 21:02:07 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D420fB028758 for ; Thu, 12 Apr 2007 21:02:01 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 0A4317BA305; Thu, 12 Apr 2007 22:01:59 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id E0C57407F; Thu, 12 Apr 2007 22:01:56 -0600 (MDT) Date: Thu, 12 Apr 2007 22:01:56 -0600 From: Andreas Dilger To: Anton Altaparmakov Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org, linux-ext4@vger.kernel.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070413040156.GU5967@schatzie.adilger.int> Mail-Followup-To: Anton Altaparmakov , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11091 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: > On 12 Apr 2007, at 12:05, Andreas Dilger wrote: > >I'm interested in getting input for implementing an ioctl to > >efficiently map file extents & holes (FIEMAP) instead of looping > >over FIBMAP a billion times. We already have customers with single > >files in the 10TB range and we additionally need to get the mapping > >over the network so it needs to be efficient in terms of how data > >is passed, and how easily it can be extracted from the filesystem. > > > >struct fibmap_extent { > > __u64 fe_start; /* starting offset in bytes */ > > __u64 fe_len; /* length in bytes */ > >} > > > >struct fibmap { > > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > > __u32 fm_extent_count; /* number of extents in array */ > > __u32 fm_flags; /* flags for input request */ > > XFS_IOC_GETBMAP) */ > > __u64 unused; > > struct fibmap_extent fm_extents[0]; > >} > > > >#define FIEMAP_LEN_MASK 0xff000000000000 > >#define FIEMAP_LEN_HOLE 0x01000000000000 > >#define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > Sound good but I would add: > > #define FIEMAP_LEN_NO_DIRECT_ACCESS > > This would say that the offset on disk can move at any time or that > the data is compressed or encrypted on disk thus the data is not > useful for direct disk access. This makes sense. Even for Reiserfs the same is true with packed tails, and I believe if FIBMAP is called on a tail it will migrate the tail into a block because this is might be a sign that the file is a kernel that LILO wants to boot. I'd rather not have any such feature in FIEMAP, and just return the on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me. My main reason for FIEMAP is being able to investigate allocation patterns of files. By no means is my flag list exhaustive, just the ones that I thought would be needed to implement this for ext4 and Lustre. > Also why are you not using 0xff00000000000000, i.e. two more zeroes > at the end? Seems unnecessary to drop an extra 8 bits of > significance from the byte size... It was actually just a typo (this was the first time I'd written the structs and flags down, it is just at the discussion stage). I'd meant for it to be 2^56 bytes for the file size as I wrote later in the email. That said, I think that 2^48 bytes is probably sufficient for most uses, so that we get 16 bits for flags. As it is this email already discusses 5 flags, and that would give little room for expansion in the future. Remember, this is the mapping for a single file (which can't practially be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to return a few separate extents which are actually contiguous (assuming that there will actually be files in filesystems with > 2^48 bytes of contiguous space). Since the API is that it will return the extent that contains the requested "start" byte, the kernel will be able to detect this case also, since it won't be able to specify a length for the extent that contains the start byte. At most we'd have to call the ioctl() 65536 times for a completely contiguous 2^64 byte file if the buffer was only large enough for a single extent. In reality, I expect any file to have some discontinuities and the buffer to be large enough for a thousand or more entries so the corner case is not very bad. > Finally please make sure that the file system can return in one way > or another errors for example when it fails to determine the extents > because the system ran out of memory, there was an i/o error, > whatever... It may even be useful to be able to say "here is an > extent of size X bytes but we do not know where it is on disk because > there was an error determining this particular extent's on-disk > location for some reason or other"... Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated to tape and currently has no blocks allocated in the filesystem. We want to return some indication that there is actual file data and not just a hole, but at the same time we don't want this to actually return the file from tape just to generate block mappings for it. This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ, but this needs to be specified on input to prevent the file being mapped and I'd rather the opposite (not getting file from tape) be the default, by principle of least surprise. > >block-aligned/sized allocations (e.g. tail packing). The > >fm_extents array > >returned contains the packed list of allocation extents for the file, > >including entries for holes (which have fe_start == 0, and a flag). > > Why the fe_start == 0? Surely just the flag is sufficient... On > NTFS it is perfectly valid to have fe_start == 0 and to have that not > be sparse (normally the $Boot system file is stored in the first 8 > sectors of the volume)... I thought fe_start = 0 was pretty standard for a hole. It should be something and I'd rather 0 than anything else. The _HOLE flag is enough as you say though. PS - I'd thought about adding you to the CC list for this, because I know you've had opinions on FIBMAP in the past, but I didn't have your email handy and it was late, and I know you saw the NTFS kmap patch on fsdevel so I figured you would see this too... Thanks for your input. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu Apr 12 21:16:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 21:16:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3D4FufB031559 for ; Thu, 12 Apr 2007 21:16:00 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA16550; Fri, 13 Apr 2007 14:15:54 +1000 Message-Id: <200704130415.OAA16550@larry.melbourne.sgi.com> From: "Barry Naujok" To: , "'xfs-dev'" Subject: [PATCH] xfs_repair - move realtime extent processing to a separate function Date: Fri, 13 Apr 2007 14:22:10 +1000 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_025A_01C77DD7.1FB6E8F0" X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 Thread-Index: Acd9g03WEX2xICCNSPiJxMhRydgOEw== X-archive-position: 11092 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs This is a multi-part message in MIME format. ------=_NextPart_000_025A_01C77DD7.1FB6E8F0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit While changing the process_bmbt_reclist_int() function, I observed a realtime check inside the block map get/set state loop which is quite CPU intensive. Upon further investigation, this loop is not used at all for realtime extents and that the two types of extents are pretty much processed exclusively. So, I simplified the functionality by moving the realtime extent processing into it's own function and fixing a bug at the same time when it comes to realtime inodes with attributes (it was comparing attr extents to the realtime volume bmap instead of the normal bmap). ------=_NextPart_000_025A_01C77DD7.1FB6E8F0 Content-Type: application/octet-stream; name="separate_rt_extent_processing.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="separate_rt_extent_processing.patch" Index: repair/xfsprogs/repair/dinode.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- repair.orig/xfsprogs/repair/dinode.c 2007-04-13 13:07:16.000000000 +1000 +++ repair/xfsprogs/repair/dinode.c 2007-04-13 14:15:33.920960345 +1000 @@ -537,6 +537,121 @@ return (val); \ } while (0) =20 +static int +process_rt_rec( + xfs_mount_t *mp, + xfs_bmbt_rec_32_t *rp, + xfs_ino_t ino, + xfs_drfsbno_t *tot, + int check_dups) +{ + xfs_dfsbno_t b; + xfs_drtbno_t ext; + xfs_dfilblks_t c; /* count */ + xfs_dfsbno_t s; /* start */ + xfs_dfiloff_t o; /* offset */ + int state; + int flag; /* extent flag */ + int pwe; /* partially-written extent */ + + convert_extent(rp, &o, &s, &c, &flag); + + /* + * check numeric validity of the extent + */ + if (s >=3D mp->m_sb.sb_rblocks) { + do_warn(_("inode %llu - bad rt extent start block number " + "%llu, offset %llu\n"), ino, s, o); + return 1; + } + if (s + c - 1 >=3D mp->m_sb.sb_rblocks) { + do_warn(_("inode %llu - bad rt extent last block number %llu, " + "offset %llu\n"), ino, s + c - 1, o); + return 1; + } + if (s + c - 1 < s) { + do_warn(_("inode %llu - bad rt extent overflows - start %llu, " + "end %llu, offset %llu\n"), + ino, s, s + c - 1, o); + return 1; + } + + /* + * verify that the blocks listed in the record + * are multiples of an extent + */ + if (XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) =3D=3D 0 && + (s % mp->m_sb.sb_rextsize !=3D 0 || + c % mp->m_sb.sb_rextsize !=3D 0)) { + do_warn(_("malformed rt inode extent [%llu %llu] (fs rtext " + "size =3D %u)\n"), s, c, mp->m_sb.sb_rextsize); + return 1; + } + + /* + * set the appropriate number of extents + */ + for (b =3D s; b < s + c; b +=3D mp->m_sb.sb_rextsize) { + ext =3D (xfs_drtbno_t) b / mp->m_sb.sb_rextsize; + pwe =3D XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) && flag && + (b % mp->m_sb.sb_rextsize !=3D 0); + + if (check_dups =3D=3D 1) { + if (search_rt_dup_extent(mp, ext) && !pwe) { + do_warn(_("data fork in rt ino %llu claims " + "dup rt extent, off - %llu, " + "start - %llu, count %llu\n"), + ino, o, s, c); + return 1; + } + continue; + } + + state =3D get_rtbno_state(mp, ext); + + switch (state) { + case XR_E_FREE: + case XR_E_UNKNOWN: + set_rtbno_state(mp, ext, XR_E_INUSE); + break; + + case XR_E_BAD_STATE: + do_error(_("bad state in rt block map %llu\n"), + ext); + + case XR_E_FS_MAP: + case XR_E_INO: + case XR_E_INUSE_FS: + do_error(_("data fork in rt inode %llu found " + "metadata block %llu in rt bmap\n"), + ino, ext); + + case XR_E_INUSE: + if (pwe) + break; + + case XR_E_MULT: + set_rtbno_state(mp, ext, XR_E_MULT); + do_warn(_("data fork in rt inode %llu claims " + "used rt block %llu\n"), + ino, ext); + return 1; + + case XR_E_FREE1: + default: + do_error(_("illegal state %d in rt block map " + "%llu\n"), state, b); + } + } + + /* + * bump up the block counter + */ + *tot +=3D c; + + return 0; +} + /* * return 1 if inode should be cleared, 0 otherwise * if check_dups should be set to 1, that implies that @@ -560,7 +675,6 @@ int whichfork) { xfs_dfsbno_t b; - xfs_drtbno_t ext; xfs_dfilblks_t c; /* count */ xfs_dfilblks_t cp =3D 0; /* prev count */ xfs_dfsbno_t s; /* start */ @@ -572,7 +686,6 @@ int i; int state; int flag; /* extent flag */ - int pwe; /* partially-written extent */ xfs_dfsbno_t e; xfs_agnumber_t agno; xfs_agblock_t agbno; @@ -615,28 +728,22 @@ o, s, ino); PROCESS_BMBT_UNLOCK_RETURN(1); } - if (type =3D=3D XR_INO_RTDATA) { - if (s >=3D mp->m_sb.sb_rblocks) { - do_warn( - _("inode %llu - bad rt extent start block number %llu, offset %llu\n"), - ino, s, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - if (s + c - 1 >=3D mp->m_sb.sb_rblocks) { - do_warn( - _("inode %llu - bad rt extent last block number %llu, offset %llu\n"), - ino, s + c - 1, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - if (s + c - 1 < s) { - do_warn( - _("inode %llu - bad rt extent overflows - start %llu, end %llu, " - "offset %llu\n"), - ino, s, s + c - 1, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - } else { - switch (verify_dfsbno_range(mp, s, c)) { + + if (type =3D=3D XR_INO_RTDATA && whichfork =3D=3D XFS_DATA_FORK) { + /* + * realtime bitmaps don't use AG locks, so returning + * immediately is fine for this code path. + */ + if (process_rt_rec(mp, rp, ino, tot, check_dups)) + return 1; + /* + * skip rest of loop processing since that's + * all for regular file forks and attr forks + */ + continue; + } + + switch (verify_dfsbno_range(mp, s, c)) { case XR_DFSBNORANGE_VALID: break; case XR_DFSBNORANGE_BADSTART: @@ -656,109 +763,13 @@ "offset %llu\n"), ino, s, s + c - 1, o); PROCESS_BMBT_UNLOCK_RETURN(1); - } - if (o >=3D fs_max_file_offset) { - do_warn( + } + if (o >=3D fs_max_file_offset) { + do_warn( _("inode %llu - extent offset too large - start %llu, count %llu, " "offset %llu\n"), - ino, s, c, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - } - - /* - * realtime file data fork - */ - if (type =3D=3D XR_INO_RTDATA && whichfork =3D=3D XFS_DATA_FORK) { - /* - * XXX - verify that the blocks listed in the record - * are multiples of an extent - */ - if (XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) =3D=3D 0 - && (s % mp->m_sb.sb_rextsize !=3D 0 || - c % mp->m_sb.sb_rextsize !=3D 0)) { - do_warn( - _("malformed rt inode extent [%llu %llu] (fs rtext size =3D %u)\n"), - s, c, mp->m_sb.sb_rextsize); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - - /* - * XXX - set the appropriate number of extents - */ - for (b =3D s; b < s + c; b +=3D mp->m_sb.sb_rextsize) { - ext =3D (xfs_drtbno_t) b / mp->m_sb.sb_rextsize; - if (XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) && - flag && (b % mp->m_sb.sb_rextsize !=3D 0)) { - pwe =3D 1; - } else { - pwe =3D 0; - } - - if (check_dups =3D=3D 1) { - if (search_rt_dup_extent(mp, ext) && - !pwe) { - do_warn( - _("data fork in rt ino %llu claims dup rt extent, off - %llu, " - "start - %llu, count %llu\n"), - ino, o, s, c); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - continue; - } - - state =3D get_rtbno_state(mp, ext); - - switch (state) { - case XR_E_FREE: -/* XXX - turn this back on after we - run process_rtbitmap() in phase2 - do_warn( - _("%s fork in rt ino %llu claims free rt block %llu\n"), - forkname, ino, ext); -*/ - /* fall through ... */ - case XR_E_UNKNOWN: - set_rtbno_state(mp, ext, XR_E_INUSE); - break; - case XR_E_BAD_STATE: - do_error( - _("bad state in rt block map %llu\n"), ext); - abort(); - break; - case XR_E_FS_MAP: - case XR_E_INO: - case XR_E_INUSE_FS: - do_error( - _("%s fork in rt inode %llu found metadata block %llu in %s bmap\n"), - forkname, ino, ext, ftype); - case XR_E_INUSE: - if (pwe) - break; - case XR_E_MULT: - set_rtbno_state(mp, ext, XR_E_MULT); - do_warn( - _("%s fork in rt inode %llu claims used rt block %llu\n"), - forkname, ino, ext); - PROCESS_BMBT_UNLOCK_RETURN(1); - case XR_E_FREE1: - default: - do_error( - _("illegal state %d in %s block map %llu\n"), - state, ftype, b); - } - } - - /* - * bump up the block counter - */ - *tot +=3D c; - - /* - * skip rest of loop processing since that's - * all for regular file forks and attr forks - */ - continue; + ino, s, c, o); + PROCESS_BMBT_UNLOCK_RETURN(1); } =20 =20 @@ -793,15 +804,6 @@ continue; } =20 - /* FIX FOR BUG 653709 -- EKN - * realtime attribute fork, should be valid block number - * in regular data space, not realtime partion. - */ - if (type =3D=3D XR_INO_RTDATA && whichfork =3D=3D XFS_ATTR_FORK) { - if (mp->m_sb.sb_agcount < agno) - PROCESS_BMBT_UNLOCK_RETURN(1); - } - /* Process in chunks of 16 (XR_BB_UNIT/XR_BB) * for common XR_E_UNKNOWN to XR_E_INUSE transition */ ------=_NextPart_000_025A_01C77DD7.1FB6E8F0-- From owner-xfs@oss.sgi.com Fri Apr 13 00:46:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 00:46:45 -0700 (PDT) Received: from ppsw-3.csi.cam.ac.uk (ppsw-3.csi.cam.ac.uk [131.111.8.133]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D7kYfB005684 for ; Fri, 13 Apr 2007 00:46:35 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49243) by ppsw-3.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.153]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HcGU8-0000Xj-BR (Exim 4.63) (return-path ); Fri, 13 Apr 2007 08:46:20 +0100 In-Reply-To: <20070413040156.GU5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> <20070413040156.GU5967@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Fri, 13 Apr 2007 08:46:18 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11093 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs Hi Andreas, On 13 Apr 2007, at 05:01, Andreas Dilger wrote: > On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: >> On 12 Apr 2007, at 12:05, Andreas Dilger wrote: >>> I'm interested in getting input for implementing an ioctl to >>> efficiently map file extents & holes (FIEMAP) instead of looping >>> over FIBMAP a billion times. We already have customers with single >>> files in the 10TB range and we additionally need to get the mapping >>> over the network so it needs to be efficient in terms of how data >>> is passed, and how easily it can be extracted from the filesystem. >>> >>> struct fibmap_extent { >>> __u64 fe_start; /* starting offset in bytes */ >>> __u64 fe_len; /* length in bytes */ >>> } >>> >>> struct fibmap { >>> struct fibmap_extent fm_start; /* offset, length of desired >>> mapping */ >>> __u32 fm_extent_count; /* number of extents in array */ >>> __u32 fm_flags; /* flags for input request */ >>> XFS_IOC_GETBMAP) */ >>> __u64 unused; >>> struct fibmap_extent fm_extents[0]; >>> } >>> >>> #define FIEMAP_LEN_MASK 0xff000000000000 >>> #define FIEMAP_LEN_HOLE 0x01000000000000 >>> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 >> >> Sound good but I would add: >> >> #define FIEMAP_LEN_NO_DIRECT_ACCESS >> >> This would say that the offset on disk can move at any time or that >> the data is compressed or encrypted on disk thus the data is not >> useful for direct disk access. > > This makes sense. Even for Reiserfs the same is true with packed > tails, > and I believe if FIBMAP is called on a tail it will migrate the > tail into > a block because this is might be a sign that the file is a kernel that > LILO wants to boot. > > I'd rather not have any such feature in FIEMAP, and just return the > on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me. > My main reason for FIEMAP is being able to investigate allocation > patterns > of files. > > By no means is my flag list exhaustive, just the ones that I > thought would > be needed to implement this for ext4 and Lustre. Sure, hence why I made my comment for NTFS. (-: And yes, ReiserFS and even ext* could use such flag. I believe there is a compression patch for ext somewhere isn't there? (Or at least there was one at some point I think...) >> Also why are you not using 0xff00000000000000, i.e. two more zeroes >> at the end? Seems unnecessary to drop an extra 8 bits of >> significance from the byte size... > > It was actually just a typo (this was the first time I'd written the > structs and flags down, it is just at the discussion stage). I'd > meant > for it to be 2^56 bytes for the file size as I wrote later in the > email. Ok. (-: > That said, I think that 2^48 bytes is probably sufficient for most > uses, > so that we get 16 bits for flags. As it is this email already > discusses > 5 flags, and that would give little room for expansion in the future. > > Remember, this is the mapping for a single file (which can't > practially > be beyond 2^64 bytes as yet) so it wouldn't be hard for the > filesystem to > return a few separate extents which are actually contiguous > (assuming that > there will actually be files in filesystems with > 2^48 bytes of > contiguous > space). Since the API is that it will return the extent that > contains the > requested "start" byte, the kernel will be able to detect this case > also, > since it won't be able to specify a length for the extent that > contains the > start byte. Valid point. As long as the "on-disk location" is maintained as full 64 bits then you are right we could just return multiple extents if the space does not fit. A bit of a kludge but it would certainly work. An alternative would be to have the flags in a separate field but that would add 8-bytes to the structure size if you want to maintain 8-byte alignment so that would not be great... > At most we'd have to call the ioctl() 65536 times for a completely > contiguous 2^64 byte file if the buffer was only large enough for a > single extent. In reality, I expect any file to have some > discontinuities > and the buffer to be large enough for a thousand or more entries so > the > corner case is not very bad. > >> Finally please make sure that the file system can return in one way >> or another errors for example when it fails to determine the extents >> because the system ran out of memory, there was an i/o error, >> whatever... It may even be useful to be able to say "here is an >> extent of size X bytes but we do not know where it is on disk because >> there was an error determining this particular extent's on-disk >> location for some reason or other"... > > Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and > FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated > to tape and currently has no blocks allocated in the filesystem. We > want to return some indication that there is actual file data and not > just a hole, but at the same time we don't want this to actually > return > the file from tape just to generate block mappings for it. Yes, NTFS also has off line storage (DFS - the Distributed File System I think it is called) but we don't support any of that. Perhaps one day... > This concept is also present in XFS_IOC_GETBMAPX - > BMV_IF_NO_DMAPI_READ, > but this needs to be specified on input to prevent the file being > mapped > and I'd rather the opposite (not getting file from tape) be the > default, > by principle of least surprise. > >>> block-aligned/sized allocations (e.g. tail packing). The >>> fm_extents array >>> returned contains the packed list of allocation extents for the >>> file, >>> including entries for holes (which have fe_start == 0, and a flag). >> >> Why the fe_start == 0? Surely just the flag is sufficient... On >> NTFS it is perfectly valid to have fe_start == 0 and to have that not >> be sparse (normally the $Boot system file is stored in the first 8 >> sectors of the volume)... > > I thought fe_start = 0 was pretty standard for a hole. It should be > something and I'd rather 0 than anything else. The _HOLE flag is > enough > as you say though. It is standard on Unix. I am trying to fight this standard because of NTFS... On NTFS a hole is -1 not 0 and zero is a valid block. But on NTFS device locations are "s64" not "u64" so the -1 is logical to use... As long as it is made clear that people MUST check the flag when fe_start == 0 rather than assume that fe_start == 0 means a hole I am happy with that. Hopefully not too many programmers will be lazy gits who will ignore this and just check fe_start == 0 or they will fail on NTFS and assume $Boot is sparse when it is not... > PS - I'd thought about adding you to the CC list for this, because > I know > you've had opinions on FIBMAP in the past, but I didn't have > your email handy and it was late, and I know you saw the NTFS > kmap > patch on fsdevel so I figured you would see this too... Thanks. Yes, I try to follow fsdevel closely and LKML not so closely (I often read it with "select all new, delete")... > Thanks for your input. You are welcome. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Fri Apr 13 01:04:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 01:04:29 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D84PfB010564 for ; Fri, 13 Apr 2007 01:04:26 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HcGAh-0005dQ-77; Fri, 13 Apr 2007 08:26:15 +0100 Date: Fri, 13 Apr 2007 08:26:15 +0100 From: Christoph Hellwig To: Utako Kusaka Cc: xfs@oss.sgi.com Subject: Re: [PATCH] remove the unnecessary word in the log message. Message-ID: <20070413072615.GB20326@infradead.org> References: <200704130037.AA05196@TNESG9305.tnes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704130037.AA05196@TNESG9305.tnes.nec.co.jp> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11094 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 13, 2007 at 09:37:35AM +0900, Utako Kusaka wrote: > --- linux-2.6.20-orig/fs/xfs/xfs_log_recover.c 2007-02-05 03:44:54.000000000 +0900 > +++ linux-2.6.20/fs/xfs/xfs_log_recover.c 2007-04-11 13:23:04.000000000 +0900 > @@ -3937,8 +3937,7 @@ xlog_recover( > * under the vfs layer, so we can get away with it unless > * the device itself is read-only, in which case we fail. > */ > - if ((error = xfs_dev_is_read_only(log->l_mp, > - "recovery required"))) { > + if ((error = xfs_dev_is_read_only(log->l_mp, "recovery"))) { > return error; > } Looks good. (And gets rid of an ugly line-break, nice) From owner-xfs@oss.sgi.com Fri Apr 13 01:04:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 01:04:32 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D84RfB010574 for ; Fri, 13 Apr 2007 01:04:29 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HcGBD-0005dk-S2; Fri, 13 Apr 2007 08:26:47 +0100 Date: Fri, 13 Apr 2007 08:26:47 +0100 From: Christoph Hellwig To: Barry Naujok Cc: xfs@oss.sgi.com, "'xfs-dev'" Subject: Re: [PATCH] xfs_repair - move realtime extent processing to a separate function Message-ID: <20070413072647.GC20326@infradead.org> References: <200704130415.OAA16550@larry.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704130415.OAA16550@larry.melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11095 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 13, 2007 at 02:22:10PM +1000, Barry Naujok wrote: > While changing the process_bmbt_reclist_int() function, I observed a > realtime check inside the block map get/set state loop which is quite > CPU intensive. Upon further investigation, this loop is not used at > all for realtime extents and that the two types of extents are pretty > much processed exclusively. > > So, I simplified the functionality by moving the realtime extent > processing into it's own function and fixing a bug at the same time > when it comes to realtime inodes with attributes (it was comparing > attr extents to the realtime volume bmap instead of the normal bmap). Nice cleanup, looks good. From owner-xfs@oss.sgi.com Fri Apr 13 03:15:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 03:15:12 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DAF7fB007394 for ; Fri, 13 Apr 2007 03:15:09 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HcIo7-0003PJ-50; Fri, 13 Apr 2007 11:15:07 +0100 Date: Fri, 13 Apr 2007 11:15:07 +0100 From: Christoph Hellwig To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070413101507.GA11406@infradead.org> References: <20070412110550.GM5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11096 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > All offsets are in bytes to allow cases where filesystems are not going > block-aligned/sized allocations (e.g. tail packing). The fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). > One feature that XFS_IOC_GETBMAPX has that may be desirable is the > ability to return unwritten extent information. In order to do this XFS > required expanding the per-extent struct from 32 to 48 bytes per extent, > but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) > and keep 8 bytes or so for input/output flags per extent (would need to > be masked before use). I'd be much happier to have the separate per-extent flags value. For one thing this allows much nicer representations of unwritten extents or holes without taking away bits from the len value. It also allows to make interesting use of this in the future, e.g. telling about an offline exttent for use in HSM applications. Also for this kernel<->user interface the wasted space shouldn't matter too much - if you want to pass the above condensed structure over the wire in lustre that shouldn't a problem, you'd have to convert to an endian-neutral on the wire format anyway. Not doing the masking also make the interface quite a bit simpler to use. One addition freature from the XFS getbmapx interface we should provide is the ability to query layout of xattrs. While other filesystems might not have the exact xattr fork XFS has it fits nicely into the interface. Especially when we have Anton's suggested flag for inline data. From owner-xfs@oss.sgi.com Fri Apr 13 04:39:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 04:39:27 -0700 (PDT) Received: from ppsw-3.csi.cam.ac.uk (ppsw-3.csi.cam.ac.uk [131.111.8.133]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DBdIfB026056 for ; Fri, 13 Apr 2007 04:39:20 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:50371) by ppsw-3.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.153]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HcK7H-0004AB-CS (Exim 4.63) (return-path ); Fri, 13 Apr 2007 12:38:59 +0100 In-Reply-To: <20070413101507.GA11406@infradead.org> References: <20070412110550.GM5967@schatzie.adilger.int> <20070413101507.GA11406@infradead.org> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <37B026AB-60FA-4595-B2B1-F57BB023D91C@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Fri, 13 Apr 2007 12:38:58 +0100 To: Christoph Hellwig X-Mailer: Apple Mail (2.752.3) X-archive-position: 11097 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 13 Apr 2007, at 11:15, Christoph Hellwig wrote: > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: >> struct fibmap_extent { >> __u64 fe_start; /* starting offset in bytes */ >> __u64 fe_len; /* length in bytes */ >> } >> >> struct fibmap { >> struct fibmap_extent fm_start; /* offset, length of desired >> mapping */ >> __u32 fm_extent_count; /* number of extents in array */ >> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ >> __u64 unused; >> struct fibmap_extent fm_extents[0]; >> } >> >> #define FIEMAP_LEN_MASK 0xff000000000000 >> #define FIEMAP_LEN_HOLE 0x01000000000000 >> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 >> >> All offsets are in bytes to allow cases where filesystems are not >> going >> block-aligned/sized allocations (e.g. tail packing). The >> fm_extents array >> returned contains the packed list of allocation extents for the file, >> including entries for holes (which have fe_start == 0, and a flag). > >> One feature that XFS_IOC_GETBMAPX has that may be desirable is the >> ability to return unwritten extent information. In order to do >> this XFS >> required expanding the per-extent struct from 32 to 48 bytes per >> extent, >> but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what >> hardship) >> and keep 8 bytes or so for input/output flags per extent (would >> need to >> be masked before use). > > I'd be much happier to have the separate per-extent flags value. > For one thing this allows much nicer representations of unwritten > extents or holes without taking away bits from the len value. It also > allows to make interesting use of this in the future, e.g. telling > about an offline exttent for use in HSM applications. Also for > this kernel<->user interface the wasted space shouldn't matter too > much - if you want to pass the above condensed structure over the > wire in lustre that shouldn't a problem, you'd have to convert > to an endian-neutral on the wire format anyway. Not doing the > masking also make the interface quite a bit simpler to use. > > One addition freature from the XFS getbmapx interface we should > provide is the ability to query layout of xattrs. While other > filesystems might not have the exact xattr fork XFS has it fits > nicely into the interface. Especially when we have Anton's suggested > flag for inline data. Would it not be better to allow people to get a file descriptor on the xattr fork and then just run the normal FIEMAP ioctl on that file descriptor? I.e. "openat(base file descriptor, O_STREAM, streamname)" or O_XATTR or whatever... An alternative API would be to provide a "getxattrfd ()/fgetxattrfd()" call or similar that would instead of returning the value of an xattr return an fd to it. Then you do not need to modify openat() at all... Interface doesn't bother me, just some ideas... And for XFS you would define a magic streamname or xattrname (or whatever you want to call it) of say "com.sgi.filesystem.xfs.xattrstream" (or .xattrfork) or something and then XFS would intercept that and know what to do with it... Such an interface could then be used by NTFS named streams and other file systems providing such things... (Yes I know I will now totally get flamed about named streams not being wanted in Linux and crap like that but that is exactly what you are asking for except you want to special case a particular stream using a flag instead of calling it for what it really is and once you start doing that you might as well allow full named streams...) You can just see named streams as an alternative, non-atomic API to xattrs if you like, i.e. you can either use the atomic xattr API provided in Linux already or you can get a file descriptor to an xattr and then use the normal system calls to access it non- atomically thus you can use the FIEMAP ioctl also. (-: FWIW this two-API approach to xattrs/named streams is the direction OSX is heading towards also so it is not without precedent and Windows has had both APIs for many years. And Solaris has the "openat (O_XATTR)" interface so that is not without precedent either. Best regards, Anton PS. to all flamers: I am going to delete any non-technical flames without replying so please do us all a favour and don't bother... Thanks. -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Fri Apr 13 08:12:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 08:12:12 -0700 (PDT) Received: from mx1.suse.de (mail.suse.de [195.135.220.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DFC5fB002711 for ; Fri, 13 Apr 2007 08:12:06 -0700 Received: from Relay2.suse.de (mail2.suse.de [195.135.221.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.suse.de (Postfix) with ESMTP id BA42B122ED; Fri, 13 Apr 2007 16:54:50 +0200 (CEST) Message-ID: <461F997E.30002@suse.com> Date: Fri, 13 Apr 2007 10:53:50 -0400 From: Jeff Mahoney Organization: SUSE Labs, Novell, Inc User-Agent: Thunderbird 1.5.0.10 (X11/20060911) MIME-Version: 1.0 To: Anton Altaparmakov , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation References: <20070412110550.GM5967@schatzie.adilger.int> <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> <20070413040156.GU5967@schatzie.adilger.int> In-Reply-To: <20070413040156.GU5967@schatzie.adilger.int> X-Enigmail-Version: 0.94.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11098 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeffm@suse.com Precedence: bulk X-list: xfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andreas Dilger wrote: > On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: >> This would say that the offset on disk can move at any time or that >> the data is compressed or encrypted on disk thus the data is not >> useful for direct disk access. > > This makes sense. Even for Reiserfs the same is true with packed tails, > and I believe if FIBMAP is called on a tail it will migrate the tail into > a block because this is might be a sign that the file is a kernel that > LILO wants to boot. Actually, reiserfs_aop_bmap() returns 0 when the requested block is in a tail. There's a separate ioctl for unpacking them. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFGH5l+LPWxlyuTD7IRAn5/AJ9VcocIcDGr9wtAlgGZuOAQWqVASwCfVdWM uLZQq1mkf8hsGXOpZtKQH5w= =AxnN -----END PGP SIGNATURE----- From owner-xfs@oss.sgi.com Fri Apr 13 12:06:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 12:07:02 -0700 (PDT) Received: from rwcrmhc15.comcast.net (rwcrmhc15.comcast.net [216.148.227.155]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DJ6wfB025581 for ; Fri, 13 Apr 2007 12:06:59 -0700 Received: from [192.168.1.10] (c-67-171-1-120.hsd1.wa.comcast.net[67.171.1.120]) by comcast.net (rwcrmhc15) with SMTP id <20070413185549m15007ivg8e>; Fri, 13 Apr 2007 18:55:50 +0000 Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation From: Nicholas Miell To: Anton Altaparmakov Cc: Christoph Hellwig , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com In-Reply-To: <37B026AB-60FA-4595-B2B1-F57BB023D91C@cam.ac.uk> References: <20070412110550.GM5967@schatzie.adilger.int> <20070413101507.GA11406@infradead.org> <37B026AB-60FA-4595-B2B1-F57BB023D91C@cam.ac.uk> Content-Type: text/plain Date: Fri, 13 Apr 2007 11:55:49 -0700 Message-Id: <1176490549.3122.16.camel@entropy> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.0.njm.1) Content-Transfer-Encoding: 7bit X-archive-position: 11099 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nmiell@comcast.net Precedence: bulk X-list: xfs On Fri, 2007-04-13 at 12:38 +0100, Anton Altaparmakov wrote: > > One addition freature from the XFS getbmapx interface we should > > provide is the ability to query layout of xattrs. While other > > filesystems might not have the exact xattr fork XFS has it fits > > nicely into the interface. Especially when we have Anton's suggested > > flag for inline data. > > Would it not be better to allow people to get a file descriptor on > the xattr fork and then just run the normal FIEMAP ioctl on that file > descriptor? > > I.e. "openat(base file descriptor, O_STREAM, streamname)" or O_XATTR > or whatever... An alternative API would be to provide a "getxattrfd > ()/fgetxattrfd()" call or similar that would instead of returning the > value of an xattr return an fd to it. Then you do not need to modify > openat() at all... Interface doesn't bother me, just some ideas... > > And for XFS you would define a magic streamname or xattrname (or > whatever you want to call it) of say > "com.sgi.filesystem.xfs.xattrstream" (or .xattrfork) or something and > then XFS would intercept that and know what to do with it... > > Such an interface could then be used by NTFS named streams and other > file systems providing such things... > > (Yes I know I will now totally get flamed about named streams not > being wanted in Linux and crap like that but that is exactly what you > are asking for except you want to special case a particular stream > using a flag instead of calling it for what it really is and once you > start doing that you might as well allow full named streams...) > > You can just see named streams as an alternative, non-atomic API to > xattrs if you like, i.e. you can either use the atomic xattr API > provided in Linux already or you can get a file descriptor to an > xattr and then use the normal system calls to access it non- > atomically thus you can use the FIEMAP ioctl also. (-: > > FWIW this two-API approach to xattrs/named streams is the direction > OSX is heading towards also so it is not without precedent and > Windows has had both APIs for many years. And Solaris has the "openat > (O_XATTR)" interface so that is not without precedent either. Except that xattrs in Linux aren't streams, and providing a stream-like interface to them would be a weird abuse of the xattr concept. In essence, Linux xattrs are named extensions to struct stat, with getxattr() being in the same category as stat() and setxattr() being in the same category as chmod()/chown()/utime()/etc. They system namespace exists to provide a better interface than ioctl() to weird FS-specific features (DOS attribute bits, HFS+ creator/type, ext2/3/reiserfs/etc. immutable/append-only/secure-delete/etc. attributes and so on). The uptake of this feature isn't as high as I'd like, but that's what it's there for. They security namespace is there for all the neat LSM modules that need to attach metadata to files in order to function. Finally, the user namespace exists to allow users to attach small bits of information to their own files, since the API was already there and hey!, metadata is useful. Now, Solaris came along and totally confused the issue by using the same name for a completely different feature, but that isn't any real reason to mess up the existing Linux xattr concept just to graft named streams support into the kernel. (Not that I'm opposed to named streams in Linux, you just have to realize that xattrs aren't name streams, can't live in the same namespace as named streams, and certainly don't serve the same purpose as named streams.) -- Nicholas Miell From owner-xfs@oss.sgi.com Sun Apr 15 21:00:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 15 Apr 2007 21:00:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3G40DfB007691 for ; Sun, 15 Apr 2007 21:00:15 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA05077; Mon, 16 Apr 2007 14:00:07 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1116) id 0A0415903EDB; Mon, 16 Apr 2007 14:00:06 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com, xfs@oss.sgi.com Subject: TAKE 963465 - export xfs_buftarg_list for xfsidbg (using func) Message-Id: <20070416040007.0A0415903EDB@chook.melbourne.sgi.com> Date: Mon, 16 Apr 2007 14:00:06 +1000 (EST) From: tes@sgi.com (Tim Shimmin) X-archive-position: 11100 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Export via a function xfs_buftarg_list for use by kdb/xfsidbg. Date: Mon Apr 16 13:58:41 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/tes/2.6.x-xfs Inspected by: lachlan@sgi.com The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28414a fs/xfs/xfsidbg.c - 1.313 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfsidbg.c.diff?r1=text&tr1=1.313&r2=text&tr2=1.312&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.6/xfs_buf.h - 1.119 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.h.diff?r1=text&tr1=1.119&r2=text&tr2=1.118&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.6/xfs_buf.c - 1.235 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.235&r2=text&tr2=1.234&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.4/xfs_buf.h - 1.118 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_buf.h.diff?r1=text&tr1=1.118&r2=text&tr2=1.117&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.4/xfs_buf.c - 1.219 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_buf.c.diff?r1=text&tr1=1.219&r2=text&tr2=1.218&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.6/xfs_ksyms.c - 1.57 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_ksyms.c.diff?r1=text&tr1=1.57&r2=text&tr2=1.56&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.4/xfs_ksyms.c - 1.49 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_ksyms.c.diff?r1=text&tr1=1.49&r2=text&tr2=1.48&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. From owner-xfs@oss.sgi.com Sun Apr 15 22:21:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 15 Apr 2007 22:21:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3G5LDfB027136 for ; Sun, 15 Apr 2007 22:21:15 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA06555; Mon, 16 Apr 2007 15:21:09 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1116) id 0483F5903EDB; Mon, 16 Apr 2007 15:21:08 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com, xfs@oss.sgi.com Subject: TAKE 963466 - remove the unnecessary word in the log message Message-Id: <20070416052109.0483F5903EDB@chook.melbourne.sgi.com> Date: Mon, 16 Apr 2007 15:21:08 +1000 (EST) From: tes@sgi.com (Tim Shimmin) X-archive-position: 11101 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Thanks to Utako Kusaka. Signed-off-by: Utako Kusaka Get rid of redundant "required" in msg. Date: Mon Apr 16 15:19:51 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/tes/2.6.x-xfs Inspected by: utako@tnes.nec.co.jp,hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28416a fs/xfs/xfs_log_recover.c - 1.318 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log_recover.c.diff?r1=text&tr1=1.318&r2=text&tr2=1.317&f=h - Signed-off-by: Utako Kusaka Get rid of redundant "required" in msg. From owner-xfs@oss.sgi.com Mon Apr 16 00:59:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 16 Apr 2007 00:59:51 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3G7xhfB010633 for ; Mon, 16 Apr 2007 00:59:46 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA10446; Mon, 16 Apr 2007 17:59:20 +1000 Date: Mon, 16 Apr 2007 18:01:17 +1000 From: Timothy Shimmin To: Andreas Dilger , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com cc: hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <31588A06562720FE1E0F93DF@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11102 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Andreas, --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion > times. ... > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. They certainly seem to be (combining entries and header). > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > ># define FIEMAP_LEN_MASK 0xff000000000000 ># define FIEMAP_LEN_HOLE 0x01000000000000 ># define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > All offsets are in bytes to allow cases where filesystems are not going > block-aligned/sized allocations (e.g. tail packing). The fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). > > The ->fm_extents[] array includes all of the holes in addition to > allocated extents because this avoids the need to return both the logical > and physical address for every extent and does not make processing any > harder. Well, that's what stood out for me. I was wondering where the "fe_block" field had gone - the "physical address". So is your "fe_start; /* starting offset */" actually the disk location (not a logical file offset) _except_ in the header (fibmap) where it is the desired logical offset. Okay, looking at your example use below that's what it looks like. And when you refer to fm_start below, you mean fm_start.fe_start? Sorry, I realise this is just an approximation but this part confused me. So you get rid of all the logical file offsets in the extents because we report holes explicitly (and we know everything is contiguous if you include the holes). --Tim > > Caller works something like: > > char buf[4096]; > struct fibmap *fm = (struct fibmap *)buf; > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > fm->fm_extent.fe_start = 0; /* start of file */ > fm->fm_extent.fe_len = -1; /* end of file */ > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > fd = open(path, O_RDONLY); > printf("logical\t\tphysical\t\tbytes\n"); > > /* The last entry will have less extents than the maximum */ > while (fm->fm_extent_count == count) { > rc = ioctl(fd, FIEMAP, fm); > if (rc) > break; > > /* kernel filled in fm_extents[] array, set fm_extent_count > * to be actual number of extents returned, leaves fm_start > * alone (unlike XFS_IOC_GETBMAP). */ > > for (i = 0; i < fm->fm_extent_count; i++) { > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > __u64 fm_next = fm->fm_start + len; > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > fm->fm_start, fm_next - 1, > hole ? 0 : fm->fm_extents[i].fe_start, > hole ? 0 : fm->fm_extents[i].fe_start + > fm->fm_extents[i].fe_len - 1, > len, hole ? "(hole) " : "", > unwr ? "(unwritten) " : ""); > > /* get ready for printing next extent, or next ioctl */ > fm->fm_start = fm_next; > } > } > From owner-xfs@oss.sgi.com Mon Apr 16 04:23:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 16 Apr 2007 04:23:14 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3GBN6fB021516 for ; Mon, 16 Apr 2007 04:23:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id VAA14873; Mon, 16 Apr 2007 21:22:56 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3GBMsAf66125042; Mon, 16 Apr 2007 21:22:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3GBMrmv66162864; Mon, 16 Apr 2007 21:22:53 +1000 (AEST) Date: Mon, 16 Apr 2007 21:22:53 +1000 From: David Chinner To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070416112252.GJ48531920@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11103 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion > times. We already have customers with single files in the 10TB range and > we additionally need to get the mapping over the network so it needs to > be efficient in terms of how data is passed, and how easily it can be > extracted from the filesystem. > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. Yeah, they were designed with having a long term stable ABI that limited expandability. Hence the "future" fields that never got used ;) > There was also recent discussion about SEEK_HOLE and SEEK_DATA as > implemented by Sun, but even if we could skip the holes we still might > need to do millions of FIBMAPs to see how large files are allocated > on disk. Conversely, having filesystems implement an efficient FIBMAP > ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE > and SEEK_DATA instead of doing looping over ->bmap() inside the kernel > as I saw one patch. Yup. > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 I'm not sure I like stealing bits from the length to use a flags - I'd prefer an explicit field per fibmap_extent for this. Given that xfs_bmap uses extra information from the filesystem (geometry) to display extra (and frequently used) information about the alignment of extents. ie: chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width This information could be easily passed up in the flags fields if the filesystem has geometry information (there go 4 more flags ;). Also - what are the explicit sync semantics of this ioctl? The XFS ioctl causes a fsync of the file first to convert delalloc extents to real extents before returning the bmap. Is this functionality going to be the same? If not, then we need a DELALLOC flag to indicate extents that haven't been allocated yet. This might be handy to have, anyway.... > All offsets are in bytes to allow cases where filesystems are not going > block-aligned/sized allocations (e.g. tail packing). So it'll be ok for a few years yet ;) > The fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). Internalling in XFS, we pass these around as: #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) And the offset passed out through XFS_IOC_GETBMAP[X] is a block number of -1 for the start of a hole. Hence we don't need a flag for this. We can expose delalloc extents like this as well without needing flags... > The ->fm_extents[] array includes all of the holes in addition to > allocated extents because this avoids the need to return both the logical > and physical address for every extent and does not make processing any > harder. Doesn't really make it any easier to map to disk, either. > One feature that XFS_IOC_GETBMAPX has that may be desirable is the > ability to return unwritten extent information. You got that with the unwritten flag above..... > required expanding the per-extent struct from 32 to 48 bytes per extent, not sure I follow your maths here? > but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) > and keep 8 bytes or so for input/output flags per extent (would need to ^^^^^ bits? > be masked before use). > > > Caller works something like: > > char buf[4096]; > struct fibmap *fm = (struct fibmap *)buf; > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > fm->fm_extent.fe_start = 0; /* start of file */ > fm->fm_extent.fe_len = -1; /* end of file */ > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > fd = open(path, O_RDONLY); > printf("logical\t\tphysical\t\tbytes\n"); > > /* The last entry will have less extents than the maximum */ > while (fm->fm_extent_count == count) { fm_extent_count is an in/out parameter? > rc = ioctl(fd, FIEMAP, fm); > if (rc) > break; > > /* kernel filled in fm_extents[] array, set fm_extent_count > * to be actual number of extents returned, leaves fm_start > * alone (unlike XFS_IOC_GETBMAP). */ Ok, it is. > for (i = 0; i < fm->fm_extent_count; i++) { > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > __u64 fm_next = fm->fm_start + len; > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > fm->fm_start, fm_next - 1, > hole ? 0 : fm->fm_extents[i].fe_start, > hole ? 0 : fm->fm_extents[i].fe_start + > fm->fm_extents[i].fe_len - 1, > len, hole ? "(hole) " : "", > unwr ? "(unwritten) " : ""); > > /* get ready for printing next extent, or next ioctl */ > fm->fm_start = fm_next; Ok, so the only way you can determine where you are in the file is by adding up the length of each extent. What happens if the file is changing underneath you e.g. someone punches out a hole in teh file, or truncates and extends it again between ioctl() calls? Also, what happens if you ask for an offset/len that doesn't map to any extent boundaries - are you truncating the extents returned to teh off/len passed in? xfs_bmap gets around this by finding out how many extents there are in the file and allocating a buffer that big to hold all the extents so they are gathered in a single atomic call (think sparse matrix files).... > I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP. > I'm quite open to suggestions at this point, both in terms of how usable > the fibmap data structures are by the caller, and if we need to add anything > to make them more flexible for the future. ioctl is fine by me. perhaps a version number in the structure header would be handy so we can modify the interface easily in the future without having to worry about breaking userspace.... > In terms of implementing this in the kernel, there was originally code for > this during the development of the ext3 extent patches and it was done via > a callback in the extent tree iterator so it is very efficient. I believe > it implements all that is needed to allow this interface to be mapped > onto XFS_IOC_BMAP internally (or vice versa). I wouldn't map the ioctls - I'd just write another interface to xfs_getbmap(). That way we could eventually get rid of the XFS_IOC_BMAP interface. is there any code yet? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue Apr 17 05:55:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 17 Apr 2007 05:55:50 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3HCthfB002992 for ; Tue, 17 Apr 2007 05:55:45 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3HCtgDX010326 for ; Tue, 17 Apr 2007 08:55:42 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3HCtg8k444038 for ; Tue, 17 Apr 2007 08:55:42 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3HCtPR3011241 for ; Tue, 17 Apr 2007 08:55:26 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3HCtOU8010339; Tue, 17 Apr 2007 08:55:25 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id DDD1129ED6E; Tue, 17 Apr 2007 18:25:15 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3HCtEZA015242; Tue, 17 Apr 2007 18:25:14 +0530 Date: Tue, 17 Apr 2007 18:25:14 +0530 From: "Amit K. Arora" To: Andrew Morton , Jakub Jelinek , torvalds@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070417125514.GA7574@amitarora.in.ibm.com> References: <20070117094658.GA17390@amitarora.in.ibm.com> <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070330071417.GI355@devserv.devel.redhat.com> User-Agent: Mutt/1.4.1i X-archive-position: 11104 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > Wouldn't > int fallocate(loff_t offset, loff_t len, int fd, int mode) > work on both s390 and ppc/arm? glibc will certainly wrap it and > reorder the arguments as needed, so there is no need to keep fd first. > I think more people are comfirtable with this approach. Since glibc will wrap the system call and export the "conventional" interface (with fd first) to applications, we may not worry about keeping fd first in kernel code. I am personally fine with this approach. Still, if people have major concerns, we can think of getting rid of the "mode" argument itself. Anyhow we may, in future, need to have a policy based system call (say, for providing the goal block by applications for performance reasons). "mode" can then be made part of it. Comments ? -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue Apr 17 13:57:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 17 Apr 2007 13:57:43 -0700 (PDT) Received: from mx2.redhat.com (mx2.redhat.com [66.187.237.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3HKvdfB029040 for ; Tue, 17 Apr 2007 13:57:40 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx2.redhat.com (8.13.1/8.13.1) with ESMTP id l3HKUfeT009781 for ; Tue, 17 Apr 2007 16:30:42 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3HKUesm004074 for ; Tue, 17 Apr 2007 16:30:40 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3HKUcAx017901 for ; Tue, 17 Apr 2007 16:30:40 -0400 Message-ID: <46252D94.2050106@sandeen.net> Date: Tue, 17 Apr 2007 15:27:00 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: xfs mailing list Subject: when is a dmapi tarball not a dmapi tarball? Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11105 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs when you find it on oss, it seems :) [esandeen@neon tmp]$ wget ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz --15:11:31-- ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz => `dmapi_2.2.8-1.tar.gz' Resolving oss.sgi.com... 192.48.170.157 Connecting to oss.sgi.com|192.48.170.157|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD /projects/xfs/cmd_tars ... done. ==> SIZE dmapi_2.2.8-1.tar.gz ... 84649 ==> PASV ... done. ==> RETR dmapi_2.2.8-1.tar.gz ... done. Length: 84649 (83K) 100%[=======================================>] 84,649 133K/s in 0.6s 15:11:33 (133 KB/s) - `dmapi_2.2.8-1.tar.gz' saved [84649] [esandeen@neon tmp]$ tar xvzf dmapi_2.2.8-1.tar.gz gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error exit delayed from previous errors [esandeen@neon tmp]$ file dmapi_2.2.8-1.tar.gz dmapi_2.2.8-1.tar.gz: RPM v3 src IA64 dmapi-2.2.8-1 [esandeen@neon tmp]$ rpm -qpl dmapi_2.2.8-1.tar.gz dmapi-2.2.8.src.tar.gz dmapi.spec Might want to fix that... -Eric From owner-xfs@oss.sgi.com Tue Apr 17 18:03:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 17 Apr 2007 18:03:14 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3I138fB018969 for ; Tue, 17 Apr 2007 18:03:10 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA09246; Wed, 18 Apr 2007 11:03:00 +1000 Message-Id: <200704180103.LAA09246@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'Eric Sandeen'" , "'xfs mailing list'" Subject: RE: when is a dmapi tarball not a dmapi tarball? Date: Wed, 18 Apr 2007 11:08:45 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 In-Reply-To: <46252D94.2050106@sandeen.net> Thread-Index: AceBM0jqao+UQMxMRwqcjMZZKELIMQAIs3SA X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 X-archive-position: 11106 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Fixed! > -----Original Message----- > From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] > On Behalf Of Eric Sandeen > Sent: Wednesday, 18 April 2007 6:27 AM > To: xfs mailing list > Subject: when is a dmapi tarball not a dmapi tarball? > > when you find it on oss, it seems :) > > [esandeen@neon tmp]$ wget > ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz > --15:11:31-- > ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz > => `dmapi_2.2.8-1.tar.gz' > Resolving oss.sgi.com... 192.48.170.157 > Connecting to oss.sgi.com|192.48.170.157|:21... connected. > Logging in as anonymous ... Logged in! > ==> SYST ... done. ==> PWD ... done. > ==> TYPE I ... done. ==> CWD /projects/xfs/cmd_tars ... done. > ==> SIZE dmapi_2.2.8-1.tar.gz ... 84649 > ==> PASV ... done. ==> RETR dmapi_2.2.8-1.tar.gz ... done. > Length: 84649 (83K) > > 100%[=======================================>] 84,649 > 133K/s in > 0.6s > > 15:11:33 (133 KB/s) - `dmapi_2.2.8-1.tar.gz' saved [84649] > > [esandeen@neon tmp]$ tar xvzf dmapi_2.2.8-1.tar.gz > > gzip: stdin: not in gzip format > tar: Child returned status 1 > tar: Error exit delayed from previous errors > > [esandeen@neon tmp]$ file dmapi_2.2.8-1.tar.gz > dmapi_2.2.8-1.tar.gz: RPM v3 src IA64 dmapi-2.2.8-1 > > [esandeen@neon tmp]$ rpm -qpl dmapi_2.2.8-1.tar.gz > dmapi-2.2.8.src.tar.gz > dmapi.spec > > Might want to fix that... > > -Eric > > From owner-xfs@oss.sgi.com Wed Apr 18 06:06:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 06:06:12 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3ID63fB007264 for ; Wed, 18 Apr 2007 06:06:05 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 68EF64E4594; Wed, 18 Apr 2007 07:06:02 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 911374141; Wed, 18 Apr 2007 07:06:00 -0600 (MDT) Date: Wed, 18 Apr 2007 07:06:00 -0600 From: Andreas Dilger To: "Amit K. Arora" Cc: Andrew Morton , Jakub Jelinek , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070418130600.GW5967@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , Andrew Morton , Jakub Jelinek , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070417125514.GA7574@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11107 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 17, 2007 18:25 +0530, Amit K. Arora wrote: > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > > Wouldn't > > int fallocate(loff_t offset, loff_t len, int fd, int mode) > > work on both s390 and ppc/arm? glibc will certainly wrap it and > > reorder the arguments as needed, so there is no need to keep fd first. > > I think more people are comfirtable with this approach. Really? I thought from the last postings that "fd first, wrap on s390" was better. > Since glibc > will wrap the system call and export the "conventional" interface > (with fd first) to applications, we may not worry about keeping fd first > in kernel code. I am personally fine with this approach. It would seem to make more sense to wrap the syscall on those architectures that can't handle the "conventional" interface (fd first). > Still, if people have major concerns, we can think of getting rid of the > "mode" argument itself. Anyhow we may, in future, need to have a policy > based system call (say, for providing the goal block by applications for > performance reasons). "mode" can then be made part of it. We need at least mode="unallocate" or a separate funallocate() call to allow allocated-but-unwritten blocks to be unallocated without actually punching out written data. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed Apr 18 10:57:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 10:57:41 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3IHvXfB017246 for ; Wed, 18 Apr 2007 10:57:36 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3IHvVLD018372 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 18 Apr 2007 19:57:31 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3IHvUal018370 for xfs@oss.sgi.com; Wed, 18 Apr 2007 19:57:30 +0200 Date: Wed, 18 Apr 2007 19:57:30 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [PATCH] remove various useless min/max macros Message-ID: <20070418175730.GA18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11108 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs xfs_btree.h has various macros to calculate a min/max after casting it's arguments to a specific type. This can be done much simpler by using min_t/max_t with the type as first argument. Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/xfs/xfs_alloc.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_alloc.c 2007-04-13 13:40:00.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_alloc.c 2007-04-13 13:44:07.000000000 +0200 @@ -151,11 +151,11 @@ xfs_alloc_compute_diff( if (newbno1 >= freeend) newbno1 = NULLAGBLOCK; else - newlen1 = XFS_EXTLEN_MIN(wantlen, freeend - newbno1); + newlen1 = min_t(xfs_extlen_t, wantlen, freeend - newbno1); if (newbno2 < freebno) newbno2 = NULLAGBLOCK; else - newlen2 = XFS_EXTLEN_MIN(wantlen, freeend - newbno2); + newlen2 = min_t(xfs_extlen_t, wantlen, freeend - newbno2); if (newbno1 != NULLAGBLOCK && newbno2 != NULLAGBLOCK) { if (newlen1 < newlen2 || (newlen1 == newlen2 && @@ -686,7 +686,7 @@ xfs_alloc_ag_vextent_exact( * End of extent will be smaller of the freespace end and the * maximal requested end. */ - end = XFS_AGBLOCK_MIN(fend, maxend); + end = min_t(xfs_agblock_t, fend, maxend); /* * Fix the length according to mod and prod if given. */ @@ -850,7 +850,7 @@ xfs_alloc_ag_vextent_near( args->alignment, args->minlen, <bnoa, <lena)) continue; - args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen); + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); ASSERT(args->len >= args->minlen); if (args->len < blen) @@ -1007,7 +1007,7 @@ xfs_alloc_ag_vextent_near( /* * Fix up the length. */ - args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen); + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; ltdiff = xfs_alloc_compute_diff(args->agbno, rlen, @@ -1045,7 +1045,7 @@ xfs_alloc_ag_vextent_near( */ if (gtlena >= args->minlen) { args->len = - XFS_EXTLEN_MIN(gtlena, + min_t(xfs_extlen_t, gtlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; @@ -1104,7 +1104,7 @@ xfs_alloc_ag_vextent_near( /* * Fix up the length. */ - args->len = XFS_EXTLEN_MIN(gtlena, args->maxlen); + args->len = min_t(xfs_extlen_t, gtlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; gtdiff = xfs_alloc_compute_diff(args->agbno, rlen, @@ -1141,7 +1141,7 @@ xfs_alloc_ag_vextent_near( * compare the two and pick the best. */ if (ltlena >= args->minlen) { - args->len = XFS_EXTLEN_MIN( + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; @@ -1221,7 +1221,7 @@ xfs_alloc_ag_vextent_near( * Fix up the length and compute the useful address. */ ltend = ltbno + ltlen; - args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen); + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); if (!xfs_alloc_fix_minleft(args)) { TRACE_ALLOC("nominleft", args); @@ -1320,7 +1320,7 @@ xfs_alloc_ag_vextent_size( */ xfs_alloc_compute_aligned(fbno, flen, args->alignment, args->minlen, &rbno, &rlen); - rlen = XFS_EXTLEN_MIN(args->maxlen, rlen); + rlen = min_t(xfs_extlen_t, args->maxlen, rlen); XFS_WANT_CORRUPTED_GOTO(rlen == 0 || (rlen <= flen && rbno + rlen <= fbno + flen), error0); if (rlen < args->maxlen) { @@ -1346,7 +1346,7 @@ xfs_alloc_ag_vextent_size( break; xfs_alloc_compute_aligned(fbno, flen, args->alignment, args->minlen, &rbno, &rlen); - rlen = XFS_EXTLEN_MIN(args->maxlen, rlen); + rlen = min_t(xfs_extlen_t, args->maxlen, rlen); XFS_WANT_CORRUPTED_GOTO(rlen == 0 || (rlen <= flen && rbno + rlen <= fbno + flen), error0); Index: linux-2.6/fs/xfs/xfs_bmap.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_bmap.c 2007-04-13 13:41:43.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_bmap.c 2007-04-13 13:45:14.000000000 +0200 @@ -994,7 +994,7 @@ xfs_bmap_add_extent_delay_real( LEFT.br_state))) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock)); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "LF|LC", ip, idx, @@ -1043,7 +1043,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock) - (cur ? cur->bc_private.b.allocated : 0)); ep = xfs_iext_get_ext(ifp, idx + 1); @@ -1090,7 +1090,7 @@ xfs_bmap_add_extent_delay_real( RIGHT.br_state))) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock)); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "RF|RC", ip, idx, @@ -1138,7 +1138,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock) - (cur ? cur->bc_private.b.allocated : 0)); ep = xfs_iext_get_ext(ifp, idx); @@ -3177,7 +3177,7 @@ xfs_bmap_del_extent( xfs_bmbt_set_blockcount(ep, temp); ifp->if_lastex = idx; if (delay) { - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), da_old); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "2", ip, idx, @@ -3206,7 +3206,7 @@ xfs_bmap_del_extent( xfs_bmbt_set_blockcount(ep, temp); ifp->if_lastex = idx; if (delay) { - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), da_old); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "1", ip, idx, @@ -4337,7 +4337,7 @@ xfs_bmap_first_unused( return 0; } lastaddr = off + xfs_bmbt_get_blockcount(ep); - max = XFS_FILEOFF_MAX(lastaddr, lowest); + max = max_t(xfs_fileoff_t, lastaddr, lowest); } *first_unused = max; return 0; @@ -4850,16 +4850,16 @@ xfs_bmapi( } } else if (wasdelay) { alen = (xfs_extlen_t) - XFS_FILBLKS_MIN(len, + min_t(xfs_filblks_t, len, (got.br_startoff + got.br_blockcount) - bno); aoff = bno; } else { alen = (xfs_extlen_t) - XFS_FILBLKS_MIN(len, MAXEXTLEN); + min_t(xfs_filblks_t, len, MAXEXTLEN); if (!eof) alen = (xfs_extlen_t) - XFS_FILBLKS_MIN(alen, + min_t(xfs_filblks_t, alen, got.br_startoff - bno); aoff = bno; } @@ -5087,7 +5087,7 @@ xfs_bmapi( mval->br_startoff = bno; mval->br_startblock = HOLESTARTBLOCK; mval->br_blockcount = - XFS_FILBLKS_MIN(len, got.br_startoff - bno); + min_t(xfs_filblks_t, len, got.br_startoff - bno); mval->br_state = XFS_EXT_NORM; bno += mval->br_blockcount; len -= mval->br_blockcount; @@ -5122,7 +5122,7 @@ xfs_bmapi( * didn't overlap what was asked for. */ mval->br_blockcount = - XFS_FILBLKS_MIN(end - bno, got.br_blockcount - + min_t(xfs_filblks_t, end - bno, got.br_blockcount - (bno - got.br_startoff)); mval->br_state = got.br_state; ASSERT(mval->br_blockcount <= len); @@ -5462,7 +5462,7 @@ xfs_bunmapi( * Is the last block of this extent before the range * we're supposed to delete? If so, we're done. */ - bno = XFS_FILEOFF_MIN(bno, + bno = min_t(xfs_fileoff_t, bno, got.br_startoff + got.br_blockcount - 1); if (bno < start) break; Index: linux-2.6/fs/xfs/xfs_btree.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_btree.h 2007-04-13 13:43:19.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_btree.h 2007-04-13 13:43:56.000000000 +0200 @@ -440,35 +440,6 @@ xfs_btree_setbuf( #endif /* __KERNEL__ */ - -/* - * Min and max functions for extlen, agblock, fileoff, and filblks types. - */ -#define XFS_EXTLEN_MIN(a,b) \ - ((xfs_extlen_t)(a) < (xfs_extlen_t)(b) ? \ - (xfs_extlen_t)(a) : (xfs_extlen_t)(b)) -#define XFS_EXTLEN_MAX(a,b) \ - ((xfs_extlen_t)(a) > (xfs_extlen_t)(b) ? \ - (xfs_extlen_t)(a) : (xfs_extlen_t)(b)) -#define XFS_AGBLOCK_MIN(a,b) \ - ((xfs_agblock_t)(a) < (xfs_agblock_t)(b) ? \ - (xfs_agblock_t)(a) : (xfs_agblock_t)(b)) -#define XFS_AGBLOCK_MAX(a,b) \ - ((xfs_agblock_t)(a) > (xfs_agblock_t)(b) ? \ - (xfs_agblock_t)(a) : (xfs_agblock_t)(b)) -#define XFS_FILEOFF_MIN(a,b) \ - ((xfs_fileoff_t)(a) < (xfs_fileoff_t)(b) ? \ - (xfs_fileoff_t)(a) : (xfs_fileoff_t)(b)) -#define XFS_FILEOFF_MAX(a,b) \ - ((xfs_fileoff_t)(a) > (xfs_fileoff_t)(b) ? \ - (xfs_fileoff_t)(a) : (xfs_fileoff_t)(b)) -#define XFS_FILBLKS_MIN(a,b) \ - ((xfs_filblks_t)(a) < (xfs_filblks_t)(b) ? \ - (xfs_filblks_t)(a) : (xfs_filblks_t)(b)) -#define XFS_FILBLKS_MAX(a,b) \ - ((xfs_filblks_t)(a) > (xfs_filblks_t)(b) ? \ - (xfs_filblks_t)(a) : (xfs_filblks_t)(b)) - #define XFS_FSB_SANITY_CHECK(mp,fsb) \ (XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \ XFS_FSB_TO_AGBNO(mp, fsb) < mp->m_sb.sb_agblocks) Index: linux-2.6/fs/xfs/xfs_inode.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_inode.c 2007-04-13 13:42:08.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_inode.c 2007-04-13 13:42:19.000000000 +0200 @@ -1341,7 +1341,7 @@ xfs_file_last_byte( last_block = 0; } size_last_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)ip->i_d.di_size); - last_block = XFS_FILEOFF_MAX(last_block, size_last_block); + last_block = max_t(xfs_fileoff_t, last_block, size_last_block); last_byte = XFS_FSB_TO_B(mp, last_block); if (last_byte < 0) { Index: linux-2.6/fs/xfs/xfs_iomap.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_iomap.c 2007-04-13 13:42:08.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_iomap.c 2007-04-13 13:42:23.000000000 +0200 @@ -820,7 +820,7 @@ xfs_iomap_write_allocate( end_fsb = XFS_B_TO_FSB(mp, ip->i_d.di_size); xfs_bmap_last_offset(NULL, ip, &last_block, XFS_DATA_FORK); - last_block = XFS_FILEOFF_MAX(last_block, end_fsb); + last_block = max_t(xfs_fileoff_t, last_block, end_fsb); if ((map_start_fsb + count_fsb) > last_block) { count_fsb = last_block - map_start_fsb; if (count_fsb == 0) { From owner-xfs@oss.sgi.com Wed Apr 18 10:59:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 10:59:14 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3IHx3fB017854 for ; Wed, 18 Apr 2007 10:59:05 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3IHx0LD018425 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 18 Apr 2007 19:59:00 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3IHx0WA018423 for xfs@oss.sgi.com; Wed, 18 Apr 2007 19:59:00 +0200 Date: Wed, 18 Apr 2007 19:59:00 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [PATCH] kill macro noise in xfs_dir2*.h Message-ID: <20070418175859.GB18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11109 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs Remove all the macros that just give inline functions uppercase names. Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/xfs/xfs_dir2.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2.c 2007-04-13 14:02:24.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2.c 2007-04-13 14:07:18.000000000 +0200 @@ -55,9 +55,9 @@ xfs_dir_mount( XFS_MAX_BLOCKSIZE); mp->m_dirblksize = 1 << (mp->m_sb.sb_blocklog + mp->m_sb.sb_dirblklog); mp->m_dirblkfsbs = 1 << mp->m_sb.sb_dirblklog; - mp->m_dirdatablk = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_DATA_FIRSTDB(mp)); - mp->m_dirleafblk = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_LEAF_FIRSTDB(mp)); - mp->m_dirfreeblk = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_FREE_FIRSTDB(mp)); + mp->m_dirdatablk = xfs_dir2_db_to_da(mp, XFS_DIR2_DATA_FIRSTDB(mp)); + mp->m_dirleafblk = xfs_dir2_db_to_da(mp, XFS_DIR2_LEAF_FIRSTDB(mp)); + mp->m_dirfreeblk = xfs_dir2_db_to_da(mp, XFS_DIR2_FREE_FIRSTDB(mp)); mp->m_attr_node_ents = (mp->m_sb.sb_blocksize - (uint)sizeof(xfs_da_node_hdr_t)) / (uint)sizeof(xfs_da_node_entry_t); @@ -554,7 +554,7 @@ xfs_dir2_grow_inode( */ if (mapp != &map) kmem_free(mapp, sizeof(*mapp) * count); - *dbp = XFS_DIR2_DA_TO_DB(mp, (xfs_dablk_t)bno); + *dbp = xfs_dir2_da_to_db(mp, (xfs_dablk_t)bno); /* * Update file's size if this is the data space and it grew. */ @@ -706,7 +706,7 @@ xfs_dir2_shrink_inode( dp = args->dp; mp = dp->i_mount; tp = args->trans; - da = XFS_DIR2_DB_TO_DA(mp, db); + da = xfs_dir2_db_to_da(mp, db); /* * Unmap the fsblock(s). */ @@ -742,7 +742,7 @@ xfs_dir2_shrink_inode( /* * If the block isn't the last one in the directory, we're done. */ - if (dp->i_d.di_size > XFS_DIR2_DB_OFF_TO_BYTE(mp, db + 1, 0)) + if (dp->i_d.di_size > xfs_dir2_db_off_to_byte(mp, db + 1, 0)) return 0; bno = da; if ((error = xfs_bmap_last_before(tp, dp, &bno, XFS_DATA_FORK))) { Index: linux-2.6/fs/xfs/xfs_dir2_block.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_block.c 2007-04-13 13:47:00.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_block.c 2007-04-13 14:08:20.000000000 +0200 @@ -115,13 +115,13 @@ xfs_dir2_block_addname( xfs_da_brelse(tp, bp); return XFS_ERROR(EFSCORRUPTED); } - len = XFS_DIR2_DATA_ENTSIZE(args->namelen); + len = xfs_dir2_data_entsize(args->namelen); /* * Set up pointers to parts of the block. */ bf = block->hdr.bestfree; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * No stale entries? Need space for entry and new leaf. */ @@ -397,7 +397,7 @@ xfs_dir2_block_addname( * Fill in the leaf entry. */ blp[mid].hashval = cpu_to_be32(args->hashval); - blp[mid].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[mid].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); xfs_dir2_block_log_leaf(tp, bp, lfloglow, lfloghigh); /* @@ -412,7 +412,7 @@ xfs_dir2_block_addname( dep->inumber = cpu_to_be64(args->inumber); dep->namelen = args->namelen; memcpy(dep->name, args->name, args->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); /* * Clean up the bestfree array and log the header, tail, and entry. @@ -457,7 +457,7 @@ xfs_dir2_block_getdents( /* * If the block number in the offset is out of range, we're done. */ - if (XFS_DIR2_DATAPTR_TO_DB(mp, uio->uio_offset) > mp->m_dirdatablk) { + if (xfs_dir2_dataptr_to_db(mp, uio->uio_offset) > mp->m_dirdatablk) { *eofp = 1; return 0; } @@ -473,15 +473,15 @@ xfs_dir2_block_getdents( * Extract the byte offset we start at from the seek pointer. * We'll skip entries before this. */ - wantoff = XFS_DIR2_DATAPTR_TO_OFF(mp, uio->uio_offset); + wantoff = xfs_dir2_dataptr_to_off(mp, uio->uio_offset); block = bp->data; xfs_dir2_data_check(dp, bp); /* * Set up values for the loop. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); ptr = (char *)block->u; - endptr = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); + endptr = (char *)xfs_dir2_block_leaf_p(btp); p.dbp = dbp; p.put = put; p.uio = uio; @@ -504,7 +504,7 @@ xfs_dir2_block_getdents( /* * Bump pointer for the next iteration. */ - ptr += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + ptr += xfs_dir2_data_entsize(dep->namelen); /* * The entry is before the desired starting point, skip it. */ @@ -515,7 +515,7 @@ xfs_dir2_block_getdents( */ p.namelen = dep->namelen; - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + p.cook = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, ptr - (char *)block); p.ino = be64_to_cpu(dep->inumber); #if XFS_BIG_INUMS @@ -533,7 +533,7 @@ xfs_dir2_block_getdents( */ if (!p.done) { uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, (char *)dep - (char *)block); xfs_da_brelse(tp, bp); return error; @@ -547,7 +547,7 @@ xfs_dir2_block_getdents( *eofp = 1; uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk + 1, 0); + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk + 1, 0); xfs_da_brelse(tp, bp); @@ -571,8 +571,8 @@ xfs_dir2_block_log_leaf( mp = tp->t_mountp; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); xfs_da_log_buf(tp, bp, (uint)((char *)&blp[first] - (char *)block), (uint)((char *)&blp[last + 1] - (char *)block - 1)); } @@ -591,7 +591,7 @@ xfs_dir2_block_log_tail( mp = tp->t_mountp; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); xfs_da_log_buf(tp, bp, (uint)((char *)btp - (char *)block), (uint)((char *)(btp + 1) - (char *)block - 1)); } @@ -625,13 +625,13 @@ xfs_dir2_block_lookup( mp = dp->i_mount; block = bp->data; xfs_dir2_data_check(dp, bp); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Get the offset from the leaf entry, to point to the data. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(blp[ent].address))); + ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address))); /* * Fill in inode number, release the block. */ @@ -677,8 +677,8 @@ xfs_dir2_block_lookup_int( ASSERT(bp != NULL); block = bp->data; xfs_dir2_data_check(dp, bp); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Loop doing a binary search for our hash value. * Find our entry, ENOENT if it's not there. @@ -715,7 +715,7 @@ xfs_dir2_block_lookup_int( * Get pointer to the entry from the leaf. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, addr)); + ((char *)block + xfs_dir2_dataptr_to_off(mp, addr)); /* * Compare, if it's right give back buffer & entry number. */ @@ -770,20 +770,20 @@ xfs_dir2_block_removename( tp = args->trans; mp = dp->i_mount; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Point to the data entry using the leaf entry. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(blp[ent].address))); + ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address))); /* * Mark the data entry's space free. */ needlog = needscan = 0; xfs_dir2_data_make_free(tp, bp, (xfs_dir2_data_aoff_t)((char *)dep - (char *)block), - XFS_DIR2_DATA_ENTSIZE(dep->namelen), &needlog, &needscan); + xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan); /* * Fix up the block tail. */ @@ -846,13 +846,13 @@ xfs_dir2_block_replace( dp = args->dp; mp = dp->i_mount; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Point to the data entry we need to change. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(blp[ent].address))); + ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address))); ASSERT(be64_to_cpu(dep->inumber) != args->inumber); /* * Change the inode number to the new value. @@ -915,7 +915,7 @@ xfs_dir2_leaf_to_block( mp = dp->i_mount; leaf = lbp->data; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAF1_MAGIC); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); /* * If there are data blocks other than the first one, take this * opportunity to remove trailing empty data blocks that may have @@ -923,7 +923,7 @@ xfs_dir2_leaf_to_block( * These will show up in the leaf bests table. */ while (dp->i_d.di_size > mp->m_dirblksize) { - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + bestsp = xfs_dir2_leaf_bests_p(ltp); if (be16_to_cpu(bestsp[be32_to_cpu(ltp->bestcount) - 1]) == mp->m_dirblksize - (uint)sizeof(block->hdr)) { if ((error = @@ -977,14 +977,14 @@ xfs_dir2_leaf_to_block( /* * Initialize the block tail. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); btp->count = cpu_to_be32(be16_to_cpu(leaf->hdr.count) - be16_to_cpu(leaf->hdr.stale)); btp->stale = 0; xfs_dir2_block_log_tail(tp, dbp); /* * Initialize the block leaf area. We compact out stale entries. */ - lep = XFS_DIR2_BLOCK_LEAF_P(btp); + lep = xfs_dir2_block_leaf_p(btp); for (from = to = 0; from < be16_to_cpu(leaf->hdr.count); from++) { if (be32_to_cpu(leaf->ents[from].address) == XFS_DIR2_NULL_DATAPTR) continue; @@ -1071,7 +1071,7 @@ xfs_dir2_sf_to_block( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Copy the directory into the stack buffer. * Then pitch the incore inode data so we can make extents. @@ -1123,10 +1123,10 @@ xfs_dir2_sf_to_block( /* * Fill in the tail. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); btp->count = cpu_to_be32(sfp->hdr.count + 2); /* ., .. */ btp->stale = 0; - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + blp = xfs_dir2_block_leaf_p(btp); endoffset = (uint)((char *)blp - (char *)block); /* * Remove the freespace, we'll manage it. @@ -1142,25 +1142,25 @@ xfs_dir2_sf_to_block( dep->inumber = cpu_to_be64(dp->i_ino); dep->namelen = 1; dep->name[0] = '.'; - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); xfs_dir2_data_log_entry(tp, bp, dep); blp[0].hashval = cpu_to_be32(xfs_dir_hash_dot); - blp[0].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[0].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); /* * Create entry for .. */ dep = (xfs_dir2_data_entry_t *) ((char *)block + XFS_DIR2_DATA_DOTDOT_OFFSET); - dep->inumber = cpu_to_be64(XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent)); + dep->inumber = cpu_to_be64(xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent)); dep->namelen = 2; dep->name[0] = dep->name[1] = '.'; - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); xfs_dir2_data_log_entry(tp, bp, dep); blp[1].hashval = cpu_to_be32(xfs_dir_hash_dotdot); - blp[1].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[1].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); offset = XFS_DIR2_DATA_FIRST_OFFSET; /* @@ -1169,7 +1169,7 @@ xfs_dir2_sf_to_block( if ((i = 0) == sfp->hdr.count) sfep = NULL; else - sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + sfep = xfs_dir2_sf_firstentry(sfp); /* * Need to preserve the existing offset values in the sf directory. * Insert holes (unused entries) where necessary. @@ -1181,7 +1181,7 @@ xfs_dir2_sf_to_block( if (sfep == NULL) newoffset = endoffset; else - newoffset = XFS_DIR2_SF_GET_OFFSET(sfep); + newoffset = xfs_dir2_sf_get_offset(sfep); /* * There should be a hole here, make one. */ @@ -1190,7 +1190,7 @@ xfs_dir2_sf_to_block( ((char *)block + offset); dup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); dup->length = cpu_to_be16(newoffset - offset); - *XFS_DIR2_DATA_UNUSED_TAG_P(dup) = cpu_to_be16( + *xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16( ((char *)dup - (char *)block)); xfs_dir2_data_log_unused(tp, bp, dup); (void)xfs_dir2_data_freeinsert((xfs_dir2_data_t *)block, @@ -1202,22 +1202,22 @@ xfs_dir2_sf_to_block( * Copy a real entry. */ dep = (xfs_dir2_data_entry_t *)((char *)block + newoffset); - dep->inumber = cpu_to_be64(XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep))); + dep->inumber = cpu_to_be64(xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep))); dep->namelen = sfep->namelen; memcpy(dep->name, sfep->name, dep->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); xfs_dir2_data_log_entry(tp, bp, dep); blp[2 + i].hashval = cpu_to_be32(xfs_da_hashname( (char *)sfep->name, sfep->namelen)); - blp[2 + i].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); offset = (int)((char *)(tagp + 1) - (char *)block); if (++i == sfp->hdr.count) sfep = NULL; else - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); } /* Done with the temporary buffer */ kmem_free(buf, buf_len); Index: linux-2.6/fs/xfs/xfs_dir2_block.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_block.h 2007-04-13 13:48:21.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_block.h 2007-04-13 13:48:29.000000000 +0200 @@ -60,7 +60,6 @@ typedef struct xfs_dir2_block { /* * Pointer to the leaf header embedded in a data block (1-block format) */ -#define XFS_DIR2_BLOCK_TAIL_P(mp,block) xfs_dir2_block_tail_p(mp,block) static inline xfs_dir2_block_tail_t * xfs_dir2_block_tail_p(struct xfs_mount *mp, xfs_dir2_block_t *block) { @@ -71,7 +70,6 @@ xfs_dir2_block_tail_p(struct xfs_mount * /* * Pointer to the leaf entries embedded in a data block (1-block format) */ -#define XFS_DIR2_BLOCK_LEAF_P(btp) xfs_dir2_block_leaf_p(btp) static inline struct xfs_dir2_leaf_entry * xfs_dir2_block_leaf_p(xfs_dir2_block_tail_t *btp) { Index: linux-2.6/fs/xfs/xfs_dir2_data.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_data.c 2007-04-13 13:47:12.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_data.c 2007-04-13 14:08:11.000000000 +0200 @@ -72,8 +72,8 @@ xfs_dir2_data_check( bf = d->hdr.bestfree; p = (char *)d->u; if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) { - btp = XFS_DIR2_BLOCK_TAIL_P(mp, (xfs_dir2_block_t *)d); - lep = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d); + lep = xfs_dir2_block_leaf_p(btp); endp = (char *)lep; } else endp = (char *)d + mp->m_dirblksize; @@ -107,7 +107,7 @@ xfs_dir2_data_check( */ if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) { ASSERT(lastfree == 0); - ASSERT(be16_to_cpu(*XFS_DIR2_DATA_UNUSED_TAG_P(dup)) == + ASSERT(be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)) == (char *)dup - (char *)d); dfp = xfs_dir2_data_freefind(d, dup); if (dfp) { @@ -131,12 +131,12 @@ xfs_dir2_data_check( dep = (xfs_dir2_data_entry_t *)p; ASSERT(dep->namelen != 0); ASSERT(xfs_dir_ino_validate(mp, be64_to_cpu(dep->inumber)) == 0); - ASSERT(be16_to_cpu(*XFS_DIR2_DATA_ENTRY_TAG_P(dep)) == + ASSERT(be16_to_cpu(*xfs_dir2_data_entry_tag_p(dep)) == (char *)dep - (char *)d); count++; lastfree = 0; if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) { - addr = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, (xfs_dir2_data_aoff_t) ((char *)dep - (char *)d)); hash = xfs_da_hashname((char *)dep->name, dep->namelen); @@ -147,7 +147,7 @@ xfs_dir2_data_check( } ASSERT(i < be32_to_cpu(btp->count)); } - p += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + p += xfs_dir2_data_entsize(dep->namelen); } /* * Need to have seen all the entries and all the bestfree slots. @@ -349,8 +349,8 @@ xfs_dir2_data_freescan( if (aendp) endp = aendp; else if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) { - btp = XFS_DIR2_BLOCK_TAIL_P(mp, (xfs_dir2_block_t *)d); - endp = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d); + endp = (char *)xfs_dir2_block_leaf_p(btp); } else endp = (char *)d + mp->m_dirblksize; /* @@ -363,7 +363,7 @@ xfs_dir2_data_freescan( */ if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) { ASSERT((char *)dup - (char *)d == - be16_to_cpu(*XFS_DIR2_DATA_UNUSED_TAG_P(dup))); + be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup))); xfs_dir2_data_freeinsert(d, dup, loghead); p += be16_to_cpu(dup->length); } @@ -373,8 +373,8 @@ xfs_dir2_data_freescan( else { dep = (xfs_dir2_data_entry_t *)p; ASSERT((char *)dep - (char *)d == - be16_to_cpu(*XFS_DIR2_DATA_ENTRY_TAG_P(dep))); - p += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + be16_to_cpu(*xfs_dir2_data_entry_tag_p(dep))); + p += xfs_dir2_data_entsize(dep->namelen); } } } @@ -405,7 +405,7 @@ xfs_dir2_data_init( /* * Get the buffer set up for the block. */ - error = xfs_da_get_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, blkno), -1, &bp, + error = xfs_da_get_buf(tp, dp, xfs_dir2_db_to_da(mp, blkno), -1, &bp, XFS_DATA_FORK); if (error) { return error; @@ -430,7 +430,7 @@ xfs_dir2_data_init( t=mp->m_dirblksize - (uint)sizeof(d->hdr); d->hdr.bestfree[0].length = cpu_to_be16(t); dup->length = cpu_to_be16(t); - *XFS_DIR2_DATA_UNUSED_TAG_P(dup) = cpu_to_be16((char *)dup - (char *)d); + *xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16((char *)dup - (char *)d); /* * Log it and return it. */ @@ -455,7 +455,7 @@ xfs_dir2_data_log_entry( ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC || be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC); xfs_da_log_buf(tp, bp, (uint)((char *)dep - (char *)d), - (uint)((char *)(XFS_DIR2_DATA_ENTRY_TAG_P(dep) + 1) - + (uint)((char *)(xfs_dir2_data_entry_tag_p(dep) + 1) - (char *)d - 1)); } @@ -500,8 +500,8 @@ xfs_dir2_data_log_unused( * Log the end (tag) of the unused entry. */ xfs_da_log_buf(tp, bp, - (uint)((char *)XFS_DIR2_DATA_UNUSED_TAG_P(dup) - (char *)d), - (uint)((char *)XFS_DIR2_DATA_UNUSED_TAG_P(dup) - (char *)d + + (uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)d), + (uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)d + sizeof(xfs_dir2_data_off_t) - 1)); } @@ -538,8 +538,8 @@ xfs_dir2_data_make_free( xfs_dir2_block_tail_t *btp; /* block tail */ ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, (xfs_dir2_block_t *)d); - endptr = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d); + endptr = (char *)xfs_dir2_block_leaf_p(btp); } /* * If this isn't the start of the block, then back up to @@ -590,7 +590,7 @@ xfs_dir2_data_make_free( * Fix up the new big freespace. */ be16_add(&prevdup->length, len + be16_to_cpu(postdup->length)); - *XFS_DIR2_DATA_UNUSED_TAG_P(prevdup) = + *xfs_dir2_data_unused_tag_p(prevdup) = cpu_to_be16((char *)prevdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, prevdup); if (!needscan) { @@ -624,7 +624,7 @@ xfs_dir2_data_make_free( else if (prevdup) { dfp = xfs_dir2_data_freefind(d, prevdup); be16_add(&prevdup->length, len); - *XFS_DIR2_DATA_UNUSED_TAG_P(prevdup) = + *xfs_dir2_data_unused_tag_p(prevdup) = cpu_to_be16((char *)prevdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, prevdup); /* @@ -652,7 +652,7 @@ xfs_dir2_data_make_free( newdup = (xfs_dir2_data_unused_t *)((char *)d + offset); newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup->length = cpu_to_be16(len + be16_to_cpu(postdup->length)); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); /* @@ -679,7 +679,7 @@ xfs_dir2_data_make_free( newdup = (xfs_dir2_data_unused_t *)((char *)d + offset); newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup->length = cpu_to_be16(len); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); (void)xfs_dir2_data_freeinsert(d, newdup, needlogp); @@ -715,7 +715,7 @@ xfs_dir2_data_use_free( ASSERT(be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG); ASSERT(offset >= (char *)dup - (char *)d); ASSERT(offset + len <= (char *)dup + be16_to_cpu(dup->length) - (char *)d); - ASSERT((char *)dup - (char *)d == be16_to_cpu(*XFS_DIR2_DATA_UNUSED_TAG_P(dup))); + ASSERT((char *)dup - (char *)d == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup))); /* * Look up the entry in the bestfree table. */ @@ -748,7 +748,7 @@ xfs_dir2_data_use_free( newdup = (xfs_dir2_data_unused_t *)((char *)d + offset + len); newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup->length = cpu_to_be16(oldlen - len); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); /* @@ -775,7 +775,7 @@ xfs_dir2_data_use_free( else if (matchback) { newdup = dup; newdup->length = cpu_to_be16(((char *)d + offset) - (char *)newdup); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); /* @@ -802,13 +802,13 @@ xfs_dir2_data_use_free( else { newdup = dup; newdup->length = cpu_to_be16(((char *)d + offset) - (char *)newdup); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); newdup2 = (xfs_dir2_data_unused_t *)((char *)d + offset + len); newdup2->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup2->length = cpu_to_be16(oldlen - len - be16_to_cpu(newdup->length)); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup2) = + *xfs_dir2_data_unused_tag_p(newdup2) = cpu_to_be16((char *)newdup2 - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup2); /* Index: linux-2.6/fs/xfs/xfs_dir2_data.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_data.h 2007-04-13 13:50:00.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_data.h 2007-04-13 14:04:36.000000000 +0200 @@ -44,7 +44,7 @@ struct xfs_trans; #define XFS_DIR2_DATA_SPACE 0 #define XFS_DIR2_DATA_OFFSET (XFS_DIR2_DATA_SPACE * XFS_DIR2_SPACE_SIZE) #define XFS_DIR2_DATA_FIRSTDB(mp) \ - XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_DATA_OFFSET) + xfs_dir2_byte_to_db(mp, XFS_DIR2_DATA_OFFSET) /* * Offsets of . and .. in data space (always block 0) @@ -52,9 +52,9 @@ struct xfs_trans; #define XFS_DIR2_DATA_DOT_OFFSET \ ((xfs_dir2_data_aoff_t)sizeof(xfs_dir2_data_hdr_t)) #define XFS_DIR2_DATA_DOTDOT_OFFSET \ - (XFS_DIR2_DATA_DOT_OFFSET + XFS_DIR2_DATA_ENTSIZE(1)) + (XFS_DIR2_DATA_DOT_OFFSET + xfs_dir2_data_entsize(1)) #define XFS_DIR2_DATA_FIRST_OFFSET \ - (XFS_DIR2_DATA_DOTDOT_OFFSET + XFS_DIR2_DATA_ENTSIZE(2)) + (XFS_DIR2_DATA_DOTDOT_OFFSET + xfs_dir2_data_entsize(2)) /* * Structures. @@ -123,7 +123,6 @@ typedef struct xfs_dir2_data { /* * Size of a data entry. */ -#define XFS_DIR2_DATA_ENTSIZE(n) xfs_dir2_data_entsize(n) static inline int xfs_dir2_data_entsize(int n) { return (int)roundup(offsetof(xfs_dir2_data_entry_t, name[0]) + (n) + \ @@ -133,19 +132,16 @@ static inline int xfs_dir2_data_entsize( /* * Pointer to an entry's tag word. */ -#define XFS_DIR2_DATA_ENTRY_TAG_P(dep) xfs_dir2_data_entry_tag_p(dep) static inline __be16 * xfs_dir2_data_entry_tag_p(xfs_dir2_data_entry_t *dep) { return (__be16 *)((char *)dep + - XFS_DIR2_DATA_ENTSIZE(dep->namelen) - sizeof(__be16)); + xfs_dir2_data_entsize(dep->namelen) - sizeof(__be16)); } /* * Pointer to a freespace's tag word. */ -#define XFS_DIR2_DATA_UNUSED_TAG_P(dup) \ - xfs_dir2_data_unused_tag_p(dup) static inline __be16 * xfs_dir2_data_unused_tag_p(xfs_dir2_data_unused_t *dup) { Index: linux-2.6/fs/xfs/xfs_dir2_leaf.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_leaf.c 2007-04-13 13:47:18.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_leaf.c 2007-04-13 14:08:13.000000000 +0200 @@ -92,7 +92,7 @@ xfs_dir2_block_to_leaf( if ((error = xfs_da_grow_inode(args, &blkno))) { return error; } - ldb = XFS_DIR2_DA_TO_DB(mp, blkno); + ldb = xfs_dir2_da_to_db(mp, blkno); ASSERT(ldb == XFS_DIR2_LEAF_FIRSTDB(mp)); /* * Initialize the leaf block, get a buffer for it. @@ -104,8 +104,8 @@ xfs_dir2_block_to_leaf( leaf = lbp->data; block = dbp->data; xfs_dir2_data_check(dp, dbp); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Set the counts in the leaf header. */ @@ -138,9 +138,9 @@ xfs_dir2_block_to_leaf( /* * Set up leaf tail and bests table. */ - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ltp->bestcount = cpu_to_be32(1); - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + bestsp = xfs_dir2_leaf_bests_p(ltp); bestsp[0] = block->hdr.bestfree[0].length; /* * Log the data header and leaf bests table. @@ -210,9 +210,9 @@ xfs_dir2_leaf_addname( */ index = xfs_dir2_leaf_search_hash(args, lbp); leaf = lbp->data; - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); - length = XFS_DIR2_DATA_ENTSIZE(args->namelen); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); + bestsp = xfs_dir2_leaf_bests_p(ltp); + length = xfs_dir2_data_entsize(args->namelen); /* * See if there are any entries with the same hash value * and space in their block for the new entry. @@ -224,7 +224,7 @@ xfs_dir2_leaf_addname( index++, lep++) { if (be32_to_cpu(lep->address) == XFS_DIR2_NULL_DATAPTR) continue; - i = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + i = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); ASSERT(i < be32_to_cpu(ltp->bestcount)); ASSERT(be16_to_cpu(bestsp[i]) != NULLDATAOFF); if (be16_to_cpu(bestsp[i]) >= length) { @@ -379,7 +379,7 @@ xfs_dir2_leaf_addname( */ else { if ((error = - xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, use_block), + xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, use_block), -1, &dbp, XFS_DATA_FORK))) { xfs_da_brelse(tp, lbp); return error; @@ -408,7 +408,7 @@ xfs_dir2_leaf_addname( dep->inumber = cpu_to_be64(args->inumber); dep->namelen = args->namelen; memcpy(dep->name, args->name, dep->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)data); /* * Need to scan fix up the bestfree table. @@ -530,7 +530,7 @@ xfs_dir2_leaf_addname( * Fill in the new leaf entry. */ lep->hashval = cpu_to_be32(args->hashval); - lep->address = cpu_to_be32(XFS_DIR2_DB_OFF_TO_DATAPTR(mp, use_block, + lep->address = cpu_to_be32(xfs_dir2_db_off_to_dataptr(mp, use_block, be16_to_cpu(*tagp))); /* * Log the leaf fields and give up the buffers. @@ -568,13 +568,13 @@ xfs_dir2_leaf_check( * Should factor in the size of the bests table as well. * We can deduce a value for that from di_size. */ - ASSERT(be16_to_cpu(leaf->hdr.count) <= XFS_DIR2_MAX_LEAF_ENTS(mp)); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ASSERT(be16_to_cpu(leaf->hdr.count) <= xfs_dir2_max_leaf_ents(mp)); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); /* * Leaves and bests don't overlap. */ ASSERT((char *)&leaf->ents[be16_to_cpu(leaf->hdr.count)] <= - (char *)XFS_DIR2_LEAF_BESTS_P(ltp)); + (char *)xfs_dir2_leaf_bests_p(ltp)); /* * Check hash value order, count stale entries. */ @@ -816,12 +816,12 @@ xfs_dir2_leaf_getdents( * Inside the loop we keep the main offset value as a byte offset * in the directory file. */ - curoff = XFS_DIR2_DATAPTR_TO_BYTE(mp, uio->uio_offset); + curoff = xfs_dir2_dataptr_to_byte(mp, uio->uio_offset); /* * Force this conversion through db so we truncate the offset * down to get the start of the data block. */ - map_off = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_BYTE_TO_DB(mp, curoff)); + map_off = xfs_dir2_db_to_da(mp, xfs_dir2_byte_to_db(mp, curoff)); /* * Loop over directory entries until we reach the end offset. * Get more blocks and readahead as necessary. @@ -871,7 +871,7 @@ xfs_dir2_leaf_getdents( */ if (1 + ra_want > map_blocks && map_off < - XFS_DIR2_BYTE_TO_DA(mp, XFS_DIR2_LEAF_OFFSET)) { + xfs_dir2_byte_to_da(mp, XFS_DIR2_LEAF_OFFSET)) { /* * Get more bmaps, fill in after the ones * we already have in the table. @@ -879,7 +879,7 @@ xfs_dir2_leaf_getdents( nmap = map_size - map_valid; error = xfs_bmapi(tp, dp, map_off, - XFS_DIR2_BYTE_TO_DA(mp, + xfs_dir2_byte_to_da(mp, XFS_DIR2_LEAF_OFFSET) - map_off, XFS_BMAPI_METADATA, NULL, 0, &map[map_valid], &nmap, NULL, NULL); @@ -904,7 +904,7 @@ xfs_dir2_leaf_getdents( map[map_valid + nmap - 1].br_blockcount; else map_off = - XFS_DIR2_BYTE_TO_DA(mp, + xfs_dir2_byte_to_da(mp, XFS_DIR2_LEAF_OFFSET); /* * Look for holes in the mapping, and @@ -932,14 +932,14 @@ xfs_dir2_leaf_getdents( * No valid mappings, so no more data blocks. */ if (!map_valid) { - curoff = XFS_DIR2_DA_TO_BYTE(mp, map_off); + curoff = xfs_dir2_da_to_byte(mp, map_off); break; } /* * Read the directory block starting at the first * mapping. */ - curdb = XFS_DIR2_DA_TO_DB(mp, map->br_startoff); + curdb = xfs_dir2_da_to_db(mp, map->br_startoff); error = xfs_da_read_buf(tp, dp, map->br_startoff, map->br_blockcount >= mp->m_dirblkfsbs ? XFS_FSB_TO_DADDR(mp, map->br_startblock) : @@ -1015,7 +1015,7 @@ xfs_dir2_leaf_getdents( /* * Having done a read, we need to set a new offset. */ - newoff = XFS_DIR2_DB_OFF_TO_BYTE(mp, curdb, 0); + newoff = xfs_dir2_db_off_to_byte(mp, curdb, 0); /* * Start of the current block. */ @@ -1025,7 +1025,7 @@ xfs_dir2_leaf_getdents( * Make sure we're in the right block. */ else if (curoff > newoff) - ASSERT(XFS_DIR2_BYTE_TO_DB(mp, curoff) == + ASSERT(xfs_dir2_byte_to_db(mp, curoff) == curdb); data = bp->data; xfs_dir2_data_check(dp, bp); @@ -1033,7 +1033,7 @@ xfs_dir2_leaf_getdents( * Find our position in the block. */ ptr = (char *)&data->u; - byteoff = XFS_DIR2_BYTE_TO_OFF(mp, curoff); + byteoff = xfs_dir2_byte_to_off(mp, curoff); /* * Skip past the header. */ @@ -1055,15 +1055,15 @@ xfs_dir2_leaf_getdents( } dep = (xfs_dir2_data_entry_t *)ptr; length = - XFS_DIR2_DATA_ENTSIZE(dep->namelen); + xfs_dir2_data_entsize(dep->namelen); ptr += length; } /* * Now set our real offset. */ curoff = - XFS_DIR2_DB_OFF_TO_BYTE(mp, - XFS_DIR2_BYTE_TO_DB(mp, curoff), + xfs_dir2_db_off_to_byte(mp, + xfs_dir2_byte_to_db(mp, curoff), (char *)ptr - (char *)data); if (ptr >= (char *)data + mp->m_dirblksize) { continue; @@ -1092,9 +1092,9 @@ xfs_dir2_leaf_getdents( p->namelen = dep->namelen; - length = XFS_DIR2_DATA_ENTSIZE(p->namelen); + length = xfs_dir2_data_entsize(p->namelen); - p->cook = XFS_DIR2_BYTE_TO_DATAPTR(mp, curoff + length); + p->cook = xfs_dir2_byte_to_dataptr(mp, curoff + length); p->ino = be64_to_cpu(dep->inumber); #if XFS_BIG_INUMS @@ -1122,10 +1122,10 @@ xfs_dir2_leaf_getdents( * All done. Set output offset value to current offset. */ *eofp = eof; - if (curoff > XFS_DIR2_DATAPTR_TO_BYTE(mp, XFS_DIR2_MAX_DATAPTR)) + if (curoff > xfs_dir2_dataptr_to_byte(mp, XFS_DIR2_MAX_DATAPTR)) uio->uio_offset = XFS_DIR2_MAX_DATAPTR; else - uio->uio_offset = XFS_DIR2_BYTE_TO_DATAPTR(mp, curoff); + uio->uio_offset = xfs_dir2_byte_to_dataptr(mp, curoff); kmem_free(map, map_size * sizeof(*map)); kmem_free(p, sizeof(*p)); if (bp) @@ -1160,7 +1160,7 @@ xfs_dir2_leaf_init( /* * Get the buffer for the block. */ - error = xfs_da_get_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, bno), -1, &bp, + error = xfs_da_get_buf(tp, dp, xfs_dir2_db_to_da(mp, bno), -1, &bp, XFS_DATA_FORK); if (error) { return error; @@ -1182,7 +1182,7 @@ xfs_dir2_leaf_init( * the block. */ if (magic == XFS_DIR2_LEAF1_MAGIC) { - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ltp->bestcount = 0; xfs_dir2_leaf_log_tail(tp, bp); } @@ -1207,9 +1207,9 @@ xfs_dir2_leaf_log_bests( leaf = bp->data; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAF1_MAGIC); - ltp = XFS_DIR2_LEAF_TAIL_P(tp->t_mountp, leaf); - firstb = XFS_DIR2_LEAF_BESTS_P(ltp) + first; - lastb = XFS_DIR2_LEAF_BESTS_P(ltp) + last; + ltp = xfs_dir2_leaf_tail_p(tp->t_mountp, leaf); + firstb = xfs_dir2_leaf_bests_p(ltp) + first; + lastb = xfs_dir2_leaf_bests_p(ltp) + last; xfs_da_log_buf(tp, bp, (uint)((char *)firstb - (char *)leaf), (uint)((char *)lastb - (char *)leaf + sizeof(*lastb) - 1)); } @@ -1269,7 +1269,7 @@ xfs_dir2_leaf_log_tail( mp = tp->t_mountp; leaf = bp->data; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAF1_MAGIC); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); xfs_da_log_buf(tp, bp, (uint)((char *)ltp - (char *)leaf), (uint)(mp->m_dirblksize - 1)); } @@ -1313,7 +1313,7 @@ xfs_dir2_leaf_lookup( */ dep = (xfs_dir2_data_entry_t *) ((char *)dbp->data + - XFS_DIR2_DATAPTR_TO_OFF(dp->i_mount, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(dp->i_mount, be32_to_cpu(lep->address))); /* * Return the found inode number. */ @@ -1382,7 +1382,7 @@ xfs_dir2_leaf_lookup_int( /* * Get the new data block number. */ - newdb = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + newdb = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); /* * If it's not the same as the old data block number, * need to pitch the old one and read the new one. @@ -1392,7 +1392,7 @@ xfs_dir2_leaf_lookup_int( xfs_da_brelse(tp, dbp); if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, newdb), -1, &dbp, + xfs_dir2_db_to_da(mp, newdb), -1, &dbp, XFS_DATA_FORK))) { xfs_da_brelse(tp, lbp); return error; @@ -1405,7 +1405,7 @@ xfs_dir2_leaf_lookup_int( */ dep = (xfs_dir2_data_entry_t *) ((char *)dbp->data + - XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); /* * If it matches then return it. */ @@ -1470,20 +1470,20 @@ xfs_dir2_leaf_removename( * Point to the leaf entry, use that to point to the data entry. */ lep = &leaf->ents[index]; - db = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + db = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); dep = (xfs_dir2_data_entry_t *) - ((char *)data + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address))); + ((char *)data + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); needscan = needlog = 0; oldbest = be16_to_cpu(data->hdr.bestfree[0].length); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); + bestsp = xfs_dir2_leaf_bests_p(ltp); ASSERT(be16_to_cpu(bestsp[db]) == oldbest); /* * Mark the former data entry unused. */ xfs_dir2_data_make_free(tp, dbp, (xfs_dir2_data_aoff_t)((char *)dep - (char *)data), - XFS_DIR2_DATA_ENTSIZE(dep->namelen), &needlog, &needscan); + xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan); /* * We just mark the leaf entry stale by putting a null in it. */ @@ -1603,7 +1603,7 @@ xfs_dir2_leaf_replace( */ dep = (xfs_dir2_data_entry_t *) ((char *)dbp->data + - XFS_DIR2_DATAPTR_TO_OFF(dp->i_mount, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(dp->i_mount, be32_to_cpu(lep->address))); ASSERT(args->inumber != be64_to_cpu(dep->inumber)); /* * Put the new inode number in, log it. @@ -1699,7 +1699,7 @@ xfs_dir2_leaf_trim_data( /* * Read the offending data block. We need its buffer. */ - if ((error = xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, db), -1, &dbp, + if ((error = xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, db), -1, &dbp, XFS_DATA_FORK))) { return error; } @@ -1713,7 +1713,7 @@ xfs_dir2_leaf_trim_data( */ leaf = lbp->data; - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ASSERT(be16_to_cpu(data->hdr.bestfree[0].length) == mp->m_dirblksize - (uint)sizeof(data->hdr)); ASSERT(db == be32_to_cpu(ltp->bestcount) - 1); @@ -1728,7 +1728,7 @@ xfs_dir2_leaf_trim_data( /* * Eliminate the last bests entry from the table. */ - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + bestsp = xfs_dir2_leaf_bests_p(ltp); be32_add(<p->bestcount, -1); memmove(&bestsp[1], &bestsp[0], be32_to_cpu(ltp->bestcount) * sizeof(*bestsp)); xfs_dir2_leaf_log_tail(tp, lbp); @@ -1839,12 +1839,12 @@ xfs_dir2_node_to_leaf( /* * Set up the leaf tail from the freespace block. */ - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ltp->bestcount = free->hdr.nvalid; /* * Set up the leaf bests table. */ - memcpy(XFS_DIR2_LEAF_BESTS_P(ltp), free->bests, + memcpy(xfs_dir2_leaf_bests_p(ltp), free->bests, be32_to_cpu(ltp->bestcount) * sizeof(leaf->bests[0])); xfs_dir2_leaf_log_bests(tp, lbp, 0, be32_to_cpu(ltp->bestcount) - 1); xfs_dir2_leaf_log_tail(tp, lbp); Index: linux-2.6/fs/xfs/xfs_dir2_leaf.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_leaf.h 2007-04-13 13:54:13.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_leaf.h 2007-04-13 14:10:43.000000000 +0200 @@ -32,7 +32,7 @@ struct xfs_trans; #define XFS_DIR2_LEAF_SPACE 1 #define XFS_DIR2_LEAF_OFFSET (XFS_DIR2_LEAF_SPACE * XFS_DIR2_SPACE_SIZE) #define XFS_DIR2_LEAF_FIRSTDB(mp) \ - XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_LEAF_OFFSET) + xfs_dir2_byte_to_db(mp, XFS_DIR2_LEAF_OFFSET) /* * Offset in data space of a data entry. @@ -82,7 +82,6 @@ typedef struct xfs_dir2_leaf { * DB blocks here are logical directory block numbers, not filesystem blocks. */ -#define XFS_DIR2_MAX_LEAF_ENTS(mp) xfs_dir2_max_leaf_ents(mp) static inline int xfs_dir2_max_leaf_ents(struct xfs_mount *mp) { return (int)(((mp)->m_dirblksize - (uint)sizeof(xfs_dir2_leaf_hdr_t)) / @@ -92,7 +91,6 @@ static inline int xfs_dir2_max_leaf_ents /* * Get address of the bestcount field in the single-leaf block. */ -#define XFS_DIR2_LEAF_TAIL_P(mp,lp) xfs_dir2_leaf_tail_p(mp, lp) static inline xfs_dir2_leaf_tail_t * xfs_dir2_leaf_tail_p(struct xfs_mount *mp, xfs_dir2_leaf_t *lp) { @@ -104,7 +102,6 @@ xfs_dir2_leaf_tail_p(struct xfs_mount *m /* * Get address of the bests array in the single-leaf block. */ -#define XFS_DIR2_LEAF_BESTS_P(ltp) xfs_dir2_leaf_bests_p(ltp) static inline __be16 * xfs_dir2_leaf_bests_p(xfs_dir2_leaf_tail_t *ltp) { @@ -114,7 +111,6 @@ xfs_dir2_leaf_bests_p(xfs_dir2_leaf_tail /* * Convert dataptr to byte in file space */ -#define XFS_DIR2_DATAPTR_TO_BYTE(mp,dp) xfs_dir2_dataptr_to_byte(mp, dp) static inline xfs_dir2_off_t xfs_dir2_dataptr_to_byte(struct xfs_mount *mp, xfs_dir2_dataptr_t dp) { @@ -124,7 +120,6 @@ xfs_dir2_dataptr_to_byte(struct xfs_moun /* * Convert byte in file space to dataptr. It had better be aligned. */ -#define XFS_DIR2_BYTE_TO_DATAPTR(mp,by) xfs_dir2_byte_to_dataptr(mp,by) static inline xfs_dir2_dataptr_t xfs_dir2_byte_to_dataptr(struct xfs_mount *mp, xfs_dir2_off_t by) { @@ -134,7 +129,6 @@ xfs_dir2_byte_to_dataptr(struct xfs_moun /* * Convert byte in space to (DB) block */ -#define XFS_DIR2_BYTE_TO_DB(mp,by) xfs_dir2_byte_to_db(mp, by) static inline xfs_dir2_db_t xfs_dir2_byte_to_db(struct xfs_mount *mp, xfs_dir2_off_t by) { @@ -145,17 +139,15 @@ xfs_dir2_byte_to_db(struct xfs_mount *mp /* * Convert dataptr to a block number */ -#define XFS_DIR2_DATAPTR_TO_DB(mp,dp) xfs_dir2_dataptr_to_db(mp, dp) static inline xfs_dir2_db_t xfs_dir2_dataptr_to_db(struct xfs_mount *mp, xfs_dir2_dataptr_t dp) { - return XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_DATAPTR_TO_BYTE(mp, dp)); + return xfs_dir2_byte_to_db(mp, xfs_dir2_dataptr_to_byte(mp, dp)); } /* * Convert byte in space to offset in a block */ -#define XFS_DIR2_BYTE_TO_OFF(mp,by) xfs_dir2_byte_to_off(mp, by) static inline xfs_dir2_data_aoff_t xfs_dir2_byte_to_off(struct xfs_mount *mp, xfs_dir2_off_t by) { @@ -166,18 +158,15 @@ xfs_dir2_byte_to_off(struct xfs_mount *m /* * Convert dataptr to a byte offset in a block */ -#define XFS_DIR2_DATAPTR_TO_OFF(mp,dp) xfs_dir2_dataptr_to_off(mp, dp) static inline xfs_dir2_data_aoff_t xfs_dir2_dataptr_to_off(struct xfs_mount *mp, xfs_dir2_dataptr_t dp) { - return XFS_DIR2_BYTE_TO_OFF(mp, XFS_DIR2_DATAPTR_TO_BYTE(mp, dp)); + return xfs_dir2_byte_to_off(mp, xfs_dir2_dataptr_to_byte(mp, dp)); } /* * Convert block and offset to byte in space */ -#define XFS_DIR2_DB_OFF_TO_BYTE(mp,db,o) \ - xfs_dir2_db_off_to_byte(mp, db, o) static inline xfs_dir2_off_t xfs_dir2_db_off_to_byte(struct xfs_mount *mp, xfs_dir2_db_t db, xfs_dir2_data_aoff_t o) @@ -189,7 +178,6 @@ xfs_dir2_db_off_to_byte(struct xfs_mount /* * Convert block (DB) to block (dablk) */ -#define XFS_DIR2_DB_TO_DA(mp,db) xfs_dir2_db_to_da(mp, db) static inline xfs_dablk_t xfs_dir2_db_to_da(struct xfs_mount *mp, xfs_dir2_db_t db) { @@ -199,29 +187,25 @@ xfs_dir2_db_to_da(struct xfs_mount *mp, /* * Convert byte in space to (DA) block */ -#define XFS_DIR2_BYTE_TO_DA(mp,by) xfs_dir2_byte_to_da(mp, by) static inline xfs_dablk_t xfs_dir2_byte_to_da(struct xfs_mount *mp, xfs_dir2_off_t by) { - return XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_BYTE_TO_DB(mp, by)); + return xfs_dir2_db_to_da(mp, xfs_dir2_byte_to_db(mp, by)); } /* * Convert block and offset to dataptr */ -#define XFS_DIR2_DB_OFF_TO_DATAPTR(mp,db,o) \ - xfs_dir2_db_off_to_dataptr(mp, db, o) static inline xfs_dir2_dataptr_t xfs_dir2_db_off_to_dataptr(struct xfs_mount *mp, xfs_dir2_db_t db, xfs_dir2_data_aoff_t o) { - return XFS_DIR2_BYTE_TO_DATAPTR(mp, XFS_DIR2_DB_OFF_TO_BYTE(mp, db, o)); + return xfs_dir2_byte_to_dataptr(mp, xfs_dir2_db_off_to_byte(mp, db, o)); } /* * Convert block (dablk) to block (DB) */ -#define XFS_DIR2_DA_TO_DB(mp,da) xfs_dir2_da_to_db(mp, da) static inline xfs_dir2_db_t xfs_dir2_da_to_db(struct xfs_mount *mp, xfs_dablk_t da) { @@ -231,11 +215,10 @@ xfs_dir2_da_to_db(struct xfs_mount *mp, /* * Convert block (dablk) to byte offset in space */ -#define XFS_DIR2_DA_TO_BYTE(mp,da) xfs_dir2_da_to_byte(mp, da) static inline xfs_dir2_off_t xfs_dir2_da_to_byte(struct xfs_mount *mp, xfs_dablk_t da) { - return XFS_DIR2_DB_OFF_TO_BYTE(mp, XFS_DIR2_DA_TO_DB(mp, da), 0); + return xfs_dir2_db_off_to_byte(mp, xfs_dir2_da_to_db(mp, da), 0); } /* Index: linux-2.6/fs/xfs/xfs_dir2_node.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_node.c 2007-04-13 13:49:23.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_node.c 2007-04-13 14:08:15.000000000 +0200 @@ -136,14 +136,14 @@ xfs_dir2_leaf_to_node( /* * Get the buffer for the new freespace block. */ - if ((error = xfs_da_get_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, fdb), -1, &fbp, + if ((error = xfs_da_get_buf(tp, dp, xfs_dir2_db_to_da(mp, fdb), -1, &fbp, XFS_DATA_FORK))) { return error; } ASSERT(fbp != NULL); free = fbp->data; leaf = lbp->data; - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); /* * Initialize the freespace block header. */ @@ -155,7 +155,7 @@ xfs_dir2_leaf_to_node( * Copy freespace entries from the leaf block to the new block. * Count active entries. */ - for (i = n = 0, from = XFS_DIR2_LEAF_BESTS_P(ltp), to = free->bests; + for (i = n = 0, from = xfs_dir2_leaf_bests_p(ltp), to = free->bests; i < be32_to_cpu(ltp->bestcount); i++, from++, to++) { if ((off = be16_to_cpu(*from)) != NULLDATAOFF) n++; @@ -215,7 +215,7 @@ xfs_dir2_leafn_add( * a compact. */ - if (be16_to_cpu(leaf->hdr.count) == XFS_DIR2_MAX_LEAF_ENTS(mp)) { + if (be16_to_cpu(leaf->hdr.count) == xfs_dir2_max_leaf_ents(mp)) { if (!leaf->hdr.stale) return XFS_ERROR(ENOSPC); compact = be16_to_cpu(leaf->hdr.stale) > 1; @@ -327,7 +327,7 @@ xfs_dir2_leafn_add( * Insert the new entry, log everything. */ lep->hashval = cpu_to_be32(args->hashval); - lep->address = cpu_to_be32(XFS_DIR2_DB_OFF_TO_DATAPTR(mp, + lep->address = cpu_to_be32(xfs_dir2_db_off_to_dataptr(mp, args->blkno, args->index)); xfs_dir2_leaf_log_header(tp, bp); xfs_dir2_leaf_log_ents(tp, bp, lfloglow, lfloghigh); @@ -352,7 +352,7 @@ xfs_dir2_leafn_check( leaf = bp->data; mp = dp->i_mount; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAFN_MAGIC); - ASSERT(be16_to_cpu(leaf->hdr.count) <= XFS_DIR2_MAX_LEAF_ENTS(mp)); + ASSERT(be16_to_cpu(leaf->hdr.count) <= xfs_dir2_max_leaf_ents(mp)); for (i = stale = 0; i < be16_to_cpu(leaf->hdr.count); i++) { if (i + 1 < be16_to_cpu(leaf->hdr.count)) { ASSERT(be32_to_cpu(leaf->ents[i].hashval) <= @@ -440,7 +440,7 @@ xfs_dir2_leafn_lookup_int( if (args->addname) { curfdb = curbp ? state->extrablk.blkno : -1; curdb = -1; - length = XFS_DIR2_DATA_ENTSIZE(args->namelen); + length = xfs_dir2_data_entsize(args->namelen); if ((free = (curbp ? curbp->data : NULL))) ASSERT(be32_to_cpu(free->hdr.magic) == XFS_DIR2_FREE_MAGIC); } @@ -465,7 +465,7 @@ xfs_dir2_leafn_lookup_int( /* * Pull the data block number from the entry. */ - newdb = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + newdb = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); /* * For addname, we're looking for a place to put the new entry. * We want to use a data block with an entry of equal @@ -482,7 +482,7 @@ xfs_dir2_leafn_lookup_int( * Convert the data block to the free block * holding its freespace information. */ - newfdb = XFS_DIR2_DB_TO_FDB(mp, newdb); + newfdb = xfs_dir2_db_to_fdb(mp, newdb); /* * If it's not the one we have in hand, * read it in. @@ -497,7 +497,7 @@ xfs_dir2_leafn_lookup_int( * Read the free block. */ if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, + xfs_dir2_db_to_da(mp, newfdb), -1, &curbp, XFS_DATA_FORK))) { @@ -517,7 +517,7 @@ xfs_dir2_leafn_lookup_int( /* * Get the index for our entry. */ - fi = XFS_DIR2_DB_TO_FDINDEX(mp, curdb); + fi = xfs_dir2_db_to_fdindex(mp, curdb); /* * If it has room, return it. */ @@ -561,7 +561,7 @@ xfs_dir2_leafn_lookup_int( */ if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, newdb), -1, + xfs_dir2_db_to_da(mp, newdb), -1, &curbp, XFS_DATA_FORK))) { return error; } @@ -573,7 +573,7 @@ xfs_dir2_leafn_lookup_int( */ dep = (xfs_dir2_data_entry_t *) ((char *)curbp->data + - XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); /* * Compare the entry, return it if it matches. */ @@ -876,9 +876,9 @@ xfs_dir2_leafn_remove( /* * Extract the data block and offset from the entry. */ - db = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + db = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); ASSERT(dblk->blkno == db); - off = XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address)); + off = xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)); ASSERT(dblk->index == off); /* * Kill the leaf entry by marking it stale. @@ -898,7 +898,7 @@ xfs_dir2_leafn_remove( longest = be16_to_cpu(data->hdr.bestfree[0].length); needlog = needscan = 0; xfs_dir2_data_make_free(tp, dbp, off, - XFS_DIR2_DATA_ENTSIZE(dep->namelen), &needlog, &needscan); + xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan); /* * Rescan the data block freespaces for bestfree. * Log the data block header if needed. @@ -924,8 +924,8 @@ xfs_dir2_leafn_remove( * Convert the data block number to a free block, * read in the free block. */ - fdb = XFS_DIR2_DB_TO_FDB(mp, db); - if ((error = xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, fdb), + fdb = xfs_dir2_db_to_fdb(mp, db); + if ((error = xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, fdb), -1, &fbp, XFS_DATA_FORK))) { return error; } @@ -937,7 +937,7 @@ xfs_dir2_leafn_remove( /* * Calculate which entry we need to fix. */ - findex = XFS_DIR2_DB_TO_FDINDEX(mp, db); + findex = xfs_dir2_db_to_fdindex(mp, db); longest = be16_to_cpu(data->hdr.bestfree[0].length); /* * If the data block is now empty we can get rid of it @@ -1073,7 +1073,7 @@ xfs_dir2_leafn_split( /* * Initialize the new leaf block. */ - error = xfs_dir2_leaf_init(args, XFS_DIR2_DA_TO_DB(mp, blkno), + error = xfs_dir2_leaf_init(args, xfs_dir2_da_to_db(mp, blkno), &newblk->bp, XFS_DIR2_LEAFN_MAGIC); if (error) { return error; @@ -1385,7 +1385,7 @@ xfs_dir2_node_addname_int( dp = args->dp; mp = dp->i_mount; tp = args->trans; - length = XFS_DIR2_DATA_ENTSIZE(args->namelen); + length = xfs_dir2_data_entsize(args->namelen); /* * If we came in with a freespace block that means that lookup * found an entry with our hash value. This is the freespace @@ -1438,7 +1438,7 @@ xfs_dir2_node_addname_int( if ((error = xfs_bmap_last_offset(tp, dp, &fo, XFS_DATA_FORK))) return error; - lastfbno = XFS_DIR2_DA_TO_DB(mp, (xfs_dablk_t)fo); + lastfbno = xfs_dir2_da_to_db(mp, (xfs_dablk_t)fo); fbno = ifbno; } /* @@ -1474,7 +1474,7 @@ xfs_dir2_node_addname_int( * to avoid it. */ if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, fbno), -2, &fbp, + xfs_dir2_db_to_da(mp, fbno), -2, &fbp, XFS_DATA_FORK))) { return error; } @@ -1550,9 +1550,9 @@ xfs_dir2_node_addname_int( * Get the freespace block corresponding to the data block * that was just allocated. */ - fbno = XFS_DIR2_DB_TO_FDB(mp, dbno); + fbno = xfs_dir2_db_to_fdb(mp, dbno); if (unlikely(error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, fbno), -2, &fbp, + xfs_dir2_db_to_da(mp, fbno), -2, &fbp, XFS_DATA_FORK))) { xfs_da_buf_done(dbp); return error; @@ -1567,14 +1567,14 @@ xfs_dir2_node_addname_int( return error; } - if (unlikely(XFS_DIR2_DB_TO_FDB(mp, dbno) != fbno)) { + if (unlikely(xfs_dir2_db_to_fdb(mp, dbno) != fbno)) { cmn_err(CE_ALERT, "xfs_dir2_node_addname_int: dir ino " "%llu needed freesp block %lld for\n" " data block %lld, got %lld\n" " ifbno %llu lastfbno %d\n", (unsigned long long)dp->i_ino, - (long long)XFS_DIR2_DB_TO_FDB(mp, dbno), + (long long)xfs_dir2_db_to_fdb(mp, dbno), (long long)dbno, (long long)fbno, (unsigned long long)ifbno, lastfbno); if (fblk) { @@ -1598,7 +1598,7 @@ xfs_dir2_node_addname_int( * Get a buffer for the new block. */ if ((error = xfs_da_get_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, fbno), + xfs_dir2_db_to_da(mp, fbno), -1, &fbp, XFS_DATA_FORK))) { return error; } @@ -1623,7 +1623,7 @@ xfs_dir2_node_addname_int( /* * Set the freespace block index from the data block number. */ - findex = XFS_DIR2_DB_TO_FDINDEX(mp, dbno); + findex = xfs_dir2_db_to_fdindex(mp, dbno); /* * If it's after the end of the current entries in the * freespace block, extend that table. @@ -1669,7 +1669,7 @@ xfs_dir2_node_addname_int( * Read the data block in. */ if (unlikely( - error = xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, dbno), + error = xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, dbno), -1, &dbp, XFS_DATA_FORK))) { if ((fblk == NULL || fblk->bp == NULL) && fbp != NULL) xfs_da_buf_done(fbp); @@ -1698,7 +1698,7 @@ xfs_dir2_node_addname_int( dep->inumber = cpu_to_be64(args->inumber); dep->namelen = args->namelen; memcpy(dep->name, args->name, dep->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)data); xfs_dir2_data_log_entry(tp, dbp, dep); /* @@ -1904,7 +1904,7 @@ xfs_dir2_node_replace( ASSERT(be32_to_cpu(data->hdr.magic) == XFS_DIR2_DATA_MAGIC); dep = (xfs_dir2_data_entry_t *) ((char *)data + - XFS_DIR2_DATAPTR_TO_OFF(state->mp, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(state->mp, be32_to_cpu(lep->address))); ASSERT(inum != be64_to_cpu(dep->inumber)); /* * Fill in the new inode number and log the entry. @@ -1980,7 +1980,7 @@ xfs_dir2_node_trim_free( * Blow the block away. */ if ((error = - xfs_dir2_shrink_inode(args, XFS_DIR2_DA_TO_DB(mp, (xfs_dablk_t)fo), + xfs_dir2_shrink_inode(args, xfs_dir2_da_to_db(mp, (xfs_dablk_t)fo), bp))) { /* * Can't fail with ENOSPC since that only happens with no Index: linux-2.6/fs/xfs/xfs_dir2_node.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_node.h 2007-04-13 13:55:47.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_node.h 2007-04-13 14:04:32.000000000 +0200 @@ -36,7 +36,7 @@ struct xfs_trans; #define XFS_DIR2_FREE_SPACE 2 #define XFS_DIR2_FREE_OFFSET (XFS_DIR2_FREE_SPACE * XFS_DIR2_SPACE_SIZE) #define XFS_DIR2_FREE_FIRSTDB(mp) \ - XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_FREE_OFFSET) + xfs_dir2_byte_to_db(mp, XFS_DIR2_FREE_OFFSET) #define XFS_DIR2_FREE_MAGIC 0x58443246 /* XD2F */ @@ -60,7 +60,6 @@ typedef struct xfs_dir2_free { /* * Convert data space db to the corresponding free db. */ -#define XFS_DIR2_DB_TO_FDB(mp,db) xfs_dir2_db_to_fdb(mp, db) static inline xfs_dir2_db_t xfs_dir2_db_to_fdb(struct xfs_mount *mp, xfs_dir2_db_t db) { @@ -70,7 +69,6 @@ xfs_dir2_db_to_fdb(struct xfs_mount *mp, /* * Convert data space db to the corresponding index in a free db. */ -#define XFS_DIR2_DB_TO_FDINDEX(mp,db) xfs_dir2_db_to_fdindex(mp, db) static inline int xfs_dir2_db_to_fdindex(struct xfs_mount *mp, xfs_dir2_db_t db) { Index: linux-2.6/fs/xfs/xfs_dir2_sf.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_sf.c 2007-04-13 13:47:23.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_sf.c 2007-04-13 14:08:17.000000000 +0200 @@ -89,8 +89,8 @@ xfs_dir2_block_sfsize( mp = dp->i_mount; count = i8count = namelen = 0; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Iterate over the block's data entries by using the leaf pointers. @@ -102,7 +102,7 @@ xfs_dir2_block_sfsize( * Calculate the pointer to the entry at hand. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, addr)); + ((char *)block + xfs_dir2_dataptr_to_off(mp, addr)); /* * Detect . and .., so we can special-case them. * . is not included in sf directories. @@ -124,7 +124,7 @@ xfs_dir2_block_sfsize( /* * Calculate the new size, see if we should give up yet. */ - size = XFS_DIR2_SF_HDR_SIZE(i8count) + /* header */ + size = xfs_dir2_sf_hdr_size(i8count) + /* header */ count + /* namelen */ count * (uint)sizeof(xfs_dir2_sf_off_t) + /* offset */ namelen + /* name */ @@ -139,7 +139,7 @@ xfs_dir2_block_sfsize( */ sfhp->count = count; sfhp->i8count = i8count; - XFS_DIR2_SF_PUT_INUMBER((xfs_dir2_sf_t *)sfhp, &parent, &sfhp->parent); + xfs_dir2_sf_put_inumber((xfs_dir2_sf_t *)sfhp, &parent, &sfhp->parent); return size; } @@ -199,15 +199,15 @@ xfs_dir2_block_to_sf( * Copy the header into the newly allocate local space. */ sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - memcpy(sfp, sfhp, XFS_DIR2_SF_HDR_SIZE(sfhp->i8count)); + memcpy(sfp, sfhp, xfs_dir2_sf_hdr_size(sfhp->i8count)); dp->i_d.di_size = size; /* * Set up to loop over the block's entries. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); ptr = (char *)block->u; - endptr = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); - sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + endptr = (char *)xfs_dir2_block_leaf_p(btp); + sfep = xfs_dir2_sf_firstentry(sfp); /* * Loop over the active and unused entries. * Stop when we reach the leaf/tail portion of the block. @@ -233,22 +233,22 @@ xfs_dir2_block_to_sf( else if (dep->namelen == 2 && dep->name[0] == '.' && dep->name[1] == '.') ASSERT(be64_to_cpu(dep->inumber) == - XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent)); + xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent)); /* * Normal entry, copy it into shortform. */ else { sfep->namelen = dep->namelen; - XFS_DIR2_SF_PUT_OFFSET(sfep, + xfs_dir2_sf_put_offset(sfep, (xfs_dir2_data_aoff_t) ((char *)dep - (char *)block)); memcpy(sfep->name, dep->name, dep->namelen); temp = be64_to_cpu(dep->inumber); - XFS_DIR2_SF_PUT_INUMBER(sfp, &temp, - XFS_DIR2_SF_INUMBERP(sfep)); - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + xfs_dir2_sf_put_inumber(sfp, &temp, + xfs_dir2_sf_inumberp(sfep)); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); } - ptr += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + ptr += xfs_dir2_data_entsize(dep->namelen); } ASSERT((char *)sfep - (char *)sfp == size); xfs_dir2_sf_check(args); @@ -294,11 +294,11 @@ xfs_dir2_sf_addname( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Compute entry (and change in) size. */ - add_entsize = XFS_DIR2_SF_ENTSIZE_BYNAME(sfp, args->namelen); + add_entsize = xfs_dir2_sf_entsize_byname(sfp, args->namelen); incr_isize = add_entsize; objchange = 0; #if XFS_BIG_INUMS @@ -392,7 +392,7 @@ xfs_dir2_sf_addname_easy( /* * Grow the in-inode space. */ - xfs_idata_realloc(dp, XFS_DIR2_SF_ENTSIZE_BYNAME(sfp, args->namelen), + xfs_idata_realloc(dp, xfs_dir2_sf_entsize_byname(sfp, args->namelen), XFS_DATA_FORK); /* * Need to set up again due to realloc of the inode data. @@ -403,10 +403,10 @@ xfs_dir2_sf_addname_easy( * Fill in the new entry. */ sfep->namelen = args->namelen; - XFS_DIR2_SF_PUT_OFFSET(sfep, offset); + xfs_dir2_sf_put_offset(sfep, offset); memcpy(sfep->name, args->name, sfep->namelen); - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, + xfs_dir2_sf_inumberp(sfep)); /* * Update the header and inode. */ @@ -463,14 +463,14 @@ xfs_dir2_sf_addname_hard( * If it's going to end up at the end then oldsfep will point there. */ for (offset = XFS_DIR2_DATA_FIRST_OFFSET, - oldsfep = XFS_DIR2_SF_FIRSTENTRY(oldsfp), - add_datasize = XFS_DIR2_DATA_ENTSIZE(args->namelen), + oldsfep = xfs_dir2_sf_firstentry(oldsfp), + add_datasize = xfs_dir2_data_entsize(args->namelen), eof = (char *)oldsfep == &buf[old_isize]; !eof; - offset = new_offset + XFS_DIR2_DATA_ENTSIZE(oldsfep->namelen), - oldsfep = XFS_DIR2_SF_NEXTENTRY(oldsfp, oldsfep), + offset = new_offset + xfs_dir2_data_entsize(oldsfep->namelen), + oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep), eof = (char *)oldsfep == &buf[old_isize]) { - new_offset = XFS_DIR2_SF_GET_OFFSET(oldsfep); + new_offset = xfs_dir2_sf_get_offset(oldsfep); if (offset + add_datasize <= new_offset) break; } @@ -495,10 +495,10 @@ xfs_dir2_sf_addname_hard( * Fill in the new entry, and update the header counts. */ sfep->namelen = args->namelen; - XFS_DIR2_SF_PUT_OFFSET(sfep, offset); + xfs_dir2_sf_put_offset(sfep, offset); memcpy(sfep->name, args->name, sfep->namelen); - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, + xfs_dir2_sf_inumberp(sfep)); sfp->hdr.count++; #if XFS_BIG_INUMS if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && !objchange) @@ -508,7 +508,7 @@ xfs_dir2_sf_addname_hard( * If there's more left to copy, do that. */ if (!eof) { - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); memcpy(sfep, oldsfep, old_isize - nbytes); } kmem_free(buf, old_isize); @@ -544,9 +544,9 @@ xfs_dir2_sf_addname_pick( mp = dp->i_mount; sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - size = XFS_DIR2_DATA_ENTSIZE(args->namelen); + size = xfs_dir2_data_entsize(args->namelen); offset = XFS_DIR2_DATA_FIRST_OFFSET; - sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + sfep = xfs_dir2_sf_firstentry(sfp); holefit = 0; /* * Loop over sf entries. @@ -555,10 +555,10 @@ xfs_dir2_sf_addname_pick( */ for (i = 0; i < sfp->hdr.count; i++) { if (!holefit) - holefit = offset + size <= XFS_DIR2_SF_GET_OFFSET(sfep); - offset = XFS_DIR2_SF_GET_OFFSET(sfep) + - XFS_DIR2_DATA_ENTSIZE(sfep->namelen); - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + holefit = offset + size <= xfs_dir2_sf_get_offset(sfep); + offset = xfs_dir2_sf_get_offset(sfep) + + xfs_dir2_data_entsize(sfep->namelen); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); } /* * Calculate data bytes used excluding the new entry, if this @@ -617,18 +617,18 @@ xfs_dir2_sf_check( sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; offset = XFS_DIR2_DATA_FIRST_OFFSET; - ino = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); i8count = ino > XFS_DIR2_MAX_SHORT_INUM; - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { - ASSERT(XFS_DIR2_SF_GET_OFFSET(sfep) >= offset); - ino = XFS_DIR2_SF_GET_INUMBER(sfp, XFS_DIR2_SF_INUMBERP(sfep)); + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { + ASSERT(xfs_dir2_sf_get_offset(sfep) >= offset); + ino = xfs_dir2_sf_get_inumber(sfp, xfs_dir2_sf_inumberp(sfep)); i8count += ino > XFS_DIR2_MAX_SHORT_INUM; offset = - XFS_DIR2_SF_GET_OFFSET(sfep) + - XFS_DIR2_DATA_ENTSIZE(sfep->namelen); + xfs_dir2_sf_get_offset(sfep) + + xfs_dir2_data_entsize(sfep->namelen); } ASSERT(i8count == sfp->hdr.i8count); ASSERT(XFS_BIG_INUMS || i8count == 0); @@ -671,7 +671,7 @@ xfs_dir2_sf_create( ASSERT(dp->i_df.if_flags & XFS_IFINLINE); ASSERT(dp->i_df.if_bytes == 0); i8count = pino > XFS_DIR2_MAX_SHORT_INUM; - size = XFS_DIR2_SF_HDR_SIZE(i8count); + size = xfs_dir2_sf_hdr_size(i8count); /* * Make a buffer for the data. */ @@ -684,7 +684,7 @@ xfs_dir2_sf_create( /* * Now can put in the inode number, since i8count is set. */ - XFS_DIR2_SF_PUT_INUMBER(sfp, &pino, &sfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &pino, &sfp->hdr.parent); sfp->hdr.count = 0; dp->i_d.di_size = size; xfs_dir2_sf_check(args); @@ -727,12 +727,12 @@ xfs_dir2_sf_getdents( sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * If the block number in the offset is out of range, we're done. */ - if (XFS_DIR2_DATAPTR_TO_DB(mp, dir_offset) > mp->m_dirdatablk) { + if (xfs_dir2_dataptr_to_db(mp, dir_offset) > mp->m_dirdatablk) { *eofp = 1; return 0; } @@ -747,9 +747,9 @@ xfs_dir2_sf_getdents( * Put . entry unless we're starting past it. */ if (dir_offset <= - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOT_OFFSET)) { - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, 0, + p.cook = xfs_dir2_db_off_to_dataptr(mp, 0, XFS_DIR2_DATA_DOTDOT_OFFSET); p.ino = dp->i_ino; #if XFS_BIG_INUMS @@ -762,7 +762,7 @@ xfs_dir2_sf_getdents( if (!p.done) { uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOT_OFFSET); return error; } @@ -772,11 +772,11 @@ xfs_dir2_sf_getdents( * Put .. entry unless we're starting past it. */ if (dir_offset <= - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOTDOT_OFFSET)) { - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + p.cook = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_FIRST_OFFSET); - p.ino = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + p.ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); #if XFS_BIG_INUMS p.ino += mp->m_inoadd; #endif @@ -787,7 +787,7 @@ xfs_dir2_sf_getdents( if (!p.done) { uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOTDOT_OFFSET); return error; } @@ -796,23 +796,23 @@ xfs_dir2_sf_getdents( /* * Loop while there are more entries and put'ing works. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { - off = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, - XFS_DIR2_SF_GET_OFFSET(sfep)); + off = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, + xfs_dir2_sf_get_offset(sfep)); if (dir_offset > off) continue; p.namelen = sfep->namelen; - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, - XFS_DIR2_SF_GET_OFFSET(sfep) + - XFS_DIR2_DATA_ENTSIZE(p.namelen)); + p.cook = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, + xfs_dir2_sf_get_offset(sfep) + + xfs_dir2_data_entsize(p.namelen)); - p.ino = XFS_DIR2_SF_GET_INUMBER(sfp, XFS_DIR2_SF_INUMBERP(sfep)); + p.ino = xfs_dir2_sf_get_inumber(sfp, xfs_dir2_sf_inumberp(sfep)); #if XFS_BIG_INUMS p.ino += mp->m_inoadd; #endif @@ -832,7 +832,7 @@ xfs_dir2_sf_getdents( *eofp = 1; uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk + 1, 0); + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk + 1, 0); return 0; } @@ -865,7 +865,7 @@ xfs_dir2_sf_lookup( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Special case for . */ @@ -878,21 +878,21 @@ xfs_dir2_sf_lookup( */ if (args->namelen == 2 && args->name[0] == '.' && args->name[1] == '.') { - args->inumber = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + args->inumber = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); return XFS_ERROR(EEXIST); } /* * Loop over all the entries trying to match ours. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { if (sfep->namelen == args->namelen && sfep->name[0] == args->name[0] && memcmp(args->name, sfep->name, args->namelen) == 0) { args->inumber = - XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep)); return XFS_ERROR(EEXIST); } } @@ -934,19 +934,19 @@ xfs_dir2_sf_removename( ASSERT(dp->i_df.if_bytes == oldsize); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(oldsize >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(oldsize >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Loop over the old directory entries. * Find the one we're deleting. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { if (sfep->namelen == args->namelen && sfep->name[0] == args->name[0] && memcmp(sfep->name, args->name, args->namelen) == 0) { - ASSERT(XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep)) == + ASSERT(xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep)) == args->inumber); break; } @@ -961,7 +961,7 @@ xfs_dir2_sf_removename( * Calculate sizes. */ byteoff = (int)((char *)sfep - (char *)sfp); - entsize = XFS_DIR2_SF_ENTSIZE_BYNAME(sfp, args->namelen); + entsize = xfs_dir2_sf_entsize_byname(sfp, args->namelen); newsize = oldsize - entsize; /* * Copy the part if any after the removed entry, sliding it down. @@ -1027,7 +1027,7 @@ xfs_dir2_sf_replace( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); #if XFS_BIG_INUMS /* * New inode number is large, and need to convert to 8-byte inodes. @@ -1067,28 +1067,28 @@ xfs_dir2_sf_replace( if (args->namelen == 2 && args->name[0] == '.' && args->name[1] == '.') { #if XFS_BIG_INUMS || defined(DEBUG) - ino = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); ASSERT(args->inumber != ino); #endif - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, &sfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, &sfp->hdr.parent); } /* * Normal entry, look for the name. */ else { - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { if (sfep->namelen == args->namelen && sfep->name[0] == args->name[0] && memcmp(args->name, sfep->name, args->namelen) == 0) { #if XFS_BIG_INUMS || defined(DEBUG) - ino = XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep)); + ino = xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep)); ASSERT(args->inumber != ino); #endif - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, + xfs_dir2_sf_inumberp(sfep)); break; } } @@ -1189,22 +1189,22 @@ xfs_dir2_sf_toino4( */ sfp->hdr.count = oldsfp->hdr.count; sfp->hdr.i8count = 0; - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, &oldsfp->hdr.parent); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(oldsfp, &oldsfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &ino, &sfp->hdr.parent); /* * Copy the entries field by field. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp), - oldsfep = XFS_DIR2_SF_FIRSTENTRY(oldsfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp), + oldsfep = xfs_dir2_sf_firstentry(oldsfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep), - oldsfep = XFS_DIR2_SF_NEXTENTRY(oldsfp, oldsfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep), + oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep)) { sfep->namelen = oldsfep->namelen; sfep->offset = oldsfep->offset; memcpy(sfep->name, oldsfep->name, sfep->namelen); - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, - XFS_DIR2_SF_INUMBERP(oldsfep)); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, XFS_DIR2_SF_INUMBERP(sfep)); + ino = xfs_dir2_sf_get_inumber(oldsfp, + xfs_dir2_sf_inumberp(oldsfep)); + xfs_dir2_sf_put_inumber(sfp, &ino, xfs_dir2_sf_inumberp(sfep)); } /* * Clean up the inode. @@ -1266,22 +1266,22 @@ xfs_dir2_sf_toino8( */ sfp->hdr.count = oldsfp->hdr.count; sfp->hdr.i8count = 1; - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, &oldsfp->hdr.parent); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(oldsfp, &oldsfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &ino, &sfp->hdr.parent); /* * Copy the entries field by field. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp), - oldsfep = XFS_DIR2_SF_FIRSTENTRY(oldsfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp), + oldsfep = xfs_dir2_sf_firstentry(oldsfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep), - oldsfep = XFS_DIR2_SF_NEXTENTRY(oldsfp, oldsfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep), + oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep)) { sfep->namelen = oldsfep->namelen; sfep->offset = oldsfep->offset; memcpy(sfep->name, oldsfep->name, sfep->namelen); - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, - XFS_DIR2_SF_INUMBERP(oldsfep)); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, XFS_DIR2_SF_INUMBERP(sfep)); + ino = xfs_dir2_sf_get_inumber(oldsfp, + xfs_dir2_sf_inumberp(oldsfep)); + xfs_dir2_sf_put_inumber(sfp, &ino, xfs_dir2_sf_inumberp(sfep)); } /* * Clean up the inode. Index: linux-2.6/fs/xfs/xfs_dir2_sf.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_sf.h 2007-04-13 13:57:01.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_sf.h 2007-04-13 14:01:37.000000000 +0200 @@ -90,7 +90,6 @@ typedef struct xfs_dir2_sf { xfs_dir2_sf_entry_t list[1]; /* shortform entries */ } xfs_dir2_sf_t; -#define XFS_DIR2_SF_HDR_SIZE(i8count) xfs_dir2_sf_hdr_size(i8count) static inline int xfs_dir2_sf_hdr_size(int i8count) { return ((uint)sizeof(xfs_dir2_sf_hdr_t) - \ @@ -98,14 +97,11 @@ static inline int xfs_dir2_sf_hdr_size(i ((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t))); } -#define XFS_DIR2_SF_INUMBERP(sfep) xfs_dir2_sf_inumberp(sfep) static inline xfs_dir2_inou_t *xfs_dir2_sf_inumberp(xfs_dir2_sf_entry_t *sfep) { return (xfs_dir2_inou_t *)&(sfep)->name[(sfep)->namelen]; } -#define XFS_DIR2_SF_GET_INUMBER(sfp, from) \ - xfs_dir2_sf_get_inumber(sfp, from) static inline xfs_intino_t xfs_dir2_sf_get_inumber(xfs_dir2_sf_t *sfp, xfs_dir2_inou_t *from) { @@ -114,8 +110,6 @@ xfs_dir2_sf_get_inumber(xfs_dir2_sf_t *s (xfs_intino_t)XFS_GET_DIR_INO8((from)->i8)); } -#define XFS_DIR2_SF_PUT_INUMBER(sfp,from,to) \ - xfs_dir2_sf_put_inumber(sfp,from,to) static inline void xfs_dir2_sf_put_inumber(xfs_dir2_sf_t *sfp, xfs_ino_t *from, xfs_dir2_inou_t *to) { @@ -125,24 +119,18 @@ static inline void xfs_dir2_sf_put_inumb XFS_PUT_DIR_INO8(*(from), (to)->i8); } -#define XFS_DIR2_SF_GET_OFFSET(sfep) \ - xfs_dir2_sf_get_offset(sfep) static inline xfs_dir2_data_aoff_t xfs_dir2_sf_get_offset(xfs_dir2_sf_entry_t *sfep) { return INT_GET_UNALIGNED_16_BE(&(sfep)->offset.i); } -#define XFS_DIR2_SF_PUT_OFFSET(sfep,off) \ - xfs_dir2_sf_put_offset(sfep,off) static inline void xfs_dir2_sf_put_offset(xfs_dir2_sf_entry_t *sfep, xfs_dir2_data_aoff_t off) { INT_SET_UNALIGNED_16_BE(&(sfep)->offset.i, off); } -#define XFS_DIR2_SF_ENTSIZE_BYNAME(sfp,len) \ - xfs_dir2_sf_entsize_byname(sfp,len) static inline int xfs_dir2_sf_entsize_byname(xfs_dir2_sf_t *sfp, int len) { return ((uint)sizeof(xfs_dir2_sf_entry_t) - 1 + (len) - \ @@ -150,8 +138,6 @@ static inline int xfs_dir2_sf_entsize_by ((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t))); } -#define XFS_DIR2_SF_ENTSIZE_BYENTRY(sfp,sfep) \ - xfs_dir2_sf_entsize_byentry(sfp,sfep) static inline int xfs_dir2_sf_entsize_byentry(xfs_dir2_sf_t *sfp, xfs_dir2_sf_entry_t *sfep) { @@ -160,19 +146,17 @@ xfs_dir2_sf_entsize_byentry(xfs_dir2_sf_ ((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t))); } -#define XFS_DIR2_SF_FIRSTENTRY(sfp) xfs_dir2_sf_firstentry(sfp) static inline xfs_dir2_sf_entry_t *xfs_dir2_sf_firstentry(xfs_dir2_sf_t *sfp) { return ((xfs_dir2_sf_entry_t *) \ - ((char *)(sfp) + XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count))); + ((char *)(sfp) + xfs_dir2_sf_hdr_size(sfp->hdr.i8count))); } -#define XFS_DIR2_SF_NEXTENTRY(sfp,sfep) xfs_dir2_sf_nextentry(sfp,sfep) static inline xfs_dir2_sf_entry_t * xfs_dir2_sf_nextentry(xfs_dir2_sf_t *sfp, xfs_dir2_sf_entry_t *sfep) { return ((xfs_dir2_sf_entry_t *) \ - ((char *)(sfep) + XFS_DIR2_SF_ENTSIZE_BYENTRY(sfp,sfep))); + ((char *)(sfep) + xfs_dir2_sf_entsize_byentry(sfp,sfep))); } /* From owner-xfs@oss.sgi.com Wed Apr 18 16:03:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 16:03:59 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3IN3sfB029878 for ; Wed, 18 Apr 2007 16:03:56 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 32A134E457A; Wed, 18 Apr 2007 17:03:53 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 7158B4141; Wed, 18 Apr 2007 17:03:50 -0600 (MDT) Date: Wed, 18 Apr 2007 17:03:50 -0600 From: Andreas Dilger To: Timothy Shimmin Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070418230349.GJ5967@schatzie.adilger.int> Mail-Followup-To: Timothy Shimmin , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <31588A06562720FE1E0F93DF@timothy-shimmins-power-mac-g5.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <31588A06562720FE1E0F93DF@timothy-shimmins-power-mac-g5.local> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11110 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 16, 2007 18:01 +1000, Timothy Shimmin wrote: > --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger > wrote: > >struct fiemap_extent { > > __u64 fe_start; /* starting offset in bytes */ > > __u64 fe_len; /* length in bytes */ > >} > > > >struct fiemap { > > struct fiemap_extent fm_start; /* offset, length of desired mapping > > */ > > __u32 fm_extent_count; /* number of extents in array */ > > __u32 fm_flags; /* flags (similar to > > XFS_IOC_GETBMAP) */ > > __u64 unused; > > struct fiemap_extent fm_extents[0]; > >} > > > ># define FIEMAP_LEN_MASK 0xff000000000000 > ># define FIEMAP_LEN_HOLE 0x01000000000000 > ># define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > > >All offsets are in bytes to allow cases where filesystems are not going > >block-aligned/sized allocations (e.g. tail packing). The fm_extents array > >returned contains the packed list of allocation extents for the file, > >including entries for holes (which have fe_start == 0, and a flag). > > > >The ->fm_extents[] array includes all of the holes in addition to > >allocated extents because this avoids the need to return both the logical > >and physical address for every extent and does not make processing any > >harder. > > Well, that's what stood out for me. I was wondering where the "fe_block" > field had gone - the "physical address". > So is your "fe_start; /* starting offset */" actually the disk location > (not a logical file offset) > _except_ in the header (fiemap) where it is the desired logical offset. Correct. The fm_extent in the request contains the logical start offset and length in bytes of the requested fiemap region. In the returned header it represents the logical start offset of the extent that contained the requested start offset, and the logical length of all the returned extents. I haven't decided whether the returned length should be until EOF, or have the "virtual hole" at the end of the file. I think EOF makes more sense. The fe_start + fe_len in the fm_extents represent the physical location on the block device for that extent. fm_extent[i].fe_start (per Anton) is undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole. > Okay, looking at your example use below that's what it looks like. > And when you refer to fm_start below, you mean fm_start.fe_start? > Sorry, I realise this is just an approximation but this part confused me. Right, I'll write up a new RFC based on feedback here, and correcting the various errors in the original proposal. > So you get rid of all the logical file offsets in the extents because we > report holes explicitly (and we know everything is contiguous if you > include the holes). Correct. It saves space in the common case. > >Caller works something like: > > > > char buf[4096]; > > struct fiemap *fm = (struct fiemap *)buf; > > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > > > fm->fm_start.fe_start = 0; /* start of file */ > > fm->fm_start.fe_len = -1; /* end of file */ > > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > > > fd = open(path, O_RDONLY); > > printf("logical\t\tphysical\t\tbytes\n"); > > > > /* The last entry will have less extents than the maximum */ > > while (fm->fm_extent_count == count) { > > rc = ioctl(fd, FIEMAP, fm); > > if (rc) > > break; > > > > /* kernel filled in fm_extents[] array, set fm_extent_count > > * to be actual number of extents returned, leaves > > * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */ > > > > for (i = 0; i < fm->fm_extent_count; i++) { > > __u64 len = fm->fm_extents[i].fe_len & > > FIEMAP_LEN_MASK; > > __u64 fm_next = fm->fm_start.fe_start + len; > > int hole = fm->fm_extents[i].fe_len & > > FIEMAP_LEN_HOLE; > > int unwr = fm->fm_extents[i].fe_len & > > FIEMAP_LEN_UNWRITTEN; > > > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > > fm->fm_start.fe_start, fm_next - 1, > > hole ? 0 : fm->fm_extents[i].fe_start, > > hole ? 0 : fm->fm_extents[i].fe_start + > > fm->fm_extents[i].fe_len - 1, > > len, hole ? "(hole) " : "", > > unwr ? "(unwritten) " : ""); > > > > /* get ready for printing next extent, or next ioctl > > */ > > fm->fm_start.fe_start = fm_next; > > } > > } > > Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed Apr 18 16:51:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 16:51:14 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3INp8fB014489 for ; Wed, 18 Apr 2007 16:51:10 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA12048; Thu, 19 Apr 2007 09:51:00 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3INowAf70438188; Thu, 19 Apr 2007 09:50:58 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3INouZT70152189; Thu, 19 Apr 2007 09:50:56 +1000 (AEST) Date: Thu, 19 Apr 2007 09:50:56 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Re: [PATCH] remove various useless min/max macros Message-ID: <20070418235056.GJ48531920@melbourne.sgi.com> References: <20070418175730.GA18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418175730.GA18315@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11111 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 07:57:30PM +0200, Christoph Hellwig wrote: > xfs_btree.h has various macros to calculate a min/max after casting > it's arguments to a specific type. This can be done much simpler > by using min_t/max_t with the type as first argument. Sure, but I NACKed that last October for good reason. http://marc.info/?t=116116017600003&r=1&w=2 Specifically: http://marc.info/?l=linux-kernel&m=116122285309389&w=2 I still have no objection to changing the implementation of these macros or even changing them to non-shouting static inlines but I don't want them removed.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed Apr 18 16:57:53 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 16:57:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3INvofB016054 for ; Wed, 18 Apr 2007 16:57:52 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA12162; Thu, 19 Apr 2007 09:57:45 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3INvhAf69889917; Thu, 19 Apr 2007 09:57:43 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3INvfAQ69832729; Thu, 19 Apr 2007 09:57:41 +1000 (AEST) Date: Thu, 19 Apr 2007 09:57:41 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Re: [PATCH] kill macro noise in xfs_dir2*.h Message-ID: <20070418235741.GK48531920@melbourne.sgi.com> References: <20070418175859.GB18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418175859.GB18315@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11112 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 07:59:00PM +0200, Christoph Hellwig wrote: > Remove all the macros that just give inline functions uppercase names. > > Signed-off-by: Christoph Hellwig Added to my QA tree. Thanks, Christoph. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed Apr 18 17:09:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 17:09:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J09qfB019240 for ; Wed, 18 Apr 2007 17:09:53 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA12511; Thu, 19 Apr 2007 10:09:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J09gAf69610411; Thu, 19 Apr 2007 10:09:42 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J09ekX69103188; Thu, 19 Apr 2007 10:09:40 +1000 (AEST) Date: Thu, 19 Apr 2007 10:09:40 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Re: [PATCH] kill macro noise in xfs_dir2*.h Message-ID: <20070419000940.GL48531920@melbourne.sgi.com> References: <20070418175859.GB18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418175859.GB18315@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11113 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 07:59:00PM +0200, Christoph Hellwig wrote: > Remove all the macros that just give inline functions uppercase names. > > Signed-off-by: Christoph Hellwig BTW, you'll need this patch to make debug kernels build.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfsidbg.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfsidbg.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfsidbg.c 2007-03-30 09:30:01.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfsidbg.c 2007-04-19 10:02:29.565671598 +1000 @@ -5490,7 +5490,7 @@ xfs_dir2data(void *addr, int size) /* XFS_DIR2_BLOCK_TAIL_P */ tail = (xfs_dir2_block_tail_t *) ((char *)bb + size - sizeof(xfs_dir2_block_tail_t)); - l = XFS_DIR2_BLOCK_LEAF_P(tail); + l = xfs_dir2_block_leaf_p(tail); t = (char *)l; } for (p = (char *)(h + 1); p < t; ) { @@ -5500,7 +5500,7 @@ xfs_dir2data(void *addr, int size) (unsigned long) (p - (char *)addr), INT_GET(u->freetag, ARCH_CONVERT), INT_GET(u->length, ARCH_CONVERT), - INT_GET(*XFS_DIR2_DATA_UNUSED_TAG_P(u), ARCH_CONVERT)); + INT_GET(*xfs_dir2_data_unused_tag_p(u), ARCH_CONVERT)); p += INT_GET(u->length, ARCH_CONVERT); continue; } @@ -5511,8 +5511,8 @@ xfs_dir2data(void *addr, int size) e->namelen); for (k = 0; k < e->namelen; k++) kdb_printf("%c", e->name[k]); - kdb_printf("\" tag 0x%x\n", INT_GET(*XFS_DIR2_DATA_ENTRY_TAG_P(e), ARCH_CONVERT)); - p += XFS_DIR2_DATA_ENTSIZE(e->namelen); + kdb_printf("\" tag 0x%x\n", INT_GET(*xfs_dir2_data_entry_tag_p(e), ARCH_CONVERT)); + p += xfs_dir2_data_entsize(e->namelen); } if (INT_GET(h->magic, ARCH_CONVERT) == XFS_DIR2_DATA_MAGIC) return; @@ -5557,7 +5557,7 @@ xfs_dir2leaf(xfs_dir2_leaf_t *leaf, int return; /* XFS_DIR2_LEAF_TAIL_P */ t = (xfs_dir2_leaf_tail_t *)((char *)leaf + size - sizeof(*t)); - b = XFS_DIR2_LEAF_BESTS_P(t); + b = xfs_dir2_leaf_bests_p(t); for (j = 0; j < INT_GET(t->bestcount, ARCH_CONVERT); j++, b++) { kdb_printf("0x%lx best %d 0x%x\n", (unsigned long) ((char *)b - (char *)leaf), j, @@ -5578,19 +5578,19 @@ xfsidbg_xdir2sf(xfs_dir2_sf_t *s) int i, j; sfh = &s->hdr; - ino = XFS_DIR2_SF_GET_INUMBER(s, &sfh->parent); + ino = xfs_dir2_sf_get_inumber(s, &sfh->parent); kdb_printf("hdr count %d i8count %d parent %llu\n", sfh->count, sfh->i8count, (unsigned long long) ino); - for (i = 0, sfe = XFS_DIR2_SF_FIRSTENTRY(s); i < sfh->count; i++) { - ino = XFS_DIR2_SF_GET_INUMBER(s, XFS_DIR2_SF_INUMBERP(sfe)); + for (i = 0, sfe = xfs_dir2_sf_firstentry(s); i < sfh->count; i++) { + ino = xfs_dir2_sf_get_inumber(s, xfs_dir2_sf_inumberp(sfe)); kdb_printf("entry %d inumber %llu offset 0x%x namelen %d name \"", i, (unsigned long long) ino, - XFS_DIR2_SF_GET_OFFSET(sfe), + xfs_dir2_sf_get_offset(sfe), sfe->namelen); for (j = 0; j < sfe->namelen; j++) kdb_printf("%c", sfe->name[j]); kdb_printf("\"\n"); - sfe = XFS_DIR2_SF_NEXTENTRY(s, sfe); + sfe = xfs_dir2_sf_nextentry(s, sfe); } } From owner-xfs@oss.sgi.com Wed Apr 18 17:21:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 17:21:50 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J0LkfB023372 for ; Wed, 18 Apr 2007 17:21:47 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 8472C4E457A; Wed, 18 Apr 2007 18:21:41 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id AF2124141; Wed, 18 Apr 2007 18:21:39 -0600 (MDT) Date: Wed, 18 Apr 2007 18:21:39 -0600 From: Andreas Dilger To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070419002139.GK5967@schatzie.adilger.int> Mail-Followup-To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070416112252.GJ48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11114 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 16, 2007 21:22 +1000, David Chinner wrote: > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > > struct fiemap_extent { > > __u64 fe_start; /* starting offset in bytes */ > > __u64 fe_len; /* length in bytes */ > > } > > > > struct fiemap { > > struct fiemap_extent fm_start; /* offset, length of desired mapping */ > > __u32 fm_extent_count; /* number of extents in array */ > > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > > __u64 unused; > > struct fiemap_extent fm_extents[0]; > > } > > > > #define FIEMAP_LEN_MASK 0xff000000000000 > > #define FIEMAP_LEN_HOLE 0x01000000000000 > > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > I'm not sure I like stealing bits from the length to use a flags - > I'd prefer an explicit field per fiemap_extent for this. Christoph expressed the same concern. I'm not dead set against having an extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may mean the need for 50% more ioctls if the file is large. Below is an aggregation of the comments in this thread: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_lun; /* logical storage device number in array */ } struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ /* flags for the returned extents */ #define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */ #define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */ #define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */ #define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */ #define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */ SUMMARY OF CHANGES ================== - use fm_* fields directly in request instead of making it a fiemap_extent (though they are layed out identically) - separate flags word for fm_flags: - FIEMAP_FLAG_SYNC = range should be synced to disk before returning mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag) - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future if there is agreement on whether that is desirable to have or if it is better to call ioctl(FIEMAP) on an XATTR fd. - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it - __u64 fm_unused does not take up an extra space on all power-of-two buffer sizes (would otherwise be at end of buffer), and may be handy in the future. - add separate fe_flags word with flags from various suggestions: - FIEMAP_EXTENT_HOLE = extent has no space allocation - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown (e.g. HSM, delalloc awaiting sync, etc) - FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno? - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data encrypted, compressed, etc), may want separate flags for these? - add new fe_lun word per extent for filesystems that manage multiple devices (e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused. > Given that xfs_bmap uses extra information from the filesystem > (geometry) to display extra (and frequently used) information > about the alignment of extents. ie: > > chook 681% xfs_bmap -vv fred > fred: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width Can you clarify the terminology here? What is a "stripe unit" and what is a "stripe width"? Are there "N * stripe_unit = stripe_width" in e.g. a RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa? I don't mind adding this, as long as it's clear that some filesystems don't have this kind of information. > This information could be easily passed up in the flags fields if the > filesystem has geometry information (there go 4 more flags ;). Got lots of flag bits now. > Also - what are the explicit sync semantics of this ioctl? The > XFS ioctl causes a fsync of the file first to convert delalloc > extents to real extents before returning the bmap. Is this functionality > going to be the same? If not, then we need a DELALLOC flag to indicate > extents that haven't been allocated yet. This might be handy to > have, anyway.... Have added a FIEMAP_FLAG_SYNC on the request to sync if applications care, and FIEMAP_EXTENT_UNKNOWN can handle unmapped extents for delalloc. > > The fm_extents array > > returned contains the packed list of allocation extents for the file, > > including entries for holes (which have fe_start == 0, and a flag). > > Internalling in XFS, we pass these around as: > > #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) > #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) We could do this too, instead of having flags, but many of the proposed flags are orthogonal so we'd end up needing a lot of separate values here and it would just degenerate into the FIEMAP_LEN_MASK I previously suggested. > > required expanding the per-extent struct from 32 to 48 bytes per extent, > > not sure I follow your maths here? That was the case for XFS getbmap vs. getbmapx. For FIEMAP it increases the extent size from 16 to 24 bytes. > > Caller works something like: > > > > char buf[4096]; > > struct fiemap *fm = (struct fiemap *)buf; > > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > > > fm->fm_start.fe_start = 0; /* start of file */ > > fm->fm_start.fe_len = -1; /* end of file */ > > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > > > fd = open(path, O_RDONLY); > > printf("logical\t\tphysical\t\tbytes\n"); > > > > /* The last entry will have less extents than the maximum */ > > while (fm->fm_extent_count == count) { > > fm_extent_count is an in/out parameter? Correct. > > > rc = ioctl(fd, FIEMAP, fm); > > if (rc) > > break; > > > > /* kernel filled in fm_extents[] array, set fm_extent_count > > * to be actual number of extents returned, leaves fm_start > > * alone (unlike XFS_IOC_GETBMAP). */ > > Ok, it is. > > > for (i = 0; i < fm->fm_extent_count; i++) { > > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > > __u64 fm_next = fm->fm_start.fe_start + len; > > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > > fm->fm_start.fe_start, fm_next - 1, > > hole ? 0 : fm->fm_extents[i].fe_start, > > hole ? 0 : fm->fm_extents[i].fe_start + > > fm->fm_extents[i].fe_len - 1, > > len, hole ? "(hole) " : "", > > unwr ? "(unwritten) " : ""); > > > > /* get ready for printing next extent, or next ioctl */ > > fm->fm_start.fe_start = fm_next; > > Ok, so the only way you can determine where you are in the file > is by adding up the length of each extent. What happens if the file > is changing underneath you e.g. someone punches out a hole > in teh file, or truncates and extends it again between ioctl() > calls? Well, that is always true with data once it is out of the caller. > Also, what happens if you ask for an offset/len that doesn't map to > any extent boundaries - are you truncating the extents returned to > teh off/len passed in? The request offset will be returned as the start of the actual extent that it falls inside. And the returned extents will end with the extent that ends at or after the requested fm_start + fm_len. > xfs_bmap gets around this by finding out how many extents there are in the > file and allocating a buffer that big to hold all the extents so they > are gathered in a single atomic call (think sparse matrix files).... Yeah, except this might be persistent for a long time if it isn't fully read with a single ioctl and the app never continues reading but doesn't close the fd. > > I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP. > > I'm quite open to suggestions at this point, both in terms of how usable > > the fiemap data structures are by the caller, and if we need to add anything > > to make them more flexible for the future. > > ioctl is fine by me. perhaps a version number in the structure header > would be handy so we can modify the interface easily in the future > without having to worry about breaking userspace.... Yeah, but premature optimization and such. Would rather have INCOMPAT flags instead of version numbers. > > In terms of implementing this in the kernel, there was originally code for > > this during the development of the ext3 extent patches and it was done via > > a callback in the extent tree iterator so it is very efficient. I believe > > it implements all that is needed to allow this interface to be mapped > > onto XFS_IOC_BMAP internally (or vice versa). > > I wouldn't map the ioctls - I'd just write another interface to > xfs_getbmap(). That way we could eventually get rid of the XFS_IOC_BMAP > interface. is there any code yet? Up to you, I was just suggesting "mapping" in the generic sense. The flags and values would all have to be changed anyways. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed Apr 18 18:54:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 18:54:46 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J1sdfB017095 for ; Wed, 18 Apr 2007 18:54:41 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA14925; Thu, 19 Apr 2007 11:54:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J1sSAf70586857; Thu, 19 Apr 2007 11:54:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J1sQCO70511966; Thu, 19 Apr 2007 11:54:26 +1000 (AEST) Date: Thu, 19 Apr 2007 11:54:26 +1000 From: David Chinner To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070419015426.GM48531920@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419002139.GK5967@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11115 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 06:21:39PM -0600, Andreas Dilger wrote: > On Apr 16, 2007 21:22 +1000, David Chinner wrote: > > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > > > struct fiemap_extent { > > > __u64 fe_start; /* starting offset in bytes */ > > > __u64 fe_len; /* length in bytes */ > > > } > > > > > > struct fiemap { > > > struct fiemap_extent fm_start; /* offset, length of desired mapping */ > > > __u32 fm_extent_count; /* number of extents in array */ > > > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > > > __u64 unused; > > > struct fiemap_extent fm_extents[0]; > > > } > > > > > > #define FIEMAP_LEN_MASK 0xff000000000000 > > > #define FIEMAP_LEN_HOLE 0x01000000000000 > > > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > > > I'm not sure I like stealing bits from the length to use a flags - > > I'd prefer an explicit field per fiemap_extent for this. > > Christoph expressed the same concern. I'm not dead set against having an > extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may > mean the need for 50% more ioctls if the file is large. I don't think this overhead is a huge problem - just pass in a larger buffer (e.g. xfs_bmap can ask for thousands of extents in a single ioctl call as we can extract the number of extents in an inode via XFS_IOC_FSGETXATTRA). > Below is an aggregation of the comments in this thread: > > struct fiemap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > __u32 fe_lun; /* logical storage device number in array */ > } Oh, I missed the bit about the fe_lun - I was thinking something like that might be useful in future.... > struct fiemap { > __u64 fm_start; /* logical start offset of mapping (in/out) */ > __u64 fm_len; /* logical length of mapping (in/out) */ > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > __u64 fm_unused; > struct fiemap_extent fm_extents[0]; > } > > /* flags for the fiemap request */ > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? > /* flags for the returned extents */ > #define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */ > #define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */ > #define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */ > #define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */ > #define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */ SO, there's a HSM_READ flag above. If we are going to make this interface useful for filesystems that have HSMs interacting with their extents, the HSM needs to be able to query whether the extent is online (on disk), has been migrated offline (on tape) or in dual-state (i.e. both online and offline). > SUMMARY OF CHANGES > ================== > - use fm_* fields directly in request instead of making it a fiemap_extent > (though they are layed out identically) > > - separate flags word for fm_flags: > - FIEMAP_FLAG_SYNC = range should be synced to disk before returning > mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise > - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified > (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag) > - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future > if there is agreement on whether that is desirable to have or if it is > better to call ioctl(FIEMAP) on an XATTR fd. > - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel > must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we > don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it > > - __u64 fm_unused does not take up an extra space on all power-of-two buffer > sizes (would otherwise be at end of buffer), and may be handy in the future. > > - add separate fe_flags word with flags from various suggestions: > - FIEMAP_EXTENT_HOLE = extent has no space allocation > - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data > - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown > (e.g. HSM, delalloc awaiting sync, etc) I'd like an explicit delalloc flag, not lumping it in with "unknown". we *know* the extent is delalloc ;) > - FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno? > - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data > encrypted, compressed, etc), may want separate flags for these? > > - add new fe_lun word per extent for filesystems that manage multiple devices > (e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused. > > > > Given that xfs_bmap uses extra information from the filesystem > > (geometry) to display extra (and frequently used) information > > about the alignment of extents. ie: > > > > chook 681% xfs_bmap -vv fred > > fred: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 > > FLAG Values: > > 010000 Unwritten preallocated extent > > 001000 Doesn't begin on stripe unit > > 000100 Doesn't end on stripe unit > > 000010 Doesn't begin on stripe width > > 000001 Doesn't end on stripe width > > Can you clarify the terminology here? What is a "stripe unit" and what is > a "stripe width"? Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount of data that is written to each lun in a stripe before moving onto the next stripe element. > Are there "N * stripe_unit = stripe_width" in e.g. a > RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa? Yes, on simple configurations. In more complex HW RAID configurations, we'll typically set the stripe unit to the width of the RAID5 lun (N * segment size) and the stripe width to the number of luns we've striped across. The reason I want this to come out of the filesystem is that one of the driving factors for multi-device support in XFS is to allow multiple devices of different geometries to co-exist efficiently in the one namespace (another reason I'm happy about the fe_lun addition). Passing this information out with the extent is far simpler than trying to find what device it lies on from userspace, then querying for the geometry of that device and then converting it. Especially when extents could lie on different devices with differing geometries.... > I don't mind adding this, as long as it's clear that some filesystems don't > have this kind of information. Sure. > > This information could be easily passed up in the flags fields if the > > filesystem has geometry information (there go 4 more flags ;). > > Got lots of flag bits now. Time to start using them all up ;) > > Also - what are the explicit sync semantics of this ioctl? The > > XFS ioctl causes a fsync of the file first to convert delalloc > > extents to real extents before returning the bmap. Is this functionality > > going to be the same? If not, then we need a DELALLOC flag to indicate > > extents that haven't been allocated yet. This might be handy to > > have, anyway.... > > Have added a FIEMAP_FLAG_SYNC on the request to sync if applications care, OK. > and FIEMAP_EXTENT_UNKNOWN can handle unmapped extents for delalloc. I'd prefer explicit enumeration of then, as I said before... > > > The fm_extents array > > > returned contains the packed list of allocation extents for the file, > > > including entries for holes (which have fe_start == 0, and a flag). > > > > Internalling in XFS, we pass these around as: > > > > #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) > > #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) > > We could do this too, instead of having flags, but many of the proposed > flags are orthogonal so we'd end up needing a lot of separate values here > and it would just degenerate into the FIEMAP_LEN_MASK I previously suggested. Yeah, fair enough. > > > for (i = 0; i < fm->fm_extent_count; i++) { > > > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > > > __u64 fm_next = fm->fm_start.fe_start + len; > > > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > > > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > > > > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > > > fm->fm_start.fe_start, fm_next - 1, > > > hole ? 0 : fm->fm_extents[i].fe_start, > > > hole ? 0 : fm->fm_extents[i].fe_start + > > > fm->fm_extents[i].fe_len - 1, > > > len, hole ? "(hole) " : "", > > > unwr ? "(unwritten) " : ""); > > > > > > /* get ready for printing next extent, or next ioctl */ > > > fm->fm_start.fe_start = fm_next; > > > > Ok, so the only way you can determine where you are in the file > > is by adding up the length of each extent. What happens if the file > > is changing underneath you e.g. someone punches out a hole > > in teh file, or truncates and extends it again between ioctl() > > calls? > > Well, that is always true with data once it is out of the caller. Sure, but this interface requires iterative calls where the n+1 call is reliant on nothing changing since the first call to be accurate. My question is how do you use this interface to reliably and accurately get all the extents if you using iterative summing like this? > > Also, what happens if you ask for an offset/len that doesn't map to > > any extent boundaries - are you truncating the extents returned to > > teh off/len passed in? > > The request offset will be returned as the start of the actual extent that > it falls inside. And the returned extents will end with the extent that > ends at or after the requested fm_start + fm_len. Ok, so you round the start inwards and the round end outwards. Can you ensure that this is documented in the header file that describes this interface? > > xfs_bmap gets around this by finding out how many extents there are in the > > file and allocating a buffer that big to hold all the extents so they > > are gathered in a single atomic call (think sparse matrix files).... > > Yeah, except this might be persistent for a long time if it isn't fully > read with a single ioctl and the app never continues reading but doesn't > close the fd. Not sure I follow you here... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed Apr 18 20:04:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 20:04:48 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J34efB001974 for ; Wed, 18 Apr 2007 20:04:42 -0700 Received: from pc-bnaujok.melbourne.sgi.com (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA16453 for ; Thu, 19 Apr 2007 13:04:39 +1000 Date: Thu, 19 Apr 2007 13:10:08 +1000 To: xfs@oss.sgi.com Subject: [PATCH] Fix xfs_fsr so the temp dir is not world readable/writable From: "Barry Naujok" Organization: SGI Content-Type: multipart/mixed; boundary=----------Z2DvzjHEAFgw2CcUZZOzh8 MIME-Version: 1.0 Message-ID: User-Agent: Opera Mail/9.10 (Win32) X-archive-position: 11116 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs ------------Z2DvzjHEAFgw2CcUZZOzh8 Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Just changed the ".fsr" directory to 0700. I also improved the usage text to give more information. Barry. ------------Z2DvzjHEAFgw2CcUZZOzh8 Content-Disposition: attachment; filename=xfs_fsr.patch Content-Type: application/octet-stream; name=xfs_fsr.patch Content-Transfer-Encoding: Base64 LS0tIGEveGZzZHVtcC9mc3IveGZzX2Zzci5jCTIwMDctMDQtMTkgMTM6MDI6 MDMuMDAwMDAwMDAwICsxMDAwCisrKyBiL3hmc2R1bXAvZnNyL3hmc19mc3Iu YwkyMDA3LTA0LTE5IDEyOjM4OjMyLjkzNTMyMjcxMCArMTAwMApAQCAtMTks NyArMTksNyBAQAogLyoKICAqIGZzciAtIGZpbGUgc3lzdGVtIHJlb3JnYW5p emVyCiAgKgotICogZnNyIFstZF0gWy12XSBbLW5dIFstc10gWy1nXSBbLXQg bWluc10gWy1mIGxlZnRmXSBbLW0gbXRhYl0KKyAqIGZzciBbLWRdIFstdl0g Wy1uXSBbLXNdIFstZ10gWy10IHNlY3NdIFstZiBsZWZ0Zl0gWy1tIG10YWJd CiAgKiBmc3IgWy1kXSBbLXZdIFstbl0gWy1zXSBbLWddIHhmc2RldiB8IGRp ciB8IGZpbGUgLi4uCiAgKgogICogSWYgaW52b2tlZCBpbiB0aGUgZmlyc3Qg Zm9ybSBmc3IgZG9lcyB0aGUgZm9sbG93aW5nOiBzdGFydGluZyB3aXRoIHRo ZQpAQCAtMTAwLDcgKzEwMCw3IEBAIHN0YXRpYyBfX2ludDY0X3QJbWluaW11 bWZyZWUgPSAyMDQ4Owogc3RhdGljIHRpbWVfdCBob3dsb25nID0gNzIwMDsJ CS8qIGRlZmF1bHQgc2Vjb25kcyBvZiByZW9yZ2FuaXppbmcgKi8KIHN0YXRp YyBjaGFyICpsZWZ0b2ZmZmlsZSA9ICIvdmFyL3RtcC8uZnNybGFzdF94ZnMi Oy8qIHdoZXJlIHdlIGxlZnQgb2ZmIGxhc3QgKi8KIHN0YXRpYyBjaGFyICpt dGFiID0gTU9VTlRFRDsKLXN0YXRpYyB0aW1lX3QgZW5kdGltZTsKK3N0YXRp YyB0aW1lX3QgZW5kdGltZSA9IDA7CiBzdGF0aWMgdGltZV90IHN0YXJ0dGlt ZTsKIHN0YXRpYyB4ZnNfaW5vX3QJbGVmdG9mZmlubyA9IDA7CiBzdGF0aWMg aW50CXBhZ2VzaXplOwpAQCAtMzU4LDcgKzM1OCwyMSBAQCBtYWluKGludCBh cmdjLCBjaGFyICoqYXJndikKIHZvaWQKIHVzYWdlKGludCByZXQpCiB7Ci0J ZnByaW50ZihzdGRlcnIsIF8oIlVzYWdlOiAlcyBbeGZzZmlsZV0gLi4uXG4i KSwgcHJvZ25hbWUpOworCWZwcmludGYoc3RkZXJyLCBfKAorIlVzYWdlOiAl cyBbLWRdIFstdl0gWy1uXSBbLXNdIFstZ10gWy10IHRpbWVdIFstcCBwYXNz ZXNdIFstZiBsZWZ0Zl0gWy1tIG10YWJdXG4iCisiICAgICAgICVzIFstZF0g Wy12XSBbLW5dIFstc10gWy1nXSB4ZnNkZXYgfCBkaXIgfCBmaWxlIC4uLlxu XG4iCisiT3B0aW9uczpcbiIKKyIgICAgICAgLW4gICAgICAgICAgICAgIERv IG5vdGhpbmcsIG9ubHkgaW50ZXJlc3Rpbmcgd2l0aCAtdi4gTm90XG4iCisi ICAgICAgICAgICAgICAgICAgICAgICBlZmZlY3RpdmUgd2l0aCBpbiBtdGFi IG1vZGUuXG4iCisiICAgICAgIC1zCQlQcmludCBzdGF0aXN0aWNzIG9ubHku XG4iCisiICAgICAgIC1nICAgICAgICAgICAgICBQcmludCB0byBzeXNsb2cg KGRlZmF1bHQgaWYgc3Rkb3V0IG5vdCBhIHR0eSkuXG4iCisiICAgICAgIC10 IHRpbWUgICAgICAgICBIb3cgbG9uZyB0byBydW4gaW4gc2Vjb25kcy5cbiIK KyIgICAgICAgLXAgcGFzc2VzCU51bWJlciBvZiBwYXNzZXMgYmVmb3JlIHRl cm1pbmF0aW5nIGdsb2JhbCByZS1vcmcuXG4iCisiICAgICAgIC1mIGxlZnRv ZmYgICAgICBVc2UgdGhpcyBpbnN0ZWFkIG9mIC9ldGMvZnNybGFzdC5cbiIK KyIgICAgICAgLW0gbXRhYiAgICAgICAgIFVzZSBzb21ldGhpbmcgb3RoZXIg dGhhbiAvZXRjL210YWIuXG4iCisiICAgICAgIC1kICAgICAgICAgICAgICBE ZWJ1ZywgcHJpbnQgZXZlbiBtb3JlLlxuIgorIiAgICAgICAtdgkJVmVyYm9z ZSwgbW9yZSAtdidzIG1vcmUgdmVyYm9zZS5cbiIKKwkJKSwgcHJvZ25hbWUs IHByb2duYW1lKTsKIAlleGl0KHJldCk7CiB9CiAKQEAgLTkxNSw3ICs5Mjks NyBAQCBmc3JmaWxlX2NvbW1vbigKIAl9CiAJaWYgKGZzeC5mc3hfeGZsYWdz ICYgWEZTX1hGTEFHX05PREVGUkFHKSB7CiAJCWlmICh2ZmxhZykKLQkJCWZz cnByaW50ZihfKCIlczogbWFya2VkIGFzIGRvbid0IGRlZnJhZywgaWdub3Jp bmdcbiIpLCAKKwkJCWZzcnByaW50ZihfKCIlczogbWFya2VkIGFzIGRvbid0 IGRlZnJhZywgaWdub3JpbmdcbiIpLAogCQkJICAgIGZuYW1lKTsKIAkJcmV0 dXJuKDApOwogCX0KQEAgLTE1MzMsNyArMTU0Nyw3IEBAIHRtcF9pbml0KGNo YXIgKm1udCkKIAlzcHJpbnRmKGJ1ZiwgIiVzLy5mc3IiLCBtbnQpOwogCiAJ bWFzayA9IHVtYXNrKDApOwotCWlmIChta2RpcihidWYsIDA3NzcpIDwgMCkg eworCWlmIChta2RpcihidWYsIDA3MDApIDwgMCkgewogCQlpZiAoZXJybm8g PT0gRUVYSVNUKSB7CiAJCQlpZiAoZGZsYWcpCiAJCQkJZnNycHJpbnRmKF8o InRtcGRpciBhbHJlYWR5IGV4aXN0czogJXNcbiIpLAo= ------------Z2DvzjHEAFgw2CcUZZOzh8-- From owner-xfs@oss.sgi.com Wed Apr 18 20:41:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 20:41:31 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J3fPfB013491 for ; Wed, 18 Apr 2007 20:41:27 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 07A6FAAC419; Thu, 19 Apr 2007 13:41:23 +1000 (EST) Subject: Re: [PATCH] Fix xfs_fsr so the temp dir is not world readable/writable From: Nathan Scott Reply-To: nscott@aconex.com To: Barry Naujok Cc: xfs@oss.sgi.com In-Reply-To: References: Content-Type: text/plain Organization: Aconex Date: Thu, 19 Apr 2007 13:42:35 +1000 Message-Id: <1176954155.6273.143.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11117 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Thu, 2007-04-19 at 13:10 +1000, Barry Naujok wrote: > Looks good. > -static time_t endtime; > +static time_t endtime = 0; This line of the change is unnecessary though. cheers. -- Nathan From owner-xfs@oss.sgi.com Wed Apr 18 23:21:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 23:21:33 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J6LQfB018388 for ; Wed, 18 Apr 2007 23:21:28 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA20363; Thu, 19 Apr 2007 16:21:08 +1000 Date: Thu, 19 Apr 2007 16:23:16 +1000 From: Timothy Shimmin To: Andreas Dilger , David Chinner cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <60F23AB8D50382586C1E0BFC@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070419002139.GK5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11118 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 18 April 2007 6:21:39 PM -0600 Andreas Dilger wrote: > Below is an aggregation of the comments in this thread: > > struct fiemap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > __u32 fe_lun; /* logical storage device number in array */ > } > > struct fiemap { > __u64 fm_start; /* logical start offset of mapping (in/out) */ > __u64 fm_len; /* logical length of mapping (in/out) */ > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > __u64 fm_unused; > struct fiemap_extent fm_extents[0]; > } > > /* flags for the fiemap request */ ># define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ ># define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ ># define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ > > /* flags for the returned extents */ ># define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */ ># define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */ ># define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */ ># define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */ ># define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */ > > > > SUMMARY OF CHANGES > ================== > - use fm_* fields directly in request instead of making it a fiemap_extent > (though they are layed out identically) I much prefer that - it makes it a lot clearer to me to have fiemap_extent just for fm_extents (no different meanings now). (Don't like the word "offset" in comment without "physical" or some such but whatever;-) I also prefer the flags as separate fields too :) --Tim From owner-xfs@oss.sgi.com Thu Apr 19 00:19:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:19:10 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7J4fB001557 for ; Thu, 19 Apr 2007 00:19:06 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22249; Thu, 19 Apr 2007 17:18:58 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7IvAf70629110; Thu, 19 Apr 2007 17:18:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7IuCf70659502; Thu, 19 Apr 2007 17:18:56 +1000 (AEST) Date: Thu, 19 Apr 2007 17:18:56 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: make xfs_dm_punch_hole() atomic when punching EOF Message-ID: <20070419071856.GR48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11119 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Currently punching a hole to EOF via xfs_dm_punch_hole() truncates the file and then extends it. This leaves a small window where applications can see an incorrect file size while the punch is in progress. This can cause problems with DMF leading to premature completion of recalls and hence data corruption. Use the UNRESVSP ioctl rather than FREESP+setattr to punch the hole at EOF. This can leave specualtive allocations past EOF, so truncate them off so we don't leave blocks that can't be migrated away around in the filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/dmapi/xfs_dm.c | 47 +++++++++++++++++++++++++++---------------- fs/xfs/linux-2.6/xfs_ksyms.c | 1 fs/xfs/xfs_rw.h | 14 ++++++++++-- fs/xfs/xfs_vnodeops.c | 28 ++++++++++++++++--------- 4 files changed, 60 insertions(+), 30 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/dmapi/xfs_dm.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/dmapi/xfs_dm.c 2007-04-19 16:55:44.345586509 +1000 +++ 2.6.x-xfs-new/fs/xfs/dmapi/xfs_dm.c 2007-04-19 17:18:05.818466833 +1000 @@ -2601,9 +2601,9 @@ xfs_dm_punch_hole( xfs_inode_t *xip; xfs_mount_t *mp; u_int bsize; - int cmd = XFS_IOC_UNRESVSP; /* punch */ xfs_fsize_t realsize; bhv_vnode_t *vp = vn_from_inode(inode); + int punch_to_eof = 0; /* Returns negative errors to DMAPI */ @@ -2638,12 +2638,24 @@ xfs_dm_punch_hole( down_rw_sems(inode, DM_SEM_FLAG_WR); xfs_ilock(xip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL); - if ((off >= xip->i_size) || ((off+len) > xip->i_size)) { + realsize = xip->i_size; + + if ((off >= realsize) || ((off + len) > realsize)) { xfs_iunlock(xip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL); error = -E2BIG; goto up_and_out; } - realsize = xip->i_size; + if (len == 0) + punch_to_eof = 1; + + /* + * When we are punching to EOF, we have to make sure we punch the + * last partial block that contains EOF. Round up the length to + * make sure we punch the block and not just zero it. + */ + if (punch_to_eof) + len = roundup((realsize - off), bsize); + xfs_iunlock(xip, XFS_ILOCK_EXCL); bf.l_type = 0; @@ -2651,20 +2663,21 @@ xfs_dm_punch_hole( bf.l_start = (xfs_off_t)off; bf.l_len = (xfs_off_t)len; - if (len == 0) - cmd = XFS_IOC_FREESP; /* truncate */ - error = xfs_change_file_space(xbdp, cmd, &bf, (xfs_off_t)off, - sys_cred, - ATTR_DMI|ATTR_NOLOCK); - - /* If truncate, grow it back to its original size. */ - if ((error == 0) && (len == 0)) { - bhv_vattr_t va; - - va.va_mask = XFS_AT_SIZE; - va.va_size = realsize; - error = xfs_setattr(xbdp, &va, ATTR_DMI|ATTR_NOLOCK, - sys_cred); + error = xfs_change_file_space(xbdp, XFS_IOC_UNRESVSP, &bf, + (xfs_off_t)off, sys_cred, ATTR_DMI|ATTR_NOLOCK); + + /* + * if punching to end of file, kill any blocks past EOF that + * may have been (speculatively) preallocated. No point in + * leaving them around if we are migrating the file.... + */ + if (!error && punch_to_eof) { + error = xfs_free_eofblocks(mp, xip, XFS_FREE_EOF_NOLOCK); + if (!error) { + /* Update linux inode block count after free above */ + inode->i_blocks = XFS_FSB_TO_BB(mp, + xip->i_d.di_nblocks + xip->i_delayed_blks); + } } /* Let threads in send_data_event know we punched the file. */ Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ksyms.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-19 16:56:33.471205020 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-19 17:18:05.082563433 +1000 @@ -332,3 +332,4 @@ EXPORT_SYMBOL(xfs_xlatesb); EXPORT_SYMBOL(xfs_zero_eof); EXPORT_SYMBOL(xlog_recover_process_iunlinks); EXPORT_SYMBOL(xfs_ichgtime_fast); +EXPORT_SYMBOL(xfs_free_eofblocks); Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c 2007-04-19 16:56:33.655181121 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-04-19 17:18:04.890588633 +1000 @@ -1207,13 +1207,15 @@ xfs_fsync( } /* - * This is called by xfs_inactive to free any blocks beyond eof, - * when the link count isn't zero. + * This is called by xfs_inactive to free any blocks beyond eof + * when the link count isn't zero and by xfs_dm_punch_hole() when + * punching a hole to EOF. */ -STATIC int -xfs_inactive_free_eofblocks( +int +xfs_free_eofblocks( xfs_mount_t *mp, - xfs_inode_t *ip) + xfs_inode_t *ip, + int flags) { xfs_trans_t *tp; int error; @@ -1222,6 +1224,7 @@ xfs_inactive_free_eofblocks( xfs_filblks_t map_len; int nimaps; xfs_bmbt_irec_t imap; + int use_iolock = (flags & XFS_FREE_EOF_LOCK); /* * Figure out if there are any blocks beyond the end @@ -1262,11 +1265,13 @@ xfs_inactive_free_eofblocks( * cache and we can't * do that within a transaction. */ - xfs_ilock(ip, XFS_IOLOCK_EXCL); + if (use_iolock) + xfs_ilock(ip, XFS_IOLOCK_EXCL); error = xfs_itruncate_start(ip, XFS_ITRUNC_DEFINITE, ip->i_size); if (error) { - xfs_iunlock(ip, XFS_IOLOCK_EXCL); + if (use_iolock) + xfs_iunlock(ip, XFS_IOLOCK_EXCL); return error; } @@ -1303,7 +1308,8 @@ xfs_inactive_free_eofblocks( error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES); } - xfs_iunlock(ip, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); + xfs_iunlock(ip, (use_iolock ? (XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL) + : XFS_ILOCK_EXCL)); } return error; } @@ -1579,7 +1585,8 @@ xfs_release( (ip->i_df.if_flags & XFS_IFEXTENTS)) && (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) { - if ((error = xfs_inactive_free_eofblocks(mp, ip))) + error = xfs_free_eofblocks(mp, ip, XFS_FREE_EOF_LOCK); + if (error) return error; /* Update linux inode block count after free above */ vn_to_inode(vp)->i_blocks = XFS_FSB_TO_BB(mp, @@ -1660,7 +1667,8 @@ xfs_inactive( (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)) || (ip->i_delayed_blks != 0)))) { - if ((error = xfs_inactive_free_eofblocks(mp, ip))) + error = xfs_free_eofblocks(mp, ip, XFS_FREE_EOF_LOCK); + if (error) return VN_INACTIVE_CACHE; /* Update linux inode block count after free above */ vn_to_inode(vp)->i_blocks = XFS_FSB_TO_BB(mp, Index: 2.6.x-xfs-new/fs/xfs/xfs_rw.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_rw.h 2007-04-19 16:55:44.373582872 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_rw.h 2007-04-19 16:56:33.839157222 +1000 @@ -72,6 +72,12 @@ xfs_fsb_to_db_io(struct xfs_iocore *io, } /* + * Flags for xfs_free_eofblocks + */ +#define XFS_FREE_EOF_LOCK (1<<0) +#define XFS_FREE_EOF_NOLOCK (1<<1) + +/* * Prototypes for functions in xfs_rw.c. */ extern int xfs_write_clear_setuid(struct xfs_inode *ip); @@ -91,10 +97,12 @@ extern void xfs_ioerror_alert(char *func extern int xfs_rwlock(bhv_desc_t *bdp, bhv_vrwlock_t write_lock); extern void xfs_rwunlock(bhv_desc_t *bdp, bhv_vrwlock_t write_lock); extern int xfs_setattr(bhv_desc_t *, bhv_vattr_t *vap, int flags, - cred_t *credp); + cred_t *credp); extern int xfs_change_file_space(bhv_desc_t *bdp, int cmd, xfs_flock64_t *bf, - xfs_off_t offset, cred_t *credp, int flags); + xfs_off_t offset, cred_t *credp, int flags); extern int xfs_set_dmattrs(bhv_desc_t *bdp, u_int evmask, u_int16_t state, - cred_t *credp); + cred_t *credp); +extern int xfs_free_eofblocks(struct xfs_mount *mp, struct xfs_inode *ip, + int flags); #endif /* __XFS_RW_H__ */ From owner-xfs@oss.sgi.com Thu Apr 19 00:25:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:25:18 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7PCfB006340 for ; Thu, 19 Apr 2007 00:25:14 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22351; Thu, 19 Apr 2007 17:25:07 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7P6Af70618500; Thu, 19 Apr 2007 17:25:06 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7P51e58858094; Thu, 19 Apr 2007 17:25:05 +1000 (AEST) Date: Thu, 19 Apr 2007 17:25:05 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: allocate bmapi args Message-ID: <20070419072505.GS48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11120 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Save some stack space (64 bytes on 32bit systems, 80 bytes on 64bit systems) in a critical path by allocating the xfs_bmalloca_t structure rather than putting it on the stack. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_bmap.c | 62 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 33 insertions(+), 29 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap.c 2007-04-19 13:26:49.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap.c 2007-04-19 13:47:03.161553644 +1000 @@ -4710,7 +4710,7 @@ xfs_bmapi( xfs_fsblock_t abno; /* allocated block number */ xfs_extlen_t alen; /* allocated extent length */ xfs_fileoff_t aoff; /* allocated file offset */ - xfs_bmalloca_t bma; /* args for xfs_bmap_alloc */ + xfs_bmalloca_t *bma; /* args for xfs_bmap_alloc */ xfs_btree_cur_t *cur; /* bmap btree cursor */ xfs_fileoff_t end; /* end of mapped file region */ int eof; /* we've hit the end of extents */ @@ -4763,6 +4763,9 @@ xfs_bmapi( } if (XFS_FORCED_SHUTDOWN(mp)) return XFS_ERROR(EIO); + bma = kmem_zalloc(sizeof(xfs_bmalloca_t), KM_SLEEP); + if (!bma) + return XFS_ERROR(ENOMEM); rt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip); ifp = XFS_IFORK_PTR(ip, whichfork); ASSERT(ifp->if_ext_max == @@ -4816,7 +4819,7 @@ xfs_bmapi( n = 0; end = bno + len; obno = bno; - bma.ip = NULL; + bma->ip = NULL; if (delta) { delta->xed_startoff = NULLFILEOFF; delta->xed_blockcount = 0; @@ -4960,34 +4963,34 @@ xfs_bmapi( * If first time, allocate and fill in * once-only bma fields. */ - if (bma.ip == NULL) { - bma.tp = tp; - bma.ip = ip; - bma.prevp = &prev; - bma.gotp = &got; - bma.total = total; - bma.userdata = 0; + if (bma->ip == NULL) { + bma->tp = tp; + bma->ip = ip; + bma->prevp = &prev; + bma->gotp = &got; + bma->total = total; + bma->userdata = 0; } /* Indicate if this is the first user data * in the file, or just any user data. */ if (!(flags & XFS_BMAPI_METADATA)) { - bma.userdata = (aoff == 0) ? + bma->userdata = (aoff == 0) ? XFS_ALLOC_INITIAL_USER_DATA : XFS_ALLOC_USERDATA; } /* * Fill in changeable bma fields. */ - bma.eof = eof; - bma.firstblock = *firstblock; - bma.alen = alen; - bma.off = aoff; - bma.conv = !!(flags & XFS_BMAPI_CONVERT); - bma.wasdel = wasdelay; - bma.minlen = minlen; - bma.low = flist->xbf_low; - bma.minleft = minleft; + bma->eof = eof; + bma->firstblock = *firstblock; + bma->alen = alen; + bma->off = aoff; + bma->conv = !!(flags & XFS_BMAPI_CONVERT); + bma->wasdel = wasdelay; + bma->minlen = minlen; + bma->low = flist->xbf_low; + bma->minleft = minleft; /* * Only want to do the alignment at the * eof if it is userdata and allocation length @@ -4997,30 +5000,30 @@ xfs_bmapi( (!(flags & XFS_BMAPI_METADATA)) && (whichfork == XFS_DATA_FORK)) { if ((error = xfs_bmap_isaeof(ip, aoff, - whichfork, &bma.aeof))) + whichfork, &bma->aeof))) goto error0; } else - bma.aeof = 0; + bma->aeof = 0; /* * Call allocator. */ - if ((error = xfs_bmap_alloc(&bma))) + if ((error = xfs_bmap_alloc(bma))) goto error0; /* * Copy out result fields. */ - abno = bma.rval; - if ((flist->xbf_low = bma.low)) + abno = bma->rval; + if ((flist->xbf_low = bma->low)) minleft = 0; - alen = bma.alen; - aoff = bma.off; + alen = bma->alen; + aoff = bma->off; ASSERT(*firstblock == NULLFSBLOCK || XFS_FSB_TO_AGNO(mp, *firstblock) == - XFS_FSB_TO_AGNO(mp, bma.firstblock) || + XFS_FSB_TO_AGNO(mp, bma->firstblock) || (flist->xbf_low && XFS_FSB_TO_AGNO(mp, *firstblock) < - XFS_FSB_TO_AGNO(mp, bma.firstblock))); - *firstblock = bma.firstblock; + XFS_FSB_TO_AGNO(mp, bma->firstblock))); + *firstblock = bma->firstblock; if (cur) cur->bc_private.b.firstblock = *firstblock; @@ -5290,6 +5293,7 @@ error0: if (!error) xfs_bmap_validate_ret(orig_bno, orig_len, orig_flags, orig_mval, orig_nmap, *nmap); + kmem_free(bma, sizeof(xfs_bmalloca_t)); return error; } From owner-xfs@oss.sgi.com Thu Apr 19 00:32:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:32:34 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7WPfB008491 for ; Thu, 19 Apr 2007 00:32:27 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22571; Thu, 19 Apr 2007 17:32:18 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7WHAf70655162; Thu, 19 Apr 2007 17:32:18 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7WGYQ68515725; Thu, 19 Apr 2007 17:32:16 +1000 (AEST) Date: Thu, 19 Apr 2007 17:32:16 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: allocate alloc args Message-ID: <20070419073216.GT48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11121 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Save some stack space in the critical allocator paths by allocating the xfs_alloc_arg_t structures (104 bytes on 64bit, 88 bytes on 32bit systems) rather than placing them on the stack. There can be more than one of these structures on the stack through the critical allocation path (e.g. xfs_bmap_btalloc() and xfs_alloc_fix_freelist()) so there are significant savings to be had here... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_alloc.c | 81 +++++++------ fs/xfs/xfs_bmap.c | 276 ++++++++++++++++++++++++---------------------- fs/xfs/xfs_bmap_btree.c | 131 +++++++++++---------- fs/xfs/xfs_ialloc.c | 163 ++++++++++++++------------- fs/xfs/xfs_ialloc_btree.c | 120 ++++++++++---------- 5 files changed, 412 insertions(+), 359 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.c 2007-03-30 11:32:07.613682556 +1000 @@ -1826,7 +1826,7 @@ xfs_alloc_fix_freelist( xfs_mount_t *mp; /* file system mount point structure */ xfs_extlen_t need; /* total blocks needed in freelist */ xfs_perag_t *pag; /* per-ag information structure */ - xfs_alloc_arg_t targs; /* local allocation arguments */ + xfs_alloc_arg_t *targs; /* local allocation arguments */ xfs_trans_t *tp; /* transaction pointer */ mp = args->mp; @@ -1934,54 +1934,60 @@ xfs_alloc_fix_freelist( /* * Initialize the args structure. */ - targs.tp = tp; - targs.mp = mp; - targs.agbp = agbp; - targs.agno = args->agno; - targs.mod = targs.minleft = targs.wasdel = targs.userdata = - targs.minalignslop = 0; - targs.alignment = targs.minlen = targs.prod = targs.isfl = 1; - targs.type = XFS_ALLOCTYPE_THIS_AG; - targs.pag = pag; - if ((error = xfs_alloc_read_agfl(mp, tp, targs.agno, &agflbp))) - return error; + targs = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!targs) + return XFS_ERROR(ENOMEM); + targs->tp = tp; + targs->mp = mp; + targs->agbp = agbp; + targs->agno = args->agno; + targs->mod = targs->minleft = targs->wasdel = targs->userdata = + targs->minalignslop = 0; + targs->alignment = targs->minlen = targs->prod = targs->isfl = 1; + targs->type = XFS_ALLOCTYPE_THIS_AG; + targs->pag = pag; + if ((error = xfs_alloc_read_agfl(mp, tp, targs->agno, &agflbp))) + goto out_error; /* * Make the freelist longer if it's too short. */ while (be32_to_cpu(agf->agf_flcount) < need) { - targs.agbno = 0; - targs.maxlen = need - be32_to_cpu(agf->agf_flcount); + targs->agbno = 0; + targs->maxlen = need - be32_to_cpu(agf->agf_flcount); /* * Allocate as many blocks as possible at once. */ - if ((error = xfs_alloc_ag_vextent(&targs))) { + if ((error = xfs_alloc_ag_vextent(targs))) { xfs_trans_brelse(tp, agflbp); - return error; + goto out_error; } /* * Stop if we run out. Won't happen if callers are obeying * the restrictions correctly. Can happen for free calls * on a completely full ag. */ - if (targs.agbno == NULLAGBLOCK) { + if (targs->agbno == NULLAGBLOCK) { if (flags & XFS_ALLOC_FLAG_FREEING) break; xfs_trans_brelse(tp, agflbp); args->agbp = NULL; - return 0; + error = 0; + goto out_error; } /* * Put each allocated block on the list. */ - for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { + for (bno = targs->agbno; bno < targs->agbno + targs->len; bno++) { if ((error = xfs_alloc_put_freelist(tp, agbp, agflbp, bno, 0))) - return error; + goto out_error; } } xfs_trans_brelse(tp, agflbp); args->agbp = agbp; - return 0; +out_error: + kmem_free(targs, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -2480,28 +2486,31 @@ xfs_free_extent( xfs_fsblock_t bno, /* starting block number of extent */ xfs_extlen_t len) /* length of extent */ { - xfs_alloc_arg_t args; + xfs_alloc_arg_t *args; int error; ASSERT(len != 0); - memset(&args, 0, sizeof(xfs_alloc_arg_t)); - args.tp = tp; - args.mp = tp->t_mountp; - args.agno = XFS_FSB_TO_AGNO(args.mp, bno); - ASSERT(args.agno < args.mp->m_sb.sb_agcount); - args.agbno = XFS_FSB_TO_AGBNO(args.mp, bno); - down_read(&args.mp->m_peraglock); - args.pag = &args.mp->m_perag[args.agno]; - if ((error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING))) + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); + args->tp = tp; + args->mp = tp->t_mountp; + args->agno = XFS_FSB_TO_AGNO(args->mp, bno); + ASSERT(args->agno < args->mp->m_sb.sb_agcount); + args->agbno = XFS_FSB_TO_AGBNO(args->mp, bno); + down_read(&args->mp->m_peraglock); + args->pag = &args->mp->m_perag[args->agno]; + if ((error = xfs_alloc_fix_freelist(args, XFS_ALLOC_FLAG_FREEING))) goto error0; #ifdef DEBUG - ASSERT(args.agbp != NULL); - ASSERT((args.agbno + len) <= - be32_to_cpu(XFS_BUF_TO_AGF(args.agbp)->agf_length)); + ASSERT(args->agbp != NULL); + ASSERT((args->agbno + len) <= + be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length)); #endif - error = xfs_free_ag_extent(tp, args.agbp, args.agno, args.agbno, len, 0); + error = xfs_free_ag_extent(tp, args->agbp, args->agno, args->agbno, len, 0); error0: - up_read(&args.mp->m_peraglock); + up_read(&args->mp->m_peraglock); + kmem_free(args, sizeof(xfs_alloc_arg_t)); return error; } Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap.c 2007-03-30 11:33:25.711487339 +1000 @@ -2701,7 +2701,7 @@ xfs_bmap_btalloc( xfs_agnumber_t ag; xfs_agnumber_t fb_agno; /* ag number of ap->firstblock */ xfs_agnumber_t startag; - xfs_alloc_arg_t args; + xfs_alloc_arg_t *args; xfs_extlen_t blen; xfs_extlen_t delta; xfs_extlen_t longest; @@ -2712,8 +2712,11 @@ xfs_bmap_btalloc( int isaligned; int notinit; int tryagain; - int error; + int error = 0; + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); mp = ap->ip->i_mount; align = (ap->userdata && ap->ip->i_d.di_extsize && (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) ? @@ -2746,29 +2749,29 @@ xfs_bmap_btalloc( * Normal allocation, done through xfs_alloc_vextent. */ tryagain = isaligned = 0; - args.tp = ap->tp; - args.mp = mp; - args.fsbno = ap->rval; - args.maxlen = MIN(ap->alen, mp->m_sb.sb_agblocks); - args.firstblock = ap->firstblock; + args->tp = ap->tp; + args->mp = mp; + args->fsbno = ap->rval; + args->maxlen = MIN(ap->alen, mp->m_sb.sb_agblocks); + args->firstblock = ap->firstblock; blen = 0; if (nullfb) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.total = ap->total; + args->type = XFS_ALLOCTYPE_START_BNO; + args->total = ap->total; /* * Find the longest available space. * We're going to try for the whole allocation at once. */ - startag = ag = XFS_FSB_TO_AGNO(mp, args.fsbno); + startag = ag = XFS_FSB_TO_AGNO(mp, args->fsbno); notinit = 0; down_read(&mp->m_peraglock); while (blen < ap->alen) { pag = &mp->m_perag[ag]; if (!pag->pagf_init && - (error = xfs_alloc_pagf_init(mp, args.tp, + (error = xfs_alloc_pagf_init(mp, args->tp, ag, XFS_ALLOC_FLAG_TRYLOCK))) { up_read(&mp->m_peraglock); - return error; + goto out_error; } /* * See xfs_alloc_fix_freelist... @@ -2796,39 +2799,39 @@ xfs_bmap_btalloc( * possible that there is space for this request. */ if (notinit || blen < ap->minlen) - args.minlen = ap->minlen; + args->minlen = ap->minlen; /* * If the best seen length is less than the request * length, use the best as the minimum. */ else if (blen < ap->alen) - args.minlen = blen; + args->minlen = blen; /* * Otherwise we've seen an extent as big as alen, * use that as the minimum. */ else - args.minlen = ap->alen; + args->minlen = ap->alen; } else if (ap->low) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.total = args.minlen = ap->minlen; + args->type = XFS_ALLOCTYPE_START_BNO; + args->total = args->minlen = ap->minlen; } else { - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.total = ap->total; - args.minlen = ap->minlen; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->total = ap->total; + args->minlen = ap->minlen; } if (unlikely(ap->userdata && ap->ip->i_d.di_extsize && (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE))) { - args.prod = ap->ip->i_d.di_extsize; - if ((args.mod = (xfs_extlen_t)do_mod(ap->off, args.prod))) - args.mod = (xfs_extlen_t)(args.prod - args.mod); + args->prod = ap->ip->i_d.di_extsize; + if ((args->mod = (xfs_extlen_t)do_mod(ap->off, args->prod))) + args->mod = (xfs_extlen_t)(args->prod - args->mod); } else if (mp->m_sb.sb_blocksize >= NBPP) { - args.prod = 1; - args.mod = 0; + args->prod = 1; + args->mod = 0; } else { - args.prod = NBPP >> mp->m_sb.sb_blocklog; - if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) - args.mod = (xfs_extlen_t)(args.prod - args.mod); + args->prod = NBPP >> mp->m_sb.sb_blocklog; + if ((args->mod = (xfs_extlen_t)(do_mod(ap->off, args->prod)))) + args->mod = (xfs_extlen_t)(args->prod - args->mod); } /* * If we are not low on available data blocks, and the @@ -2841,25 +2844,25 @@ xfs_bmap_btalloc( */ if (!ap->low && ap->aeof) { if (!ap->off) { - args.alignment = mp->m_dalign; - atype = args.type; + args->alignment = mp->m_dalign; + atype = args->type; isaligned = 1; /* * Adjust for alignment */ - if (blen > args.alignment && blen <= ap->alen) - args.minlen = blen - args.alignment; - args.minalignslop = 0; + if (blen > args->alignment && blen <= ap->alen) + args->minlen = blen - args->alignment; + args->minalignslop = 0; } else { /* * First try an exact bno allocation. * If it fails then do a near or start bno * allocation with alignment turned on. */ - atype = args.type; + atype = args->type; tryagain = 1; - args.type = XFS_ALLOCTYPE_THIS_BNO; - args.alignment = 1; + args->type = XFS_ALLOCTYPE_THIS_BNO; + args->alignment = 1; /* * Compute the minlen+alignment for the * next case. Set slop so that the value @@ -2869,75 +2872,75 @@ xfs_bmap_btalloc( if (blen > mp->m_dalign && blen <= ap->alen) nextminlen = blen - mp->m_dalign; else - nextminlen = args.minlen; - if (nextminlen + mp->m_dalign > args.minlen + 1) - args.minalignslop = + nextminlen = args->minlen; + if (nextminlen + mp->m_dalign > args->minlen + 1) + args->minalignslop = nextminlen + mp->m_dalign - - args.minlen - 1; + args->minlen - 1; else - args.minalignslop = 0; + args->minalignslop = 0; } } else { - args.alignment = 1; - args.minalignslop = 0; + args->alignment = 1; + args->minalignslop = 0; } - args.minleft = ap->minleft; - args.wasdel = ap->wasdel; - args.isfl = 0; - args.userdata = ap->userdata; - if ((error = xfs_alloc_vextent(&args))) - return error; - if (tryagain && args.fsbno == NULLFSBLOCK) { + args->minleft = ap->minleft; + args->wasdel = ap->wasdel; + args->isfl = 0; + args->userdata = ap->userdata; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + if (tryagain && args->fsbno == NULLFSBLOCK) { /* * Exact allocation failed. Now try with alignment * turned on. */ - args.type = atype; - args.fsbno = ap->rval; - args.alignment = mp->m_dalign; - args.minlen = nextminlen; - args.minalignslop = 0; + args->type = atype; + args->fsbno = ap->rval; + args->alignment = mp->m_dalign; + args->minlen = nextminlen; + args->minalignslop = 0; isaligned = 1; - if ((error = xfs_alloc_vextent(&args))) + if ((error = xfs_alloc_vextent(args))) return error; } - if (isaligned && args.fsbno == NULLFSBLOCK) { + if (isaligned && args->fsbno == NULLFSBLOCK) { /* * allocation failed, so turn off alignment and * try again. */ - args.type = atype; - args.fsbno = ap->rval; - args.alignment = 0; - if ((error = xfs_alloc_vextent(&args))) - return error; - } - if (args.fsbno == NULLFSBLOCK && nullfb && - args.minlen > ap->minlen) { - args.minlen = ap->minlen; - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = ap->rval; - if ((error = xfs_alloc_vextent(&args))) - return error; - } - if (args.fsbno == NULLFSBLOCK && nullfb) { - args.fsbno = 0; - args.type = XFS_ALLOCTYPE_FIRST_AG; - args.total = ap->minlen; - args.minleft = 0; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->type = atype; + args->fsbno = ap->rval; + args->alignment = 0; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + } + if (args->fsbno == NULLFSBLOCK && nullfb && + args->minlen > ap->minlen) { + args->minlen = ap->minlen; + args->type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = ap->rval; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + } + if (args->fsbno == NULLFSBLOCK && nullfb) { + args->fsbno = 0; + args->type = XFS_ALLOCTYPE_FIRST_AG; + args->total = ap->minlen; + args->minleft = 0; + if ((error = xfs_alloc_vextent(args))) + goto out_error; ap->low = 1; } - if (args.fsbno != NULLFSBLOCK) { - ap->firstblock = ap->rval = args.fsbno; - ASSERT(nullfb || fb_agno == args.agno || - (ap->low && fb_agno < args.agno)); - ap->alen = args.len; - ap->ip->i_d.di_nblocks += args.len; + if (args->fsbno != NULLFSBLOCK) { + ap->firstblock = ap->rval = args->fsbno; + ASSERT(nullfb || fb_agno == args->agno || + (ap->low && fb_agno < args->agno)); + ap->alen = args->len; + ap->ip->i_d.di_nblocks += args->len; xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE); if (ap->wasdel) - ap->ip->i_delayed_blks -= args.len; + ap->ip->i_delayed_blks -= args->len; /* * Adjust the disk quota also. This was reserved * earlier. @@ -2945,12 +2948,14 @@ xfs_bmap_btalloc( XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : XFS_TRANS_DQ_BCOUNT, - (long) args.len); + (long) args->len); } else { ap->rval = NULLFSBLOCK; ap->alen = 0; } - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -3395,7 +3400,7 @@ xfs_bmap_extents_to_btree( { xfs_bmbt_block_t *ablock; /* allocated (child) bt block */ xfs_buf_t *abp; /* buffer for ablock */ - xfs_alloc_arg_t args; /* allocation arguments */ + xfs_alloc_arg_t *args; /* allocation arguments */ xfs_bmbt_rec_t *arp; /* child record pointer */ xfs_bmbt_block_t *block; /* btree root block */ xfs_btree_cur_t *cur; /* bmap btree cursor */ @@ -3408,6 +3413,9 @@ xfs_bmap_extents_to_btree( xfs_extnum_t nextents; /* number of file extents */ xfs_bmbt_ptr_t *pp; /* root block address pointer */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); ifp = XFS_IFORK_PTR(ip, whichfork); ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS); ASSERT(ifp->if_ext_max == @@ -3439,42 +3447,42 @@ xfs_bmap_extents_to_btree( * Convert to a btree with two levels, one record in root. */ XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_BTREE); - args.tp = tp; - args.mp = mp; - args.firstblock = *firstblock; + args->tp = tp; + args->mp = mp; + args->firstblock = *firstblock; if (*firstblock == NULLFSBLOCK) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); + args->type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); } else if (flist->xbf_low) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = *firstblock; + args->type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = *firstblock; } else { - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.fsbno = *firstblock; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->fsbno = *firstblock; } - args.minlen = args.maxlen = args.prod = 1; - args.total = args.minleft = args.alignment = args.mod = args.isfl = - args.minalignslop = 0; - args.wasdel = wasdel; + args->minlen = args->maxlen = args->prod = 1; + args->total = args->minleft = args->alignment = args->mod = args->isfl = + args->minalignslop = 0; + args->wasdel = wasdel; *logflagsp = 0; - if ((error = xfs_alloc_vextent(&args))) { + if ((error = xfs_alloc_vextent(args))) { xfs_iroot_realloc(ip, -1, whichfork); xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); - return error; + goto out_error; } /* * Allocation can't fail, the space was reserved. */ - ASSERT(args.fsbno != NULLFSBLOCK); + ASSERT(args->fsbno != NULLFSBLOCK); ASSERT(*firstblock == NULLFSBLOCK || - args.agno == XFS_FSB_TO_AGNO(mp, *firstblock) || + args->agno == XFS_FSB_TO_AGNO(mp, *firstblock) || (flist->xbf_low && - args.agno > XFS_FSB_TO_AGNO(mp, *firstblock))); - *firstblock = cur->bc_private.b.firstblock = args.fsbno; + args->agno > XFS_FSB_TO_AGNO(mp, *firstblock))); + *firstblock = cur->bc_private.b.firstblock = args->fsbno; cur->bc_private.b.allocated++; ip->i_d.di_nblocks++; XFS_TRANS_MOD_DQUOT_BYINO(mp, tp, ip, XFS_TRANS_DQ_BCOUNT, 1L); - abp = xfs_btree_get_bufl(mp, tp, args.fsbno, 0); + abp = xfs_btree_get_bufl(mp, tp, args->fsbno, 0); /* * Fill in the child block. */ @@ -3502,7 +3510,7 @@ xfs_bmap_extents_to_btree( arp = XFS_BMAP_REC_IADDR(ablock, 1, cur); kp->br_startoff = cpu_to_be64(xfs_bmbt_disk_get_startoff(arp)); pp = XFS_BMAP_PTR_IADDR(block, 1, cur); - *pp = cpu_to_be64(args.fsbno); + *pp = cpu_to_be64(args->fsbno); /* * Do all this logging at the end so that * the root is at the right level. @@ -3512,7 +3520,9 @@ xfs_bmap_extents_to_btree( ASSERT(*curp == NULL); *curp = cur; *logflagsp = XFS_ILOG_CORE | XFS_ILOG_FBROOT(whichfork); - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -3572,13 +3582,16 @@ xfs_bmap_local_to_extents( flags = 0; error = 0; if (ifp->if_bytes) { - xfs_alloc_arg_t args; /* allocation arguments */ + xfs_alloc_arg_t *args; /* allocation arguments */ xfs_buf_t *bp; /* buffer for extent block */ xfs_bmbt_rec_t *ep; /* extent record pointer */ - args.tp = tp; - args.mp = ip->i_mount; - args.firstblock = *firstblock; + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); + args->tp = tp; + args->mp = ip->i_mount; + args->firstblock = *firstblock; ASSERT((ifp->if_flags & (XFS_IFINLINE|XFS_IFEXTENTS|XFS_IFEXTIREC)) == XFS_IFINLINE); /* @@ -3586,39 +3599,42 @@ xfs_bmap_local_to_extents( * file currently fits in an inode. */ if (*firstblock == NULLFSBLOCK) { - args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino); - args.type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = XFS_INO_TO_FSB(args->mp, ip->i_ino); + args->type = XFS_ALLOCTYPE_START_BNO; } else { - args.fsbno = *firstblock; - args.type = XFS_ALLOCTYPE_NEAR_BNO; + args->fsbno = *firstblock; + args->type = XFS_ALLOCTYPE_NEAR_BNO; } - args.total = total; - args.mod = args.minleft = args.alignment = args.wasdel = - args.isfl = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - if ((error = xfs_alloc_vextent(&args))) + args->total = total; + args->mod = args->minleft = args->alignment = args->wasdel = + args->isfl = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + if ((error = xfs_alloc_vextent(args))) { + kmem_free(args, sizeof(xfs_alloc_arg_t)); goto done; + } /* * Can't fail, the space was reserved. */ - ASSERT(args.fsbno != NULLFSBLOCK); - ASSERT(args.len == 1); - *firstblock = args.fsbno; - bp = xfs_btree_get_bufl(args.mp, tp, args.fsbno, 0); + ASSERT(args->fsbno != NULLFSBLOCK); + ASSERT(args->len == 1); + *firstblock = args->fsbno; + bp = xfs_btree_get_bufl(args->mp, tp, args->fsbno, 0); memcpy((char *)XFS_BUF_PTR(bp), ifp->if_u1.if_data, ifp->if_bytes); xfs_trans_log_buf(tp, bp, 0, ifp->if_bytes - 1); - xfs_bmap_forkoff_reset(args.mp, ip, whichfork); + xfs_bmap_forkoff_reset(args->mp, ip, whichfork); xfs_idata_realloc(ip, -ifp->if_bytes, whichfork); xfs_iext_add(ifp, 0, 1); ep = xfs_iext_get_ext(ifp, 0); - xfs_bmbt_set_allf(ep, 0, args.fsbno, 1, XFS_EXT_NORM); + xfs_bmbt_set_allf(ep, 0, args->fsbno, 1, XFS_EXT_NORM); xfs_bmap_trace_post_update(fname, "new", ip, 0, whichfork); XFS_IFORK_NEXT_SET(ip, whichfork, 1); ip->i_d.di_nblocks = 1; - XFS_TRANS_MOD_DQUOT_BYINO(args.mp, tp, ip, + XFS_TRANS_MOD_DQUOT_BYINO(args->mp, tp, ip, XFS_TRANS_DQ_BCOUNT, 1L); flags |= XFS_ILOG_FEXT(whichfork); + kmem_free(args, sizeof(xfs_alloc_arg_t)); } else { ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0); xfs_bmap_forkoff_reset(ip->i_mount, ip, whichfork); Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap_btree.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap_btree.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap_btree.c 2007-03-30 11:32:42.257159915 +1000 @@ -1490,7 +1490,7 @@ xfs_bmbt_split( xfs_btree_cur_t **curp, int *stat) /* success/failure */ { - xfs_alloc_arg_t args; /* block allocation args */ + xfs_alloc_arg_t *args; /* block allocation args */ int error; /* error return value */ #ifdef XFS_BMBT_TRACE static char fname[] = "xfs_bmbt_split"; @@ -1510,50 +1510,54 @@ xfs_bmbt_split( xfs_buf_t *rrbp; /* right-right buffer pointer */ xfs_bmbt_rec_t *rrp; /* right record pointer */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); XFS_BMBT_TRACE_CURSOR(cur, ENTRY); XFS_BMBT_TRACE_ARGIFK(cur, level, *bnop, *startoff); - args.tp = cur->bc_tp; - args.mp = cur->bc_mp; + args->tp = cur->bc_tp; + args->mp = cur->bc_mp; lbp = cur->bc_bufs[level]; - lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp)); + lbno = XFS_DADDR_TO_FSB(args->mp, XFS_BUF_ADDR(lbp)); left = XFS_BUF_TO_BMBT_BLOCK(lbp); - args.fsbno = cur->bc_private.b.firstblock; - args.firstblock = args.fsbno; - if (args.fsbno == NULLFSBLOCK) { - args.fsbno = lbno; - args.type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = cur->bc_private.b.firstblock; + args->firstblock = args->fsbno; + if (args->fsbno == NULLFSBLOCK) { + args->fsbno = lbno; + args->type = XFS_ALLOCTYPE_START_BNO; } else - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.mod = args.minleft = args.alignment = args.total = args.isfl = - args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; - if (!args.wasdel && xfs_trans_get_block_res(args.tp) == 0) { + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->mod = args->minleft = args->alignment = args->total = args->isfl = + args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; + if (!args->wasdel && xfs_trans_get_block_res(args->tp) == 0) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); + kmem_free(args, sizeof(xfs_alloc_arg_t)); return XFS_ERROR(ENOSPC); } - if ((error = xfs_alloc_vextent(&args))) { + if ((error = xfs_alloc_vextent(args))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { XFS_BMBT_TRACE_CURSOR(cur, EXIT); *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - cur->bc_private.b.firstblock = args.fsbno; + ASSERT(args->len == 1); + cur->bc_private.b.firstblock = args->fsbno; cur->bc_private.b.allocated++; cur->bc_private.b.ip->i_d.di_nblocks++; - xfs_trans_log_inode(args.tp, cur->bc_private.b.ip, XFS_ILOG_CORE); - XFS_TRANS_MOD_DQUOT_BYINO(args.mp, args.tp, cur->bc_private.b.ip, + xfs_trans_log_inode(args->tp, cur->bc_private.b.ip, XFS_ILOG_CORE); + XFS_TRANS_MOD_DQUOT_BYINO(args->mp, args->tp, cur->bc_private.b.ip, XFS_TRANS_DQ_BCOUNT, 1L); - rbp = xfs_btree_get_bufl(args.mp, args.tp, args.fsbno, 0); + rbp = xfs_btree_get_bufl(args->mp, args->tp, args->fsbno, 0); right = XFS_BUF_TO_BMBT_BLOCK(rbp); #ifdef DEBUG if ((error = xfs_btree_check_lblock(cur, left, level, rbp))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } #endif right->bb_magic = cpu_to_be32(XFS_BMAP_MAGIC); @@ -1572,7 +1576,7 @@ xfs_bmbt_split( for (i = 0; i < be16_to_cpu(right->bb_numrecs); i++) { if ((error = xfs_btree_check_lptr_disk(cur, lpp[i], level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } } #endif @@ -1590,23 +1594,23 @@ xfs_bmbt_split( } be16_add(&left->bb_numrecs, -(be16_to_cpu(right->bb_numrecs))); right->bb_rightsib = left->bb_rightsib; - left->bb_rightsib = cpu_to_be64(args.fsbno); + left->bb_rightsib = cpu_to_be64(args->fsbno); right->bb_leftsib = cpu_to_be64(lbno); xfs_bmbt_log_block(cur, rbp, XFS_BB_ALL_BITS); xfs_bmbt_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB); if (be64_to_cpu(right->bb_rightsib) != NULLDFSBNO) { - if ((error = xfs_btree_read_bufl(args.mp, args.tp, + if ((error = xfs_btree_read_bufl(args->mp, args->tp, be64_to_cpu(right->bb_rightsib), 0, &rrbp, XFS_BMAP_BTREE_REF))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } rrblock = XFS_BUF_TO_BMBT_BLOCK(rrbp); if ((error = xfs_btree_check_lblock(cur, rrblock, level, rrbp))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } - rrblock->bb_leftsib = cpu_to_be64(args.fsbno); + rrblock->bb_leftsib = cpu_to_be64(args->fsbno); xfs_bmbt_log_block(cur, rrbp, XFS_BB_LEFTSIB); } if (cur->bc_ptrs[level] > be16_to_cpu(left->bb_numrecs) + 1) { @@ -1616,14 +1620,16 @@ xfs_bmbt_split( if (level + 1 < cur->bc_nlevels) { if ((error = xfs_btree_dup_cursor(cur, curp))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } (*curp)->bc_ptrs[level + 1]++; } - *bnop = args.fsbno; + *bnop = args->fsbno; XFS_BMBT_TRACE_CURSOR(cur, EXIT); *stat = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } @@ -2238,7 +2244,7 @@ xfs_bmbt_newroot( int *logflags, /* logging flags for inode */ int *stat) /* return status - 0 fail */ { - xfs_alloc_arg_t args; /* allocation arguments */ + xfs_alloc_arg_t *args; /* allocation arguments */ xfs_bmbt_block_t *block; /* bmap btree block */ xfs_buf_t *bp; /* buffer for block */ xfs_bmbt_block_t *cblock; /* child btree block */ @@ -2255,48 +2261,51 @@ xfs_bmbt_newroot( int level; /* btree level */ xfs_bmbt_ptr_t *pp; /* pointer to bmap block addr */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); XFS_BMBT_TRACE_CURSOR(cur, ENTRY); level = cur->bc_nlevels - 1; block = xfs_bmbt_get_block(cur, level, &bp); /* * Copy the root into a real block. */ - args.mp = cur->bc_mp; + args->mp = cur->bc_mp; pp = XFS_BMAP_PTR_IADDR(block, 1, cur); - args.tp = cur->bc_tp; - args.fsbno = cur->bc_private.b.firstblock; - args.mod = args.minleft = args.alignment = args.total = args.isfl = - args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; - args.firstblock = args.fsbno; - if (args.fsbno == NULLFSBLOCK) { + args->tp = cur->bc_tp; + args->fsbno = cur->bc_private.b.firstblock; + args->mod = args->minleft = args->alignment = args->total = args->isfl = + args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; + args->firstblock = args->fsbno; + if (args->fsbno == NULLFSBLOCK) { #ifdef DEBUG if ((error = xfs_btree_check_lptr_disk(cur, *pp, level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } #endif - args.fsbno = be64_to_cpu(*pp); - args.type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = be64_to_cpu(*pp); + args->type = XFS_ALLOCTYPE_START_BNO; } else - args.type = XFS_ALLOCTYPE_NEAR_BNO; - if ((error = xfs_alloc_vextent(&args))) { + args->type = XFS_ALLOCTYPE_NEAR_BNO; + if ((error = xfs_alloc_vextent(args))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { XFS_BMBT_TRACE_CURSOR(cur, EXIT); *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - cur->bc_private.b.firstblock = args.fsbno; + ASSERT(args->len == 1); + cur->bc_private.b.firstblock = args->fsbno; cur->bc_private.b.allocated++; cur->bc_private.b.ip->i_d.di_nblocks++; - XFS_TRANS_MOD_DQUOT_BYINO(args.mp, args.tp, cur->bc_private.b.ip, + XFS_TRANS_MOD_DQUOT_BYINO(args->mp, args->tp, cur->bc_private.b.ip, XFS_TRANS_DQ_BCOUNT, 1L); - bp = xfs_btree_get_bufl(args.mp, cur->bc_tp, args.fsbno, 0); + bp = xfs_btree_get_bufl(args->mp, cur->bc_tp, args->fsbno, 0); cblock = XFS_BUF_TO_BMBT_BLOCK(bp); *cblock = *block; be16_add(&block->bb_level, 1); @@ -2311,18 +2320,18 @@ xfs_bmbt_newroot( for (i = 0; i < be16_to_cpu(cblock->bb_numrecs); i++) { if ((error = xfs_btree_check_lptr_disk(cur, pp[i], level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } } #endif memcpy(cpp, pp, be16_to_cpu(cblock->bb_numrecs) * sizeof(*pp)); #ifdef DEBUG - if ((error = xfs_btree_check_lptr(cur, args.fsbno, level))) { + if ((error = xfs_btree_check_lptr(cur, args->fsbno, level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } #endif - *pp = cpu_to_be64(args.fsbno); + *pp = cpu_to_be64(args->fsbno); xfs_iroot_realloc(cur->bc_private.b.ip, 1 - be16_to_cpu(cblock->bb_numrecs), cur->bc_private.b.whichfork); xfs_btree_setbuf(cur, level, bp); @@ -2337,6 +2346,8 @@ xfs_bmbt_newroot( *logflags |= XFS_ILOG_CORE | XFS_ILOG_FBROOT(cur->bc_private.b.whichfork); *stat = 1; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); return 0; } Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c 2007-03-30 11:32:50.168127184 +1000 @@ -119,7 +119,7 @@ xfs_ialloc_ag_alloc( int *alloc) { xfs_agi_t *agi; /* allocation group header */ - xfs_alloc_arg_t args; /* allocation argument structure */ + xfs_alloc_arg_t *args; /* allocation argument structure */ int blks_per_cluster; /* fs blocks per inode cluster */ xfs_btree_cur_t *cur; /* inode btree cursor */ xfs_daddr_t d; /* disk addr of buffer */ @@ -138,18 +138,23 @@ xfs_ialloc_ag_alloc( int isaligned = 0; /* inode allocation at stripe unit */ /* boundary */ - args.tp = tp; - args.mp = tp->t_mountp; + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); + args->tp = tp; + args->mp = tp->t_mountp; /* * Locking will ensure that we don't have two callers in here * at one time. */ - newlen = XFS_IALLOC_INODES(args.mp); - if (args.mp->m_maxicount && - args.mp->m_sb.sb_icount + newlen > args.mp->m_maxicount) + newlen = XFS_IALLOC_INODES(args->mp); + if (args->mp->m_maxicount && + args->mp->m_sb.sb_icount + newlen > args->mp->m_maxicount) { + kmem_free(args, sizeof(xfs_alloc_arg_t)); return XFS_ERROR(ENOSPC); - args.minlen = args.maxlen = XFS_IALLOC_BLOCKS(args.mp); + } + args->minlen = args->maxlen = XFS_IALLOC_BLOCKS(args->mp); /* * First try to allocate inodes contiguous with the last-allocated * chunk of inodes. If the filesystem is striped, this will fill @@ -157,27 +162,27 @@ xfs_ialloc_ag_alloc( */ agi = XFS_BUF_TO_AGI(agbp); newino = be32_to_cpu(agi->agi_newino); - args.agbno = XFS_AGINO_TO_AGBNO(args.mp, newino) + - XFS_IALLOC_BLOCKS(args.mp); + args->agbno = XFS_AGINO_TO_AGBNO(args->mp, newino) + + XFS_IALLOC_BLOCKS(args->mp); if (likely(newino != NULLAGINO && - (args.agbno < be32_to_cpu(agi->agi_length)))) { - args.fsbno = XFS_AGB_TO_FSB(args.mp, - be32_to_cpu(agi->agi_seqno), args.agbno); - args.type = XFS_ALLOCTYPE_THIS_BNO; - args.mod = args.total = args.wasdel = args.isfl = - args.userdata = args.minalignslop = 0; - args.prod = 1; - args.alignment = 1; + (args->agbno < be32_to_cpu(agi->agi_length)))) { + args->fsbno = XFS_AGB_TO_FSB(args->mp, + be32_to_cpu(agi->agi_seqno), args->agbno); + args->type = XFS_ALLOCTYPE_THIS_BNO; + args->mod = args->total = args->wasdel = args->isfl = + args->userdata = args->minalignslop = 0; + args->prod = 1; + args->alignment = 1; /* * Allow space for the inode btree to split. */ - args.minleft = XFS_IN_MAXLEVELS(args.mp) - 1; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->minleft = XFS_IN_MAXLEVELS(args->mp) - 1; + if ((error = xfs_alloc_vextent(args))) + goto out_error; } else - args.fsbno = NULLFSBLOCK; + args->fsbno = NULLFSBLOCK; - if (unlikely(args.fsbno == NULLFSBLOCK)) { + if (unlikely(args->fsbno == NULLFSBLOCK)) { /* * Set the alignment for the allocation. * If stripe alignment is turned on then align at stripe unit @@ -187,82 +192,82 @@ xfs_ialloc_ag_alloc( * pieces, so don't need alignment anyway. */ isaligned = 0; - if (args.mp->m_sinoalign) { - ASSERT(!(args.mp->m_flags & XFS_MOUNT_NOALIGN)); - args.alignment = args.mp->m_dalign; + if (args->mp->m_sinoalign) { + ASSERT(!(args->mp->m_flags & XFS_MOUNT_NOALIGN)); + args->alignment = args->mp->m_dalign; isaligned = 1; - } else if (XFS_SB_VERSION_HASALIGN(&args.mp->m_sb) && - args.mp->m_sb.sb_inoalignmt >= - XFS_B_TO_FSBT(args.mp, - XFS_INODE_CLUSTER_SIZE(args.mp))) - args.alignment = args.mp->m_sb.sb_inoalignmt; + } else if (XFS_SB_VERSION_HASALIGN(&args->mp->m_sb) && + args->mp->m_sb.sb_inoalignmt >= + XFS_B_TO_FSBT(args->mp, + XFS_INODE_CLUSTER_SIZE(args->mp))) + args->alignment = args->mp->m_sb.sb_inoalignmt; else - args.alignment = 1; + args->alignment = 1; /* * Need to figure out where to allocate the inode blocks. * Ideally they should be spaced out through the a.g. * For now, just allocate blocks up front. */ - args.agbno = be32_to_cpu(agi->agi_root); - args.fsbno = XFS_AGB_TO_FSB(args.mp, - be32_to_cpu(agi->agi_seqno), args.agbno); + args->agbno = be32_to_cpu(agi->agi_root); + args->fsbno = XFS_AGB_TO_FSB(args->mp, + be32_to_cpu(agi->agi_seqno), args->agbno); /* * Allocate a fixed-size extent of inodes. */ - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.mod = args.total = args.wasdel = args.isfl = - args.userdata = args.minalignslop = 0; - args.prod = 1; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->mod = args->total = args->wasdel = args->isfl = + args->userdata = args->minalignslop = 0; + args->prod = 1; /* * Allow space for the inode btree to split. */ - args.minleft = XFS_IN_MAXLEVELS(args.mp) - 1; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->minleft = XFS_IN_MAXLEVELS(args->mp) - 1; + if ((error = xfs_alloc_vextent(args))) + goto out_error; } /* * If stripe alignment is turned on, then try again with cluster * alignment. */ - if (isaligned && args.fsbno == NULLFSBLOCK) { - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.agbno = be32_to_cpu(agi->agi_root); - args.fsbno = XFS_AGB_TO_FSB(args.mp, - be32_to_cpu(agi->agi_seqno), args.agbno); - if (XFS_SB_VERSION_HASALIGN(&args.mp->m_sb) && - args.mp->m_sb.sb_inoalignmt >= - XFS_B_TO_FSBT(args.mp, XFS_INODE_CLUSTER_SIZE(args.mp))) - args.alignment = args.mp->m_sb.sb_inoalignmt; + if (isaligned && args->fsbno == NULLFSBLOCK) { + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->agbno = be32_to_cpu(agi->agi_root); + args->fsbno = XFS_AGB_TO_FSB(args->mp, + be32_to_cpu(agi->agi_seqno), args->agbno); + if (XFS_SB_VERSION_HASALIGN(&args->mp->m_sb) && + args->mp->m_sb.sb_inoalignmt >= + XFS_B_TO_FSBT(args->mp, XFS_INODE_CLUSTER_SIZE(args->mp))) + args->alignment = args->mp->m_sb.sb_inoalignmt; else - args.alignment = 1; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->alignment = 1; + if ((error = xfs_alloc_vextent(args))) + goto out_error; } - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { *alloc = 0; - return 0; + goto out_error; } - ASSERT(args.len == args.minlen); + ASSERT(args->len == args->minlen); /* * Convert the results. */ - newino = XFS_OFFBNO_TO_AGINO(args.mp, args.agbno, 0); + newino = XFS_OFFBNO_TO_AGINO(args->mp, args->agbno, 0); /* * Loop over the new block(s), filling in the inodes. * For small block sizes, manipulate the inodes in buffers * which are multiples of the blocks size. */ - if (args.mp->m_sb.sb_blocksize >= XFS_INODE_CLUSTER_SIZE(args.mp)) { + if (args->mp->m_sb.sb_blocksize >= XFS_INODE_CLUSTER_SIZE(args->mp)) { blks_per_cluster = 1; - nbufs = (int)args.len; - ninodes = args.mp->m_sb.sb_inopblock; + nbufs = (int)args->len; + ninodes = args->mp->m_sb.sb_inopblock; } else { - blks_per_cluster = XFS_INODE_CLUSTER_SIZE(args.mp) / - args.mp->m_sb.sb_blocksize; - nbufs = (int)args.len / blks_per_cluster; - ninodes = blks_per_cluster * args.mp->m_sb.sb_inopblock; + blks_per_cluster = XFS_INODE_CLUSTER_SIZE(args->mp) / + args->mp->m_sb.sb_blocksize; + nbufs = (int)args->len / blks_per_cluster; + ninodes = blks_per_cluster * args->mp->m_sb.sb_inopblock; } /* * Figure out what version number to use in the inodes we create. @@ -271,7 +276,7 @@ xfs_ialloc_ag_alloc( * use the old version so that old kernels will continue to be * able to use the file system. */ - if (XFS_SB_VERSION_HASNLINK(&args.mp->m_sb)) + if (XFS_SB_VERSION_HASNLINK(&args->mp->m_sb)) version = XFS_DINODE_VERSION_2; else version = XFS_DINODE_VERSION_1; @@ -280,19 +285,19 @@ xfs_ialloc_ag_alloc( /* * Get the block. */ - d = XFS_AGB_TO_DADDR(args.mp, be32_to_cpu(agi->agi_seqno), - args.agbno + (j * blks_per_cluster)); - fbuf = xfs_trans_get_buf(tp, args.mp->m_ddev_targp, d, - args.mp->m_bsize * blks_per_cluster, + d = XFS_AGB_TO_DADDR(args->mp, be32_to_cpu(agi->agi_seqno), + args->agbno + (j * blks_per_cluster)); + fbuf = xfs_trans_get_buf(tp, args->mp->m_ddev_targp, d, + args->mp->m_bsize * blks_per_cluster, XFS_BUF_LOCK); ASSERT(fbuf); ASSERT(!XFS_BUF_GETERROR(fbuf)); /* * Set initial values for the inodes in this buffer. */ - xfs_biozero(fbuf, 0, ninodes << args.mp->m_sb.sb_inodelog); + xfs_biozero(fbuf, 0, ninodes << args->mp->m_sb.sb_inodelog); for (i = 0; i < ninodes; i++) { - free = XFS_MAKE_IPTR(args.mp, fbuf, i); + free = XFS_MAKE_IPTR(args->mp, fbuf, i); INT_SET(free->di_core.di_magic, ARCH_CONVERT, XFS_DINODE_MAGIC); INT_SET(free->di_core.di_version, ARCH_CONVERT, version); INT_SET(free->di_next_unlinked, ARCH_CONVERT, NULLAGINO); @@ -304,14 +309,14 @@ xfs_ialloc_ag_alloc( be32_add(&agi->agi_count, newlen); be32_add(&agi->agi_freecount, newlen); agno = be32_to_cpu(agi->agi_seqno); - down_read(&args.mp->m_peraglock); - args.mp->m_perag[agno].pagi_freecount += newlen; - up_read(&args.mp->m_peraglock); + down_read(&args->mp->m_peraglock); + args->mp->m_perag[agno].pagi_freecount += newlen; + up_read(&args->mp->m_peraglock); agi->agi_newino = cpu_to_be32(newino); /* * Insert records describing the new inode chunk into the btree. */ - cur = xfs_btree_init_cursor(args.mp, tp, agbp, agno, + cur = xfs_btree_init_cursor(args->mp, tp, agbp, agno, XFS_BTNUM_INO, (xfs_inode_t *)0, 0); for (thisino = newino; thisino < newino + newlen; @@ -319,12 +324,12 @@ xfs_ialloc_ag_alloc( if ((error = xfs_inobt_lookup_eq(cur, thisino, XFS_INODES_PER_CHUNK, XFS_INOBT_ALL_FREE, &i))) { xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); - return error; + goto out_error; } ASSERT(i == 0); if ((error = xfs_inobt_insert(cur, &i))) { xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); - return error; + goto out_error; } ASSERT(i == 1); } @@ -340,7 +345,9 @@ xfs_ialloc_ag_alloc( xfs_trans_mod_sb(tp, XFS_TRANS_SB_ICOUNT, (long)newlen); xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, (long)newlen); *alloc = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } STATIC_INLINE xfs_agnumber_t Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc_btree.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc_btree.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc_btree.c 2007-03-30 11:32:30.678671441 +1000 @@ -1185,7 +1185,7 @@ xfs_inobt_newroot( int *stat) /* success/failure */ { xfs_agi_t *agi; /* a.g. inode header */ - xfs_alloc_arg_t args; /* allocation argument structure */ + xfs_alloc_arg_t *args; /* allocation argument structure */ xfs_inobt_block_t *block; /* one half of the old root block */ xfs_buf_t *bp; /* buffer containing block */ int error; /* error return value */ @@ -1207,33 +1207,36 @@ xfs_inobt_newroot( /* * Get a block & a buffer. */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); agi = XFS_BUF_TO_AGI(cur->bc_private.i.agbp); - args.tp = cur->bc_tp; - args.mp = cur->bc_mp; - args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.i.agno, + args->tp = cur->bc_tp; + args->mp = cur->bc_mp; + args->fsbno = XFS_AGB_TO_FSB(args->mp, cur->bc_private.i.agno, be32_to_cpu(agi->agi_root)); - args.mod = args.minleft = args.alignment = args.total = args.wasdel = - args.isfl = args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.type = XFS_ALLOCTYPE_NEAR_BNO; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->mod = args->minleft = args->alignment = args->total = args->wasdel = + args->isfl = args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + if ((error = xfs_alloc_vextent(args))) + goto out_error; /* * None available, we fail. */ - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - nbp = xfs_btree_get_bufs(args.mp, args.tp, args.agno, args.agbno, 0); + ASSERT(args->len == 1); + nbp = xfs_btree_get_bufs(args->mp, args->tp, args->agno, args->agbno, 0); new = XFS_BUF_TO_INOBT_BLOCK(nbp); /* * Set the root data in the a.g. inode structure. */ - agi->agi_root = cpu_to_be32(args.agbno); + agi->agi_root = cpu_to_be32(args->agbno); be32_add(&agi->agi_level, 1); - xfs_ialloc_log_agi(args.tp, cur->bc_private.i.agbp, + xfs_ialloc_log_agi(args->tp, cur->bc_private.i.agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL); /* * At the previous root level there are now two blocks: the old @@ -1245,41 +1248,41 @@ xfs_inobt_newroot( block = XFS_BUF_TO_INOBT_BLOCK(bp); #ifdef DEBUG if ((error = xfs_btree_check_sblock(cur, block, cur->bc_nlevels - 1, bp))) - return error; + goto out_error; #endif if (be32_to_cpu(block->bb_rightsib) != NULLAGBLOCK) { /* * Our block is left, pick up the right block. */ lbp = bp; - lbno = XFS_DADDR_TO_AGBNO(args.mp, XFS_BUF_ADDR(lbp)); + lbno = XFS_DADDR_TO_AGBNO(args->mp, XFS_BUF_ADDR(lbp)); left = block; rbno = be32_to_cpu(left->bb_rightsib); - if ((error = xfs_btree_read_bufs(args.mp, args.tp, args.agno, + if ((error = xfs_btree_read_bufs(args->mp, args->tp, args->agno, rbno, 0, &rbp, XFS_INO_BTREE_REF))) - return error; + goto out_error; bp = rbp; right = XFS_BUF_TO_INOBT_BLOCK(rbp); if ((error = xfs_btree_check_sblock(cur, right, cur->bc_nlevels - 1, rbp))) - return error; + goto out_error; nptr = 1; } else { /* * Our block is right, pick up the left block. */ rbp = bp; - rbno = XFS_DADDR_TO_AGBNO(args.mp, XFS_BUF_ADDR(rbp)); + rbno = XFS_DADDR_TO_AGBNO(args->mp, XFS_BUF_ADDR(rbp)); right = block; lbno = be32_to_cpu(right->bb_leftsib); - if ((error = xfs_btree_read_bufs(args.mp, args.tp, args.agno, + if ((error = xfs_btree_read_bufs(args->mp, args->tp, args->agno, lbno, 0, &lbp, XFS_INO_BTREE_REF))) - return error; + goto out_error; bp = lbp; left = XFS_BUF_TO_INOBT_BLOCK(lbp); if ((error = xfs_btree_check_sblock(cur, left, cur->bc_nlevels - 1, lbp))) - return error; + goto out_error; nptr = 2; } /* @@ -1290,7 +1293,7 @@ xfs_inobt_newroot( new->bb_numrecs = cpu_to_be16(2); new->bb_leftsib = cpu_to_be32(NULLAGBLOCK); new->bb_rightsib = cpu_to_be32(NULLAGBLOCK); - xfs_inobt_log_block(args.tp, nbp, XFS_BB_ALL_BITS); + xfs_inobt_log_block(args->tp, nbp, XFS_BB_ALL_BITS); ASSERT(lbno != NULLAGBLOCK && rbno != NULLAGBLOCK); /* * Fill in the key data in the new root. @@ -1320,7 +1323,9 @@ xfs_inobt_newroot( cur->bc_ptrs[cur->bc_nlevels] = nptr; cur->bc_nlevels++; *stat = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -1466,7 +1471,7 @@ xfs_inobt_split( xfs_btree_cur_t **curp, /* output: new cursor */ int *stat) /* success/failure */ { - xfs_alloc_arg_t args; /* allocation argument structure */ + xfs_alloc_arg_t *args; /* allocation argument structure */ int error; /* error return value */ int i; /* loop index/record number */ xfs_agblock_t lbno; /* left (current) block number */ @@ -1481,30 +1486,33 @@ xfs_inobt_split( xfs_inobt_ptr_t *rpp; /* right btree address pointer */ xfs_inobt_rec_t *rrp; /* right btree record pointer */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); /* * Set up left block (current one). */ lbp = cur->bc_bufs[level]; - args.tp = cur->bc_tp; - args.mp = cur->bc_mp; - lbno = XFS_DADDR_TO_AGBNO(args.mp, XFS_BUF_ADDR(lbp)); + args->tp = cur->bc_tp; + args->mp = cur->bc_mp; + lbno = XFS_DADDR_TO_AGBNO(args->mp, XFS_BUF_ADDR(lbp)); /* * Allocate the new block. * If we can't do it, we're toast. Give up. */ - args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.i.agno, lbno); - args.mod = args.minleft = args.alignment = args.total = args.wasdel = - args.isfl = args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.type = XFS_ALLOCTYPE_NEAR_BNO; - if ((error = xfs_alloc_vextent(&args))) - return error; - if (args.fsbno == NULLFSBLOCK) { + args->fsbno = XFS_AGB_TO_FSB(args->mp, cur->bc_private.i.agno, lbno); + args->mod = args->minleft = args->alignment = args->total = args->wasdel = + args->isfl = args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + if (args->fsbno == NULLFSBLOCK) { *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - rbp = xfs_btree_get_bufs(args.mp, args.tp, args.agno, args.agbno, 0); + ASSERT(args->len == 1); + rbp = xfs_btree_get_bufs(args->mp, args->tp, args->agno, args->agbno, 0); /* * Set up the new block as "right". */ @@ -1515,7 +1523,7 @@ xfs_inobt_split( left = XFS_BUF_TO_INOBT_BLOCK(lbp); #ifdef DEBUG if ((error = xfs_btree_check_sblock(cur, left, level, lbp))) - return error; + goto out_error; #endif /* * Fill in the btree header for the new block. @@ -1542,7 +1550,7 @@ xfs_inobt_split( #ifdef DEBUG for (i = 0; i < be16_to_cpu(right->bb_numrecs); i++) { if ((error = xfs_btree_check_sptr(cur, be32_to_cpu(lpp[i]), level))) - return error; + goto out_error; } #endif memcpy(rkp, lkp, be16_to_cpu(right->bb_numrecs) * sizeof(*rkp)); @@ -1567,10 +1575,10 @@ xfs_inobt_split( */ be16_add(&left->bb_numrecs, -(be16_to_cpu(right->bb_numrecs))); right->bb_rightsib = left->bb_rightsib; - left->bb_rightsib = cpu_to_be32(args.agbno); + left->bb_rightsib = cpu_to_be32(args->agbno); right->bb_leftsib = cpu_to_be32(lbno); - xfs_inobt_log_block(args.tp, rbp, XFS_BB_ALL_BITS); - xfs_inobt_log_block(args.tp, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB); + xfs_inobt_log_block(args->tp, rbp, XFS_BB_ALL_BITS); + xfs_inobt_log_block(args->tp, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB); /* * If there's a block to the new block's right, make that block * point back to right instead of to left. @@ -1579,15 +1587,15 @@ xfs_inobt_split( xfs_inobt_block_t *rrblock; /* rr btree block */ xfs_buf_t *rrbp; /* buffer for rrblock */ - if ((error = xfs_btree_read_bufs(args.mp, args.tp, args.agno, + if ((error = xfs_btree_read_bufs(args->mp, args->tp, args->agno, be32_to_cpu(right->bb_rightsib), 0, &rrbp, XFS_INO_BTREE_REF))) - return error; + goto out_error; rrblock = XFS_BUF_TO_INOBT_BLOCK(rrbp); if ((error = xfs_btree_check_sblock(cur, rrblock, level, rrbp))) - return error; - rrblock->bb_leftsib = cpu_to_be32(args.agbno); - xfs_inobt_log_block(args.tp, rrbp, XFS_BB_LEFTSIB); + goto out_error; + rrblock->bb_leftsib = cpu_to_be32(args->agbno); + xfs_inobt_log_block(args->tp, rrbp, XFS_BB_LEFTSIB); } /* * If the cursor is really in the right block, move it there. @@ -1604,12 +1612,14 @@ xfs_inobt_split( */ if (level + 1 < cur->bc_nlevels) { if ((error = xfs_btree_dup_cursor(cur, curp))) - return error; + goto out_error; (*curp)->bc_ptrs[level + 1]++; } - *bnop = args.agbno; + *bnop = args->agbno; *stat = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* From owner-xfs@oss.sgi.com Thu Apr 19 00:37:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:37:26 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7bLfB009730 for ; Thu, 19 Apr 2007 00:37:23 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22700; Thu, 19 Apr 2007 17:37:16 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7bFAf68298774; Thu, 19 Apr 2007 17:37:15 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7bEMo69779756; Thu, 19 Apr 2007 17:37:14 +1000 (AEST) Date: Thu, 19 Apr 2007 17:37:14 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: handle barriers being switched off dynamically. Message-ID: <20070419073714.GU48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11122 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs As pointed out by Neil Brown, MD can switch barriers off dynamically underneath a mounted filesystem. If this happens to XFS, it will shutdown the filesystem immediately. Handle this more sanely by yelling into the syslog, retrying the I/O without barriers and if that is successful, turn off barriers. Also remove an unnecessary check when first checking to see if the underlying device supports barriers. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_buf.c | 13 ++++++++++++- fs/xfs/linux-2.6/xfs_super.c | 8 -------- fs/xfs/xfs_log.c | 13 +++++++++++++ 3 files changed, 25 insertions(+), 9 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-04-19 13:26:49.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-04-19 13:27:01.733786992 +1000 @@ -1000,7 +1000,18 @@ xfs_buf_iodone_work( xfs_buf_t *bp = container_of(work, xfs_buf_t, b_iodone_work); - if (bp->b_iodone) + /* + * We can get an EOPNOTSUPP to ordered writes. Here we clear the + * ordered flag and reissue them. Because we can't tell the higher + * layers directly that they should not issue ordered I/O anymore, they + * need to check if the ordered flag was cleared during I/O completion. + */ + if ((bp->b_error == EOPNOTSUPP) && + (bp->b_flags & (XBF_ORDERED|XBF_ASYNC)) == (XBF_ORDERED|XBF_ASYNC)) { + XB_TRACE(bp, "ordered_retry", bp->b_iodone); + bp->b_flags &= ~XBF_ORDERED; + xfs_buf_iorequest(bp); + } else if (bp->b_iodone) (*(bp->b_iodone))(bp); else if (bp->b_flags & XBF_ASYNC) xfs_buf_relse(bp); Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-04-19 13:27:00.245980891 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-04-19 13:27:01.753784386 +1000 @@ -961,6 +961,19 @@ xlog_iodone(xfs_buf_t *bp) l = iclog->ic_log; /* + * If the ordered flag has been removed by a lower + * layer, it means the underlyin device no longer supports + * barrier I/O. Warn loudly and turn off barriers. + */ + if ((l->l_mp->m_flags & XFS_MOUNT_BARRIER) && !XFS_BUF_ORDERED(bp)) { + l->l_mp->m_flags &= ~XFS_MOUNT_BARRIER; + xfs_fs_cmn_err(CE_WARN, l->l_mp, + "xlog_iodone: Barriers are no longer supported" + " by device. Disabling barriers\n"); + xfs_buftrace("XLOG_IODONE BARRIERS OFF", bp); + } + + /* * Race to shutdown the filesystem if we see an error. */ if (XFS_TEST_ERROR((XFS_BUF_GETERROR(bp)), l->l_mp, Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2007-04-19 13:27:00.277976721 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c 2007-04-19 13:27:01.757783865 +1000 @@ -314,14 +314,6 @@ xfs_mountfs_check_barriers(xfs_mount_t * return; } - if (mp->m_ddev_targp->bt_bdev->bd_disk->queue->ordered == - QUEUE_ORDERED_NONE) { - xfs_fs_cmn_err(CE_NOTE, mp, - "Disabling barriers, not supported by the underlying device"); - mp->m_flags &= ~XFS_MOUNT_BARRIER; - return; - } - if (xfs_readonly_buftarg(mp->m_ddev_targp)) { xfs_fs_cmn_err(CE_NOTE, mp, "Disabling barriers, underlying device is readonly"); From owner-xfs@oss.sgi.com Thu Apr 19 00:49:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:49:53 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J7nnfB011915 for ; Thu, 19 Apr 2007 00:49:50 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id AE5ABAAC37A; Thu, 19 Apr 2007 17:49:48 +1000 (EST) Subject: Re: review: allocate bmapi args From: Nathan Scott Reply-To: nscott@aconex.com To: David Chinner Cc: xfs-dev , xfs-oss In-Reply-To: <20070419072505.GS48531920@melbourne.sgi.com> References: <20070419072505.GS48531920@melbourne.sgi.com> Content-Type: text/plain Organization: Aconex Date: Thu, 19 Apr 2007 17:51:02 +1000 Message-Id: <1176969062.6273.169.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11123 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Thu, 2007-04-19 at 17:25 +1000, David Chinner wrote: > > + bma = kmem_zalloc(sizeof(xfs_bmalloca_t), KM_SLEEP); > + if (!bma) > + return XFS_ERROR(ENOMEM); I guess you meant KM_NOSLEEP? Are you sure this is legit though? (are all callers going to be able to handle this?) I'm thinking of the writeout paths where we're doing space allocation (unwritten extent conversion comes through here too) in order to free up some page cache so other memory allocs elsewhere can proceed. I don't see any other memory allocations in this area of the code, so I guess I'd be treading really carefully here.. (Oh, and why the _zalloc? Could just do an _alloc, since previous code was using non-zeroed memory - so, should have been filling in all fields). cheers. -- Nathan From owner-xfs@oss.sgi.com Thu Apr 19 00:53:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:53:49 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7rifB013002 for ; Thu, 19 Apr 2007 00:53:46 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA23138; Thu, 19 Apr 2007 17:53:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7rdAf70617956; Thu, 19 Apr 2007 17:53:39 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7rc1f70015918; Thu, 19 Apr 2007 17:53:38 +1000 (AEST) Date: Thu, 19 Apr 2007 17:53:38 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: fix use after free of log buffers on shutdown. Message-ID: <20070419075338.GV48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11124 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs When unmounting the filesystem we write an unmount record into the log just before we start freeing up in memory structures. When we wait for the unmount record to hit the disk, we don't wait for the log buffers to be finished with, we only wait for part of the iodone callback to be run - the bit that processes the unmount record completion. Hence when the unmount wakes up, it races with the remainder of the log io completion and pretty much the first thing it does is free the log buffers. As a result, when iodone processing completes and we check the buffer's async status, the buffer can already have been freed. Luckily, all log I/O is issued asynchronously, so we don't really need the async check and so we can avoid this use after free easily. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_log.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-04-19 17:18:14.097380099 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-04-19 17:51:03.017078512 +1000 @@ -988,14 +988,16 @@ xlog_iodone(xfs_buf_t *bp) } else if (iclog->ic_state & XLOG_STATE_IOERROR) { aborted = XFS_LI_ABORTED; } + + /* log I/O is always issued ASYNC */ + ASSERT(XFS_BUF_ISASYNC(bp)); xlog_state_done_syncing(iclog, aborted); - if (!(XFS_BUF_ISASYNC(bp))) { - /* - * Corresponding psema() will be done in bwrite(). If we don't - * vsema() here, panic. - */ - XFS_BUF_V_IODONESEMA(bp); - } + /* + * do not reference the buffer (bp) here as we could race + * with it being freed after writing the unmount record to the + * log. + */ + } /* xlog_iodone */ /* From owner-xfs@oss.sgi.com Thu Apr 19 01:23:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 01:23:42 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J8NafB019495 for ; Thu, 19 Apr 2007 01:23:38 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA24557; Thu, 19 Apr 2007 18:23:34 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J8NXAf70514203; Thu, 19 Apr 2007 18:23:33 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J8NVJY70575441; Thu, 19 Apr 2007 18:23:31 +1000 (AEST) Date: Thu, 19 Apr 2007 18:23:31 +1000 From: David Chinner To: Nathan Scott Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review: allocate bmapi args Message-ID: <20070419082331.GW48531920@melbourne.sgi.com> References: <20070419072505.GS48531920@melbourne.sgi.com> <1176969062.6273.169.camel@edge> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1176969062.6273.169.camel@edge> User-Agent: Mutt/1.4.2.1i X-archive-position: 11125 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 05:51:02PM +1000, Nathan Scott wrote: > On Thu, 2007-04-19 at 17:25 +1000, David Chinner wrote: > > > > + bma = kmem_zalloc(sizeof(xfs_bmalloca_t), KM_SLEEP); > > + if (!bma) > > + return XFS_ERROR(ENOMEM); > > I guess you meant KM_NOSLEEP? No, I meant a sleeping allocation. I guess that mea I don't need the error handling.... > Are you sure this is legit though? It *must* be. We already rely on being able to do substantial amounts of allocation in this path.... > (are all callers going to be able to handle this?) I'm thinking > of the writeout paths where we're doing space allocation (unwritten > extent conversion comes through here too) in order to free up some > page cache so other memory allocs elsewhere can proceed. I don't > see any other memory allocations in this area of the code, so I > guess I'd be treading really carefully here.. We modify the incore extent list as it grows and shrinks in this path. It is critical that we are able to allocate at least small amounts of memory in these writeback paths, and we currently do that with kmem_alloc(KM_SLEEP). A quick search of the xfs_iext_* functions shows lots of allocations are done in manipulating the incore extents.... Then there's needing new pages in the page cache and xfs_buf_t's if we trigger a btree split duringthe allocation, and so on. IOWS, there's plenty of far larger allocations down through this path already, and if any one of them doesn't succeed we are pretty much fscked. given that we haven't had any reports of writeback deadlocks since the new incore extent handling went in, I think small allocations like this are not a problem. FWIW, I have done low memory testing and I wasn't about to trigger any problems..... > (Oh, and why the _zalloc? Could just do an _alloc, since previous > code was using non-zeroed memory - so, should have been filling in > all fields). Habit. And it doesn't hurt performance at all - we've got to take that cache miss somewhere along the line.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu Apr 19 01:38:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 01:38:06 -0700 (PDT) Received: from tyo200.gate.nec.co.jp (TYO200.gate.nec.co.jp [210.143.35.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J8c2fB023026 for ; Thu, 19 Apr 2007 01:38:04 -0700 Received: from tyo202.gate.nec.co.jp ([10.7.69.202]) by tyo200.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3J8c0nB006255 for ; Thu, 19 Apr 2007 17:38:00 +0900 (JST) Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.193]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3J8bK1d011041 for ; Thu, 19 Apr 2007 17:37:20 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l3J8bKn00848 for xfs@oss.sgi.com; Thu, 19 Apr 2007 17:37:20 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv3.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l3J8bJL27926 for ; Thu, 19 Apr 2007 17:37:19 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070419.173719.50002280 for ; Thu, 19 Apr 2007 17:37:19 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Thu Apr 19 17:37:19 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 57B77AE4B3; Thu, 19 Apr 2007 17:37:10 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l3J8bJrN008943; Thu, 19 Apr 2007 17:37:19 +0900 Message-Id: <200704190837.AA05238@TNESG9305.tnes.nec.co.jp> Date: Thu, 19 Apr 2007 17:37:11 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix "quota -n" command in xfs_quota. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11126 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, "quota -n" command in xfs_quota don't work when specifying the project id. This patch fixes it. Example: # ./xfs_quota -x -c 'quota -p -n 42' ~utako/mpnt Disk quotas for Project logfiles (42) Filesystem Blocks Quota Limit Warn/Time Mounted on /dev/sda6 52 0 0 00 [--------] /home/utako/mpnt Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/quota.orig 2007-04-18 10:36:38.000000000 +0900 +++ xfsprogs-2.8.20/quota/quota.c 2007-04-18 11:09:10.000000000 +0900 @@ -312,7 +312,7 @@ getprojectname( static char buffer[32]; fs_project_t *p; - if ((p = getprprid(prid))) + if (!numeric && (p = getprprid(prid))) return p->pr_name; snprintf(buffer, sizeof(buffer), "#%u", (unsigned int)prid); return &buffer[0]; From owner-xfs@oss.sgi.com Thu Apr 19 01:41:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 01:41:07 -0700 (PDT) Received: from ex-osl-dc03.exense.int (exense-pdc.exense.com [195.204.47.129]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J8f2fB024171 for ; Thu, 19 Apr 2007 01:41:03 -0700 Received: from [127.0.0.1] ([10.1.3.13]) by ex-osl-dc03.exense.int with Microsoft SMTPSVC(6.0.3790.1830); Thu, 19 Apr 2007 10:28:45 +0200 Message-ID: <4627283E.7060000@start.no> Date: Thu, 19 Apr 2007 10:28:46 +0200 From: "Stein M. Hugubakken" User-Agent: Thunderbird 1.5.0.10 (X11/20070317) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: Inode usage Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 19 Apr 2007 08:28:45.0181 (UTC) FILETIME=[BEB9FAD0:01C7825C] X-archive-position: 11127 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dulci@start.no Precedence: bulk X-list: xfs Hi! I have a lot of free inodes on my xfs-partitions and was wondering about what impact this has on performance or memory? Here is output from 'df': df -ih Filesystem Inodes IUsed IFree IUse% Mounted on /dev/hda2 5,1M 129K 5,0M 3% / /dev/hda3 31M 54K 31M 1% /home df -h Filesystem Size Used Avail Use% Mounted on /dev/hda2 5,1G 2,8G 2,4G 55% / /dev/hda3 31G 18G 13G 59% /home With xfs_growfs -m I can adjust the amount of free inodes, but it seems I can't change it for the root-partition, why is that a problem? Kind regards Stein From owner-xfs@oss.sgi.com Thu Apr 19 06:14:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 06:14:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3JDEpfB023951 for ; Thu, 19 Apr 2007 06:14:52 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA00078; Thu, 19 Apr 2007 23:14:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3JDEhAf70798677; Thu, 19 Apr 2007 23:14:43 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3JDEgnG70177793; Thu, 19 Apr 2007 23:14:42 +1000 (AEST) Date: Thu, 19 Apr