From owner-xfs@oss.sgi.com Tue May 1 07:21:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 07:21:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l41EL5fB015382 for ; Tue, 1 May 2007 07:21:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id AAA04161; Wed, 2 May 2007 00:20:54 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l41EKqAf81627789; Wed, 2 May 2007 00:20:53 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l41EKn6U80735839; Wed, 2 May 2007 00:20:49 +1000 (AEST) Date: Wed, 2 May 2007 00:20:49 +1000 From: David Chinner To: Nicholas Miell Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177994346.3362.5.camel@entropy> User-Agent: Mutt/1.4.2.1i X-archive-position: 11237 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: > On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > This is actually for future use. Any flags that are added into this > > > range must be understood by both sides or it should be considered an > > > error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need > > > to be supported. If it turns out that 8 bits is too small a range for > > > INCOMPAT flags, then we can make 0x01000000 an incompat flag that means > > > e.g. 0x00ff0000 are also incompat flags also. > > > > Ah, ok. So it's not really a set of "compatibility" flags, it's more a > > "compulsory" set. Under those terms, i don't really see why this is > > necessary - either the filesystem will understand the flags or it will > > return EINVAL or ignore them... > > > > > I'm assuming that all flags that will be in the original FIEMAP proposal > > > will be understood by the implementations. Most filesystems can safely > > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > > > that matter FLAG_SYNC is probably moot for most filesystems also because > > > they do block allocation at preprw time. > > > > Exactly my point - so why do we really need to encode a compulsory set of > > > > Because flags have meaning, independent of whether or not the filesystem > understands them. And if the filesystem chooses to ignore critically > important flags (instead of returning EINVAL), bad things may happen. > > So, either the filesystem will understand the flag or iff the unknown flag > is in the incompat set, it will return EINVAL or else the unknown flag will > be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 11:38:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:38:27 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41IcLfB004929 for ; Tue, 1 May 2007 11:38:23 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49945 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixE1-00068o-W4 (Exim 4.63) (return-path ); Tue, 01 May 2007 19:37:22 +0100 In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:37:20 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11238 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 05:22, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >> The FIBMAP ioctl is for privileged users >> only, and I wonder if FIEMAP should be the same, or at least >> disallow >> mapping files that the user can't access especially with >> FLAG_SYNC and/or >> FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the machine. Perhaps for non-privileged users FIEMAP has to be read- only? As soon as any of the FLAG_* flags come into play you make it privileged. For example fancy any user being able to fill up your file system by calling FIEMAP with FLAG_HSM_READ on all files recursively? This should certainly not be simply dismissed as a non- issue without thinking about it first... Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 11:48:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:48:44 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41ImefB006913 for ; Tue, 1 May 2007 11:48:41 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49949 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixNG-0000gV-WA (Exim 4.63) (return-path ); Tue, 01 May 2007 19:46:55 +0100 In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:46:53 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11239 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 15:20, David Chinner wrote: > On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: >> On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> This is actually for future use. Any flags that are added into >>>> this >>>> range must be understood by both sides or it should be >>>> considered an >>>> error. Flags outside the FIEMAP_FLAG_INCOMPAT do not >>>> necessarily need >>>> to be supported. If it turns out that 8 bits is too small a >>>> range for >>>> INCOMPAT flags, then we can make 0x01000000 an incompat flag >>>> that means >>>> e.g. 0x00ff0000 are also incompat flags also. >>> >>> Ah, ok. So it's not really a set of "compatibility" flags, it's >>> more a >>> "compulsory" set. Under those terms, i don't really see why this is >>> necessary - either the filesystem will understand the flags or it >>> will >>> return EINVAL or ignore them... >>> >>>> I'm assuming that all flags that will be in the original FIEMAP >>>> proposal >>>> will be understood by the implementations. Most filesystems can >>>> safely >>>> ignore FLAG_HSM_READ, for example, since they don't support HSM, >>>> and for >>>> that matter FLAG_SYNC is probably moot for most filesystems also >>>> because >>>> they do block allocation at preprw time. >>> >>> Exactly my point - so why do we really need to encode a >>> compulsory set of >> >> Because flags have meaning, independent of whether or not the >> filesystem >> understands them. And if the filesystem chooses to ignore critically >> important flags (instead of returning EINVAL), bad things may happen. >> >> So, either the filesystem will understand the flag or iff the >> unknown flag >> is in the incompat set, it will return EINVAL or else the unknown >> flag will >> be safely ignored. > > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... You are wrong about this because you are missing the point that you have no code to review. The users that will use those flags are going to be applications that run in user space. Chances are you will never see their code. Heck, they might not even be open source applications... And all applications will run against a multitude of kernels. So version X of the application will run on kernel 2.4.*, 2.6.*, a.b.*, etc... For future expandability of the interface I think it is important to have both compulsory and non-compulsory flags. For example there is no reason why FIEMAP_HSM_READ needs to be compulsory. Most filesystems do not support HSM so can safely ignore it. And applications that want to read/write the data locations that are obtained with the FIEMAP call will likely always supply FIEMAP_HSM_READ because they want to ensure the file is brought in if it is off line so they definitely want file systems that do not support this flag to ignore it. And vice versa, an application might specify some weird and funky yet to be developed feature that it expects the FS to perform and if the FS cannot do it (either because it does not support it or because it failed to perform the operation) the application expects the FS to return an error and not to ignore the flag. An example could be the asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS ignores it it will return the extent map for the file data instead of the XATTR_FORK! Not what the application wanted at all. Ouch! So this is definitely a compulsory flag if I ever saw one. So as you see you must support both voluntary and compulsory flags... Also consider what I said above about different kernels. A new feature is implemented in kernel 2.8.13 say that was not there before and an application is updated to use that feature. There will be lots of instances where that application will still be run on older kernels where this feature does not exist. Depending on the feature it may be quite sensible to simply ignore in the kernel that the application set an unknown flag whilst for a different feature it may be the opposite. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 15:32:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:32:47 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MWgfB012145 for ; Tue, 1 May 2007 15:32:43 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id DFE564E4564; Tue, 1 May 2007 16:32:41 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id AC4524179; Tue, 1 May 2007 15:32:36 -0700 (PDT) Date: Tue, 1 May 2007 15:32:36 -0700 From: Andreas Dilger To: David Chinner Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223236.GM5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11241 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 00:20 +1000, David Chinner wrote: > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... What you seem to be missing about my proposal is that the FLAG_INCOMPAT is for future use by that part of the specification we haven't thought of yet... Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4, and is much better than having version numbers for the interface. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 15:30:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:30:53 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MUnfB011674 for ; Tue, 1 May 2007 15:30:50 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 459B44E4564; Tue, 1 May 2007 16:30:47 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id F06254179; Tue, 1 May 2007 15:30:40 -0700 (PDT) Date: Tue, 1 May 2007 15:30:40 -0700 From: Andreas Dilger To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223040.GL5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11240 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 01, 2007 14:22 +1000, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > I disagree - why would you want to indicate the state is unknown when we know > very well that it is offline? If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a catch-all flag that indicates "this extent contains data but there is nothing sensible to be returned for the extent mapping." > Effectively, when your extent is offline in the HSM, it is inaccessable, and > you have to bring it back from tape so it becomes accessible again. i.e. some > action is necessary on behalf of the user to make it accessible. So I think > that OFFLINE is a good name for this state because it really is inaccessible. What you are calling OFFLINE I would prefer to call UNMAPPED, since that can be used by applications as a catch-all for "no mapping". There can be further flags that give refinements to UNMAPPED that some applications might care about them (e.g. HSM_RESIDENT), but many users/apps will not if they just want the number of fragments in a given file. > Also, I don't think "secondary" is a good term because most large systems > have more than one tier of storage. One possibility is "HSM_RESIDENT" > which indicates the extent is current and resident with a HSM's archive.... Sure. > > Can you propose reasonable flag names for these (I can't think of anything > > very good) and a clear explanation of what they mean. I suspect it will > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > the concept of stripe unit and stripe width, but as yet they are not > > communicated between the two very well. I'd be much happier if this info > > could be queried in a standard way from the block layer instead of the > > user having to specify it and the filesystem having to track it. > > My preference is definitely for a separate ioctl to grab the > filesystem geometry so this stuff can be calculated in userspace. > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > bother trying to define names until we decide which appraoch we take > to implement this. Hmm, previously you wrote "This information could be easily passed up in the flags fields if the filesystem has geometry information". So, I _think_ what you are saying is that you want 4 flags to convey this start/end alignment information, but the exact semantics of what a "stripe unit" and a "stripe width" is filesystem specific? I definitely do NOT want to get into any issues of querying the block device geometry here. I was just making a passing comment that ext4+mballoc can already do RAID-specific allocation alignment, but it depends on the admin to specify this information and it would be nice if there was some easy way to get this from userspace/kernel interfaces. Having an API that can request "tell me the number of blocks from this offset until the next physical disk boundary" or similar would be useful to any allocator, and the block layer already needs to know this when submitting IO. > In XFS, mkfs.xfs does the work of getting this information > to see in the filesystem superblock. Here's the code for getting > sunit/swidth from the underlying block device: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > Not much in common there ;) It looks like this might be just what e2fsprogs needs also. > > It does make sense to specify zero for the fm_extent_count array and a > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > extent data itself, for the non-verbose mode of filefrag, and for > > pre-allocating a buffer large enough to hold the file if that is important. > > Rather than rely on implicit behaviour of "pass in extent count of > zero and a don't try to return any extents" to return the number of > extents on the file, why not just explicitly define this as a valid > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my clever-clever for "return no extents" and "return number of extents" is wasted :-/. > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > No, but we can return the extent map for the attribute fork (i.e. > extended attrs) if asked for (XFS_IOC_GETBMAPA). This seems like it would be a useful addition to the interface also, having FIEMAP_FLAG_METADATA request the return of metadata allocations too. > > - does XFS return preallocated extents beyond EOF? > > Yes - they are part of the extent map for the file. OK. > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > use by non-root users at all? > > Users can run xfs_bmap on any file they have permission to > open(O_RDONLY). > > > The FIBMAP ioctl is for privileged users > > only, and I wonder if FIEMAP should be the same, or at least disallow > > mapping files that the user can't access especially with FLAG_SYNC and/or > > FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. I think I agree with Anton that allowing some of the flags for non-privileged users seems dangerous. I think this needs to be determined on a flag-by-flag basis, and -EPERM should be returned in some cases. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 17:07:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 17:07:22 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4207GfB029493 for ; Tue, 1 May 2007 17:07:18 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA19765; Wed, 2 May 2007 10:07:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4206xAf82132681; Wed, 2 May 2007 10:07:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4206tcJ81768258; Wed, 2 May 2007 10:06:55 +1000 (AEST) Date: Wed, 2 May 2007 10:06:54 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11242 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 05:22, David Chinner wrote: > >On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > >> The FIBMAP ioctl is for privileged users > >> only, and I wonder if FIEMAP should be the same, or at least > >>disallow > >> mapping files that the user can't access especially with > >>FLAG_SYNC and/or > >> FLAG_HSM_READ. > > > >I see little reason for restricting FI[BE]MAP to privileged users - > >anyone should be able to determine if files they have permission to > >access are fragmented. > > Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the > machine. Perhaps for non-privileged users FIEMAP has to be read- > only? As soon as any of the FLAG_* flags come into play you make it > privileged. For example fancy any user being able to fill up your > file system by calling FIEMAP with FLAG_HSM_READ on all files > recursively? By that reasoning, users should not be allowed to recall any files without root privileges. HSMs don't work that way, though - any user is allowed to recall any files they have permission to access either by manual command or by trying to read the file daata. If that runs the filesytem out of space, then the HSM either hasn't been configured properly or it's failed to manage the space correctly. Either way, that's not the fault of the user for recalling their own files. Hence allowing FIEMAP to be executed by the user does not open up any DOS conditions that don't already exist in normal HSM-managed filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 19:27:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 19:27:06 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l422QxfB029690 for ; Tue, 1 May 2007 19:27:01 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA22695; Wed, 2 May 2007 12:26:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l422QkAf82214176; Wed, 2 May 2007 12:26:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l422Qisa78652236; Wed, 2 May 2007 12:26:44 +1000 (AEST) Date: Wed, 2 May 2007 12:26:44 +1000 From: David Chinner To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502022644.GO77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11243 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > > > I disagree - why would you want to indicate the state is unknown when we know > > very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." Yes, I like that much more. Good suggestion. ;) > > Effectively, when your extent is offline in the HSM, it is inaccessable, and > > you have to bring it back from tape so it becomes accessible again. i.e. some > > action is necessary on behalf of the user to make it accessible. So I think > > that OFFLINE is a good name for this state because it really is inaccessible. > > What you are calling OFFLINE I would prefer to call UNMAPPED, since that > can be used by applications as a catch-all for "no mapping". There can > be further flags that give refinements to UNMAPPED that some applications > might care about them (e.g. HSM_RESIDENT), but many users/apps will not > if they just want the number of fragments in a given file. Agreed - UNMAPPED does make a lot more sense in this case. > > > Can you propose reasonable flag names for these (I can't think of anything > > > very good) and a clear explanation of what they mean. I suspect it will > > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > > the concept of stripe unit and stripe width, but as yet they are not > > > communicated between the two very well. I'd be much happier if this info > > > could be queried in a standard way from the block layer instead of the > > > user having to specify it and the filesystem having to track it. > > > > My preference is definitely for a separate ioctl to grab the > > filesystem geometry so this stuff can be calculated in userspace. > > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > > bother trying to define names until we decide which appraoch we take > > to implement this. > > Hmm, previously you wrote "This information could be easily passed up in the > flags fields if the filesystem has geometry information". So, I _think_ > what you are saying is that you want 4 flags to convey this start/end > alignment information, but the exact semantics of what a "stripe unit" and > a "stripe width" is filesystem specific? Right. > I definitely do NOT want to get into any issues of querying the block > device geometry here. I was just making a passing comment that ext4+mballoc > can already do RAID-specific allocation alignment, but it depends on the > admin to specify this information and it would be nice if there was some > easy way to get this from userspace/kernel interfaces. > > Having an API that can request "tell me the number of blocks from this > offset until the next physical disk boundary" or similar would be useful > to any allocator, and the block layer already needs to know this when > submitting IO. The block layer knows this once you get inside the volume manager. I think the issue is that there is no common export interface for this information. > > In XFS, mkfs.xfs does the work of getting this information > > to see in the filesystem superblock. Here's the code for getting > > sunit/swidth from the underlying block device: > > > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > > > Not much in common there ;) > > It looks like this might be just what e2fsprogs needs also. More than likely. > > > It does make sense to specify zero for the fm_extent_count array and a > > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > > extent data itself, for the non-verbose mode of filefrag, and for > > > pre-allocating a buffer large enough to hold the file if that is important. > > > > Rather than rely on implicit behaviour of "pass in extent count of > > zero and a don't try to return any extents" to return the number of > > extents on the file, why not just explicitly define this as a valid > > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS > > That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my > clever-clever for "return no extents" and "return number of extents" > is wasted :-/. Too clever for an API, I think. ;) My point is mainly that if you are going to use an API for a specific function (e.g. query the number of extents) I think that the API should have an obvious method for executing that specific function. Using a command of "get no extents" to provide the query of "how many extents in this file" is kind of obscure. When you read the code it doesn't make a lot of sense, as opposed to seeing a clear statement of intent from the code itself. i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API and the code that uses it... > > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > > > No, but we can return the extent map for the attribute fork (i.e. > > extended attrs) if asked for (XFS_IOC_GETBMAPA). > > This seems like it would be a useful addition to the interface also, having > FIEMAP_FLAG_METADATA request the return of metadata allocations too. Agreed. The different types of requests need to be mutually exclusive, though - returning the map of the attribute fork mixed with the map of the data fork is going to be confusing.... > > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > > use by non-root users at all? > > > > Users can run xfs_bmap on any file they have permission to > > open(O_RDONLY). > > > > > The FIBMAP ioctl is for privileged users > > > only, and I wonder if FIEMAP should be the same, or at least disallow > > > mapping files that the user can't access especially with FLAG_SYNC and/or > > > FLAG_HSM_READ. > > > > I see little reason for restricting FI[BE]MAP to privileged users - > > anyone should be able to determine if files they have permission to > > access are fragmented. > > I think I agree with Anton that allowing some of the flags for non-privileged > users seems dangerous. I think this needs to be determined on a flag-by-flag > basis, and -EPERM should be returned in some cases. Agreed, but I'm yet to see any flags where I think that is necessary yet. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 01:18:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:18:25 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428IKfB012099 for ; Wed, 2 May 2007 01:18:21 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49210) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA0M-0001Ma-PD (Exim 4.63) (return-path ); Wed, 02 May 2007 09:16:06 +0100 In-Reply-To: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> <20070502000654.GK77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <8464EA47-03AC-4162-A2D0-683517568640@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:16:04 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11244 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 01:06, David Chinner wrote: > On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 05:22, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> The FIBMAP ioctl is for privileged users >>>> only, and I wonder if FIEMAP should be the same, or at least >>>> disallow >>>> mapping files that the user can't access especially with >>>> FLAG_SYNC and/or >>>> FLAG_HSM_READ. >>> >>> I see little reason for restricting FI[BE]MAP to privileged users - >>> anyone should be able to determine if files they have permission to >>> access are fragmented. >> >> Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the >> machine. Perhaps for non-privileged users FIEMAP has to be read- >> only? As soon as any of the FLAG_* flags come into play you make it >> privileged. For example fancy any user being able to fill up your >> file system by calling FIEMAP with FLAG_HSM_READ on all files >> recursively? > > By that reasoning, users should not be allowed to recall any files > without root privileges. HSMs don't work that way, though - any user > is allowed to recall any files they have permission to access either > by manual command or by trying to read the file daata. > > If that runs the filesytem out of space, then the HSM either hasn't > been configured properly or it's failed to manage the space > correctly. Either way, that's not the fault of the user for > recalling their own files. > > Hence allowing FIEMAP to be executed by the user does not open up > any DOS conditions that don't already exist in normal HSM-managed > filesystem. Sorry, it was not a great example. But the point still stands that there are/may be created flags that you do not want to allow everyone to use. I completely agree with Andreas that those can simply return -EPERM and the rest can be allowed through. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:25:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:25:15 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428P7fB013738 for ; Wed, 2 May 2007 01:25:08 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49214) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA7h-0003fi-Mq (Exim 4.63) (return-path ); Wed, 02 May 2007 09:23:41 +0100 In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:23:38 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11245 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 23:30, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: >> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I >>> didn't >> >> I disagree - why would you want to indicate the state is unknown >> when we know >> very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." I like UNMAPPED. I even use it in NTFS internally for extents maps that have not been read into memory yet. (-: On a different issue, do you think it would be worth adding an option flags like FIEMAP_DONT_RELOCATE or something similar that would be a compulsory flag and if set the FS is not allowed to move the file around/change the block allocation of the file. My thinking is that the extent map is not terribly useful if the FS goes and relocates the file to somewhere else just after you have done the ioctl. For example HFS on OSX automatically defragments files whilst it is running... Linux file systems may one day do similar things. Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell the FS we want to access the actual raw blocks so the FS can make sure the data is on block aligned boundaries and if the FS does not support this (e.g. ZFS or a compressed or encrypted NTFS file) then it can return -ENOTSUP. Perhaps this is totally the wrong interface and such a "prepare file for direct access" API should be a different ioctl() or syscall or whatever. It just seems very simple and appropriate to combine it here as people who use FIEMAP are at least sometimes going to be wanting to access those blocks directly as well and it feels right to be able to communicate this to the FS in the same call, kind of like an "open intent" of "I want to use the data directly on disk"... What do you think? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:31:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:31:39 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428VZfB015273 for ; Wed, 2 May 2007 01:31:36 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49220) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjAE7-0006gx-N5 (Exim 4.63) (return-path ); Wed, 02 May 2007 09:30:19 +0100 In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <69B76939-CAAD-4F43-BE9F-6C3CA3ECCF5E@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, Linux Filesystems , xfs@oss.sgi.com, Christoph Hellwig Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:30:17 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11246 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 09:23, Anton Altaparmakov wrote: > On 1 May 2007, at 23:30, Andreas Dilger wrote: > >> On May 01, 2007 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but >>>> I didn't >>> >>> I disagree - why would you want to indicate the state is unknown >>> when we know >>> very well that it is offline? >> >> If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a >> catch-all flag that indicates "this extent contains data but there is >> nothing sensible to be returned for the extent mapping." > > I like UNMAPPED. I even use it in NTFS internally for extents maps > that have not been read into memory yet. (-: Oops, I use NOT_MAPPED in NTFS rather than UNMAPPED but I still like UNMAPPED, too. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:15:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:15:52 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429FjfB025664 for ; Wed, 2 May 2007 02:15:47 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03102; Wed, 2 May 2007 19:15:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429FUAf82146138; Wed, 2 May 2007 19:15:31 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429FQYw81999881; Wed, 2 May 2007 19:15:26 +1000 (AEST) Date: Wed, 2 May 2007 19:15:26 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11247 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 15:20, David Chinner wrote: > >> > >>So, either the filesystem will understand the flag or iff the > >>unknown flag > >>is in the incompat set, it will return EINVAL or else the unknown > >>flag will > >>be safely ignored. > > > >My point was that there is a difference between specification and > >implementation - if the specification says something is compulsory, > >then they must be implemented in the filesystem. This is easy > >enough to ensure by code review - we don't need additional interface > >complexity for this.... > > You are wrong about this because you are missing the point that you > have no code to review. The users that will use those flags are > going to be applications that run in user space. Chances are you > will never see their code. Heck, they might not even be open source > applications... Ummm - the specification defines what is compulsory for *filesystems* to implement, not what applications can use. We don't need to see what the applications do - what we care about is that all filesystems implement the compulsory part of the specification. That's the code we review, and that's what I was referring to. > And all applications will run against a multitude of > kernels. So version X of the application will run on kernel 2.4.*, > 2.6.*, a.b.*, etc... For future expandability of the interface I > think it is important to have both compulsory and non-compulsory flags. Ah, so that's what you want - a mutable interface. i.e. versioning. So how does compusory flags help here? What happens if a voluntary flag now becomes compulsory? Or vice versa? How is the application supposed to deal with this dynamically? I suggested a version number for this right back at the start of this discussion and got told that we don't want versioned interfaces because we should make the effort to get it right the first time. I don't think this can be called "getting it right". > For example there is no reason why FIEMAP_HSM_READ needs to be > compulsory. Most filesystems do not support HSM so can safely ignore > it. They might be able to safely ignore it, but in reality it should be saying "I don't understand this". If the application *needs* to use a flag like this, then it should be told that the filesystem is not capable of doing what it was asked! OTOH if the application does not need to use the flag, then it shouldn't be using it and we shouldn't be silently ignoring incorrect usage of the provided API. What you are effectively saying about these "voluntary" flags is that their behaviour is _undefined_. That is, if you use these flags what you get on a successful call is undefined; it may or may not contain what you asked for but you can't tell if it really did what you want or returned the information you asked for. This is a really bad semantic to encode into an API. > And vice versa, an application might specify some weird and funky yet > to be developed feature that it expects the FS to perform and if the > FS cannot do it (either because it does not support it or because it > failed to perform the operation) the application expects the FS to > return an error and not to ignore the flag. An example could be the > asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > ignores it it will return the extent map for the file data instead of > the XATTR_FORK! Not what the application wanted at all. Ouch! So > this is definitely a compulsory flag if I ever saw one. Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But we don't need a flag defined in the user visible API to tell us that we need to return an error here. > So as you see you must support both voluntary and compulsory flags... No, you've managed to convince me that they are not necessary and they are in fact a Bad Idea... ;) > Also consider what I said above about different kernels. A new > feature is implemented in kernel 2.8.13 say that was not there before > and an application is updated to use that feature. There will be > lots of instances where that application will still be run on older > kernels where this feature does not exist. This is *exactly* where silently ignoring flags really falls down. On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does something and it returns different structure contents for the same state. Now how does the application writer know which is correct or how to tell the difference? They have to guess or write detection code which is exactly what we want to avoid. I objected to the UNKNOWN flag because it wasn't explicit in it's meaning - I'm doing the same thing here. An interface needs to be explicitly defined and should not have and undefined behaviour in it.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:38:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:38:32 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429cPfB032340 for ; Wed, 2 May 2007 02:38:28 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49355) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBFu-0000Yj-Ne (Exim 4.63) (return-path ); Wed, 02 May 2007 10:36:14 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:36:12 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11248 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 15:20, David Chinner wrote: >>>> >>>> So, either the filesystem will understand the flag or iff the >>>> unknown flag >>>> is in the incompat set, it will return EINVAL or else the unknown >>>> flag will >>>> be safely ignored. >>> >>> My point was that there is a difference between specification and >>> implementation - if the specification says something is compulsory, >>> then they must be implemented in the filesystem. This is easy >>> enough to ensure by code review - we don't need additional interface >>> complexity for this.... >> >> You are wrong about this because you are missing the point that you >> have no code to review. The users that will use those flags are >> going to be applications that run in user space. Chances are you >> will never see their code. Heck, they might not even be open source >> applications... > > Ummm - the specification defines what is compulsory for *filesystems* > to implement, not what applications can use. We don't need to see > what the applications do - what we care about is that all filesystems > implement the compulsory part of the specification. That's the code > we review, and that's what I was referring to. > >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? What happens if a voluntary > flag now becomes compulsory? Or vice versa? How is the application > supposed to deal with this dynamically? > > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". Look at ext2/3/4. They do it that way and it works well. No versioning just compatible and incompatible flags... The proposal is to do the same here. >> For example there is no reason why FIEMAP_HSM_READ needs to be >> compulsory. Most filesystems do not support HSM so can safely ignore >> it. > > They might be able to safely ignore it, but in reality it should > be saying "I don't understand this". If the application *needs* to > use a flag like this, then it should be told that the filesystem is > not capable of doing what it was asked! That is where you are completely wrong! (-: Or rather you are wrong for my example, i.e. you are wrong/right depending on the type of flag in question. HSM_READ is definitely _NOT_ required because all it means is "if the file is OFFLINE, bring it ONLINE and then return the extent map". Clearly all file systems that do not support HSM can 100% ignore this flag as all files will ALWAYS be ONLINE so they will return the correct data ALWAYS so no need to do anything for HSM_READ. > OTOH if the application does not need to use the flag, then it > shouldn't be using it and we shouldn't be silently ignoring > incorrect usage of the provided API. > > What you are effectively saying about these "voluntary" flags > is that their behaviour is _undefined_. That is, if you use > these flags what you get on a successful call is undefined; > it may or may not contain what you asked for but you can't > tell if it really did what you want or returned the information > you asked for. > > This is a really bad semantic to encode into an API. That is your opinion. There is nothing undefined in the API at all. You just fail to understand it... >> And vice versa, an application might specify some weird and funky yet >> to be developed feature that it expects the FS to perform and if the >> FS cannot do it (either because it does not support it or because it >> failed to perform the operation) the application expects the FS to >> return an error and not to ignore the flag. An example could be the >> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS >> ignores it it will return the extent map for the file data instead of >> the XATTR_FORK! Not what the application wanted at all. Ouch! So >> this is definitely a compulsory flag if I ever saw one. > > Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > we don't need a flag defined in the user visible API to tell us > that we need to return an error here. Heh? What are you talking about? You need a flag to specify that you want XATTR_FORK. If not how the hell does the application specify that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you of the opinion that FIEMAP should definitely not support XATTR_FORK. If the latter I fully agree. This should be a separate API with named streams and the FD of the named stream should be passed to FIEMAP without the silly XATTR_FORK flag... >> So as you see you must support both voluntary and compulsory flags... > > No, you've managed to convince me that they are not necessary and > they are in fact a Bad Idea... ;) We agree to disagree then. I think they are a very Good Idea(TM). (-; >> Also consider what I said above about different kernels. A new >> feature is implemented in kernel 2.8.13 say that was not there before >> and an application is updated to use that feature. There will be >> lots of instances where that application will still be run on older >> kernels where this feature does not exist. > > This is *exactly* where silently ignoring flags really falls down. It does not! > On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > something and it returns different structure contents for the same No it does not. You do NOT understand at all what we are talking about do you?!? If a flag would do something weird like returning different data then OBVIOUSLY you would make this a mandatory flag and it will NOT be ignored! You should know better than arguing with fallacies. Seriously... > state. Now how does the application writer know which is correct or > how to tell the difference? They have to guess or write detection > code which is exactly what we want to avoid. No they don't. It is then a compulsory flag so your argument is totally moot. > I objected to the UNKNOWN flag because it wasn't explicit > in it's meaning - I'm doing the same thing here. An interface > needs to be explicitly defined and should not have and undefined > behaviour in it.... That is exactly the point. It is explicitly defined and has NO undefined behaviour in it. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:48:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:48:16 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429mCfB003811 for ; Wed, 2 May 2007 02:48:14 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49362) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBPJ-0006HX-NQ (Exim 4.63) (return-path ); Wed, 02 May 2007 10:45:57 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1AFF1746-8313-4DC2-81D6-4271B5FB71A3@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:45:55 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11249 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? A concrete example: Let's say that the FIEMAP interface goes live as is without any flags at all and just defined bits for "these are optional and those are compulsory". Then the next kernel adds support for optional flag HSM_READ and compulsory flag XATTR_READ. FS that do not support XATTR_READ will return -ENOTSUP as they cannot return the wanted data. FS that do not support HSM_READ will still return the correct data in majority of cases (except when the FS supports HSM and the data is actually OFFLINE which the application will need to be able to cope with anyway incase the FS failed to bring the file ONLINE even if it supports the HSM_READ flag so no added complexity for handling this case). > What happens if a voluntary flag now becomes compulsory? Or vice > versa? How is the application supposed to deal with this dynamically? Forgot to answer this bit: This cannot happen. There cannot be flags that move from compulsory to non-compulsory or anything stupid like that. It would have to be a totally new flag otherwise it breaks backwards compatibility and hence this interface becomes useless crap. > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". So all applications end up doing: if (version X, do blah) else if (version Y, do blob) else if (version Z, do foo) else if (version A, do bar) else exit(1); Every time a new version is added? And abort for unknown versions? Now that is a great interface! Not. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:49:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:49:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429nEfB004317 for ; Wed, 2 May 2007 02:49:17 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03843; Wed, 2 May 2007 19:49:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429mtAf82223314; Wed, 2 May 2007 19:48:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429mqgl82278699; Wed, 2 May 2007 19:48:52 +1000 (AEST) Date: Wed, 2 May 2007 19:48:51 +1000 From: David Chinner To: Anton Altaparmakov Cc: Andreas Dilger , David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11250 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: > On a different issue, do you think it would be worth adding an option > flags like FIEMAP_DONT_RELOCATE or something similar that would be a > compulsory flag and if set the FS is not allowed to move the file > around/change the block allocation of the file. We already have an inode flag in XFS to say this - the defrag tool checks it and ignores the file if it is set. > Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell > the FS we want to access the actual raw blocks so the FS can make > sure the data is on block aligned boundaries and if the FS does not > support this (e.g. ZFS or a compressed or encrypted NTFS file) then > it can return -ENOTSUP. > > Perhaps this is totally the wrong interface and such a "prepare file > for direct access" API should be a different ioctl() or syscall or > whatever. It just seems very simple and appropriate to combine it > here as people who use FIEMAP are at least sometimes going to be > wanting to access those blocks directly as well and it feels right to > be able to communicate this to the FS in the same call, kind of like > an "open intent" of "I want to use the data directly on disk"... I think this is wrong interface for this. Sure, use it to get the mappings (that's what it's for) but what you do with the mappings after that is not part of FIEMAP.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:57:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:57:55 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429vnfB007855 for ; Wed, 2 May 2007 02:57:52 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49383) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBZ6-0001FA-PT (Exim 4.63) (return-path ); Wed, 02 May 2007 10:56:04 +0100 In-Reply-To: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> <20070502094851.GX77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Andreas Dilger , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:56:03 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11251 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:48, David Chinner wrote: > On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: >> On a different issue, do you think it would be worth adding an option >> flags like FIEMAP_DONT_RELOCATE or something similar that would be a >> compulsory flag and if set the FS is not allowed to move the file >> around/change the block allocation of the file. > > We already have an inode flag in XFS to say this - the defrag > tool checks it and ignores the file if it is set. That is great for XFS but you control the metadata. NTFS, HFS, etc are cases where we cannot add such a flag because we cannot modify the metadata format (ok we could in some kludgy manner like storing an EA with an inode to say "com.linux.ntfs.immutable" or something but I would rather not if I can avoid it). >> Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell >> the FS we want to access the actual raw blocks so the FS can make >> sure the data is on block aligned boundaries and if the FS does not >> support this (e.g. ZFS or a compressed or encrypted NTFS file) then >> it can return -ENOTSUP. >> >> Perhaps this is totally the wrong interface and such a "prepare file >> for direct access" API should be a different ioctl() or syscall or >> whatever. It just seems very simple and appropriate to combine it >> here as people who use FIEMAP are at least sometimes going to be >> wanting to access those blocks directly as well and it feels right to >> be able to communicate this to the FS in the same call, kind of like >> an "open intent" of "I want to use the data directly on disk"... > > I think this is wrong interface for this. Sure, use it to get the > mappings (that's what it's for) but what you do with the mappings > after that is not part of FIEMAP.... Thanks for the comments. I am not sure it is a good idea either, just thought it would be worth discussing in case people thought it a good idea. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 03:52:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:52:48 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42AqhfB021110 for ; Wed, 2 May 2007 03:52:44 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l42AqgmK016050 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 2 May 2007 12:52:42 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l42Aqfmi016048 for xfs@oss.sgi.com; Wed, 2 May 2007 12:52:41 +0200 Date: Wed, 2 May 2007 12:52:41 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: Re: [Bug 756] New: File data corruption when writing to files with DM_EVENT_WRITE enabled over NFS (2.4 kernel) Message-ID: <20070502105241.GA15399@lst.de> References: <200705012104.l41L4CI3029767@oss.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705012104.l41L4CI3029767@oss.sgi.com> User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11252 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > by this recent change: > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h Seems like someone forgot to send TAKEs to the xfs list once again.. From owner-xfs@oss.sgi.com Wed May 2 03:58:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:58:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42AwCfB023745 for ; Wed, 2 May 2007 03:58:14 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id UAA05217; Wed, 2 May 2007 20:57:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42AvrAf82323358; Wed, 2 May 2007 20:57:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42AvnBI81446737; Wed, 2 May 2007 20:57:49 +1000 (AEST) Date: Wed, 2 May 2007 20:57:49 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11253 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > On 2 May 2007, at 10:15, David Chinner wrote: > >On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > >>And all applications will run against a multitude of > >>kernels. So version X of the application will run on kernel 2.4.*, > >>2.6.*, a.b.*, etc... For future expandability of the interface I > >>think it is important to have both compulsory and non-compulsory > >>flags. > > > >Ah, so that's what you want - a mutable interface. i.e. versioning. > > > >So how does compusory flags help here? What happens if a voluntary > >flag now becomes compulsory? Or vice versa? How is the application > >supposed to deal with this dynamically? > > > >I suggested a version number for this right back at the start of > >this discussion and got told that we don't want versioned interfaces > >because we should make the effort to get it right the first time. > >I don't think this can be called "getting it right". > > Look at ext2/3/4. They do it that way and it works well. No > versioning just compatible and incompatible flags... The proposal is > to do the same here. Just because it works for extN doesn't make it right for this interface. > >>For example there is no reason why FIEMAP_HSM_READ needs to be > >>compulsory. Most filesystems do not support HSM so can safely ignore > >>it. > > > >They might be able to safely ignore it, but in reality it should > >be saying "I don't understand this". If the application *needs* to > >use a flag like this, then it should be told that the filesystem is > >not capable of doing what it was asked! > > That is where you are completely wrong! (-: Or rather you are wrong > for my example, i.e. you are wrong/right depending on the type of > flag in question. And that is the crux of the argument. My point is that *any* flag returns an error if the filesystem does not support it. > HSM_READ is definitely _NOT_ required because all > it means is "if the file is OFFLINE, bring it ONLINE and then return > the extent map". You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. > >OTOH if the application does not need to use the flag, then it > >shouldn't be using it and we shouldn't be silently ignoring > >incorrect usage of the provided API. > > > >What you are effectively saying about these "voluntary" flags > >is that their behaviour is _undefined_. That is, if you use > >these flags what you get on a successful call is undefined; > >it may or may not contain what you asked for but you can't > >tell if it really did what you want or returned the information > >you asked for. > > > >This is a really bad semantic to encode into an API. > > That is your opinion. There is nothing undefined in the API at all. > You just fail to understand it... FIEMAP returned success. Did it do what I asked? I don't know because it's allowed to return success when it did ignored me. This is as silly an interface definition as saying you can implement fsync() with { return 0; }. So, when fsync() succeeded did it write my data to disk? I don't know; it's allowed to return success when it ignored me. It's crazy, isn't it? It makes writing applications portable across operating systems a real PITA (ask the MySQL folk ;) because POSIX really does allow fsync() to be implemented like this. I use this example because the "allow some filesystems to silently ignore flags they don't understand" is a portability problem for applications - rather than a cross-OS issue it is a cross-filesystem issue. That is, if different filesystems behave differently to the same request they will have to be handled specifically by the application. Every filesystem should behave in *exactly* the same way to the FIEMAP ioctls - if they don't support something they throw an error, if they do then they return the correct data. > >>And vice versa, an application might specify some weird and funky yet > >>to be developed feature that it expects the FS to perform and if the > >>FS cannot do it (either because it does not support it or because it > >>failed to perform the operation) the application expects the FS to > >>return an error and not to ignore the flag. An example could be the > >>asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > >>ignores it it will return the extent map for the file data instead of > >>the XATTR_FORK! Not what the application wanted at all. Ouch! So > >>this is definitely a compulsory flag if I ever saw one. > > > >Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > >we don't need a flag defined in the user visible API to tell us > >that we need to return an error here. > > Heh? What are you talking about? You need a flag to specify that you > want XATTR_FORK. If not how the hell does the application specify > that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you > of the opinion that FIEMAP should definitely not support XATTR_FORK. > If the latter I fully agree. This should be a separate API with > named streams and the FD of the named stream should be passed to > FIEMAP without the silly XATTR_FORK flag... Ummmm - I think you misunderstood what I was saying. I was agreeing with you that is a FS does not support FIEMAP_XATTR_FORK "the correct answer is -EOPNOTSUPP or -EINVAL". What I was saying is that we don't need a COMPAT flag bit to tell us the obvious error return if the filesystem does not support this functionality.... > >>Also consider what I said above about different kernels. A new > >>feature is implemented in kernel 2.8.13 say that was not there before > >>and an application is updated to use that feature. There will be > >>lots of instances where that application will still be run on older > >>kernels where this feature does not exist. > > > >This is *exactly* where silently ignoring flags really falls down. > > It does not! > > >On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > >something and it returns different structure contents for the same > > No it does not. You do NOT understand at all what we are talking > about do you?!? > > If a flag would do something weird like returning different data then > OBVIOUSLY you would make this a mandatory flag and it will NOT be > ignored! You've just successfully argued my case for me. By your reasoning, if we have voluntary flags 1, 2 and 3 and filesystems A, B and C and filesystem A is the only filesystem to implement 1, when B implements 1 bit must become a compulsory flag and hence C must now return an error despite being unchanged. Likewise when C implement 3, 3 must become a comulsory flag and A and B must now return an error despite being unchanged. IOWs, whenever *any* filesystem implements a voluntary feature that it didn't previously support, we have to make that a mandatory feature and all other filesystems that don't support it now must return an error. You're guaranteeing th application sees changes in behaviour with this interface, not preventing. Can we simply mandate that filesystems return an error to commands they don't support or don't understand and drop this silly interface mutation thing? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 04:19:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 04:19:31 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42BJPfB003965 for ; Wed, 2 May 2007 04:19:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49519) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjCpx-000625-Oy (Exim 4.63) (return-path ); Wed, 02 May 2007 12:17:33 +0100 In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 12:17:32 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11254 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 11:57, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >> On 2 May 2007, at 10:15, David Chinner wrote: >>> On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >>>> And all applications will run against a multitude of >>>> kernels. So version X of the application will run on kernel 2.4.*, >>>> 2.6.*, a.b.*, etc... For future expandability of the interface I >>>> think it is important to have both compulsory and non-compulsory >>>> flags. >>> >>> Ah, so that's what you want - a mutable interface. i.e. versioning. >>> >>> So how does compusory flags help here? What happens if a voluntary >>> flag now becomes compulsory? Or vice versa? How is the application >>> supposed to deal with this dynamically? >>> >>> I suggested a version number for this right back at the start of >>> this discussion and got told that we don't want versioned interfaces >>> because we should make the effort to get it right the first time. >>> I don't think this can be called "getting it right". >> >> Look at ext2/3/4. They do it that way and it works well. No >> versioning just compatible and incompatible flags... The proposal is >> to do the same here. > > Just because it works for extN doesn't make it right for this > interface. > >>>> For example there is no reason why FIEMAP_HSM_READ needs to be >>>> compulsory. Most filesystems do not support HSM so can safely >>>> ignore >>>> it. >>> >>> They might be able to safely ignore it, but in reality it should >>> be saying "I don't understand this". If the application *needs* to >>> use a flag like this, then it should be told that the filesystem is >>> not capable of doing what it was asked! >> >> That is where you are completely wrong! (-: Or rather you are wrong >> for my example, i.e. you are wrong/right depending on the type of >> flag in question. > > And that is the crux of the argument. > > My point is that *any* flag returns an error if the filesystem > does not support it. Yes and my point is that it should not do so as there are flags where it is not necessary. >> HSM_READ is definitely _NOT_ required because all >> it means is "if the file is OFFLINE, bring it ONLINE and then return >> the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. Ah, sorry, I did indeed misunderstand what it was meant to mean. >>> OTOH if the application does not need to use the flag, then it >>> shouldn't be using it and we shouldn't be silently ignoring >>> incorrect usage of the provided API. >>> >>> What you are effectively saying about these "voluntary" flags >>> is that their behaviour is _undefined_. That is, if you use >>> these flags what you get on a successful call is undefined; >>> it may or may not contain what you asked for but you can't >>> tell if it really did what you want or returned the information >>> you asked for. >>> >>> This is a really bad semantic to encode into an API. >> >> That is your opinion. There is nothing undefined in the API at all. >> You just fail to understand it... > > FIEMAP returned success. Did it do what I asked? I don't > know because it's allowed to return success when it did ignored me. So what? > This is as silly an interface definition as saying you can > implement fsync() with { return 0; }. So, when fsync() succeeded > did it write my data to disk? I don't know; it's allowed to return > success when it ignored me. No it is not silly at all. There can be flags that fail but still the operation is a success. Example from admittedly unrelated area: when truncating a file to smaller size if the freeing of the allocated blocks fails it does not cause the truncate to fail, it just means some space is wasted/marked used when it is unused on the volume and running fsck fixes this. At least that is how I have implemented it for NTFS and I think this is the most sensible way to do it. The user does not care if some blocks could not be freed. All they care about is that the file is now truncated. The volume is then marked dirty thus running fsck/ chkdsk will reclaim the lost space. > It's crazy, isn't it? It makes writing applications portable > across operating systems a real PITA (ask the MySQL folk ;) > because POSIX really does allow fsync() to be implemented like this. > > I use this example because the "allow some filesystems to silently > ignore flags they don't understand" is a portability problem for > applications - rather than a cross-OS issue it is a cross-filesystem > issue. That is, if different filesystems behave differently to > the same request they will have to be handled specifically by > the application. Every filesystem should behave in *exactly* the > same way to the FIEMAP ioctls - if they don't support something > they throw an error, if they do then they return the correct > data. It is only a problem if you do not choose wisely which flags my be ignored silently... >>>> And vice versa, an application might specify some weird and >>>> funky yet >>>> to be developed feature that it expects the FS to perform and if >>>> the >>>> FS cannot do it (either because it does not support it or >>>> because it >>>> failed to perform the operation) the application expects the FS to >>>> return an error and not to ignore the flag. An example could be >>>> the >>>> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and >>>> the FS >>>> ignores it it will return the extent map for the file data >>>> instead of >>>> the XATTR_FORK! Not what the application wanted at all. Ouch! So >>>> this is definitely a compulsory flag if I ever saw one. >>> >>> Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But >>> we don't need a flag defined in the user visible API to tell us >>> that we need to return an error here. >> >> Heh? What are you talking about? You need a flag to specify that you >> want XATTR_FORK. If not how the hell does the application specify >> that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you >> of the opinion that FIEMAP should definitely not support XATTR_FORK. >> If the latter I fully agree. This should be a separate API with >> named streams and the FD of the named stream should be passed to >> FIEMAP without the silly XATTR_FORK flag... > > Ummmm - I think you misunderstood what I was saying. I was agreeing > with you that is a FS does not support FIEMAP_XATTR_FORK "the correct > answer is -EOPNOTSUPP or -EINVAL". > > What I was saying is that we don't need a COMPAT flag bit to tell > us the obvious error return if the filesystem does not support this > functionality.... But there is no COMPAT bit. I don't understand what you are saying... >>>> Also consider what I said above about different kernels. A new >>>> feature is implemented in kernel 2.8.13 say that was not there >>>> before >>>> and an application is updated to use that feature. There will be >>>> lots of instances where that application will still be run on older >>>> kernels where this feature does not exist. >>> >>> This is *exactly* where silently ignoring flags really falls down. >> >> It does not! >> >>> On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does >>> something and it returns different structure contents for the same >> >> No it does not. You do NOT understand at all what we are talking >> about do you?!? >> >> If a flag would do something weird like returning different data then >> OBVIOUSLY you would make this a mandatory flag and it will NOT be >> ignored! > > You've just successfully argued my case for me. No I have not at all. > By your reasoning, if we have voluntary flags 1, 2 and 3 and > filesystems A, B and C and filesystem A is the only filesystem to > implement 1, when B implements 1 bit must become a compulsory flag WHY? It does not at all. Flags CANNOT move from voluntary to compulsory. Read my argument again... > and hence C must now return an error despite being unchanged. Nope. > Likewise when C implement 3, 3 must become a comulsory flag and > A and B must now return an error despite being unchanged. Again no. > IOWs, whenever *any* filesystem implements a voluntary feature that > it didn't previously support, we have to make that a mandatory > feature and all other filesystems that don't support it now This is total crap. > must return an error. You're guaranteeing th application sees > changes in behaviour with this interface, not preventing. > > Can we simply mandate that filesystems return an error > to commands they don't support or don't understand and > drop this silly interface mutation thing? Can we simply not and drop this silly argument? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 05:19:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:19:37 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CJWfB016412 for ; Wed, 2 May 2007 05:19:33 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HjDMP-0005ml-DC; Wed, 02 May 2007 12:51:05 +0100 Date: Wed, 2 May 2007 12:51:05 +0100 From: Christoph Hellwig To: Lachlan McIlroy Cc: xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070502115105.GA21031@infradead.org> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11255 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > Add lockdep support for XFS I don't think this is entirely correct, and it misses some of the most interesting cases. I've Cc'ed -fsdevel and Al to get some comments on the more tricky issues in the rename section at the end of the mail. > Modid: xfs-linux-melb:xfs-kern:28485a > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. xfs_lock_dir_and_entry should go away and just become and opencoded xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); xfs_ilock(ip, XFS_ILOCK_EXCL); in the two callers, once we made sure to have a sufficient locking protocol where we always lock the parent before the child. xfs_lock_dir_and_entry can be totally removed and replaced with just the two ilock calls if we sort out the locking as proposed in this mail. > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h This looks a bit odd to me - the rt inodes are not connected to the filesystem namespace so the root inode can't really be it's parent. Why are we locking the root inode so early. Is there a good reason we don't delay the locking until we're done with the rt inodes? If not the parent annotation is probably safe beause we never lock the rt inode at the same time as any other inode, but it at least needs a big comment describing what's going on. Now what seems to be completely lacking is any kind of annotation in xfs_rename.c, which is the most difficult thing to get right for inode locking because we may have to lock up to four inodes. I suggest to implement the same locking protocol the the VFS uses for locking i_mutex, as document in Documentation/filesystems/directory-locking: Also xfs_lock_inodes lacks any kind of annotation. Let's start with the xfs_lock_inodes that don't fall into rename or xfs_lock_dir_and_entry handled above: - xfs_swap_extents locks two inodes of the same type, but these could be directories, so there is a chance we can get into conflicts with the parent->child type locking - xfs_link locks the source inode and the target directory inode. vfs locking rule is lock parent, lock source and we should follow this as it's in line with the directory before child rule except that the source doesn't always have to be a child, in which case we don't have a problem anyway And now rename gets ugly, we should follow the VFS rules with the following required adjustments: - XFS needs both source and target inode (if existing) locked. Because both must be non-directories sorting by inode number should be okay - Doing a lock_rename equivalent for locking the parent directories requires dentries, but only inodes are passed down from the VFS. On the other hand they are obviously guranteed to be directories so i_dentry has exactly one dentry on which we can do the upwards walk. s_vfs_rename_mutex is already held by the vfs so we don't need to do that again. I'd suggest having a copy of the directory-locking file with the XFS adjustments somewhere so all this is actually well documented. - case for source directory == parent directory is trivial. lock parent From owner-xfs@oss.sgi.com Wed May 2 05:53:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:53:21 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CrGfB024777 for ; Wed, 2 May 2007 05:53:17 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l42CrBgT015874 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l42CrB9i554574 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l42CrAET015347 for ; Wed, 2 May 2007 08:53:10 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l42Cr9Ww015185; Wed, 2 May 2007 08:53:09 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B3BC793BC1; Wed, 2 May 2007 18:23:13 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l42CrCw4025574; Wed, 2 May 2007 18:23:12 +0530 Date: Wed, 2 May 2007 18:23:12 +0530 From: "Amit K. Arora" To: Chris Wedgwood Cc: David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070502125312.GA5845@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430052559.GA13145@tuatara.stupidest.org> User-Agent: Mutt/1.4.1i X-archive-position: 11256 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > For FA_ALLOCATE, it's supposed to change the file size if we > > allocate past EOF, right? > > I would argue no. Use truncate for that. The patch I posted for ext4 *does* change the filesize after preallocation, if required (i.e. when preallocation is after EOF). I may have to change that, if we decide on not doing this. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 2 06:12:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 06:12:04 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42DBvfB029629 for ; Wed, 2 May 2007 06:12:00 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA08194; Wed, 2 May 2007 23:11:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42DBkAf82475833; Wed, 2 May 2007 23:11:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42DBipP82488324; Wed, 2 May 2007 23:11:44 +1000 (AEST) Date: Wed, 2 May 2007 23:11:44 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Missing TAKE 958522 (was Re: [Bug 756] New: File data corruption.....) Message-ID: <20070502131144.GZ77450368@melbourne.sgi.com> References: <200705012104.l41L4CI3029767@oss.sgi.com> <20070502105241.GA15399@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105241.GA15399@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11257 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:52:41PM +0200, Christoph Hellwig wrote: > > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > > by this recent change: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h > > Seems like someone forgot to send TAKEs to the xfs list once again.. Hmmm - that was a bad one to miss considering the importance of the problem it fixes...... ----- TAKE 958522 - XFS has conflicting strategies between metadata and file data flushing Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. Date: Fri Mar 30 02:24:06 AEST 2007 Workarea: vpn-emea-sw-emea-160-18.emea.sgi.com:/home/lachlan/isms/2.6.x-null Inspected by: dgc,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28322a fs/xfs/xfsidbg.c - 1.312 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfsidbg.c.diff?r1=text&tr1=1.312&r2=text&tr2=1.311&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_vnodeops.c - 1.693 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.693&r2=text&tr2=1.692&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iocore.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iocore.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.c - 1.463 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.c.diff?r1=text&tr1=1.463&r2=text&tr2=1.462&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.h - 1.219 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.h.diff?r1=text&tr1=1.219&r2=text&tr2=1.218&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_bmap.c - 1.367 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_bmap.c.diff?r1=text&tr1=1.367&r2=text&tr2=1.366&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.h - 1.10 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.h.diff?r1=text&tr1=1.10&r2=text&tr2=1.9&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_lrw.c - 1.259 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_lrw.c.diff?r1=text&tr1=1.259&r2=text&tr2=1.258&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_aops.c - 1.142 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_aops.c.diff?r1=text&tr1=1.142&r2=text&tr2=1.141&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/dmapi/xfs_dm.c - 1.34 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/dmapi/xfs_dm.c.diff?r1=text&tr1=1.34&r2=text&tr2=1.33&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 23:45:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 23:45:20 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l436jDfB003835 for ; Wed, 2 May 2007 23:45:15 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA04360; Thu, 3 May 2007 16:45:03 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l436j2Af82987621; Thu, 3 May 2007 16:45:02 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l436ixY983041938; Thu, 3 May 2007 16:44:59 +1000 (AEST) Date: Thu, 3 May 2007 16:44:59 +1000 From: David Chinner To: Christoph Hellwig Cc: Lachlan McIlroy , xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070503064459.GJ77450368@melbourne.sgi.com> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> <20070502115105.GA21031@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502115105.GA21031@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11258 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:51:05PM +0100, Christoph Hellwig wrote: > On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > > Add lockdep support for XFS > > I don't think this is entirely correct, and it misses some of the > most interesting cases. Yeah, we decided it was better to get something out there that fixes the obvious and frequently reported false positives than hold it up on the hard stuff.... > I've Cc'ed -fsdevel and Al to get some comments on the more tricky > issues in the rename section at the end of the mail. There's several other tricky cases that we're not sure to handle as well - they are mainly due to *valid* lock inversions. i.e. we do "lock A, lock B" in most places, but in others we do "lock B, *trylock* A" to avoid deadlocks. I think the MOUNT_ILOCK/inode ilock is one of these pairs. > > > > Modid: xfs-linux-melb:xfs-kern:28485a > > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. > > xfs_lock_dir_and_entry should go away and just become and opencoded > > xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); > xfs_ilock(ip, XFS_ILOCK_EXCL); > > in the two callers, once we made sure to have a sufficient locking > protocol where we always lock the parent before the child. > > xfs_lock_dir_and_entry can be totally removed and replaced with just > the two ilock calls if we sort out the locking as proposed in this > mail. I'm not sure it is that simple - we currently always group locking of multiple inodes in increasing inode number order. i don't know what deadlock that is protecting against. There's also the case that we can't sleep on the ilock if the inode in the AIL while we hold the directory lock. Once again I'm not sure what the deadlock is, but given we are now in a transaction it's probably a tail-pushing deadlock that it is avoiding. Without knowing for certain what these are avoiding, I don't think we should be removing the code blindly.... > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > This looks a bit odd to me - the rt inodes are not connected to the > filesystem namespace so the root inode can't really be it's parent. > > Why are we locking the root inode so early. Is there a good reason we > don't delay the locking until we're done with the rt inodes? No idea - it's like that on irix too, and I don't have time right now to discover why.... > If not the parent annotation is probably safe beause we never lock > the rt inode at the same time as any other inode, but it at least needs > a big comment describing what's going on. > > > > Now what seems to be completely lacking is any kind of annotation in > xfs_rename.c, which is the most difficult thing to get right for > inode locking because we may have to lock up to four inodes. I suggest > to implement the same locking protocol the the VFS uses for locking > i_mutex, as document in Documentation/filesystems/directory-locking: > > Also xfs_lock_inodes lacks any kind of annotation. It calls xfs_lock_inumorder() to set up the annotation. The inode number in the set of inodes to be locked drives the lock subclass for nesting. Also xfs_rename locking ends up calling xfs_lock_inodes() and so it does get annotated. > Let's start with the xfs_lock_inodes that don't fall into rename or > xfs_lock_dir_and_entry handled above: > > > - xfs_swap_extents locks two inodes of the same type, but these > could be directories, so there is a chance we can get into > conflicts with the parent->child type locking Uses xfs_lock_inodes() so subclass nesting is used instead of parent/child. > - xfs_link locks the source inode and the target directory > inode. vfs locking rule is lock parent, lock source and > we should follow this as it's in line with the directory > before child rule except that the source doesn't always > have to be a child, in which case we don't have a problem > anyway It locks in inode number order as per xfs_lock_dir_and_entry() and uses xfs_lock_inodes() for annotation. > And now rename gets ugly, we should follow the VFS rules with > the following required adjustments: > > - XFS needs both source and target inode (if existing) locked. > Because both must be non-directories sorting by inode number > should be okay > - Doing a lock_rename equivalent for locking the parent directories > requires dentries, but only inodes are passed down from the VFS. > On the other hand they are obviously guranteed to be directories > so i_dentry has exactly one dentry on which we can do the upwards > walk. This is a lot of churn that I don't really see as necessary - why should we risk deadlocks and difficult to diagnose problems when the current code works and is now annotated? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 00:49:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 00:49:17 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l437nCfB010685 for ; Thu, 3 May 2007 00:49:13 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id C2D3C4E456B; Thu, 3 May 2007 01:49:10 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 4864A406D; Thu, 3 May 2007 00:49:09 -0700 (PDT) Date: Thu, 3 May 2007 00:49:09 -0700 From: Andreas Dilger To: David Chinner Cc: Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070503074909.GA6220@schatzie.adilger.int> Mail-Followup-To: David Chinner , Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11259 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 20:57 +1000, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > > HSM_READ is definitely _NOT_ required because all > > it means is "if the file is OFFLINE, bring it ONLINE and then return > > the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. > > Specifying the flag indicates that we do *not* want the offline > extents brought back online. i.e. it is a HSM or a datamover > (e.g. backup program) that is querying the extents and we want to > known *exactly* what the current state of the file is right now. > > So, if the HSM_READ flag is set, then the application is > expecting the filesytem to be part of a HSM. Hence if it's not, > it should return an error because somebody has done something wrong. In my original proposal I specifically pointed out that the FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the HSM_READ flag is set. That's why the flag is called "HSM_READ" instead of "HSM_NO_READ". The reason is that it seems bad if the default behaviour for calling ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is only disabled by specifying a flag. It makes a lot more sense to just leave the data as it is and return the extent mapping by default (i.e. this is the principle of least surprise). It would probably be equally surprising and undesirable if the default behaviour was to force all data out to HSM. For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should even be a part of this interface? I have no problem with returning a flag that reports if the data is migrated to HSM and whether it is UNMAPPED. Having FIEMAP force the retrieval of data from HSM strikes me as something that should be a part of a separate HSM interface, which also needs to be able to do things like push specific files or parts thereof out to HSM, set the aging policy, and return information like "where does the HSM file live" and "how many copies are there". Do you know the reasoning behind including this into XFS_IOC_GETBMAPX? Looking at the bmap.c comments it appears it is simply because the API isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate there is data in HSM but it has no blocks allocated in the filesystem. I don't think it makes the operation significantly more efficient than say "ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP)" if an application actually needs the data to be present instead of just returning mapping info that includes "UNMAPPED. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu May 3 01:24:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 01:24:31 -0700 (PDT) Received: from ppsw-2.csi.cam.ac.uk (ppsw-2.csi.cam.ac.uk [131.111.8.132]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l438OPfB031233 for ; Thu, 3 May 2007 01:24:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49510) by ppsw-2.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.152]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjWb8-0005kD-8u (Exim 4.63) (return-path ); Thu, 03 May 2007 09:23:34 +0100 In-Reply-To: <20070503074909.GA6220@schatzie.adilger.int> References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> <20070503074909.GA6220@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <13539C2E-16DA-4F86-9CBB-D16050EDDC44@cam.ac.uk> Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Thu, 3 May 2007 09:23:33 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11260 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 3 May 2007, at 08:49, Andreas Dilger wrote: > On May 02, 2007 20:57 +1000, David Chinner wrote: >> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >>> HSM_READ is definitely _NOT_ required because all >>> it means is "if the file is OFFLINE, bring it ONLINE and then return >>> the extent map". >> >> You've got the definition of HSM_READ wrong. If the flag is *not* >> set, then we bring everything back online and return the full extent >> map. >> >> Specifying the flag indicates that we do *not* want the offline >> extents brought back online. i.e. it is a HSM or a datamover >> (e.g. backup program) that is querying the extents and we want to >> known *exactly* what the current state of the file is right now. >> >> So, if the HSM_READ flag is set, then the application is >> expecting the filesytem to be part of a HSM. Hence if it's not, >> it should return an error because somebody has done something wrong. > > In my original proposal I specifically pointed out that the > FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the > XFS_IOC_GETBMAPX > BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the > HSM_READ flag is set. That's why the flag is called "HSM_READ" > instead > of "HSM_NO_READ". Cool. I did not misunderstand after all then. (-: > The reason is that it seems bad if the default behaviour for calling > ioctl(FIEMAP) would be to force retrieval of data from HSM, and > this is > only disabled by specifying a flag. It makes a lot more sense to just > leave the data as it is and return the extent mapping by default (i.e. > this is the principle of least surprise). It would probably be > equally > surprising and undesirable if the default behaviour was to force all > data out to HSM. > > For that matter, I'm also beginning to wonder if the FLAG_HSM_READ > should > even be a part of this interface? I have no problem with returning a > flag that reports if the data is migrated to HSM and whether it is > UNMAPPED. > > Having FIEMAP force the retrieval of data from HSM strikes me as > something > that should be a part of a separate HSM interface, which also needs > to be > able to do things like push specific files or parts thereof out to > HSM, > set the aging policy, and return information like "where does the HSM > file live" and "how many copies are there". That would seem sensible to me also. Just like David argued that causing the data to be in a fixed location should be a separate interface rather than part of FIEMAP so by analogy the same should apply to touching HSM. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Thu May 3 03:34:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 03:34:34 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l43AYRfB018100 for ; Thu, 3 May 2007 03:34:28 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 6152E7BA319; Thu, 3 May 2007 04:34:26 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 2F13B4153; Thu, 3 May 2007 03:34:25 -0700 (PDT) Date: Thu, 3 May 2007 03:34:25 -0700 From: Andreas Dilger To: "Amit K. Arora" Cc: Chris Wedgwood , David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070503103425.GE6220@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , Chris Wedgwood , David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> <20070502125312.GA5845@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502125312.GA5845@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11261 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 18:23 +0530, Amit K. Arora wrote: > On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > > > For FA_ALLOCATE, it's supposed to change the file size if we > > > allocate past EOF, right? > > > > I would argue no. Use truncate for that. > > The patch I posted for ext4 *does* change the filesize after > preallocation, if required (i.e. when preallocation is after EOF). > I may have to change that, if we decide on not doing this. I think I'd agree - it may be useful to allow preallocation beyond EOF for some kinds of applications (e.g. PVR preallocating live TV in 10 minute segments or something, but not knowing in advance how long the show will actually be recorded or the final encoded size). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu May 3 08:01:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 08:02:00 -0700 (PDT) Received: from smtp-ft6.fr.colt.net (smtp-ft6.fr.colt.net [213.41.78.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l43F1ufB005799 for ; Thu, 3 May 2007 08:01:57 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft6.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l43EjJQV005258 for ; Thu, 3 May 2007 16:45:19 +0200 Date: Thu, 3 May 2007 16:45:21 +0200 From: Emmanuel Florac To: xfs@oss.sgi.com Subject: XFS crash on linux raid Message-ID: <20070503164521.16efe075@harpe.intellique.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 11262 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Hello, Apparently quite a lot of people do encounter the same problem from time to time, but I couldn't find any solution. When writing quite a lot to the filesystem (heavy load on the fileserver), the filesystem crashes when filled at 2.5~3TB (varies from time to time). The filesystems tested where always running on a software raid 0, with disabled barriers. I tend to think that disabled write barriers are causing the crash but I'll do some more tests to get sure. I've met this problem for the first time on 12/23 (yup... merry christmas :) when a 13 TB filesystem went belly up : Dec 23 01:38:10 storiq1 -- MARK -- Dec 23 01:58:10 storiq1 -- MARK -- Dec 23 02:10:29 storiq1 kernel: xfs_iunlink_remove: xfs_itobp() returned an error 990 on md0. Returning error. Dec 23 02:10:29 storiq1 kernel: xfs_inactive:^Ixfs_ifree() returned an error = 990 on md0 Dec 23 02:10:29 storiq1 kernel: xfs_force_shutdown(md0,0x1) called from line 1763 of file fs/xfs/xfs_vnodeops.c. Return address = 0xc027f78b Dec 23 02:38:11 storiq1 -- MARK -- Dec 23 02:58:11 storiq1 -- MARK -- When mounting, it did that : Filesystem "md0": Disabling barriers, not supported by the underlying device XFS mounting filesystem md0 Starting XFS recovery on filesystem: md0 (logdev: internal) Filesystem "md0": xfs_inode_recover: Bad inode magic number, dino ptr = 0xf7196600, dino bp = 0xf718e980, ino = 119318 Filesystem "md0": XFS internal error xlog_recover_do_inode_trans(1) at line 2352 of file fs/xfs/xfs_log_recover.c. Caller 0xc025d180 xlog_recover_do_inode_trans+0x93d/0xa00 xlog_recover_do_trans+0x140/0x160 xfs_buf_delwri_queue+0x2b/0xb0 xlog_recover_do_trans+0x140/0x160 kmem_zalloc+0x1f/0x50 xlog_recover_commit_trans+0x3f/0x50 xlog_recover_process_data+0xea/0x240 xlog_do_recovery_pass+0x39a/0xb70 hrtimer_run_queues+0x29/0x110 xlog_do_log_recovery+0x96/0xd0 xlog_do_recover+0x3b/0x170 xlog_recover+0xdd/0xf0 xfs_log_mount+0xa1/0x110 xfs_mountfs+0x825/0xf30 xfs_fs_cmn_err+0x27/0x30 xfs_ioinit+0x27/0x50 xfs_mount+0x2ff/0x520 vfs_mount+0x43/0x50 xfs_fs_fill_super+0x9a/0x200 debug_mutex_add_waiter+0x3d/0xd0 snprintf+0x27/0x30 disk_name+0xb4/0xc0 sb_set_blocksize+0x1f/0x50 get_sb_bdev+0x106/0x150 xfs_fs_get_sb+0x30/0x40 xfs_fs_fill_super+0x0/0x200 do_kern_mount+0x5f/0xe0 do_new_mount+0x77/0xc0 do_mount+0x18d/0x1f0 take_cpu_down+0xb/0x20 copy_mount_options+0x63/0xc0 sys_mount+0x9f/0xe0 syscall_call+0x7/0xb XFS: log mount/recovery failed: error 990 XFS: log mount failed XFS_repair (too old a version...) hosed the filesystem and destroyed most of the 2.6TB of data. Yes, there were no backup, I wrote a recovery tool to restore the video data from the raw device but the is a different story. The system was running vanilla 2.6.17.9, and md0 was made of 3 striped RAID-5 on 3 3Ware-9550 cards, each hardware RAID-5 made of 8 750 GB drives. On a similar hardware with 2 3Ware-9550 16x750GB striped together, but running 2.6.17.13, I had a similar fs crash last week. Unfortunately I don't have the logs at hand, but we where able to reproduce several times the crash at home : Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 xfs_btree_check_sblock+0x58/0xe0 xfs_alloc_lookup+0x142/0x400 xfs_alloc_lookup+0x142/0x400 kmem_zone_alloc+0x59/0xd0 xfs_btree_init_cursor+0x23/0x190 xfs_alloc_ag_vextent_near+0x54/0x9e0 xfs_bmap_add_extent+0x383/0x430 xfs_bmap_search_multi_extents+0x76/0xf0 xfs_alloc_ag_vextent+0x119/0x120 xfs_alloc_vextent+0x3db/0x4f0 xfs_bmap_btalloc+0x3ee/0x890 xfs_bmapi+0x1216/0x1690 xfs_dir2_grow_inode+0xf6/0x400 cache_alloc_refill+0xb6/0x1e0 xfs_idata_realloc+0x3b/0x130 xfs_dir2_sf_to_block+0xac/0x5d0 xfs_dir2_lookup+0x129/0x130 xfs_dir2_sf_addname+0x97/0x110 xfs_dir2_createname+0x144/0x150 xfs_trans_ijoin+0x2b/0x80 xfs_rename+0x354/0x9f0 xfs_access+0x3f/0x50 xfs_vn_rename+0x48/0xa0 __link_path_walk+0xc7c/0xc90 xfs_getattr+0x23f/0x2f0 mntput_no_expire+0x1b/0x80 cache_alloc_refill+0xb6/0x1e0 vfs_rename_other+0x96/0xd0 vfs_rename+0x258/0x2d0 do_rename+0x171/0x1a0 cache_grow+0x10b/0x160 cache_alloc_refill+0xb6/0x1e0 do_getname+0x4b/0x80 sys_renameat+0x47/0x80 sys_rename+0x28/0x30 syscall_call+0x7/0xb Filesystem "md0": XFS internal error xfs_trans_cancel at line 1150 of file fs/xfs/xfs_trans.c. Caller 0xc0245ec7 xfs_trans_cancel+0xd0/0x100 xfs_rename+0x6a7/0x9f0 xfs_rename+0x6a7/0x9f0 xfs_access+0x3f/0x50 xfs_vn_rename+0x48/0xa0 __link_path_walk+0xc7c/0xc90 xfs_getattr+0x23f/0x2f0 mntput_no_expire+0x1b/0x80 cache_alloc_refill+0xb6/0x1e0 vfs_rename_other+0x96/0xd0 vfs_rename+0x258/0x2d0 do_rename+0x171/0x1a0 cache_grow+0x10b/0x160 cache_alloc_refill+0xb6/0x1e0 do_getname+0x4b/0x80 sys_renameat+0x47/0x80 sys_rename+0x28/0x30 syscall_call+0x7/0xb xfs_force_shutdown(md0,0x8) called from line 1151 of file fs/xfs/xfs_trans.c. Return address = 0xc025f7b9 Filesystem "md0": Corruption of in-memory data detected. Shutting down filesystem: md0 Please umount the filesystem, and rectify the problem(s) xfs_force_shutdown(md0,0x1) called from line 338 of file fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 xfs_force_shutdown(md0,0x1) called from line 338 of file fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 After xfs_repair, the fs is fine. However, it crashes again when writing again a couple of GBs of data. It crashes again under 2.6.17.13, 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... Out of curiosity, I've tried to use reiserfs (just to see how it compares regarding this). Reiserfs crashed before even writing 100MB! So I tend to believe this is a "write barrier" problem and it looks really nasty!!! To sort this out I've started a test on a single 3Ware raid, without software raid. Any idea on how to circumvent the problem to make software RAID/LVM usable? -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Thu May 3 16:02:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 16:02:10 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l43N25fB020477 for ; Thu, 3 May 2007 16:02:06 -0700 Received: (qmail 94083 invoked from network); 3 May 2007 23:02:04 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 3 May 2007 23:02:03 -0000 X-YMail-OSG: ArKZSuYVM1kqn6qAuVrrwBMH7q78gcbdZ1PV.SHTJD7BztaEkuYJYhv3Ob5ff5ZJrgc4r7nNHw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id B6EFB1827265; Thu, 3 May 2007 16:02:02 -0700 (PDT) Date: Thu, 3 May 2007 16:02:02 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070503230202.GA12747@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503164521.16efe075@harpe.intellique.com> X-archive-position: 11263 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 04:45:21PM +0200, Emmanuel Florac wrote: > After xfs_repair, the fs is fine. However, it crashes again when > writing again a couple of GBs of data. It crashes again under > 2.6.17.13, 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... 4K stacks? > So I tend to believe this is a "write barrier" problem and it looks > really nasty!!! You could try "mount -o nobarrier ...." From owner-xfs@oss.sgi.com Thu May 3 17:59:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 17:59:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l440xXfB009201 for ; Thu, 3 May 2007 17:59:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA29867; Fri, 4 May 2007 10:59:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l440xNAf83828843; Fri, 4 May 2007 10:59:24 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l440xMeV83970284; Fri, 4 May 2007 10:59:22 +1000 (AEST) Date: Fri, 4 May 2007 10:59:22 +1000 From: David Chinner To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504005922.GC32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503164521.16efe075@harpe.intellique.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11264 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 04:45:21PM +0200, Emmanuel Florac wrote: > > Hello, > Apparently quite a lot of people do encounter the same problem from > time to time, but I couldn't find any solution. > > When writing quite a lot to the filesystem (heavy load on the > fileserver), the filesystem crashes when filled at 2.5~3TB (varies from > time to time). The filesystems tested where always running on a software > raid 0, with disabled barriers. I tend to think that disabled write > barriers are causing the crash but I'll do some more tests to get sure. > > I've met this problem for the first time on 12/23 (yup... merry > christmas :) when a 13 TB filesystem went belly up : > > Dec 23 01:38:10 storiq1 -- MARK -- > Dec 23 01:58:10 storiq1 -- MARK -- > Dec 23 02:10:29 storiq1 kernel: xfs_iunlink_remove: xfs_itobp() > returned an error 990 on md0. Returning error. > Dec 23 02:10:29 storiq1 kernel: xfs_inactive:^Ixfs_ifree() returned an > error = 990 on md0 > Dec 23 02:10:29 storiq1 kernel: xfs_force_shutdown(md0,0x1) called from > line 1763 of file fs/xfs/xfs_vnodeops.c. Return address = 0xc027f78b > Dec 23 02:38:11 storiq1 -- MARK -- > Dec 23 02:58:11 storiq1 -- MARK -- So, trying to remove an inode there was a corruption found on disk and it shut the filesystem down. Where there any I/o errors reported before the shutdown? > When mounting, it did that : > > Filesystem "md0": Disabling barriers, not supported by the underlying > device XFS mounting filesystem md0 > Starting XFS recovery on filesystem: md0 (logdev: internal) > Filesystem "md0": xfs_inode_recover: Bad inode magic number, dino ptr = > 0xf7196600, dino bp = 0xf718e980, ino = 119318 Filesystem "md0": XFS Which was found again during log recovery. > The system was running vanilla 2.6.17.9, and md0 was made of 3 striped > RAID-5 on 3 3Ware-9550 cards, each hardware RAID-5 made of 8 750 GB > drives. > > On a similar hardware with 2 3Ware-9550 16x750GB striped together, but > running 2.6.17.13, I had a similar fs crash last week. Unfortunately I > don't have the logs at hand, but we where able to reproduce several > times the crash at home : Hmm - 750GB drives are brand new. i wouldn't rule out media issues at this point... > Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 336 > of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 Memory corruption? > line 1151 of file fs/xfs/xfs_trans.c. Return address = 0xc025f7b9 > Filesystem "md0": Corruption of in-memory data detected. Shutting down > filesystem: md0 Please umount the filesystem, and rectify the > problem(s) xfs_force_shutdown(md0,0x1) called from line 338 of file > fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 > xfs_force_shutdown(md0,0x1) called from line 338 of file > fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 > > After xfs_repair, the fs is fine. However, it crashes again when > writing again a couple of GBs of data. It crashes again under 2.6.17.13, > 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... > > Out of curiosity, I've tried to use reiserfs (just to see how it > compares regarding this). Reiserfs crashed before even writing 100MB! That indicates there's something wrong other than the filesystem. I'd suggest making sure your raid arrays, memory, etc are all functioning correctly first. What platform are you running on? Are you running ia32 with 4k stacks? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 19:46:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 19:46:05 -0700 (PDT) Received: from mailsecure1.itc.griffith.edu.au (mailsecure1-out.itc.griffith.edu.au [132.234.242.61]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l442jwfB031706 for ; Thu, 3 May 2007 19:46:01 -0700 Received: from mailsecure1.itc.griffith.edu.au (unknown [127.0.0.1]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id 04449286 for ; Fri, 4 May 2007 12:45:57 +1000 (EST) X-AuditID: 84eaf23c-af2f2bb000004912-c9-463a9e64a23b Received: from nox-1.itc.griffith.edu.au (sc2bigip02-242.nms.griffith.edu.au [132.234.242.254]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id 4AF7730187 for ; Fri, 4 May 2007 12:45:56 +1000 (EST) Received: from [132.234.242.254] (helo=studentemail.griffith.edu.au) by nox-1.itc.griffith.edu.au with esmtp (Exim 4.63) (envelope-from ) id 1Hjnnw-0006gz-52 for xfs@oss.sgi.com; Fri, 04 May 2007 12:45:56 +1000 Received: from ss64.me.griffith.edu.au ([132.234.103.168]) by studentemail.griffith.edu.au (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPA id <0JHH002HWX0KTM40@studentemail.griffith.edu.au> for xfs@oss.sgi.com; Fri, 04 May 2007 12:45:56 +1000 (EST) Date: Fri, 04 May 2007 12:45:55 +1000 From: Stephen So Subject: Re: Slow performance when extracting tarballs In-reply-to: <20070430213538.GA30809@tuatara.stupidest.org> To: xfs@oss.sgi.com Message-id: <463A9E63.7010007@griffith.edu.au> Organization: Griffith School of Engineering, Griffith University, Australia MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7BIT X-Enigmail-Version: 0.95.0 References: <4635DAA4.4070402@griffith.edu.au> <20070430213538.GA30809@tuatara.stupidest.org> User-Agent: Thunderbird 2.0.0.0 (X11/20070326) X-Brightmail-Tracker: AAAAAA== X-archive-position: 11265 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: S.So@griffith.edu.au Precedence: bulk X-list: xfs Hi, thanks for the reply xfs-bounce@oss.sgi.com wrote: > what does "vmstat 1" look like during this? > I did a vmstat 1 and this is the output: % vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 1002716 3540 745316 0 0 59 15 559 560 1 2 96 1 0 0 0 0 1002700 3540 745316 0 0 0 12 1091 1543 2 2 97 0 0 1 0 0 995464 3540 750300 0 0 2060 401 1134 2569 18 3 76 4 0 2 0 0 980884 3540 762652 0 0 3712 1376 1238 4850 43 7 43 8 0 1 0 0 968368 3540 776152 0 0 3968 1568 1224 5155 43 7 44 7 0 2 0 0 954660 3540 787264 0 0 3584 1344 1244 4542 38 6 45 11 0 1 0 0 942668 3540 797556 0 0 2944 1431 1224 4376 36 6 48 11 0 1 0 0 932852 3540 807304 0 0 3072 1312 1229 4164 33 6 46 15 0 3 0 0 922724 3540 817912 0 0 3072 1440 1215 4378 37 7 44 12 0 0 1 0 911612 3540 828552 0 0 3328 1568 1242 4558 37 5 46 12 0 1 0 0 900804 3540 839140 0 0 3072 1568 1222 4279 36 5 45 13 0 0 0 0 887824 3540 848788 0 0 3072 1427 1250 3862 35 5 46 14 0 1 0 0 880036 3540 857700 0 0 2560 1529 1229 3775 31 7 47 16 0 1 0 0 867552 3540 867548 0 0 3072 1632 1250 4035 36 5 46 14 0 0 1 0 859156 3540 877576 0 0 2944 1696 1239 4291 33 6 45 16 0 1 0 0 852904 3540 883628 0 0 1664 5403 1229 3111 23 4 48 25 0 0 1 0 846328 3540 888188 0 0 1536 5300 1188 2622 21 6 61 12 0 0 1 0 842076 3540 892752 0 0 1280 5383 1232 2478 21 5 62 12 0 1 1 0 837312 3540 897396 0 0 1408 5330 1211 2476 20 5 53 24 0 6 1 0 828876 3540 903572 0 0 1920 5771 1245 2904 24 5 46 25 0 1 0 0 822016 3540 912304 0 0 2304 1203 1216 3897 30 7 55 7 0 0 1 0 818404 3540 915628 0 0 1024 9446 1181 2028 14 5 63 17 0 0 1 0 809552 3540 923336 0 0 2432 1109 1228 3344 28 5 46 22 0 1 0 0 801124 3540 928892 0 0 1664 9195 1201 2821 22 6 59 13 0 0 0 0 794364 3540 935364 0 0 1792 5296 1218 3052 24 6 52 18 0 2 1 0 789784 3540 941564 0 0 2048 4992 1194 3116 23 4 51 23 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 781540 3540 947180 0 0 1536 6434 1226 2942 23 6 50 22 0 4 0 0 777300 3540 953628 0 0 1920 1088 1200 2970 25 5 56 14 0 0 1 0 772892 3540 957032 0 0 1152 9440 1201 2141 17 4 59 21 0 1 0 0 764432 3540 964572 0 0 2304 1253 1216 3198 29 4 46 22 0 2 0 0 756516 3540 970284 0 0 1664 9720 1222 2832 22 5 57 17 0 1 1 0 750880 3540 977204 0 0 2176 1100 1207 2973 25 5 49 20 0 0 0 0 745424 3540 980768 0 0 1024 9140 1200 2205 16 4 66 14 0 0 1 0 741928 3540 986200 0 0 1664 1376 1193 2746 20 5 61 15 0 0 1 0 734536 3540 992480 0 0 1920 5516 1226 2874 24 5 57 14 0 0 1 0 729072 3540 997168 0 0 1408 5328 1199 2473 21 5 62 13 0 0 1 0 723228 3540 1003288 0 0 1792 5509 1243 2959 24 6 54 15 0 2 0 0 717948 3540 1007752 0 0 1408 5308 1196 2418 20 4 59 18 0 4 0 0 709940 3540 1013564 0 0 1536 5568 1217 3145 25 4 55 16 0 0 0 0 701132 3540 1021948 0 0 2816 5612 1224 3562 32 6 47 16 0 0 1 0 702448 3540 1023140 0 0 256 5108 1203 1538 6 5 73 15 0 0 1 0 691688 3540 1032264 0 0 2688 1852 1239 3630 32 5 45 18 0 0 1 0 688292 3540 1034228 0 0 768 9348 1198 1671 10 3 60 27 0 1 0 0 682636 3540 1039248 0 0 1408 1069 1198 2729 20 5 47 29 0 1 0 0 676848 3540 1044456 0 0 1408 5704 1234 2897 20 5 59 16 0 1 0 0 672460 3540 1049428 0 0 1536 5484 1215 2813 19 5 55 22 0 1 0 0 663820 3540 1056108 0 0 2176 5258 1241 3245 27 5 49 20 0 1 0 0 660064 3540 1061708 0 0 1664 1688 1222 3100 22 6 60 11 0 0 0 0 653400 3540 1065924 0 0 1152 5496 1221 2495 17 4 51 28 0 0 1 0 651468 3540 1069324 0 0 1152 5278 1187 2157 16 3 67 14 0 2 0 0 645132 3540 1073620 0 0 1152 5466 1221 2714 19 5 61 17 0 3 0 0 640544 3540 1078720 0 0 1664 5587 1219 2830 21 6 51 21 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 634040 3540 1083872 0 0 1536 5223 1208 2996 20 3 64 14 0 0 1 0 629024 3540 1090772 0 0 2048 5342 1199 3141 26 5 49 20 0 0 0 0 621116 3540 1095840 0 0 1664 4410 1211 2631 22 4 52 22 0 0 0 0 615760 3540 1100840 0 0 1408 6032 1186 2601 20 6 61 14 0 0 0 0 608852 3540 1107448 0 0 1920 1192 1215 3228 24 6 50 21 0 0 1 0 605872 3540 1112248 0 0 1536 5424 1220 2779 22 4 63 12 0 0 0 0 598016 3540 1117476 0 0 1536 5603 1227 3016 23 4 53 21 0 2 0 0 592416 3540 1122576 0 0 1536 5407 1217 2671 22 7 56 16 0 0 0 0 587504 3540 1127404 0 0 1408 4624 1230 2599 19 5 55 21 0 2 1 0 585800 3540 1130704 0 0 1152 1880 1175 2431 15 2 53 30 0 0 1 0 582732 3540 1133696 0 0 896 5293 1210 2357 16 4 74 6 0 2 0 0 575528 3540 1138696 0 0 1536 5424 1214 2585 22 5 48 26 0 1 0 0 569992 3540 1145872 0 0 2176 1519 1245 3267 27 5 50 17 0 0 0 0 563568 3540 1149772 0 0 1152 8164 1189 2364 15 4 74 6 0 0 0 0 559936 3540 1153020 0 0 896 2483 1198 2145 16 6 64 15 0 1 0 0 556504 3540 1156720 0 0 1408 5248 1206 2152 17 6 62 14 0 0 1 0 553568 3540 1161280 0 0 1280 5716 1231 2620 19 4 59 18 0 1 0 0 544820 3540 1167580 0 0 2048 1545 1234 2947 26 5 51 18 0 1 0 0 541096 3540 1170748 0 0 1024 5272 1205 2107 17 3 73 8 0 0 1 0 535092 3540 1176848 0 0 1792 6132 1225 2861 25 6 49 19 0 0 1 0 531696 3540 1181220 0 0 1280 969 1215 2758 18 3 66 14 0 0 1 0 528920 3540 1184220 0 0 896 5268 1192 2248 16 4 71 10 0 0 0 0 520532 3540 1189884 0 0 1664 5425 1252 3008 21 4 64 10 0 0 0 0 514012 3540 1196084 0 0 1920 1920 1214 3110 25 6 60 10 0 1 0 0 511804 3540 1199608 0 0 1152 5336 1240 2224 20 6 60 15 0 0 0 0 503212 3540 1206108 0 0 2048 6516 1227 2963 26 4 58 12 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 500968 3540 1208380 0 0 512 4684 1214 2066 13 5 79 3 0 2 0 0 496216 3540 1212580 0 0 1408 5727 1214 2399 21 7 65 8 0 4 0 0 491268 3540 1217184 0 0 1408 4304 1243 2593 21 5 64 11 0 2 0 0 488856 3540 1219784 0 0 896 2058 1189 1849 15 4 70 11 0 0 1 0 483660 3540 1224340 0 0 1408 5824 1240 2571 23 4 52 22 0 0 0 0 477704 3540 1229940 0 0 1536 5170 1173 2855 21 6 52 21 0 0 1 0 474500 3540 1234952 0 0 1536 5163 1212 2629 20 3 55 23 0 1 0 0 465196 3540 1242552 0 0 2304 940 1204 3265 28 4 47 22 0 0 0 0 458280 3540 1247892 0 0 1664 9382 1211 2719 19 6 70 4 0 1 0 0 453276 3540 1252592 0 0 1408 5040 1176 2827 19 5 58 18 0 0 0 0 446840 3540 1258496 0 0 1792 5676 1221 3025 24 5 56 14 0 1 0 0 443180 3540 1264096 0 0 1664 932 1193 2680 21 5 56 19 0 1 0 0 435748 3540 1269060 0 0 1664 5182 1209 2635 21 4 49 27 0 0 1 0 432060 3540 1274860 0 0 1536 5376 1183 2860 21 7 51 20 0 0 1 0 426376 3540 1279492 0 0 1408 5177 1214 2480 19 3 51 27 0 0 1 0 422356 3540 1283992 0 0 1280 5256 1196 2516 18 4 55 23 0 0 0 0 410112 3540 1292848 0 0 2560 1916 1254 3839 32 5 49 14 0 1 0 0 407296 3540 1295448 0 0 896 8244 1203 1816 15 6 74 6 0 1 0 0 405256 3540 1297456 0 0 384 2276 1192 1729 10 5 78 7 0 1 0 0 401044 3540 1303756 0 0 1920 2004 1260 2779 29 5 56 11 0 1 0 0 397668 3540 1306976 0 0 1024 5432 1229 2264 18 4 68 9 0 1 0 0 393720 3540 1310076 0 0 1024 5520 1219 1983 17 6 66 11 0 1 0 0 384148 3540 1316896 0 0 2048 2224 1279 3279 33 5 54 9 0 1 0 0 384716 3540 1318996 0 0 336 5291 1194 2084 12 3 74 13 0 0 0 0 384716 3540 1319252 0 0 0 149 1115 1467 1 1 98 0 0 0 0 0 384716 3540 1319252 0 0 0 92 1065 1075 2 3 95 0 0 > have you also tried setting (increasing) logbsize? (i think you need > > v2 logs to make that work) > I read in the man page for mount that the max logbsize is 32K and the default value for machines with more than 32 MB of memory is 32768, so I assumed it was already set to maximum. Best regards, Steve. -- __________________________________________________ Dr Stephen So, PhD, MIEEE Griffith School of Engineering & Institute for Integrated and Intelligent Systems Griffith University, Gold Coast Campus PMB 50 Gold Coast Mail Centre Gold Coast, QLD, 9726, Australia. E-mail: s.so@griffith.edu.au Phone: +61 7 5552 8663 Fax: +61 7 5552 8065 __________________________________________________ From owner-xfs@oss.sgi.com Thu May 3 21:30:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:30:17 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444UDfB026817 for ; Thu, 3 May 2007 21:30:14 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444U3vS017825 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:30:04 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444U2BH028973; Thu, 3 May 2007 21:30:02 -0700 Date: Thu, 3 May 2007 21:30:02 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 3/5] ext4: Extent overlap bugfix Message-Id: <20070503213002.eff696db.akpm@linux-foundation.org> In-Reply-To: <20070426181101.GC7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181101.GC7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11267 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" wrote: > +unsigned int ext4_ext_check_overlap(struct inode *inode, > + struct ext4_extent *newext, > + struct ext4_ext_path *path) > +{ > + unsigned long b1, b2; > + unsigned int depth, len1; > + > + b1 = le32_to_cpu(newext->ee_block); > + len1 = le16_to_cpu(newext->ee_len); > + depth = ext_depth(inode); > + if (!path[depth].p_ext) > + goto out; > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > + > + /* get the next allocated block if the extent in the path > + * is before the requested block(s) */ > + if (b2 < b1) { > + b2 = ext4_ext_next_allocated_block(path); > + if (b2 == EXT_MAX_BLOCK) > + goto out; > + } > + > + if (b1 + len1 > b2) { Are we sure that b1+len cannot wrap through zero here? > + newext->ee_len = cpu_to_le16(b2 - b1); > + return 1; > + } From owner-xfs@oss.sgi.com Thu May 3 21:30:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:30:11 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444U6fB026766 for ; Thu, 3 May 2007 21:30:06 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444Tu2f017820 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:29:57 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444TtUT028928; Thu, 3 May 2007 21:29:55 -0700 Date: Thu, 3 May 2007 21:29:55 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503212955.b1b6443c.akpm@linux-foundation.org> In-Reply-To: <20070426180332.GA7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11266 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > This patch implements the fallocate() system call and adds support for > i386, x86_64 and powerpc. > > ... > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be said? > +{ > + struct file *file; > + struct inode *inode; > + long ret = -EINVAL; > + > + if (len == 0 || offset < 0) > + goto out; The posix spec implies that negative `len' is permitted - presumably "allocate ahead of `offset'". How peculiar. > + ret = -EBADF; > + file = fget(fd); > + if (!file) > + goto out; > + if (!(file->f_mode & FMODE_WRITE)) > + goto out_fput; > + > + inode = file->f_path.dentry->d_inode; > + > + ret = -ESPIPE; > + if (S_ISFIFO(inode->i_mode)) > + goto out_fput; > + > + ret = -ENODEV; > + if (!S_ISREG(inode->i_mode)) > + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. > + ret = -EFBIG; > + if (offset + len > inode->i_sb->s_maxbytes) > + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment here would settle the reader's mind. > + if (inode->i_op && inode->i_op->fallocate) > + ret = inode->i_op->fallocate(inode, mode, offset, len); > + else > + ret = -ENOSYS; If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. If we're not going to handle negative `len' then we should check for it. > +out_fput: > + fput(file); > +out: > + return ret; > +} > +EXPORT_SYMBOL(sys_fallocate); I don't believe this needs to be exported to modules? > +/* > + * fallocate() modes > + */ > +#define FA_ALLOCATE 0x1 > +#define FA_DEALLOCATE 0x2 Now those aren't in posix. They should be documented, along with their expected semantics. > #ifdef __KERNEL__ > > #include > @@ -1125,6 +1131,7 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + long (*fallocate)(struct inode *, int, loff_t, loff_t); I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? From owner-xfs@oss.sgi.com Thu May 3 21:31:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:31:48 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444VhfB027433 for ; Thu, 3 May 2007 21:31:44 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444VY8K017921 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:31:35 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444VXbq029006; Thu, 3 May 2007 21:31:33 -0700 Date: Thu, 3 May 2007 21:31:33 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070503213133.d1559f52.akpm@linux-foundation.org> In-Reply-To: <20070426181332.GD7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11268 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > This patch has the ext4 implemtation of fallocate system call. > > ... > > + /* ext4_can_extents_be_merged should have checked that either > + * both extents are uninitialized, or both aren't. Thus we > + * need to check only one of them here. > + */ Please always format multiline comments like this: /* * ext4_can_extents_be_merged should have checked that either * both extents are uninitialized, or both aren't. Thus we * need to check only one of them here. */ > ... > > +/* > + * ext4_fallocate: > + * preallocate space for a file > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > + */ This description is rather thin. What is the filesystem's actual behaviour here? If the file is using extents then the implementation will do . If the file is using bitmaps then we will do . But what? Here is where it should be described. > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > +{ > + handle_t *handle; > + ext4_fsblk_t block, max_blocks; > + int ret, ret2, nblocks = 0, retries = 0; > + struct buffer_head map_bh; > + unsigned int credits, blkbits = inode->i_blkbits; > + > + /* Currently supporting (pre)allocate mode _only_ */ > + if (mode != FA_ALLOCATE) > + return -EOPNOTSUPP; > + > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > + return -ENOTTY; So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Also, posix says nothing about fallocate() returning ENOTTY. > + block = offset >> blkbits; > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > + - block; > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? > + handle=ext4_journal_start(inode, credits + Please always put spaces around "=" > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); And around "+" > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > +retry: > + ret = 0; > + while (ret >= 0 && ret < max_blocks) { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > + BUG_ON(!ret); BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and ext4_error() would be safer and more useful here. > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) Use buffer_new() here. A separate patch which fixes the three existing instances of open-coded BH_foo usage would be appreciated. > + && ((block + ret) > (i_size_read(inode) << blkbits))) Check for wrap though the sign bit and through zero please. > + nblocks = nblocks + ret; > + } > + > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + > + /* Time to update the file size. > + * Update only when preallocation was requested beyond the file size. > + */ Fix comment layout. > + if ((offset + len) > i_size_read(inode)) { Both the lhs and the rhs here are signed. Please review for possible overflows through the sign bit and through zero. Perhaps a comment explaining why it's correct would be appropriate. > + if (ret > 0) { > + /* if no error, we assume preallocation succeeded completely */ > + mutex_lock(&inode->i_mutex); > + i_size_write(inode, offset + len); > + EXT4_I(inode)->i_disksize = i_size_read(inode); > + mutex_unlock(&inode->i_mutex); > + } else if (ret < 0 && nblocks) { > + /* Handle partial allocation scenario */ The above two comments should be indented one additional tabstop. > + loff_t newsize; > + mutex_lock(&inode->i_mutex); > + newsize = (nblocks << blkbits) + i_size_read(inode); > + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); > + EXT4_I(inode)->i_disksize = i_size_read(inode); > + mutex_unlock(&inode->i_mutex); > + } > + } > + ext4_mark_inode_dirty(handle, inode); > + ret2 = ext4_journal_stop(handle); > + if (ret > 0) > + ret = ret2; > + > + return ret > 0 ? 0 : ret; > +} > + > EXPORT_SYMBOL(ext4_mark_inode_dirty); > EXPORT_SYMBOL(ext4_ext_invalidate_cache); > EXPORT_SYMBOL(ext4_ext_insert_extent); > EXPORT_SYMBOL(ext4_ext_walk_space); > EXPORT_SYMBOL(ext4_ext_find_goal); > EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); > +EXPORT_SYMBOL(ext4_fallocate); > > Index: linux-2.6.21/fs/ext4/file.c > =================================================================== > --- linux-2.6.21.orig/fs/ext4/file.c > +++ linux-2.6.21/fs/ext4/file.c > @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ > .removexattr = generic_removexattr, > #endif > .permission = ext4_permission, > + .fallocate = ext4_fallocate, > }; > > Index: linux-2.6.21/include/linux/ext4_fs.h > =================================================================== > --- linux-2.6.21.orig/include/linux/ext4_fs.h > +++ linux-2.6.21/include/linux/ext4_fs.h > @@ -102,6 +102,8 @@ > EXT4_GOOD_OLD_FIRST_INO : \ > (s)->s_first_ino) > #endif > +#define EXT4_BLOCK_ALIGN(size, blkbits) (((size)+(1 << blkbits)-1) & \ > + (~((1 << blkbits)-1))) Maybe a comment describing what this does? Probably it's obvious enough. I think it could use the standard ALIGN macro. Is blkbits sufficiently parenthesised here? Even if it is, adding the parens would be better practice. > /* > * Macro-instructions used to manage fragments > @@ -225,6 +227,10 @@ struct ext4_new_group_data { > __u32 free_blocks_count; > }; > > +/* Following is used by preallocation logic to tell get_blocks() that we > + * want uninitialzed extents. > + */ Please convert all newly-added multiline comments to the preferred layout. > +#define EXT4_CREATE_UNINITIALIZED_EXT 2 > > /* > * ioctl commands > @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t > extern void ext4_ext_truncate(struct inode *, struct page *); > extern void ext4_ext_init(struct super_block *); > extern void ext4_ext_release(struct super_block *); > +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t); argh. And feel free to give these args some useful names. > static inline int > ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > unsigned long max_blocks, struct buffer_head *bh, > Index: linux-2.6.21/include/linux/ext4_fs_extents.h > =================================================================== > --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h > +++ linux-2.6.21/include/linux/ext4_fs_extents.h > @@ -125,6 +125,19 @@ struct ext4_ext_path { > #define EXT4_EXT_CACHE_EXTENT 2 > > /* > + * Macro-instructions to handle (mark/unmark/check/create) unitialized > + * extents. Applications can issue an IOCTL for preallocation, which results > + * in assigning unitialized extents to the file. > + */ > +#define ext4_ext_mark_uninitialized(ext) ((ext)->ee_len |= \ > + cpu_to_le16(0x8000)) > +#define ext4_ext_is_uninitialized(ext) ((le16_to_cpu((ext)->ee_len))& \ > + 0x8000) > +#define ext4_ext_get_actual_len(ext) ((le16_to_cpu((ext)->ee_len))& \ > + 0x7FFF) inlined C functions are preferred, and I think these could be implemented that way. From owner-xfs@oss.sgi.com Thu May 3 21:32:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:32:51 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444WmfB027913 for ; Thu, 3 May 2007 21:32:49 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444WdFD017959 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:32:40 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444Wc1E029024; Thu, 3 May 2007 21:32:39 -0700 Date: Thu, 3 May 2007 21:32:38 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-Id: <20070503213238.5cdb1585.akpm@linux-foundation.org> In-Reply-To: <20070426181623.GE7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11269 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" wrote: > This patch adds write support for preallocated (using fallocate system > call) blocks/extents. The preallocated extents in ext4 are marked > "uninitialized", hence they need special handling especially while > writing to them. This patch takes care of that. > > ... > > /* > + * ext4_ext_try_to_merge: > + * tries to merge the "ex" extent to the next extent in the tree. > + * It always tries to merge towards right. If you want to merge towards > + * left, pass "ex - 1" as argument instead of "ex". > + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > + * 1 if they got merged. OK. > + */ > +int ext4_ext_try_to_merge(struct inode *inode, > + struct ext4_ext_path *path, > + struct ext4_extent *ex) > +{ > + struct ext4_extent_header *eh; > + unsigned int depth, len; > + int merge_done=0, uninitialized = 0; space around "=", please. Many people prefer not to do the multiple-definitions-per-line, btw: int merge_done = 0; int uninitialized = 0; reasons: - If gives you some space for a nice comment - It makes patches much more readable, and it makes rejects easier to fix - standardisation. > + depth = ext_depth(inode); > + BUG_ON(path[depth].p_hdr == NULL); > + eh = path[depth].p_hdr; > + > + while (ex < EXT_LAST_EXTENT(eh)) { > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > + break; > + /* merge with next extent! */ > + if (ext4_ext_is_uninitialized(ex)) > + uninitialized = 1; > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > + + ext4_ext_get_actual_len(ex + 1)); > + if (uninitialized) > + ext4_ext_mark_uninitialized(ex); > + > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > + * sizeof(struct ext4_extent); > + memmove(ex + 1, ex + 2, len); > + } > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); Kenrel convention is to put spaces around "-" > + merge_done = 1; > + BUG_ON(eh->eh_entries == 0); eek, scary BUG_ON. Do we really need to be that severe? Would it be better to warn and run ext4_error() here? > + } > + > + return merge_done; > +} > + > + > > ... > > +/* > + * ext4_ext_convert_to_initialized: > + * this function is called by ext4_ext_get_blocks() if someone tries to write > + * to an uninitialized extent. It may result in splitting the uninitialized > + * extent into multiple extents (upto three). Atleast one initialized extent > + * and atmost two uninitialized extents can result. There are some typos here > + * There are three possibilities: > + * a> No split required: Entire extent should be initialized. > + * b> Split into two extents: Only one end of the extent is being written to. > + * c> Split into three extents: Somone is writing in middle of the extent. and here > + */ > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > + struct ext4_ext_path *path, > + ext4_fsblk_t iblock, > + unsigned long max_blocks) > +{ > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > + struct ext4_extent_header *eh; > + unsigned int allocated, ee_block, ee_len, depth; > + ext4_fsblk_t newblock; > + int err = 0, ret = 0; > + > + depth = ext_depth(inode); > + eh = path[depth].p_hdr; > + ex = path[depth].p_ext; > + ee_block = le32_to_cpu(ex->ee_block); > + ee_len = ext4_ext_get_actual_len(ex); > + allocated = ee_len - (iblock - ee_block); > + newblock = iblock - ee_block + ext_pblock(ex); > + ex2 = ex; > + > + /* ex1: ee_block to iblock - 1 : uninitialized */ > + if (iblock > ee_block) { > + ex1 = ex; > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > + ext4_ext_mark_uninitialized(ex1); > + ex2 = &newex; > + } > + /* for sanity, update the length of the ex2 extent before > + * we insert ex3, if ex1 is NULL. This is to avoid temporary > + * overlap of blocks. > + */ > + if (!ex1 && allocated > max_blocks) > + ex2->ee_len = cpu_to_le16(max_blocks); > + /* ex3: to ee_block + ee_len : uninitialised */ > + if (allocated > max_blocks) { > + unsigned int newdepth; > + ex3 = &newex; > + ex3->ee_block = cpu_to_le32(iblock + max_blocks); > + ext4_ext_store_pblock(ex3, newblock + max_blocks); > + ex3->ee_len = cpu_to_le16(allocated - max_blocks); > + ext4_ext_mark_uninitialized(ex3); > + err = ext4_ext_insert_extent(handle, inode, path, ex3); > + if (err) > + goto out; > + /* The depth, and hence eh & ex might change > + * as part of the insert above. > + */ > + newdepth = ext_depth(inode); > + if (newdepth != depth) > + { Use if (newdepth != depth) { > + depth=newdepth; spaces > + path = ext4_ext_find_extent(inode, iblock, NULL); > + if (IS_ERR(path)) { > + err = PTR_ERR(path); > + path = NULL; > + goto out; > + } > + eh = path[depth].p_hdr; > + ex = path[depth].p_ext; > + if (ex2 != &newex) > + ex2 = ex; > + } > + allocated = max_blocks; > + } > + /* If there was a change of depth as part of the > + * insertion of ex3 above, we need to update the length > + * of the ex1 extent again here > + */ > + if (ex1 && ex1 != ex) { > + ex1 = ex; > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > + ext4_ext_mark_uninitialized(ex1); > + ex2 = &newex; > + } > + /* ex2: iblock to iblock + maxblocks-1 : initialised */ > + ex2->ee_block = cpu_to_le32(iblock); > + ex2->ee_start = cpu_to_le32(newblock); > + ext4_ext_store_pblock(ex2, newblock); > + ex2->ee_len = cpu_to_le16(allocated); > + if (ex2 != ex) > + goto insert; > + if ((err = ext4_ext_get_access(handle, inode, path + depth))) > + goto out; The preferred style is err = ext4_ext_get_access(handle, inode, path + depth); if (err) goto out; > + /* New (initialized) extent starts from the first block > + * in the current extent. i.e., ex2 == ex > + * We have to see if it can be merged with the extent > + * on the left. > + */ > + if (ex2 > EXT_FIRST_EXTENT(eh)) { > + /* To merge left, pass "ex2 - 1" to try_to_merge(), > + * since it merges towards right _only_. > + */ > + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); > + if (ret) { > + err = ext4_ext_correct_indexes(handle, inode, path); > + if (err) > + goto out; > + depth = ext_depth(inode); > + ex2--; > + } > + } > + /* Try to Merge towards right. This might be required > + * only when the whole extent is being written to. > + * i.e. ex2==ex and ex3==NULL. > + */ > + if (!ex3) { > + ret = ext4_ext_try_to_merge(inode, path, ex2); > + if (ret) { > + err = ext4_ext_correct_indexes(handle, inode, path); > + if (err) > + goto out; > + } > + } > + /* Mark modified extent as dirty */ > + err = ext4_ext_dirty(handle, inode, path + depth); > + goto out; > +insert: > + err = ext4_ext_insert_extent(handle, inode, path, &newex); > +out: > + return err ? err : allocated; > +} Sigh. I hope you guys know how all this works, because the extent code is a mystery to me. Is the on-disk layout and the allocation strategy described anywhere? > +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *); Again, I do think that sticking the identifiers in there helps readability. Although it is not as important in a boring old declaration as it is in, say, inode_operations, etc. Please try to keep the code looking nice in an 80-column display. From owner-xfs@oss.sgi.com Thu May 3 21:55:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:55:43 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444tdfB002706 for ; Thu, 3 May 2007 21:55:40 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444tTgs018661 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:55:31 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444tSik029320; Thu, 3 May 2007 21:55:29 -0700 Date: Thu, 3 May 2007 21:55:28 -0700 From: Andrew Morton To: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503215528.d8ab4e47.akpm@linux-foundation.org> In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11270 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton wrote: > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. But it doesn't handle offset+len wrapping through zero. From owner-xfs@oss.sgi.com Thu May 3 22:16:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 22:16:49 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l445GifB007660 for ; Thu, 3 May 2007 22:16:45 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id 8F525DDFF5; Fri, 4 May 2007 15:16:43 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17978.47502.786970.196554@cargo.ozlabs.ibm.com> Date: Fri, 4 May 2007 14:41:50 +1000 From: Paul Mackerras To: Andrew Morton Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> X-Mailer: VM 7.19 under Emacs 21.4.1 X-archive-position: 11271 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Andrew Morton writes: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Paul. From owner-xfs@oss.sgi.com Thu May 3 23:08:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:08:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l44687fB021573 for ; Thu, 3 May 2007 23:08:09 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA06552; Fri, 4 May 2007 16:07:46 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4467cAf83970051; Fri, 4 May 2007 16:07:38 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4467VZ384026819; Fri, 4 May 2007 16:07:31 +1000 (AEST) Date: Fri, 4 May 2007 16:07:31 +1000 From: David Chinner To: Andrew Morton Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504060731.GJ32602149@melbourne.sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11272 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. Hmmmm - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 23:28:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:28:37 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l446STfB031053 for ; Thu, 3 May 2007 23:28:30 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l446SGLQ021546 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 23:28:18 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l446SFXl030589; Thu, 3 May 2007 23:28:16 -0700 Date: Thu, 3 May 2007 23:28:15 -0700 From: Andrew Morton To: David Chinner Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503232815.2f62a75e.akpm@linux-foundation.org> In-Reply-To: <20070504060731.GJ32602149@melbourne.sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11273 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Fri, 4 May 2007 16:07:31 +1000 David Chinner wrote: > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > +{ > > > + struct file *file; > > > + struct inode *inode; > > > + long ret = -EINVAL; > > > + > > > + if (len == 0 || offset < 0) > > > + goto out; > > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > ahead of `offset'". How peculiar. > > I just checked the man page for posix_fallocate() and it says: > > EINVAL offset or len was less than zero. > > We should probably follow this lead. Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. However it would be neat if someone could test $OTHER_OS and, perhaps more importantly, the present glibc emulation (which I assume your manpage is referring to, so this would be a manpage test ;)). > > > + > > > + ret = -ENODEV; > > > + if (!S_ISREG(inode->i_mode)) > > > + goto out_fput; > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > seems a bit silly of them. > > Hmmmm - I thought that the intention of sys_fallocate() was to > be generic enough to eventually allow preallocation on directories. > If that is the case, then this check will prevent that.... The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? From owner-xfs@oss.sgi.com Thu May 3 23:57:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:57:07 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l446uwfB004955 for ; Thu, 3 May 2007 23:57:00 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l446ujPL002724; Fri, 4 May 2007 02:56:46 -0400 Received: from devserv.devel.redhat.com (devserv.devel.redhat.com [172.16.58.1]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l446uef7021912; Fri, 4 May 2007 02:56:40 -0400 Received: from devserv.devel.redhat.com (localhost.localdomain [127.0.0.1]) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l446ueGH007487; Fri, 4 May 2007 02:56:40 -0400 Received: (from jakub@localhost) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11/Submit) id l446uQr9007476; Fri, 4 May 2007 02:56:26 -0400 Date: Fri, 4 May 2007 02:56:26 -0400 From: Jakub Jelinek To: Andrew Morton Cc: Ulrich Drepper , David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504065626.GW355@devserv.devel.redhat.com> Reply-To: Jakub Jelinek References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11274 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jakub@redhat.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. That describes the current glibc implementation. > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. > > However it would be neat if someone could test $OTHER_OS and, perhaps more > importantly, the present glibc emulation (which I assume your manpage is > referring to, so this would be a manpage test ;)). int posix_fallocate (int fd, __off_t offset, __off_t len) { struct stat64 st; struct statfs f; /* `off_t' is a signed type. Therefore we can determine whether OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; if (offset + len < 0) return EFBIG; /* First thing we have to make sure is that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) return ESPIPE; if (! S_ISREG (st.st_mode)) return ENODEV; if (len == 0) { if (st.st_size < offset) { int ret = __ftruncate (fd, offset); if (ret != 0) ret = errno; return ret; } return 0; } ... is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is "shall fail". But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and "required storage for regular file data starting at offset and continuing for len bytes" doesn't make sense for negative size. And given the general "Implementations may support additional errors not included in this list, may generate errors included in this list under circumstances other than those described here, or may contain extensions or limitations that prevent some errors from occurring." I believe returning EINVAL for len < 0 is not a POSIX violation. That doesn't mean the standard shouldn't be clarified, whether by saying EINVAL must be returned for non-positive len or saying that using negative len has undefined or implementation defined behavior. > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. No, see above. Jakub From owner-xfs@oss.sgi.com Fri May 4 00:28:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:28:25 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l447SHfB017704 for ; Fri, 4 May 2007 00:28:19 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA08305; Fri, 4 May 2007 17:27:56 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l447RnAf84055039; Fri, 4 May 2007 17:27:50 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l447Rg2j84042753; Fri, 4 May 2007 17:27:42 +1000 (AEST) Date: Fri, 4 May 2007 17:27:42 +1000 From: David Chinner To: Andrew Morton Cc: David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504072742.GK32602149@melbourne.sgi.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11275 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > On Fri, 4 May 2007 16:07:31 +1000 David Chinner wrote: > > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > > > This patch implements the fallocate() system call and adds support for > > > > i386, x86_64 and powerpc. > > > > > > > > ... > > > > +{ > > > > + struct file *file; > > > > + struct inode *inode; > > > > + long ret = -EINVAL; > > > > + > > > > + if (len == 0 || offset < 0) > > > > + goto out; > > > > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. > > > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. I don't think we should care. If we provide a syscall with the semantics of "allocate from offset to offset+len" then glibc's implementation can turn negative length into two separate fallocate syscalls.... > > > > + ret = -ENODEV; > > > > + if (!S_ISREG(inode->i_mode)) > > > > + goto out_fput; > > > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > > seems a bit silly of them. > > > > Hmmmm - I thought that the intention of sys_fallocate() was to > > be generic enough to eventually allow preallocation on directories. > > If that is the case, then this check will prevent that.... > > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the above > check was relaxed? No - right now empty blocks are pruned from the directory immediately so I don't think we really have a concept of empty blocks in the btree structure. dir2 is bloody complex, so adding preallocation is probably not going to be simple to do. It's not high on my list to add, either, because we can typically avoid the worst case directory fragmentation by using larger directory block sizes (e.g. 16k instead of the default 4k on a 4k block size fs). IIRC directory preallocation has been talked about more for ext3/4.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 4 00:29:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:29:41 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l447TbfB018149 for ; Fri, 4 May 2007 00:29:38 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id CD3ADFA8658 for ; Fri, 4 May 2007 08:06:38 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CD49517BA4; Fri, 4 May 2007 09:06:13 +0200 (CEST) Date: Fri, 4 May 2007 09:06:13 +0200 From: Emmanuel Florac To: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504090613.7c0f97d3@galadriel.home> In-Reply-To: <20070504005922.GC32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l447TcfB018166 X-archive-position: 11276 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 10:59:22 +1000 vous criviez: > Where there any I/o errors reported before the shutdown? > Nope. To make it clear : the problem can be reproduce on several different systems, different motherboards, different drives, different RAID controllers... This isn't a hardware problem. > > On a similar hardware with 2 3Ware-9550 16x750GB striped together, > > but running 2.6.17.13, I had a similar fs crash last week. > > Unfortunately I don't have the logs at hand, but we where able to > > reproduce several times the crash at home : > > Hmm - 750GB drives are brand new. i wouldn't rule out media issues > at this point... The problem is quite easily reproduced with 500GB drives too. > > Filesystem "md0": XFS internal error xfs_btree_check_sblock at line > > 336 of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 > > Memory corruption? Tried with different RAMs, and the problem occurs on ECC RAM too. > > > > Out of curiosity, I've tried to use reiserfs (just to see how it > > compares regarding this). Reiserfs crashed before even writing > > 100MB! > > That indicates there's something wrong other than the filesystem. > I'd suggest making sure your raid arrays, memory, etc are all > functioning correctly first. They are. I've tested 5 different machines so far (Supermicro or Tyan mobos, kingston RAM, Intel or AMD cpus, hitachi and seagate drives...) > What platform are you running on? Are you running ia32 with 4k stacks? Yes. I'll try this week 2.6.18.8 thoroughly and 2.6.20.11 too. Then jfs, just to be sure. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 00:34:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:34:03 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l447XvfB019746 for ; Fri, 4 May 2007 00:33:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA08568; Fri, 4 May 2007 17:33:47 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l447XkAf83983180; Fri, 4 May 2007 17:33:46 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l447Xi8582990264; Fri, 4 May 2007 17:33:44 +1000 (AEST) Date: Fri, 4 May 2007 17:33:44 +1000 From: David Chinner To: Emmanuel Florac Cc: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504073344.GL32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070504090613.7c0f97d3@galadriel.home> User-Agent: Mutt/1.4.2.1i X-archive-position: 11277 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 09:06:13AM +0200, Emmanuel Florac wrote: > Le Fri, 4 May 2007 10:59:22 +1000 vous criviez: > > What platform are you running on? Are you running ia32 with 4k stacks? > > Yes. I'll try this week 2.6.18.8 thoroughly and 2.6.20.11 too. Then > jfs, just to be sure. Well, there's your problem. Stack overflows. IMO, if you use a filesystem, you shouldn't use 4k stacks. ;) If you remake you kernel with 8k stacks then your problems will most likely go away. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 4 06:25:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 06:25:53 -0700 (PDT) Received: from smtp-ft5.fr.colt.net (smtp-ft5.fr.colt.net [213.41.78.197]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44DPlfB025077 for ; Fri, 4 May 2007 06:25:49 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft5.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l44DPhpu000578; Fri, 4 May 2007 15:25:43 +0200 Date: Fri, 4 May 2007 15:25:46 +0200 From: Emmanuel Florac To: David Chinner Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504152546.614374ac@harpe.intellique.com> In-Reply-To: <20070504073344.GL32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44DPofB025089 X-archive-position: 11278 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 17:33:44 +1000 David Chinner crivait: > Well, there's your problem. Stack overflows. IMO, if you use a > filesystem, you shouldn't use 4k stacks. ;) > > If you remake you kernel with 8k stacks then your problems will > most likely go away. Well, I've double-checked the asm-i386/module.h, and it actually looks like 4K stacks is NOT the default, so I must be using 8K, isn't it? I've ran the same test on the same machine but WITHOUT software raid-0 (so write barriers are in use), and all went well, more than 3TB written without a glitch. I still think there's something related to the write barriers here. I'll try with another RAID controller, Adaptec for instance, to get sure the 3ware driver isn't involved. I'll also try again with an amd64 kernel. I'd really like to sort this out... -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 07:55:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 07:55:38 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44EtXfB019895 for ; Fri, 4 May 2007 07:55:35 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id C945B18022E01; Fri, 4 May 2007 09:55:30 -0500 (CDT) Message-ID: <463B4962.70904@sandeen.net> Date: Fri, 04 May 2007 09:55:30 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11279 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:33:44 +1000 > David Chinner crivait: > >> Well, there's your problem. Stack overflows. IMO, if you use a >> filesystem, you shouldn't use 4k stacks. ;) >> >> If you remake you kernel with 8k stacks then your problems will >> most likely go away. > > Well, I've double-checked the asm-i386/module.h, and it actually looks > like 4K stacks is NOT the default, so I must be using 8K, isn't it? Depends on how you config'd it, just look at the .config you built with, and search for CONFIG_4KSTACKS On Fedora at least (and I can't remember - I don't think this is a fedora-ism...) you can do "modinfo" on some module, and see: vermagic: 2.6.21 SMP mod_unload 686 4KSTACKS -Eric From owner-xfs@oss.sgi.com Fri May 4 08:30:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 08:30:51 -0700 (PDT) Received: from smtp-ft1.fr.colt.net (smtp-ft1.fr.colt.net [213.41.78.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44FUlfB030646 for ; Fri, 4 May 2007 08:30:49 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft1.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l44FUdlH008756; Fri, 4 May 2007 17:30:41 +0200 Date: Fri, 4 May 2007 17:30:49 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504173049.14606033@harpe.intellique.com> In-Reply-To: <463B4962.70904@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 X-Antivirus: checked in 0.023sec at smtp-ft1.fr.colt.net ([213.41.78.210]) by smf-clamd Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44FUnfB030656 X-archive-position: 11280 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 04 May 2007 09:55:30 -0500 Eric Sandeen crivait: > Emmanuel Florac wrote: > > Le Fri, 4 May 2007 17:33:44 +1000 > > David Chinner crivait: > > > >> Well, there's your problem. Stack overflows. IMO, if you use a > >> filesystem, you shouldn't use 4k stacks. ;) > >> > >> If you remake you kernel with 8k stacks then your problems will > >> most likely go away. > > > > Well, I've double-checked the asm-i386/module.h, and it actually > > looks like 4K stacks is NOT the default, so I must be using 8K, > > isn't it? > > Depends on how you config'd it, just look at the .config you built > with, and search for CONFIG_4KSTACKS config-2.6.17.13: # CONFIG_4KSTACKS is not set So the problem lies elsewhere... -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 08:58:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 08:58:29 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44FwOfB005594 for ; Fri, 4 May 2007 08:58:25 -0700 Received: from localhost (dslb-084-057-112-255.pools.arcor-ip.net [84.57.112.255]) by mail.lichtvoll.de (Postfix) with ESMTP id AF67E5AD3F for ; Fri, 4 May 2007 17:58:22 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Date: Fri, 4 May 2007 17:58:21 +0200 User-Agent: KMail/1.9.6 References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> (sfid-20070504_161005_263297_AD8C4AAD) In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705041758.21320.Martin@lichtvoll.de> X-archive-position: 11281 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Freitag 04 Mai 2007 schrieb Emmanuel Florac: > I've ran the same test on the same machine but WITHOUT software raid-0 > (so write barriers are in use), and all went well, more than 3TB > written without a glitch. I still think there's something related to > the write barriers here. I'll try with another RAID controller, Adaptec > for instance, to get sure the 3ware driver isn't involved. I'll also > try again with an amd64 kernel. Hello Emmanuel! When you can't use write barriers as XFS tell you in the logs, you better switch of write caching for the harddisks / raid controller, unless you happen to have NVRAM or safe power supply. But then using write cache without barrier should not make any difference unless you actually have a crash or power failure during write operation. Did you test with ext3 as well? You wrote it crashes with ReiserFS (version 3) even faster. When it crashes with several filesystems its unlikely to be a filesystem issue. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Fri May 4 15:12:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 15:12:33 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44MCTfB015022 for ; Fri, 4 May 2007 15:12:30 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id 48027FA5D2B for ; Fri, 4 May 2007 22:44:22 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 89A381838B; Fri, 4 May 2007 23:43:56 +0200 (CEST) Date: Fri, 4 May 2007 23:43:57 +0200 From: Emmanuel Florac To: Martin Steigerwald Cc: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504234357.24d22883@galadriel.home> In-Reply-To: <200705041758.21320.Martin@lichtvoll.de> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44MCVfB015030 X-archive-position: 11282 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 17:58:21 +0200 vous criviez: > Did you test with ext3 as well? You wrote it crashes with ReiserFS > (version 3) even faster. When it crashes with several filesystems its > unlikely to be a filesystem issue. Unfortunately ext3 doesn't support volumes bigger than 8TB, so that's useless to me. I plan to test jfs, however. I think it's more a dm/md issue, but I'm not sure... -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 16:20:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 16:20:38 -0700 (PDT) Received: from smtp108.sbc.mail.mud.yahoo.com (smtp108.sbc.mail.mud.yahoo.com [68.142.198.207]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l44NKVfB031820 for ; Fri, 4 May 2007 16:20:32 -0700 Received: (qmail 71668 invoked from network); 4 May 2007 23:20:30 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp108.sbc.mail.mud.yahoo.com with SMTP; 4 May 2007 23:20:29 -0000 X-YMail-OSG: OPP1hd4VM1lfaVvz3tObISaM4S9Wsbmdmu7ru90QC85M5NGiDwRjeqFhPzMWSgDFVI.VQ1CzkQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 7B9EE1827261; Fri, 4 May 2007 16:20:28 -0700 (PDT) Date: Fri, 4 May 2007 16:20:28 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504232028.GA19744@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070504173049.14606033@harpe.intellique.com> X-archive-position: 11283 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 05:30:49PM +0200, Emmanuel Florac wrote: > # CONFIG_4KSTACKS is not set > > So the problem lies elsewhere... CONFIG_4KSTACKS is badly named. It means you have 4K process + 4K interrupt stacks. Without this set you have just a single 8K stack for processes and interrupts. One argument for 4K+4K stacks is that 8K+0K isn't really safer in many cases --- it just appears that way becasue the problems are harder to hit. Almost three years ago I posted patches to split the CONFIG_4KSTACKS option into two options. I quickly just ported that to 2.6.21 just now (very quickly, I might have goofed fixing up the rejects). You could if you have time try this and enable CONFIG_I386_IRQSTACKS but don't enable CONFIG_I386_4KSTACKS and see if that helps... diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug index 458bc16..f32fbec 100644 --- a/arch/i386/Kconfig.debug +++ b/arch/i386/Kconfig.debug @@ -56,15 +56,22 @@ config DEBUG_RODATA portion of the kernel code won't be covered by a 2MB TLB anymore. If in doubt, say "N". -config 4KSTACKS +config I386_4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates running more threads on a system and also reduces the pressure - on the VM subsystem for higher order allocations. This option - will also use IRQ stacks to compensate for the reduced stackspace. + on the VM subsystem for higher order allocations. + +config I386_IRQSTACKS + bool "Allocate separate IRQ stacks" + depends on DEBUG_KERNEL + default y + help + If you say Y here the kernel will allocate and use separate + stacks for interrupts. config X86_FIND_SMP_CONFIG bool diff --git a/arch/i386/defconfig b/arch/i386/defconfig diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c index 8db8d51..f6224fd 100644 --- a/arch/i386/kernel/irq.c +++ b/arch/i386/kernel/irq.c @@ -47,7 +47,7 @@ void ack_bad_irq(unsigned int irq) #endif } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * per-CPU IRQ handling contexts (thread information and stack) */ @@ -58,7 +58,7 @@ union irq_ctx { static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly; static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * do_IRQ handles all normal device IRQ's (the special @@ -71,7 +71,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) /* high bit used in ret_from_ code */ int irq = ~regs->orig_eax; struct irq_desc *desc = irq_desc + irq; -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS union irq_ctx *curctx, *irqctx; u32 *isp; #endif @@ -99,7 +99,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS curctx = (union irq_ctx *) current_thread_info(); irqctx = hardirq_ctx[smp_processor_id()]; @@ -136,7 +136,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) : "memory", "cc" ); } else -#endif +#endif /* CONFIG_I386_IRQSTACKS */ desc->handle_irq(irq, desc); irq_exit(); @@ -144,7 +144,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) return 1; } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * These should really be __section__(".bss.page_aligned") as well, but @@ -234,7 +234,7 @@ asmlinkage void do_softirq(void) } EXPORT_SYMBOL(do_softirq); -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * Interrupt statistics: diff --git a/include/asm-i386/irq.h b/include/asm-i386/irq.h index 11761cd..7db95e1 100644 --- a/include/asm-i386/irq.h +++ b/include/asm-i386/irq.h @@ -24,14 +24,14 @@ static __inline__ int irq_canonicalize(int irq) # define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */ #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS extern void irq_ctx_init(int cpu); extern void irq_ctx_exit(int cpu); # define __ARCH_HAS_DO_SOFTIRQ -#else +#else /* !CONFIG_I386_IRQSTACKS */ # define irq_ctx_init(cpu) do { } while (0) # define irq_ctx_exit(cpu) do { } while (0) -#endif +#endif /* CONFIG_I386_IRQSTACKS */ #ifdef CONFIG_IRQBALANCE extern int irqbalance_disable(char *str); diff --git a/include/asm-i386/module.h b/include/asm-i386/module.h index 02f8f54..7d5d2df 100644 --- a/include/asm-i386/module.h +++ b/include/asm-i386/module.h @@ -62,11 +62,11 @@ struct mod_arch_specific #error unknown processor family #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define MODULE_STACKSIZE "4KSTACKS " -#else +#else /* not using CONFIG_I386_4KSTACKS */ #define MODULE_STACKSIZE "" -#endif +#endif /* CONFIG_I386_4KSTACKS */ #define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_STACKSIZE diff --git a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h index 4b187bb..f5268e0 100644 --- a/include/asm-i386/thread_info.h +++ b/include/asm-i386/thread_info.h @@ -53,7 +53,7 @@ struct thread_info { #endif #define PREEMPT_ACTIVE 0x10000000 -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define THREAD_SIZE (4096) #else #define THREAD_SIZE (8192) From owner-xfs@oss.sgi.com Fri May 4 22:21:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 22:21:40 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l455LZfB010386 for ; Fri, 4 May 2007 22:21:36 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 659901802EE36; Fri, 4 May 2007 23:49:31 -0500 (CDT) Message-ID: <463C0CD8.4090402@sandeen.net> Date: Fri, 04 May 2007 23:49:28 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> In-Reply-To: <20070504234357.24d22883@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11284 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:58:21 +0200 vous criviez: > >> Did you test with ext3 as well? You wrote it crashes with ReiserFS >> (version 3) even faster. When it crashes with several filesystems its >> unlikely to be a filesystem issue. > > Unfortunately ext3 doesn't support volumes bigger than 8TB, so that's > useless to me. I plan to test jfs, however. > I think it's more a dm/md issue, but I'm not sure... > Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from rhel5/centos5) can do up to 16T ext3 filesystems, so you should be able to test that if you like. -Eric From owner-xfs@oss.sgi.com Fri May 4 23:06:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 23:06:50 -0700 (PDT) Received: from mta5.adelphia.net (mta5.adelphia.net [68.168.78.187]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4566jfB021481 for ; Fri, 4 May 2007 23:06:47 -0700 Subject: Re: Mail System Error - Returned Mail To: linux-xfs@oss.sgi.com From: "Auto-reply from pjmarkert@adelphia.net" In-Reply-To: <20070505053606.FRLF26012.mta9.adelphia.net@oss.sgi.com> Precedence: bulk Date: Sat, 5 May 2007 01:36:08 -0400 Message-ID: <20070505053608.FRPH26012.mta9.adelphia.net@mta9> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 11285 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: pjmarkert@adelphia.net Precedence: bulk X-list: xfs My email address is changed to pjmarkert@verizon.net From owner-xfs@oss.sgi.com Sat May 5 08:20:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 08:20:34 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45FKUfB010573 for ; Sat, 5 May 2007 08:20:31 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 26851F15FD1 for ; Sat, 5 May 2007 17:20:30 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 19169182B0; Sat, 5 May 2007 17:20:27 +0200 (CEST) Date: Sat, 5 May 2007 17:19:31 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505171931.6fe9b6f5@galadriel.home> In-Reply-To: <20070504232028.GA19744@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45FKVfB010597 X-archive-position: 11287 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 16:20:28 -0700 vous criviez: > You could if you have time try this and enable CONFIG_I386_IRQSTACKS > but don't enable CONFIG_I386_4KSTACKS and see if that helps... That sounds very interesting, I'll give it a try monday. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 08:18:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 08:18:29 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45FIOfB006549 for ; Sat, 5 May 2007 08:18:24 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 2E267F15D45 for ; Sat, 5 May 2007 17:18:23 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id A524E17BFE; Sat, 5 May 2007 17:18:20 +0200 (CEST) Date: Sat, 5 May 2007 17:18:20 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505171820.6e92d437@galadriel.home> In-Reply-To: <463C0CD8.4090402@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45FIPfB006557 X-archive-position: 11286 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 04 May 2007 23:49:28 -0500 vous criviez: > > Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from > rhel5/centos5) can do up to 16T ext3 filesystems, so you should be > able to test that if you like. Thanks, I'll try that too. Though it won't cover all my needs (I plan to set up 50 and 150TB systems really soon). -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 09:33:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:33:55 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GXofB000741 for ; Sat, 5 May 2007 09:33:51 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 8860BB02F5B2; Sat, 5 May 2007 12:33:49 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 849C85000166; Sat, 5 May 2007 12:33:49 -0400 (EDT) Date: Sat, 5 May 2007 12:33:49 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: linux-raid@vger.kernel.org cc: xfs@oss.sgi.com Subject: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11288 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Question, I currently have a 965 chipset-based motherboard, use 4 port onboard and several PCI-e x1 controller cards for a raid 5 of 10 raptor drives. I get pretty decent speeds: user@host$ time dd if=/dev/zero of=100gb bs=1M count=102400 102400+0 records in 102400+0 records out 107374182400 bytes (107 GB) copied, 247.134 seconds, 434 MB/s real 4m7.164s user 0m0.223s sys 3m3.505s user@host$ time dd if=100gb of=/dev/null bs=1M count=102400 102400+0 records in 102400+0 records out 107374182400 bytes (107 GB) copied, 172.588 seconds, 622 MB/s real 2m52.631s user 0m0.212s sys 1m50.905s user@host$ Also, when I run simultaenous dd's from all of the drives, I see 850-860MB/s, I am curious if there is some kind of limitation with software raid as to why I am not getting better than 500MB/s for sequential write speed? With 7 disks, I got about the same speed, adding 3 more for a total of 10 did not seem to help in regards to write. However, read improved to 622MBs/ from about 420-430MB/s. However, if I want to upgrade to more than 12 disks, I am out of PCI-e slots, so I was wondering, does anyone on this list run a 16 port Areca or 3ware card and use it for JBOD? What kind of performance do you see when using mdadm with such a card? Or if anyone uses mdadm with less than a 16 port card, I'd like to hear what kind of experiences you have seen with that type of configuration. Justin. From owner-xfs@oss.sgi.com Sat May 5 09:48:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:48:03 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GlxfB005748 for ; Sat, 5 May 2007 09:48:00 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 93A4518022E01; Sat, 5 May 2007 11:47:58 -0500 (CDT) Message-ID: <463CB53E.8000202@sandeen.net> Date: Sat, 05 May 2007 11:47:58 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> In-Reply-To: <20070505171820.6e92d437@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11289 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 04 May 2007 23:49:28 -0500 vous criviez: > >> Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from >> rhel5/centos5) can do up to 16T ext3 filesystems, so you should be >> able to test that if you like. > > Thanks, I'll try that too. Though it won't cover all my needs (I plan > to set up 50 and 150TB systems really soon). > Sure, I understand - it may be helpful in figuring out what the problem is, though. I'll be curious to see how it goes... Oh, btw, you'll need the -F (force) flag for mkfs.ext3 -Eric From owner-xfs@oss.sgi.com Sat May 5 09:50:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:50:17 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GoDfB006809 for ; Sat, 5 May 2007 09:50:15 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id C4EDC18022E01; Sat, 5 May 2007 11:50:12 -0500 (CDT) Message-ID: <463CB5C4.7040803@sandeen.net> Date: Sat, 05 May 2007 11:50:12 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> In-Reply-To: <20070505171931.6fe9b6f5@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11290 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 16:20:28 -0700 vous criviez: > >> You could if you have time try this and enable CONFIG_I386_IRQSTACKS >> but don't enable CONFIG_I386_4KSTACKS and see if that helps... > > That sounds very interesting, I'll give it a try monday. > There are also stack debugging config options; one that will warn if you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that will print max stack depth in sysrq-t output (CONFIG_DEBUG_STACK_USAGE). -Eric From owner-xfs@oss.sgi.com Sat May 5 13:35:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:35:35 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45KZSfB010426 for ; Sat, 5 May 2007 13:35:29 -0700 Received: (qmail 92356 invoked from network); 5 May 2007 20:35:28 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:35:27 -0000 X-YMail-OSG: NfkFI3wVM1l2KAzcA7Gpvf5kMfsvZM8GGJA_DL2tvbfn03E9cLQ8rwaGzn2fNG.7uUhguDxBvQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 887111827261; Sat, 5 May 2007 13:35:25 -0700 (PDT) Date: Sat, 5 May 2007 13:35:25 -0700 From: Chris Wedgwood To: Eric Sandeen Cc: Emmanuel Florac , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505203525.GA16477@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463CB5C4.7040803@sandeen.net> X-archive-position: 11291 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 11:50:12AM -0500, Eric Sandeen wrote: > There are also stack debugging config options; one that will warn if > you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that > will print max stack depth in sysrq-t output > (CONFIG_DEBUG_STACK_USAGE). I was in such a hurry I don't think I tweaked that sanely. I'll go over the patch checking that and test it later today. Is there some preferred kernel version people would like? From owner-xfs@oss.sgi.com Sat May 5 13:55:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:55:02 -0700 (PDT) Received: from smtp104.sbc.mail.mud.yahoo.com (smtp104.sbc.mail.mud.yahoo.com [68.142.198.203]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45KsxfB015700 for ; Sat, 5 May 2007 13:54:59 -0700 Received: (qmail 62153 invoked from network); 5 May 2007 20:54:58 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp104.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:54:58 -0000 X-YMail-OSG: m97Te.QVM1mbh8aFrqo95Qk4qrjAE4R81UJBHQJ1y14F1mB3VHCW427ig.b06hW2BI2KGF6gBQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id C851A1827261; Sat, 5 May 2007 13:54:56 -0700 (PDT) Date: Sat, 5 May 2007 13:54:56 -0700 From: Chris Wedgwood To: Justin Piszcz Cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: <20070505205456.GA17112@tuatara.stupidest.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-archive-position: 11292 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 12:33:49PM -0400, Justin Piszcz wrote: > Also, when I run simultaenous dd's from all of the drives, I see > 850-860MB/s, I am curious if there is some kind of limitation with > software raid as to why I am not getting better than 500MB/s for > sequential write speed? What does "vmstat 1" output look like in both cases? My guess is that for large writes it's NOT CPU bound but it can't hurt to check. > With 7 disks, I got about the same speed, adding 3 more for a total > of 10 did not seem to help in regards to write. However, read > improved to 622MBs/ from about 420-430MB/s. RAID is quirky. It's worth fiddling with the stripe size as that can have a big difference in terms of performance --- it's far from clear why on some setups some values work well and other setups you want very different values. It would be good to know if anyone has ever studied stripe size and also controller interleave/layout issues to get a good understanding of why certain values are good and others are very poor and why it varies so much from one setup to the other. Also, 'dd performance' varies between the start of a disk and the end. Typically you get better performance at the start of the disk so dd might not be a very good benchmark here. > However, if I want to upgrade to more than 12 disks, I am out of > PCI-e slots, so I was wondering, does anyone on this list run a 16 > port Areca or 3ware card and use it for JBOD? What kind of > performance do you see when using mdadm with such a card? Or if > anyone uses mdadm with less than a 16 port card, I'd like to hear > what kind of experiences you have seen with that type of > configuration. I've used some 2, 4 and 8 port 3ware cards. As JBODS they worked fine, as RAID cards I had no end of problems. I'm happy to test larger cards if someone wants to donate them :-) From owner-xfs@oss.sgi.com Sat May 5 13:56:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:56:58 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45KuqfB016458 for ; Sat, 5 May 2007 13:56:53 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 5B167F2888F for ; Sat, 5 May 2007 22:56:48 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id B30BE15507; Sat, 5 May 2007 22:56:46 +0200 (CEST) Date: Sat, 5 May 2007 22:56:46 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505225646.1e16b0c4@galadriel.home> In-Reply-To: <463CB53E.8000202@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <463CB53E.8000202@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45KusfB016478 X-archive-position: 11293 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 05 May 2007 11:47:58 -0500 vous criviez: > Sure, I understand - it may be helpful in figuring out what the > problem is, though. I'll be curious to see how it goes... Sure, stay tuned! > Oh, btw, you'll need the -F (force) flag for mkfs.ext3 Thanks! -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 13:57:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:57:31 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45KvQfB016809 for ; Sat, 5 May 2007 13:57:28 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id 26B7EFBB29D for ; Sat, 5 May 2007 21:57:49 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 134EF17BA4; Sat, 5 May 2007 22:57:23 +0200 (CEST) Date: Sat, 5 May 2007 22:57:23 +0200 From: Emmanuel Florac To: xfs@oss.sgi.com Cc: Chris Wedgwood Subject: Re: XFS crash on linux raid Message-ID: <20070505225723.012cc38b@galadriel.home> In-Reply-To: <463CB5C4.7040803@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45KvSfB016842 X-archive-position: 11294 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 05 May 2007 11:50:12 -0500 vous criviez: > There are also stack debugging config options; one that will warn if > you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that > will print max stack depth in sysrq-t output > (CONFIG_DEBUG_STACK_USAGE). Fine, I'll try that. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:00:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:00:25 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45L0LfB018275 for ; Sat, 5 May 2007 14:00:22 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 8F802F10C62 for ; Sat, 5 May 2007 22:58:21 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 9CFA618302; Sat, 5 May 2007 22:58:19 +0200 (CEST) Date: Sat, 5 May 2007 22:58:19 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505225819.0dd3c0fa@galadriel.home> In-Reply-To: <20070505203525.GA16477@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45L0MfB018282 X-archive-position: 11295 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 13:35:25 -0700 vous criviez: > Is there some preferred kernel version people would like? > Well I prefer staying away from the very latest bleeding edge, so I stick to 2.6.20.11 for now. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:18:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:18:50 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45LIkfB023937 for ; Sat, 5 May 2007 14:18:47 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 118A517BD2; Sat, 5 May 2007 23:18:45 +0200 (CEST) Date: Sat, 5 May 2007 23:18:45 +0200 From: Emmanuel Florac To: Justin Piszcz Cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: <20070505231845.7b1cbdc5@galadriel.home> In-Reply-To: References: Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45LImfB023945 X-archive-position: 11296 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 12:33:49 -0400 (EDT) vous criviez: > However, if I want to upgrade to more than 12 disks, I am out of > PCI-e slots, so I was wondering, does anyone on this list run a 16 > port Areca or 3ware card and use it for JBOD? I don't use this setup in production, but I tried it with 8 ports 3Ware cards. I didn't try the latest 9650 though. > What kind of > performance do you see when using mdadm with such a card? 3Ghz Supermicro P4D 1 GB RAM, 3Ware 9550SX with 8x250GB 8MB cache 7200 RPM Seagate drives, raid 0 Tested XFS and reiserfs, with 64 and 256K stripes. tested under Linux 2.6.15.1, with bonnie++ in "fast mode" (-f option). use bon_csv2html to translate, or see bonnie++ documentation, roughly : 2G is the file size tested, then numbers on the first line are : write speed (KB/s), CPU usage (%), rewrite speed (overwrite), cpu usage, read speed, cpu usage. Then follow sequential and random seeks, reads, writes and delete with their cpu usage. "+++++" means "no significant value". # XFS, stripe 256k storiq,2G,,,353088,69,76437,17,,,197376,16,410.8,0,16,11517,57,+++++,+++,10699,51,11502,59,+++++,+++,12158,61 storiq,2G,,,349166,71,75397,17,,,196057,16,433.3,0,16,12744,64,+++++,+++,12700,58,13008,67,+++++,+++,9890,51 storiq,2G,,,336683,68,72581,16,,,191254,18,419.9,0,16,12377,62,+++++,+++,10991,52,12947,67,+++++,+++,10580,52 storiq,2G,,,335646,65,77938,17,,,195350,17,397.4,0,16,14578,74,+++++,+++,11085,53,14377,74,+++++,+++,10852,54 storiq,2G,,,330022,67,73004,17,,,197846,18,412.3,0,16,12534,65,+++++,+++,10983,52,12161,63,+++++,+++,11752,61 storiq,2G,,,279454,55,75256,17,,,196065,18,412.7,0,16,13022,67,+++++,+++,10802,52,13759,72,+++++,+++,9800,47 storiq,2G,,,314606,61,74883,16,,,194131,16,401.2,0,16,11665,58,+++++,+++,10723,52,11880,61,+++++,+++,6659,33 storiq,2G,,,264382,53,72011,15,,,196690,18,411.5,0,16,10194,52,+++++,+++,12202,57,10367,52,+++++,+++,9175,45 storiq,2G,,,360252,72,75845,17,,,199721,18,432.7,0,16,12067,61,+++++,+++,11047,54,12156,62,+++++,+++,12372,60 storiq,2G,,,280746,57,74541,17,,,193562,19,414.0,0,16,12418,61,+++++,+++,11090,52,11135,57,+++++,+++,11309,55 storiq,2G,,,309464,61,79153,18,,,191533,17,419.5,0,16,12705,62,+++++,+++,11889,57,12027,61,+++++,+++,10960,54 storiq,2G,,,342122,67,68113,15,,,195572,16,413.5,0,16,13667,69,+++++,+++,10596,55,12731,66,+++++,+++,10766,54 storiq,2G,,,329945,63,72183,15,,,193082,18,421.8,0,16,12627,62,+++++,+++,9270,43,12455,63,+++++,+++,8878,44 storiq,2G,,,309570,63,69628,16,,,192415,19,413.1,0,16,13568,69,+++++,+++,10104,48,13512,70,+++++,+++,9261,45 storiq,2G,,,298528,58,70029,15,,,193531,17,399.5,0,16,13028,64,+++++,+++,9990,47,10098,52,+++++,+++,7544,38 storiq,2G,,,260341,52,66979,15,,,197199,18,393.1,0,16,10633,53,+++++,+++,9189,43,11159,56,+++++,+++,11696,58 # XFS, stripe 64k storiq,2G,,,351241,70,90868,22,,,305222,29,408.7,0,16,8593,43,+++++,+++,6639,31,7555,39,+++++,+++,6639,33 storiq,2G,,,340145,67,83790,19,,,297148,28,401.4,0,16,9132,46,+++++,+++,6790,34,8881,45,+++++,+++,6305,31 storiq,2G,,,325791,65,81314,19,,,282439,26,395.5,0,16,9095,44,+++++,+++,6255,29,8173,42,+++++,+++,6194,31 storiq,2G,,,266009,53,83362,20,,,308438,26,407.7,0,16,8362,43,+++++,+++,6443,30,9264,47,+++++,+++,6339,33 storiq,2G,,,322776,65,76466,17,,,288001,26,399.7,0,16,8038,41,+++++,+++,5387,26,6389,34,+++++,+++,6545,31 storiq,2G,,,309007,60,77846,18,,,290613,29,392.8,0,16,7183,37,+++++,+++,6492,30,8270,41,+++++,+++,6813,35 storiq,2G,,,287662,58,72920,17,,,287911,26,398.4,0,16,8893,44,+++++,+++,7777,36,8150,41,+++++,+++,7717,39 storiq,2G,,,288149,56,75743,17,,,300949,29,386.2,0,16,9545,47,+++++,+++,7572,35,9115,46,+++++,+++,7211,36 # reiser, stripe 256k storiq,2G,,,289179,98,102775,26,,,188307,22,444.0,0,16,27326,100,+++++,+++,21887,99,26726,99,+++++,+++,20633,98 storiq,2G,,,275847,93,101970,25,,,190551,21,450.2,0,16,27397,100,+++++,+++,21926,100,26609,100,+++++,+++,20895,99 storiq,2G,,,289414,99,105080,26,,,189022,22,423.9,0,16,27212,100,+++++,+++,21757,100,26651,99,+++++,+++,20863,100 storiq,2G,,,292746,99,103681,25,,,186303,21,431.5,0,16,27375,100,+++++,+++,21989,99,26251,99,+++++,+++,20924,99 storiq,2G,,,290222,99,104135,26,,,189656,22,449.7,0,16,27453,99,+++++,+++,21849,100,26757,99,+++++,+++,20845,99 storiq,2G,,,291716,99,103872,26,,,187410,23,437.0,0,16,27419,99,+++++,+++,22119,99,26516,100,+++++,+++,20934,100 storiq,2G,,,285545,99,101637,25,,,189788,21,422.1,0,16,27224,99,+++++,+++,21742,99,26500,99,+++++,+++,20922,100 storiq,2G,,,293042,98,100272,24,,,185631,22,453.8,0,16,27268,99,+++++,+++,21944,100,26777,100,+++++,+++,21042,99 # reiser stripe 64k storiq,2G,,,295569,99,112563,29,,,282178,32,434.5,0,16,27631,99,+++++,+++,22015,99,27021,100,+++++,+++,21028,99 storiq,2G,,,287830,98,112449,29,,,271047,33,425.1,0,16,27447,99,+++++,+++,21973,99,26810,99,+++++,+++,21008,100 storiq,2G,,,271668,95,114410,30,,,282419,33,438.7,0,16,27495,100,+++++,+++,22158,100,26707,100,+++++,+++,21106,100 storiq,2G,,,282535,99,118620,30,,,272089,33,425.0,0,16,27569,100,+++++,+++,22021,100,26778,100,+++++,+++,20629,98 storiq,2G,,,294392,98,119654,32,,,273269,32,429.7,0,16,27591,100,+++++,+++,21984,99,26786,100,+++++,+++,20994,99 storiq,2G,,,296652,99,118420,31,,,279586,33,425.5,0,16,15007,78,+++++,+++,21889,99,26998,99,+++++,+++,20952,100 storiq,2G,,,290551,98,124374,32,,,273852,32,424.0,0,16,27534,99,+++++,+++,21974,99,26746,100,+++++,+++,20786,99 storiq,2G,,,287033,99,100559,26,,,204845,24,390.9,0,16,27620,99,+++++,+++,21996,99,26811,100,+++++,+++,21009,100 Here are the tests I did with a similar system, but with 500GB drives, XFS only, 64KB stripe (3ware default).I tested RAID 5 software RAID compared to RAID-5 hardware (3Ware 9550). # software raid 5 storiq-5U,2G,,,155913,22,23390,4,,,84327,9,531.5,0,16,1323,3,+++++,+++,634,1,657,2,+++++,+++,903,3 storiq-5U,2G,,,168104,24,23964,4,,,81666,8,534.2,0,16,605,2,+++++,+++,608,2,770,2,+++++,+++,706,1 storiq-5U,2G,,,149516,21,22612,4,,,82111,9,571.3,0,16,606,2,+++++,+++,590,2,729,2,+++++,+++,450,1 storiq-5U,2G,,,141883,20,22966,4,,,78116,8,568.5,0,16,615,2,+++++,+++,553,2,684,2,+++++,+++,508,2 # hardware raid 5 storiq-1,2G,,,148500,29,43043,9,,,148808,14,442.3,0,16,5953,27,+++++,+++,4408,20,4994,24,+++++,+++,2399,11 storiq-1,2G,,,191440,37,38092,8,,,155494,15,420.9,0,16,3074,15,+++++,+++,3356,17,4246,21,+++++,+++,2513,12 storiq-1,2G,,,150460,29,40018,9,,,144936,14,386.9,0,16,4206,20,+++++,+++,2497,11,5182,26,+++++,+++,2440,11 storiq-1,2G,,,163132,34,34525,8,,,132131,13,369.7,0,16,6796,33,+++++,+++,10002,47,5475,28,+++++,+++,3652,17 As you can see, hardware RAID-5 doesn't perform significantly faster at writing, but read thruput and rewrite performance is way better, and seeks are an order of magnitude faster. That's why I use striped 3Ware hardware RAID-5 to build high capacity systems instead of software RAID 5. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:23:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:23:05 -0700 (PDT) Received: from smtp107.sbc.mail.mud.yahoo.com (smtp107.sbc.mail.mud.yahoo.com [68.142.198.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45LN2fB026375 for ; Sat, 5 May 2007 14:23:03 -0700 Received: (qmail 55252 invoked from network); 5 May 2007 20:56:18 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp107.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:56:18 -0000 X-YMail-OSG: B34Ic84VM1nv8UOP1HdKHiretfaGEYgoETcAHAVgODfGX.6akLlrLcUL7d6IS85Oo1moL3QNRw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 624B21827261; Sat, 5 May 2007 13:56:17 -0700 (PDT) Date: Sat, 5 May 2007 13:56:17 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505205617.GB17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070504234357.24d22883@galadriel.home> X-archive-position: 11297 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 11:43:57PM +0200, Emmanuel Florac wrote: > Unfortunately ext3 doesn't support volumes bigger than 8TB, so > that's useless to me. I plan to test jfs, however. Is jfs supported by anyone right now? From owner-xfs@oss.sgi.com Sat May 5 14:32:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:32:53 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45LWmfB030926 for ; Sat, 5 May 2007 14:32:49 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 5F5C4B02F5B2; Sat, 5 May 2007 17:32:47 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 5887A5000177; Sat, 5 May 2007 17:32:47 -0400 (EDT) Date: Sat, 5 May 2007 17:32:47 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Emmanuel Florac cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? In-Reply-To: <20070505231845.7b1cbdc5@galadriel.home> Message-ID: References: <20070505231845.7b1cbdc5@galadriel.home> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1478584756-1178400767=:18820" X-archive-position: 11298 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1478584756-1178400767=:18820 Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Sat, 5 May 2007, Emmanuel Florac wrote: > Le Sat, 5 May 2007 12:33:49 -0400 (EDT) vous =E9criviez: > >> However, if I want to upgrade to more than 12 disks, I am out of >> PCI-e slots, so I was wondering, does anyone on this list run a 16 >> port Areca or 3ware card and use it for JBOD? > > I don't use this setup in production, but I tried it with 8 ports 3Ware > cards. > I didn't try the latest 9650 though. > >> What kind of >> performance do you see when using mdadm with such a card? > > 3Ghz Supermicro P4D 1 GB RAM, 3Ware 9550SX with 8x250GB 8MB cache 7200 > RPM Seagate drives, raid 0 > > Tested XFS and reiserfs, with 64 and 256K stripes. > > tested under Linux 2.6.15.1, with bonnie++ in "fast mode" (-f option). > use bon_csv2html to translate, or see bonnie++ documentation, roughly : > 2G is the file size tested, then numbers on the first line are : write > speed (KB/s), CPU usage (%), rewrite speed (overwrite), cpu usage, read > speed, cpu usage. Then follow sequential and random seeks, reads, > writes and delete with their cpu usage. "+++++" means "no significant > value". > > # XFS, stripe 256k > storiq,2G,,,353088,69,76437,17,,,197376,16,410.8,0,16,11517,57,+++++,+++,= 10699,51,11502,59,+++++,+++,12158,61 > storiq,2G,,,349166,71,75397,17,,,196057,16,433.3,0,16,12744,64,+++++,+++,= 12700,58,13008,67,+++++,+++,9890,51 > storiq,2G,,,336683,68,72581,16,,,191254,18,419.9,0,16,12377,62,+++++,+++,= 10991,52,12947,67,+++++,+++,10580,52 > storiq,2G,,,335646,65,77938,17,,,195350,17,397.4,0,16,14578,74,+++++,+++,= 11085,53,14377,74,+++++,+++,10852,54 > storiq,2G,,,330022,67,73004,17,,,197846,18,412.3,0,16,12534,65,+++++,+++,= 10983,52,12161,63,+++++,+++,11752,61 > storiq,2G,,,279454,55,75256,17,,,196065,18,412.7,0,16,13022,67,+++++,+++,= 10802,52,13759,72,+++++,+++,9800,47 > storiq,2G,,,314606,61,74883,16,,,194131,16,401.2,0,16,11665,58,+++++,+++,= 10723,52,11880,61,+++++,+++,6659,33 > storiq,2G,,,264382,53,72011,15,,,196690,18,411.5,0,16,10194,52,+++++,+++,= 12202,57,10367,52,+++++,+++,9175,45 > storiq,2G,,,360252,72,75845,17,,,199721,18,432.7,0,16,12067,61,+++++,+++,= 11047,54,12156,62,+++++,+++,12372,60 > storiq,2G,,,280746,57,74541,17,,,193562,19,414.0,0,16,12418,61,+++++,+++,= 11090,52,11135,57,+++++,+++,11309,55 > storiq,2G,,,309464,61,79153,18,,,191533,17,419.5,0,16,12705,62,+++++,+++,= 11889,57,12027,61,+++++,+++,10960,54 > storiq,2G,,,342122,67,68113,15,,,195572,16,413.5,0,16,13667,69,+++++,+++,= 10596,55,12731,66,+++++,+++,10766,54 > storiq,2G,,,329945,63,72183,15,,,193082,18,421.8,0,16,12627,62,+++++,+++,= 9270,43,12455,63,+++++,+++,8878,44 > storiq,2G,,,309570,63,69628,16,,,192415,19,413.1,0,16,13568,69,+++++,+++,= 10104,48,13512,70,+++++,+++,9261,45 > storiq,2G,,,298528,58,70029,15,,,193531,17,399.5,0,16,13028,64,+++++,+++,= 9990,47,10098,52,+++++,+++,7544,38 > storiq,2G,,,260341,52,66979,15,,,197199,18,393.1,0,16,10633,53,+++++,+++,= 9189,43,11159,56,+++++,+++,11696,58 > # XFS, stripe 64k > storiq,2G,,,351241,70,90868,22,,,305222,29,408.7,0,16,8593,43,+++++,+++,6= 639,31,7555,39,+++++,+++,6639,33 > storiq,2G,,,340145,67,83790,19,,,297148,28,401.4,0,16,9132,46,+++++,+++,6= 790,34,8881,45,+++++,+++,6305,31 > storiq,2G,,,325791,65,81314,19,,,282439,26,395.5,0,16,9095,44,+++++,+++,6= 255,29,8173,42,+++++,+++,6194,31 > storiq,2G,,,266009,53,83362,20,,,308438,26,407.7,0,16,8362,43,+++++,+++,6= 443,30,9264,47,+++++,+++,6339,33 > storiq,2G,,,322776,65,76466,17,,,288001,26,399.7,0,16,8038,41,+++++,+++,5= 387,26,6389,34,+++++,+++,6545,31 > storiq,2G,,,309007,60,77846,18,,,290613,29,392.8,0,16,7183,37,+++++,+++,6= 492,30,8270,41,+++++,+++,6813,35 > storiq,2G,,,287662,58,72920,17,,,287911,26,398.4,0,16,8893,44,+++++,+++,7= 777,36,8150,41,+++++,+++,7717,39 > storiq,2G,,,288149,56,75743,17,,,300949,29,386.2,0,16,9545,47,+++++,+++,7= 572,35,9115,46,+++++,+++,7211,36 > # reiser, stripe 256k > storiq,2G,,,289179,98,102775,26,,,188307,22,444.0,0,16,27326,100,+++++,++= +,21887,99,26726,99,+++++,+++,20633,98 > storiq,2G,,,275847,93,101970,25,,,190551,21,450.2,0,16,27397,100,+++++,++= +,21926,100,26609,100,+++++,+++,20895,99 > storiq,2G,,,289414,99,105080,26,,,189022,22,423.9,0,16,27212,100,+++++,++= +,21757,100,26651,99,+++++,+++,20863,100 > storiq,2G,,,292746,99,103681,25,,,186303,21,431.5,0,16,27375,100,+++++,++= +,21989,99,26251,99,+++++,+++,20924,99 > storiq,2G,,,290222,99,104135,26,,,189656,22,449.7,0,16,27453,99,+++++,+++= ,21849,100,26757,99,+++++,+++,20845,99 > storiq,2G,,,291716,99,103872,26,,,187410,23,437.0,0,16,27419,99,+++++,+++= ,22119,99,26516,100,+++++,+++,20934,100 > storiq,2G,,,285545,99,101637,25,,,189788,21,422.1,0,16,27224,99,+++++,+++= ,21742,99,26500,99,+++++,+++,20922,100 > storiq,2G,,,293042,98,100272,24,,,185631,22,453.8,0,16,27268,99,+++++,+++= ,21944,100,26777,100,+++++,+++,21042,99 > # reiser stripe 64k > storiq,2G,,,295569,99,112563,29,,,282178,32,434.5,0,16,27631,99,+++++,+++= ,22015,99,27021,100,+++++,+++,21028,99 > storiq,2G,,,287830,98,112449,29,,,271047,33,425.1,0,16,27447,99,+++++,+++= ,21973,99,26810,99,+++++,+++,21008,100 > storiq,2G,,,271668,95,114410,30,,,282419,33,438.7,0,16,27495,100,+++++,++= +,22158,100,26707,100,+++++,+++,21106,100 > storiq,2G,,,282535,99,118620,30,,,272089,33,425.0,0,16,27569,100,+++++,++= +,22021,100,26778,100,+++++,+++,20629,98 > storiq,2G,,,294392,98,119654,32,,,273269,32,429.7,0,16,27591,100,+++++,++= +,21984,99,26786,100,+++++,+++,20994,99 > storiq,2G,,,296652,99,118420,31,,,279586,33,425.5,0,16,15007,78,+++++,+++= ,21889,99,26998,99,+++++,+++,20952,100 > storiq,2G,,,290551,98,124374,32,,,273852,32,424.0,0,16,27534,99,+++++,+++= ,21974,99,26746,100,+++++,+++,20786,99 > storiq,2G,,,287033,99,100559,26,,,204845,24,390.9,0,16,27620,99,+++++,+++= ,21996,99,26811,100,+++++,+++,21009,100 > > Here are the tests I did with a similar system, but with 500GB drives, > XFS only, 64KB stripe (3ware default).I tested RAID 5 software RAID > compared to RAID-5 hardware (3Ware 9550). > > # software raid 5 > storiq-5U,2G,,,155913,22,23390,4,,,84327,9,531.5,0,16,1323,3,+++++,+++,63= 4,1,657,2,+++++,+++,903,3 > storiq-5U,2G,,,168104,24,23964,4,,,81666,8,534.2,0,16,605,2,+++++,+++,608= ,2,770,2,+++++,+++,706,1 > storiq-5U,2G,,,149516,21,22612,4,,,82111,9,571.3,0,16,606,2,+++++,+++,590= ,2,729,2,+++++,+++,450,1 > storiq-5U,2G,,,141883,20,22966,4,,,78116,8,568.5,0,16,615,2,+++++,+++,553= ,2,684,2,+++++,+++,508,2 > # hardware raid 5 > storiq-1,2G,,,148500,29,43043,9,,,148808,14,442.3,0,16,5953,27,+++++,+++,= 4408,20,4994,24,+++++,+++,2399,11 > storiq-1,2G,,,191440,37,38092,8,,,155494,15,420.9,0,16,3074,15,+++++,+++,= 3356,17,4246,21,+++++,+++,2513,12 > storiq-1,2G,,,150460,29,40018,9,,,144936,14,386.9,0,16,4206,20,+++++,+++,= 2497,11,5182,26,+++++,+++,2440,11 > storiq-1,2G,,,163132,34,34525,8,,,132131,13,369.7,0,16,6796,33,+++++,+++,= 10002,47,5475,28,+++++,+++,3652,17 > > As you can see, hardware RAID-5 doesn't perform significantly faster > at writing, but read thruput and rewrite performance is way better, and > seeks are an order of magnitude faster. That's why I use striped 3Ware > hardware RAID-5 to build high capacity systems instead of software RAID > 5. > > --=20 > -------------------------------------------------- > Emmanuel Florac www.intellique.com > -------------------------------------------------- > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Wow, very impressive benchmarks, thank you very much for this. Justin.= ---1463747160-1478584756-1178400767=:18820-- From owner-xfs@oss.sgi.com Sat May 5 15:12:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 15:13:00 -0700 (PDT) Received: from smtp113.sbc.mail.mud.yahoo.com (smtp113.sbc.mail.mud.yahoo.com [68.142.198.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45MCrfB009680 for ; Sat, 5 May 2007 15:12:54 -0700 Received: (qmail 80976 invoked from network); 5 May 2007 22:12:52 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp113.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 22:12:52 -0000 X-YMail-OSG: kYmy1WoVM1lLSqHi8kH_YIg9mAqfQe1Fv.gpSI1oIdhpFszmQ05A3stLHe0TtQ_9tudApt0ekA-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 2A6251827261; Sat, 5 May 2007 15:12:50 -0700 (PDT) Date: Sat, 5 May 2007 15:12:50 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505221249.GA21960@tuatara.stupidest.org> References: <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> <20070505225819.0dd3c0fa@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070505225819.0dd3c0fa@galadriel.home> X-archive-position: 11299 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 10:58:19PM +0200, Emmanuel Florac wrote: > Well I prefer staying away from the very latest bleeding edge, so I > stick to 2.6.20.11 for now. diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug index f68cc6f..908b755 100644 --- a/arch/i386/Kconfig.debug +++ b/arch/i386/Kconfig.debug @@ -56,15 +56,22 @@ config DEBUG_RODATA portion of the kernel code won't be covered by a 2MB TLB anymore. If in doubt, say "N". -config 4KSTACKS +config I386_4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates running more threads on a system and also reduces the pressure - on the VM subsystem for higher order allocations. This option - will also use IRQ stacks to compensate for the reduced stackspace. + on the VM subsystem for higher order allocations. + +config I386_IRQSTACKS + bool "Allocate separate IRQ stacks" + depends on DEBUG_KERNEL + default y + help + If you say Y here the kernel will allocate and use separate + stacks for interrupts. config X86_FIND_SMP_CONFIG bool diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c index 3201d42..0da8251 100644 --- a/arch/i386/kernel/irq.c +++ b/arch/i386/kernel/irq.c @@ -33,7 +33,7 @@ void ack_bad_irq(unsigned int irq) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * per-CPU IRQ handling contexts (thread information and stack) */ @@ -44,7 +44,7 @@ union irq_ctx { static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly; static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * do_IRQ handles all normal device IRQ's (the special @@ -57,7 +57,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) /* high bit used in ret_from_ code */ int irq = ~regs->orig_eax; struct irq_desc *desc = irq_desc + irq; -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS union irq_ctx *curctx, *irqctx; u32 *isp; #endif @@ -85,7 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS curctx = (union irq_ctx *) current_thread_info(); irqctx = hardirq_ctx[smp_processor_id()]; @@ -122,7 +122,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) : "memory", "cc" ); } else -#endif +#endif /* CONFIG_I386_IRQSTACKS */ desc->handle_irq(irq, desc); irq_exit(); @@ -130,7 +130,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) return 1; } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * These should really be __section__(".bss.page_aligned") as well, but @@ -220,7 +220,7 @@ asmlinkage void do_softirq(void) } EXPORT_SYMBOL(do_softirq); -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * Interrupt statistics: diff --git a/include/asm-i386/irq.h b/include/asm-i386/irq.h index 11761cd..7db95e1 100644 --- a/include/asm-i386/irq.h +++ b/include/asm-i386/irq.h @@ -24,14 +24,14 @@ static __inline__ int irq_canonicalize(int irq) # define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */ #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS extern void irq_ctx_init(int cpu); extern void irq_ctx_exit(int cpu); # define __ARCH_HAS_DO_SOFTIRQ -#else +#else /* !CONFIG_I386_IRQSTACKS */ # define irq_ctx_init(cpu) do { } while (0) # define irq_ctx_exit(cpu) do { } while (0) -#endif +#endif /* CONFIG_I386_IRQSTACKS */ #ifdef CONFIG_IRQBALANCE extern int irqbalance_disable(char *str); diff --git a/include/asm-i386/module.h b/include/asm-i386/module.h index 02f8f54..7d5d2df 100644 --- a/include/asm-i386/module.h +++ b/include/asm-i386/module.h @@ -62,11 +62,11 @@ struct mod_arch_specific #error unknown processor family #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define MODULE_STACKSIZE "4KSTACKS " -#else +#else /* not using CONFIG_I386_4KSTACKS */ #define MODULE_STACKSIZE "" -#endif +#endif /* CONFIG_I386_4KSTACKS */ #define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_STACKSIZE diff --git a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h index 4b187bb..f5268e0 100644 --- a/include/asm-i386/thread_info.h +++ b/include/asm-i386/thread_info.h @@ -53,7 +53,7 @@ struct thread_info { #endif #define PREEMPT_ACTIVE 0x10000000 -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define THREAD_SIZE (4096) #else #define THREAD_SIZE (8192) From owner-xfs@oss.sgi.com Sun May 6 10:21:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:21:05 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HL1fB028324 for ; Sun, 6 May 2007 10:21:02 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CCD9D18762; Sun, 6 May 2007 19:21:00 +0200 (CEST) Date: Sun, 6 May 2007 19:21:04 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506192104.3becdd81@galadriel.home> In-Reply-To: <20070505210002.GC17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HL2fB028330 X-archive-position: 11301 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 14:00:02 -0700 vous criviez: > A 50TB filesystem might suck horrible on a 32-bit platform. I'm not > sure there is *ANY* way you coiuld fsck that should you need in some > cases. > > Is that what you're planning to do? Nope, I'll use an x86_64 system running an x86_64 kernel :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:21:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:21:46 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HLhfB028657 for ; Sun, 6 May 2007 10:21:44 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CC0CF18718; Sun, 6 May 2007 19:21:42 +0200 (CEST) Date: Sun, 6 May 2007 19:21:46 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506192146.7f03cd4e@galadriel.home> In-Reply-To: <20070505221249.GA21960@tuatara.stupidest.org> References: <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> <20070505225819.0dd3c0fa@galadriel.home> <20070505221249.GA21960@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HLifB028674 X-archive-position: 11302 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 15:12:50 -0700 vous criviez: > diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug Thanks! -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:19:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:19:53 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HJmfB027935 for ; Sun, 6 May 2007 10:19:49 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 582C9F34029 for ; Sun, 6 May 2007 19:19:47 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id E2A6818206; Sun, 6 May 2007 19:19:43 +0200 (CEST) Date: Sun, 6 May 2007 19:19:47 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506191947.75a2058a@galadriel.home> In-Reply-To: <20070505205617.GB17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <20070505205617.GB17112@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HJofB027943 X-archive-position: 11300 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 13:56:17 -0700 vous criviez: > Is jfs supported by anyone right now? Huh, IBM I hope :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:26:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:26:14 -0700 (PDT) Received: from smtp114.sbc.mail.mud.yahoo.com (smtp114.sbc.mail.mud.yahoo.com [68.142.198.213]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l46HQ8fB030347 for ; Sun, 6 May 2007 10:26:09 -0700 Received: (qmail 92936 invoked from network); 6 May 2007 17:26:08 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp114.sbc.mail.mud.yahoo.com with SMTP; 6 May 2007 17:26:08 -0000 X-YMail-OSG: JpyOiz0VM1mldeiku.Hr8o32aTLyos4dDOQSFemrA1zdTVKzh2MZehYlzOHEUP1wl41_FWvOGg-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id DA4271827261; Sun, 6 May 2007 10:26:06 -0700 (PDT) Date: Sun, 6 May 2007 10:26:06 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506172606.GB4823@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> <20070506192104.3becdd81@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070506192104.3becdd81@galadriel.home> X-archive-position: 11303 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sun, May 06, 2007 at 07:21:04PM +0200, Emmanuel Florac wrote: > Nope, I'll use an x86_64 system running an x86_64 kernel :) How much RAM? I think you'll want 10s of GBs possibly (well, it depends very much on what you're storing but you can fit a lot of small files in 150TB...) From owner-xfs@oss.sgi.com Sun May 6 10:56:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:56:08 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46Hu4fB005398 for ; Sun, 6 May 2007 10:56:05 -0700 Received: from localhost (dslb-084-057-122-104.pools.arcor-ip.net [84.57.122.104]) by mail.lichtvoll.de (Postfix) with ESMTP id 3A3FF5AD40 for ; Sun, 6 May 2007 19:56:03 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Date: Sun, 6 May 2007 19:56:02 +0200 User-Agent: KMail/1.9.6 References: <20070503164521.16efe075@harpe.intellique.com> <20070504234357.24d22883@galadriel.home> <20070505205617.GB17112@tuatara.stupidest.org> (sfid-20070506_174955_742323_AFBCDD13) In-Reply-To: <20070505205617.GB17112@tuatara.stupidest.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705061956.02375.Martin@lichtvoll.de> X-archive-position: 11304 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Samstag 05 Mai 2007 schrieb Chris Wedgwood: > On Fri, May 04, 2007 at 11:43:57PM +0200, Emmanuel Florac wrote: > > Unfortunately ext3 doesn't support volumes bigger than 8TB, so > > that's useless to me. I plan to test jfs, however. > > Is jfs supported by anyone right now? David 'Dave' Kleikamp was still taking care of JFS as I asked him some questions about write barrier support back in July 2007. He concentrated on bug fixes tough, not on new features. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Sun May 6 11:37:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 11:37:16 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46IbCfB014325 for ; Sun, 6 May 2007 11:37:13 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 88D3D18737; Sun, 6 May 2007 20:37:11 +0200 (CEST) Date: Sun, 6 May 2007 20:36:49 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506203649.1c4d9d14@galadriel.home> In-Reply-To: <20070506172606.GB4823@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> <20070506192104.3becdd81@galadriel.home> <20070506172606.GB4823@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46IbDfB014343 X-archive-position: 11305 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sun, 6 May 2007 10:26:06 -0700 vous criviez: > How much RAM? I think you'll want 10s of GBs possibly (well, it > depends very much on what you're storing but you can fit a lot of > small files in 150TB...) It will be video storage, big to huge file mainly. But I'll remember to stick as much RAM as I can :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 18:38:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 18:38:15 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l471c5fB018505 for ; Sun, 6 May 2007 18:38:08 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.197]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l471c2sH008659 for ; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l471c2s11594 for xfs@oss.sgi.com; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id l471c2O04063 for ; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070507.092351.98402312 for ; Mon, 7 May 2007 09:23:52 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Mon May 07 09:23:51 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 36BF6AE4B3; Mon, 7 May 2007 10:38:01 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l471c1ok001475; Mon, 7 May 2007 10:38:01 +0900 Message-Id: <200705070137.AA05294@TNESG9305.tnes.nec.co.jp> Date: Mon, 07 May 2007 10:37:56 +0900 To: xfs@oss.sgi.com Cc: tes@sgi.com Subject: [PATCH] Fix disable, enable, off and remove commands in xfs_quota. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: multipart/mixed; boundary="--------------------0751065352324900" X-archive-position: 11306 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs This is multipart message. ----------------------0751065352324900 Content-Type: text/plain; charset=iso-2022-jp Hi, I send this mail 10 days ago but it got lost...$B!!(B disable, enable, off and remove commands in xfs_quota don't work. Because: 1) The argument type to quotactl() is wrong. "addr" is fs_quota_stat_t structure in the original code but it should be an unsigned int as shown in man page. (disable, enable, off and remove) 2) The wrong flag is used for -ugp option check. (disable, enable, off and remove) 3) The accounting flag (XFS_QUOTA_*DQ_ACCT) is used for disabling quota enforcement incorrectly. (disable) 4) The accounting and enforcement flag is used for removing space incorrectly. (remove) 5) The quota types must be specified to quotactl() one by one. But multiple quota types are passed to quotactl() when specifying -ug|-up option. (remove) Attached patch fixes these problems. Signed-off-by: Utako Kusaka --- ----------------------0751065352324900 Content-Type: application/octet-stream; name="state.diff" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="state.diff" LS0tIHhmc3Byb2dzLTIuOC4yMC9xdW90YS9zdGF0ZS5vcmlnCTIwMDctMDQt MTkgMTM6MDc6MzguMDAwMDAwMDAwICswOTAwCisrKyB4ZnNwcm9ncy0yLjgu MjAvcXVvdGEvc3RhdGUuYwkyMDA3LTA0LTI2IDExOjQ2OjQ2LjAwMDAwMDAw MCArMDkwMApAQCAtMjUwLDEwICsyNTAsNiBAQCBlbmFibGVfZW5mb3JjZW1l bnQoCiAJdWludAkJZmxhZ3MpCiB7CiAJZnNfcGF0aF90CSptb3VudDsKLQlm c19xdW90YV9zdGF0X3QJcXN0YXQgPSB7IDAgfTsKLQotCXFzdGF0LnFzX3Zl cnNpb24gPSBGU19RU1RBVF9WRVJTSU9OOwotCXFzdGF0LnFzX2ZsYWdzID0g cWZsYWdzOwogCiAJbW91bnQgPSBmc190YWJsZV9sb29rdXAoZGlyLCBGU19N T1VOVF9QT0lOVCk7CiAJaWYgKCFtb3VudCkgewpAQCAtMjYxLDcgKzI1Nyw3 IEBAIGVuYWJsZV9lbmZvcmNlbWVudCgKIAkJcmV0dXJuOwogCX0KIAlkaXIg PSBtb3VudC0+ZnNfbmFtZTsKLQlpZiAoeGZzcXVvdGFjdGwoWEZTX1FVT1RB T04sIGRpciwgdHlwZSwgMCwgKHZvaWQgKikmcXN0YXQpIDwgMCkKKwlpZiAo eGZzcXVvdGFjdGwoWEZTX1FVT1RBT04sIGRpciwgdHlwZSwgMCwgKHZvaWQg KikmcWZsYWdzKSA8IDApCiAJCXBlcnJvcigiWEZTX1FVT1RBT04iKTsKIAll bHNlIGlmIChmbGFncyAmIFZFUkJPU0VfRkxBRykKIAkJc3RhdGVfcXVvdGFm aWxlX21vdW50KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZsYWdzKTsKQEAgLTI3 NSwxMCArMjcxLDYgQEAgZGlzYWJsZV9lbmZvcmNlbWVudCgKIAl1aW50CQlm bGFncykKIHsKIAlmc19wYXRoX3QJKm1vdW50OwotCWZzX3F1b3RhX3N0YXRf dAlxc3RhdCA9IHsgMCB9OwotCi0JcXN0YXQucXNfdmVyc2lvbiA9IEZTX1FT VEFUX1ZFUlNJT047Ci0JcXN0YXQucXNfZmxhZ3MgPSBxZmxhZ3M7CiAKIAlt b3VudCA9IGZzX3RhYmxlX2xvb2t1cChkaXIsIEZTX01PVU5UX1BPSU5UKTsK IAlpZiAoIW1vdW50KSB7CkBAIC0yODYsNyArMjc4LDcgQEAgZGlzYWJsZV9l bmZvcmNlbWVudCgKIAkJcmV0dXJuOwogCX0KIAlkaXIgPSBtb3VudC0+ZnNf bmFtZTsKLQlpZiAoeGZzcXVvdGFjdGwoWEZTX1FVT1RBT0ZGLCBkaXIsIHR5 cGUsIDAsICh2b2lkICopJnFzdGF0KSA8IDApCisJaWYgKHhmc3F1b3RhY3Rs KFhGU19RVU9UQU9GRiwgZGlyLCB0eXBlLCAwLCAodm9pZCAqKSZxZmxhZ3Mp IDwgMCkKIAkJcGVycm9yKCJYRlNfUVVPVEFPRkYiKTsKIAllbHNlIGlmIChm bGFncyAmIFZFUkJPU0VfRkxBRykKIAkJc3RhdGVfcXVvdGFmaWxlX21vdW50 KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZsYWdzKTsKQEAgLTMwMCwxMCArMjky LDYgQEAgcXVvdGFvZmYoCiAJdWludAkJZmxhZ3MpCiB7CiAJZnNfcGF0aF90 CSptb3VudDsKLQlmc19xdW90YV9zdGF0X3QJcXN0YXQgPSB7IDAgfTsKLQot CXFzdGF0LnFzX3ZlcnNpb24gPSBGU19RU1RBVF9WRVJTSU9OOwotCXFzdGF0 LnFzX2ZsYWdzID0gcWZsYWdzOwogCiAJbW91bnQgPSBmc190YWJsZV9sb29r dXAoZGlyLCBGU19NT1VOVF9QT0lOVCk7CiAJaWYgKCFtb3VudCkgewpAQCAt MzExLDI0ICsyOTksMzEgQEAgcXVvdGFvZmYoCiAJCXJldHVybjsKIAl9CiAJ ZGlyID0gbW91bnQtPmZzX25hbWU7Ci0JaWYgKHhmc3F1b3RhY3RsKFhGU19R VU9UQU9GRiwgZGlyLCB0eXBlLCAwLCAodm9pZCAqKSZxc3RhdCkgPCAwKQor CWlmICh4ZnNxdW90YWN0bChYRlNfUVVPVEFPRkYsIGRpciwgdHlwZSwgMCwg KHZvaWQgKikmcWZsYWdzKSA8IDApCiAJCXBlcnJvcigiWEZTX1FVT1RBT0ZG Iik7CiAJZWxzZSBpZiAoZmxhZ3MgJiBWRVJCT1NFX0ZMQUcpCiAJCXN0YXRl X3F1b3RhZmlsZV9tb3VudChzdGRvdXQsIHR5cGUsIG1vdW50LCBmbGFncyk7 CiB9CiAKK3N0YXRpYyBpbnQKK3JlbW92ZV9xdHlwZV9leHRlbnRzKAorCWNo YXIJCSpkaXIsCisJdWludAkJdHlwZSkKK3sKKwlpbnQJZXJyb3IgPSAwOwor CisJaWYgKChlcnJvciA9IHhmc3F1b3RhY3RsKFhGU19RVU9UQVJNLCBkaXIs IHR5cGUsIDAsICh2b2lkICopJnR5cGUpKSA8IDApCisJCXBlcnJvcigiWEZT X1FVT1RBUk0iKTsKKwlyZXR1cm4gZXJyb3I7Cit9CisKIHN0YXRpYyB2b2lk CiByZW1vdmVfZXh0ZW50cygKIAljaGFyCQkqZGlyLAogCXVpbnQJCXR5cGUs Ci0JdWludAkJcWZsYWdzLAogCXVpbnQJCWZsYWdzKQogewogCWZzX3BhdGhf dAkqbW91bnQ7Ci0JZnNfcXVvdGFfc3RhdF90CXFzdGF0ID0geyAwIH07Ci0K LQlxc3RhdC5xc192ZXJzaW9uID0gRlNfUVNUQVRfVkVSU0lPTjsKLQlxc3Rh dC5xc19mbGFncyA9IHFmbGFnczsKIAogCW1vdW50ID0gZnNfdGFibGVfbG9v a3VwKGRpciwgRlNfTU9VTlRfUE9JTlQpOwogCWlmICghbW91bnQpIHsKQEAg LTMzNiw5ICszMzEsMTggQEAgcmVtb3ZlX2V4dGVudHMoCiAJCXJldHVybjsK IAl9CiAJZGlyID0gbW91bnQtPmZzX25hbWU7Ci0JaWYgKHhmc3F1b3RhY3Rs KFhGU19RVU9UQVJNLCBkaXIsIHR5cGUsIDAsICh2b2lkICopJnFzdGF0KSA8 IDApCi0JCXBlcnJvcigiWEZTX1FVT1RBUk0iKTsKLQllbHNlIGlmIChmbGFn cyAmIFZFUkJPU0VfRkxBRykKKwlpZiAodHlwZSAmIFhGU19VU0VSX1FVT1RB KSB7CisJCWlmIChyZW1vdmVfcXR5cGVfZXh0ZW50cyhkaXIsIFhGU19VU0VS X1FVT1RBKSA8IDApIAorCQkJcmV0dXJuOworCX0KKwlpZiAodHlwZSAmIFhG U19HUk9VUF9RVU9UQSkgeworCQlpZiAocmVtb3ZlX3F0eXBlX2V4dGVudHMo ZGlyLCBYRlNfR1JPVVBfUVVPVEEpIDwgMCkgCisJCQlyZXR1cm47CisJfSBl bHNlIGlmICh0eXBlICYgWEZTX1BST0pfUVVPVEEpIHsKKwkJaWYgKHJlbW92 ZV9xdHlwZV9leHRlbnRzKGRpciwgWEZTX1BST0pfUVVPVEEpIDwgMCkgCisJ CQlyZXR1cm47CisJfQorCWlmIChmbGFncyAmIFZFUkJPU0VfRkxBRykKIAkJ c3RhdGVfcXVvdGFmaWxlX21vdW50KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZs YWdzKTsKIH0KIApAQCAtMzc0LDcgKzM3OCw3IEBAIGVuYWJsZV9mKAogCWlm IChhcmdjICE9IG9wdGluZCkKIAkJcmV0dXJuIGNvbW1hbmRfdXNhZ2UoJmVu YWJsZV9jbWQpOwogCi0JaWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewog CQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwogCQlxZmxhZ3MgfD0gWEZTX1FV T1RBX1VEUV9BQ0NUIHwgWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KQEAgLTM5 NSwxNSArMzk5LDE1IEBAIGRpc2FibGVfZigKIAkJc3dpdGNoIChjKSB7CiAJ CWNhc2UgJ2cnOgogCQkJdHlwZSB8PSBYRlNfR1JPVVBfUVVPVEE7Ci0JCQlx ZmxhZ3MgfD0gWEZTX1FVT1RBX0dEUV9BQ0NUOworCQkJcWZsYWdzIHw9IFhG U19RVU9UQV9HRFFfRU5GRDsKIAkJCWJyZWFrOwogCQljYXNlICdwJzoKIAkJ CXR5cGUgfD0gWEZTX1BST0pfUVVPVEE7Ci0JCQlxZmxhZ3MgfD0gWEZTX1FV T1RBX1BEUV9BQ0NUOworCQkJcWZsYWdzIHw9IFhGU19RVU9UQV9QRFFfRU5G RDsKIAkJCWJyZWFrOwogCQljYXNlICd1JzoKIAkJCXR5cGUgfD0gWEZTX1VT RVJfUVVPVEE7Ci0JCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUOwor CQkJcWZsYWdzIHw9IFhGU19RVU9UQV9VRFFfRU5GRDsKIAkJCWJyZWFrOwog CQljYXNlICd2JzoKIAkJCWZsYWdzIHw9IFZFUkJPU0VfRkxBRzsKQEAgLTQx Niw5ICs0MjAsOSBAQCBkaXNhYmxlX2YoCiAJaWYgKGFyZ2MgIT0gb3B0aW5k KQogCQlyZXR1cm4gY29tbWFuZF91c2FnZSgmZGlzYWJsZV9jbWQpOwogCi0J aWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhGU19V U0VSX1FVT1RBOwotCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUOwor CQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KIAogCWlmIChm c19wYXRoLT5mc19mbGFncyAmIEZTX01PVU5UX1BPSU5UKQpAQCAtNDU4LDcg KzQ2Miw3IEBAIG9mZl9mKAogCWlmIChhcmdjICE9IG9wdGluZCkKIAkJcmV0 dXJuIGNvbW1hbmRfdXNhZ2UoJm9mZl9jbWQpOwogCi0JaWYgKCFmbGFncykg eworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwog CQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUIHwgWEZTX1FVT1RBX1VE UV9FTkZEOwogCX0KQEAgLTQ3MywyMSArNDc3LDE4IEBAIHJlbW92ZV9mKAog CWludAkJYXJnYywKIAljaGFyCQkqKmFyZ3YpCiB7Ci0JaW50CQljLCBmbGFn cyA9IDAsIHFmbGFncyA9IDAsIHR5cGUgPSAwOworCWludAkJYywgZmxhZ3Mg PSAwLCB0eXBlID0gMDsKIAogCXdoaWxlICgoYyA9IGdldG9wdChhcmdjLCBh cmd2LCAiZ3B1diIpKSAhPSBFT0YpIHsKIAkJc3dpdGNoIChjKSB7CiAJCWNh c2UgJ2cnOgogCQkJdHlwZSB8PSBYRlNfR1JPVVBfUVVPVEE7Ci0JCQlxZmxh Z3MgfD0gWEZTX1FVT1RBX0dEUV9BQ0NUIHwgWEZTX1FVT1RBX0dEUV9FTkZE OwogCQkJYnJlYWs7CiAJCWNhc2UgJ3AnOgogCQkJdHlwZSB8PSBYRlNfUFJP Sl9RVU9UQTsKLQkJCXFmbGFncyB8PSBYRlNfUVVPVEFfUERRX0FDQ1QgfCBY RlNfUVVPVEFfUERRX0VORkQ7CiAJCQlicmVhazsKIAkJY2FzZSAndSc6CiAJ CQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwotCQkJcWZsYWdzIHw9IFhGU19R VU9UQV9VRFFfQUNDVCB8IFhGU19RVU9UQV9VRFFfRU5GRDsKIAkJCWJyZWFr OwogCQljYXNlICd2JzoKIAkJCWZsYWdzIHw9IFZFUkJPU0VfRkxBRzsKQEAg LTUwMCwxMyArNTAxLDEyIEBAIHJlbW92ZV9mKAogCWlmIChhcmdjICE9IG9w dGluZCkKIAkJcmV0dXJuIGNvbW1hbmRfdXNhZ2UoJnJlbW92ZV9jbWQpOwog Ci0JaWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhG U19VU0VSX1FVT1RBOwotCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NU IHwgWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KIAogCWlmIChmc19wYXRoLT5m c19mbGFncyAmIEZTX01PVU5UX1BPSU5UKQotCQlyZW1vdmVfZXh0ZW50cyhm c19wYXRoLT5mc19kaXIsIHR5cGUsIHFmbGFncywgZmxhZ3MpOworCQlyZW1v dmVfZXh0ZW50cyhmc19wYXRoLT5mc19kaXIsIHR5cGUsIGZsYWdzKTsKIAly ZXR1cm4gMDsKIH0KIAo= ----------------------0751065352324900-- From owner-xfs@oss.sgi.com Sun May 6 19:11:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 19:11:38 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l472BXfB024497 for ; Sun, 6 May 2007 19:11:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA17577; Mon, 7 May 2007 12:11:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l472BOAf86463303; Mon, 7 May 2007 12:11:24 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l472BMNV85439097; Mon, 7 May 2007 12:11:22 +1000 (AEST) Date: Mon, 7 May 2007 12:11:22 +1000 From: David Chinner To: Emmanuel Florac Cc: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070507021122.GQ32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11307 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 03:25:46PM +0200, Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:33:44 +1000 > David Chinner crivait: > > > Well, there's your problem. Stack overflows. IMO, if you use a > > filesystem, you shouldn't use 4k stacks. ;) > > > > If you remake you kernel with 8k stacks then your problems will > > most likely go away. > > Well, I've double-checked the asm-i386/module.h, and it actually looks > like 4K stacks is NOT the default, so I must be using 8K, isn't it? Yes. > I've ran the same test on the same machine but WITHOUT software raid-0 > (so write barriers are in use), and all went well, more than 3TB > written without a glitch. I still think there's something related to > the write barriers here. I'll try with another RAID controller, Adaptec > for instance, to get sure the 3ware driver isn't involved. I'll also try > again with an amd64 kernel. So you use software raid and you get corruptions, right? I doubt this has anything to do with write barriers - if it does thats an indication of broken drivers or hardware..... Can you run with "-o nobarrier" and no software raid and see if you still have a problem? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 03:07:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 03:07:59 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47A7tfB005622 for ; Mon, 7 May 2007 03:07:56 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id A440617CFA; Mon, 7 May 2007 12:07:54 +0200 (CEST) Date: Mon, 7 May 2007 12:07:54 +0200 From: Emmanuel Florac To: David Chinner Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070507120754.289deffd@galadriel.home> In-Reply-To: <20070507021122.GQ32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <20070507021122.GQ32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l47A7vfB005627 X-archive-position: 11308 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Mon, 7 May 2007 12:11:22 +1000 vous criviez: > So you use software raid and you get corruptions, right? I doubt this > has anything to do with write barriers - if it does thats an > indication of broken drivers or hardware..... > > Can you run with "-o nobarrier" and no software raid and see if you > still have a problem? I tried on the same machine without software RAID and barriers, and i worked OK. I'll try today with nobarrier. Stay tuned :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Mon May 7 04:03:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:03:55 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47B3lfB017677 for ; Mon, 7 May 2007 04:03:49 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47B3hN7031521 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47B3hcJ515560 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47B3g21005021 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47B3frL004964; Mon, 7 May 2007 07:03:42 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5AB9D94BBD; Mon, 7 May 2007 16:33:49 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47B3nwC010945; Mon, 7 May 2007 16:33:49 +0530 Date: Mon, 7 May 2007 16:33:48 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070507110348.GA7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11309 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. > > If that's all too much, this material should at least be spelled out in the > changelog. Because there's no way in which this change can be fully > reviewed unless someone (ie: you) tells us what it is setting out to > achieve. > > If we 100% implement some standard then a URL for what we claim to > implement would suffice. Given that we're at least using different types from > posix I doubt if such a thing would be sufficient. > > And given the complexity and potential variability within the filesystem > implementations of this, I'd expect that _something_ additional needs to be > said? Ok. I will add a detailed comment here. > > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. > > > + ret = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + if (!(file->f_mode & FMODE_WRITE)) > > + goto out_fput; > > + > > + inode = file->f_path.dentry->d_inode; > > + > > + ret = -ESPIPE; > > + if (S_ISFIFO(inode->i_mode)) > > + goto out_fput; > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. True. > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment > here would settle the reader's mind. Ok. I will add a check here for wrap though zero. > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, mode, offset, len); > > + else > > + ret = -ENOSYS; > > If we _are_ going to support negative `len', as posix suggests, I think we > should perform the appropriate sanity conversions to `offset' and `len' > right here, rather than expecting each filesystem to do it. > > If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). > > +out_fput: > > + fput(file); > > +out: > > + return ret; > > +} > > +EXPORT_SYMBOL(sys_fallocate); > > I don't believe this needs to be exported to modules? Ok. Will remove it. > > +/* > > + * fallocate() modes > > + */ > > +#define FA_ALLOCATE 0x1 > > +#define FA_DEALLOCATE 0x2 > > Now those aren't in posix. They should be documented, along with their > expected semantics. Will add a comment describing the role of these modes. > > #ifdef __KERNEL__ > > > > #include > > @@ -1125,6 +1131,7 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *, int, loff_t, loff_t); > > I really do think it's better to put the variable names in definitions such > as this. Especially when we have two identically-typed variables next to > each other like that. Quick: which one is the offset and which is the > length? Ok. Will add the variable names here. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 04:10:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:10:42 -0700 (PDT) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.146]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47BAbfB018894 for ; Mon, 7 May 2007 04:10:39 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47BBYno028901 for ; Mon, 7 May 2007 07:11:34 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47BAbK5550866 for ; Mon, 7 May 2007 07:10:37 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47BAaUk020671 for ; Mon, 7 May 2007 07:10:36 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47BAZ5f020654; Mon, 7 May 2007 07:10:36 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B3CE494BBD; Mon, 7 May 2007 16:40:38 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47BAcV7013746; Mon, 7 May 2007 16:40:38 +0530 Date: Mon, 7 May 2007 16:40:38 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: David Chinner , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070507111038.GB7012@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11310 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the above > check was relaxed? I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 04:46:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:46:48 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47BkifB032153 for ; Mon, 7 May 2007 04:46:45 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47Bkh21026930 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47BkhAb550744 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47Bkh3g010826 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47Bkg9x010807; Mon, 7 May 2007 07:46:42 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id ECD1694BBD; Mon, 7 May 2007 17:16:49 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47BknTn028767; Mon, 7 May 2007 17:16:49 +0530 Date: Mon, 7 May 2007 17:16:49 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 3/5] ext4: Extent overlap bugfix Message-ID: <20070507114649.GC7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181101.GC7209@amitarora.in.ibm.com> <20070503213002.eff696db.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213002.eff696db.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11311 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" wrote: > > > +unsigned int ext4_ext_check_overlap(struct inode *inode, > > + struct ext4_extent *newext, > > + struct ext4_ext_path *path) > > +{ > > + unsigned long b1, b2; > > + unsigned int depth, len1; > > + > > + b1 = le32_to_cpu(newext->ee_block); > > + len1 = le16_to_cpu(newext->ee_len); > > + depth = ext_depth(inode); > > + if (!path[depth].p_ext) > > + goto out; > > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > > + > > + /* get the next allocated block if the extent in the path > > + * is before the requested block(s) */ > > + if (b2 < b1) { > > + b2 = ext4_ext_next_allocated_block(path); > > + if (b2 == EXT_MAX_BLOCK) > > + goto out; > > + } > > + > > + if (b1 + len1 > b2) { > > Are we sure that b1+len cannot wrap through zero here? No. Will add a check here for this. Thanks! > > + newext->ee_len = cpu_to_le16(b2 - b1); > > + return 1; > > + } -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 05:11:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 05:11:58 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47CBqfB003761 for ; Mon, 7 May 2007 05:11:54 -0700 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47C7E7Y027576 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 08:07:15 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l47C3sWu029214 for ; Mon, 7 May 2007 08:03:54 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47C7CFZ184746 for ; Mon, 7 May 2007 06:07:12 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47C7CKS012675 for ; Mon, 7 May 2007 06:07:12 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47C7Bi2012612; Mon, 7 May 2007 06:07:11 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2FC5D94BBD; Mon, 7 May 2007 17:37:19 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47C7Jq7004761; Mon, 7 May 2007 17:37:19 +0530 Date: Mon, 7 May 2007 17:37:19 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507120719.GD7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213133.d1559f52.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11312 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > This patch has the ext4 implemtation of fallocate system call. > > > > ... > > > > + /* ext4_can_extents_be_merged should have checked that either > > + * both extents are uninitialized, or both aren't. Thus we > > + * need to check only one of them here. > > + */ > > Please always format multiline comments like this: > > /* > * ext4_can_extents_be_merged should have checked that either > * both extents are uninitialized, or both aren't. Thus we > * need to check only one of them here. > */ Ok. > > ... > > > > +/* > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. Ok. Will expand the description. > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. > Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > + - block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? You are right to say that the credits can not be fixed here. But, 'len' will not directly tell us how many extents might need to be inserted and how many block groups (if any - think about the "segment range" already being allocated case) the allocation request might touch. One solution I have thought is to check the buffer credits after a call to ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the credits are falling short. Incase journal_extend fails, we call journal_restart. This will automatically take care of how much journal space we might need for any value of "len". > > + handle=ext4_journal_start(inode, credits + > > Please always put spaces around "="A Ok. > > > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > > And around "+" Ok. > > > + if (IS_ERR(handle)) > > + return PTR_ERR(handle); > > +retry: > > + ret = 0; > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ok. Will do that. > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > Use buffer_new() here. A separate patch which fixes the three existing > instances of open-coded BH_foo usage would be appreciated. Ok. > > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > Check for wrap though the sign bit and through zero please. Ok. > > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + > > + /* Time to update the file size. > > + * Update only when preallocation was requested beyond the file size. > > + */ > > Fix comment layout. Ok. > > > + if ((offset + len) > i_size_read(inode)) { > > Both the lhs and the rhs here are signed. Please review for possible > overflows through the sign bit and through zero. Perhaps a comment > explaining why it's correct would be appropriate. Ok. > > > > + if (ret > 0) { > > + /* if no error, we assume preallocation succeeded completely */ > > + mutex_lock(&inode->i_mutex); > > + i_size_write(inode, offset + len); > > + EXT4_I(inode)->i_disksize = i_size_read(inode); > > + mutex_unlock(&inode->i_mutex); > > + } else if (ret < 0 && nblocks) { > > + /* Handle partial allocation scenario */ > > The above two comments should be indented one additional tabstop. Ok. > > > + loff_t newsize; > > + mutex_lock(&inode->i_mutex); > > + newsize = (nblocks << blkbits) + i_size_read(inode); > > + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); > > + EXT4_I(inode)->i_disksize = i_size_read(inode); > > + mutex_unlock(&inode->i_mutex); > > + } > > + } > > + ext4_mark_inode_dirty(handle, inode); > > + ret2 = ext4_journal_stop(handle); > > + if (ret > 0) > > + ret = ret2; > > + > > + return ret > 0 ? 0 : ret; > > +} > > + > > EXPORT_SYMBOL(ext4_mark_inode_dirty); > > EXPORT_SYMBOL(ext4_ext_invalidate_cache); > > EXPORT_SYMBOL(ext4_ext_insert_extent); > > EXPORT_SYMBOL(ext4_ext_walk_space); > > EXPORT_SYMBOL(ext4_ext_find_goal); > > EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); > > +EXPORT_SYMBOL(ext4_fallocate); > > > > Index: linux-2.6.21/fs/ext4/file.c > > =================================================================== > > --- linux-2.6.21.orig/fs/ext4/file.c > > +++ linux-2.6.21/fs/ext4/file.c > > @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ > > .removexattr = generic_removexattr, > > #endif > > .permission = ext4_permission, > > + .fallocate = ext4_fallocate, > > }; > > > > Index: linux-2.6.21/include/linux/ext4_fs.h > > =================================================================== > > --- linux-2.6.21.orig/include/linux/ext4_fs.h > > +++ linux-2.6.21/include/linux/ext4_fs.h > > @@ -102,6 +102,8 @@ > > EXT4_GOOD_OLD_FIRST_INO : \ > > (s)->s_first_ino) > > #endif > > +#define EXT4_BLOCK_ALIGN(size, blkbits) (((size)+(1 << blkbits)-1) & \ > > + (~((1 << blkbits)-1))) > > Maybe a comment describing what this does? Probably it's obvious enough. > > I think it could use the standard ALIGN macro. > > Is blkbits sufficiently parenthesised here? Even if it is, adding the > parens would be better practice. I agree. Will change it. > > > /* > > * Macro-instructions used to manage fragments > > @@ -225,6 +227,10 @@ struct ext4_new_group_data { > > __u32 free_blocks_count; > > }; > > > > +/* Following is used by preallocation logic to tell get_blocks() that we > > + * want uninitialzed extents. > > + */ > > Please convert all newly-added multiline comments to the preferred layout. Ok. > > > +#define EXT4_CREATE_UNINITIALIZED_EXT 2 > > > > /* > > * ioctl commands > > @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t > > extern void ext4_ext_truncate(struct inode *, struct page *); > > extern void ext4_ext_init(struct super_block *); > > extern void ext4_ext_release(struct super_block *); > > +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t); > > argh. And feel free to give these args some useful names. Ok. > > > static inline int > > ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > > unsigned long max_blocks, struct buffer_head *bh, > > Index: linux-2.6.21/include/linux/ext4_fs_extents.h > > =================================================================== > > --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h > > +++ linux-2.6.21/include/linux/ext4_fs_extents.h > > @@ -125,6 +125,19 @@ struct ext4_ext_path { > > #define EXT4_EXT_CACHE_EXTENT 2 > > > > /* > > + * Macro-instructions to handle (mark/unmark/check/create) unitialized > > + * extents. Applications can issue an IOCTL for preallocation, which results > > + * in assigning unitialized extents to the file. > > + */ > > +#define ext4_ext_mark_uninitialized(ext) ((ext)->ee_len |= \ > > + cpu_to_le16(0x8000)) > > +#define ext4_ext_is_uninitialized(ext) ((le16_to_cpu((ext)->ee_len))& \ > > + 0x8000) > > +#define ext4_ext_get_actual_len(ext) ((le16_to_cpu((ext)->ee_len))& \ > > + 0x7FFF) > > inlined C functions are preferred, and I think these could be implemented > that way. Ok. Will convert them to inline functions. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 05:24:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 05:24:59 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47COsfB005440 for ; Mon, 7 May 2007 05:24:55 -0700 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47CBCnQ029352 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 08:11:12 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l47C7pKq031566 for ; Mon, 7 May 2007 08:07:51 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47CB9dj171804 for ; Mon, 7 May 2007 06:11:09 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47CB9VC024191 for ; Mon, 7 May 2007 06:11:09 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47CB7of024105; Mon, 7 May 2007 06:11:08 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2B6F894BBD; Mon, 7 May 2007 17:41:16 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47CBFgC006463; Mon, 7 May 2007 17:41:15 +0530 Date: Mon, 7 May 2007 17:41:15 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-ID: <20070507121115.GE7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> <20070503213238.5cdb1585.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213238.5cdb1585.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11313 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" wrote: > > + */ > > +int ext4_ext_try_to_merge(struct inode *inode, > > + struct ext4_ext_path *path, > > + struct ext4_extent *ex) > > +{ > > + struct ext4_extent_header *eh; > > + unsigned int depth, len; > > + int merge_done=0, uninitialized = 0; > > space around "=", please. > > Many people prefer not to do the multiple-definitions-per-line, btw: > > int merge_done = 0; > int uninitialized = 0; Ok. Will make the change. > > reasons: > > - If gives you some space for a nice comment > > - It makes patches much more readable, and it makes rejects easier to fix > > - standardisation. > > > + depth = ext_depth(inode); > > + BUG_ON(path[depth].p_hdr == NULL); > > + eh = path[depth].p_hdr; > > + > > + while (ex < EXT_LAST_EXTENT(eh)) { > > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > > + break; > > + /* merge with next extent! */ > > + if (ext4_ext_is_uninitialized(ex)) > > + uninitialized = 1; > > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > > + + ext4_ext_get_actual_len(ex + 1)); > > + if (uninitialized) > > + ext4_ext_mark_uninitialized(ex); > > + > > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > > + * sizeof(struct ext4_extent); > > + memmove(ex + 1, ex + 2, len); > > + } > > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); > > Kenrel convention is to put spaces around "-" Will fix this. > > > + merge_done = 1; > > + BUG_ON(eh->eh_entries == 0); > > eek, scary BUG_ON. Do we really need to be that severe? Would it be > better to warn and run ext4_error() here? Ok. > > > + } > > + > > + return merge_done; > > +} > > + > > + > > > > ... > > > > +/* > > + * ext4_ext_convert_to_initialized: > > + * this function is called by ext4_ext_get_blocks() if someone tries to write > > + * to an uninitialized extent. It may result in splitting the uninitialized > > + * extent into multiple extents (upto three). Atleast one initialized extent > > + * and atmost two uninitialized extents can result. > > There are some typos here > > > + * There are three possibilities: > > + * a> No split required: Entire extent should be initialized. > > + * b> Split into two extents: Only one end of the extent is being written to. > > + * c> Split into three extents: Somone is writing in middle of the extent. > > and here > Ok. Will fix them. > > + */ > > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > > + struct ext4_ext_path *path, > > + ext4_fsblk_t iblock, > > + unsigned long max_blocks) > > +{ > > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > > + struct ext4_extent_header *eh; > > + unsigned int allocated, ee_block, ee_len, depth; > > + ext4_fsblk_t newblock; > > + int err = 0, ret = 0; > > + > > + depth = ext_depth(inode); > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + ee_block = le32_to_cpu(ex->ee_block); > > + ee_len = ext4_ext_get_actual_len(ex); > > + allocated = ee_len - (iblock - ee_block); > > + newblock = iblock - ee_block + ext_pblock(ex); > > + ex2 = ex; > > + > > + /* ex1: ee_block to iblock - 1 : uninitialized */ > > + if (iblock > ee_block) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* for sanity, update the length of the ex2 extent before > > + * we insert ex3, if ex1 is NULL. This is to avoid temporary > > + * overlap of blocks. > > + */ > > + if (!ex1 && allocated > max_blocks) > > + ex2->ee_len = cpu_to_le16(max_blocks); > > + /* ex3: to ee_block + ee_len : uninitialised */ > > + if (allocated > max_blocks) { > > + unsigned int newdepth; > > + ex3 = &newex; > > + ex3->ee_block = cpu_to_le32(iblock + max_blocks); > > + ext4_ext_store_pblock(ex3, newblock + max_blocks); > > + ex3->ee_len = cpu_to_le16(allocated - max_blocks); > > + ext4_ext_mark_uninitialized(ex3); > > + err = ext4_ext_insert_extent(handle, inode, path, ex3); > > + if (err) > > + goto out; > > + /* The depth, and hence eh & ex might change > > + * as part of the insert above. > > + */ > > + newdepth = ext_depth(inode); > > + if (newdepth != depth) > > + { > > Use > > if (newdepth != depth) { Ok. > > > + depth=newdepth; > > spaces Ok. > > > + path = ext4_ext_find_extent(inode, iblock, NULL); > > + if (IS_ERR(path)) { > > + err = PTR_ERR(path); > > + path = NULL; > > + goto out; > > + } > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + if (ex2 != &newex) > > + ex2 = ex; > > + } > > + allocated = max_blocks; > > + } > > + /* If there was a change of depth as part of the > > + * insertion of ex3 above, we need to update the length > > + * of the ex1 extent again here > > + */ > > + if (ex1 && ex1 != ex) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* ex2: iblock to iblock + maxblocks-1 : initialised */ > > + ex2->ee_block = cpu_to_le32(iblock); > > + ex2->ee_start = cpu_to_le32(newblock); > > + ext4_ext_store_pblock(ex2, newblock); > > + ex2->ee_len = cpu_to_le16(allocated); > > + if (ex2 != ex) > > + goto insert; > > + if ((err = ext4_ext_get_access(handle, inode, path + depth))) > > + goto out; > > The preferred style is > > err = ext4_ext_get_access(handle, inode, path + depth); > if (err) > goto out; Right. Will change it. > > + /* New (initialized) extent starts from the first block > > + * in the current extent. i.e., ex2 == ex > > + * We have to see if it can be merged with the extent > > + * on the left. > > + */ > > + if (ex2 > EXT_FIRST_EXTENT(eh)) { > > + /* To merge left, pass "ex2 - 1" to try_to_merge(), > > + * since it merges towards right _only_. > > + */ > > + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); > > + if (ret) { > > + err = ext4_ext_correct_indexes(handle, inode, path); > > + if (err) > > + goto out; > > + depth = ext_depth(inode); > > + ex2--; > > + } > > + } > > + /* Try to Merge towards right. This might be required > > + * only when the whole extent is being written to. > > + * i.e. ex2==ex and ex3==NULL. > > + */ > > + if (!ex3) { > > + ret = ext4_ext_try_to_merge(inode, path, ex2); > > + if (ret) { > > + err = ext4_ext_correct_indexes(handle, inode, path); > > + if (err) > > + goto out; > > + } > > + } > > + /* Mark modified extent as dirty */ > > + err = ext4_ext_dirty(handle, inode, path + depth); > > + goto out; > > +insert: > > + err = ext4_ext_insert_extent(handle, inode, path, &newex); > > +out: > > + return err ? err : allocated; > > +} > > Sigh. I hope you guys know how all this works, because the extent code is > a mystery to me. Is the on-disk layout and the allocation strategy > described anywhere? > > > +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *); > > Again, I do think that sticking the identifiers in there helps > readability. Although it is not as important in a boring old declaration > as it is in, say, inode_operations, etc. > > Please try to keep the code looking nice in an 80-column display. Ok. Will make the required changes. Thanks again for your comments! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 06:04:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:04:30 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47D4PfB010686 for ; Mon, 7 May 2007 06:04:26 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47D4Odp031307 for ; Mon, 7 May 2007 09:04:24 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47D4ObH130260 for ; Mon, 7 May 2007 07:04:24 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47D4NSZ024479 for ; Mon, 7 May 2007 07:04:23 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47D4MAa024384; Mon, 7 May 2007 07:04:23 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 774D594BBD; Mon, 7 May 2007 18:34:30 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47D4T2s028246; Mon, 7 May 2007 18:34:29 +0530 Date: Mon, 7 May 2007 18:34:29 +0530 From: "Amit K. Arora" To: Pekka Enberg Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-ID: <20070507130429.GA6681@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> User-Agent: Mutt/1.4.1i X-archive-position: 11314 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote: > On 4/26/07, Amit K. Arora wrote: > > /* > >+ * ext4_ext_try_to_merge: > >+ * tries to merge the "ex" extent to the next extent in the tree. > >+ * It always tries to merge towards right. If you want to merge towards > >+ * left, pass "ex - 1" as argument instead of "ex". > >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > >+ * 1 if they got merged. > >+ */ > >+int ext4_ext_try_to_merge(struct inode *inode, > >+ struct ext4_ext_path *path, > >+ struct ext4_extent *ex) > >+{ > > Please either use proper kerneldoc format or drop > "ext4_ext_try_to_merge" from the comment. Ok, Thanks. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 06:07:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:07:09 -0700 (PDT) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.174]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47D75fB011315 for ; Mon, 7 May 2007 06:07:06 -0700 Received: by ug-out-1314.google.com with SMTP id t39so888009ugd for ; Mon, 07 May 2007 06:07:04 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=urrBzjEDw6lND4d8kP5iZTiYWtAhrbTo0ORHvRO5Ac0osqutU/p2ps7ovA3enA6g5I7Jm25FfmAzxoA7atZKmc+vZuTtqPfy9vd5MJb3PzeWa9bscWaYVyMm7LqyoXNAnbXxY1thOUP7Bbzn5Tcc1AexjGTILIVaW1ugvFG3vGM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=EMuKhDR2U7hdITRMtxB/Oej5BaFNrGc5hvurNCbd0H9fYPUePT20nmMki3NBZJPq657HruCk61mcjc92u/jpzZQ8RuphVoJtrLKhzBQT/CvM4E+FcNF5nfiW9ei7sZh0QH9smIIL1eDa9egvH4kK/9Z+4XVOjfUOcHV5JVm02qg= Received: by 10.67.90.19 with SMTP id s19mr3525671ugl.1178541626311; Mon, 07 May 2007 05:40:26 -0700 (PDT) Received: by 10.67.9.19 with HTTP; Mon, 7 May 2007 05:40:26 -0700 (PDT) Message-ID: <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> Date: Mon, 7 May 2007 15:40:26 +0300 From: "Pekka Enberg" To: "Amit K. Arora" Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070426181623.GE7209@amitarora.in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> X-Google-Sender-Auth: 7ffddca7cb123766 X-archive-position: 11315 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: penberg@cs.helsinki.fi Precedence: bulk X-list: xfs On 4/26/07, Amit K. Arora wrote: > /* > + * ext4_ext_try_to_merge: > + * tries to merge the "ex" extent to the next extent in the tree. > + * It always tries to merge towards right. If you want to merge towards > + * left, pass "ex - 1" as argument instead of "ex". > + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > + * 1 if they got merged. > + */ > +int ext4_ext_try_to_merge(struct inode *inode, > + struct ext4_ext_path *path, > + struct ext4_extent *ex) > +{ Please either use proper kerneldoc format or drop "ext4_ext_try_to_merge" from the comment. From owner-xfs@oss.sgi.com Mon May 7 06:22:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:22:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47DMGfB013748 for ; Mon, 7 May 2007 06:22:17 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l47D8pKs005754; Mon, 7 May 2007 09:08:51 -0400 Received: from lacrosse.corp.redhat.com (lacrosse.corp.redhat.com [172.16.52.154]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47D8otS020113; Mon, 7 May 2007 09:08:50 -0400 Received: from myware66.akkadia.org (vpn-14-5.rdu.redhat.com [10.11.14.5]) by lacrosse.corp.redhat.com (8.12.11.20060308/8.11.6) with ESMTP id l47D8med016906; Mon, 7 May 2007 09:08:49 -0400 Message-ID: <463F24DB.5040406@redhat.com> Date: Mon, 07 May 2007 06:08:43 -0700 From: Ulrich Drepper Organization: Red Hat, Inc. User-Agent: Thunderbird 2.0.0.0 (X11/20070419) MIME-Version: 1.0 To: Jakub Jelinek CC: Andrew Morton , David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> <20070504065626.GW355@devserv.devel.redhat.com> In-Reply-To: <20070504065626.GW355@devserv.devel.redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11316 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: drepper@redhat.com Precedence: bulk X-list: xfs Jakub Jelinek wrote: > is what glibc does ATM. Seems we violate the case where len == 0, as > EINVAL in that case is "shall fail". But reading the standard to imply > negative len is ok is too much guessing, there is no word what it means > when len is negative and > "required storage for regular file data starting at offset and continuing for len bytes" > doesn't make sense for negative size. This wording has already been cleaned up. The current draft for the next revision reads: [EINVAL] The len argument is less than or equal to zero, or the offset argument is less than zero, or the underlying file system does not support this operation. I still don't like it since len==0 shouldn't create an error (it's inconsistent) but len<0 is already outlawed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ From owner-xfs@oss.sgi.com Mon May 7 08:48:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 08:48:24 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47FmJfB009196 for ; Mon, 7 May 2007 08:48:21 -0700 Received: from e1.ny.us.ibm.com ([192.168.1.101]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47FOkpf023745 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 11:24:46 -0400 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47FOffK009339 for ; Mon, 7 May 2007 11:24:41 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47FOfJk549588 for ; Mon, 7 May 2007 11:24:41 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47FOewe027863 for ; Mon, 7 May 2007 11:24:40 -0400 Received: from [9.53.41.190] (kleikamp.austin.ibm.com [9.53.41.190]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47FOdo3027760; Mon, 7 May 2007 11:24:39 -0400 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Dave Kleikamp To: "Amit K. Arora" Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070507120719.GD7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> Content-Type: text/plain Date: Mon, 07 May 2007 10:24:37 -0500 Message-Id: <1178551477.12900.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11317 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > > +{ > > > + handle_t *handle; > > > + ext4_fsblk_t block, max_blocks; > > > + int ret, ret2, nblocks = 0, retries = 0; > > > + struct buffer_head map_bh; > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > + > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > + if (mode != FA_ALLOCATE) > > > + return -EOPNOTSUPP; > > > + > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > + return -ENOTTY; > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > news. The changelog would be an appropriate place to communicate this, > > along with reasons why, or a description of the plan to fix it. > > Ok. Will add this in the function description as well. > > > Also, posix says nothing about fallocate() returning ENOTTY. > > Right. I don't seem to find any suitable error from posix description. > Can you please suggest an error code which might make more sense here ? > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > non-extent files. Isn't the idea that libc will interpret -ENOTTY, or whatever is returned here, and fall back to the current library code to do preallocation? This way, the caller of fallocate() will never see this return code, so it won't violate posix. -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Mon May 7 11:35:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:35:08 -0700 (PDT) Received: from tur.go2.pl (tur.go2.pl [193.17.41.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IZ0fB008787 for ; Mon, 7 May 2007 11:35:02 -0700 Received: from poczta.o2.pl (mx10.go2.pl [193.17.41.74]) by tur.go2.pl (o2.pl Mailer 2.0.1) with ESMTP id CF0912349DA for ; Mon, 7 May 2007 20:04:28 +0200 (CEST) Received: from poczta.o2.pl (mx10.go2.pl [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id 07A2C58113 for ; Mon, 7 May 2007 20:04:26 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP for ; Mon, 7 May 2007 20:04:25 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: xfs@oss.sgi.com Subject: RESVSP problems Date: Mon, 7 May 2007 20:04:22 +0200 User-Agent: KMail/1.9.6 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705072004.22848.lucke@o2.pl> X-archive-position: 11318 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs Hello, guys, I've been trying to implement RESVSP-based allocation in rtorrent. From the very beginning it has, alas, misbehaved, thus (also considering my very basic programming skills and experience and unfamiliarity with rtorrent's code) after hours of trying to determine what's wrong, I finally observed that blocks of files allocated with RESVSP (previously ftruncated to a proper size) and being downloaded in rtorrent don't have their unwritten flags removed (as confirmed by xfs_bmap -vp). In the effect downloaded file promptly corrupts (read: changes its md5sum). What is interesting, files RESVSP-allocated in ktorrent and then imported to rtorrent seem to download properly. Everything works properly with ALLOCSP (although I've noticed that while RESVSP worked with l_start = 0 and l_length = size, ALLOCSP worked with l_start = size and l_length = 0; is that intended?). I'm not quite sure what's at fault here. Perhaps rtorrent, as it prides itself on "directly between file pages mapped to memory by the mmap() function and the network stack". I haven't been yet able to determine how it actually writes chunks to files (aforementioned lacks of skills, experience and familiarity). Perhaps it's somehow XFS's fault, hence my posting to this ML. Any help/suggestions would be appreciated. Cheers, Luke From owner-xfs@oss.sgi.com Mon May 7 11:46:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:46:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IkAfB010835 for ; Mon, 7 May 2007 11:46:12 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik5aL021039; Mon, 7 May 2007 14:46:06 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik5VF013479; Mon, 7 May 2007 14:46:05 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik4LP014643; Mon, 7 May 2007 14:46:04 -0400 Message-ID: <463F7368.8090101@sandeen.net> Date: Mon, 07 May 2007 13:43:52 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: lucke@o2.pl CC: xfs@oss.sgi.com Subject: Re: RESVSP problems References: <200705072004.22848.lucke@o2.pl> In-Reply-To: <200705072004.22848.lucke@o2.pl> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-archive-position: 11319 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Łukasz Fibinger wrote: > Hello, guys, > > I've been trying to implement RESVSP-based allocation in rtorrent. From the > very beginning it has, alas, misbehaved, thus (also considering my very basic > programming skills and experience and unfamiliarity with rtorrent's code) > after hours of trying to determine what's wrong, I finally observed that > blocks of files allocated with RESVSP (previously ftruncated to a proper > size) and being downloaded in rtorrent don't have their unwritten flags > removed (as confirmed by xfs_bmap -vp). You've probably hit: http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 unwritten extents remain unwritten after mmap() modifies them Bug dchinner about it... ;-) > In the effect downloaded file > promptly corrupts (read: changes its md5sum). What is interesting, files > RESVSP-allocated in ktorrent and then imported to rtorrent seem to download > properly. > > Everything works properly with ALLOCSP (although I've noticed that while > RESVSP worked with l_start = 0 and l_length = size, ALLOCSP worked with > l_start = size and l_length = 0; is that intended?). yeah... ISTR that the arguments are funky. I can't remember if it's a bug or not. :) FWIW, allocsp just writes zeros to the file, so you could do it just as well from userspace w/ no fancy ioctls... ALLOCSP is a bit pointless if you ask me... though maybe someone knows why it's there :) -Eric > I'm not quite sure what's at fault here. Perhaps rtorrent, as it prides itself > on "directly between file pages mapped to memory by the mmap() function and > the network stack". I haven't been yet able to determine how it actually > writes chunks to files (aforementioned lacks of skills, experience and > familiarity). Perhaps it's somehow XFS's fault, hence my posting to this ML. > Any help/suggestions would be appreciated. > > Cheers, > > Luke > > From owner-xfs@oss.sgi.com Mon May 7 11:58:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:58:53 -0700 (PDT) Received: from poczta.o2.pl (mx12.go2.pl [193.17.41.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IwjfB013062 for ; Mon, 7 May 2007 11:58:46 -0700 Received: from poczta.o2.pl (mx12 [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id 1FBB83E81A6; Mon, 7 May 2007 20:58:37 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP; Mon, 7 May 2007 20:58:37 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: Eric Sandeen Subject: Re: RESVSP problems Date: Mon, 7 May 2007 20:58:32 +0200 User-Agent: KMail/1.9.6 References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> In-Reply-To: <463F7368.8090101@sandeen.net> Cc: xfs@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705072058.32679.lucke@o2.pl> X-archive-position: 11320 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs On Monday 07 of May 2007, you wrote: > You've probably hit: > http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > unwritten extents remain unwritten after mmap() modifies them > > Bug dchinner about it... ;-) Dave, consider it a bugging from my humble self :-) > yeah... ISTR that the arguments are funky. I can't remember if it's a > bug or not. :) FWIW, allocsp just writes zeros to the file, so you > could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > is a bit pointless if you ask me... though maybe someone knows why it's > there :) Let me say that I have noticed that using ALLOCSP seems to create less extents than posix_fallocate/manual zeroing. Thanks for your answer. Incidentally, I'm really happy that XFS has been bestowed upon linux users. Thanks for all your work, guys :-) Cheers, Luke From owner-xfs@oss.sgi.com Mon May 7 12:49:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 12:49:44 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47JnbfB023617 for ; Mon, 7 May 2007 12:49:39 -0700 Received: from localhost.adilger.int (dhcp215-19.nersc.gov [128.55.19.215]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id EB3E17BA315; Mon, 7 May 2007 13:49:36 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 2DAA8406D; Mon, 7 May 2007 05:37:54 -0600 (MDT) Date: Mon, 7 May 2007 05:37:54 -0600 From: Andreas Dilger To: Andrew Morton Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507113753.GA5439@schatzie.adilger.int> Mail-Followup-To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213133.d1559f52.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11321 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 03, 2007 21:31 -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. My understanding is that glibc will handle zero-filling of files for filesystems that do not support fallocate(). > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. > > Also, posix says nothing about fallocate() returning ENOTTY. I _think_ this is to convince glibc to do the zero-filling in userspace, but I'm not up on the API specifics. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > + - block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). Also, since @path is not being given then truncate_mutex is not needed. > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ouch, not very friendly error handling. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 13:58:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 13:58:46 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47KwffB028868 for ; Mon, 7 May 2007 13:58:42 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47KwQAQ005761 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 13:58:27 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47KwPPl005141; Mon, 7 May 2007 13:58:25 -0700 Date: Mon, 7 May 2007 13:58:25 -0700 From: Andrew Morton To: Andreas Dilger Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507135825.f8545a65.akpm@linux-foundation.org> In-Reply-To: <20070507113753.GA5439@schatzie.adilger.int> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11322 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger wrote: > > > + block = offset >> blkbits; > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > > + - block; > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > > space, and that this disk space will require an arbitrary amount of > > metadata, how can we work out how much journal space we'll be needing > > without at least looking at `len'? > > Good question. > > The uninitialized extent can cover up to 128MB with a single entry. > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > function returns the maximum number of extents needed to insert a leaf, > including splitting all of the index blocks. That would allow up to 43GB > (340 extents/block * 128MB) to be preallocated, but it still needs to take > the size of the preallocation into account (adding 3 blocks per 43GB - a > leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() "succeeded" and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. Does the proposed implementation handle quotas correctly, btw? Has that been tested? Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. From owner-xfs@oss.sgi.com Mon May 7 15:21:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 15:21:14 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47ML9fB005559 for ; Mon, 7 May 2007 15:21:10 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 99AB27BA306; Mon, 7 May 2007 16:21:08 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 6A5173F57; Mon, 7 May 2007 15:21:04 -0700 (PDT) Date: Mon, 7 May 2007 15:21:04 -0700 From: Andreas Dilger To: Andrew Morton Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507222103.GJ8181@schatzie.adilger.int> Mail-Followup-To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11323 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 13:58 -0700, Andrew Morton wrote: > Final point: it's fairly disappointing that the present implementation is > ext4-only, and extent-only. I do think we should be aiming at an ext4 > bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 15:39:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 15:39:43 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47MdcfB007821 for ; Mon, 7 May 2007 15:39:39 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47McuH6010334 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 15:38:58 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47Mcut9007194; Mon, 7 May 2007 15:38:56 -0700 Date: Mon, 7 May 2007 15:38:56 -0700 From: Andrew Morton To: Andreas Dilger Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507153856.d56a5133.akpm@linux-foundation.org> In-Reply-To: <20070507222103.GJ8181@schatzie.adilger.int> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11324 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 15:21:04 -0700 Andreas Dilger wrote: > On May 07, 2007 13:58 -0700, Andrew Morton wrote: > > Final point: it's fairly disappointing that the present implementation is > > ext4-only, and extent-only. I do think we should be aiming at an ext4 > > bitmap-based implementation and an ext3 implementation. > > Actually, this is a non-issue. The reason that it is handled for extent-only > is that this is the only way to allocate space in the filesystem without > doing the explicit zeroing. For other filesystems (including ext3 and > ext4 with block-mapped files) the filesystem should return an error (e.g. > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. hrm, spose so. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Totally blowing pagecache could be a problem. Fixable in userspace by using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't. From owner-xfs@oss.sgi.com Mon May 7 16:31:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:31:56 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NVnfB016106 for ; Mon, 7 May 2007 16:31:52 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47NVZ47012460 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 16:31:37 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47NVZ6H008256; Mon, 7 May 2007 16:31:35 -0700 Date: Mon, 7 May 2007 16:31:35 -0700 From: Andrew Morton To: Theodore Tso Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507163135.cf455103.akpm@linux-foundation.org> In-Reply-To: <20070507231442.GA29907@thunk.org> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> <20070507231442.GA29907@thunk.org> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11325 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 19:14:42 -0400 Theodore Tso wrote: > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > Actually, this is a non-issue. The reason that it is handled for extent-only > > > is that this is the only way to allocate space in the filesystem without > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > > > It can be a bit suboptimal from the layout POV. The reservations code will > > largely save us here, but kernel support might make it a bit better. > > Actually, the reservations code won't matter, since glibc will fall > back to its current behavior, which is it will do the preallocation by > explicitly writing zeros to the file. No! Reservations code is *critical* here. Without reservations, we get disastrously-bad layout if two processes were running a large fallocate() at the same time. (This is an SMP-only problem, btw: on UP the timeslice lengths save us). My point is that even though reservations save us, we could do even-better in-kernel. But then, a smart application would bypass the glibc() fallocate() implementation and would tune the reservation window size and would use direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > This wlil result in the same > layout as if we had done the persistent preallocation, but of course > it will mean the posix_fallocate() could potentially take a long time > if you're a PVR and you're reserving a gig or two for a two hour movie > at high quality. That seems suboptimal, granted, and ideally the > application should be warned about this before it calls > posix_fallocate(). On the other hand, it's what happens today, all > the time, so applications won't be too badly surprised. A PVR implementor would take all this over and would do it themselves, for sure. > If we think applications programmers badly need to know in advance if > posix_fallocate() will be fast or slow, probably the right thing is to > define a new fpathconf() configuration option so they can query to see > whether a particular file will support a fast posix_fallocate(). I'm > not 100% convinced such complexity is really needed, but I'm willing > to be convinced.... what do folks think? > An application could do sys_fallocate(one-byte) to work out whether it's supported in-kernel, I guess. From owner-xfs@oss.sgi.com Mon May 7 16:36:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:36:41 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NabfB016849 for ; Mon, 7 May 2007 16:36:38 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlCrZ-00083r-GU; Mon, 07 May 2007 19:43:30 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlCkg-0006Ub-B6; Mon, 07 May 2007 19:36:22 -0400 Date: Mon, 7 May 2007 19:36:22 -0400 From: Theodore Tso To: Jeff Garzik Cc: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507233622.GB29907@thunk.org> Mail-Followup-To: Theodore Tso , Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463FB008.3080706@garzik.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11326 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >On May 07, 2007 13:58 -0700, Andrew Morton wrote: > >>Final point: it's fairly disappointing that the present implementation is > >>ext4-only, and extent-only. I do think we should be aiming at an ext4 > >>bitmap-based implementation and an ext3 implementation. > > > >Actually, this is a non-issue. The reason that it is handled for > >extent-only > >is that this is the only way to allocate space in the filesystem without > >doing the explicit zeroing. For other filesystems (including ext3 and > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. There is a bit in the extent structure which indicates that the extent has not been initialized. When reading from a block where the extent is marked as unitialized, ext4 returns zero's, to avoid returning the uninitalized contents of the disk, which might contain someone else's love letters, p0rn, or other information which we shouldn't leak out. When writing to an extent which is uninitalized, we may potentially have to split the extent into three extents in the worst case. My understanding is that XFS uses a similar implementation; it's a pretty obvious and standard way to implement allocated-but-not-initialized extents. We thought about supporting persistent preallocation for inodes using indirect blocks, but it would require stealing a bit from each entry in the indirect block, reducing the maximum size of the filesystem by two (i.e., 2**31 blocks). It was decided it wasn't worth the complexity, given the tradeoffs. - Ted From owner-xfs@oss.sgi.com Mon May 7 16:44:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:44:13 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47Ni8fB017858 for ; Mon, 7 May 2007 16:44:09 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlCWZ-0007zC-Rg; Mon, 07 May 2007 19:21:48 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlCPi-0003Ie-6y; Mon, 07 May 2007 19:14:42 -0400 Date: Mon, 7 May 2007 19:14:42 -0400 From: Theodore Tso To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507231442.GA29907@thunk.org> Mail-Followup-To: Theodore Tso , Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507153856.d56a5133.akpm@linux-foundation.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11327 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > Actually, this is a non-issue. The reason that it is handled for extent-only > > is that this is the only way to allocate space in the filesystem without > > doing the explicit zeroing. For other filesystems (including ext3 and > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > It can be a bit suboptimal from the layout POV. The reservations code will > largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced.... what do folks think? - Ted From owner-xfs@oss.sgi.com Mon May 7 16:57:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:57:51 -0700 (PDT) Received: from mail.dvmed.net (srv5.dvmed.net [207.36.208.214]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NvkfB019596 for ; Mon, 7 May 2007 16:57:47 -0700 Received: from cpe-065-190-194-075.nc.res.rr.com ([65.190.194.75] helo=[10.10.10.10]) by mail.dvmed.net with esmtpsa (Exim 4.63 #1 (Red Hat Linux)) id 1HlCDx-0000ze-FS; Mon, 07 May 2007 23:02:33 +0000 Message-ID: <463FB008.3080706@garzik.org> Date: Mon, 07 May 2007 19:02:32 -0400 From: Jeff Garzik User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> In-Reply-To: <20070507222103.GJ8181@schatzie.adilger.int> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11328 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeff@garzik.org Precedence: bulk X-list: xfs Andreas Dilger wrote: > On May 07, 2007 13:58 -0700, Andrew Morton wrote: >> Final point: it's fairly disappointing that the present implementation is >> ext4-only, and extent-only. I do think we should be aiming at an ext4 >> bitmap-based implementation and an ext3 implementation. > > Actually, this is a non-issue. The reason that it is handled for extent-only > is that this is the only way to allocate space in the filesystem without > doing the explicit zeroing. For other filesystems (including ext3 and Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. Jeff From owner-xfs@oss.sgi.com Mon May 7 17:16:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:16:15 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480GBfB022224 for ; Mon, 7 May 2007 17:16:12 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l480Fii3015034 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 17:15:46 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l480FfvA009569; Mon, 7 May 2007 17:15:41 -0700 Date: Mon, 7 May 2007 17:15:41 -0700 From: Andrew Morton To: cmm@us.ibm.com Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507171541.5370a36a.akpm@linux-foundation.org> In-Reply-To: <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11329 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao wrote: > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + > > Now the interesting question is: what do we do if we get halfway through > > this loop and then run out of space? We could leave the disk all filled up > > and then return failure to the caller, but that's pretty poor behaviour, > > IMO. > > > The current code handles earlier ENOSPC by three times retries. After > that if we still run out of space, then it's propably right to notify > the caller there isn't much space left. > > We could extend the block reservation window size before the while loop > so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. From owner-xfs@oss.sgi.com Mon May 7 17:20:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:20:05 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480K1fB022916 for ; Mon, 7 May 2007 17:20:02 -0700 Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4800TK3031704 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 20:00:30 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4800Rgk009304 for ; Mon, 7 May 2007 20:00:27 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4800RNf186946 for ; Mon, 7 May 2007 18:00:27 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4800QhU006734 for ; Mon, 7 May 2007 18:00:27 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4800OlA006675; Mon, 7 May 2007 18:00:25 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:00:24 -0700 Message-Id: <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11330 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 05:37:54 -0600 > Andreas Dilger wrote: > > > > > + block = offset >> blkbits; > > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > > > + - block; > > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > > > space, and that this disk space will require an arbitrary amount of > > > metadata, how can we work out how much journal space we'll be needing > > > without at least looking at `len'? > > > > Good question. > > > > The uninitialized extent can cover up to 128MB with a single entry. > > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > > function returns the maximum number of extents needed to insert a leaf, > > including splitting all of the index blocks. That would allow up to 43GB > > (340 extents/block * 128MB) to be preallocated, but it still needs to take > > the size of the preallocation into account (adding 3 blocks per 43GB - a > > leaf block, a bitmap block and a group descriptor). > > I think the use of ext4_journal_extend() (as Amit has proposed) will help > here, but it is not sufficient. > > Because under some circumstances, a journal_extend() failure could mean > that we fail to allocate all the required disk space. If it is infrequent > enough, that is acceptable when the caller is using fallocate() for > performance reasons. > > But it is very much not acceptable if the caller is using fallocate() for > space-reservation reasons. If you used fallocate to reserve 1GB of disk > and fallocate() "succeeded" and you later get ENOSPC then you'd have a > right to get a bit upset. > > So I think the ext3/4 fallocate() implementation will need to be > implemented as a loop: > > while (len) { > journal_start(); > len -= do_fallocate(len, ...); > journal_stop(); > } > > I agree. There is already a loop in Amit's current's patch to call ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to ask for in each journal_start()? > +/* > + * ext4_fallocate: > + * preallocate space for a file > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > + */ > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > +{ .... > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); I think the calculation is based on the assumption that there is only a single extent to be inserted, which is the ideal case. But in some cases we may end up allocating several chunk of blocks(extents) for this single preallocation request when fs is fragmented (or part of preallocation request is already fulfilled) I think we should move this calculation inside the loop as well,and we really do not need to grab the lock to calculate the credit if the @path is always NULL, all the function does is mathmatics. I can't think of any good way to estimate the total credits needed for this whole preallocation request. Looked at ext4_get_block(), which is used for DIO code to deal with large amount of block allocation. The credit reservation is quite weak there too. The DIO_CREDIT is only (EXT4_RESERVE_TRANS_BLOCKS + 32) > + handle=ext4_journal_start(inode, credits + > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > +retry: > + ret = 0; > + while (ret >= 0 && ret < max_blocks) { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > + BUG_ON(!ret); > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > + && ((block + ret) > (i_size_read(inode) << blkbits))) > + nblocks = nblocks + ret; > + } > + > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + > Now the interesting question is: what do we do if we get halfway through > this loop and then run out of space? We could leave the disk all filled up > and then return failure to the caller, but that's pretty poor behaviour, > IMO. > The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. > > Does the proposed implementation handle quotas correctly, btw? Has that > been tested? > I think so. The ext4_ext_get_blocks() will end up calling ext4_new_blocks() to do the real block allocation, quota is being handled there, therefor is tested already. Mingming From owner-xfs@oss.sgi.com Mon May 7 17:30:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:30:43 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480UbfB024196 for ; Mon, 7 May 2007 17:30:38 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l480UWm5018735 for ; Mon, 7 May 2007 20:30:32 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l480UWTX162186 for ; Mon, 7 May 2007 18:30:32 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l480UVng005071 for ; Mon, 7 May 2007 18:30:32 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l480UU96005046; Mon, 7 May 2007 18:30:30 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Theodore Tso , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507163135.cf455103.akpm@linux-foundation.org> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> <20070507231442.GA29907@thunk.org> <20070507163135.cf455103.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:30:29 -0700 Message-Id: <1178584229.3933.60.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11331 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 19:14:42 -0400 > Theodore Tso wrote: > > > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > > Actually, this is a non-issue. The reason that it is handled for extent-only > > > > is that this is the only way to allocate space in the filesystem without > > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > > > > > It can be a bit suboptimal from the layout POV. The reservations code will > > > largely save us here, but kernel support might make it a bit better. > > > > Actually, the reservations code won't matter, since glibc will fall > > back to its current behavior, which is it will do the preallocation by > > explicitly writing zeros to the file. > > No! Reservations code is *critical* here. Without reservations, we get > disastrously-bad layout if two processes were running a large fallocate() > at the same time. (This is an SMP-only problem, btw: on UP the timeslice > lengths save us). > > My point is that even though reservations save us, we could do even-better > in-kernel. > In this case, since the number of blocks to preallocate (eg. N=10GB) is clear, we could improve the current reservation code, to allow callers explicitly ask for a new window that have the minimum N free blocks for the blocks-to-preallocated(rather than just have at least 1 free blocks). Before the ext4_fallocate() is called, the right reservation window size is set with the flag to indicating "please spend time if needed to find a window covers at least N free blocks". So for ex4 block mapped files, later when glibc is doing allocation and zeroing, the ext4 block-mapped allocator will knows to reserve the right amount of free blocks before allocating and zeroing 10GB space. I am not sure whether this worth the effort though. > But then, a smart application would bypass the glibc() fallocate() > implementation and would tune the reservation window size and would use > direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > > > This wlil result in the same > > layout as if we had done the persistent preallocation, but of course > > it will mean the posix_fallocate() could potentially take a long time > > if you're a PVR and you're reserving a gig or two for a two hour movie > > at high quality. That seems suboptimal, granted, and ideally the > > application should be warned about this before it calls > > posix_fallocate(). On the other hand, it's what happens today, all > > the time, so applications won't be too badly surprised. > > A PVR implementor would take all this over and would do it themselves, for > sure. > > > If we think applications programmers badly need to know in advance if > > posix_fallocate() will be fast or slow, probably the right thing is to > > define a new fpathconf() configuration option so they can query to see > > whether a particular file will support a fast posix_fallocate(). I'm > > not 100% convinced such complexity is really needed, but I'm willing > > to be convinced.... what do folks think? > > > > An application could do sys_fallocate(one-byte) to work out whether it's > supported in-kernel, I guess. > From owner-xfs@oss.sgi.com Mon May 7 17:41:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:42:01 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480ftfB025394 for ; Mon, 7 May 2007 17:41:56 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l480cMBk002369 for ; Mon, 7 May 2007 20:38:22 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l480fgcB102890 for ; Mon, 7 May 2007 18:41:42 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l480ffZE025439 for ; Mon, 7 May 2007 18:41:42 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l480fePn025409; Mon, 7 May 2007 18:41:40 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507171541.5370a36a.akpm@linux-foundation.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:41:39 -0700 Message-Id: <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11332 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote: > On Mon, 07 May 2007 17:00:24 -0700 > Mingming Cao wrote: > > > > + while (ret >= 0 && ret < max_blocks) { > > > + block = block + ret; > > > + max_blocks = max_blocks - ret; > > > + ret = ext4_ext_get_blocks(handle, inode, block, > > > + max_blocks, &map_bh, > > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > > + BUG_ON(!ret); > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > > + nblocks = nblocks + ret; > > > + } > > > + > > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > > + goto retry; > > > + > > > Now the interesting question is: what do we do if we get halfway through > > > this loop and then run out of space? We could leave the disk all filled up > > > and then return failure to the caller, but that's pretty poor behaviour, > > > IMO. > > > > > The current code handles earlier ENOSPC by three times retries. After > > that if we still run out of space, then it's propably right to notify > > the caller there isn't much space left. > > > > We could extend the block reservation window size before the while loop > > so we could get a lower chance to get more fragmented. > > yes, but my point is that the proposed behaviour is really quite bad. > I agree your point, that's why I mention it only helped the fragmentation issue but not the ENOSPC case. > We will attempt to allocate the disk space and then we will return failure, > having consumed all the disk space and having partially and uselessly > populated an unknown amount of the file. > Not totally useless I think. If only half of the space is preallocated because run out of space, the application can decide whether it's good enough to start to use this preallocated space or wait for the fs to have more free space. > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. > > So we'd need to either keep track of which blocks were newly-allocated and > then free them all again on the error path (doesn't work right across > commit+crash+recovery) or we could later use the space-reservation scheme which > delayed allocation will need to introduce. > > Or we could decide to live with the above IMO-crappy behaviour. In fact Amit and I had raised this issue before, whether it's okay to do allow partial preallocation. At that moment the feedback is it's no much different than the current zero-out-preallocation behavior: people might preallocating half-way then later deal with ENOSPC. We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. The fs free blocks account is just a estimate though. Mingming From owner-xfs@oss.sgi.com Mon May 7 17:59:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:59:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l480xYfB027310 for ; Mon, 7 May 2007 17:59:36 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA18644; Tue, 8 May 2007 10:59:27 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l480xOAf87607519; Tue, 8 May 2007 10:59:25 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l480xNoZ87644275; Tue, 8 May 2007 10:59:23 +1000 (AEST) Date: Tue, 8 May 2007 10:59:23 +1000 From: David Chinner To: =?iso-8859-1?Q?=C5=81ukasz?= Fibinger Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070508005923.GS77450368@melbourne.sgi.com> References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <200705072058.32679.lucke@o2.pl> User-Agent: Mutt/1.4.2.1i X-archive-position: 11333 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 08:58:32PM +0200, ?ukasz Fibinger wrote: > On Monday 07 of May 2007, you wrote: > > You've probably hit: > > http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > > unwritten extents remain unwritten after mmap() modifies them > > > > Bug dchinner about it... ;-) > > Dave, consider it a bugging from my humble self :-) Yeah, yeah ;) I'm waiting to see what happens with Nick's patches in .22 before going any further. If they are not merged into .22, then I think we should push the XFS specific fix in.... > > yeah... ISTR that the arguments are funky. I can't remember if it's a > > bug or not. :) FWIW, allocsp just writes zeros to the file, so you > > could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > > is a bit pointless if you ask me... though maybe someone knows why it's > > there :) > > Let me say that I have noticed that using ALLOCSP seems to create less extents > than posix_fallocate/manual zeroing. Yes, that's likely ;) There's work currently active to make posix_fallocate() do the same thing as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but that's a ways off yet... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 18:07:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:07:45 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4817efB028619 for ; Mon, 7 May 2007 18:07:42 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 7F4124E457A; Mon, 7 May 2007 19:07:38 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 1941F3F57; Mon, 7 May 2007 18:07:36 -0700 (PDT) Date: Mon, 7 May 2007 18:07:36 -0700 From: Andreas Dilger To: Jeff Garzik Cc: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508010736.GO8181@schatzie.adilger.int> Mail-Followup-To: Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463FB008.3080706@garzik.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11334 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 19:02 -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >Actually, this is a non-issue. The reason that it is handled for > >extent-only is that this is the only way to allocate space in the > >filesystem without doing the explicit zeroing. > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. In ext4 (as in XFS) there is a flag stored in the extent that tells if the extent is initialized or not. Reads from uninitialized extents will return zero-filled data, and writes that don't span the whole extent will cause the uninitialized extent to be split into a regular extent and one or two uninitialized extents (depending where the write is). My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 18:26:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:26:10 -0700 (PDT) Received: from mail.dvmed.net (srv5.dvmed.net [207.36.208.214]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l481Q5fB030567 for ; Mon, 7 May 2007 18:26:06 -0700 Received: from cpe-065-190-194-075.nc.res.rr.com ([65.190.194.75] helo=[10.10.10.10]) by mail.dvmed.net with esmtpsa (Exim 4.63 #1 (Red Hat Linux)) id 1HlESi-0001kC-7i; Tue, 08 May 2007 01:25:56 +0000 Message-ID: <463FD1A2.1020505@garzik.org> Date: Mon, 07 May 2007 21:25:54 -0400 From: Jeff Garzik User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> <20070508010736.GO8181@schatzie.adilger.int> In-Reply-To: <20070508010736.GO8181@schatzie.adilger.int> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11335 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeff@garzik.org Precedence: bulk X-list: xfs Andreas Dilger wrote: > My comment was just that the extent doesn't have to be explicitly zero > filled on the disk, by virtue of the fact that the uninitialized flag > will cause reads to return zero. Agreed, thanks for the clarification. Jeff From owner-xfs@oss.sgi.com Mon May 7 18:43:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:43:59 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l481hrfB032410 for ; Mon, 7 May 2007 18:43:54 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlEqi-0008QR-T3; Mon, 07 May 2007 21:50:45 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlEjp-00009j-MT; Mon, 07 May 2007 21:43:37 -0400 Date: Mon, 7 May 2007 21:43:37 -0400 From: Theodore Tso To: Mingming Cao Cc: Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508014337.GA14072@thunk.org> Mail-Followup-To: Theodore Tso , Mingming Cao , Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11336 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > We could check the total number of fs free blocks account before > preallocation happens, if there isn't enough space left, there is no > need to bother preallocating. Checking against the fs free blocks is a good idea, since it will prevent the obvious error case where someone tries to preallocate 10GB when there is only 2GB left. But it won't help if there are multiple processes trying to allocate blocks the same time. On the other hand, that case is probably relatively rare, and in that case, the filesystem was probably going to be left completely full in any case. On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. Actually, the kernel could do it, in that could simply release all unitialized extents back to the system. The problem is distinguishing between the unitialized extents that had just been newly added, versus the ones that had there from before. (On the other hand, if the filesystem was completely full, releasing unitialized blocks wouldn't be the worse thing in the world to do, although releasing previously fallocated blocks probably does violate the princple of least surprise, even if it's what the user would have wanted.) On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > If there is enough free space, we could make a reservation window that > have at least N free blocks and mark it not stealable by other files. So > later we will not run into the ENOSPC error. Could you really use a single reservation window? When the filesystem is almost full, the free extents are likely going to be scattered all over the disk. The general principle of grabbing all of the extents and keeping them in an in-memory data structure, and only adding them to the extent tree would work, though; I'm just not sure we could do it using the existing reservation window code, since it only supports a single reservation window per file, yes? - Ted From owner-xfs@oss.sgi.com Mon May 7 22:03:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 22:03:13 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48539fB010772 for ; Mon, 7 May 2007 22:03:10 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 1C86018077E7E; Tue, 8 May 2007 00:03:08 -0500 (CDT) Message-ID: <4640048B.6070803@sandeen.net> Date: Tue, 08 May 2007 00:03:07 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: David Chinner CC: =?UTF-8?B?xYF1a2FzeiBGaWJpbmdlcg==?= , xfs@oss.sgi.com Subject: Re: RESVSP problems References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> <20070508005923.GS77450368@melbourne.sgi.com> In-Reply-To: <20070508005923.GS77450368@melbourne.sgi.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11337 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs David Chinner wrote: >>> yeah... ISTR that the arguments are funky. I can't remember if it's a >>> bug or not. :) FWIW, allocsp just writes zeros to the file, so you >>> could do it just as well from userspace w/ no fancy ioctls... ALLOCSP >>> is a bit pointless if you ask me... though maybe someone knows why it's >>> there :) >> Let me say that I have noticed that using ALLOCSP seems to create less extents >> than posix_fallocate/manual zeroing. > > Yes, that's likely ;) > > There's work currently active to make posix_fallocate() do the same thing > as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but > that's a ways off yet... Dave, doesn't ALLOCSP actually create actual zeroed space though? Pretty much as posix_fallocate from userspace does today, maybe with better allocation... And "smart stuff" would be *not* needing to write zeros.... i.e. what RESVSP does. -Eric From owner-xfs@oss.sgi.com Mon May 7 22:25:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 22:25:32 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l485PQfB014571 for ; Mon, 7 May 2007 22:25:28 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA24103; Tue, 8 May 2007 15:25:24 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l485PNAf83137079; Tue, 8 May 2007 15:25:23 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l485PLnZ87479850; Tue, 8 May 2007 15:25:21 +1000 (AEST) Date: Tue, 8 May 2007 15:25:21 +1000 From: David Chinner To: Eric Sandeen Cc: David Chinner , =?iso-8859-1?Q?=C5=81ukasz?= Fibinger , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070508052521.GH32602149@melbourne.sgi.com> References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> <20070508005923.GS77450368@melbourne.sgi.com> <4640048B.6070803@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4640048B.6070803@sandeen.net> User-Agent: Mutt/1.4.2.1i X-archive-position: 11338 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 08, 2007 at 12:03:07AM -0500, Eric Sandeen wrote: > David Chinner wrote: > > >>>yeah... ISTR that the arguments are funky. I can't remember if it's a > >>>bug or not. :) FWIW, allocsp just writes zeros to the file, so you > >>>could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > >>>is a bit pointless if you ask me... though maybe someone knows why it's > >>>there :) > >>Let me say that I have noticed that using ALLOCSP seems to create less > >>extents than posix_fallocate/manual zeroing. > > > >Yes, that's likely ;) > > > >There's work currently active to make posix_fallocate() do the same thing > >as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but > >that's a ways off yet... > > Dave, doesn't ALLOCSP actually create actual zeroed space though? Ah, yes it does - I was sort of lumping allocsp/resvsp together as one there. > Pretty much as posix_fallocate from userspace does today, maybe with > better allocation... Better allocations and with no ENOSPC-after-partial-zeroing problems, either. > And "smart stuff" would be *not* needing to write > zeros.... i.e. what RESVSP does. Yup. I've implemented fallocate() with the equivalent of RESVSP. xfs_zero_eof() is smart enough to not try to zero unwritten extents so changing the filesize after preallocation is effectively a no-op ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 23:51:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 23:51:37 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l486pWfB030333 for ; Mon, 7 May 2007 23:51:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA26143; Tue, 8 May 2007 16:51:28 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l486pRAf87623879; Tue, 8 May 2007 16:51:27 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l486pQsF87768021; Tue, 8 May 2007 16:51:26 +1000 (AEST) Date: Tue, 8 May 2007 16:51:26 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070508065126.GK32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11339 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Back in 2.6.13, unwritten extent conversion was changed to be done via a workqueue because we can't do conversion in interrupt context (AIO issue). The problem was that the changes extent conversion to run asynchronously w.r.t I/o completion. Under heavy load (e.g. 100 fsstress processes), a direct write into an unwritten extent can complete and return to userspace before the unwritten extent is converted. If that range of the file is then read immediately, it will return zeros - unwritten - instead of the data that was written and is present on disk. A simpl etest case to show this is to run 100 fsstress processes, the loop doing: prealloc direct write bmap and at some point during this time, the bmap will return an unwritten extent spanning a range that has already been written. The following patch fixes the synchronous direct I/O by triggering a workqueue flush on detection of a sync direct I/O into an unwritten extent after queuing the conversion work. The other approach that could be taken is to simply do the conversion without passing it off to a work queue. Anyone have a preference on which would be the better method to choose? The patch below passes the QA test I wrote to exercise this bug. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-04-26 09:25:26.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-08 14:28:20.854616591 +1000 @@ -108,14 +108,19 @@ xfs_page_trace( /* * Schedule IO completion handling on a xfsdatad if this was - * the final hold on this ioend. + * the final hold on this ioend. If we are asked to wait, + * flush the workqueue. */ STATIC void xfs_finish_ioend( - xfs_ioend_t *ioend) + xfs_ioend_t *ioend, + int wait) { - if (atomic_dec_and_test(&ioend->io_remaining)) + if (atomic_dec_and_test(&ioend->io_remaining)) { queue_work(xfsdatad_workqueue, &ioend->io_work); + if (wait) + flush_workqueue(xfsdatad_workqueue); + } } /* @@ -334,7 +339,7 @@ xfs_end_bio( bio->bi_end_io = NULL; bio_put(bio); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); return 0; } @@ -470,7 +475,7 @@ xfs_submit_ioend( } if (bio) xfs_submit_ioend_bio(ioend, bio); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } while ((ioend = next) != NULL); } @@ -1408,6 +1413,13 @@ xfs_end_io_direct( * This is not necessary for synchronous direct I/O, but we do * it anyway to keep the code uniform and simpler. * + * Well, if only it were that simple. Because synchronous direct I/O + * requires extent conversion to occur *before* we return to userspace, + * we have to wait for extent conversion to complete. Look at the + * iocb that has been passed to use to determine if this is AIO or + * not. If it is synchronous, tell xfs_finish_ioend() to kick the + * workqueue and wait for it to complete. + * * The core direct I/O code might be changed to always call the * completion handler in the future, in which case all this can * go away. @@ -1415,9 +1427,9 @@ xfs_end_io_direct( ioend->io_offset = offset; ioend->io_size = size; if (ioend->io_type == IOMAP_READ) { - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } else if (private && size > 0) { - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, is_sync_kiocb(iocb) ? 1 : 0); } else { /* * A direct I/O write ioend starts it's life in unwritten @@ -1426,7 +1438,7 @@ xfs_end_io_direct( * handler. */ INIT_WORK(&ioend->io_work, xfs_end_bio_written); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } /* From owner-xfs@oss.sgi.com Mon May 7 23:53:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 23:53:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l486rYfB030943 for ; Mon, 7 May 2007 23:53:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA26229; Tue, 8 May 2007 16:53:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l486rSAf87718760; Tue, 8 May 2007 16:53:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l486rRnx87261508; Tue, 8 May 2007 16:53:27 +1000 (AEST) Date: Tue, 8 May 2007 16:53:27 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: XFSQA: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070508065327.GL32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11340 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Test to exercise synchronous direct I/O into unwritten extents. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- xfstests/167 | 65 ++++++++++++++++ xfstests/167.out | 3 xfstests/group | 1 xfstests/src/Makefile | 5 + xfstests/src/unwritten_sync.c | 167 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 240 insertions(+), 1 deletion(-) Index: xfs-cmds/xfstests/src/Makefile =================================================================== --- xfs-cmds.orig/xfstests/src/Makefile 2007-05-03 17:10:54.000000000 +1000 +++ xfs-cmds/xfstests/src/Makefile 2007-05-07 10:54:08.296322074 +1000 @@ -10,7 +10,7 @@ TARGETS = dirstress fill fill2 getpagesi mmapcat append_reader append_writer dirperf metaperf \ devzero feature alloc fault fstest t_access_root \ godown resvtest writemod makeextents itrash \ - multi_open_unlink dmiperf + multi_open_unlink dmiperf unwritten_sync LINUX_TARGETS = loggen xfsctl bstat t_mtab getdevicesize \ preallo_rw_pattern_reader preallo_rw_pattern_writer ftrunc trunc \ @@ -111,6 +111,9 @@ looptest: looptest.o locktest: locktest.o $(LINKTEST) +unwritten_sync: unwritten_sync.o + $(LINKTEST) + ifeq ($(PKG_PLATFORM),irix) fill2: fill2.o $(LINKTEST) -lgen Index: xfs-cmds/xfstests/src/unwritten_sync.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/src/unwritten_sync.c 2007-05-07 11:44:38.668980258 +1000 @@ -0,0 +1,167 @@ +#include +#include +#include +#include +#include +#include + +/* test thanks to judith@sgi.com */ + +#define IO_SIZE 1048576 + +void +print_getbmapx( + const char *pathname, + int fd, + int64_t start, + int64_t limit); + +int +main(int argc, char *argv[]) +{ + int i; + int fd; + char *buf; + struct dioattr dio; + xfs_flock64_t flock; + off_t offset; + char *file; + int loops; + + if(argc != 3) { + fprintf(stderr, "%s \n", argv[0]); + exit(1); + } + + errno = 0; + loops = strtoull(argv[1], NULL, 0); + if (errno) { + perror("strtoull"); + exit(errno); + } + file = argv[2]; + + while (loops-- > 0) { + sleep(1); + fd = open(file, O_RDWR|O_CREAT|O_DIRECT, 0666); + if (fd < 0) { + perror("open"); + exit(1); + } + if (xfsctl(file, fd, XFS_IOC_DIOINFO, &dio) < 0) { + perror("dioinfo"); + exit(1); + } + + if ((dio.d_miniosz > IO_SIZE) || (dio.d_maxiosz < IO_SIZE)) { + fprintf(stderr,"Test won't work. Sorry\n"); + exit(1); + } + buf = (char *)memalign(dio.d_mem , IO_SIZE); + if (buf == NULL) { + fprintf(stderr,"Can't get memory\n"); + exit(1); + } + memset(buf,'Z',IO_SIZE); + offset = 0; + + flock.l_whence = 0; + flock.l_start= 0; + flock.l_len = IO_SIZE*21; + if (xfsctl(file, fd, XFS_IOC_RESVSP64, &flock) < 0) { + perror("xfsctl "); + exit(1); + } + for (i = 0; i < 21; i++) { + if (pwrite(fd, buf, IO_SIZE, offset) != IO_SIZE) { + perror("pwrite"); + exit(1); + } + offset += IO_SIZE; + } + + print_getbmapx(file, fd, 0, 0); + + flock.l_whence = 0; + flock.l_start= 0; + flock.l_len = 0; + xfsctl(file, fd, XFS_IOC_FREESP64, &flock); + print_getbmapx(file, fd, 0, 0); + close(fd); + } +} + + + +int +get_getbmapx( + const char *pathname, + int fd, + struct getbmapx *bmapx) +{ + int rc; + + rc = ioctl(fd, XFS_IOC_GETBMAPX, bmapx); + if (rc < 0) { + perror("xfs_ioc_getbmapx"); + exit(1); + } +} + +void +print_getbmapx( +const char *pathname, + int fd, + int64_t start, + int64_t limit) +{ + struct getbmapx bmapx[50]; + int array_size = sizeof(bmapx) / sizeof(bmapx[0]); + int x; + int foundone = 0; + int foundany = 0; + +again: + foundone = 0; + memset(bmapx, '\0', sizeof(bmapx)); + + bmapx[0].bmv_offset = start; + bmapx[0].bmv_length = -1; /* limit - start; */ + bmapx[0].bmv_count = array_size; + bmapx[0].bmv_entries = 0; /* no entries filled in yet */ + + bmapx[0].bmv_iflags = BMV_IF_PREALLOC; + + x = array_size; + for (;;) { + if (x > bmapx[0].bmv_entries) { + if (x != array_size) { + break; /* end of file */ + } + if (get_getbmapx(pathname, fd, bmapx) < 0) { + fprintf(stderr, "getbmapx failed\n"); + exit(1); + } + if (bmapx[0].bmv_entries == 0) { + break; + } + x = 1; /* back at first extent in buffer */ + } + if (bmapx[x].bmv_oflags & 1) { + fprintf(stderr, "FOUND ONE %lld %lld %x\n", + bmapx[x].bmv_offset, bmapx[x].bmv_length,bmapx[x].bmv_oflags); + foundone = 1; + foundany = 1; + } + x++; + } + if (foundone) { + sleep(1); + fprintf(stderr,"Repeat\n"); + goto again; + } + if (foundany) { + exit(1); + } +} + Index: xfs-cmds/xfstests/167 =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/167 2007-05-07 16:02:58.993892587 +1000 @@ -0,0 +1,65 @@ +#! /bin/sh +# FSQA Test No. 167 +# +# unwritten extent conversion test +# +#----------------------------------------------------------------------- +# Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved. +#----------------------------------------------------------------------- +# +# creator +owner=dgc@sgi.com + +seq=`basename $0` +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +rm -f $seq.full +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + killall -q -TERM fsstress 2> /dev/null + _cleanup_testdir +} + +workout() +{ + procs=100 + nops=15000 + $FSSTRESS_PROG -d $SCRATCH_MNT -p $procs -n $nops $FSSTRESS_AVOID \ + >>$seq.full & + sleep 2 +} + +# get standard environment, filters and checks +. ./common.rc +. ./common.filter + +# real QA test starts here +_supported_fs xfs +_supported_os Linux + +_setup_testdir +_require_scratch +_scratch_mkfs_xfs >/dev/null 2>&1 +_scratch_mount + +TEST_FILE=$SCRATCH_MNT/test_file +TEST_PROG=$here/src/unwritten_sync +LOOPS=100 + +echo "*** test unwritten extent conversion under heavy I/O" + +workout + +rm -f $TEST_FILE +$TEST_PROG $LOOPS $TEST_FILE +killall -q -TERM fsstress 2> /dev/null + +echo " *** test done" + +status=0 +exit Index: xfs-cmds/xfstests/167.out =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/167.out 2007-05-07 11:46:46.560202917 +1000 @@ -0,0 +1,3 @@ +QA output created by 167 +*** test unwritten extent conversion under heavy I/O + *** test done Index: xfs-cmds/xfstests/group =================================================================== --- xfs-cmds.orig/xfstests/group 2007-04-23 16:22:06.000000000 +1000 +++ xfs-cmds/xfstests/group 2007-05-07 10:57:00.721817454 +1000 @@ -246,3 +246,4 @@ pattern ajones@sgi.com 164 rw pattern auto 165 rw pattern auto 166 rw metadata auto +167 rw metadata auto From owner-xfs@oss.sgi.com Tue May 8 01:08:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 01:08:26 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4888KfB019878 for ; Tue, 8 May 2007 01:08:22 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA27849; Tue, 8 May 2007 18:08:13 +1000 Date: Tue, 08 May 2007 18:11:37 +1000 From: Timothy Shimmin To: torvalds@linux-foundation.org cc: akpm@osdl.org, xfs@oss.sgi.com Subject: [GIT] XFS updates for 2.6.22 Message-ID: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11341 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Linus, Please pull from: git pull git://oss.sgi.com:8090/xfs/xfs-2.6 --Tim This will update the following files: fs/xfs/linux-2.6/mrlock.h | 12 +++ fs/xfs/linux-2.6/xfs_aops.c | 89 +++++++++++++++++++--- fs/xfs/linux-2.6/xfs_buf.c | 10 ++ fs/xfs/linux-2.6/xfs_buf.h | 3 + fs/xfs/linux-2.6/xfs_fs_subr.c | 21 +++-- fs/xfs/linux-2.6/xfs_fs_subr.h | 2 fs/xfs/linux-2.6/xfs_lrw.c | 163 +++++++++++++++++++++++----------------- fs/xfs/linux-2.6/xfs_vnode.h | 2 fs/xfs/quota/xfs_dquot.c | 3 - fs/xfs/quota/xfs_qm.c | 16 +++- fs/xfs/quota/xfs_qm_syscalls.c | 19 +++-- fs/xfs/quota/xfs_trans_dquot.c | 4 + fs/xfs/support/debug.c | 17 ---- fs/xfs/support/debug.h | 2 fs/xfs/xfs_alloc.c | 2 fs/xfs/xfs_attr.c | 12 +-- fs/xfs/xfs_attr_leaf.c | 2 fs/xfs/xfs_bmap.c | 28 +++---- fs/xfs/xfs_dfrag.c | 6 + fs/xfs/xfs_dir2_block.c | 14 +-- fs/xfs/xfs_dir2_data.c | 7 -- fs/xfs/xfs_dir2_data.h | 2 fs/xfs/xfs_dir2_leaf.c | 7 +- fs/xfs/xfs_dir2_node.c | 4 - fs/xfs/xfs_error.c | 2 fs/xfs/xfs_fsops.c | 4 - fs/xfs/xfs_iget.c | 15 ++-- fs/xfs/xfs_inode.c | 58 +++++++++++--- fs/xfs/xfs_inode.h | 65 ++++++++++++---- fs/xfs/xfs_iocore.c | 2 fs/xfs/xfs_iomap.c | 15 ++-- fs/xfs/xfs_iomap.h | 1 fs/xfs/xfs_log_recover.c | 15 +--- fs/xfs/xfs_mount.c | 2 fs/xfs/xfs_qmops.c | 2 fs/xfs/xfs_quota.h | 3 - fs/xfs/xfs_rename.c | 2 fs/xfs/xfs_rtalloc.c | 6 + fs/xfs/xfs_rw.c | 4 - fs/xfs/xfs_trans.c | 6 - fs/xfs/xfs_trans.h | 4 - fs/xfs/xfs_utils.c | 11 ++- fs/xfs/xfs_vfsops.c | 6 + fs/xfs/xfs_vnodeops.c | 125 ++++++++++++++++++------------- 44 files changed, 491 insertions(+), 304 deletions(-) through these commits: commit f7c66ce3f70d8417de0cfb481ca4e5430382ec5d Author: Lachlan McIlroy Date: Tue May 8 13:50:19 2007 +1000 [XFS] Add lockdep support for XFS SGI-PV: 963965 SGI-Modid: xfs-linux-melb:xfs-kern:28485a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 71dfd5a396d11512aa6c8ed0d35b268bc084bb9b Author: Lachlan McIlroy Date: Tue May 8 13:50:12 2007 +1000 [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks. In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA() which means that the file could change from not-cached to cached and we need to redo the direct I/O checks. We should also redo the direct I/O checks when the file size changes regardless if O_APPEND is set or not. SGI-PV: 963483 SGI-Modid: xfs-linux-melb:xfs-kern:28440a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 3a02ee1828915d6540b415a160344775e2a4f918 Author: Utako Kusaka Date: Tue May 8 13:50:06 2007 +1000 [XFS] Get rid of redundant "required" in msg. SGI-PV: 963466 SGI-Modid: xfs-linux-melb:xfs-kern:28416a Signed-off-by: Utako Kusaka Signed-off-by: Tim Shimmin Signed-off-by: Christoph Hellwig commit e6a0e9cdff79e1406e5653f759aaf9f59b7ce4c8 Author: Tim Shimmin Date: Tue May 8 13:49:59 2007 +1000 [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg. SGI-PV: 963465 SGI-Modid: xfs-linux-melb:xfs-kern:28414a Signed-off-by: Tim Shimmin Signed-off-by: Lachlan McIlroy commit f10bb2dad02a846966064a531ba6eec301bbb9e0 Author: Tim Shimmin Date: Tue May 8 13:49:53 2007 +1000 [XFS] Remove unused ilen variable and references. SGI-PV: 907752 SGI-Modid: xfs-linux-melb:xfs-kern:28344a Signed-off-by: Tim Shimmin Signed-off-by: Lachlan McIlroy Signed-off-by: Eric Sandeen commit ba87ea699ebd9dd577bf055ebc4a98200e337542 Author: Lachlan McIlroy Date: Tue May 8 13:49:46 2007 +1000 [XFS] Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. SGI-PV: 958522 SGI-Modid: xfs-linux-melb:xfs-kern:28322a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 2a32963130aec5e157b58ff7dfa3dfa1afdf7ca1 Author: Lachlan McIlroy Date: Tue May 8 13:49:39 2007 +1000 [XFS] Fix race condition in xfs_write(). This change addresses a race in xfs_write() where, for direct I/O, the flags need_i_mutex and need_flush are setup before the iolock is acquired. The logic used to setup the flags may change between setting the flags and acquiring the iolock resulting in these flags having incorrect values. For example, if a file is not currently cached then need_i_mutex is set to zero and then if the file is cached before the iolock is acquired we will fail to do the flushinval before the direct write. The flush (and also the call to xfs_zero_eof()) need to be done with the iolock held exclusive so we need to acquire the iolock before checking for cached data (or if the write begins after eof) to prevent this state from changing. For direct I/O I've chosen to always acquire the iolock in shared mode initially and if there is a need to promote it then drop it and reacquire it. There's also some other tidy-ups including removing the O_APPEND offset adjustment since that work is done in generic_write_checks() (and we don't use offset as an input parameter anywhere). SGI-PV: 962170 SGI-Modid: xfs-linux-melb:xfs-kern:28319a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit e6d29426bc8a5d07d0eebd0842fe0cf6ecc862cd Author: Kouta Ooizumi Date: Tue May 8 13:49:33 2007 +1000 [XFS] Fix uquota and oquota enforcement problems. When uquota and oquota (gquota/pquota) are enabled for accounting both are enforced if ether has enforcement active. Conditions: - Both XFS_UQUOTA_ACCT and XFS_GQUOTA_ACCT are enabled. - Either XFS_UQUOTA_ENFD or XFS_OQUOTA_ENFD is enabled. - The usage without enforce is reached at the soft limit. Problems: 1. "repquota" shows all grace time even if no enforcement. 2. we cannot make a file over a hard limits even if no enforcement. SGI-PV: 962291 SGI-Modid: xfs-linux-melb:xfs-kern:28272a Signed-off-by: Kouta Ooizumi Signed-off-by: Donald Douwsma Signed-off-by: Tim Shimmin commit d3cf209476b72c83907a412b6708c5e498410aa7 Author: Lachlan McIlroy Date: Tue May 8 13:49:27 2007 +1000 [XFS] propogate return codes from flush routines This patch handles error return values in fs_flush_pages and fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we can propogate the errors and handle them at higher layers. I also modified xfs_itruncate_start so that it could propogate the error further. SGI-PV: 961990 SGI-Modid: xfs-linux-melb:xfs-kern:28231a Signed-off-by: Lachlan McIlroy Signed-off-by: Stewart Smith Signed-off-by: Tim Shimmin commit 424ea91ba61c1cdc2dac68576c97030cbf47d84f Author: Donald Douwsma Date: Tue May 8 13:49:15 2007 +1000 [XFS] Fix quotaon syscall failures for group enforcement requests. xfs_qm_scall_quotaon was incorrectly failing requests to enable group quota enforcement. Fixes logic error in OQUOTA handling. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28227a Signed-off-by: Donald Douwsma Signed-off-by: Tim Shimmin commit 646d5bdab38c88f4b9088d4e517986a3f3b0edb9 Author: Donald Douwsma Date: Tue May 8 13:49:09 2007 +1000 [XFS] Invalidate quotacheck when mounting without a quota type. When quotas are mounted or remounted without a particular quota type the quota accounting for that type becomes invalid. Previously we were ignoring this leading to accounting errors. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28225a Signed-off-by: Donald Douwsma Signed-off-by: Utako Kusaka Signed-off-by: Vlad Apostolov Signed-off-by: Tim Shimmin commit e7a23a9b37c395a153a541d4c50e166eef6abe49 Author: Joe Perches Date: Tue May 8 13:49:03 2007 +1000 [XFS] reducing the number of random number functions. Patch provided by Joe Perches SGI-PV: 961696 SGI-Modid: xfs-linux-melb:xfs-kern:28209a Signed-off-by: Joe Perches Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit e9ed9d2240c71014a84043095af4465ffce61367 Author: Eric Sandeen Date: Tue May 8 13:48:56 2007 +1000 [XFS] remove more misc. unused args Patch provided by Eric Sandeen. SGI-PV: 961695 SGI-Modid: xfs-linux-melb:xfs-kern:28205a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit ef497f8a1eafe0447f0473940ff2e0f6c8519a14 Author: Eric Sandeen Date: Tue May 8 13:48:49 2007 +1000 [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it. Patch provided by Eric Sandeen. SGI-PV: 961694 SGI-Modid: xfs-linux-melb:xfs-kern:28204a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit 1c72bf90037f32fc2b10e0a05dff2640abce8ee2 Author: Eric Sandeen Date: Tue May 8 13:48:42 2007 +1000 [XFS] The last argument "lsn" of xfs_trans_commit() is always called with NULL. Patch provided by Eric Sandeen. SGI-PV: 961693 SGI-Modid: xfs-linux-melb:xfs-kern:28199a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin From owner-xfs@oss.sgi.com Tue May 8 03:00:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:00:09 -0700 (PDT) Received: from atlas.informatik.uni-freiburg.de (atlas.informatik.uni-freiburg.de [132.230.150.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48A00fB010168 for ; Tue, 8 May 2007 03:00:01 -0700 Received: from login.informatik.uni-freiburg.de ([132.230.151.6]) by atlas.informatik.uni-freiburg.de with esmtps (TLSv1:DES-CBC3-SHA:168) (Exim 4.66) (envelope-from ) id 1HlLns-00067j-VF for xfs@oss.sgi.com; Tue, 08 May 2007 11:16:17 +0200 Received: from login.informatik.uni-freiburg.de (localhost [127.0.0.1]) by login.informatik.uni-freiburg.de (8.13.8+Sun/8.12.11) with ESMTP id l489GFYW008121 for ; Tue, 8 May 2007 11:16:15 +0200 (MEST) Received: (from zeisberg@localhost) by login.informatik.uni-freiburg.de (8.13.8+Sun/8.12.11/Submit) id l489GEui008120 for xfs@oss.sgi.com; Tue, 8 May 2007 11:16:14 +0200 (MEST) Date: Tue, 8 May 2007 11:16:14 +0200 From: Uwe =?iso-8859-1?Q?Kleine-K=F6nig?= To: xfs@oss.sgi.com Subject: Problems with XFS in a power failure Message-ID: <20070508091613.GA5852@cepheus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.14+cvs20070321 (2007-03-20) Organization: Universitaet Freiburg, Institut f. Informatik X-archive-position: 11342 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: ukleinek@informatik.uni-freiburg.de Precedence: bulk X-list: xfs Hello, my machine suffered a power failure while doing a apt-get upgrade. This damaged several files. E.g. root@cepheus:~# xxd /var/lib/dpkg/info/myspell-en-us.postrm 0000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000040: 0000 0000 0000 0000 0000 0000 0000 .............. while the repaired file has a size of 78 (= 0x4e) bytes. Some other files got broken with random data. I checked with debsums -c and it reported: root@cepheus:~# debsums -c >=������2[j�b��nw�������� in md5sums for irssi-scripts: ����gV{ڛ� �N���L���Mg{����.�����`����ӈL���j$�kC1'� ��S� ���ݏ�� debsums: invalid line (2) in md5sums for irssi-scripts: ����g�DPH�� 숐���}�]g�����N�ci�5�h �w�W{SZ��q��F_�sR�[���ie�A|��Sv��@�@��;�5�'#c��$��l%���� ��T���$�!d�B�y debsums: invalid line (3) in md5sums for irssi-scripts: �����-ċE��yq�/7đ>�������Ў������Vu����V �+ɋA�f��:%O��_l���}������}� ��1���ȴϘ��=?��&��������F���mT�trZ� ���1���enO%.�YN��=�k��@����\{8ɔw�x����z��-P!g�j����QV9u������)�m���5�l�8l �Rk5�;M���R��� �fx��O gѝ��;����٠�HYfrc��9�����u�q���Ox߀`����~_�ƃ2"J;�Q$vl?�{�=V������ �[��\�d��n�!�UH��Y�D��j2I���*� [�c��G�������[��h*���������2A��m&����������ޥGЉ�;R�0��̦��� ... I don't know exactly, but I think the damaged file here was /var/lib/dpkg/info/irssi.md5sums and debsums repaired it!? In theory this should not happen with a journaled fs, does it? This is a 2.6.19.3 kernel, unfortunately tainted by madwifi. There was nothing logged in dmesg and/or syslog. root@cepheus:~# xfs_info /var meta-data=/dev/hda9 isize=256 agcount=8, agsize=91619 blks = sectsz=512 attr=0 data = bsize=4096 blocks=732952, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 I have to shutdown that machine because of a power cut by my supplier, but probably I will use a boot cd to bring it up again ... Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=1+degree+celsius+in+kelvin From owner-xfs@oss.sgi.com Tue May 8 03:25:53 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:25:56 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48APqfB015225 for ; Tue, 8 May 2007 03:25:53 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l48APnoq007956 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 8 May 2007 03:25:50 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l48APkHi021147; Tue, 8 May 2007 03:25:47 -0700 Date: Tue, 8 May 2007 03:25:46 -0700 From: Andrew Morton To: Timothy Shimmin Cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-Id: <20070508032546.0728ae95.akpm@linux-foundation.org> In-Reply-To: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> X-Mailer: Sylpheed 2.4.1 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11343 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Tue, 08 May 2007 18:11:37 +1000 Timothy Shimmin wrote: > Please pull from: > git pull git://oss.sgi.com:8090/xfs/xfs-2.6 > I pull that regularly and it's always empty. Where did all this code suddenly come from? From owner-xfs@oss.sgi.com Tue May 8 03:52:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:52:49 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48AqhfB021631 for ; Tue, 8 May 2007 03:52:45 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48AqgVH002005 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48Aqg6c518946 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48Aqg30020671 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48AqeM6020596; Tue, 8 May 2007 06:52:41 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id BDCF194C6E; Tue, 8 May 2007 16:22:48 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l48AqmIq011407; Tue, 8 May 2007 16:22:48 +0530 Date: Tue, 8 May 2007 16:22:47 +0530 From: "Amit K. Arora" To: Dave Kleikamp Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508105247.GA1950@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> <1178551477.12900.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1178551477.12900.6.camel@kleikamp.austin.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11344 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > > > +{ > > > > + handle_t *handle; > > > > + ext4_fsblk_t block, max_blocks; > > > > + int ret, ret2, nblocks = 0, retries = 0; > > > > + struct buffer_head map_bh; > > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > > + > > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > > + if (mode != FA_ALLOCATE) > > > > + return -EOPNOTSUPP; > > > > + > > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > > + return -ENOTTY; > > > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > > news. The changelog would be an appropriate place to communicate this, > > > along with reasons why, or a description of the plan to fix it. > > > > Ok. Will add this in the function description as well. > > > > > Also, posix says nothing about fallocate() returning ENOTTY. > > > > Right. I don't seem to find any suitable error from posix description. > > Can you please suggest an error code which might make more sense here ? > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > > non-extent files. > > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned > here, and fall back to the current library code to do preallocation? > This way, the caller of fallocate() will never see this return code, so > it won't violate posix. You are right. But, we still need to "standardize" (and limit) the error codes which we should return from kernel when we want to fall back on the library implementation. The posix_fallocate() library function will have to look for a set of errors from fallocate() system call, upon receiving which it will do preallocation from user level; or else, it will return success/error-code returned by the system call to the user. I think we can make it fall back to library implementation of fallocate, whenever posix_fallocate() receives any of the following errors from fallocate() system call: 1. ENOSYS 2. EOPNOTSUPP 3. ENOTTY (?) Now the question is - should we limit the set of errors for this purpose to just 1 & 2 above ? In that case I will need to change the error being returned here to -EOPNOTSUPP (from current -ENOTTY). -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 8 06:28:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 06:29:00 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48DStfB001716 for ; Tue, 8 May 2007 06:28:57 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id D1B2E18723; Tue, 8 May 2007 15:28:53 +0200 (CEST) Date: Tue, 8 May 2007 15:28:53 +0200 From: Emmanuel Florac To: Uwe =?ISO-8859-1?Q?Kleine-K=F6nig?= Cc: xfs@oss.sgi.com Subject: Re: Problems with XFS in a power failure Message-ID: <20070508152853.1d387fea@galadriel.home> In-Reply-To: <20070508091613.GA5852@cepheus> References: <20070508091613.GA5852@cepheus> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l48DSvfB001742 X-archive-position: 11345 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Tue, 8 May 2007 11:16:14 +0200 vous criviez: > In theory this should not happen with a journaled fs, does it? That's the opposite. It's the expected behaviour. It's especially important to garantee proper power when using journaling filesystems. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Tue May 8 07:12:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 07:12:59 -0700 (PDT) Received: from amanpulo.fs3.ph (amanpulo.fs3.ph [72.51.42.241]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48ECufB017002 for ; Tue, 8 May 2007 07:12:56 -0700 Received: from localhost (localhost [127.0.0.1]) by amanpulo.fs3.ph (Postfix) with ESMTP id 25E8E1E0D5967 for ; Tue, 8 May 2007 21:55:59 +0800 (PHT) Received: from amanpulo.fs3.ph ([127.0.0.1]) by localhost (amanpulo.fs3.ph [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 1NVf-U3wH6+s for ; Tue, 8 May 2007 21:55:56 +0800 (PHT) Received: from musang.fs3.ph (smtp01.globe.com.ph [203.177.91.252]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by amanpulo.fs3.ph (Postfix) with ESMTP id A97BD1E0D5953 for ; Tue, 8 May 2007 21:55:55 +0800 (PHT) Received: by musang.fs3.ph (Postfix, from userid 1000) id BD35A2017683; Tue, 8 May 2007 21:55:42 +0800 (PHT) Date: Tue, 8 May 2007 21:55:42 +0800 From: Federico Sevilla III To: xfs@oss.sgi.com Subject: Re: Problems with XFS in a power failure Message-ID: <20070508135542.GF5621@fs3.ph> Mail-Followup-To: xfs@oss.sgi.com References: <20070508091613.GA5852@cepheus> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070508091613.GA5852@cepheus> X-Personal-URL: http://jijo.free.net.ph User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11346 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jijo@fs3.ph Precedence: bulk X-list: xfs On Tue, May 08, 2007 at 11:16:14AM +0200, Uwe Kleine-Knig wrote: > my machine suffered a power failure while doing a apt-get upgrade. Uh-oh. You hit the (in)famous binary nulls "issue". You may want to read the FAQ entry: http://oss.sgi.com/projects/xfs/faq.html#nulls. -- Federico Sevilla III F S 3 Consulting Inc. http://www.fs3.ph From owner-xfs@oss.sgi.com Tue May 8 07:47:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 07:48:02 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48ElvfB024039 for ; Tue, 8 May 2007 07:47:58 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48EluvR013297 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48EluWL551990 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48ElujD011654 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from [9.53.41.190] (kleikamp.austin.ibm.com [9.53.41.190]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48Eltlu011606; Tue, 8 May 2007 10:47:55 -0400 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Dave Kleikamp To: "Amit K. Arora" Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070508105247.GA1950@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> <1178551477.12900.6.camel@kleikamp.austin.ibm.com> <20070508105247.GA1950@amitarora.in.ibm.com> Content-Type: text/plain Date: Tue, 08 May 2007 09:47:54 -0500 Message-Id: <1178635675.11344.10.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11347 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote: > On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: > > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > > > news. The changelog would be an appropriate place to communicate this, > > > > along with reasons why, or a description of the plan to fix it. > > > > > > Ok. Will add this in the function description as well. > > > > > > > Also, posix says nothing about fallocate() returning ENOTTY. > > > > > > Right. I don't seem to find any suitable error from posix description. > > > Can you please suggest an error code which might make more sense here ? > > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > > > non-extent files. > > > > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned > > here, and fall back to the current library code to do preallocation? > > This way, the caller of fallocate() will never see this return code, so > > it won't violate posix. > > You are right. > > But, we still need to "standardize" (and limit) the error codes > which we should return from kernel when we want to fall back on the > library implementation. The posix_fallocate() library function will have > to look for a set of errors from fallocate() system call, upon receiving > which it will do preallocation from user level; or else, it will return > success/error-code returned by the system call to the user. > > I think we can make it fall back to library implementation of fallocate, > whenever posix_fallocate() receives any of the following errors from > fallocate() system call: > > 1. ENOSYS > 2. EOPNOTSUPP > 3. ENOTTY (?) > > Now the question is - should we limit the set of errors for this purpose > to just 1 & 2 above ? In that case I will need to change the error being > returned here to -EOPNOTSUPP (from current -ENOTTY). If you want my opinion, -EOPNOTSUPP is better than -ENOTTY. Shaggy -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Tue May 8 09:53:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 09:53:09 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48Gr2fB021897 for ; Tue, 8 May 2007 09:53:04 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id EA0D24E4557; Tue, 8 May 2007 10:53:00 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 28A3A3FB4; Tue, 8 May 2007 09:52:59 -0700 (PDT) Date: Tue, 8 May 2007 09:52:59 -0700 From: Andreas Dilger To: Theodore Tso , Mingming Cao , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508165259.GD6375@schatzie.adilger.int> Mail-Followup-To: Theodore Tso , Mingming Cao , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> <20070508014337.GA14072@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070508014337.GA14072@thunk.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11348 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 21:43 -0400, Theodore Tso wrote: > On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > > Userspace could presumably repair the mess in most situations by truncating > > the file back again. The kernel cannot do that because there might be live > > data in amongst there. > > Actually, the kernel could do it, in that could simply release all > unitialized extents back to the system. The problem is distinguishing > between the unitialized extents that had just been newly added, versus > the ones that had there from before. (On the other hand, if the > filesystem was completely full, releasing unitialized blocks wouldn't > be the worse thing in the world to do, although releasing previously > fallocated blocks probably does violate the princple of least > surprise, even if it's what the user would have wanted.) I tend to agree with this. Having fallocate() fill up the filesystem is exactly what the caller asked. Doing a write() hit ENOSPC doesn't trucate off the whole write either, nor does "dd" delete the whole file when the filesystem is full. Even checking the statfs() space before doing the fallocate() may be counter intuitive, since it will return ENOSPC but the filesystem will not actually be full. Some applications (e.g. database) may WANT to fill the filesystem and then get the actual file size back to avoid trusting statfs() because of metadata overhead (e.g. indirect blocks). One of the design goals for sys_fallocate() was to allow FA_DELALLOC to deallocate unwritten extents in a safe manner. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 8 10:46:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 10:46:15 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48Hk9fB032642 for ; Tue, 8 May 2007 10:46:10 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48Hk7kD027946 for ; Tue, 8 May 2007 13:46:07 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48Hk5a2140062 for ; Tue, 8 May 2007 11:46:05 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48Hk44h029943 for ; Tue, 8 May 2007 11:46:05 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48Hk2Vq029851; Tue, 8 May 2007 11:46:03 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Theodore Tso Cc: Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070508014337.GA14072@thunk.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> <20070508014337.GA14072@thunk.org> Content-Type: text/plain Organization: IBM LTC Date: Tue, 08 May 2007 10:46:01 -0700 Message-Id: <1178646362.4135.17.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11349 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 21:43 -0400, Theodore Tso wrote: > On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > > We could check the total number of fs free blocks account before > > preallocation happens, if there isn't enough space left, there is no > > need to bother preallocating. > > Checking against the fs free blocks is a good idea, since it will > prevent the obvious error case where someone tries to preallocate 10GB > when there is only 2GB left. Think it again, this check is useful when preallocate blocks at EOF. It's not much useful is preallocating a range with holes. In that case 2GB space might be enough if the application tries to preallocate a 10GB. > But it won't help if there are multiple > processes trying to allocate blocks the same time. On the other hand, > that case is probably relatively rare, and in that case, the > filesystem was probably going to be left completely full in any case. > On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > > Userspace could presumably repair the mess in most situations by truncating > > the file back again. The kernel cannot do that because there might be live > > data in amongst there. > > Actually, the kernel could do it, in that could simply release all > unitialized extents back to the system. The problem is distinguishing > between the unitialized extents that had just been newly added, versus > the ones that had there from before. True, the new uninitialized extents can be merged to the near old uninitialized extents, there is no way to distinguish the just added unintialized extents from the merged one. > (On the other hand, if the > filesystem was completely full, releasing unitialized blocks wouldn't > be the worse thing in the world to do, although releasing previously > fallocated blocks probably does violate the princple of least > surprise, even if it's what the user would have wanted.) > > On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > > If there is enough free space, we could make a reservation window that > > have at least N free blocks and mark it not stealable by other files. So > > later we will not run into the ENOSPC error. > > Could you really use a single reservation window? When the filesystem > is almost full, the free extents are likely going to be scattered all > over the disk. The general principle of grabbing all of the extents > and keeping them in an in-memory data structure, and only adding them > to the extent tree would work, though; I'm just not sure we could do > it using the existing reservation window code, since it only supports > a single reservation window per file, yes? > You are right. One reservation window per file and there is limit to the maximum window size). So yeah this way it's not going to prevent ENOSPC for sure:( Mingming From owner-xfs@oss.sgi.com Tue May 8 18:06:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 18:06:36 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4916UfB010738 for ; Tue, 8 May 2007 18:06:32 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA24949; Wed, 9 May 2007 11:06:24 +1000 Date: Wed, 09 May 2007 11:09:51 +1000 From: Timothy Shimmin To: Andrew Morton cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-ID: In-Reply-To: <20070508032546.0728ae95.akpm@linux-foundation.org> References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> <20070508032546.0728ae95.akpm@linux-foundation.org> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11350 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Andrew, --On 8 May 2007 3:25:46 AM -0700 Andrew Morton wrote: > On Tue, 08 May 2007 18:11:37 +1000 Timothy Shimmin wrote: > >> Please pull from: >> git pull git://oss.sgi.com:8090/xfs/xfs-2.6 >> > > I pull that regularly and it's always empty. Where did all this > code suddenly come from? It came from our internal tree which is also mirrored in cvs on oss. Our internal tree gets updated (non-xfs) from mainline every so often and has latest kdb patches applied (and dmapi patches). The internal tree is where our changes are originated from before moving out to an absolutely ridiculous number of trees. I only update the git tree every so often (start of rc's, important fixes, when I remember:) for Linus. Should I be updating a git branch for you more often? Cheers, Tim. From owner-xfs@oss.sgi.com Tue May 8 18:44:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 18:44:16 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l491iCfB019017 for ; Tue, 8 May 2007 18:44:12 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l491iADi017404 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 8 May 2007 18:44:11 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l491i9Fx007238; Tue, 8 May 2007 18:44:09 -0700 Date: Tue, 8 May 2007 18:44:09 -0700 From: Andrew Morton To: Timothy Shimmin Cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-Id: <20070508184409.e6ad4c8b.akpm@linux-foundation.org> In-Reply-To: References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> <20070508032546.0728ae95.akpm@linux-foundation.org> X-Mailer: Sylpheed 2.4.1 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11351 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Wed, 09 May 2007 11:09:51 +1000 Timothy Shimmin wrote: > Should I be updating a git branch for you more often? Only if you want it tested ;) Yes please. From owner-xfs@oss.sgi.com Tue May 8 22:59:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 22:59:29 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l495xOfB008438 for ; Tue, 8 May 2007 22:59:26 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.195]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l495xLcV009966 for ; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l495xL321403 for xfs@oss.sgi.com; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id l495xLO26246 for ; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070509.145924.21802300 for ; Wed, 9 May 2007 14:59:24 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Wed May 09 14:59:24 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 07047AE4B3; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l495xKD3006884; Wed, 9 May 2007 14:59:20 +0900 Message-Id: <200705090559.AA05331@TNESG9305.tnes.nec.co.jp> Date: Wed, 09 May 2007 14:59:11 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix xfs_quota path command. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11352 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, In path command in xfs_quota, the range value in the message becomes from 0 to -1 incorrectly when the list number is specified though the path list is empty. I think that the message is unnecessary the same as not specifying the list number in this case. Example: # ./xfs_quota -x xfs_quota> path xfs_quota> path 0 value 0 is out of range (0--1) Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/path.orig 2007-04-26 14:14:00.000000000 +0900 +++ xfsprogs-2.8.20/quota/path.c 2007-04-27 11:27:56.000000000 +0900 @@ -102,6 +102,9 @@ path_f( if (argc <= 1) return pathlist_f(); + if (!fs_count) + return 0; + i = atoi(argv[1]); if (i < 0 || i >= fs_count) { printf(_("value %d is out of range (0-%d)\n"), From owner-xfs@oss.sgi.com Wed May 9 03:52:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 03:52:33 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49AqPfB028412 for ; Wed, 9 May 2007 03:52:26 -0700 Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49AEJTa016628 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 9 May 2007 06:14:20 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49AECOf004909 for ; Wed, 9 May 2007 06:14:13 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49AECD9171760 for ; Wed, 9 May 2007 04:14:12 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49AECFt015558 for ; Wed, 9 May 2007 04:14:12 -0600 Received: from qubit.in.ibm.com (wks184594wss.in.ibm.com [9.184.236.184]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49AEAHi014659; Wed, 9 May 2007 04:14:11 -0600 Received: from qubit.in.ibm.com (localhost.localdomain [127.0.0.1]) by qubit.in.ibm.com (Postfix) with ESMTP id A8B2A67FFD; Wed, 9 May 2007 15:45:19 +0530 (IST) Received: (from suparna@localhost) by qubit.in.ibm.com (8.13.1/8.13.1/Submit) id l49AFDbE001436; Wed, 9 May 2007 15:45:13 +0530 Date: Wed, 9 May 2007 15:45:07 +0530 From: Suparna Bhattacharya To: Paul Mackerras Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509101507.GA26056@in.ibm.com> Reply-To: suparna@in.ibm.com References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17978.47502.786970.196554@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.5.11 X-archive-position: 11354 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: suparna@in.ibm.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: > Andrew Morton writes: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > > > Please add a comment over this function which specifies its behaviour. > > Really it should be enough material from which a full manpage can be > > written. > > This looks like it will have the same problem on s390 as > sys_sync_file_range. Maybe the prototype should be: > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2 http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2 Are there any better ideas ? Regards Suparna > > Paul. > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India From owner-xfs@oss.sgi.com Wed May 9 03:51:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 03:51:46 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49ApefB028095 for ; Wed, 9 May 2007 03:51:42 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id C15B7DDE44; Wed, 9 May 2007 20:51:39 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17985.42884.971318.859402@cargo.ozlabs.ibm.com> Date: Wed, 9 May 2007 20:50:44 +1000 From: Paul Mackerras To: suparna@in.ibm.com Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc In-Reply-To: <20070509101507.GA26056@in.ibm.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-archive-position: 11353 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Suparna Bhattacharya writes: > > This looks like it will have the same problem on s390 as > > sys_sync_file_range. Maybe the prototype should be: > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > Yes, but the trouble is that there was a contrary viewpoint preferring that fd > first be maintained as a convention like other syscalls (see the following > posts) Of course the interface used by an application program would have the fd first. Glibc can do the translation. Paul. From owner-xfs@oss.sgi.com Wed May 9 04:08:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 04:08:58 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49B8rfB001993 for ; Wed, 9 May 2007 04:08:55 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49B8g6b005958 for ; Wed, 9 May 2007 07:08:42 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49B8g8B178256 for ; Wed, 9 May 2007 05:08:42 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49B8fSu020108 for ; Wed, 9 May 2007 05:08:42 -0600 Received: from qubit.in.ibm.com (wks184594wss.in.ibm.com [9.184.236.184]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49B8fFH020062; Wed, 9 May 2007 05:08:41 -0600 Received: from qubit.in.ibm.com (localhost.localdomain [127.0.0.1]) by qubit.in.ibm.com (Postfix) with ESMTP id F2E5767FFD; Wed, 9 May 2007 16:40:17 +0530 (IST) Received: (from suparna@localhost) by qubit.in.ibm.com (8.13.1/8.13.1/Submit) id l49BAHY1024578; Wed, 9 May 2007 16:40:17 +0530 Date: Wed, 9 May 2007 16:40:11 +0530 From: Suparna Bhattacharya To: Paul Mackerras Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509111011.GA21619@in.ibm.com> Reply-To: suparna@in.ibm.com References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17985.42884.971318.859402@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.5.11 X-archive-position: 11355 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: suparna@in.ibm.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > This looks like it will have the same problem on s390 as > > > sys_sync_file_range. Maybe the prototype should be: > > > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > > > Yes, but the trouble is that there was a contrary viewpoint preferring that fd > > first be maintained as a convention like other syscalls (see the following > > posts) > > Of course the interface used by an application program would have the > fd first. Glibc can do the translation. I think that was understood. Regards Suparna > > Paul. -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India From owner-xfs@oss.sgi.com Wed May 9 04:37:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 04:37:34 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49BbSfB013664 for ; Wed, 9 May 2007 04:37:29 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id EA6BDDDE1A; Wed, 9 May 2007 21:37:27 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17985.45682.284634.969153@cargo.ozlabs.ibm.com> Date: Wed, 9 May 2007 21:37:22 +1000 From: Paul Mackerras To: suparna@in.ibm.com Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc In-Reply-To: <20070509111011.GA21619@in.ibm.com> References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> <20070509111011.GA21619@in.ibm.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-archive-position: 11356 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Suparna Bhattacharya writes: > > Of course the interface used by an application program would have the > > fd first. Glibc can do the translation. > > I think that was understood. OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). Paul. From owner-xfs@oss.sgi.com Wed May 9 05:05:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 05:06:03 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49C5wfB021602 for ; Wed, 9 May 2007 05:05:59 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49C5vv6004481 for ; Wed, 9 May 2007 08:05:57 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49C5vn6531388 for ; Wed, 9 May 2007 08:05:57 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49C5vNd025124 for ; Wed, 9 May 2007 08:05:57 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49C5umt024820; Wed, 9 May 2007 08:05:56 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5139793C0B; Wed, 9 May 2007 17:36:00 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l49C5xBc002968; Wed, 9 May 2007 17:35:59 +0530 Date: Wed, 9 May 2007 17:35:59 +0530 From: "Amit K. Arora" To: Paul Mackerras Cc: suparna@in.ibm.com, Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509120559.GA19430@amitarora.in.ibm.com> References: <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> <20070509111011.GA21619@in.ibm.com> <17985.45682.284634.969153@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17985.45682.284634.969153@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11357 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > Of course the interface used by an application program would have the > > > fd first. Glibc can do the translation. > > > > I think that was understood. > > OK, then what does it matter what the glibc/kernel interface is, as > long as it works? > > It's only a minor point; the order of arguments can vary between > architectures if necessary, but it's nicer if they don't have to. > 32-bit powerpc will need to have the two int arguments adjacent in > order to avoid using more than 6 argument registers at the user/kernel > boundary, and s390 will need to avoid having a 64-bit argument last > (if I understand it correctly). You are right to say that. But, it may not be _that_ a minor point, especially for the arch which is getting affected. It has other implications like what Heiko noticed in his post below: http://lkml.org/lkml/2007/4/27/377 - implications like modifying glibc and *trace utilities for a particular arch. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 9 05:28:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 05:28:55 -0700 (PDT) Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.248]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49CSmfB028386 for ; Wed, 9 May 2007 05:28:50 -0700 Received: by an-out-0708.google.com with SMTP id c25so33820ana for ; Wed, 09 May 2007 05:28:48 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=googlemail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=k82pNmCQG6xCtzvBGnA0DcnHTp3uogf8iXrFeu2ESrspAtjyuI5di7SXd8oVHD6YcfovSY0Cy+TEdHl51eX07BRjau7AZKfFVGpw9W7y77qimQUhMP2b/y1TXK/aAK9hiIz5EQqGaVAtcfl0erlPsKGkeEkzwi/ajHMuLybLOzI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=a2VKPe4iK52E4mrndjj8ctLhkaJWneH5H3bKROdSv/dILYa6Il+YpwSYwjgNWXo8yMpsqv9m3mzVE7dcB55H+bm9QEqow1Aef2PRwGza9eo0XyC6lQL4WYeeflCjtZmrSVaXpmc66I5a2fCoiSavV49zRM9eb4UXNL4BTd/DSFs= Received: by 10.100.119.14 with SMTP id r14mr314081anc.1178712019528; Wed, 09 May 2007 05:00:19 -0700 (PDT) Received: by 10.100.44.6 with HTTP; Wed, 9 May 2007 05:00:19 -0700 (PDT) Message-ID: <6e0cfd1d0705090500u3423877u579ebace44100b77@mail.gmail.com> Date: Wed, 9 May 2007 14:00:19 +0200 From: "Martin Schwidefsky" To: "Paul Mackerras" Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Cc: suparna@in.ibm.com, "Andrew Morton" , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com In-Reply-To: <17985.45682.284634.969153@cargo.ozlabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070418130600.GW5967@schatzie.adilger.int> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> <20070509111011.GA21619@in.ibm.com> <17985.45682.284634.969153@cargo.ozlabs.ibm.com> X-archive-position: 11358 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: schwidefsky@googlemail.com Precedence: bulk X-list: xfs On 5/9/07, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > Of course the interface used by an application program would have the > > > fd first. Glibc can do the translation. > > > > I think that was understood. > > OK, then what does it matter what the glibc/kernel interface is, as > long as it works? > > It's only a minor point; the order of arguments can vary between > architectures if necessary, but it's nicer if they don't have to. > 32-bit powerpc will need to have the two int arguments adjacent in > order to avoid using more than 6 argument registers at the user/kernel > boundary, and s390 will need to avoid having a 64-bit argument last > (if I understand it correctly). Ah, almost but not quite the point. But I admit it is hard to understand.. The trouble started with the futex call which has been the first system call with 6 arguments. s390 supported only 5 arguments up to that point (%r2 - %r6). For futex we added a wrapper to the glibc that loaded the 6th argument to %r7. In entry.S we set up things so that %r7 gets stored to the kernel stack where normal C code expects the first overflow argument. This enabled us to use the standard futex system call with 6 arguments. fallocate now has an additional problem: the last argument is a 64 bit integers AND registers %r2-%r5 are already used. In this case the 64 bit number would have to be split into the high part in %r6 and the low part on the stack so that the glibc wrapper can load the low part to %r7. But the C compiler will skip %r6 and store the 64 bit number on the stack. If the order of the arguments if modified so that %r6 is assigned to a 32-bit argument, then the entry.S magic with %r7 would work. -- blue skies, Martin From owner-xfs@oss.sgi.com Wed May 9 09:01:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 09:01:23 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49G1IfB007772 for ; Wed, 9 May 2007 09:01:19 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l49FvoGr002654 for ; Wed, 9 May 2007 11:57:50 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49G1Aa6133268 for ; Wed, 9 May 2007 10:01:11 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49G19kR006076 for ; Wed, 9 May 2007 10:01:10 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49G1835004970; Wed, 9 May 2007 10:01:09 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 3EA4C29EB6E; Wed, 9 May 2007 21:31:03 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l49G12nc004261; Wed, 9 May 2007 21:31:02 +0530 Date: Wed, 9 May 2007 21:31:02 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509160102.GA30745@amitarora.in.ibm.com> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426180332.GA7209@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11359 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for "unallocation" when block holding the EOF gets unallocated. - If we do not "unallocate" normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. 4) Should we update mtime & ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 9 09:54:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 09:54:08 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49Gs4fB021251 for ; Wed, 9 May 2007 09:54:05 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 603794E4569; Wed, 9 May 2007 10:54:03 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 6891C3FB2; Wed, 9 May 2007 09:54:02 -0700 (PDT) Date: Wed, 9 May 2007 09:54:02 -0700 From: Andreas Dilger To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509165402.GO6375@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070509160102.GA30745@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11360 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 09, 2007 21:31 +0530, Amit K. Arora wrote: > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > of normal (non-preallocated) blocks (blocks allocated via > regular write/truncate operations) also (i.e. work as punch()) ? > - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. I would only allow this on FA_ALLOCATE extents. That means it won't be possible to do this for filesystems that don't understand unwritten extents unless there are blocks allocated beyond EOF. > 3) If above is true, the file size will need to be changed > for "unallocation" when block holding the EOF gets unallocated. > - If we do not "unallocate" normal (non-preallocated) blocks and we > do not change the file size on preallocation, then this is a > non-issue. Not necessarily. That will just make the file sparse. If FA_ALLOCATE does not change the file size, why should FA_UNALLOCATE. > 4) Should we update mtime & ctime on a successfull allocation/ > unallocation ? I would say yes. If glibc does the fallback fallocate via write() the mtime/ctime will be updated, so it makes sense to be consistent for both methods. Also, it just makes sense from the "this file was modified" point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed May 9 10:07:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 10:07:34 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49H7TfB023872 for ; Wed, 9 May 2007 10:07:31 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49H7TOQ015642 for ; Wed, 9 May 2007 13:07:29 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49H7SWv169794 for ; Wed, 9 May 2007 11:07:28 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49H7RwY000330 for ; Wed, 9 May 2007 11:07:28 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49H7Qvq032702; Wed, 9 May 2007 11:07:26 -0600 Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc From: Mingming Cao Reply-To: cmm@us.ibm.com To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070509160102.GA30745@amitarora.in.ibm.com> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> Content-Type: text/plain Organization: IBM LTC Date: Wed, 09 May 2007 10:07:25 -0700 Message-Id: <1178730446.3815.8.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11361 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? > - Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. > If we chose not to update the file size beyong EOF, then for filesystem without fallocate() support (ext2,3 currently), posix_fallocate() will follow the hard way(zero-out) to do preallocation. Then we will get different behavior on filesystems w/o fallocate() support. It make sense to be consistent, IMO. My point of view, preallocation is just a efficient way to allocating blocks for files without zero-out, other than this, the new behavior should be consistent with the old way: file size update,mtime/ctime, ENOSPC etc. Mingming From owner-xfs@oss.sgi.com Wed May 9 16:16:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 16:16:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l49NGqfB008827 for ; Wed, 9 May 2007 16:16:54 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA26430; Thu, 10 May 2007 09:16:49 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l49NGlAf89734777; Thu, 10 May 2007 09:16:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l49NGibq89702380; Thu, 10 May 2007 09:16:44 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 09:16:43 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070509231643.GM85884050@sgi.com> References: <4642389E.4080804@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4642389E.4080804@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11362 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 02:09:50PM -0700, Jeremy Fitzhardinge wrote: > I've had a couple of instances of a linux-2.6 mercurial repo getting > corrupted in some odd way this morning. It looks like files are being > truncated; not to size 0, but losing something off the end. > > This is on an xfs filesystem. I haven't had any crashes/oops, and I > don't think its the normal files getting filled with 0 problem. I saw > this before the most recent set of xfs updates, but it happened again > afterwards too. It looks like the latest XFS changes haven't been pulled yet, so it's not new code that is triggering this.... > Mercurial uses a strictly append-only model for updating its repo files, > but it looks like maybe an append operation didn't stick. > > I'm repulling a fresh copy of the repo; I'll be able to compare > before/after. Update: yep, definitely truncated: > > $ ls -l .hg-new/store/data/_documentation/pi-futex.txt.i .hg-broken/store/data/_documentation/pi-futex.txt.i > 4 -rw-rw-r-- 1 jeremy jeremy 3309 May 9 09:43 .hg-broken/store/data/_documentation/pi-futex.txt.i > 4 -rw-rw-r-- 1 jeremy jeremy 3797 May 9 13:38 .hg-new/store/data/_documentation/pi-futex.txt.i > > also > 3476 -rw-rw-r-- 1 jeremy jeremy 3558208 May 9 13:55 00manifest.i > 3476 -rw-rw-r-- 1 jeremy jeremy 3555200 May 9 09:41 00manifest.i~ > > > where 00manifest.i~ is the broken one. The files are identical up to the > truncation point. Hmmm - that is bizarre. What is the output of xfs_bmap -vvp on each of those files? what happens to these files after then are downloaded? Does it only happen to append-only files or are other files affected as well? BTW, what's the 'xfs_info ' output for this filesystem? > The repo passed "hg verify" just after I pulled it, so this corruption > came about after a while. > > Hm, the other possibility is that nlinks is being misreported. When > cloning a repo, mercurial will generally hard-link files where possible, > and then break the link if it sees nlink > 1. If xfs is mis-reporting > the link count, then this will cause havok. Is that possible? Seems > unlikely, but it would also explain the symptoms. I just did a linking > clone with an older kernel, and the link count is as expected. I'd be surprised if it was a link count problem - that would cause all sorts of other problems as well.... > xfs_check passes without any output, which I presume is good. Yes, it means everythign is ok. You only have to worry when xfs_check says something - it only brings bad news ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 16:59:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 16:59:48 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49NxifB019372 for ; Wed, 9 May 2007 16:59:45 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 223212C8046; Wed, 9 May 2007 16:29:29 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 9922E2C803C; Wed, 9 May 2007 16:29:28 -0700 (PDT) Received: from [192.168.1.25] (adsl-69-107-77-42.dsl.pltn13.pacbell.net [69.107.77.42]) by lurch.goop.org (Postfix) with ESMTP; Wed, 9 May 2007 16:29:28 -0700 (PDT) Message-ID: <4642598E.3000607@goop.org> Date: Wed, 09 May 2007 16:30:22 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> In-Reply-To: <20070509231643.GM85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11363 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > On Wed, May 09, 2007 at 02:09:50PM -0700, Jeremy Fitzhardinge wrote: > >> I've had a couple of instances of a linux-2.6 mercurial repo getting >> corrupted in some odd way this morning. It looks like files are being >> truncated; not to size 0, but losing something off the end. >> >> This is on an xfs filesystem. I haven't had any crashes/oops, and I >> don't think its the normal files getting filled with 0 problem. I saw >> this before the most recent set of xfs updates, but it happened again >> afterwards too. >> > > It looks like the latest XFS changes haven't been pulled yet, so > it's not new code that is triggering this.... > A bunch of xfs changes appeared in git this morning, I thought. But all this first happened from a kernel compiled yesterday. >> Mercurial uses a strictly append-only model for updating its repo files, >> but it looks like maybe an append operation didn't stick. >> >> I'm repulling a fresh copy of the repo; I'll be able to compare >> before/after. Update: yep, definitely truncated: >> >> $ ls -l .hg-new/store/data/_documentation/pi-futex.txt.i .hg-broken/store/data/_documentation/pi-futex.txt.i >> 4 -rw-rw-r-- 1 jeremy jeremy 3309 May 9 09:43 .hg-broken/store/data/_documentation/pi-futex.txt.i >> 4 -rw-rw-r-- 1 jeremy jeremy 3797 May 9 13:38 .hg-new/store/data/_documentation/pi-futex.txt.i >> >> also >> 3476 -rw-rw-r-- 1 jeremy jeremy 3558208 May 9 13:55 00manifest.i >> 3476 -rw-rw-r-- 1 jeremy jeremy 3555200 May 9 09:41 00manifest.i~ >> >> >> where 00manifest.i~ is the broken one. The files are identical up to the >> truncation point. >> > > Hmmm - that is bizarre. What is the output of xfs_bmap -vvp > on each of those files? > 00manifest.i~ is linux-2.6-broken/.hg/store/00manifest.i $ xfs_bmap -vvp linux-2.6/.hg/store/00manifest.i linux-2.6-broken/.hg/store/00manifest.i linux-2.6/.hg/store/00manifest.i: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..895]: 8135128..8136023 1 (270808..271703) 896 1: [896..1407]: 8207424..8207935 1 (343104..343615) 512 2: [1408..2047]: 8211520..8212159 1 (347200..347839) 640 3: [2048..3071]: 8212904..8213927 1 (348584..349607) 1024 4: [3072..4991]: 8215672..8217591 1 (351352..353271) 1920 5: [4992..6143]: 8344408..8345559 1 (480088..481239) 1152 6: [6144..6951]: 7930840..7931647 1 (66520..67327) 808 linux-2.6-broken/.hg/store/00manifest.i: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..383]: 27132064..27132447 3 (3539104..3539487) 384 1: [384..511]: 27132912..27133039 3 (3539952..3540079) 128 2: [512..895]: 27136216..27136599 3 (3543256..3543639) 384 3: [896..1151]: 27147816..27148071 3 (3554856..3555111) 256 4: [1152..1535]: 27148680..27149063 3 (3555720..3556103) 384 5: [1536..2175]: 27154152..27154791 3 (3561192..3561831) 640 6: [2176..3711]: 27158944..27160479 3 (3565984..3567519) 1536 7: [3712..4607]: 27161016..27161911 3 (3568056..3568951) 896 8: [4608..5247]: 27162880..27163519 3 (3569920..3570559) 640 9: [5248..5375]: 27164096..27164223 3 (3571136..3571263) 128 10: [5376..5759]: 27165080..27165463 3 (3572120..3572503) 384 11: [5760..5887]: 27166664..27166791 3 (3573704..3573831) 128 12: [5888..6015]: 27171400..27171527 3 (3578440..3578567) 128 13: [6016..6399]: 27172904..27173287 3 (3579944..3580327) 384 14: [6400..6527]: 27173336..27173463 3 (3580376..3580503) 128 15: [6528..6911]: 27173784..27174167 3 (3580824..3581207) 384 16: [6912..6943]: 27174568..27174599 3 (3581608..3581639) 32 > what happens to these files after then are downloaded? Does it only > happen to append-only files or are other files affected as well? > I saw similar damage in another repo, but I was using the "mq" extension on that, which means the files are no longer append-only. I explicitly checked that repo was OK after I downloaded it. It became broken again after a while. It was as if the dirty inode data was dropped without being written to disk, so once it had to read back it got a stale file length. Or something like that - I'm just guessing. > BTW, what's the 'xfs_info ' output for this filesystem? > meta-data=/dev/vg00/homexfs isize=256 agcount=19, agsize=983040 blks = sectsz=512 attr=1 data = bsize=4096 blocks=18350080, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=7680, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 J From owner-xfs@oss.sgi.com Wed May 9 17:01:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:01:31 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A01OfB019947 for ; Wed, 9 May 2007 17:01:26 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA27818; Thu, 10 May 2007 10:01:22 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A01LAf88776765; Thu, 10 May 2007 10:01:21 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A01JGB89629843; Thu, 10 May 2007 10:01:19 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 10:01:19 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510000119.GO85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4642598E.3000607@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11364 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 04:30:22PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > On Wed, May 09, 2007 at 02:09:50PM -0700, Jeremy Fitzhardinge wrote: > > > >> I've had a couple of instances of a linux-2.6 mercurial repo getting > >> corrupted in some odd way this morning. It looks like files are being > >> truncated; not to size 0, but losing something off the end. > >> > >> This is on an xfs filesystem. I haven't had any crashes/oops, and I > >> don't think its the normal files getting filled with 0 problem. I saw > >> this before the most recent set of xfs updates, but it happened again > >> afterwards too. > >> > > > > It looks like the latest XFS changes haven't been pulled yet, so > > it's not new code that is triggering this.... > > > > A bunch of xfs changes appeared in git this morning, I thought. But all > this first happened from a kernel compiled yesterday. Ah, yes so it did - damn browser caching.... > >> Mercurial uses a strictly append-only model for updating its repo files, > >> but it looks like maybe an append operation didn't stick. > >> > >> I'm repulling a fresh copy of the repo; I'll be able to compare > >> before/after. Update: yep, definitely truncated: > >> > >> $ ls -l .hg-new/store/data/_documentation/pi-futex.txt.i .hg-broken/store/data/_documentation/pi-futex.txt.i > >> 4 -rw-rw-r-- 1 jeremy jeremy 3309 May 9 09:43 .hg-broken/store/data/_documentation/pi-futex.txt.i > >> 4 -rw-rw-r-- 1 jeremy jeremy 3797 May 9 13:38 .hg-new/store/data/_documentation/pi-futex.txt.i > >> > >> also > >> 3476 -rw-rw-r-- 1 jeremy jeremy 3558208 May 9 13:55 00manifest.i > >> 3476 -rw-rw-r-- 1 jeremy jeremy 3555200 May 9 09:41 00manifest.i~ > >> > >> > >> where 00manifest.i~ is the broken one. The files are identical up to the > >> truncation point. > >> > > > > Hmmm - that is bizarre. What is the output of xfs_bmap -vvp > > on each of those files? > > > 00manifest.i~ is linux-2.6-broken/.hg/store/00manifest.i > > $ xfs_bmap -vvp linux-2.6/.hg/store/00manifest.i linux-2.6-broken/.hg/store/00manifest.i > linux-2.6/.hg/store/00manifest.i: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL ...... > 6: [6144..6951]: 7930840..7931647 1 (66520..67327) 808 > linux-2.6-broken/.hg/store/00manifest.i: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL ..... > 16: [6912..6943]: 27174568..27174599 3 (3581608..3581639) 32 Yeah, there's one extra filesystem block in the good case compared to the broken case. If that was once good, then something has had to truncate the file to remove that block.... > > what happens to these files after then are downloaded? Does it only > > happen to append-only files or are other files affected as well? > > > > I saw similar damage in another repo, but I was using the "mq" extension > on that, which means the files are no longer append-only. > > I explicitly checked that repo was OK after I downloaded it. It became > broken again after a while. > > It was as if the dirty inode data was dropped without being written to > disk, so once it had to read back it got a stale file length. Or > something like that - I'm just guessing. Seems very unlikely. Have you unmounted and mounted the filesystem (or rebooted or suspended) between the files being seen good and the files being seen bad? > > BTW, what's the 'xfs_info ' output for this filesystem? > > > > meta-data=/dev/vg00/homexfs isize=256 agcount=19, agsize=983040 blks > = sectsz=512 attr=1 > data = bsize=4096 blocks=18350080, imaxpct=25 > = sunit=0 swidth=0 blks, unwritten=1 > naming =version 2 bsize=4096 > log =internal bsize=4096 blocks=7680, version=1 > = sectsz=512 sunit=0 blks > realtime =none extsz=65536 blocks=0, rtextents=0 Ok, nothing unusual there. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 17:04:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:04:33 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4A04UfB021087 for ; Wed, 9 May 2007 17:04:30 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id A2C6B2C8046; Wed, 9 May 2007 17:03:43 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 11C732C803C; Wed, 9 May 2007 17:03:43 -0700 (PDT) Received: from [192.168.1.25] (adsl-69-107-77-42.dsl.pltn13.pacbell.net [69.107.77.42]) by lurch.goop.org (Postfix) with ESMTP; Wed, 9 May 2007 17:03:42 -0700 (PDT) Message-ID: <46426194.3040403@goop.org> Date: Wed, 09 May 2007 17:04:36 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> In-Reply-To: <20070510000119.GO85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11365 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Seems very unlikely. Have you unmounted and mounted the filesystem > (or rebooted or suspended) between the files being seen good and > the files being seen bad? > There was definitely a suspend-resume, and maybe a reboot. I'll try again later on. J From owner-xfs@oss.sgi.com Wed May 9 17:49:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:49:33 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A0nQfB029955 for ; Wed, 9 May 2007 17:49:29 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA28968; Thu, 10 May 2007 10:49:21 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A0nKAf89693614; Thu, 10 May 2007 10:49:20 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A0nIsn89739459; Thu, 10 May 2007 10:49:18 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 10:49:18 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510004918.GS85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46426194.3040403@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11366 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 05:04:36PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Seems very unlikely. Have you unmounted and mounted the filesystem > > (or rebooted or suspended) between the files being seen good and > > the files being seen bad? > > > > There was definitely a suspend-resume, and maybe a reboot. I'll try > again later on. Suspend-resume, eh? There's an immediate suspect. Can you test this specifically for us? i.e. download a known good file set, do some stuff, suspend, resume, then check the files? If it doesn't show up the first time, can you do it a few times just to rule it out? If suspend/resume does cause the problem, can you try again but this time please run 'xfs_freeze -f ' on the filesystem before suspend, and then 'xfs_freeze -u ' after the resume and see if the problem still occurs? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 17:54:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:54:07 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4A0s4fB031191 for ; Wed, 9 May 2007 17:54:05 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 0E7CE2C8048; Wed, 9 May 2007 17:53:18 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 0CC2D2C8043; Wed, 9 May 2007 17:53:17 -0700 (PDT) Received: from [192.168.1.25] (adsl-69-107-77-42.dsl.pltn13.pacbell.net [69.107.77.42]) by lurch.goop.org (Postfix) with ESMTP; Wed, 9 May 2007 17:53:16 -0700 (PDT) Message-ID: <46426D31.8070000@goop.org> Date: Wed, 09 May 2007 17:54:09 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> In-Reply-To: <20070510004918.GS85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11367 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Suspend-resume, eh? > > There's an immediate suspect. Can you test this specifically for us? > i.e. download a known good file set, do some stuff, suspend, resume, > then check the files? If it doesn't show up the first time, can > you do it a few times just to rule it out? > Well, I've been doing suspend-resume with xfs for a while without problems; the problems seem to be recent and easily repeatable. Which just means that it could be a new suspend-resume problem, of course. > If suspend/resume does cause the problem, can you try again but this > time please run 'xfs_freeze -f ' on the filesystem before > suspend, and then 'xfs_freeze -u ' after the resume and see if > the problem still occurs? OK, but I tend to find that xfs_freeze ends up locking up large parts of the system... (For example, I tried to do the xfs_freeze + lvm snapshot thing, but the lvm snapshot just blocked on the frozen filesystem until I unfroze it). But I'll try it out. Hm, is there some script I can stick it into? J From owner-xfs@oss.sgi.com Wed May 9 18:00:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 18:00:05 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A0xvfB000479 for ; Wed, 9 May 2007 17:59:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA29194; Thu, 10 May 2007 10:59:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A0xYAf89540312; Thu, 10 May 2007 10:59:35 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A0xQUX89504537; Thu, 10 May 2007 10:59:26 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 10:59:26 +1000 From: David Chinner To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070510005926.GT85884050@sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070509160102.GA30745@amitarora.in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11368 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? > - Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > of normal (non-preallocated) blocks (blocks allocated via > regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. > - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. > > 3) If above is true, the file size will need to be changed > for "unallocation" when block holding the EOF gets unallocated. No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the file size atomically. > 4) Should we update mtime & ctime on a successfull allocation/ > unallocation ? > - David Chinner raised this question in following post: > http://lkml.org/lkml/2007/4/29/407 > I think it makes sense to update the [mc]time for a successfull > preallocation/unallocation. Does anyone feel otherwise ? > It will be interesting to know how XFS behaves currently. Does XFS > update [mc]time for preallocation ? No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 18:26:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 18:26:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A1QGfB005302 for ; Wed, 9 May 2007 18:26:18 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA29908; Thu, 10 May 2007 11:26:14 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A1QCAf89698910; Thu, 10 May 2007 11:26:12 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A1Q9Ad89731643; Thu, 10 May 2007 11:26:09 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 11:26:09 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510012609.GU85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46426D31.8070000@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11369 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Suspend-resume, eh? > > > > There's an immediate suspect. Can you test this specifically for us? > > i.e. download a known good file set, do some stuff, suspend, resume, > > then check the files? If it doesn't show up the first time, can > > you do it a few times just to rule it out? > > Well, I've been doing suspend-resume with xfs for a while without > problems; the problems seem to be recent and easily repeatable. Which > just means that it could be a new suspend-resume problem, of course. Ok. I'm just trying to find a relatively simple test case for the problem - seeing as you seem to be able to reliably reproduce this we should be able to work out the trigger... > > If suspend/resume does cause the problem, can you try again but this > > time please run 'xfs_freeze -f ' on the filesystem before > > suspend, and then 'xfs_freeze -u ' after the resume and see if > > the problem still occurs? > > OK, but I tend to find that xfs_freeze ends up locking up large parts of > the system... (For example, I tried to do the xfs_freeze + lvm snapshot > thing, but the lvm snapshot just blocked on the frozen filesystem until > I unfroze it). Yes, because LVM snapshot freezes the filesystem for you - if you've already frozen the filesystem the snapshot will block until you unfreeze it and then it will freeze it itself to take the snapshot. > But I'll try it out. Hm, is there some script I can > stick it into? No idea..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 23:07:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 23:07:42 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A67afB002726 for ; Wed, 9 May 2007 23:07:37 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA05849; Thu, 10 May 2007 16:07:30 +1000 Date: Thu, 10 May 2007 16:11:01 +1000 From: Timothy Shimmin To: David Chinner , xfs-dev cc: xfs-oss Subject: Re: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: In-Reply-To: <20070508065126.GK32602149@melbourne.sgi.com> References: <20070508065126.GK32602149@melbourne.sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11370 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Dave, --On 8 May 2007 4:51:26 PM +1000 David Chinner wrote: > > Back in 2.6.13, unwritten extent conversion was changed to be done > via a workqueue because we can't do conversion in interrupt context > (AIO issue). The problem was that the changes extent conversion to > run asynchronously w.r.t I/o completion. Oh ok, and at the same time they used the workqueue also (apart from AIO) for synchronous direct writes even though they didn't have to. i.e the existing comment: * This is not necessary for synchronous direct I/O, but we do * it anyway to keep the code uniform and simpler. So you were tossing up whether to flush the queue as in the patch given or to effectively call the code of xfs_end_bio_unwritten to do the unwritten extent conversion straight away. Hmmm....I dunno :) Does it matter? What are the pros and cons? :) Does it matter if we flush the whole queue now or later? Is it nicer/simpler for this to always happen in the queue? Is it a bit silly to queue and immediately flush? * Possible typo in comment: s/passed to use to determine/passed to us to determine/ * Don't really need the "? 1 : 0" is_sync_kiocb(iocb) ? 1 : 0 => is_sync_kiocb(iocb) --Tim > > Under heavy load (e.g. 100 fsstress processes), a direct write into > an unwritten extent can complete and return to userspace before > the unwritten extent is converted. If that range of the file is > then read immediately, it will return zeros - unwritten - instead > of the data that was written and is present on disk. > > A simpl etest case to show this is to run 100 fsstress processes, > the loop doing: > > prealloc > direct write > bmap > > and at some point during this time, the bmap will return an > unwritten extent spanning a range that has already been written. > > The following patch fixes the synchronous direct I/O by triggering > a workqueue flush on detection of a sync direct I/O into an > unwritten extent after queuing the conversion work. The other > approach that could be taken is to simply do the conversion > without passing it off to a work queue. Anyone have a preference > on which would be the better method to choose? > > The patch below passes the QA test I wrote to exercise this > bug. > > Comments? > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > > > --- > fs/xfs/linux-2.6/xfs_aops.c | 28 ++++++++++++++++++++-------- > 1 file changed, 20 insertions(+), 8 deletions(-) > > Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c > =================================================================== > --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-04-26 09:25:26.000000000 +1000 > +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-08 14:28:20.854616591 +1000 > @@ -108,14 +108,19 @@ xfs_page_trace( > > /* > * Schedule IO completion handling on a xfsdatad if this was > - * the final hold on this ioend. > + * the final hold on this ioend. If we are asked to wait, > + * flush the workqueue. > */ > STATIC void > xfs_finish_ioend( > - xfs_ioend_t *ioend) > + xfs_ioend_t *ioend, > + int wait) > { > - if (atomic_dec_and_test(&ioend->io_remaining)) > + if (atomic_dec_and_test(&ioend->io_remaining)) { > queue_work(xfsdatad_workqueue, &ioend->io_work); > + if (wait) > + flush_workqueue(xfsdatad_workqueue); > + } > } > > /* > @@ -334,7 +339,7 @@ xfs_end_bio( > bio->bi_end_io = NULL; > bio_put(bio); > > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > return 0; > } > > @@ -470,7 +475,7 @@ xfs_submit_ioend( > } > if (bio) > xfs_submit_ioend_bio(ioend, bio); > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > } while ((ioend = next) != NULL); > } > > @@ -1408,6 +1413,13 @@ xfs_end_io_direct( > * This is not necessary for synchronous direct I/O, but we do > * it anyway to keep the code uniform and simpler. > * > + * Well, if only it were that simple. Because synchronous direct I/O > + * requires extent conversion to occur *before* we return to userspace, > + * we have to wait for extent conversion to complete. Look at the > + * iocb that has been passed to use to determine if this is AIO or > + * not. If it is synchronous, tell xfs_finish_ioend() to kick the > + * workqueue and wait for it to complete. > + * > * The core direct I/O code might be changed to always call the > * completion handler in the future, in which case all this can > * go away. > @@ -1415,9 +1427,9 @@ xfs_end_io_direct( > ioend->io_offset = offset; > ioend->io_size = size; > if (ioend->io_type == IOMAP_READ) { > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > } else if (private && size > 0) { > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, is_sync_kiocb(iocb) ? 1 : 0); > } else { > /* > * A direct I/O write ioend starts it's life in unwritten > @@ -1426,7 +1438,7 @@ xfs_end_io_direct( > * handler. > */ > INIT_WORK(&ioend->io_work, xfs_end_bio_written); > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > } > > /* From owner-xfs@oss.sgi.com Wed May 9 23:51:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 23:52:02 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A6pufB013695 for ; Wed, 9 May 2007 23:51:57 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA06774; Thu, 10 May 2007 16:51:55 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A6psAf89700828; Thu, 10 May 2007 16:51:54 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A6prj689784308; Thu, 10 May 2007 16:51:53 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 16:51:53 +1000 From: David Chinner To: Timothy Shimmin Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070510065153.GY85884050@sgi.com> References: <20070508065126.GK32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11371 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 04:11:01PM +1000, Timothy Shimmin wrote: > Hi Dave, > > --On 8 May 2007 4:51:26 PM +1000 David Chinner wrote: > > > > >Back in 2.6.13, unwritten extent conversion was changed to be done > >via a workqueue because we can't do conversion in interrupt context > >(AIO issue). The problem was that the changes extent conversion to > >run asynchronously w.r.t I/o completion. > > Oh ok, and at the same time they used the workqueue also (apart > from AIO) for synchronous direct writes even though they didn't have to. > i.e the existing comment: > * This is not necessary for synchronous direct I/O, but we do > * it anyway to keep the code uniform and simpler. Yes, exactly. > So you were tossing up whether to flush the queue as in the patch given > or to effectively call the code of xfs_end_bio_unwritten to > do the unwritten extent conversion straight away. > Hmmm....I dunno :) > Does it matter? What are the pros and cons? :) I think with async buffered writes we are doing I/O completion in IRQ context as well so it seems to me that we have to push the unwritten extent conversion off to a workqueue in that case. I don't think there's any great overhead from flushing only when we are doing sync dio writes - all that calling xfs_end_bio_unwritten() directly saves us is a couple of context switches. However, that could promote I/o completion ahead of other I/Os waiting in the workqueue.... I think I'm convincing myself that the workqueue flush is the correct thing to do here ;) > Does it matter if we flush the whole queue now or later? We have to wait for it to complete, and that's what the flush does; it waits for the queued work up to the flush entrance sequence to complete. It's really the only way we can wait for a specific item in a workqueue to be run. So yes, it needs to be run now, not later. > Is it nicer/simpler for this to always happen in the queue? I think so. > Is it a bit silly to queue and immediately flush? I think that's the way you're supposed to do things ;) > * Possible typo in comment: > s/passed to use to determine/passed to us to determine/ > > * Don't really need the "? 1 : 0" > is_sync_kiocb(iocb) ? 1 : 0 > => > is_sync_kiocb(iocb) Right - I'll fix that. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 00:22:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 00:22:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A7M2fB023732 for ; Thu, 10 May 2007 00:22:04 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA07415; Thu, 10 May 2007 17:21:56 +1000 Date: Thu, 10 May 2007 17:25:28 +1000 From: Timothy Shimmin To: David Chinner cc: xfs-dev , xfs-oss Subject: Re: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: <0072C73B201FC2AD6A7E01D0@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070510065153.GY85884050@sgi.com> References: <20070508065126.GK32602149@melbourne.sgi.com> <20070510065153.GY85884050@sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11372 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 10 May 2007 4:51:53 PM +1000 David Chinner wrote: > On Thu, May 10, 2007 at 04:11:01PM +1000, Timothy Shimmin wrote: >> So you were tossing up whether to flush the queue as in the patch given >> or to effectively call the code of xfs_end_bio_unwritten to >> do the unwritten extent conversion straight away. >> Hmmm....I dunno :) >> Does it matter? What are the pros and cons? :) > > I think with async buffered writes we are doing I/O completion in > IRQ context as well so it seems to me that we have to push the > unwritten extent conversion off to a workqueue in that case. > > I don't think there's any great overhead from flushing only when > we are doing sync dio writes - all that calling > xfs_end_bio_unwritten() directly saves us is a couple of context > switches. However, that could promote I/o completion ahead of > other I/Os waiting in the workqueue.... That's true. > > I think I'm convincing myself that the workqueue flush is the > correct thing to do here ;) :) > > >> Does it matter if we flush the whole queue now or later? > > We have to wait for it to complete, and that's what the flush does; > it waits for the queued work up to the flush entrance sequence > to complete. It's really the only way we can wait for a specific > item in a workqueue to be run. So yes, it needs to be run now, > not later. > I was meaning for any i/o's previously existing in the queue which didn't need to do completion straight away - we are now handling those one's too. Not that it may matter but was just trying to see any differences in old behaviour. --Tim From owner-xfs@oss.sgi.com Thu May 10 04:56:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 04:56:27 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ABuKfB026834 for ; Thu, 10 May 2007 04:56:21 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4ABuGoC001834 for ; Thu, 10 May 2007 07:56:16 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4ABuGkY522062 for ; Thu, 10 May 2007 07:56:16 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4ABuFsC021127 for ; Thu, 10 May 2007 07:56:16 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4ABuE3U021087; Thu, 10 May 2007 07:56:15 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id EF6C729EB6E; Thu, 10 May 2007 17:26:21 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4ABuKMX027980; Thu, 10 May 2007 17:26:20 +0530 Date: Thu, 10 May 2007 17:26:20 +0530 From: "Amit K. Arora" To: David Chinner Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070510115620.GB21400@amitarora.in.ibm.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510005926.GT85884050@sgi.com> User-Agent: Mutt/1.4.1i X-archive-position: 11373 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > I have the updated patches ready which take care of Andrew's comments. > > Will run some tests and post them soon. > > > > But, before submitting these patches, I think it will be better to finalize > > on certain things which might be worth some discussion here: > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > - Andreas and Chris Wedgwood are in favor of not changing the > > file size in this case. I also tend to agree with them. Does anyone > > has an argument in favor of changing the filesize ? > > If not, I will remove the code which changes the filesize, before I > > resubmit the concerned ext4 patch. > > I think there needs to be both. If we don't have a mechanism to > atomically change the file size with the preallocation, then > applications that use stat() to work out if they need to preallocate > more space will end up racing. By "both" above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth considering. > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > > of normal (non-preallocated) blocks (blocks allocated via > > regular write/truncate operations) also (i.e. work as punch()) ? > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and > what i did for FA_UNALLOCATE as well. Ok. But, some people may not expect/like this. I think, we can keep it on the backburner for a while, till other issues are sorted out. > > - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > > we need to finalize on the convention here as a general guideline > > to all the filesystems that implement fallocate. > > > > 3) If above is true, the file size will need to be changed > > for "unallocation" when block holding the EOF gets unallocated. > > No - we punch a hole. If you want the filesize to change, then > you use ftruncate() to remove the blocks at EOF and change the > file size atomically. Ok. > > > 4) Should we update mtime & ctime on a successfull allocation/ > > unallocation ? > > - David Chinner raised this question in following post: > > http://lkml.org/lkml/2007/4/29/407 > > I think it makes sense to update the [mc]time for a successfull > > preallocation/unallocation. Does anyone feel otherwise ? > > It will be interesting to know how XFS behaves currently. Does XFS > > update [mc]time for preallocation ? > > No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size > changes. If the filesize changes, it behaves exactly the same way that > ftruncate() behaves. Having additional mode (of FA_PREALLOCATE) might help here too. Please see above. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu May 10 07:46:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 07:46:40 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AEkTfB010149 for ; Thu, 10 May 2007 07:46:31 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 366E12C8048; Thu, 10 May 2007 07:45:42 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 453BB2C8043; Thu, 10 May 2007 07:45:41 -0700 (PDT) Received: from [192.168.28.126] (outer-dhcp-126.goop.org [192.168.28.126]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 07:45:41 -0700 (PDT) Message-ID: <46433049.4020003@goop.org> Date: Thu, 10 May 2007 07:46:33 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> In-Reply-To: <20070510012609.GU85884050@sgi.com> Content-Type: multipart/mixed; boundary="------------000301090406040003030205" X-archive-position: 11374 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs This is a multi-part message in MIME format. --------------000301090406040003030205 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit David Chinner wrote: > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > >> David Chinner wrote: >> >>> Suspend-resume, eh? >>> >>> There's an immediate suspect. Can you test this specifically for us? >>> i.e. download a known good file set, do some stuff, suspend, resume, >>> then check the files? If it doesn't show up the first time, can >>> you do it a few times just to rule it out? >>> >> Well, I've been doing suspend-resume with xfs for a while without >> problems; the problems seem to be recent and easily repeatable. Which >> just means that it could be a new suspend-resume problem, of course. >> > > Ok. I'm just trying to find a relatively simple test case for the > problem - seeing as you seem to be able to reliably reproduce this > we should be able to work out the trigger... > OK, I was able to reproduce it reliably with a script with did basically: for i in `seq 20`; do hg clone -U --pull a b-$i hg verify b-$i # always OK umount /home sleep 5 mount /home hg verify b-$i # often found truncated files done No suspend/resumes involved. The trees are linux kernel ones, so fairly large, but small enough to fit entirely in core. My script also captured xfs_bmap before/after output for files which had tended to be corrupted in the past, but unfortunately none of them got corrupted in these tests. But I do have all the trees lying around to extract more detail for if you like. Interestingly, the corruption happened in each case around the same place in the tree, often in the sata drivers. I wonder if that was just related to the timing of this script. Attaching script and results. J --------------000301090406040003030205 Content-Type: text/plain; name="clonetest" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="clonetest" #!/bin/sh #set -x #set -e D=/home/jeremy/hg F=linux-clone-test function emit() { #echo "$@" > /dev/tty echo "$@" } function run() { emit " $@" if ! eval "$@"; then echo "Command failed" exit fi } function nofail() { emit " $@" eval "$@" } function validaterepo() { nofail hg -R "$1" verify run xfs_bmap -vvp $1/.hg/store/* run ls -ld $1/.hg/store/* } [ -d "$D" ] || nofail mount /home validaterepo $D/$F || exit for i in $(seq 20); do emit "Iteration $i" $(date) run hg clone -U --pull $D/$F $D/$F-$i validaterepo $D/$F-$i run umount /home #run xfs_check /dev/vg00/homexfs || exit run sleep 5 run mount /home nofail hg -R "$D/$F-$i" verify run xfs_bmap -vvp $D/$F-$i/.hg/store/* run ls -l $D/$F-$i/.hg/store/* emit done --------------000301090406040003030205 Content-Type: application/x-gzip; name="clonetest.log.gz" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename="clonetest.log.gz" H4sICMcvQ0YCA2Nsb25ldGVzdC5sb2cA7F1bbxxHdn4XoP/QefMiplX3i96c xLtYxLsOdhUgQBAQfamWJiZnmJmhLf37/c6p6p5uyiSnm5aE2BRos9ic+qrq 3M/p082/t/vNzbE6HOv9MXXVblu9eXdbSVH9pf5QKSF8JeRraV/jyrd/qf7j 3968qKrqene7PVav3u2u00v6+d3b6uJv+edX/5v26frDq3dvX11ttrfvL9qr 3TZdHNPhWP2U9pv+w8sXf90dq+P+9nDcbN9W/eYqPTr11Te4+O7tvq36/e66 wuo0Hfu9PaR9led9Xb3d725vyk8vX7TvUvsjrdC+q7dv0yEdD5OL1/V20wOZ ru13h8P4C9rPodpsJ9OqetvNJsw++/KFstL6/MPXlbVOysnkryvjrNCmOu6O 9VW1Tz9tDpvd9sCEe98fLpvr+qa6+Omnm/OocDju9umVEHmFq93bb7q1EzfL Jg4UWLrgOG/Rel19rJd8/nbb7V6+WEvC12DHd//15nX1xz9//93FD3/849+/ e1Pxv3/5/od//feLv3371z99V5V/3/4JX7MP0b83P7z59nvmqnhd/bf45huj pIz/8zr/VkYnrDC4jJHx0saqUtVXOnhnBK7qKIOV8Q8VZimx+iCbT3EQF60c zlHJ4KIOIeAgwQsRdD6Iis4qg6sqeiUjDlJhmlp4kJOI3X+O8w/y8Umk1XY8 CY6itLLB0FGUNuBWPgrUNRpcxcAaT0epMNExkAQQ/UDsVbpgYbqKIboMBEg9 AEllHQMBUROQciEwkAIQILBMMMZPgKzzioEI0mYgE73QioGAaAmI5jKQBhAg iCEi+CJwNF0rFrcMKQcg40jcMqLkHRmtGMjQ0UQINEnLkA9H00XU+WiAHNht IvbnGMhYZneltMg7sgQEBCKsGXSAgYQoOwLkeDSpXd4REPloVSzEdgRkoA80 SYtxR9IGmXcESYvDjoKXPu9Iass7MtgDA3kCAgIxVU52JPWwI4IcdhRM2REh 8o4U1mOgACAg0I68xe8zUJSK/iPdxhaUL7qto1bESw0GO007skDO7I8AIgRm W3SaFQzTYQaykQDXYpEjbZxWbCTANc1yBGk1fDQpmGtAoOVBLE8CzjCBRI9G URcaaWOdIsoBUfhCI1uAWLIJwDFQ3hIDsYEqkHYAst7xjggxcw1DBlIZqOwo WOUzkIeGFCA1AYJgZyBTgDQ+x0CagQBARFIiqwkB+WJICXIwpNCAbEgJUWal jWRJ8Z0lmwAYSHuXgZjEJhM7QrwKENEYV7UVwjlWWq0LEEs2ATBQ0KwkEnOt 501GLeHqC5DFPom/GqIFZWNiuyxHkiWbAGh5wJgCJFXWfkiOAgcKEE5GxIRU g5OstBGkZCCWbAJgZfMgRwaCFcrE1iCrL8R2MENMbAcN0UzsOOwosK75zCKo SsxyBGApmEbax0H7Md8aRzRCaJO1X1uZlVbGomtsHRzokeVIR6HzjowLKg47 ctLzjryGA6EdeV+0X5FkMwAdHRfLjownBrAfFWRXM5A3gCDJDqRq2JHUyjmz 1v98Ej8qlZ86Hw0ahkxtGAJE2mx9nHIs3TSQzrMcq3DyPSRncAFjXEGTrVUF xhXbg9mWLRLj6TiFIf1kt6SDnsLoyGQlGFmUgWC0ZkYqgOspDGkny6+VcgYT WM0yoB1hlAgZRmfBG2BINy3tFGybwThnCoyPcoQR5DFpIK2cwpBmOqJbVu4T jNXlUAAcD2VCLIcSanYo0kvW7hBnJNZGDIeyejyU4Y9mvNmhSCtzPCDH0Ikm 22zLaOTDuBvH4RQGXpYgwZ68DRAUi87o2aFqZiAxEo9xNyoUEsOYzXYTs9yQ qBnhTzBwAUVuKHAYYWSRG7j7qdxkT5PpMIZRPNnJQmLhzXgoydLEeGJKYnmK oCQ4MsIYH1RmOFmxgeHsddiOwDZOGZ69TJYRyMcExuU4jAH1CCNsNkfAm+2G fQwsG7uTOIGxLhQYN0RhDq6U1Y/xpiTOHoZtK9zMicRwWaEcClHJeCgnTT4U sXwKQ1IMAFojO/IMA99UdoMQZtyNkD7vRsBHznbjOEph+g9RU4YpESED+hEm R4SMN7U37FtUDpqMPpFYwfkMuxGT3ehhN2a+G46ZTA5fbZzAhLIbAjztRg27 0fPdxGy2+JfBT2BkMVsnH47ZwmazNfrwAlO8CiscIr4TjA3F3sBwDvbGIpdh e4OBntkbRVKsOApV0ZxIDItdaCPdGL3bEIrZDkHNaKPYFnOyoYWXExhbDgXA 4VA2eA7ECUbMD8W2WHCsiYh4AqNyLoHREAMSjMvMCL7EgAMMSXGJNLXyExiW 1wIoRxjLosF4M9qQFGuWBooTJzBCDLSReqSNZ9mmgZkZUUVSTPElR1ATEttQ vCYB+hGGZTvjTeVGkRRrzvK0lxMSG1MYDnkdGR5syAwPNs4ZTlIMAI5F3IQ2 o5+SSOFHEhc/ZSn6jROTrkiKNeuPEXFCG11yLOxGuJHhNudYGJiZn9IkxYZD rFMamicPtDHqRBuE+Jk21s5oo+WYhMJSTeRGqmLSEcefGK5YKGkQSuqYD6U5 lzW0MuUVExgRCsMBODJcsffKeFOGa5JiABBt/MTBIF8rfgq2wY8k1lpmKUbw PJViTVJsOBg0YeJgKEcbYPSJU5qDJxo4P4MhKTZsFEyMU7kZjCictTuROBtR DOZGVJMUA0BxnHQisYixeAYppBxJLHUuPASi+xTG5zAJa1g1cTAi+qKaIoYT p6QsqklHncKQFOc6DcKqE4lFtnM80oPzRYjCakMDP3O+mqTYchRu7cTBCC+K 8yVAPcKYXAQhvKkUG5Jiy1bEuomDESon05TyDwE+Zhs2ZjTQccopQ1Js2aZB oCYkzp6UR0MKhNk5lKOBnYVJhqQ4Z7WwMBMSy8EWC3myxV4VW+z13BYbkmLL ER182EhiBQ+SUyuMTBjkBp6MgwYMgpnKjSEpJt8MGHlyMCoEnWmDkRlpY6Pz xb/6OW04LmbzTaZ7AiNL3YsA5Qhjs/gR3pThhqQ4OxNyJCcYPVS94MtHhkOj MsMtkoEZbXyO0qmmaE8OBgbBlFKVgagMtIGhzzolbZzplCEpdhxKO3dyMJg8 0AZ4I22kLbQBH+e0ISkGAJHYnxwMJg+0IUA5whTaEN6UNpak2HGc5+LJwWAy x5wFUI8whoMGxpvSxpIUOxb8U72VJusS0CrKcO0IkwNaGpSimVlWNab6+qdI cicpLjlfRCqcD+YRVX109ZVRMpeQeVD2X4VF+6d6/+dJ0odwA5ZqkqSrIUnX ZyXp0LdcacQI0lxgLMUwBGMRLvkzknT4oyxUGI1lZgSLORLFIJpzkvSQxZ5H p0QU+lZggpzD3JOkh2iKg4cZGBy8gwdnjmMQZw7+viQ9xFLOxSiOGaSTOYPB QItfTtKlmJTgdRzTmzjJktyQ3rghSxqyay6dcZx2ytJpshxcKnLFwaUiicgu FYO5S70nSUcUV/QWo1MiCi3IiSg4NjON9yTpRhqZA0OM1FjAQESXCxiIO2aB 4T1JOiYXl2rkWDPD7GzSaTB3qfck6UZFpTMM1RgKjEeyRfENDbyMjyfpmGyy TmHkBp2im0msUxjEmU5Nk/RwcmJG56JSHoURxnHqTgOXXWoVuFZaXR2qi6vu +Tbr02+zXux/5q+LCyJ6njh8Q24MoTRRcwtBFSk7FWo11R9ZDHE6sjdhfpXF Ng8vphDPIasXdlxMvYZHW8ndMw4WIN+/xlo4V7f/+T19XbyvkPvcXYvDh+hW E5GkaH4e+fEaEW7lCWtkyfvzMe3r42a3xQrUqUJoUlCTihKvjaQOFW5bGfpS GKq6+M/q4uLm9urq8SaVxz5wIT9XA8s+/d9tymvU2HlpKXn5ou66u20t5dKk R6Vc4c1NZ2LJux0q1c+b47uhS6Vcro67atLWUn31z7J6l+ru8IcFDT8Xcmz5 +e0141zI1X7i3qmbpVMX+Ir7Zi5c8yx/MZtxXmPO/eT85K05MOsi3xbHyARd kiykW54zf6Rv0VGScmZrzv3s/fTNOdzVke9peymGKquO3nD2qhFp5yrrec05 94rbr3CSj48y3uQtXKG7ZRTJCeRgQy+EMXRbl1JHOEfuheBZbv1ZPglXJukW h6pBu1yLMhSBD4EzZZQcqlprc+CMJGvMIjnhQvoXxxRHDh0jGIVTv4opKbax IverUPF9zCI1R9XTojVNzmH8pIGCZ3MYP2mgmGaRHxWtKaTPmRLF5mOHkRX5 vgAG8/sCnEXmovWswgHIHNpDvYzxQxOO4KovBjoICuNVsEu5zDWOarur0vtj 2pK3WGy5PkOJAcce25nUmP+SoOR2JqQ38/x3LDGcsiFiB/UNcFYlYc1GGC19 bmaCXWMYSm+XpEPPjm6RoysRcAmC9zuEivy/eVpUQlb7BOo/sNQsKXr6Upv7 l5qnRGUpvZ67jxxqTIievNKQEl0MKdF0pWk6tI58k4ToF88yTYbWrZClDUi3 d7vxD1cp3VT25fpO/efA/dmePQfuz4H7c+D+HLh/ucD9k9+ctF46bgLKIyKE xgmk0pkQNMh9D2fdnPzd5Q7PrvY5dfh9pg6fPrBfZzBzdGzoEbNJdqPm2c3d 1MZU9b59N5sB9X94RnO1a398iA5BqDvkvmwvd5c3lx8uN5fbywclo3JSB/fR 7P1luuww+3h5uJO+Pbrddv/h5ribzXFuvl+40Dsrdrv29jptj3y76aG5H+eK 3X6DBOqwbFI//7wxj3x+s22vbru0hBI0aXNcOOOmfYjR1EB2h3A/Nrebq+4u j+wj6/yY9tt0tWzO1aa5f2+IcZBS3tnb9WXNAngs3xPE6vCg8UA0pMxZKA+I NJyzj/4uSv1jonx3mTTb6vp6Lijykc9v0/GB80l6WuzOzkjVaijb9WV62NzR A1y/MPcGqr4HdVjVLy6by9vLtx9pLaY8vO8Dv5tgrhXVY1pxSO3tfnP8MF8q PDZrB7P7EBdApjvu9/awJzN9uiGt7tyQ1va1sZ/hhrR6viF9bl1L/ZbrWmp9 sK3WB9tqdbCtVgfbanGwrZbXtdQXrGtx3cRwBSXK4WHMXDgx/FiD4ocxz61r qS9W1yoHKEcJojz2nU+QjxKNXFDXUl+wrhXoUWiRCwNm6MW2UirOnS0CBe7F PruupT5jXesOV7hcmgunduydzPVSLpyq0ju5gisrbrqqz1E4kfOXmNAjVVwD c8GND9FGmxv5MfD5IVrq7D0VTnKb76QHlRLY3ObrqIt5gHEiP/ISqYmZm469 WtSD+mzMFxnzJZUTo55A/WWVk6cstbByosN67i6rnDxhpYU3XZeSb/lN16Ur fNKbrl8iOC2FileH9rB5dQD9Lm+QDmwO6ZsWQF16X7W77bHebGG2pFdk0/d1 1Xw40uTb7U19QqsewupUa+kZ/EbRc5Xf7fe7fXVhq5/f8cTU7q5v9ulwIKzM xe2uS1UvopK+sbJBtNC1sia579vUyJh6DIIPfRK+ozOPh4TrOdKFB7ZT4G3T Oau0NE3defzrhe6ST6KniBxOPWpIC73OZSV8HVVnbS2N9qIxAsekpxQbU9fO Gvy2MU7FXqyEj43Tvmm0Bo3gfZqgOziZFFOojTCyp8d5ag3VXQdf+2DrFIJy EYTx5Mdk0q1trU49TKoDZQLYsBK+T8Z3utOqlwASsW1MbDqprGlberdQl2RS opYr4V0t664nQnilXKpbepAUMtSkRoSI/9eiad3q3Rtla+tqU+tWBEuP7bTO us4mi2C27+sA8dExtmuJI1Lf2N4b703TRh98gyWUiq1qatnoAE5oWa8VTI3o BrJiuz5E5T0EMUlIkxKm17prELWYVvdyLXGgsl3TQyL7VIfGtr2Jfexk0yvh nJACSYAIdbtWMG0XIX0BAtN5aUUrRXKdaaKG6ioDO5Nk3yq9lvaazl5rXYPK NWAgKUA2TtShNcnB8rikwP61rMWOtUwpyVbGTjt6fUUvZeOT7zola98G2Dq7 lvbOgPC97Pok6Ym3hKC3Cb424HQno6lhUBNMp10L3wp6oK1Xbeu17mNNT5N5 oxJUDcdSXR89jPNara271oVQqyYm3dWN8NCmVhlIkIud8qnTsY5QvJXwHYTF ux4mwaRawmx2gh64BXmiCbATTdfbularBVPTa31MA3ujPLTWp6aBjRbJArmW yVjZd41QayWncw7+qm1kLyAkdYSUkkutbepgEHq41gBdWCv3yGlVY5H00GOS urNk9RvZdnRf2dVBhxp8BoPW0r7t4MitgI1pRd1biTWavoE6mBbrpGhUsK3x y+HPqK9J9XF9TRkgH9NbqhVXiQKSQ5W2LYVuCNG6f3quwD1X4J4rcM8VuP+v FbinnuCR3iljdXnDG4+oyU9XX9FbHqkdib7n59LP65z6vRUAn13Jc/3v91n/ +/TVuXXGck3nlI5LO6cwY2nnFMj9hM6pPHtt5xS2u7hzilZc2zll5IrOKUxa 1Dll1IrOKZ60qHOKZizqnCLCreicwjqLO6cwZ1nnFO3t6Z1T96Ms6ZxilDWd Uzj1os4pfH5Z5xTtbG3n1DD3V+icwr6Xd07RpOWdUzRrYecUpnzUOaXvdE5Z 9Vp/js4p/dw5de7NKf1b7pzS64NtvT7Y1quDbb062NaLg229vG6jv2TdJpY3 rAlnxPDWz1zjoLQ6Sslv/Ty3bqO/YN2GD1CO4mV5Z2A+QT5KMGFBhUB/ubpN rtbQUaRTwZS/a5HLNfQO2RDh4+WCuo3+cnUb5gU/3OisGV5HnJlBj9lFZfPr iFdwZUXnlP78hZPhPX00kuOr0ulN6fyexeh1flX6I4UTel97vkZ1sPIQIr1k kd/hioEtf59oaefUszFfYsyXVE5sfAL1l1VOnrLUwsrJGW/Vu5e7yyonT1hp YefUUvIt75xausIn7Zz6EsFpvqsJB02v+nddsAlfVkTlml63KvUNfuOVhnEL 59403f403i+lrhLfIwJwzvmWHtDttbe2sU0XkjC2T30USprlyJ0Mfd31SenY NqpttWyMcYDznRcmWSXaxtDfylmx5xhb65tEf2xHB9O3RrfYvuyN1kG1TQi+ 02ff450gA8k3sa1DZ2STgmxcL+FSmh6eqHPJ9Tr0MkS3HNlSP0QXdVO3yTfC CJBYtVp4K9pO16Ktdejc2V1UUzpjohYqiM7KrnE+RZ+oG6NXsY6NUqlxiHb8 Cg4akikRpDW2aa3VNWLeVihnY+/wQ4+LIdVmxZ7bLkmvIbu1M6b1rUumi7ED 2+pGCg/UptUhNcuRdS9Uj5xVqDpYi80hjgVtEiJCnOIf7J1Zri23DUWnJIlq h0OK0vyHkMX6SZD8vKs0uIBjGO4AV9WRyM211c56VsqxSOvnTy7WzVfaMmqW 1RxcsxyXBVya4sSKlVKz2kM77xG9FD3Gr8+pi9jp4TdkTfelpGMStfXzJ/fa h1rXW8+x5qXTrj0voi/fvXyXuHlhrfOQKbTElDvGlq2HPG7nDE2XjCT9NqC9 R7fTHp5sedlJ2Kc7lA70VaXWmA+/d8edDEM2nfoQGznrXOXMm2wYyZ008/my CfArER7nTtfzEM/q0/bhG5G5lfoeWUuXo3FRyr03xSU9KNaLbhzTxv+6V/G0 6/JDscfOaMtKkLQs1vZ6UdFbVU5eEW61jkMQ96ZT45R0Qt3rkdT4rw/xfBCy YeKZ7pOC9letsdbKaJhSdfZ7T0TLQw5mKdpW34YwWVwhdFcmRdagagmdqZVU kfSgonpur0700rr54r207COUGD6Y7NbebpLxonWLENiqjgKVPq+nOwthcVY+ c6oh3MTGfPhmHEztq+qiQKlXQkKyH42Vome0uloluc8qD7rRTO6Weo2nivWT xkGPa+pZgZC+UjXfp7/o86YAZiph3Bfay4E24hARz/fOmZPK8CYzP2TKPKLd Nw5fd9xhtQnqXuSYV2qVm7c07YVkup5YVbZqnOKS6xyi1ACbjQo4tg7vJVHd H9o52Kjq7ljUuLGjDxDJZEnIiPUAJKrVny9h/UdGMlyBjOWxsj3XWvjxnusu 126NSpX5D7099KDDQghzkarnAC+lIHOny4W4IrJXopL5iyLFHazj5rhP+KS4 TTJuZNF9EaImKVMH4zftB8rleXm3MC9DPYLXSMIKZlBjd1m5UcPjNpmHb+49 IXNLRvF2oQ2XpnmX0Y63nhFnuN3WQ3ajP76u6hCwpd0SADYchpzU73ES5Xvu qQ9Rt3PcuYg9zRnWBZ6BI2hXNq5iCMo/1536wkj89uh/1wlyZdOhY4wVjZK6 Qkhp9O33JZ6pTD2Lp0rMXoIvJ+d70VVKImyUqL0j8St+/uTlwAU5QgkpNyMe FR90zmr73nJiK0kCe+VBN9rQuD5nmOZZtzmSB0HmWL1PKbx9tu1uL7zRp8ym t2v7qGX0aNi79M5mCwRrq/JwfehBqipsiCw1v2IlblyCndPibcAzLOmQHrry 4lOQ4+ZtThzgbqcQGLH4l94cN46fS1OUbn7hZ8D7TPATKtfa8FQtD4Ka9NiZ Bqm8Aib7+ZMjfe0gdKveuAyVeLBxJo/EZe0M6Er13B5ycPCheEsCQi4xNihR lSSBm3LePR0SB184XuJ5jtj1M3iqa0k6UdWz66VE7cKTqZPd8gsVVMsCmOMl L0XV/VITvflOfkZf+0JgIGV6oVwvXuk2ArqR1iZ7xgLum8lB32uHu9h4z4d2 9nyiomxxVx6sOKBz0yiTCguIUHeVdz208ykUPR0XOponxgwOHx3D0tahkLha rpNJ5YWRejLFPCyeM8lFG0W/ZkjfvocqWtqetb/wRonhh1Pwaa0KykTWUBtL 7BjrO4bWOxz8EBurYB4CNw2tg+0xEoPPzHNro/XREZpd14Nr24SBB1SUQqyB WVocKBgxlhtXYVLFgKbz0IN6hFy7SZsfWpgyaJTUlS92M9P8K3ZXzHpfcrDn QlokvnLQ2DCT2xLlq3dFmGvsCJnzgTdgz1MdB4sP5gmJ/ENCyROkqRaBFw6o Wh7aGQK6VNVTqdaBu2hHXyscYEsD4Ci8bWKRHr5ZaQON6wXrqaMf4NHqtBm3 dWPW9NrIQewPutEpTpfMHqMOPzff1Sala/qF8BagOxK89KD8U8zpQRlxU2bk SDRDveJxyWEL+WgL5H35ZqOiELMF8Ig9TDeuIs47Oq2f5hR0r3OXh2+mESnU gis5Q04NEE0VN+/4nlZiu5xeJf0fandB4W+X1g6FKvWZcd0VY09TzHWwtBAS 2f/AGykuOgwe8IsjQeOiBHQ/FkMElkON9rjpxbXhjN3mmVV2RpcpgOpT/Bxg Rta5JykA8vBkAydw7GTKgjO0HKehyY7wiTFMgx9K6WXkJIB8JJIZjzZjuDFG B4iGGtZqxJhPyWjJeCFGW3iVmZChFnfeEYGHPtwHd7G32pRa+os+y7yktN8z 9W4kGh6Y202gyEp6Tp3CX9NDOwMpLdO4onOpix2RGLv7hlvn6pP+I2rSg/KX naQN2mJAXXxly2mfHRe8pWTXYYW9D0H5oBsLJIh8tt1uJzFq1RiJtpOE4p3F BLvyUmERJAyhb4SZPyx3cDzSJEHUe48zUKVOezzkYDRADGNsG5bUxO9F70l2 yXYB57pWw3G+jOWWBdAKBSrvvfvAeW/PORY1aPfJV6/Yq/tQrQSLemMZzozd lJkUpE08LcgrJvyHj5gaehkryHH38u7Fzj7EQ9xbbTcb3sWsrR4jjEfry7xV 2PY5kKC4uFvt8tMduTc5s8iJMV6rR+7T2JdZDGoMDGvauYFfi1oYC2Ob56o1 waP1IepI4gR7FoyJU7SLJpwPlhYgX2WjUafyyfaQKQcX4S0BXr3o6jG5lCBT ogJlRqB3hpbGy8iJ0BKWAFq+DMg6HebSnmJkI2s/u0Nn2cpDO4O2MXGF2gOi Bt03kXKstqJx4WUl6Ki44yHqBtKp6FnngasjEk13pVHOMuoUvwexq+XF0avg INL+jHEoKYbzYomICh/NlpGLmT59GIWonbwY/Hbd/Plt1d4EyxxC5Y2BJLgO +N0vYzKgRuULo0bFPnvgf9MohVf2ravQHhekfOjBz0dljAnPrYqsrS08bpea sJ4l5td7vi81ZTSgkVYWn5fWgAsU8kdVm9RMjhRyHoJ+qN2d+kFRigFiJGSD jjPFUdmGb0kHLPA7q/rLk8FxoAAHlCq03Le0W6BPbMqEnkARiC89kEy6ogQA IKAXe+z8a6QIMn01Lcfox35vf9FnbzFQmWY6DSccE4RDJTssgDp7QlJW7Ct/ 0LrTE5Dofd4wlL3vszSuMK+rgOfnuoKm+UGRJnw0ui/NduJBNwQPr7Zdgp5J xxMHt+yXdQUrDQIEM4/HWvx6cCCR7uBSzjhcOgFQeKlWnSxG3U0BmQnl94R4 Jgy4r3IIHIy99ZeRE9JCv3UJNHUsbkT1dgYfRzrJJ50KoU57odyZvwNBkMyY X2pt39l7uRbDpWvza6LYansZy00tDvyAO51Po0DFuS4tnU58tK+C5RYnsrzw BiG1bi1nyzi1rQrUEb9GDFJm+6pOJr2sskhSwGYljnGFql02eocej8Cxa7XU qOs/nE/5yUlOFO/Xg5xUYpVMJ29u//ODnAB6BThxqgcf1u+IwVBg5S5EA1XH P5AY/nqQE15sq2M5toObMUMXx5K0vfE8jWqnMedNhD0+fsTCCaeOdErIieEA kzHKooL7+BjfiJT6eiRM8Xo3ReOWdTGa3dRGJw/5G+ZqbZIl3+Sv54booj5h miB88JlydyhaTvQBpWfFKtuMfMp5PdgjTjoSNC6HJ1l4Yrk7rLE0SyXD5DmV P5/e/efHY4EL1Ttl/ETqOjQDMQB77bV1eoAfgKlYz22/cd+0gKtKwcTK3jZb cSsETIt1EZojVp8PcvqvHmB2baUckFcpCmvekS5uecVCgJiMzAVPk2d5/fqE lx9xzgXgH8saY6FINgw5zYMtkBo6pd1evz5XGqTHgUUmRa7bJOIbgUnX9nFj tKVJfn08SU+FBdZvDBrMhAkfoy1cfr7V+ZdCVtl6PsBst1vVjEA8FMhZC5gZ Z9tAQafHJCiEJTL+kmfTDQI+74rejEZDrzgZyUfMb7e2iFnSmZKaXzUnDs66 4CWaWUedLcfJRHXpyLxvmdRtqa3no5B6zW5jrY0nbaojTkQrBGscwwZ54tjj 7Ah9ffzZCTFrNkwwCZeIF0gQx7ZzvXlqGQPn+ty1ezdMDaG/mol7rLiVVjp8 gZbenFKsccryesrVvSkJZAj8uMd5OGoEfm7TCCZNKaqZ9Ofj1+D6C3Tt6EVP lPIjGQbXm+Y++DV4rE7w4bVxKqYdxLfcI3fp4nViZWDIDSJH8p7RfL2eMBZj JkOSyS3pFkSBv9Tisa6umCnVpka1tN955KNZKTSGKvWjozpHqLu8IjStrmBe QURr+Z0nShLWtnHcUmpe+BYf59Ja9HOSNjN+AwjK+VXSLHsDfrtCCtPO7ESo Tb4fGdso23USbtzXts9+7kAi0V4vMZ1qHU4YBJDdb0S89356fg1MUspP82zN ekPv+7U2PMbAO+8cFN+hgNr+/2me//nTPEcXw+cSiR9bHsqUCQUWaMCti+tR VH+/Pr7NGGTG9GsCkddE1M6IwF8WAxigiMgduzxDoM2R6U3vvcYRhHBD2fyT 4LqQfbp9dZLgtWvjLEML+ijU7JFjucOFa2OVwrytQwoLKs/PR50Wm3pGt9Lu iFW4NZbdDCIVELkrxxRbG6/uZPaIfMUlEI0+95x5lzNmajYX+sZPya398fzr v6TVkBqrFmVPgnDHkCjNP0EQSnyL2cd55c8XBP6Pj5ml64CnmIY40JScmSep m+CGSsfKkBgJ8P5MyCmsk2ACiXZaYRGjM4JT68LUzREydJ8bB9tHGBaEfSI2 q+j4+J48K8vj10wsb6uvXz/WhbIpgtieby9JiSmWbScWCFSn5fIUbelZkDM+ Byw4fKY0ImYVgWVpGLQAugXe4BJ9lrStSBnBvvRuOjEy1ja5u2ccqDqKlNG0 /BtjCncDOKVksql5q4aiBYLcHmOFuafW22taHfe71qBaHwWfqFKYIRm33dhR pxu57/f9qNN9YzMr6URNrbq+jT3YlLR02vIZsxrz9PLa9lljpUcMI6+0YpQF IC8D9BvklhgA2sTWemXMUZ18vTR/yfBS17aG81S5PZdzck0p79WfbXOLAwMp eqjZjIWPAGDk074zzhMueryO90NyKw2bY58INWuQqhfAUVChVN+XHthF/IKE r4EpdcSeixqrH2IEECjkTbBmVbubV0SHSH0eUzgtlmZ7hwluuTInLzkT+7lg 2G53tnnys7dS6GkdsLj5hqYQtnSa0Sop9Yk5BEdWba8Iy7NITlCqoS+Ey54I M26OVjEglmaHZEt9HewqE8tzx+kHDUuUXVrr3vAOVMk7Er1byLTXyDlQwY69 niBBgpX3wIfCrXa/6lgIpxTC8+xrw/S7p1iP42POFasMNOFCwQYL0SQFxvyd R7prCBmMPXuDx1bsrqErfF4geR1cv5N1bffnWruIPkmIO5wD19878T4VRFZP 6MWOUzbTezn5++UTGwjcvmJmFqdFo+3ldUnGtOvvPO8+8dxtHjuT+ex1Yz36 sgHkp7qh2nvW0nRfNcdil2s5CQkrLfYB4Pe9n62x04JcC0kOZPidp/XvmWJ7 iJgWQdzzKklkzRQTpLEkck3iauZX+7D7FjMP+cXzy3YtiMMQ9NLw+q3FcsM5 n0cC11qx3mFcwsTb3isfeLBAVjF2TCIgdzryX/IqAxiBYph7VqogPnn4oRpS 273GyrIaryn+DIFC3Az/VnYFIkvxXNutM1a2LInd40Rpn6+Px0cB8x1dudua EaKu8yTpvADjb/MmNDW/Ts0IcRm7i2KEEcKxvnsT6LWdQtVaejoMJak9T0lK zGSMhRcM1MQjZ8FTzVgN13hf+NF+8++8peLEYutijq5Nby2PZXyt9CTYUCAz 5h+QDHs2nrPHJjqaKDYkL+ge24O11aaDvgbZxl7ntVotuo26EdspRg/LbHM1 n7HrHhmN6zCWYc/L8wh4bKiLIxhauztG6aASaTdEjiJDMcjY3vJqm2fBohEw y1DMPdRzjrWg0NnVKQL9XwM2X6vVuUb69JhtHmJJ5j6HoNyLCg8/FTP8xCj5 fUJ1rH5WnMiAUUNe+mqzxSgaDjSWFk7lDa+uPEafcK+9r6y649yKODFgutWF F0fi+j688FUxY+hDZoUxWz+Nv+OxqlAKFWy+NE2noqR+novh8TIrsDDFBOcW ZYoU47eULWGgNdu7fSAv59/YO7cePY7jDN8L0H/4cmchXqkP0yfdOYlsBHHs wFGAAEGw6CO5Mclllkvb+vd53lmKFimbh7YUGEiIBbn8dqdnprqr6n2rq6qP iu4YuHcM53Ep2txXUVRV4vcB1DR2/HUeXTNKspmlbQZM0CvdrbhcbU/qtwEK D4C0ZLbX/Y98Mg5sxKkpQatq45ZTKUH9Mc6O9iowGlDearcD7HF1DHyFMDDB 1qq6I4JHRpvHQGdXVvZo2963Kg0WiNmMYx0DnpLw5RbYJpkr1Si2AKfuuzgn 9+ChVHCrgUNnITacR8IomBrHPNZQBkpdfVv2mEdWfFcS+tHnaMGp0BupTw9b HKKhte4KJ0+B+mgb+DiGDo9aqy2bG75x1A6sAnr2uh8mBd8k11V5qmJOrVJZ 5NbNHKYB2WCOeWwfVWan8hMAyCz2WGYBbR9zKcyIhjWH3FYtc3s7GFsGSlAu hC/A14bMj15b6dh+23EHYE27vfug9gDIwoMwR4km4DyOFpT+VvJa1UXlZEEm /jrPs3LYsX6YBjlUOBqmE5yi06kNA/7s3A5N3gbgcJ2uEpypUNdZfglaVm7O isDOyVT7vNL2tljC4sQGrTfhOOAkEB3MZBIatBVdOA4VtfXtMGkvVU3eGmwq oKHafjMqeF8A/mkxOavCjOb2YV/KFrCpR6CIMS6oXA4PWQA/FkPnQ/ZY1e1D 7orqlqEMi5nlNib5oLRG65GRyv27zWpgtBuyUFo8nrAY+GDEtWbwcgQIOpaR i0XdgcBpu/g+26H+9Gf7Z9/x6EhkGcZf/lgq3h+KbLq1PTz8oShmmXFWIGNj j2G1X+VjdTWv0iZkIm4D8KCaGhCsqmhUXhIwYID9bo6ZUNiGW4fl/p88hU7T dwTsCuveKDh0YB7cimEFtYgD/hjdIv11HnJXZ1KvpCDqULJnmZ7B/O4GtByN in6oGeJ2TKGCC7wp4G6vO0Ul6XUcblAGuekjqZHSdgz5xzui78EodBgJTH+p 24/yc0as5lw+xxiwRVCyjXk7qy4k7FbKav0Ewkw2qVtMDyNVWH91GE/WUN8G Ihnh9DNmZqpLxbuacOwrphynML9ttQac7nY+ploxpAMgf0TTQo5MAq7d1qil qt5Qths3tmNpKaP/a8aDb9STwKtLSlZnzpbdoRiA7NIuUnC48gNK2CcaipCq yolQXdgJMHwNQJSLadfXmpqw8rkt81AlFoCsrdgDVe4zrOVVndfXxtR+/4cv bp6440+VTOSPLJn4dqCQYGoDTFB7/fCCCd/biWzHnN067RYfuU9XJWDmKhZl iX54C4xXD/O6GgOLBGlah/Z2wbs5YvdUNsGdjsMMe0SWm98aHIRS1MYKM854 yyVnlVXpiwqLjxQVXsrF7T05QBrKu4ZXpxiV+xweQI1xqqtChRseSKS+74kF HY/qJIoT1oYWsIh34KPcTEdCuQOt59x78qP6NsCzwanHKrdYOqpp9gIRKwGo lDCG9oP3oN4cvMKMbFVVLWbUYaMOX48wzFQyRgRhdPTe1z2xYHYqnsUeHSbA Mj4cZlxlmcrCi8Pk6XDTMW4NHjOYuQEelN2nkG1xiAGQkoO1FoFltYZrm0+u dMqosi31MlGRAT7NWMVJMhjGd/nL44PjwG9NKDTobMLoJh4enK6tbQvsDHA7 xQfQK1s/uO7lLbHI42bbUlLzRTWSBMJN6JHSSpxIn528UNgafOWkLNDMqgNv gp0T4j+qUufLUsWKB6+MDwagbw2uVPYJa4RMDCu+haWusc0YZppTy6VW49LW 4EZtSJL3xqL7+N6ubInSpEg2qDVt9abPvDehqxjM9HDNQdbtqTXwa8YOXi0M DBZYW5Vla/A0Whv8ZUpWVF+pdUA5vsWvQXRbKWcz012xlGGaoIgt6lW87FAv BGvlLJL2MEPPGMm9wW1f3Y+zAPKoAOSkIM/M6j00hPtXHhMd3hocuDcAgK7A /pdaj9tsnF2hISq0R9XqB+t878lzV0BQm4dwuTZMmKXM5EuaCcjQY4KnYAvm 1uB2oZ+45AzNDxgSn5v63ube1YGtaasAvBX3VstyBxrPSjdAZF9bjglXmkE/ wwAQQ0L1BwZgz7Y4JYuApRgq9wwhcSxFVavPOY0qosZqdW1CC5es0iSDEr5H Z1IdlO4Y6JDtQVXAWJf5wSGptzUUq93A8ANdZNk1VgueqUWrHBGWvQ3KqtzE Lao68yDvnvClyhZbWEWsezx7NMG25kx97C3FcBzFtWpMba5W5JvNsjATnJ3p K02vHob2gzu5vPXkNTPOUhdZbXG6HgISH9N3M5EMXDGqtMJtDQ5csWGACpOS CBb0GbU0x3HWQQVWIYJPM++5OWRssuqA1Wc3ZgA9fiO5oRa+FSCjFBHFH/e8 v9JRpzCo6naBnUrP5nYq2SgHQrGhAlb3VkuXGfesPBxabFjx6MEX5365V655 OerA9m7all6Kes7gbWzvrteuniaH125q67wHYNK4XfCfspo9pGKKmr35wAyq RZjrLT4UuTd8/6b3V0VJNzCeUVR+6VXCHNWSTK3OsoodcNUxbypRaBUoB9ks wFHLuGmGs7YWJ5pbPeLiZcqeyT1cdOOIPYHMgUJlYmtan0DyqVYWKwaMTsl7 GupHHiUyjSYpqaojDNDGUp1AgCLbid/Gre7BOXuWtlfV0cOOvQU2huAz9kSh Lb9qLqn7sidzYTivxrLNZXXUBMOg8mdXv4ylUcmVV//cPbEsdfxxKOUqDeCi jmXyRFkhe+U4Hwgu97113lXmNBhENXcZnKiMCjA/rnQiIWtl3Hdxi7pyVvic sjpHA1Uki1FUNYyKh4pJx4KL9T04hxeuJgbtfQy1Ke4xpywCsLzKDipuKSiA vgfnsFPDHLUeg6lDXyyCPssyABwdKDBXOrDve54oDRWvWpXSNwV8UposS51G koqgObgG6lU2l+JABYESYFzf7WGVS9FVP6U9xs69iiKvcdNZVB0FYdVSpQWx iaZe58UNxeeri0OHwGwCUQdKUR1ohaakmgFCMUIV6+oqc2r6SI1690yuKuPC MsPKwwEnBrOrIqTp1X8TgtfVxWbsTehQ9+rmlLaZmFyjrZWaQsiYKwVxaz7b xOwZriLa2dNC1eGbxZqgPkfKQzhT/bCHYI+W66azSDN6/LyrduHY3IFzXgK2 BU5aFZg/OpO6xywsCxnNB5l3eKjxTAJUV63qW8C0Q3FFlnYNF9qdtcnKQ2MK UfVDbrpgEjIalHwyBRy5GW8xsa1qk6poGFaBWgb1rmOwYqwVv8Gd92Q+g1kY 9A6CcxFAClhU46ejNZ0U08+EjBI2ZX54f6hpdgVYoK04/66UGgQ0vYI8ajpv rN8M5oyMYcpALZAcXM6kGhP2V3EiX9Y0x3SlmU2rWMFT1WCijgNZsMYLw9YC Ph8AaBgwbHdsrhbjEhLAlrvgZoZ7dndGydXlKzsdvDWUDxc34dzZxQGU4gq0 3KvzlM53BJqqsX2CclUl/u4FLaEpcfUEuCrgTpusx+Q6rJU/Ezan2gKusRf6 g5pMFyNwaLHej+hxneqMcgwZA/VKUZLN2MMtani6FJ2YfRocB1KC2rVYOq9S i8NjN/PhXaTfGhwQ7pyOiYGY+6ojQkAuIP7sg1qHnYebpLU3oQ08rgKanuws UyXstSZlGTmDC/L8ZOaQk9mE0CwUTSVuAgdaje1ZZaoY3GjQn9GUNnvsiWWo YatqxdQcicWyRtZyPKBGII7IbJpD9RCbuAWQ4lYuSvxOOq/i0ElGceBGu1LL ivqN2L0nV0leVClzUa+GppReiApYrtQDiXQPLVLLvT2TC6718pdOpQTxMDrJ SXsJ9axVmFCl5Q636aCViAA4DICUUFo1qrdFXQ8Qi+nBGvUOn3nvyUsrOlOu ljrV9URHPGGvwoGkSmyKdbMeW9ib0KTONtpAU9/1Wm1RsWSyMUyl6Lhelkdw ZVOJJiZlACYg/rDoaIBHYF6PLi2Irve82WFr3oyIjhgCdBkNbVEZWGUpL8Sp oz5GRSlRzbZdqwiiKGDcVaHQaSWgo7DFGAF/lzG2yqUbezLvM/fDDgOBxtQm 0DP3KSHaqeMRhDhcV6vNvdXidJZmgqVk8Lhy+Ks1OqMF7tm9yV3tFTGRmyHu MZdHbax2zHqKadlDh9xZazHqGK+Ve9yMiBr5ndmUvjImMB/kEr0athSwEgC9 mT7ULHlv8DiAtkFtWSMkIzv1A7BNxSzKi8dRzyrntOeg+1RSbtcmyFzYdjXU qwMEpzO858A0DLXJ2LPnbsRyNBPd0ProXvUs1geLSQSjB4+2zr65lTN6BrUB TxKPrlMIoFpNCT7LrxVmagUvV+0mVoTw5KKMjHn2Sgg6ivQAWTC6unEXnZRb 5h5WrCs3FptV1jisH0ObUfalBvmA3QSfQc1C2ZtQcx65gyLqrDvs7zxrbuJU +USBSrdso8ow90huG44nraM5cXQ73XnmojFwdgc6h/9Dinb35ipuzSAVRKze Smc72FEwhKMyH7WqWWfbZBaLpVCxIKqLy+rwKO6PWQE/OtDcPBNVx+ZmK5pT VUPC1wEnz61nSJbJccETffQqGBDd2/P+KkCqqlJDPZsyot1Kh9rYWu0s2NS8 FeDdVP9o0BasB0x6GBXe4UrBLDkIsHelfSunYXP3fBzHAhpJFUfRyT+tKj10 qHl+ACDZqPLZPe+PAODKHUMb8wxiXE79gTGQPoLxoulLx9PsrXP17RH+B/kX BN2VKSpyno8DaQelQvQVNu25omdJ02iPMyS1UvFqIq2duenccvjBMPcc9Frd HgqfT/As3r+KozcU1JmjNHVtA9AEuzbDUEp9ZJ04+GFGYSCKCSt+rCGHpHNq MMF5b0JVtJqT9moRgQ6vMObIwFIdCr3SgiqhtGbsyRyWmI2yrQBAPSCcqlSO fq5LiLPHAZp07O77Q8+znLzOVlVBJRBamTqzLWFqPFyovEDdTFiq7Twcjwf3 nVU3ILqlDjmk1KG+IF00qW1OqJ8zVt90hLeKqcEqRhu6qka3AD0ciWJFm/kt h06CgHIGHVfsWdUlF52p4ora4K8ZHa50T0O73NBKbmIFbTLybkpxOaxqYqPO 9zbgvc1ci1qVj8DoY5izwlmyD5iWkKPB7aurzoLkbW4TN55U+0EptO5OByEr XLwHdiniqMO+N9U/4z2DCpLPptYRh6ztuV6PyoMrngEJAKvGzdUSKxw0ZCBc d43phGyBAMKBoQEZRJjGtJtiiSpSx27ZjhXHcThvYP3w3N6qd1ahLpa93+RE ylFkztQ1oEsgeeSsil5sYjk37ZqKWPesImv40KERyy418soDdB68KdoJtdpD 1+kgzuzRlhSPFVx7AMrF4qlxn12wt43gnJJFI7xlc7clLLx7UZXIVEqYB17J ruSpDVw17QUFx004B7MPBuO3dN5mQYUmXg79Tx2KlKFyONaS2x4QHSqpTdCT YQCebTKVSuSEZmUDADtA1bzW5oRGfGSFMPsOCWqw5gjTYFifYhzJ+AVm8TaY zdw5BT4zOLxWpfchFWz7sFoiTrEzHKmid3uGS515wEQzTXj5DJDeHoZRHH0o K1wHW5bk96xiskxYwYcaEJ1T93rsNy4Ca67MJXgLKz98fEhECZvpzLB+8dNL wHrZS39cnz2aL+Y9n+jg5xQu97f39cnlbv7u5sXN7bMXn37iz1OO7+eju5v7 by5TOdcvLvNZv33Jh3dz/M2nn1wulz+sF9ftaX1+ufrd755fvnh8+3R+8V/8 +Ok3Xzx+9MWTm2cv/3DVn9w+m1f3PO6V/+JzPn5xf3s3vzDm4TGe3D76fOxf evOxl34ru4+/6esrP/Keykn/uCtePhu3n36yL84vmZyv/v3rLy8//8dffnX1 65///F+/+vpy/vm7X/767//p6jc/+9Uvvrq8+vOzX/D1xi/pz9e//vpnvzzn 2Hx5+Q/z+eeHs7b855cPP3U6ncqE+PnnMAyxysBnl58cORYwAr9cLEY6fHbh Kmf+gle5+TFeBTJpv32Ty6sXePUqybr08CrnGzy8Sj5y+uxy4TL30a/yx+X2 A7zJ919FYfUSXr0MsvZH0kMDuhVgsuergHqAKnyKUyigis8u51Vx/13+d2aF ucjnrAgBlW9nhcnImhUXQCx7syKF/DHeIL1+fL1BsVDm4HiDh+94A6/JQPiF Tx++0RvwJ3/kG8hA/BhvwJr/4xzYnIzJhz7mO1vKw3JyRQ0OjN4LnhpYThcu c+cwlmH0n/OSVyPZnL2OrdNn8JSYHobRSWPe8inf6AQNCSLJVvDvkxeXqyf/ 705+WHdydff7q7srfTEnl7tbkMP5F2YjRJiAv/xz/eZizcXYL1ms+9J/x60u Xkcbm+MHutXNn7+VMzonpJjw3VuF/dl9z0tlZ/IPc6ebd93pInpY4l8gvoe1 sGcwH0Ai9ldVf7//w92VvjB0bzziYd58vnipd/3xG1eg/u++oj257b99lxyy cW+J+7pf314/v/7m+ub62fU7V8ZFtSPxe1ffXc/rwdX31y909cc8br/75vn9 7RvXxPjm8+pQ+jfvOG77y6fz2X29B3C/69q375e/Rfsfd9F68/eP452/L8bb n7wc82MkoYtu7j/yiuf9XRPtcBlvCe637eXNk/H2HIX33Oe38+7ZfPJx1zy5 aX/+2XRWtn3Lulw/va7nArx/9e9kWb14p/E4zzH9oFHesaQtrrikt0epv52i fR+3msvl6dM3F4p9z+8/m/fveD9Ig397BqVqFWV7ej3fbe58jPZPXPscVb9D OqeqX12365fXj76ntVzy7ud+0e9unt+/qRWX92nFi9lfihG/eav8vqtgzeNd s4CY3nK/L1/cyUz/I2T7tBA82dePX77+HfelKV/acvmXf/gaV2fSCZseP7qc Bvxy9W+Xq6vnL5+8H0O933kcn37yK57x/u7li/vXtdrvu+r0HY8f3fXLurt9 enn57Lx8Dl5r3l0ervvp5dHd7cvnr/736Sd3879fzod7VJ78VZDi00/qGPrs j0GL1x+9DoO8/uR8uO9eyS3fjnlcfn9z//hynK1Pvv34cn97+U6g5PKTv7WX x7OOF599K9ir33yAqC6Y5ZvFu/TH86Gy/btP/frD7zx3/x/2rm3Jltu2/ooe /eTiBQBJ/UyK4KXsqlh26SROPj9r9R5JMzpKtJtjaVzxnAdpbpvdTYLAAhpc 6/u/fvny4y8eF//zd6/vtn8333zgzd8+Ud65HvOr8s792o2cg205B9tyDLbl GGzLbbAt92s38oG1GybUNVypdTPkdS+pNTJqJnvSrAClP1+7kY+s3TR75KKh IIfU+PIoxn5JPkpBkI7PVwnkw2o3OaTEEjh+HIxsUvEqF5jlcFXUSkgS453a jXxc7eZai3StSgwx5JdV4WIkrkqNYvlsVa5U5Jvv/koqEaDnLzc//vsUTlrG Q4vgx/gqSntURymhVQJ+ii9SVf3VwknLNV3lPH6VRV+G4fZlOY9f1KhfFU7m pzP/xzrz5ysn6dto75j9O5WT913qVuUEl0rnq3uncvKuK/0M/78tS/wcl9+f vge10PO1mftXeFgbRvrPv/A12+PD1w++/Ptaf/tGr69//rt/YnD6wsSw8so5 pzbmRJ6qJXscg0oGc3edwaPkuJ/uyPj7f/30hp19pGZryMgpeSzUEuh91ujk VA5lik3dfjA0z1pMHm9ZY/FUKgYbzkM7de812ooj8JhJOBi6zLbX9DBb6Gp5 dwtBUkek36uOtgsb18rzLYevhm5ECzNsMxs8pdYiJoUHmDDhMkNaW8mFmw+G 9t77Zv9PDEOne9rWM6XyehfSnpttNnyWg6HxxMOqUEl375GttL5TXpYs9211 1L7LtnEydI9uW2XYtFTnCrsvtqkskgtUHr+0ukmRdGJ8a4fWcnEqQo66B3vp esiBR7BycKuZLeR6Yny05ZLybjOnpSWmPJq1uUNQTBAsT7LvdTTXdqk/etWB GxWr3i6DLnVYttlhhjP0p9nfXw89u/mkHE+fq22fWMueUw+xxJqiUQhVMNnz ZMuwDRcTITV4v+SD4TjYydhg0FJbZvt1y+1gaEwlnvmSdLfFLYLNudz33Eh/ nLRu7t2HHllIHT0OXdIz1WB2SjKWpl4w22RJ04JVPrOQMJLsmFTZxSlWYG0h hZYKD9KLDtWgdZ5YSAghu8ORwGP0sqhVFskWo72wua1MFQ12dNcdg7SojY2P WhabzzCWINHMkcK45P/xUU6GXnHhZpetlqZnp6ElHpWf3kdnW5R3+MSnSVXf 3DXCy0QQGErd5pRtY4tQ/6vKlkpqgYXU5nkukdcWkklOGeCIBmYglDUyNhDQ /u5uxc0RGlZ//kD+a7ueMw2Lc1X6KF+L/J17eqnktGqZx/5II39ifJQD6dZG ZwICG1wW+0jXUUgvYVIHHX77xKnC6VM+E0tYKAw0l/VGXeNNQrUgFqoUylef BDCpkxoFccdShbr28NZIsChgiv/sYpp2CTfn+hd+8d+/QH6p5S335Q8CN4hI iKVtDx0tYs15QBNBlnqQpGQuo7anJcGuC/+wRMYGY6nX2YDRqHzlPLMvbHhs g5p1VtfzO+310MgPsDaXCGkP++r1XiZAd0h1F2DBNC/PM1O/HnrFsuaKsKSR VnLYrkdq8FJeFm6f3ZTwHU9Lbr4eWq3hgWcbsrKweY9ayzzhFEibq/PS6r7R F/x6Qhy+suRBmnTqTrPTXReslZobExAJOAm/Phm6wySXkOjKOoKzAswChbG9 1uCHqQWWyziba+8FSV6ITV2Av3JzPEZIPAZf2og81BDXfFqS4fXQ1M+7JpUS sp3kUdXJzJt3pSodkr5FyvpyMDSArJNykZqOFBcIw7UpAjWP6wN+8syrxKcV Tl6G/voXf/8yfoHBVtpNBtvHML4QpjdtebVvv/lb/4/xp29G/4535Q8C27nm j4SbE8bOzphZkBiIU7yhrjWAUo1dvdjTGp72UtcN/Kj+GC7ikZoApsgfvxHa gyH0dCAFfMPzsSuNg6HJJosnBEgFAkwFuHXALnctKQrvHlAYv8npYGh2kfeK zQ879+amWF51nezyXwGLvxvw+9OCP6+HXoVctTCbVtOk3tfYBu+D+x+ZSjHC 3nWY28HQgDSN5FpIwgLTsOFGDbYNf0/R00IGqP38Wfs3Q8OBsU0xiwSZi3xD 2FsN+xceLIYUyTDpTyvevR4aqZYBM0SqEvEAXyCLB9vXkQUjoiPiA2PXMk+W EYGs4h6RNmtWeMMMbIKcBkE5r22N54Vrs5O5Dm0s5ovNqDUKN6/mgFCNSjC2 XHVLh4nIwdAOuNGi7AAkz5wRiS8SXUoU+S5DEKFM1i4nyzhTLsz2Z7AoyEin VhiJwXNOntSsI44GgHYydI+YaCeHVlXSlSNrlNI1Sk9I0r3QxWuyky2DpBng tC2ykRd2MK5ZW8P3UneYNDs4GNWjLZOx5RTRspO5aJccnTIRbQBeI0elhmxN z/OkvR66FUTOnpmBIWnEVcjFNgTxDZfawiO7FYmDn9y1booFUtN8ZWbViHaX 2tNeEVOfZiSh3i4n/jry4BITfeRzAosWErrGhZ3DZbVBcSBbJ3bdEIn7bkKp A98YtjpQIiK0GrIcTQ2bNeV9MtcC1OrA2FkmeTOzEmRTqm3AteSdEAg64Fc/ GJryD0GwraPy+J/m0HnyyHgWOFWAw5AyQOjJ0AFzjEyx2AjaCZknfODGBsLX CeG28LQkkpETz2fiyKCd9Yo8x+ZhHaSRJDG8JMI6/ACPZh8MDc+BGIL0wUhB H8qsXe06jJ76TL2RRtOfV9B5u2V6Xb725vHfJDwhRtILSzZHmEj3giGDjyfu qSZZwpV0ihYjEibkjHB+yE0nQrrPuHN1OZnraT3a3jnGALddNia52gYqJ09M pRZsM741Oxma/ieTwpbyvlT3IOLfZUaKqkSsw7BS9AjihIjVoshVi8uQ+yLp XeWS7DM4WA0G1C9WjuyaTFO6Gtwyq06UpQSwlASHVRAzO6wQkE+PPB/2I0au 2DhKqlUeWoIPGSUCpi5Si2cJ5QROWpvaaqeATRs82VpmAfBo62Is7ll2DgiX J0NTB7ogz+bpUHL6AVLubkDmS8g7BwxIQckZj5YxbawdFk5JH6hwrtHiQkTA /gdGqVWAXFc7GPrBMFEGz58jecVuHPgHtxpKaIIwRhqXsygjSAMpSbY3Kcgx 5205PMeaVIFoa8Wawpp2EmXEHGkepgAZWpgbgct5tn2mi1SlIduspC082TIJ SXysvUbWN4r0kpEhlwWQCcfV0phpkHLtZEKoMrRhCYCn14sp7JxIktxRERKA GvhGQrOHo4huhbKrwSU1sp0prTEgfGnyANQTItB8PzI+uLUesCeLASmQ+dHd gEdmoIRDE8A05EpyMtcFU8myTuzYPN5K4cEoTEfCbiT3bAfcnllPfAh5lAXB Za5Wlg8ytPIMeC6NIt7R+eZqt3G0jMBMowND1klRXvLtw88J5dMoimoAOh1/ k4/QUwHG6Ql4YSCVlkhu3FxK6o6HKfBMoZF24mQZkS+bGrZKbrZn9n4Jz+6I u28w7ZQMedlRlAmjCWZ59Qn/RkxTPc3dG2Bq0rgiI0L1cGLXgEme3WtKA64C CdhAaq3d4J0UW3xQ+q3OerKMnuGCAutJE1YBh7Qz9rc7blYoMjGANK3nk2XM itlGJj46BZ322ghohshC3S6AvT4z4IQcYT5k4IDnEkm2vXfFnYdtwMGKa2Fi xka81zJPtkySvZKTWGOnCtRbKI3pzKcRu+BbJgBmwFqf2HV3ql4ECnJPvm6k cHBXiiZijoEuL8mEo5SUpjeQ2nWyYStufO5COaZICY+k8HvIygCRT+Y6A36Q 8iJQihHL58IT3qod2ehI1p1kB+Mk20Ue0+CsyaanpNtJRt3LCkBJakO4U2FD 1ziJjTmOmTd8duyWkZlmsl3CJZEyZOICoyI87HgSCpRv4o0NaNNT2pPlwp5m QnAPEQaJeY7heYWnN3OtGbGx8UUFhUHHWgUepYkhL0B0qAUuwL3vI39NgsFu Ywv2uO6khLyCKAk/3rDVK0XF5WTogS0eAltNghjcaYVpUMkU+0QrvAn8akmt nxgfYoDm6MJGgjzKnsjmNoJCt+1KRknKVpieLGO+SEYQAhAfVQYAE6aeLwPD RtAFHO6XFNaJD0G6sZt21uCVxK5w270nahi52OhCFQxg1aOKme2GtGhh+ToW cfbGHhES1cVCtn7gCG95nNSvyT2pW0snCyrwOlxpXeyxCFRCha0DS1rQkyij q8xRTTKMGqF7WAJs7ySliQGLiYw9GZOdk6ELcUdUeJKkOwZZcFC57lXDEAQ1 pEzZo9YjH9KB3eHgdE4AaZU5AKJyg9FkbWEh44Dj8xN/vWBjexevI7PhWAtZ /zpsUZGSrcz6QvGSjpwq2Yp3GCTmqTFFivyypQg/Ipu7+2Qt204mhBROVUNd ojqNvJAKFA+kylIFefWQNIR+VGpBhE25seqRlBxusBhgSN9jIEfKbpSTr4j5 J9nuhNPE/kMWlpAedJJOdyAPbspBHve5CFpPhgbsH51d8mQOMbYWsxQaG9Aa 6eLcDAnpnidDsxYeAe2Qd3g0Ix18jMsGsSUegCXnHHJaR3ByFJIWUFepAgYD 3sUQ8Q1fJiKiBxabYztxqgWZF0z66okj/G8b+BczHCl8gBRpxFUa4MmJhTQA mh0mzHvx1b0HUsEDZWdsJmQE12ZPcRzl6HGHNMYm+XTNYfJN84oAqCNkr405 AqtPJwFsR4qqsKoHwKt8v6iZOpNGtkKgBKTUmKaT3VjhIPhylg0NmJsqyH4R 2an308nMPzIA5VlsbKSSv45JsPmkL+zs2pB9EMUDos1F13g21/D38Jp1ygJA T2GwmQ0IZLQ+8XPOPnxi20dFi4j8rc+EvJRSU4Jkq/iKVavxRQEeAkbe0gl0 7xQG9gSnQU756MWQboxmcOK1eVobecyWo7KWbL49kS67YgOOiHhuoUVEB0kJ M1ULfIAdvYBVZOfsXkOCC7+xANcRwspV9iOhkm5Hpo2wf2J8FEzfJWFj0EVR Iktw59SAK/CwyPiQkvqR8UVWgRrQqGw3XzAMzDVZ2zLQSCiNNcAJmHL0Dsxg HFjC3SZVMbenVr07LHpL36HT+uI+mZDZ1HotJuy4yz6GK3wpO7N7h/vjG2VN O9vRRk9w0ogvggUErBaAaqQbwPFOEkutgDjxeeqqN7GxIpSXHdeCHcQKpFRn WHuROTjWgHS3YXQ/QaowOXaPh2FI6bLxiH5afbXdEztf8vJKiz/a6KukBIwT 4aDh84GUSLuv1P6uZPBNiDm+11Gdr/a64DSZxcEDpklwrZH1oS5IpEkF3XI+ mWtA6pCaZpsLOIcCx5J3cDik2kgFD6BABsETC9GEe54rUWUAWR0yDkTZjEzA NjtcapQ+SWJ/MvTFGY8U3Zz9nQvfAgLi5inMVrE1SwcCXEcQB0g9kpHSnbVV DGPB9mJDoiDE50YC6HW00XFnQXsZQF9X+Z0NwmFRHhXzArdtytMkRzk6bhn7 3BO7anPXghsXpeolYkBjA2/JorOdgAVgUbgPvjfJZQVCVUbaLEWEggYRuwf7 Rk52I/BA2A2bvAmZxis1DWbe8BsDyEdN3NkZevZS0Ed0KgptinXOxs3iE5EY mImktxiecpInmA8Yr12aVMB2MW8EGGQvSPixlNJjwlZC0G0nPgRrN0dCNFRk t9vY2Lwx6xpW7YvCdDNRE/3krmG2MGXfsTQq6TFDNVJsxg1Ez6Z6w3zZ0Zu7 4vDW+NdLgOkV5AaYIMCyLAhsCohK8bQQTiAOW+d7ERvA0o23y17QPChzh+QR Wx7L3OpRSqpsowS0k9a91m1I5cjO3ykvlEjihcuE2I9ewOZCmBpIDs31BFCC 4S3nmQ6EBSVcILHuUdNd91FGi2xhJkcavF5P8NOZotcLkC1jJx29Jc0TUcom RlYsHwAPmX9HqrmqI9yOjsgGKz8JBQvItJSymODyDQ3teDk7Zfj+ZDak5wg0 R1V34DviGzw1wCMwA9wHkF6xMIGw5/RRL+W1I2BWPSDUbqTO5pOV1TphFbvM EcccQp0kXOjE8wE6zbSGduxsqsGPgccYWrD3KfW3G8tctR2F3QLPoYJAWxd7 H3IFuEQ0u4RnkYPlphMo+yQ2ZiRdie8zKtJnCWw3b5SGgJXDl461oiyqjBwN HSQ17OU8hbiyRFgh8wykTwBTHfiGhcCbE/LrPKgwla+JMqKFfzwP6ufx608u jU8ujU8ujX8GLo3fgQeVe+TiC72+Ki88qCVdW+TxRXmeB/Vfjc7jM5x8snn8 a7J5/PZcG2cO8z4PKu4v3+NBvT5xjwf1mu5jHtQfPn3Gg3rd7k0e1McVz3hQ 8Vm9zYN6fegGD+r193d5UF8+dIMH9fGJGzyoj4m7zYN6bZKbPKjXZ+7woD7u 7b08qP/XKM/zoL6Mcp8H9XrqGzyo19/f4UF93NkZD+pPn303D+p133d5UB8f usuD+vjULR7U6yNf8aDqz3hQk32L4Pbb86DqJw/qs1RT+v+ZB1XPwbaeg209 Btt6DLb1NtjW+7Ub/cDaDRLq9pJaaw7lh4IHT5lSoCMknsN/vnajH1i7KaVG 5qKJZ37DS2qpfPVnwkdRkyu1fLJKoB9WuxEjIU5haaOEGIsV4pM/lNR4mvaP fyzWQrRyo3ajH1i7QZL3UBYqeKwXZSE4onApCylPWN1RFtL38aDq71E4CSm/ KpxQPJrFw6u/hNriV8Wjsa3iWvjKTjwWTrDXfiqc4JvXhZOXype81MDaj5VJ lr7kqoFhx14CMlXqLR7UT2d+x5nfqZwA7J7P/r3KyXsudbNygmz7eHXvVU7e caWbPKh3p+8+D+rdK/ymPKj6UTyoI8Y4VukuGqL0zPeyyVfabcUSbOTurvCI v/5as38//vLl+5f//dufvP/0cj2VDUNKcbtP8VLNbdfYeQp0ycjUTU72jELl /36NsVrclltiy5Vsts70VkfEE1Up0dnxNkqY77nGTmGG/T/sXduOJbcN/JV9 zFNCiZIo5WcCURcgQGIDcR42f5+qnl3H8M7Gp3W8niAewEZmJm6pj0SRRR1W cZAlx++SO+lh1a01TJILyUBRRjN56nPUaXjl3cdoMlizuPsqM5qH6ps10H15 foQV9PU5Qt9NJNksq60226o1I7DZrLnBZQJ25sxmvM/MEbNhuRsrqa9C5S6U Z91UiPTg2xvAh7W5n7KrRrUi6d1aqsHynMMJLnvAsuUGgNx78EeUN/7Lnred ABFk4Vywdfgwo+oTCzTrwiHBn2utPT0zh66We0jkq5FZnHpKrfG7tLL7jpiq XbVW+tznSGmHRU28lPpK2VqMuXTY615dLAkLWvdT+wHcOCkmUj0mqRtbkOYs 8Lq9aW7dxdMMHuMzc5CvxJofmRZhYUBwc6xiJbH/essjliWy81NnsPiSYor3 ZQ3rXmwzKHGOsvqsK2pf2JuHaPxfn4PlbbDZujDZqDjULK2sOI4OTL6s2gwa 53Nz7GppapkWYGGNMoM7UrKypDZZKpRmbL23h3zJ3//Ux/f/+I6//WedcthK ZeQ00sorDHdnFbT0mGXxO26ZwMJ6On4m334uawIXxfbM+5LxkJ51BE0RJ6VR W+B4/OS7kMkDY2ph7lrhp7wLi9vhRaovRER4ktPxty/zuMQ9tTLdio+M3zUj osIH7h6tiO/j9a+CQHeJvATKPG6caU2XpoeOMqkeJyusnI7HL4MaMvBvJr1N pcRfQCK/xbH2TS5xbAnxdPylGxGCslcs2kJg1Z7TvCiJNbJrb982lvTT8eO1 KnFgK3NevZIqY9rSpub2EvZRXynZ8ftTKMiGUyacdwVrzaoUowjC0m6zSqZc S3I6vsyRsCqJZekV7ijShcKQLloDMvcRtbk+Uo79+vheqOFVytXiPHshEwpe Isc68lg6ekwaQz8ev20PADNkS8hsG8gsAvyxyBJhICQ2EN8a1vH6i3pPfTO0 eGbleplAgAB+wALrUkYZ7PccHht/rR++//iTvR1SEgBe67AdvDShS1fFQhlg slZqGj4kKffl2LB144JPSlFGBYAUOPt4MXFtG7Ycjh8n7mTsNPBiDoSKz43o EjICcICxCLy+jqQFITqFsY/GBiwlpxJ5ARWVyd+k6CnwA9sajFBCL+Q7n4w9 LbF3PbD6proXhXvbKBv+flD0Qi45BRM5GRvDUP8d/pBoxNZEQpAbSU1U6UQo Hz0/JhXz5dgsIocXJ2EfHovSclYQVAYLciQAWSUsvM+j9e4UY0SQomC4pGWl zRpK7HPWTozTe6oIjXZkg8lFdyP3eJNolSnip0Xy1QhjN7hOig+Ws/fOindO SB8bIBlJ5YBNSGoA1YDVTRCaKOh0ZIOWggBhCCU0F3mbFguhE/ArZdookmq1 HZ0dizTvIXNvrHeDB2cCI5SO6aP3CUwAnJmPbFDbzK6B0mRlwPgyTn70sTAl nOEucGMDn+1oTRaSQl0UpiTpbwhWvzfqbA5AcaoHbu0aVz8Zm1rvgxlWJVsz 9g1XgpxXE9JS5vQ9bfibVc7WhA0CcAwL/KDJdqpsso59G+D81VOhaI5Ha4II jAVQxDPd2D6G4l3TFJuUQEt7U7o45Adj2+4/xFB+ItdmI/SqbeGkzKWCUMae DC0n2sxasBg49Hw0uANR8XsFUheZO+/OYlfkBtsHRcpqG3o8eNrUccU/JSHh 6HCydVFd1yaSQ7hEDA9A8QhR75XBS9k7Iwcvu8nqvQpCsQ58HKXq1YIbbPRj 6WhwgeVV8qiH4ohWYJYiMI+odcLdUsw7bntIOeuVwUnLYLcYwBSTiWygh+0A o0zGkVc6sBASttyPBkfEwfMC3AnHQuIfgk0COESgnrXGEs2XBrOjwe1SQVJg wQJMTqH60pbqHgOp0xSS9hxpztmyFPJp6mzeNjtWpIzNQ1qEbUCCZwPRGZFZ RjkanN+TtN1JlHRKxlP6rFChSygFQVmLGYLY2ZoDyQK6Ob15swCDHMgoRukV f4zUqSl4921yZopDHacG4Wy2hoRLpsNILMLZjEoFCmTZPc0z3xItIOJ0BWij egASbATSTAHOBdO0lRGwY2hnG9qxLHCpa/ctAIIwmUmuE46WJZ0T+HMOq/Fs cKOSbslR4EToAnemfChwHdwA8NcIFmD/ZR8NHvquMahNOHEfSCMEWGiXud2H bniYnlazGs6WJcBRebhWF+dy5tDCQOoTm13gjl/IU47jzBSrKfJPZFmBknEG 95iwmwB2WqnoWABkZn40C/35m6+ABEpK7tQBW4vyuhU5c1kUhAAoygoUMP3M K+pOC0EHeWFd8LcLEBexHmEaqW7GJHP1jRB4ZuesU5+CHNYoClnIaAZEGqIN Pr2oIH4s13kWQ8MSiusuKmoBL8POqXAfBgWaR01+6cH42YauumUZ1Z0owBGc AA5OEFECiCJRnBlQ5iFhgVcGH5Q4RZrsSM0X0jci51SUrN3eNlmwOL4P6ei/ tubOSxUgIy/IPgG2pE0sccS615gjttgQp9UOfYuRvhdynAUhk9IeXmvxyb9O 3gaP4aHc3tBfpsghyH1ZQ3UdqF+ZIff+xfx7ldV7ldV7ldX/QpXVb8CQq1SJ vBhyvMHVTwy5Fl4+AX+4yqQeZMj93gq93sPJe53X77PO69tXYZ05zBOGnMpd hhyeuMuQwy9PMORenj5lyOF1bzPksGTHDDkNBww5jfcYcldPt7sMueuhWww5 PnGLIceFO2DIYZ7bDDk8c48hx3d7niH39VHuMOSuUU4YcvjUtxhy+O/vMeT4 ZqcMuc/P/goMObzTfYYcH7rPkONTNxlyeOQLhlz5GUMu8Z/fgCFX3hlyjxYh l/9nhlw5B9vlHGyXY7BdjsF2uQ22y/27m/J2dzehaDUKn+AnNtl7SeuIy68s jZKmMerjdzflLe9uKOv9klq2Zvb57ubS+uMtATPYG3c35Zvd3fzCzc3VpLFW 3tzU2K5Uu3z4A7IyK/hjSXrJAT18b1Pe7N4mlFiTxsu4NH++M8gVu9N4Mdik vNwZHOzIATuu/NaXJry4YhXAy82o6afbjizsaFmvm9Fq123HL12a8C6yfLqV bJ8Uol4uI6/7r2zxUoi6zY57d+R3HPmdW5PUnlj9e7cmz0x189Yk5fPdvXdr 8sRMN9lxd5fvPjvu7gzflB1X3oodJ7HLZoMVkzX2SH3kPkv0omotLDbZpqjt IzUqfx328eNH/i/S8p/ozkqoM3mxXoJt2b21GTW1NLpmdgGJV8uV4wk8winX ufvaXljwVuHUx/CVLC98rEnmlD7STf4rE+AIyNxhByniOXsNLGzukqioPDWW HdkDvR5PoKtOJEIjtrRHiFitli2z62XN02XvtqTs/MQERUNxdrTFYiFY1cZi z97ZlMZWGHNh8v1QGfzrE2TBore9jMsPVEF92lo1hVXYnIc1SbOV7scTVM+a BjvMsTFX3lmae+h51lk7DHfBdBG95/EEQ+uKO6ZLKd/EeyTFJAVlLSErHyUF TzUfT9Bhlxqopl6HhrxwHGqJsTRPhrM22Ue1PaS//JUJDEsEdLJGaOIWR6sA cjmw83ws1P9mB7MRzjeZ7dVSXTk0pfB1TF2C1zFLDfhEcQbPyds63wP2oFLJ sbNqa+NElYaBtTjOclrDsT0x6U7HE2BlUqfgP9sY9aSAvHW3qtqIivvGuVs7 PFS/9foEc5GIM/GeFrCdy2mu+BA4eMkcGcRSdgQ490XwFH2SBBhEdRil2cdq Jn3BVFngevE12xOfoDoLFIVt6SeWac4+SLcKLOkaA5a1W2/z3EwD+0gvJIIj z16jq9iw2jaPgJOOUmzG9YhS/dd8UesDkaaQ2l2k7h6RWwObBbHtrsFirOJP fIK0pcM8sc55h9pS7jFtHDyFE8+xLNLIsHZ2PMEuITVW00bzFhQBM6SV1RGN qVTKpn5jwwLOfZF26fClkTWAlG5nW+GN2C8+SJH3/EIVPJ6gtAa0NXadksYc PjoWBUa7c2MPphIBxrz4+SYjiilcsi9FrE+xsfJNYZsYOqSJoD+0ZZXzTWZJ cN8aFRG4exkjSFiBBYgyEoI+CcAl5fXEJ1jd2OnVisJrbGOzuVlJ3mNlqXbv CdHtfAIkCnW0gLNqVhfeF9NhPyrOcchwg/DmMmyee1P45QSrRMyKOvow/L7X hENFvIGP6KOlENMTyA4h3bCLKZH4FhxHwqLa8MqGvrVXksg8SjkPmUnZDoiL YnCgAgRDMQVA0ih5A5FhNwQn8GSC1/7vv3z/t/lHnqq5Pn4Y33/3z/7X7374 EHlb8o/+wf/1zx9hua40PQzrqwCfpzjoKgsr9TsC2ITrNDGNj8fal6l/lHlA 3BswYKCERDqXw4FtGSHHCFAowM3kX+nZ6MYgPbZ1KUggFxwjkRmbm86cUxEc XAmIXmejV4A9JdWFl3DkH0uNOiusrxriYHRWu7UuZ6MH2TMAr1bjS5dSkBqt HJdWnZ46q1/ZFeLw3SMWdsKhszgdWY+OAMxUG1v+IZjMILLy3vFwdKQIgIAa OhKG1nodQN9YkJp3g/vKkw1QkQCcjs4mjRJhhizrRtI223RsaQtuvjMb3rML UTwb3dlGq+PgKbuCIatiI+KVFlE+NlZxSCMS00OLFJLyyPVeBjyPbUXSGz3A Mgdy24bAgWhrp6dJYRTDETeRv7EfcZgI12QAwjXKFgTSuJfsw5XBwTHspO+Z 2GZbEYZKWBMrXh3uzyNApmTzs9GRuBGnNiT/DRCg4JVT1RUHbbMD2JfMQG2H 7458A3uYFXGfyis1sQ4bi2XcW4Q9iowMPTyrbHEfEo5MncgQKjtLsTdnLp7g JRuVWXoCRDjcVeR9DRknwBFbqjg2t+6AY4SQWunHOmYadrirWFJAsEjOR8+V MRNLAqDUkejmRMjBL3vK4bsjVASbQFpk6BtvP+KlgiSld1dj2b0MTHvoI2OD H/SEbA04eOH94ZG19JKnzjUjfP8iveds9D4HO6bVUeAWJ7yXyoTTbWFHiYv8 ZpVVvB+++5BOpZmeLSPPob+cU5bD6YeqvMl3yaWkQz/T2XJ2w+CrxDQRQjts JIW1ZsLKr5RT9tgP132l0W0kQIMSMkwltUqYtfjVjYzRI6xxxXW47jASDAtH ArC4t+BwjRnGbAngrq2cM6BwA4g8RByAmrbGsKvQOWg19k0NFDMqsMSm2Gjg hMPThAHhW1JCAg/0UhCimTB1NorgvUECmmsD/55aJMxCvMKVb0DqnJFGItUA ioFfYCNSV2Sv+XDd1wQakoDYR4+esUg5U2JmkNGHaIJfKaBz6AmGZSw1PCN7 nwG/4CeVsiyZh+zCTSb78xSLARsDY6y6MkUkSkL6BYs3r6nDyZirszPf4cqw UeoEhJTZ40KOh6VAXFq1VSxYDiWXvbbNU5uheFaBZQNwxLnCxZdGSte1XuJw oQJ/13x4VpkEsf98VLLJ2gT48DjZFR2n1kmR32HBWxyue8ZYk2S1KeTxtmwA GRP+BSkrjbKyq28sh5EPZzP1aYA1SJCWwx/GCUev87oIcmAaJfY4tMgJmCiN TSuVF2PFYeQwc2lR1uw1+yppxENMwO/ms1/0sj3hEwG9Qpmpqe5iA6gbCd0A xDncVXgWoPaG0wRXy5ZuJS+2VQQyXhMbkvGB4PNPz+ooAa4mZkpUDKp/kcNn CdEuXkkD3Nvoh2eV8gkk9S6AC4Uf146UN0ef1UtE0MO0LWL8s9ERJnYFQjJK iRFzlDWuDG2tZiPCOwLShNN1r8AtlNlySxIaoumOszKYhD0JimOghso6tEjg Op+IPqqpJoSJSSAfNxtzYlfha+A7Q06HK8P+kAo/tkuPIogcVVJEWhz5nRhc fFXy5PXQv/OaajGcJmTbpGsGNaR/8IoB4Cw2qogBex9GPqxvRiIfssaJ1K9v /JBmQoCC2wQkjm3B4ZTDdd8JeLTljdO/yxZHwPaGQzXhGWr1sQf29TGhu9ds BicpbTiCFlW755HwusbsfntFPrWQ8TWkaYdR2xMCKjV2dgNuN8PuRhbDUnpn YdhWKc14ek+QAE9XDXFiefMKTTGgM4YsKv0UH73nHk7xDIU35VJQ42UMzK9G mDs1bJoO+N65zfs8RXpwih3eUR3ZgDMz9ezefCDtnki590S8je0wX61INzJy p8xe8kJhoxH5ZW9GLp+RUNnCub3zBcvn0R/g4ALhfVGlWdOvz8F9L/95r+N8 r+N8r+N86zrOb8+/jYKUnzZ1/aQv7ULhe6VcvG7+YOFx/u3vrZT0PZS8V5L+ PitJv32d55nDPOHfpnKXf4sn7vJvsdxP8G9fnj7l3+J1b/NvOeMp/zbVA/4t HrrFv03tgH97PXSLf8snbvFvuXAH/FvMc5t/i2fu8W/5bs/zb78+yh3+7TXK Cf8Wn/oW/xb//T3+Ld/slH/7+dlfgX+L977Pv+VD9/m3fOom/xaPfMG/tZ/x b3P7c7TfgH9r7/zbR2kO9v/Mv7VzsG3nYNuOwbYdg227Dbbt/r2NveW9DeAI JaBisJh+FBxrWSPlonJr+RIce/Text7u3ublA7x8FA3wg58+Cj8B/loE+Wy5 odJlb6adZpoMYAdptQGdlswLj/bhD7G0JIK/qrAA8M7djb2ddlq9ki8m/FVq Du1Twm+xxevewP7N3rXu3HEjx1fRCyzAazfb//McAclmIwtkEyNGktdP1fmk WIqEzTdUbC+y8gJeAfJw5pB9qSLZ1TbU7lblogZXf4+Nk/w/JiC9DPDjVHwS j9Os+trx4EzQLPFY+XXjJFPQ8PONk1FKsrdhShn5066kUZ2fw1jLyl3JD0pH fVCD+yOYPwnm7985qT8l+Y7Zf7Jz8n2verRzgleV+9V9snPyXW96VIP7fPqe 1uA+f8NvWoP7R4DTL44kf8H8/ePPoAN//uV8XSbwJ6TyLysF/v1ffp6/jvbh r43lZXcYrKxS+08f/oEHlR/+1D/85z+9Hjz7X//y87+dX37hWG+r+PHWngE2 rJ5XYfVXnrT72GdlO4E/DB1xkr7j3vSXn/PpAu9y6aXmtqYr/olU/ehh45fa AFuswlrqe7qrfXv4acU7+w5WTasl/Ezu1rMHh/SGv11NikW6HN6WVF3rVTw4 ZKxRHUnm2BmzIbcFEEqa9T0XDL45/GQ3hsPuCYaJ0Yw8xvvqffd6AiFVMDPD 3nOD7JvDx2nq1SvvR1VM9V7NludCxe/eVfxQdX3my+Fl5unBidBS5MxdUzWK wp/FqwxnzbS2XH99K312mY3NNwCfykhbung/xOURc7BQ9F2Xmb49OenE6qG8 ALe26dCFV5Riu6xJFXSsRM3z1jCrdYWtsMgOIEdhiCfDmkpqUasvigHvGvl2 cuCyvgIWybZVq+9oFsYWPCXxTnUCn0lj7lvD7LzDKAMG45p74tU38baswnVL Q5w5OXgp43Zy+NtnrROzzMIyWApbLEiaY7cjiDwsz3pPo5JvLy2+uOZzTt7Z vCIAZI2clx5ek81T2UZD+u3cS8PER/ZgraDkA9C7hs6GlfZsbSKgHoTOfjv8 Tnm0FGVvrTUMAZkyy+XA1V412GGK4HzrtdO3jDHLslN9rqTwpl0aLEjMix6v vHDe5XJ4h7GoBEJCOzMjbHoSywXTw5uDpSyPPme5NswKsASHRbwpCq/VsxZi dDq85TjzaT2Hr1RuLcdFkK/2ypFgJGxbXJlSZz+OgBBIrQO+cGv3uSA1sW1W Hr1X74z6K29njw6ZvCWIdZ5+O7xvRyLvYF1sFRc94x0rFtyhsQXmsVYG+3M9 H/7rv/zlz//8NXzp3ypz5IVFOTszmrKAwQ98r7LApiJDl5nGEmbqd37V68Wf bgIn9np7BSlEu8r5dETbnfpcAYtWR4541+3Cr4aGJ+TsZsG7xEXWyHkbPV5r gonJWXvzsy+GXqGhSI9DJDpvXkZg4DrV6krSeXkxiIxuJqRUGlMZlVXeEhXB BMFiInizP2aCzw+4Zb4YuuEjZQeA1uhjRj21wKhq7jOSHMC4DcfWvC+Gnsgm +DYYbYJ7n4acyR5ziaVkPc4QhA+b1S+GRvI9xTzgE7xCmLRG7V4Hoh1mx7i2 bI/TLoY+W/LmfWr2OHN9wz0H9i2ZE5SXaDOt62JoCcyoARamM+rh/dvDa9bK +5YARMkny3jXzVd3947wCQeZPlj6Q0sRxLaGyCEbIH3pPDouho4w2EQn/ah7 sPxsBiBiZU9fGM9prQ0KitzMNZcR+dsZmhObsM92wgmCJoDs0iGNzWUuhs5j gqOANSG3BtDN2NNy8wMkEisjQu21ucd0MbQ7qINuxDvXDishehos+3cs5PIS 7DvoO26GLiNHZw+sLqILbAdfWUah0onaiIklbj3fxJA0JBfWXEdriN2jpODB PbsMw0DyRJLqW/vNhAD+YgZmDsAYNnwe1GYpVjrryA0mWNmf/N1k8Au7NuTV JC3VXnr3fepMfchcDUZT60JIDXbvuRjaSplrw1nSOSxdXpQJAiGhFkJQUStH gGb1G+NTeEwEy590K0zQ2RncdhwsQJ6WSrC5w83QcyK2eQKSqx1GlmElgAFn gMcCfgzAvrpyjnw1IUkB5ZoLTcQdEHsu3bKsVFr0UQESmTe5ERkkKTijy5RU WFrW0xodKQYBoAdlOsDb9k1QXQWYY9clNk9h67wEL0EcTO5IBxvJizXM5Sby wR8aYilCpzmwtYqCeACN9E5m1qxg3D7lJjw1fOA5nrqehnSrsw9nqf90pESe 95xVY+tNbkTYKKC902pLMDsfoBlsTdcxJsvNluVVdtabCUHM6OCPTeAXZBrq BpiXbbKrLvh13oUVpzcZna3tzHcPBItXyW0dCEgnKtvmsuF3qliKm6F5ICOH 6lAzNcFYyA0NPHQjYTprVoBLEGNvLIS7F+xUBtzYcqGUUwU8ADtaQt0ZGA5l kuod5otSJM4+7EzaKQAGzDdbk6YrJij1gs3Um2UscciluxoiKv4Hfi5JwFuA fGXYTFNGbfUG8yVlBdNcDgwNCv1qsQiY56vTqIW9+QAD042jl1SQAQ+bQQOH APAtLCAAZgUmG0F8kxBd5Gau8+zcOzAsl9fNHcZylAiwIq8Xlh8icK+48ca5 ESJqU8AE8FogMAXzcPxfFkN+OctaTpiSm3i92ma9NMs7DbEE/oiItdsgEFTw Ahr6LuUGmAHiICoBdHgxJNsOPJbzwOrBCIWKifD9dJUKQFaRq2ALAGAHrHWp cJN32pkbLwRMK3u3eYOeWoIJI/2xoxVFvkI7zG9b3utU0AGYI2DJHcQZxbKs lPKEFU7wvLIWoDpy1iKIKk2BpvRqQl7gv4M/T0rwrcMM4LUh0vazaO0nOzDU FQOrHiLdTpFdDXwDb+G+Sj5BCZ4Dy1O74uj44QeYkWMsTw3zmoayCnhaC4Rb ijNUWTcxBPA8HzwMc+MFDkPUZxPRIVWcndemdbBdu7EQJVSHb++GeWBncCA/ YXIBnudHY+KBpewq7UYHODqjJ0md9KCnOgBHnJxX8QIHsBxXdt1hxHARryml omnYyVtZvUh+VF797A87rF5hPswrWyZjqZBrbDby/U0hvIP0MNbxAqQdV8uI oN8S6EoqaSOyGo1uMYWBibGmHolRr7ZaNJfYmZvAbLC7GwJGa4UqY5EAHZw4 G25zE/lyBSeV6ArMrsdAB3RMGwkgoTHSghYAeF8ZX9YBWlETshcw8JY98k5w IeB58Gh4j2fY5Tw3y+iZW0BItSm7gI3vCRxPHV0EPVE1/BrnHYvnQwPplXMs KJsLQ8yUs8VQMsnAYCtNu5SerrZa2D7ejiSdu2Xp1Tc3EpC7XiXRgcSbQO5u 7Hp15Ku1wLqQs46zTFlLGmSRRtmSNsD83tVj92vjQ158qTV4qmbstN25wx5w dVabg/pZZC03yygrA/8n6tltjT28xCSx3oPddWF2ubacr4AZPhRJ2zife76a aQcDnwCUwKBh2R0osJabCQF3OYmyuFSH7uCgh6Ahl9pg1eAwHPn0KxyCSYjQ sZSnyUV1v5ivDzUAP26wDypnXBFpPJhnHKkAfli0MRmZkOWR3ROVo6wHgvi8 SQX9gCCCv+Uy10k8iczgNchqbYEhgJn2QIBpN9CdGrK5AT4tfCq4HeDSWtK4 ZYm/2EAjufsdMAOmho+zN/cCcoeDTNj5Cp5i94MhNzUWV9zs860Jn/GxEu/5 wOkBl9KpFAtJWUAHEmIAaNKNhewtxQg7QEkTD3SAokBC12jIaAi33gwI/GqD CDwmAfzWfcRAk1RixqIqD8DUyXD7uoreWcjgdnvuddYJU+CSYgErGMFLv71L Ro7v8yqodmCyngrlSOKAUANWOzLjSbUBN2Tu58NQ1o2FhDZKeHnmZjXmfIOZ CwlkX6w6UuQfcIUhV5S0UQ8DSMkXEgqcHgBn4UUVGb0huIKPIaTY1YQAHYBL gy5nRD4AsUBKkBm2geoxdAU+3uJXW+OLQWnIapTwAiLuTTP3ggMJmHdaR7N6 dQyx8MGBmOQdwOYUZBuYR84Frzjw8o0f4gAkV97YT7RGwbPUuPM2KOwwCuhG EWp4IJQkhLCbLAOqsuF1jblqUX4EQH4qSKR2fKyODUQyLN8YXy2lNDBmsFzm k3QQ/g7v7rTF44eqeRdDgrhhYK2SMzaE5oTfLwFamgBZjVrd4OodbAzLqFc0 yWC4ncd0PEejaAd+vrpa6guYZOZTYCo3y6hCj6ADMoUD4lE5ORqg6gE9Gk0F Flm8XBEOEC3uUA4KGW6LbgtThPBx9mjAEXMSxd5sWvCscbMNQ+q8ZoS5hXlw t1xNzhlIDvB/+NUNxKF+CVBe4zULN8udKsQr6YDJjAQAr2DV8+qUdHpL5pM7 +lRj9Uy5GJ5vHgA1oNjGZgbzxkJapYxOr4nidMaLRQk2l4G2c6oUX8l4a4ub GFJtZCDgolQzA+StKUDZGfdefyqxeCtwXaEnflVykLpU01IkKyAPA7Ish/dc wKWzVDs3IJhCi+yb0ZEheWLQDxijU/YOXgnQo9TGrueGgQ144mosbuSNM4QM UC+qJcLA4eq9HUrkH5Wr7VogEBiFIdEakmBBBJ15yeS5Dxy8IhxyY//q2LhL A7YGC5suVADiFjPFhLiYll9iEbKuHB15BJAd5GLjUw+iReY1S7NaxKeAKWXi qxuXEd0TwQKYMlOWD8BAkciQdljVgjeWmo0Xrm7AAjf4AMAUsbNhylfige4h HNN20ipSQaf7lTfOGWnX0nmxDD7v1OcLdaJJNtgBLlm92dUy9g5qASsAu4Pf ba8zA1AFuBKYzNACoBLr6lZLtQx87aMsL9UBAC1RebkA/eoSapiNCgjfrk7u 4qhOOAq+ffDYalerm4qrxVfpa+6DKHWzO8ldyARnR/4aDeTgWFuygdfhNgmm UjNi9h1Nqra7IUchMxrikgOFIC9o2z31emYJqnz5usrooOI8SDN8tpOHTQy7 krxkr8HbhVc6wNNvgupC3LTTdl0eiERS9fCQcVkMOqojQM1qN+jpRF48FOyI Fci9yLVUdeUiRhkNWQCBFvN9tYUIa0YOnx0ZRnngYMi5ZwGcyRHt6/TK47Ub R98ZJN3A0a1RFQ/wI0lv3GM2kxCk9j0l243xBfi5sonDQDj1qK/UAFq33HnD APCMeTPfJLAY5+TBG/JAYcGs0KKuigU8E3ZyzuGt8321oZ9ON5sKSgp4Wl+n R3nKykolW+RNULK80w10B5IEpAYjhRkM4KQyXte3KOgMslrGMEK1qyzD+Gat IX+BJHFQMSBIBNLFO7shm52J5tXWuIvDvhhBk6a+MQuttlzq4H3v8ESpXAQR vdtZWLMgKCXww64Z5BaUyXtBrp2J5+l7co/n2dD/u1jdMPm6pDm3/H+vVvej UO5H1fOPqucfVc9/C1XPv4NiXSnKGuSPf7KPinXlo+be6w/2fsW6v7fC6x/p 5Efd9d9n3fVvXxV9FzCfK9bh++ozxbrXE88U617Tfa1Y9+npO8W61+c+VKx7 e+OdYh2e7Y8V614PPVCse1nVU8W6jw89UKx7e+KBYt3bxD1WrHu956Fi3euZ J4p1b9/2vYp1f22U9yvWfRzluWLd61c/UKx7/fdPFOvevuxOse7XZ79bse71 3U8V694eeqpY9/bUI8W61yNfKdaNLxTr6k9ZfkIe/e0V68YPxbr3ioKM/8+K deMebI97sD2uwfa4BtvjMdgez/duxh+4d6O1vVTFC3lcbm8K4pJSr5Tnl9S0 pwedBsYfuHczciv60rUfrFf+9FMGxVv4U3jd+IGu/fgD926MVejcu2lVWpK3 vZuaLTfu6FSeTo8nezfjj+s2oKy4eDOwMT4Rft5SL+VlYJLeCP/Fqlwo1o3f Y+Pkv7c7PnpYxhq+bSnWnvMnD+OF+teWYhuvZgV4rDySmvsRhZ9E4SdbHogd 97P/bMvje171cMsD0PR6dZ9teXzHmx5KzT2dvudSc0/f8JtKzY2/Cam5v/zH N1TmBGH+oczcaxwtbWTKM8W29yvM5VkpntaLhNtuyUvzsaqz5IxFI74PyPR6 7yVBfsmnk9fIvA1+fO2ImQdvceR0is+Cn1i8UnFov/umzGcjUzHFD/sdvu7w aRqLXQl3Xy31sFZ2KYDo8nxktbME3Nq7+IzcQ0opGQOfgr/S1GZokaPPR66y c+rDc/Zlc8xlkdxYMchu0OyEvKv5WhcjbywuCy57iuytpJ3YZB4DlkDkyu59 l+X+fOTiSSV2V8pLlJGp7hTD02DZlhY3nVXPLBfzPC0l66v7woxUcRmwN2q1 bPwDHM2bqu+X0frc6qrrkumnN0slaXQW2oXt2BW4nBpShb3wLuY5nDeMe8W/ ZDR8dTSNsqbHklc9/pkaMp+P7GVm1kz1ES1ONWmHUl/ipY4aWa1ollRvvvmA ikweLJupqLD8SLUF9XxOP1jK1U68u8rp83nGqGmLwO5y4009zDvmtfmszr6v mKxwfXfN/OdWt0LxfcJ7IIs6d80mjXAWYHgWKGGe04p6ETeU0ih7bNYGydx2 6mpWju01qZXUx6B2xsUKTrhYWT3rDF50Chba0KJZlF9g0T335ho3K9jX6rWr ySzb2CpyHhPPO+CPmCPkWC9l3nxz1gSnkDIiR/dqY2FoHQgf5m5WBtxeUlzY s0bLYyHgtS4NnvJf7J3djly5kYRfxZe+5D+Zfhv+Yq8GC9gw9vH3i6ORIXu0 2D7UeMbe7QEGaEldrFNkMjKCxYwMKtclsvMOuv8fmgB8zvcj532A5EIeGbum rg5yvfSd6ixbHWlLTodwvEDRrmtNZWxCgK3ha4msYFVBBUCEtmgqianjApE6 GcXbWCooTCRXK0hNC2Be6SqI83MTKOEiD1bm1HhOEu2epvjrprohGKtyYGH6 13TlYg9unlOVlWF2eQQQG8e5lZfFNdUknchTYUG+mA1yUo+qEUCxxmpyELMe cpY7l4r5OiCy61VshATqI4g7hKUuGZ/pjrytcLw/aUa2YToXz+x0IuJUetBL jKmsOCuJJMUN1smsh6Sb4B4Xu3s4fVwrG1DKM7ERn9K9akk+PUqTBQJ2kWEj STqdIsML3zTFrjuzzrKFo6INtyEi01+gqLd0jkp3HIu1t0sxN7dhkCSCsVUf mNoiSC6i7kCN9mytpTjyhtltv6toBvnEIBy5gs/uYmSgQvZ3hy2SpoOBHWIa Bmby/9H9HW/5zH6xU2Qwwif31Q+ANAvsR+3l2Mm6ftnTOqTxeMPrRhzDp1pc OAc6fqyXmddyS+ncNxD0MCM3yH82n9WSzw68HGjOMlfOYGhyW9QJBIScXozs oJ2W3OmQgRiaM/aFSiM3oRFIYIEtREp8N8+//PufvqOb6nfcLRc4np2ss1sm Qnd24C7UYZKNRtFVmxgjDOODj/PTX7+pCEtQPtBxJcC3+VHAmOeWKhKq7HLY f153QF+PrDp9pFjUlXy5gjlYcpjR1exUbOBmj22VD7tafzMyimtFx6qsLGuj SrLfcsdV7YKNEJA+sVhN70dOmkXXVL8xZgaAraFC0GDQOP5w+Mu2e7p45vnU 4rNaHSIxleYT1ATQXX141YSRLmLb4/3IkLRw1gJce8vqZq26TIAMssynaICy 87J5eD9yGGUsc+gNqI+RNJlkLwOlw1RsOQgH8ty4mOdZtUpaMT69J3/GsYu+ hyBBQ9nUmdtFpOX7kUsqFaLST9obVRYK81q8bk37M22xcXNLZvtipzATTbf0 yb8wtVIkd7sDHsuAY6BLZiVN5YuRB7Rqk+/RNJ0FlF9mSo7/z5k1RatxsqgX sdHJYlO11hNRDjhW31WNhEgA285xAmAUycXuPgny401Ll1KV/w1MCkKkdvUD Vryjy/ztRWxsQKGOuEjKKQaYGzlPPtKDxw9S1nL53B9OF9/GMwS+ZytzsMlH DDAhnSqwckeedDP2RNhFd4FIHeqb5DCkby9O9coRG1olQ052Si/5uFhvcMNY qNl1PJRCaWfB3gKLt81D8fsABFnBdvHMJcvTL5I3EexxQNfqY9riiu8kmqKU t+YuN4g0gXwP9pM50GUQ75zIYMvDt5p3PdaVY/MXOQUS38uaqlrmU1ebhF4J cY+VQOc1VpZfxEXUIb7ka20JAIEHo5g6qCdTyAFa97pKcOSz8H5kVEYLCISl DhgIp8AjLvmSnoG6BEFhL63ki3le5GgAI8Qk+joln6OXao+pKkpUdqdCqvcj ixPPrJP12pcCYUg0PKbOcA7zmQyQc7jAOl+KY2NbFAs+OtWLWYdGNe+Vix95 wnOGXcSzTL/t9F4fP88TlL4rpNUa6F838kkeO/1iBadH7INszqumeIJKXgI7 TlhYjWBds9P6TYbls2uVVketbz967bVW06S40smvTh6nZ17gBlhcfFwuEVmH EPEOsedAkmRL5rLkhOr4FO9HtkVqIpIBTfQoGzHV2fa2PM8JW41hHKQpXuxB NIA6ldfREQVzLOAD/uHViwPwR2+rPpiJv9jdLbaMYu8idvsgRZhYdMhp0Hwd Alhi8H6zU4bsilTguU4cumjQYF7Oshy6IHUy+gf/7IblAm0ZAdqsrpl1cK9M hSaBgNXE7myxs8zvR9YmGxvQMB0yIe52HnU3ZgImPVHZrN7y+WKnqBw+yeMW 4tgDAskjVpDvQKeOcTb4XYZuu76PDdWbIh1YwhxlCT+bGhActPBZ06bY2ISr vx+5Lr8hBmlG1VqOqIObfZBULZr8kXP2nfe62CmxuNEhW0bGasTIqKE/D+ue 7hop9pBnSxfIH2KQqNpB7twpsmNa7WB2UF+iMkOQc8gpNysYoHEw5D3Yg7As KF3lMb2OdZgj4pvJQWVdoCiLtZSSQpBzNtQ/rD5P1Rfg0bzQNaWxL+a57+hn OU5eGMww8CwjbfOylfFMv6mHR0vnJp6LDw5txVPWIKe5tIbFzlPPBGAk9R1p 7SIPkj8PmLwTWC/iwW4pyBIYM2reySEeHQqlvIjnzpN23dxKO9WyIQgjtdFK yAFy28+oXtzpYg8WoA2pSZJKde3jj+UG8LV14AcGmamObHvD6wZ4xPoHkou6 zqBMAgykDrTEzktV+anNcDGyl61+jrC4jfxOIhsuoVH0jZFO+tDfp7OVLvA5 7Fpk15Q33NOV5tESCbmScm0ywJVzQ0sXyO+cTr6F+evADcELgV5Ze0j4DK+d Peu54fwjVKkFos48aBT2YjqINLFf5kQNjZy7UW2iRtXNE/SNiY4NpExYM7ll qdAf4enZPfUmdw/TF5VOflyleP5Lm5meG543JxIoooxuECm2w/ZYZ7d+JqC0 pmtzjUg+T4R6Uy+F0NwFbpDismdyY2+mPgw78l4kAx2bNCttu87augusC9PF XJmLepKeUh4PG1hO0bmhBgEL7UXoXOgU6/Vo142Z9YUf6Nl1ojS2i0eeTiNC HG9yitNjSkLNUYfrQ2fAIBwbJ/pxJqLILMPQb05Ogg00CTTAzzlLTfIi814G vV3GjLuaOpVdPLOPwNEsYWyEvHMphDTUbiWvMWRNPhz8LF0wxiYx0mqLULvM bBwecAFwI2504q76AizteK7U8RiSahUa7qYaxEQDo1VEkpGcHUGH4LxYwZ0O 8syRXEvoVnQs6mAf8vA8ahwzPRmx3qi2yPPKrqTwkUmku5BXe3FSVb6XPQsZ 2I9wMRvQl/3UbIMUZ3gEoS7xDDWD0hcGaaECktWLM/MqE3V2cImqZ2dbyNOP Sdk25D/NzLiewo2a6BGW6OR7nKuwY8WnaRNrtypKBWklT6ebcySGyjkxjrBT ffWgYVPXmlJBW3ULPPWBNlzM88NofZb5KTHGRrYZGW6qcD7qwHWzhucG62pu cTAXcbXD9iZfdTgYOJJj8jJYYv+4dvHMpYBr6qaSisjWhHg0dYVCiu/gdld7 rJb6BftyaryTAKGBANyGKOwKN/Ls6c4W8kVd2NZF1AESOnBAXO+MctAxce3R LzIJeLTkEGPq9naBors4GROXdkTAZUFuHQIKLgco2D6rQz/8xe5uZNdalnU/ tgY6Ao/T+1xRDEmu2GqnejEbasVkG8HQSVZIZD6CzGKRFUuO2ZWJhjveqDZC rD/fITEhJcuYskwPRahuPzYrHRbSxg2TcZ5JtJPCnrHulC3B6FixwawPoM/S InZuvl1yMUAzOisH15X5LGqtngxJJDGekVBuwdX5bp4/Owp/dhT+7Cj82VH4 s6PwZ0fhz47Cnx2F/w07Cn/E3i5/z94u/fr2dp/FeZ8l0p8l0p8l0v8KJdK/ gb1dggy0x94uKWv8bG/HUmgxnh8sf9ze7t+mSvszD3wWaf//LNL+55dQ3yHd jS9dcG996XjFW186pvsHfOm+vPrWl47Hfe1L5+3ely74C186XvTKl+7p6fbW l+550StfOr3ilS+dJu7Cl473ee1Lx2ve+dLp2X7cl+5/HuWNL90zyo0vHZ/6 lS8dv//Ol05PdutL9/W1v4IvHc/93pdOL3rvSwdCv/Wl4yW/8KWzf/Cli+FP 6bfwpbNPX7qPOojY/2VfOrsn23ZPtu2abNs12bbXZNveH7rY73jo0kxn+9Jj lnz5qol91VkFekxnkfVFTwH7PQ9ddNj5SMtWPP/080cJJRb9so8xev9xeW+/ 26ELEt++zH/LoaXMRwHn/wg5qOVZlZLUzvfFoYv9focuTUUe6ctRGOn/66GL xZp06OKdK5bvVuXCl85+c0P/5wQw/nwWKMeWrxMQsn05C2Qm4v9q6O9brO3p C+BVdfF1Hr0zr74A+uHno5+/M/Rfn2D+64L5m5OTaD8w++9OTn7krV6enBB6 16v77uTkB97ppb3d2+l7b2/39h3+qfZ29i9hb/fn/0q/vPOUv2fU0FzY7qwx 7Mw8zfdRdBHElzHPknNaqbPZ/HBjO73x39pMDq/K1GEtr6mK9iGjnJRKOtmm 21m9k2u+GZrNo8JLpEfqzw3GGnZJpqvUaYdTVhl15puht6/qTCursrDDWLsN X2y4kWvq1kNU68J++sXQuegmvcz+dkzl7Kw7Q9FOdbEVl9eqdUfb4WZChkuz xjlcTnIOrClYlkVMZoqWNdZURfk3Q/cczk7ukAx7UfvefHo/bbVW6sq6sxLr vJvr0SsIKAuYkaqf0YZsN8Ji6qvaWDrX/V5nXgyty0DPpPrSag9HVyBbgA6c Jg8dFcEwPx/vOPzN0Gr/nojbGmzL2cbNkSFmYUyLXUU/cdTk/Xo39C//4a9/ nt+5uJjs5b3FL8OMnXM6iuVtf/rDf/a/zP/4w+w/6anGl0uLiPi/mV0sgj3p CkztxaWxqxtt77maLuLZYU+jBz6KCc8DfI3T5nolG7Uw1J70KZyarszJzK3K H1RXt8O8GHqp1f1p055SznpsTeLytCp3JJ6+RUBhxXAx9JQTXmPzT7nsjJJV Az/yKjXbdiz+sXbah+85fTv01m1jXcdSiVffgt0C+vD8M+r2EAqyK9wuhkZY ml91uwJ6MdVqUdxaPeB9JHbr2rmeNG8mRABWfahy/UhrH3YCe8vYvyCYd8H3 1uuo/mLoNluZRST96aDuXXO1T/l07RlPKcYus/bha7J/t4wkssYzTidjP9Aw Fr+sqPl43EdXQ726R97MtTMViEYZBG4v78RchslbJM1d9sj5pE6IpIuhR0vV vEqS29yPb+lIhSV1a5w6kxrFpn3qzTKugPKZdpYrPhmpKTeCpICcS22m2/TT hu83Q3fPRA/12laNTKqOxKSqutRDUS9dQbysb2+eupXYjpEYiMCMUN+rmfHn 1I5bCjsABny/ievIlstky75bjaeq3MvGlmuH2Tly8Gwh+nYxtFUyZ49dpp8h 8C66Bqv22zAMuTaWp/wijJunzucs6yTqsONoeZLtQhOebs/UhyVHAmjGDV77 WYCLAQPIgGBsaXRPUn78QEsqc7lBgN/EtZGJ+7E0fG3jyEJhwBLJ0LnMfHIw NmuI52auk5Nh5PExreRlatNdCx0cnUBLPIFE0KFf/WJo3yIZkW2tdvds7Ogg F6HKhIs5hxy6ECGhN0M75nj0Wct0uYsyLzDwsIH4WfXxlfclQ9wMPUsa0y8V C/F481hgq8SQp+ov3DkdHFDf0IuhQQ5yCPKhrDS7q6v1XMqWLWZfqr1Tk/Pt 99WW6W2rygvWnAMr2eGeRTaka7rlO8Qh1+1v4KmFtJNWcnjijEwYqluA3xp1 kdLH8ie2kW7mepXuyznR6wApqailtHJg5T2hd3gDkgQM4iaul/AHrbCRDtXp 6qkY/5EhhLkhi55Zas1XFMd5Vou9Ip+mskZUeSAPj3woUaWUBdafSr2K6xjG yduA5ap8oxisKjhk1smZnSiE8uUr5GM/MnJj42TA84QZHBgyq28qXa4qvHP1 hk4WW9lalz/sUx5Y1a3eVCTCGvoe04mOdHkz9CbOKjq7ynp867r/kP8By5pG HgcOuIqz5a+WMRzWjoWDo052nq+++E1GYP/DUZqa1ZPQLoY+z6au85xJfG92 43xKHNnzztJWsX+Ld1kmIQPldXAOwnUw57YHyLGX7uObusEHt1e5yTKpDGSe ilCzd+tsOZ7JA0kXxGVe8bgy7nWzZQIi3rfevM43auo1opDrhmQCXBbmCgA5 IX6DfIT1IRLkeoAOYzWb3yiO2UgJTfayjcge7iqjl9q9VxFXADSKZUWjk6NX GA7W4zxsvl8FH7DWHXuyFpgCk5zHKPCR5VBiyeTAjFZKN3NdmUod6/jO5hlW 4Xjy4G6B3Vh8YJ+Op2D4JhXs5WWVveTKOaaqHNoXh+R92Okjks8PDPBmGeFM s8Mh25IJFWmsgXMAn8+sqSsQnc7vxCv2VOU7FeALEymdfKsKwBr6SOo/3KZD k9abjT7RyyUXtkq0clYcvZ9KFvY8vRHaQcZI/irLuGmJWd59yRwaTtNGWKeb vAKz314ZoSFtbiKkNdn2qkpcdbCwYPV26AV0ymzxuTwcZ7WbZRwRCHI6T1pE BYB0Ivt7DB42Qbz9hGmWHm+WMWZmGyU+VaEI/B0SWjnxKSqB7PUVoRPpivOh wKHn8jtF8J/Gk7tT4MGZ92Ji5iHf57putkxIZ4eRl0F9G6y3xt3OkJ4md4Et C4LpWOubuO5wMBddhZctcrDU0uk5WXfMMeyStyU53ASfQm8i7bpfHmrdiDzo 1NzeDvkgO3lgZijyzVxH6IeRvxyCpbN8I6GK0KFd3kXydA3RnXmjdtExBlhn iDQKeqVQIB+9QShhk6D3f7d3NTua3DbwnqfYRxAlUqL2FuScUx7A0G9gIIYP hgHn7VPVs2t7sJ74a429a9jjw3r2p9XdEkVWscUid3vQcRIbk4yZNny2UF29 rRRANeCSqBI/cYPhCA9bTkKBgeTDiybV2WPck+nCFmdEcA9C6Vgg7ACcdjLX lhLlygGgck9hrFXgUapm8AJqmBe4gA5+d+SvqRDb8tiKPU5JRkJeRZSEH69U ZswLjvxk6IEtHij4FYJmuFOHaYC80Amaw5tUqmDVdmJ8iAGWpIOSZr/08cHm NoJCy7vbgsuGjdRsJ8uYrup/hACeitABwISpXw3EaQvl5pzCYmfQHXRjV2vM wZtSmGRQ+KvY2l3zaCCVYGcg2SdbJu8KWrSwfA2LOFutrvQm2Cq7ONwgNdHG Sf5asi3bVtrsgMMLoA9BHQBhBUnDYOuTasB2EmVslTk8a4JRI3SPHAHbWzIw XYr8gbHHTLJzMnQh7hCjlD9V3KjzqMn38jDYVAGUKXUxP/IhjZK5MGbq4y/T ORYbKsFoktWwntp69BN/vWBje5fuI/E0jpVEQftEQZxQV2J+ofQSj5yqI0RR 7iyn4RJlX3q0NvFHe1TAiMlcdj6ZkAHDcgu+1Gzm1TEsUDyQKlMVuzF5K6Ed pVoQYWOqzHpEw2ZpsBhgyL7HAEdKPbPE2BHzT9juhNOkMu/SCHrQ1CnJ17jl gTDn8rkIWk+GBuwfl1hgs+aZ526YCpUKtLZDwSxnKpzOk6GZCxdAO/COLjlf rSZk5UFs6ey2oaAzKa4jODmKAaIXAcQBDAa8kyD4DT8mXh1e/PpadTI0mBdM uiiYBeF/3cC/mGGEHmqnAhGuUgFPTiykAtDsQIHDxU/3PbTm1NNI2ExgBNdm jzKOOLrsEMfYvQOFpTD5pXkJAOoIqXslR2D26SSAbQEcDczqAfAavy9aohwa pcnBrPndBtN0shsdDoIfZ3mgAXPjCvaLyC4zODgI2zsAUJ7FRsTsHnicE3ay gDyws72CfRDFA6LNRdd4Ntfw9/CaPnUBoMcwGrgpEMiobeLPOfvwiXUfJS0E /K3NqGwURv3SgD24xM0zPxTgJWDkeLWT3agT2C7CaeRRQJxLBt0YNcOJe6WY krLLwVFaSze/niilH7ABh1Q2MaqC6KAxYqa8wAfkow+w7BwCfpFBcOE3FuA6 Qli50n6TKbrdwbQR9k+MbztcfonYGHRRbtj5ePIQKTCpAKoGStqPjE+YBaqZ gsg99wXDwFyzTyA7I4VSmQO80aLm+TewDOPAEu4KPAMi0CMblHVY9Na2Lxme IftkQma13LxkVQvAOWN0o7rJAvlocH/8omxxp3y00SOcNOKLYgEBqxWgelKX ELwU+M8cEEcQgE5iI88Jly1rwQ7EgZR8hrWXEvV5AN2tTnXkk5wqtvQCTxoZ lC5l1q/F1VbdLfLkS1rdafFHG32VSEEmgYOGzwdSAkCtRmEKTA3iO2JOR5Q4 +o7OHhEmZHHwgHESXJswP9TYLWCyh1FKJ3MNSB1ivbRj2T9tYOy0Q4dD8ioK PLUBVJacWIhFPPNcoM0WwerAOBBlE5hApohccdE22zoid5Qw2wqKTmlCym9S hzTg4RF4jT2DSgMCXEcQB0gdkbyu3plbxTA55L1q1KoI8Qk4td5oLvLsqWHG 1soA+rrS76D9Iyx1+v8Et52NRy2PODoeuVHnSaNgb1vBg6sNEASK36whpSS1 WU/AArAo3Ae/m6SyAqEqI23SoqqluWD3YN/oyW4EHgi7YpNX+LlBTLN0pg2/ MQKFyrVfylVnHwX7EKwUT84NRCs2Q4l9UoMmF3CPieF39yNWAIxXsSHhhxJ2 +9UHq1YQfiylNonYSgi69cSHYO3miIiGBna7M8INvHNZFpY3Ng3AC7Dbw8lT w2xhyp09Vgc/eIOhZiqTygaijxHLgPnKR1/uSoe3xn+tBJheATfABAGWJUVg M0DUkDZgzwnEAR/CjtE8gKUvIVWeBU1DFVyD0tcDy1z9iJIaj1EC2mlt3X1n ylHAyCnKhvlIFnGbIO3oA2wqhKmh605cTwAlGN7qiLcJYcEIF2AtcnTorrHd TMUyXv1XSqDoKfw0Bi57AbIl7KSjr6RpIkrliZENywfAozyUHFmi0RFuR0Nk q4+3oXkWZYBMSymLBJdfaGjHq/OkDL+fzAp6jkBzlHUHviO+wVsDPAIzsLeN l0LFZLzM7MOdByCPgJn3wK5BlLfrk5lVn7CKXeaQMYfmXBMlgk8gzgYlWMMa dnZA8B0DrzGsYO8rVSor01yPN9V7nnuC5zBFoPXFsw/JAS4RzWqrwGMzsE8b UPZJbEwgXZHfMxz0WQOPm9ed9tWzsbOHruhqtuvR0EFjxV5OU4kri1wNUpz9 yACmGvANE4E3J+T5X+CXr7ALv//3L5zyrdHSjXO+Px+KBCO14lQVfFydtA5Y KjhghV9vYUf2UG6Uje0O0LEBfRH80kNnbX72MB9nMwa4SIcvMEb9BGSU2VC4 dKr3xrqc6Mj9aHBwk8JTtAG+Eo4STlQKbGwW4QHJpjxJ1tpD2d5PB+85tuAF AEDgekoNl+Rp0EEVccwSuJfiLc6mZQuiyAISmCxYQ2DKJe8AHgOHPAOVypls f6jt2KeDS2LnkjX4MWpi+aSwOx+Fz+GEehjKxqDloWbev/DkSZNGOEljkwPA XNgcUGRI1E2EJ9KYA5BjPRqcxyNNKtDWpjLwYr/Q2VvqNp0Syr6qS+y3reUR bT39JW09++219d4qA9/KvN/KvN/KvP8IZd6fQVsPT8tS6qefWKd9aetlDyy7 vn6I6XFtvb9apflbOHkrNP9rFpr//mXgZw7zRKIv5bsSfbjirkQfpvsVEn1P V59K9OFxb0v08Y6nEn3JDyT6cNEtib5UDyT6rotuSfTxilsSfZy4A4k+3Oe2 RB+uuSfRx2d7vUTfy6Pckei7RjmR6MNb35Low7+/J9HHJzuV6Pt47W8g0Yfn vi/Rx4vuS/TxqpsSfbjkE4k+/M1zjT6t71P8DBp9Et5E+h7VQcHK/IlV+iSc A+4Xr/369rV3IPdLl96962Og+9klDyZxXp7Uz5DFqcaOtVcWB1TVPjCzGNn9 lFmcXAGjH8/ivLzInyFhcOWhPmSk3HL58C6MjU8ZqVpjeTxh8KLR3XqVw0yO CI9qFOY/RHhASEsF4sD75JRSwB+XgMWp9UYu58Wt8DmSOTl+aJTguf7YKEGS PjVKEJVbjRI+2Zo3Rfs+2ae/Sy4lxPQsywgv/ZQEsWi1fswyxpquJEhM1Spz KSHqT7mUq3/Js1yKh/DU7oCdx/2DlYuJXu0OxDCjtHImA2+p9r059/vO/U5K xfJr1uBeTuVV97qZVLH4ijW+l1V5za1u6vfdnsH7An63b/G7Kvh9EeT64VhK EJ/ac2lZyg671Tpj0qqjJWORQLwqMn79w2f7epQffviB/wd1/+lrNnxl8bnb 2j1vYTl5SWPwG7atUMJUZ6uodXwDNvudW7aEHLpZR4DruF1QHriaKeYdKZHk xzfwbkmph5VZlGbbQu1dmk2f3tbYaw/N7vP4BiP5ijvqdUq0hN6ip8GvRSXy TEUrQaXrIz0HX7gBi4tSsNhy3HNjLnIdSVLuWAVdo+PeUdPW4xtIDdp4kpP1 KY2dO4vv6ldr1Rzb1mpryyPKSy/dgOoXqxQbNhuF3EIZxetOZbK0pcVcZlyP nK974QZW28AGyCGPlIPvFoHV4dcllN17khKjhz7P90FLLTTYUJSRIo9rUUpg axuhD9HrpMH2h7Q9XrgBACl83Ng+g445+mgw1xGFupdtzxzhAnvu51M0a69t p5hs9tbzGBKEXTJZxqbYaD1qyGrnOxnxzUeVOlspvjBcK3wb1zzFYF8dUzjK PDdTGLxa6jOzLylVj3rfa8JSO+B8gwWAxbC15flGWzq7jELJpMBqZtoqfizS sD0mbLeEkuLjruKrb/8zfzoBE7GrBiY54kFTybvrhssehJ2tlgB/6lQmPRu9 0AWMXVoAFfVFvS04NdbETjPNIZUQJKVxNrrDT6YG6wTi9k11R49pOpbASwkt 9mIp1BbORpewp0yPXvjQGTytjmVxJU+za9ui12HCw2ePmNgpqe22A6JhGiLd wNGD7IQ1lRCW7R0PR8/bEGCStL1BKZsPYZv26LYrtphN1s2Gfjw6a/tChBma Msy0WWfHklbq4fFoXBg8vB7PRu+svmjACInFJIi2rF9fuhggsbBpyYwGP3Q2 elitVM8IKwWhEMsaW4hd2EW77FVTpwhQOd1NCUYxOsUB61XGLlMUGyhO+Iew we5GpGzl4cxg4xSsZN/zOoSY4CozdWWxtXrk6b3VWbPdz0Zn/+dSVw1rVESR jEdWTysO2mZzGdno68vhs29EpFUtIXQ4CzJ0AlZhsgrXNq3G4qGRDvcqlVHA l5f5BP5wFiSwpNNyV3jJWi1bU+3rcFUBmdgLHfGVJ3E7Fte3YBtVSibBjzVq /ZXDVcWUIopH7FVv5gwcmBLEWrbpNmVYBBeUfPjsCBVSJoI1wl8rRMURy0mt 2tZ6wp2WhGHuhz4yVvjBrsCCbe6F54dHTrllSkKvGTdP3LYkZ6O3OVho44Md yie8VwKyL7nKjiGuNrOksHJvZ6MDDSewBlksKMMUpYqbrAbHhjkr0YQg4Xiv YoYbS02aFSv58sZzhtURUsQpw+Y9WM566MUa66A3tpOHqBMBusECVdaainVd amo9tsNVXeBpZSiARwa36k2rE8lgd9UaxmgRtr7iOlzVQrkxg5sCHtubokZj yphVgZ/qMjOAwRr1cFUBXUpZY5RKxVBJXljMKz1ozbDzmmBGQCGHq4oB4blU QT6AjTIAAByCweylkPNoDBSEt3pq7zCL0B2BYgO1moEl9LKBkeB1WB3bE8iJ Hc77msBaQRBZGS8Mk2TG6k+Y0XChEBvc5RqHfmYUw1TD77IgB+iIIowhr6Kl gzsHLjIQaz9FelUSEMzyBU6F51RQD1h86a4NLqz01FkudjgzrN6dAKhhtrg0 VkwFot7y6pgwk2wZfLzMU5uJApIMywaciXPJxs08sDuBd1D0Kg5073a4V8kz KIoSU8e8V9Dw0eOkVAd2bbdV85YFb3E474ax5ooKytRk92oFEGbCv4C00Sid peYxH8ZV7E1lgYOWAm/T4Q/jpLbhHMsT9uwOicjm0CInQGiorKRMIxnLajKL SUKlCGNz6yvriIeIo4B0WAdBBeKd8IkAdpKn1pR2LgOYvioAjZ6uKjwLOEFt V0MI1hllW6z1A+5eEwtieCH4/NO9OrLA1UQ4WwS/Esug0EhRxNJ4URK4N7zS 4aouwXbEVmWtA/x4agX2Evv0niOCnvOsKcY/Gx1hYjvwVwHbEyKavMbF/9aq hVrcAsAkp/PumeKwysYHQSrVw+J0BhPZk5A7Ajk1+Juz0YEa+0T0SYnCywHM PfJ1WC2KVYWvge8U08OZYdFigh/bucUQEDk8aATpjszEwsV7coCbdOjfmahZ DKcKLj/wkJIKyCW8ogD6RRhmUiD7w8iH+bWxKYYZJ4hl2/hBpyJAwW0CcFPJ c6d8OO9bgXarbez+nXfoCNi9YlNNeAb3PvbAum479ASOnaQbjqDGlFq3oXjc wtzB7g62tsAnK0jgYdTuioAqVhOV4oFtsLrUJwTvUxZcI0gh5sppFkIBT5dL nJheisYlDNgZQ1be4Ah9tGZNTvHMGDUE5p0nUz0wP48w9+W4SRrwvXOX3uYp 0oNTbIVK8eAanby3W++1D5D6CUK/J+JtrIds2EFmDMzMKHASYIljRH5iMFlw vS2WhX0b5v3d9OvVWzXnT8/2XBpev3Hx1tsH4rfTP2+nf95O//wBT/98hlou UHd5quXCT/FjLZe72lXLhR/yjVquv9wBpLfw8nb+6K9+/ugzHA469J4nhV2W 7hZ24Yq7hV2Y8VcUdj1dfVrYhce9XdjFO54WdgF/3S/swkW3Crsu4aC7hV3X RbcKu3jFrcIuTtxBYRfuc7uwC9fcK+zis72+sOvlUe4Udl2jnBR24a1vFXbh 398r7OKTnRZ2fbz2NyjswnPfL+ziRfcLu3jVzcIuXPJpYZc8K+zS9yG/D+Vz FHbJW2HXw8dj5U9d2CWvwN7yCuwt59hbzrG33MfecpDakS+X2okSaixU7gAT kFQ/aNokD5l6HhnMu1AI5uHUjny51E4MdIPXu4RYQvygQhLZI4zvEqsnTzfS B/JFUztXPscvrZVikb9cqZ3CdyODLhqflG0eT+3Il0vtXInDfK1N0Bw+pt2Y ObzSbsVqLodrc1LYJV8gr1JTrU8iOdXCj9mtbIk6OPghpyu79f/zKlerU3uy 8lB+TM/E4iKXlRd/Ss/cL+x6c+63nfvjiRV9D4D3ijW4k1h55b1uJVYAQ/0V a3wnsfK6W90q7DqYwbuFXQe3eKGw6+l3759++/7dBMLjeavv3vXvv/vvr/71 P7795hviy92AH7Eef/vXRYveTdII0A5yDjwvH5se53rw8j7Wd3//J7nH3/4H i1Y/w239AgA= --------------000301090406040003030205-- From owner-xfs@oss.sgi.com Thu May 10 08:55:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 08:55:10 -0700 (PDT) Received: from waste.org (waste.org [66.93.16.53]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AFt4fB031181 for ; Thu, 10 May 2007 08:55:06 -0700 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.8/8.13.8/Debian-3) with ESMTP id l4AFcYjQ028027 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 10 May 2007 10:38:34 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.8/8.13.8/Submit) id l4AFcXV8028020; Thu, 10 May 2007 10:38:33 -0500 Date: Thu, 10 May 2007 10:38:33 -0500 From: Matt Mackall To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510153832.GQ11115@waste.org> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46433049.4020003@goop.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11375 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mpm@selenic.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 07:46:33AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > > > >> David Chinner wrote: > >> > >>> Suspend-resume, eh? > >>> > >>> There's an immediate suspect. Can you test this specifically for us? > >>> i.e. download a known good file set, do some stuff, suspend, resume, > >>> then check the files? If it doesn't show up the first time, can > >>> you do it a few times just to rule it out? > >>> > >> Well, I've been doing suspend-resume with xfs for a while without > >> problems; the problems seem to be recent and easily repeatable. Which > >> just means that it could be a new suspend-resume problem, of course. > >> > > > > Ok. I'm just trying to find a relatively simple test case for the > > problem - seeing as you seem to be able to reliably reproduce this > > we should be able to work out the trigger... > > > > OK, I was able to reproduce it reliably with a script with did basically: > > for i in `seq 20`; do > hg clone -U --pull a b-$i > hg verify b-$i # always OK > umount /home > sleep 5 > mount /home > hg verify b-$i # often found truncated files > done > > > No suspend/resumes involved. The trees are linux kernel ones, so fairly > large, but small enough to fit entirely in core. My script also > captured xfs_bmap before/after output for files which had tended to be > corrupted in the past, but unfortunately none of them got corrupted in > these tests. But I do have all the trees lying around to extract more > detail for if you like. > > Interestingly, the corruption happened in each case around the same > place in the tree, often in the sata drivers. I wonder if that was just > related to the timing of this script. I guess this pins it as an XFS problem pretty solidly. This test looks like it should consist solely of open-for-append and write on about 20k files in the target directory. Because of the --pull, no hardlinks are involved. It shouldn't be all that different from doing tar cf - a | tar xf - b. The files get visited in alphabetical order, so the start of the corruption may be telling. -- Mathematics is the supreme nostalgia of our time. From owner-xfs@oss.sgi.com Thu May 10 14:14:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:14:15 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4ALE7fB007064 for ; Thu, 10 May 2007 14:14:10 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id HAA29144; Fri, 11 May 2007 07:13:58 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4ALDsAf89781116; Fri, 11 May 2007 07:13:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4ALDm4I90595453; Fri, 11 May 2007 07:13:48 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 07:13:48 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510211348.GC86004887@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46433049.4020003@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11376 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 07:46:33AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > > > >> David Chinner wrote: > >> > >>> Suspend-resume, eh? > >>> > >>> There's an immediate suspect. Can you test this specifically for us? > >>> i.e. download a known good file set, do some stuff, suspend, resume, > >>> then check the files? If it doesn't show up the first time, can > >>> you do it a few times just to rule it out? > >>> > >> Well, I've been doing suspend-resume with xfs for a while without > >> problems; the problems seem to be recent and easily repeatable. Which > >> just means that it could be a new suspend-resume problem, of course. > >> > > > > Ok. I'm just trying to find a relatively simple test case for the > > problem - seeing as you seem to be able to reliably reproduce this > > we should be able to work out the trigger... > > > > OK, I was able to reproduce it reliably with a script with did basically: > > for i in `seq 20`; do > hg clone -U --pull a b-$i > hg verify b-$i # always OK > umount /home > sleep 5 > mount /home > hg verify b-$i # often found truncated files > done > > > No suspend/resumes involved. The trees are linux kernel ones, so fairly > large, but small enough to fit entirely in core. My script also > captured xfs_bmap before/after output for files which had tended to be > corrupted in the past, but unfortunately none of them got corrupted in > these tests. But I do have all the trees lying around to extract more > detail for if you like. Ok, so most of the of the integrity errors are processed by an error like this: drivers/scsi/sata_sil24.c index contains -98 extra bytes unpacking file drivers/scsi/sata_sil24.c 5715cdfceaca: Error -5 while decompressing data That's an -EIO and not a normal error to report. Are there any errors in dmesg or syslog corresponding to this? The errors tend to imply problems decompressing and patching files, not that truncates are occurring once the files have been patched. Can you check that what is being pulled from the repository is correct before it gets uncompressed? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 14:23:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:23:42 -0700 (PDT) Received: from waste.org (waste.org [66.93.16.53]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALNcfB008822 for ; Thu, 10 May 2007 14:23:39 -0700 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.8/8.13.8/Debian-3) with ESMTP id l4ALNOYW012939 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 10 May 2007 16:23:24 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.8/8.13.8/Submit) id l4ALNNob012937; Thu, 10 May 2007 16:23:23 -0500 Date: Thu, 10 May 2007 16:23:23 -0500 From: Matt Mackall To: David Chinner Cc: Jeremy Fitzhardinge , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510212323.GS11115@waste.org> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510211348.GC86004887@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510211348.GC86004887@sgi.com> User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11377 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mpm@selenic.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 07:13:48AM +1000, David Chinner wrote: > On Thu, May 10, 2007 at 07:46:33AM -0700, Jeremy Fitzhardinge wrote: > > David Chinner wrote: > > > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > > > > > >> David Chinner wrote: > > >> > > >>> Suspend-resume, eh? > > >>> > > >>> There's an immediate suspect. Can you test this specifically for us? > > >>> i.e. download a known good file set, do some stuff, suspend, resume, > > >>> then check the files? If it doesn't show up the first time, can > > >>> you do it a few times just to rule it out? > > >>> > > >> Well, I've been doing suspend-resume with xfs for a while without > > >> problems; the problems seem to be recent and easily repeatable. Which > > >> just means that it could be a new suspend-resume problem, of course. > > >> > > > > > > Ok. I'm just trying to find a relatively simple test case for the > > > problem - seeing as you seem to be able to reliably reproduce this > > > we should be able to work out the trigger... > > > > > > > OK, I was able to reproduce it reliably with a script with did basically: > > > > for i in `seq 20`; do > > hg clone -U --pull a b-$i > > hg verify b-$i # always OK > > umount /home > > sleep 5 > > mount /home > > hg verify b-$i # often found truncated files > > done > > > > > > No suspend/resumes involved. The trees are linux kernel ones, so fairly > > large, but small enough to fit entirely in core. My script also > > captured xfs_bmap before/after output for files which had tended to be > > corrupted in the past, but unfortunately none of them got corrupted in > > these tests. But I do have all the trees lying around to extract more > > detail for if you like. > > Ok, so most of the of the integrity errors are processed by an > error like this: > > drivers/scsi/sata_sil24.c index contains -98 extra bytes > unpacking file drivers/scsi/sata_sil24.c 5715cdfceaca: Error -5 while decompressing data > > That's an -EIO and not a normal error to report. Are there any > errors in dmesg or syslog corresponding to this? > > The errors tend to imply problems decompressing and patching files, > not that truncates are occurring once the files have been patched. > Can you check that what is being pulled from the repository is correct > before it gets uncompressed? Notice that verify gets run twice. Before unmount, it's fine, after remount, it's not. That message saying that the file contains -98 extra bytes is Mercurial detecting the truncation before if tries to read and decompress the truncated bit. -- Mathematics is the supreme nostalgia of our time. From owner-xfs@oss.sgi.com Thu May 10 14:32:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:32:28 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALWPfB010384 for ; Thu, 10 May 2007 14:32:26 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id AF1BB2C8048; Thu, 10 May 2007 14:31:38 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 51A9E2C8043; Thu, 10 May 2007 14:31:38 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:31:38 -0700 (PDT) Message-ID: <46438F67.9060503@goop.org> Date: Thu, 10 May 2007 14:32:23 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510211348.GC86004887@sgi.com> In-Reply-To: <20070510211348.GC86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11378 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Ok, so most of the of the integrity errors are processed by an > error like this: > > drivers/scsi/sata_sil24.c index contains -98 extra bytes > unpacking file drivers/scsi/sata_sil24.c 5715cdfceaca: Error -5 while decompressing data > > That's an -EIO and not a normal error to report. Are there any > errors in dmesg or syslog corresponding to this? > No, that's an error code from zlib: #define Z_BUF_ERROR (-5) I think it means it got a truncated buffer while decompressing. > The errors tend to imply problems decompressing and patching files, > not that truncates are occurring once the files have been patched. > Can you check that what is being pulled from the repository is correct > before it gets uncompressed? > The hg verify checks the integrity of all the files by decompressing them and making sure their sha1 hashes are correct. The fact that the first hg verify passed is a very strong check that the whole repo's integrity is sound, both in structure and content. The second failing hg verify's messages are all related to truncation. I haven't checked this comprehensively, but in every instance I've checked the files are identical up to the truncation point. All the error messages are consistent with pure truncation, not content differences or IO errors. J From owner-xfs@oss.sgi.com Thu May 10 14:46:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:46:34 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALkVfB012587 for ; Thu, 10 May 2007 14:46:32 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 934AE2C8048; Thu, 10 May 2007 14:45:44 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 0EE002C8043; Thu, 10 May 2007 14:45:43 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:45:42 -0700 (PDT) Message-ID: <464392B4.3070009@goop.org> Date: Thu, 10 May 2007 14:46:28 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Chuck Ebbert CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> In-Reply-To: <46439185.5060207@redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11379 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Chuck Ebbert wrote: > What CPU architecture is this happening on? Not i686 with PAE by > any chance? Yes. Why? J From owner-xfs@oss.sgi.com Thu May 10 14:49:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:49:38 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALnZfB013092 for ; Thu, 10 May 2007 14:49:36 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id E76F82C804A; Thu, 10 May 2007 14:48:48 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id BFAA32C8043; Thu, 10 May 2007 14:48:46 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:48:46 -0700 (PDT) Message-ID: <4643936B.8060708@goop.org> Date: Thu, 10 May 2007 14:49:31 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510211348.GC86004887@sgi.com> <46438F67.9060503@goop.org> In-Reply-To: <46438F67.9060503@goop.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11380 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Jeremy Fitzhardinge wrote: > I haven't checked > this comprehensively I just did. They're all pure truncations. J From owner-xfs@oss.sgi.com Thu May 10 14:51:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:51:37 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALpXfB013510 for ; Thu, 10 May 2007 14:51:34 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALpVx1013624; Thu, 10 May 2007 17:51:31 -0400 Received: from mail.boston.redhat.com (mail.boston.redhat.com [172.16.76.12]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALpUUE007437; Thu, 10 May 2007 17:51:30 -0400 Received: from [172.16.83.145] (dhcp83-145.boston.redhat.com [172.16.83.145]) by mail.boston.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l4ALpTJA025798; Thu, 10 May 2007 17:51:29 -0400 Message-ID: <464393E1.3050705@redhat.com> Date: Thu, 10 May 2007 17:51:29 -0400 From: Chuck Ebbert Organization: Red Hat User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> In-Reply-To: <464392B4.3070009@goop.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11381 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cebbert@redhat.com Precedence: bulk X-list: xfs Jeremy Fitzhardinge wrote: > Chuck Ebbert wrote: >> What CPU architecture is this happening on? Not i686 with PAE by >> any chance? > > Yes. Why? I have a bug report where NFS files are corrupted only with PAE clients. Corruption is at the end of the (newly untarred) files. Doesn't happen without PAE. From owner-xfs@oss.sgi.com Thu May 10 14:54:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:54:32 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALsRfB014138 for ; Thu, 10 May 2007 14:54:29 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 8762C2C8048; Thu, 10 May 2007 14:53:40 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 182372C8043; Thu, 10 May 2007 14:53:40 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:53:39 -0700 (PDT) Message-ID: <46439491.9010604@goop.org> Date: Thu, 10 May 2007 14:54:25 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Chuck Ebbert CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> In-Reply-To: <464393E1.3050705@redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11382 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Chuck Ebbert wrote: > Jeremy Fitzhardinge wrote: > >> Chuck Ebbert wrote: >> >>> What CPU architecture is this happening on? Not i686 with PAE by >>> any chance? >>> >> Yes. Why? >> > > I have a bug report where NFS files are corrupted only with PAE clients. > Corruption is at the end of the (newly untarred) files. Doesn't happen > without PAE. > Hm, suggestive, but I'm not convinced. Two differences to this situation: 1. Immediately after the clone ("untar"), the contents are completely OK; it's only after a umount/mount cycle to problems appear 2. There's no corruption as such; the files are just too short. And it seems they're at a previously OK length, not some random size. J From owner-xfs@oss.sgi.com Thu May 10 15:05:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 15:05:19 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AM5FfB015355 for ; Thu, 10 May 2007 15:05:16 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALfQWk007310; Thu, 10 May 2007 17:41:26 -0400 Received: from mail.boston.redhat.com (mail.boston.redhat.com [172.16.76.12]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALfPM4004474; Thu, 10 May 2007 17:41:26 -0400 Received: from [172.16.83.145] (dhcp83-145.boston.redhat.com [172.16.83.145]) by mail.boston.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l4ALfPUP024073; Thu, 10 May 2007 17:41:25 -0400 Message-ID: <46439185.5060207@redhat.com> Date: Thu, 10 May 2007 17:41:25 -0400 From: Chuck Ebbert Organization: Red Hat User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> In-Reply-To: <46426194.3040403@goop.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11383 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cebbert@redhat.com Precedence: bulk X-list: xfs Jeremy Fitzhardinge wrote: > David Chinner wrote: >> Seems very unlikely. Have you unmounted and mounted the filesystem >> (or rebooted or suspended) between the files being seen good and >> the files being seen bad? >> > > There was definitely a suspend-resume, and maybe a reboot. I'll try > again later on. > What CPU architecture is this happening on? Not i686 with PAE by any chance? From owner-xfs@oss.sgi.com Thu May 10 15:40:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 15:40:27 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4AMeKfB018533 for ; Thu, 10 May 2007 15:40:22 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA02051; Fri, 11 May 2007 08:40:03 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4AMdvAf90673838; Fri, 11 May 2007 08:39:58 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4AMdpXS90607975; Fri, 11 May 2007 08:39:51 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 08:39:50 +1000 From: David Chinner To: "Amit K. Arora" Cc: David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070510223950.GD86004887@sgi.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> <20070510115620.GB21400@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510115620.GB21400@amitarora.in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11384 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > I have the updated patches ready which take care of Andrew's comments. > > > Will run some tests and post them soon. > > > > > > But, before submitting these patches, I think it will be better to > > > finalize on certain things which might be worth some discussion here: > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > in this case. I also tend to agree with them. Does anyone has an > > > argument in favor of changing the filesize ? If not, I will remove the > > > code which changes the filesize, before I resubmit the concerned ext4 > > > patch. > > > > I think there needs to be both. If we don't have a mechanism to atomically > > change the file size with the preallocation, then applications that use > > stat() to work out if they need to preallocate more space will end up > > racing. > > By "both" above, do you mean we should give user the flexibility if it wants > the filesize changed or not ? It can be done by having *two* modes for > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > will change the filesize if required (i.e. when allocation is beyond EOF) > and also update [cm]time. This way, the application can decide what it > wants. Yes, that's right. > This will be helpfull for the partial allocation scenario also. Think of the > case when we do not change the filesize in fallocate() and expect > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > Now if fallocate() results in a partial allocation with -ENOSPC error > returned, applications/posix_fallocate() will not know for what length > ftruncate() has to be called. :( Well, posix_fallocate() either gets all the space or it fails. If you truncate to extend the file size after an ENOSPC, then that is a buggy implementation. The same could be said for any application, or even the fallocate() call itself if it changes the filesize without having completely preallocated the space asked.... > Hence it may be a good idea to give user the flexibility if it wants to > atomically change the file size with preallocation or not. But, with more > flexibility there comes inconsistency in behavior, which is worth > considering. We've got different modes to specify different behaviour. That's what the mode field was put there for in the first place - the interface is *designed* to support different preallocation behaviours.... > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > normal (non-preallocated) blocks (blocks allocated via regular > > > write/truncate operations) also (i.e. work as punch()) ? > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > i did for FA_UNALLOCATE as well. > > Ok. But, some people may not expect/like this. I think, we can keep it on > the backburner for a while, till other issues are sorted out. How can it be a "backburner" issue when it defines the implementation? I've already implemented some thing in XFS that sort of does what I think that the interface is supposed to do, but I need that interface to be nailed down before proceeding any further. All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation.... Cheers, Dave, -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 15:58:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 15:58:52 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4AMwhfB019892 for ; Thu, 10 May 2007 15:58:45 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA02503; Fri, 11 May 2007 08:58:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4AMwbAf90654675; Fri, 11 May 2007 08:58:38 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4AMwYWr90638489; Fri, 11 May 2007 08:58:34 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 08:58:34 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: Chuck Ebbert , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510225834.GF86004887@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46439491.9010604@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11385 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 02:54:25PM -0700, Jeremy Fitzhardinge wrote: > Chuck Ebbert wrote: > > Jeremy Fitzhardinge wrote: > > > >> Chuck Ebbert wrote: > >> > >>> What CPU architecture is this happening on? Not i686 with PAE by > >>> any chance? > >>> > >> Yes. Why? > >> > > > > I have a bug report where NFS files are corrupted only with PAE clients. > > Corruption is at the end of the (newly untarred) files. Doesn't happen > > without PAE. > > > > Hm, suggestive, but I'm not convinced. Two differences to this situation: > > 1. Immediately after the clone ("untar"), the contents are completely > OK; it's only after a umount/mount cycle to problems appear > 2. There's no corruption as such; the files are just too short. And > it seems they're at a previously OK length, not some random size. Just to confirm this isn't a result of a recent change, can you reproduce this on a 2.6.20 or 2.6.21 kernel? (sorry if you've already done this - I've juggling some many things at once it's easy to forget little things). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 16:07:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:07:40 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AN7WfB020556 for ; Thu, 10 May 2007 16:07:33 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 750E12C8048; Thu, 10 May 2007 16:06:45 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 1DF2F2C8043; Thu, 10 May 2007 16:06:45 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 16:06:45 -0700 (PDT) Message-ID: <4643A5B2.3060906@goop.org> Date: Thu, 10 May 2007 16:07:30 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> In-Reply-To: <20070510225834.GF86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11386 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Just to confirm this isn't a result of a recent change, can you reproduce > this on a 2.6.20 or 2.6.21 kernel? (sorry if you've already done this - I've juggling > some many things at once it's easy to forget little things). It is the result of a recent change. I had seen no problem until around 2.6.21-git8-11. I will try again with a plain 2.6.21 kernel, just to confirm. J From owner-xfs@oss.sgi.com Thu May 10 16:08:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:08:12 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4AN82fB020669 for ; Thu, 10 May 2007 16:08:04 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA02814; Fri, 11 May 2007 09:08:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4AN7vAf90414844; Fri, 11 May 2007 09:07:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4AN7t2o89605496; Fri, 11 May 2007 09:07:55 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 09:07:55 +1000 From: David Chinner To: Chuck Ebbert Cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510230755.GG86004887@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <464393E1.3050705@redhat.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11387 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 05:51:29PM -0400, Chuck Ebbert wrote: > Jeremy Fitzhardinge wrote: > > Chuck Ebbert wrote: > >> What CPU architecture is this happening on? Not i686 with PAE by > >> any chance? > > > > Yes. Why? > > I have a bug report where NFS files are corrupted only with PAE clients. > Corruption is at the end of the (newly untarred) files. Doesn't happen > without PAE. Chuck, can you post a pointer to this thread? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 16:27:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:27:47 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4ANRafB025565 for ; Thu, 10 May 2007 16:27:39 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA03331; Fri, 11 May 2007 09:27:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4ANRWAf90008022; Fri, 11 May 2007 09:27:32 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4ANRT8U90616905; Fri, 11 May 2007 09:27:29 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 09:27:29 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510232729.GH86004887@sgi.com> References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4643A5B2.3060906@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11388 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 04:07:30PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Just to confirm this isn't a result of a recent change, can you reproduce > > this on a 2.6.20 or 2.6.21 kernel? (sorry if you've already done this - I've juggling > > some many things at once it's easy to forget little things). > > It is the result of a recent change. I had seen no problem until around > 2.6.21-git8-11. I will try again with a plain 2.6.21 kernel, just to > confirm. Ok, this is important to kow becase we merged a mod around that time that changes the way we handle the updates to the file size i.e. the fix for the NULL-files-on-crash problem: http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ba87ea699ebd9dd577bf055ebc4a98200e337542 and that means the size of the file is not updated to the incore cached inode until after the data write is complete. The symptoms being seen would match with a inode-not-being-written-after-last- data-write-bug in this mod.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 16:49:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:49:43 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ANnefB028196 for ; Thu, 10 May 2007 16:49:41 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id D3E782C804B; Thu, 10 May 2007 16:48:51 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 965EA2C8047; Thu, 10 May 2007 16:48:50 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 16:48:50 -0700 (PDT) Message-ID: <4643AF8F.5040705@goop.org> Date: Thu, 10 May 2007 16:49:35 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> In-Reply-To: <20070510232729.GH86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11389 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Ok, this is important to kow becase we merged a mod around that time > that changes the way we handle the updates to the file size i.e. the > fix for the NULL-files-on-crash problem: > > http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ba87ea699ebd9dd577bf055ebc4a98200e337542 > > and that means the size of the file is not updated to the incore > cached inode until after the data write is complete. The symptoms > being seen would match with a inode-not-being-written-after-last- > data-write-bug in this mod.... > Yes, that does look like a good candidate. Should I try to before-and-after this change? J From owner-xfs@oss.sgi.com Thu May 10 17:33:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 17:33:12 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B0X7fB001179 for ; Thu, 10 May 2007 17:33:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA04793; Fri, 11 May 2007 10:33:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4B0X0Af90668740; Fri, 11 May 2007 10:33:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4B0WvOp90586679; Fri, 11 May 2007 10:32:57 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 10:32:57 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070511003257.GL86004887@sgi.com> References: <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> <4643AF8F.5040705@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4643AF8F.5040705@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11390 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 04:49:35PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Ok, this is important to kow becase we merged a mod around that time > > that changes the way we handle the updates to the file size i.e. the > > fix for the NULL-files-on-crash problem: > > > > http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ba87ea699ebd9dd577bf055ebc4a98200e337542 > > > > and that means the size of the file is not updated to the incore > > cached inode until after the data write is complete. The symptoms > > being seen would match with a inode-not-being-written-after-last- > > data-write-bug in this mod.... > > > > Yes, that does look like a good candidate. Should I try to > before-and-after this change? Yes please! Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 17:36:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 17:36:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B0aEfB001875 for ; Thu, 10 May 2007 17:36:16 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA04854; Fri, 11 May 2007 10:36:08 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4B0a7Af90635377; Fri, 11 May 2007 10:36:08 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4B0a7LC90645469; Fri, 11 May 2007 10:36:07 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 10:36:07 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: Concurrent Multi-File Data Streams Message-ID: <20070511003606.GB85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11391 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Concurrent Multi-File Data Streams In media spaces, video is often stored in a frame-per-file format. When dealing with uncompressed realtime HD video streams in this format, it is crucial that files do not get fragmented and that multiple files a placed contiguously on disk. When multiple streams are being ingested and played out at the same time, it is critical that the filesystem does not cross the streams and interleave them together as this creates seek and readahead cache miss latency and prevents both ingest and playout from meeting frame rate targets. This patches creates a "stream of files" concept into the allocator to place all the data from a single stream contiguously on disk so that RAID array readahead can be used effectively. Each additional stream gets placed in different allocation groups within the filesystem, thereby ensuring that we don't cross any streams. When an AG fills up, we select a new AG for the stream that is not in use. The core of the functionality is the stream tracking - each inode that we create in a directory needs to be associated with the directories' stream. Hence every time we create a file, we look up the directories' stream object and associate the new file with that object. Once we have a stream object for a file, we use the AG that the stream object point to for allocations. If we can't allocate in that AG (e.g. it is full) we move the entire stream to another AG. Other inodes in the same stream are moved to the new AG on their next allocation (i.e. lazy update). Stream objects are kept in a cache and hold a reference on the inode. Hence the inode cannot be reclaimed while there is an outstanding stream reference. This means that on unlink we need to remove the stream association and we also need to flush all the associations on certain events that want to reclaim all unreferenced inodes (e.g. filesystem freeze). The following patch survives XFSQA with timeouts set to minimum, default, 500s and maximum. The patch has not had a great deal of low memory testing, and the object cache may need a shrinker interface to work in low memory conditions. Comments? Credits: The original filestream allocator on Irix was written by Glen Overby, the Linux port and rewrite by Nathan Scott and Sam Vaughan (none of whom work at SGI any more). I just picked the pieces and beat it repeatedly with a big stick until it passed XFSQA. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/Makefile-linux-2.6 | 2 fs/xfs/linux-2.6/xfs_globals.c | 1 fs/xfs/linux-2.6/xfs_linux.h | 1 fs/xfs/linux-2.6/xfs_sysctl.c | 11 fs/xfs/linux-2.6/xfs_sysctl.h | 2 fs/xfs/quota/xfs_qm.c | 3 fs/xfs/xfs_ag.h | 1 fs/xfs/xfs_bmap.c | 337 +++++++++++++++++ fs/xfs/xfs_clnt.h | 2 fs/xfs/xfs_dinode.h | 4 fs/xfs/xfs_filestream.c | 777 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_filestream.h | 59 +++ fs/xfs/xfs_fs.h | 1 fs/xfs/xfs_fsops.c | 2 fs/xfs/xfs_inode.c | 17 fs/xfs/xfs_mount.c | 11 fs/xfs/xfs_mount.h | 4 fs/xfs/xfs_mru_cache.c | 607 ++++++++++++++++++++++++++++++++ fs/xfs/xfs_mru_cache.h | 225 +++++++++++ fs/xfs/xfs_vfsops.c | 25 + fs/xfs/xfs_vnodeops.c | 28 + 21 files changed, 2114 insertions(+), 6 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/Makefile-linux-2.6 =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/Makefile-linux-2.6 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/Makefile-linux-2.6 2007-05-10 17:24:12.975025602 +1000 @@ -54,6 +54,7 @@ xfs-y += xfs_alloc.o \ xfs_dir2_sf.o \ xfs_error.o \ xfs_extfree_item.o \ + xfs_filestream.o \ xfs_fsops.o \ xfs_ialloc.o \ xfs_ialloc_btree.o \ @@ -67,6 +68,7 @@ xfs-y += xfs_alloc.o \ xfs_log.o \ xfs_log_recover.o \ xfs_mount.o \ + xfs_mru_cache.o \ xfs_rename.o \ xfs_trans.o \ xfs_trans_ail.o \ Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_globals.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_globals.c 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_globals.c 2007-05-10 17:24:12.987024029 +1000 @@ -49,6 +49,7 @@ xfs_param_t xfs_params = { .inherit_nosym = { 0, 0, 1 }, .rotorstep = { 1, 1, 255 }, .inherit_nodfrg = { 0, 1, 1 }, + .fstrm_timer = { 1, 50, 3600*100}, }; /* Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_linux.h 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h 2007-05-10 17:24:12.991023505 +1000 @@ -132,6 +132,7 @@ #define xfs_inherit_nosymlinks xfs_params.inherit_nosym.val #define xfs_rotorstep xfs_params.rotorstep.val #define xfs_inherit_nodefrag xfs_params.inherit_nodfrg.val +#define xfs_fstrm_centisecs xfs_params.fstrm_timer.val #define current_cpu() (raw_smp_processor_id()) #define current_pid() (current->pid) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_sysctl.c 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.c 2007-05-10 17:24:12.991023505 +1000 @@ -243,6 +243,17 @@ static ctl_table xfs_table[] = { .extra1 = &xfs_params.inherit_nodfrg.min, .extra2 = &xfs_params.inherit_nodfrg.max }, + { + .ctl_name = XFS_FILESTREAM_TIMER, + .procname = "filestream_centisecs", + .data = &xfs_params.fstrm_timer.val, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &xfs_params.fstrm_timer.min, + .extra2 = &xfs_params.fstrm_timer.max, + }, /* please keep this the last entry */ #ifdef CONFIG_PROC_FS { Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_sysctl.h 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.h 2007-05-10 17:24:12.991023505 +1000 @@ -50,6 +50,7 @@ typedef struct xfs_param { xfs_sysctl_val_t inherit_nosym; /* Inherit the "nosymlinks" flag. */ xfs_sysctl_val_t rotorstep; /* inode32 AG rotoring control knob */ xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */ + xfs_sysctl_val_t fstrm_timer; /* Filestream dir-AG assoc'n timeout. */ } xfs_param_t; /* @@ -89,6 +90,7 @@ enum { XFS_INHERIT_NOSYM = 19, XFS_ROTORSTEP = 20, XFS_INHERIT_NODFRG = 21, + XFS_FILESTREAM_TIMER = 22, }; extern xfs_param_t xfs_params; Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h 2007-05-10 17:24:12.995022981 +1000 @@ -196,6 +196,7 @@ typedef struct xfs_perag lock_t pagb_lock; /* lock for pagb_list */ #endif xfs_perag_busy_t *pagb_list; /* unstable blocks */ + atomic_t pagf_fstrms; /* # of filestreams active in this AG */ /* * inode allocation search lookup optimisation. Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap.c 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap.c 2007-05-10 17:24:13.011020884 +1000 @@ -52,6 +52,7 @@ #include "xfs_quota.h" #include "xfs_trans_space.h" #include "xfs_buf_item.h" +#include "xfs_filestream.h" #ifdef DEBUG @@ -171,6 +172,14 @@ xfs_bmap_alloc( xfs_bmalloca_t *ap); /* bmap alloc argument struct */ /* + * xfs_bmap_filestreams is the underlying allocator when filestreams are + * enabled. + */ +STATIC int /* error */ +xfs_bmap_filestreams( + xfs_bmalloca_t *ap); /* bmap alloc argument struct */ + +/* * Transform a btree format file with only one leaf node, where the * extents list will fit in the inode, into an extents format file. * Since the file extents are already in-core, all we have to do is @@ -2968,10 +2977,338 @@ xfs_bmap_alloc( { if ((ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata) return xfs_bmap_rtalloc(ap); + if ((ap->ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (ap->ip->i_d.di_flags & XFS_DIFLAG_FILESTREAM)) + return xfs_bmap_filestreams(ap); return xfs_bmap_btalloc(ap); } /* + * xfs_filestreams called by xfs_bmapi for multi-file data stream filesystems. + * + * Allocate files in a directory all in the same AG. When an AG fills, pick + * a new AG. + */ +int /* error */ +xfs_bmap_filestreams( + xfs_bmalloca_t *ap) /* bmap alloc argument struct */ +{ + xfs_alloctype_t atype; /* type for allocation routines */ + int error; /* error return value */ + xfs_agnumber_t fb_agno; /* ag number of ap->firstblock */ + xfs_mount_t *mp; /* mount point structure */ + int nullfb; /* true if ap->firstblock isn't set */ + int rt; /* true if inode is realtime */ + xfs_extlen_t align; /* minimum allocation alignment */ + xfs_agnumber_t ag; + xfs_alloc_arg_t args; + xfs_extlen_t blen; + xfs_extlen_t delta; + int isaligned; + xfs_extlen_t longest; + xfs_extlen_t need; + xfs_extlen_t nextminlen = 0; + int notinit; + xfs_perag_t *pag; + xfs_agnumber_t startag; + int tryagain; + + /* + * Set up variables. + */ + mp = ap->ip->i_mount; + rt = (ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata; + align = (ap->userdata && ap->ip->i_d.di_extsize && + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) ? + ap->ip->i_d.di_extsize : 0; + if (align) { + error = xfs_bmap_extsize_align(mp, ap->gotp, ap->prevp, + align, rt, + ap->eof, 0, ap->conv, + &ap->off, &ap->alen); + ASSERT(!error); + ASSERT(ap->alen); + } + nullfb = ap->firstblock == NULLFSBLOCK; + fb_agno = nullfb ? NULLAGNUMBER : XFS_FSB_TO_AGNO(mp, ap->firstblock); + if (nullfb) { + ag = xfs_filestream_get_ag(ap->ip); + ag = (ag != NULLAGNUMBER) ? ag : 0; + ap->rval = (ap->userdata) ? XFS_AGB_TO_FSB(mp, ag, 0) : + XFS_INO_TO_FSB(mp, ap->ip->i_ino); + } else { + ap->rval = ap->firstblock; + } + + xfs_bmap_adjacent(ap); + + /* + * If allowed, use ap->rval; otherwise must use firstblock since + * it's in the right allocation group. + */ + if (nullfb || XFS_FSB_TO_AGNO(mp, ap->rval) == fb_agno) + ; + else + ap->rval = ap->firstblock; + /* + * Normal allocation, done through xfs_alloc_vextent. + */ + tryagain = isaligned = 0; + args.tp = ap->tp; + args.mp = mp; + args.fsbno = ap->rval; + args.maxlen = MIN(ap->alen, mp->m_sb.sb_agblocks); + blen = 0; + if (nullfb) { + /* _vextent doesn't pick an AG */ + args.type = XFS_ALLOCTYPE_NEAR_BNO; + args.total = ap->total; + /* + * Find the longest available space. + * We're going to try for the whole allocation at once. + */ + startag = ag = XFS_FSB_TO_AGNO(mp, args.fsbno); + if (startag == NULLAGNUMBER) { + startag = ag = 0; + } + notinit = 0; + /* + * Search for an allocation group with a single extent + * large enough for the request. + * + * If one isn't found, then adjust the minimum allocation + * size to the largest space found. + */ + down_read(&mp->m_peraglock); + while (blen < ap->alen) { + pag = &mp->m_perag[ag]; + if (!pag->pagf_init && + (error = xfs_alloc_pagf_init(mp, args.tp, + ag, XFS_ALLOC_FLAG_TRYLOCK))) { + up_read(&mp->m_peraglock); + return error; + } + /* + * See xfs_alloc_fix_freelist... + */ + if (pag->pagf_init) { + need = XFS_MIN_FREELIST_PAG(pag, mp); + delta = need > pag->pagf_flcount ? + need - pag->pagf_flcount : 0; + longest = (pag->pagf_longest > delta) ? + (pag->pagf_longest - delta) : + (pag->pagf_flcount > 0 || + pag->pagf_longest > 0); + if (blen < longest) + blen = longest; + } else { + notinit = 1; + } + + if (blen >= ap->alen) + break; + + if (ap->userdata) { + if (startag == NULLAGNUMBER) { + /* + * If startag is an invalid AG, + * we've come here once before and + * xfs_filestream_new_ag picked the best + * currently available. + * + * Don't continue looping, since we + * could loop forever. + */ + break; + } + + if ((error = xfs_filestream_new_ag(ap, &ag))) { + up_read(&mp->m_peraglock); + return error; + } + + startag = NULLAGNUMBER; + + /* Go around the loop once more to set 'blen'*/ + } else { + if (++ag == mp->m_sb.sb_agcount) + ag = 0; + + if (ag == startag) + break; + } + } + up_read(&mp->m_peraglock); + /* + * Since the above loop did a BUF_TRYLOCK, it is + * possible that there is space for this request. + */ + if (notinit || blen < ap->minlen) + args.minlen = ap->minlen; + /* + * If the best seen length is less than the request + * length, use the best as the minimum. + */ + else if (blen < ap->alen) + args.minlen = blen; + /* + * Otherwise we've seen an extent as big as alen, + * use that as the minimum. + */ + else + args.minlen = ap->alen; + ap->rval = args.fsbno = XFS_AGB_TO_FSB(mp, ag, 0); + } else if (ap->low) { + args.type = XFS_ALLOCTYPE_FIRST_AG; + args.total = args.minlen = ap->minlen; + } else { + args.type = XFS_ALLOCTYPE_NEAR_BNO; + args.total = ap->total; + args.minlen = ap->minlen; + } + if (ap->userdata && ap->ip->i_d.di_extsize && + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) { + args.prod = ap->ip->i_d.di_extsize; + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) + args.mod = (xfs_extlen_t)(args.prod - args.mod); + } else if (mp->m_sb.sb_blocksize >= NBPP) { + args.prod = 1; + args.mod = 0; + } else { + args.prod = NBPP >> mp->m_sb.sb_blocklog; + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) + args.mod = (xfs_extlen_t)(args.prod - args.mod); + } + /* + * If we are not low on available data blocks, and the + * underlying logical volume manager is a stripe, and + * the file offset is zero then try to allocate data + * blocks on stripe unit boundary. + * NOTE: ap->aeof is only set if the allocation length + * is >= the stripe unit and the allocation offset is + * at the end of file. + */ + atype = args.type; + if (!ap->low && ap->aeof) { + if (!ap->off) { + args.alignment = mp->m_dalign; + atype = args.type; + isaligned = 1; + /* + * Adjust for alignment + */ + if (blen > args.alignment && blen <= ap->alen) + args.minlen = blen - args.alignment; + args.minalignslop = 0; + } else { + /* + * First try an exact bno allocation. + * If it fails then do a near or start bno + * allocation with alignment turned on. + */ + atype = args.type; + tryagain = 1; + args.type = XFS_ALLOCTYPE_THIS_BNO; + args.alignment = 1; + /* + * Compute the minlen+alignment for the + * next case. Set slop so that the value + * of minlen+alignment+slop doesn't go up + * between the calls. + */ + if (blen > mp->m_dalign && blen <= ap->alen) + nextminlen = blen - mp->m_dalign; + else + nextminlen = args.minlen; + if (nextminlen + mp->m_dalign > args.minlen + 1) + args.minalignslop = + nextminlen + mp->m_dalign - + args.minlen - 1; + else + args.minalignslop = 0; + } + } else { + args.alignment = 1; + args.minalignslop = 0; + } + args.minleft = ap->minleft; + args.wasdel = ap->wasdel; + args.isfl = 0; + args.userdata = ap->userdata; + if ((error = xfs_alloc_vextent(&args))) + return error; + if (tryagain && args.fsbno == NULLFSBLOCK) { + /* + * Exact allocation failed. Now try with alignment + * turned on. + */ + args.type = atype; + args.fsbno = ap->rval; + args.alignment = mp->m_dalign; + args.minlen = nextminlen; + args.minalignslop = 0; + isaligned = 1; + if ((error = xfs_alloc_vextent(&args))) + return error; + } + if (isaligned && args.fsbno == NULLFSBLOCK) { + /* + * allocation failed, so turn off alignment and + * try again. + */ + args.type = atype; + args.fsbno = ap->rval; + args.alignment = 0; + if ((error = xfs_alloc_vextent(&args))) + return error; + } + if (args.fsbno == NULLFSBLOCK && nullfb && + args.minlen > ap->minlen) { + args.minlen = ap->minlen; + args.type = XFS_ALLOCTYPE_START_BNO; + args.fsbno = ap->rval; + if ((error = xfs_alloc_vextent(&args))) + return error; + } + if (args.fsbno == NULLFSBLOCK && nullfb) { + args.fsbno = 0; + args.type = XFS_ALLOCTYPE_FIRST_AG; + args.total = ap->minlen; + args.minleft = 0; + if ((error = xfs_alloc_vextent(&args))) + return error; + ap->low = 1; + } + if (args.fsbno != NULLFSBLOCK) { + ap->firstblock = ap->rval = args.fsbno; + ASSERT(nullfb || fb_agno == args.agno || + (ap->low && fb_agno < args.agno)); + ap->alen = args.len; + ap->ip->i_d.di_nblocks += args.len; + xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE); + if (ap->wasdel) + ap->ip->i_delayed_blks -= args.len; + /* + * Adjust the disk quota also. This was reserved + * earlier. + */ + if (XFS_IS_QUOTA_ON(mp) && + ap->ip->i_ino != mp->m_sb.sb_uquotino && + ap->ip->i_ino != mp->m_sb.sb_gquotino) { + XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, + ap->wasdel ? + XFS_TRANS_DQ_DELBCOUNT : + XFS_TRANS_DQ_BCOUNT, + (long)args.len); + } + } else { + ap->rval = NULLFSBLOCK; + ap->alen = 0; + } + return 0; +} + +/* * Transform a btree format file with only one leaf node, where the * extents list will fit in the inode, into an extents format file. * Since the file extents are already in-core, all we have to do is Index: 2.6.x-xfs-new/fs/xfs/xfs_clnt.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_clnt.h 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_clnt.h 2007-05-10 17:24:13.011020884 +1000 @@ -99,5 +99,7 @@ struct xfs_mount_args { */ #define XFSMNT2_COMPAT_IOSIZE 0x00000001 /* don't report large preferred * I/O size in stat(2) */ +#define XFSMNT2_FILESTREAMS 0x00000002 /* enable the filestreams + * allocator */ #endif /* __XFS_CLNT_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_dinode.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_dinode.h 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_dinode.h 2007-05-10 17:24:13.015020360 +1000 @@ -257,6 +257,7 @@ typedef enum xfs_dinode_fmt #define XFS_DIFLAG_EXTSIZE_BIT 11 /* inode extent size allocator hint */ #define XFS_DIFLAG_EXTSZINHERIT_BIT 12 /* inherit inode extent size */ #define XFS_DIFLAG_NODEFRAG_BIT 13 /* do not reorganize/defragment */ +#define XFS_DIFLAG_FILESTREAM_BIT 14 /* use filestream allocator */ #define XFS_DIFLAG_REALTIME (1 << XFS_DIFLAG_REALTIME_BIT) #define XFS_DIFLAG_PREALLOC (1 << XFS_DIFLAG_PREALLOC_BIT) #define XFS_DIFLAG_NEWRTBM (1 << XFS_DIFLAG_NEWRTBM_BIT) @@ -271,12 +272,13 @@ typedef enum xfs_dinode_fmt #define XFS_DIFLAG_EXTSIZE (1 << XFS_DIFLAG_EXTSIZE_BIT) #define XFS_DIFLAG_EXTSZINHERIT (1 << XFS_DIFLAG_EXTSZINHERIT_BIT) #define XFS_DIFLAG_NODEFRAG (1 << XFS_DIFLAG_NODEFRAG_BIT) +#define XFS_DIFLAG_FILESTREAM (1 << XFS_DIFLAG_FILESTREAM_BIT) #define XFS_DIFLAG_ANY \ (XFS_DIFLAG_REALTIME | XFS_DIFLAG_PREALLOC | XFS_DIFLAG_NEWRTBM | \ XFS_DIFLAG_IMMUTABLE | XFS_DIFLAG_APPEND | XFS_DIFLAG_SYNC | \ XFS_DIFLAG_NOATIME | XFS_DIFLAG_NODUMP | XFS_DIFLAG_RTINHERIT | \ XFS_DIFLAG_PROJINHERIT | XFS_DIFLAG_NOSYMLINKS | XFS_DIFLAG_EXTSIZE | \ - XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG) + XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM) #endif /* __XFS_DINODE_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_filestream.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_filestream.c 2007-05-10 17:24:13.019019836 +1000 @@ -0,0 +1,777 @@ +/* + * Copyright (c) 2000-2005 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#include "xfs.h" +#include "xfs_bmap_btree.h" +#include "xfs_inum.h" +#include "xfs_dir2.h" +#include "xfs_dir2_sf.h" +#include "xfs_attr_sf.h" +#include "xfs_dinode.h" +#include "xfs_inode.h" +#include "xfs_ag.h" +#include "xfs_dmapi.h" +#include "xfs_log.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_mount.h" +#include "xfs_bmap.h" +#include "xfs_alloc.h" +#include "xfs_utils.h" +#include "xfs_mru_cache.h" +#include "xfs_filestream.h" + +#ifdef DEBUG_FILESTREAMS +#define dprint(fmt, args...) do { \ + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ + current_pid(), __FUNCTION__, ##args); \ +} while(0) +#else +#define dprint(args...) do {} while (0) +#endif + +static kmem_zone_t *item_zone; + +/* + * Per-mount point data structure to maintain its active filestreams. Currently + * only contains a single pointer, but set up and allocated as a structure to + * ease future expansion, if any. + */ +typedef struct fstrm_mnt_data +{ + struct xfs_mru_cache *fstrm_items; +} fstrm_mnt_data_t; + +/* + * Structure for associating a file or a directory with an allocation group. + * The parent directory pointer is only needed for files, but since there will + * generally be vastly more files than directories in the cache, using the same + * data structure simplifies the code with very little memory overhead. + */ +typedef struct fstrm_item +{ + xfs_agnumber_t ag; /* AG currently in use for the file/directory. */ + xfs_inode_t *ip; /* inode self-pointer. */ + xfs_inode_t *pip; /* Parent directory inode pointer. */ +} fstrm_item_t; + +/* + * Allocation group filestream associations are tracked with per-ag atomic + * counters. These counters allow _xfs_filestream_pick_ag() to tell whether a + * particular AG already has active filestreams associated with it. The mount + * point's m_peraglock is used to protect these counters from per-ag array + * re-allocation during a growfs operation. When xfs_growfs_data_private() is + * about to reallocate the array, it calls xfs_filestream_flush() with the + * m_peraglock held in write mode. + * + * Since xfs_mru_cache_flush() guarantees that all the free functions for all + * the cache elements have finished executing before it returns, it's safe for + * the free functions to use the atomic counters without m_peraglock protection. + * This allows the implementation of xfs_fstrm_free_func() to be agnostic about + * whether it was called with the m_peraglock held in read mode, write mode or + * not held at all. The race condition this addresses is the following: + * + * - The work queue scheduler fires and pulls a filestream directory cache + * element off the LRU end of the cache for deletion, then gets pre-empted. + * - A growfs operation grabs the m_peraglock in write mode, flushes all the + * remaining items from the cache and reallocates the mount point's per-ag + * array, resetting all the counters to zero. + * - The work queue thread resumes and calls the free function for the element + * it started cleaning up earlier. In the process it decrements the + * filestreams counter for an AG that now has no references. + * + * With a shrinkfs feature, the above scenario could panic the system. + * + * All other uses of the following macros should be protected by either the + * m_peraglock held in read mode, or the cache's internal locking exposed by the + * interval between a call to xfs_mru_cache_lookup() and a call to + * xfs_mru_cache_done(). In addition, the m_peraglock must be held in read mode + * when new elements are added to the cache. + * + * Combined, these locking rules ensure that no associations will ever exist in + * the cache that reference per-ag array elements that have since been + * reallocated. + */ +#define GET_AG_REF(mp, ag) atomic_read(&(mp)->m_perag[ag].pagf_fstrms) +#define INC_AG_REF(mp, ag) atomic_inc_return(&(mp)->m_perag[ag].pagf_fstrms) +#define DEC_AG_REF(mp, ag) atomic_dec_return(&(mp)->m_perag[ag].pagf_fstrms) + +#define XFS_PICK_USERDATA 1 +#define XFS_PICK_LOWSPACE 2 + +/* + * Scan the AGs starting at startag looking for an AG that isn't in use and has + * at least minlen blocks free. + */ +static int +_xfs_filestream_pick_ag( + xfs_mount_t *mp, + xfs_agnumber_t startag, + xfs_agnumber_t *agp, + int flags, + xfs_extlen_t minlen) +{ + int err, trylock, nscan; + xfs_extlen_t delta, longest, need, free, minfree, maxfree = 0; + xfs_agnumber_t ag, max_ag = NULLAGNUMBER; + struct xfs_perag *pag; + + /* 2% of an AG's blocks must be free for it to be chosen. */ + minfree = mp->m_sb.sb_agblocks / 50; + + ag = startag; + *agp = NULLAGNUMBER; + + /* For the first pass, don't sleep trying to init the per-AG. */ + trylock = XFS_ALLOC_FLAG_TRYLOCK; + + for (nscan = 0; 1; nscan++) { + + //dprint("scanning AG %d[%d]", ag, GET_AG_REF(mp, ag)); + + pag = mp->m_perag + ag; + + if (!pag->pagf_init && + (err = xfs_alloc_pagf_init(mp, NULL, ag, trylock)) && + !trylock) { + dprint("xfs_alloc_pagf_init returned %d", err); + return err; + } + + /* Might fail sometimes during the 1st pass with trylock set. */ + if (!pag->pagf_init) { + dprint("!pagf_init"); + goto next_ag; + } + + /* Keep track of the AG with the most free blocks. */ + if (pag->pagf_freeblks > maxfree) { + maxfree = pag->pagf_freeblks; + max_ag = ag; + } + + /* + * The AG reference count does two things: it enforces mutual + * exclusion when examining the suitability of an AG in this + * loop, and it guards against two filestreams being established + * in the same AG as each other. + */ + if (INC_AG_REF(mp, ag) > 1) { + DEC_AG_REF(mp, ag); + goto next_ag; + } + + need = XFS_MIN_FREELIST_PAG(pag, mp); + delta = need > pag->pagf_flcount ? need - pag->pagf_flcount : 0; + longest = (pag->pagf_longest > delta) ? + (pag->pagf_longest - delta) : + (pag->pagf_flcount > 0 || pag->pagf_longest > 0); + + if (((minlen && longest >= minlen) || + (!minlen && pag->pagf_freeblks >= minfree)) && + (!pag->pagf_metadata || !(flags & XFS_PICK_USERDATA) || + (flags & XFS_PICK_LOWSPACE))) { + + /* Break out, retaining the reference on the AG. */ + free = pag->pagf_freeblks; + *agp = ag; + break; + } + + /* Drop the reference on this AG, it's not usable. */ + DEC_AG_REF(mp, ag); +next_ag: + /* Move to the next AG, wrapping to AG 0 if necessary. */ + if (++ag >= mp->m_sb.sb_agcount) + ag = 0; + + /* If a full pass of the AGs hasn't been done yet, continue. */ + if (ag != startag) + continue; + + /* Allow sleeping in xfs_alloc_pagf_init() on the 2nd pass. */ + if (trylock != 0) { + trylock = 0; + continue; + } + + /* Finally, if lowspace wasn't set, set it for the 3rd pass. */ + if (!(flags & XFS_PICK_LOWSPACE)) { + flags |= XFS_PICK_LOWSPACE; + continue; + } + + /* + * Take the AG with the most free space, regardless of whether + * it's already in use by another filestream. + */ + if (max_ag != NULLAGNUMBER) { + INC_AG_REF(mp, max_ag); + dprint("using max_ag %d[1] with maxfree %d", max_ag, + maxfree); + + free = maxfree; + *agp = max_ag; + break; + } + + dprint("giving up, returning AG 0"); + *agp = 0; + return 0; + } + + /* + dprint("mp %p startag %d newag %d[%d] free %d minlen %d minfree %d " + "scanned %d trylock %d flags 0x%x", mp, startag, *agp, + GET_AG_REF(mp, *agp), free, minlen, minfree, nscan, trylock, + flags); + */ + + return 0; +} + +/* + * Set the allocation group number for a file or a directory, updating inode + * references and per-AG references as appropriate. Must be called with the + * m_peraglock held in read mode. + */ +static int +_xfs_filestream_set_ag( + xfs_inode_t *ip, + xfs_inode_t *pip, + xfs_agnumber_t ag) +{ + int err = 0; + xfs_mount_t *mp; + xfs_mru_cache_t *cache; + fstrm_item_t *item; + xfs_agnumber_t old_ag; + xfs_inode_t *old_pip; + + /* + * Either ip is a regular file and pip is a directory, or ip is a + * directory and pip is NULL. + */ + ASSERT(ip && (((ip->i_d.di_mode & S_IFREG) && pip && + (pip->i_d.di_mode & S_IFDIR)) || + ((ip->i_d.di_mode & S_IFDIR) && !pip))); + + mp = ip->i_mount; + cache = mp->m_filestream->fstrm_items; + + if ((item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, ip->i_ino))) { + ASSERT(item->ip == ip); + old_ag = item->ag; + item->ag = ag; + old_pip = item->pip; + item->pip = pip; + xfs_mru_cache_done(cache); + + /* + * If the AG has changed, drop the old ref and take a new one, + * effectively transferring the reference from old to new AG. + */ + if (ag != old_ag) { + DEC_AG_REF(mp, old_ag); + INC_AG_REF(mp, ag); + } + + /* + * If ip is a file and its pip has changed, drop the old ref and + * take a new one. + */ + if (pip && pip != old_pip) { + IRELE(old_pip); + IHOLD(pip); + } + + if (ag != old_ag) + dprint("found ip %p ino %lld, AG %d[%d] -> %d[%d]", ip, + ip->i_ino, old_ag, GET_AG_REF(mp, old_ag), ag, + GET_AG_REF(mp, ag)); + else + dprint("found ip %p ino %lld, AG %d[%d]", ip, ip->i_ino, + ag, GET_AG_REF(mp, ag)); + + return 0; + } + + if (!(item = (fstrm_item_t*)kmem_zone_zalloc(item_zone, KM_SLEEP))) + return ENOMEM; + + item->ag = ag; + item->ip = ip; + item->pip = pip; + + if ((err = xfs_mru_cache_insert(cache, ip->i_ino, item))) { + kmem_zone_free(item_zone, item); + return err; + } + + /* Take a reference on the AG. */ + INC_AG_REF(mp, ag); + + /* + * Take a reference on the inode itself regardless of whether it's a + * regular file or a directory. + */ + IHOLD(ip); + + /* + * In the case of a regular file, take a reference on the parent inode + * as well to ensure it remains in-core. + */ + if (pip) + IHOLD(pip); + + dprint("put ip %p ino %lld into AG %d[%d]", ip, ip->i_ino, ag, + GET_AG_REF(mp, ag)); + + return 0; +} + +/* xfs_fstrm_free_func(): callback for freeing cached stream items. */ +void +xfs_fstrm_free_func( + xfs_ino_t ino, + fstrm_item_t *item) +{ + xfs_inode_t *ip = item->ip; + int ref; + + ASSERT(ip->i_ino == ino); + + /* Drop the reference taken on the AG when the item was added. */ + ref = DEC_AG_REF(ip->i_mount, item->ag); + + ASSERT(ref >= 0); + + /* + * _xfs_filestream_set_ag() always takes a reference on the inode + * itself, whether it's a file or a directory. Release it here. + */ + IRELE(ip); + + /* + * In the case of a regular file, _xfs_filestream_set_ag() also takes a + * ref on the parent inode to keep it in-core. Release that too. + */ + if (item->pip) + IRELE(item->pip); + + if (ip->i_d.di_mode & S_IFDIR) + dprint("deleting dip %p ino %lld, AG %d[%d]", ip, ip->i_ino, + item->ag, GET_AG_REF(ip->i_mount, item->ag)); + else + dprint("deleting file %p ino %lld, pip %p ino %lld, AG %d[%d]", + ip, ip->i_ino, item->pip, + item->pip ? item->pip->i_ino : 0, item->ag, + GET_AG_REF(ip->i_mount, item->ag)); + + /* Finally, free the memory allocated for the item. */ + kmem_zone_free(item_zone, item); +} + +/* + * xfs_filestream_init() is called at xfs initialisation time to set up the + * memory zone that will be used for filestream data structure allocation. + */ +void +xfs_filestream_init(void) +{ + item_zone = kmem_zone_init(sizeof(fstrm_item_t), "fstrm_item"); + ASSERT(item_zone); +} + +/* + * xfs_filestream_uninit() is called at xfs termination time to destroy the + * memory zone that was used for filestream data structure allocation. + */ +void +xfs_filestream_uninit(void) +{ + if (item_zone) { + kmem_zone_destroy(item_zone); + item_zone = NULL; + } +} + +/* + * xfs_filestream_mount() is called when a file system is mounted with the + * filestream option. It is responsible for allocating the data structures + * needed to track the new file system's file streams. + */ +int +xfs_filestream_mount( + xfs_mount_t *mp) +{ + int err = 0; + unsigned int lifetime, grp_count; + fstrm_mnt_data_t *md; + + if (!(md = (fstrm_mnt_data_t*)kmem_zalloc(sizeof(*md), KM_SLEEP))) + return ENOMEM; + + /* + * The filestream timer tunable is currently fixed within the range of + * one second to four minutes, with five seconds being the default. The + * group count is somewhat arbitrary, but it'd be nice to adhere to the + * timer tunable to within about 10 percent. This requires at least 10 + * groups. + */ + lifetime = xfs_fstrm_centisecs * 10; + grp_count = 10; + + if ((err = xfs_mru_cache_create(&md->fstrm_items, lifetime, grp_count, + (xfs_mru_cache_free_func_t)xfs_fstrm_free_func))) { + kmem_free(md, sizeof(*md)); + return err; + } + + mp->m_filestream = md; + + dprint("created fstrm_items %p for mount %p", md->fstrm_items, mp); + + return 0; +} + +/* + * xfs_filestream_unmount() is called when a file system that was mounted with + * the filestream option is unmounted. It drains the data structures created + * to track the file system's file streams and frees all the memory that was + * allocated. + */ +void +xfs_filestream_unmount( + xfs_mount_t *mp) +{ + xfs_mru_cache_destroy(mp->m_filestream->fstrm_items); + kmem_free(mp->m_filestream, sizeof(*mp->m_filestream)); +} + +/* + * If the mount point's m_perag array is going to be reallocated, all + * outstanding cache entries must be flushed to avoid accessing reference count + * addresses that have been freed. The call to xfs_filestream_flush() must be + * made inside the block that holds the m_peraglock in write mode to do the + * reallocation. + */ +void +xfs_filestream_flush( + xfs_mount_t *mp) +{ + /* point in time flush, so keep the reaper running */ + xfs_mru_cache_flush(mp->m_filestream->fstrm_items, 1); +} + +/* + * Return the AG of the filestream the file or directory belongs to, or + * NULLAGNUMBER otherwise. + */ +xfs_agnumber_t +xfs_filestream_get_ag( + xfs_inode_t *ip) +{ + xfs_mru_cache_t *cache; + fstrm_item_t *item; + xfs_agnumber_t ag; + int ref; + + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) + return NULLAGNUMBER; + + cache = ip->i_mount->m_filestream->fstrm_items; + if (!(item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, ip->i_ino))) { + dprint("lookup on %s ip %p ino %lld failed, returning %d", + ip->i_d.di_mode & S_IFREG ? "file" : "dir", ip, + ip->i_ino, NULLAGNUMBER); + return NULLAGNUMBER; + } + + ASSERT(ip == item->ip); + ag = item->ag; + ref = GET_AG_REF(ip->i_mount, ag); + xfs_mru_cache_done(cache); + + if (ip->i_d.di_mode & S_IFREG) + dprint("lookup on file ip %p ino %lld dir %p dino %lld got AG " + "%d[%d]", ip, ip->i_ino, item->pip, item->pip->i_ino, ag, + ref); + else + dprint("lookup on dir ip %p ino %lld got AG %d[%d]", ip, + ip->i_ino, ag, ref); + + return ag; +} + +/* + * xfs_filestream_associate() should only be called to associate a regular file + * with its parent directory. Calling it with a child directory isn't + * appropriate because filestreams don't apply to entire directory hierarchies. + * Creating a file in a child directory of an existing filestream directory + * starts a new filestream with its own allocation group association. + */ +int +xfs_filestream_associate( + xfs_inode_t *pip, + xfs_inode_t *ip) +{ + xfs_mount_t *mp; + xfs_mru_cache_t *cache; + fstrm_item_t *item; + xfs_agnumber_t ag, rotorstep, startag; + int err = 0; + + ASSERT(pip->i_d.di_mode & S_IFDIR); + ASSERT(ip->i_d.di_mode & S_IFREG); + if (!(pip->i_d.di_mode & S_IFDIR) || !(ip->i_d.di_mode & S_IFREG)) + return EINVAL; + + mp = pip->i_mount; + cache = mp->m_filestream->fstrm_items; + down_read(&mp->m_peraglock); + xfs_ilock(pip, XFS_IOLOCK_EXCL); + + /* If the parent directory is already in the cache, use its AG. */ + if ((item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, pip->i_ino))) { + ASSERT(item->ip == pip); + ag = item->ag; + xfs_mru_cache_done(cache); + + dprint("got cached dir %p ino %lld with AG %d[%d]", pip, + pip->i_ino, ag, GET_AG_REF(mp, ag)); + + if ((err = _xfs_filestream_set_ag(ip, pip, ag))) + dprint("_xfs_filestream_set_ag(%p, %p, %d) -> err %d", + ip, pip, ag, err); + + goto exit; + } + + /* + * Set the starting AG using the rotor for inode32, otherwise + * use the directory inode's AG. + */ + if (mp->m_flags & XFS_MOUNT_32BITINODES) { + rotorstep = xfs_rotorstep; + startag = (mp->m_agfrotor / rotorstep) % mp->m_sb.sb_agcount; + mp->m_agfrotor = (mp->m_agfrotor + 1) % + (mp->m_sb.sb_agcount * rotorstep); + } else + startag = XFS_INO_TO_AGNO(mp, pip->i_ino); + + /* Pick a new AG for the parent inode starting at startag. */ + if ((err = _xfs_filestream_pick_ag(mp, startag, &ag, 0, 0)) || + ag == NULLAGNUMBER) + goto exit_did_pick; + + /* Associate the parent inode with the AG. */ + if ((err = _xfs_filestream_set_ag(pip, NULL, ag))) { + dprint("_xfs_filestream_set_ag(%p (%lld), NULL, %d) -> err %d", + pip, pip->i_ino, ag, err); + goto exit_did_pick; + } + + /* Associate the file inode with the AG. */ + if ((err = _xfs_filestream_set_ag(ip, pip, ag))) { + dprint("_xfs_filestream_set_ag(%p (%lld), %p (%lld), %d) -> " + "err %d", ip, ip->i_ino, pip, pip->i_ino, ag, err); + goto exit_did_pick; + } + + dprint("pip %p ino %lld and ip %p ino %lld given ag %d[%d]", + pip, pip->i_ino, ip, ip->i_ino, ag, GET_AG_REF(mp, ag)); + +exit_did_pick: + /* + * If _xfs_filestream_pick_ag() returned a valid AG, remove the + * reference it took on it, since the file and directory will have taken + * their own now if they were successfully cached. + */ + if (ag != NULLAGNUMBER) + DEC_AG_REF(mp, ag); + else + dprint("_pick_ag() returned invalid AG %d, no stream set", ag); + +exit: + xfs_iunlock(pip, XFS_IOLOCK_EXCL); + up_read(&mp->m_peraglock); + return err; +} + +/* + * Pick a new allocation group for the current file and its file stream. This + * function is called by xfs_bmap_filestreams() with the mount point's per-ag + * lock held. + */ +int +xfs_filestream_new_ag( + xfs_bmalloca_t *ap, + xfs_agnumber_t *agp) +{ + int flags, err; + xfs_inode_t *ip, *pip = NULL; + xfs_mount_t *mp; + xfs_mru_cache_t *cache; + xfs_extlen_t minlen; + fstrm_item_t *dir, *file; + xfs_agnumber_t ag = NULLAGNUMBER; + + ip = ap->ip; + mp = ip->i_mount; + cache = mp->m_filestream->fstrm_items; + minlen = ap->alen; + *agp = NULLAGNUMBER; + + /* + * Look for the file in the cache, removing it if it's found. Doing + * this allows it to be held across the dir lookup that follows. + */ + if ((file = (fstrm_item_t*)xfs_mru_cache_remove(cache, ip->i_ino))) { + ASSERT(ip == file->ip); + + /* Save the file's parent inode and old AG number for later. */ + pip = file->pip; + ag = file->ag; + + /* Look for the file's directory in the cache. */ + dir = (fstrm_item_t*)xfs_mru_cache_lookup(cache, pip->i_ino); + if (dir) { + ASSERT(pip == dir->ip); + + /* + * If the directory has already moved on to a new AG, + * use that AG as the new AG for the file. Don't + * forget to twiddle the AG refcounts to match the + * movement. + */ + if (dir->ag != file->ag) { + DEC_AG_REF(mp, file->ag); + INC_AG_REF(mp, dir->ag); + *agp = file->ag = dir->ag; + } + + xfs_mru_cache_done(cache); + } + + /* + * Put the file back in the cache. If this fails, the free + * function needs to be called to tidy up in the same way as if + * the item had simply expired from the cache. + */ + if ((err = xfs_mru_cache_insert(cache, ip->i_ino, file))) { + xfs_fstrm_free_func(ip->i_ino, file); + return err; + } + + /* + * If the file's AG was moved to the directory's new AG, there's + * nothing more to be done. + */ + if (*agp != NULLAGNUMBER) { + dprint("dir %p ino %lld for file %p ino %lld has " + "already moved %d[%d] -> %d[%d]", pip, + pip->i_ino, ip, ip->i_ino, ag, + GET_AG_REF(mp, ag), *agp, GET_AG_REF(mp, *agp)); + return 0; + } + } + + /* + * If the file's parent directory is known, take its iolock in exclusive + * mode to prevent two sibling files from racing each other to migrate + * themselves and their parent to different AGs. + */ + if (pip) + xfs_ilock(pip, XFS_IOLOCK_EXCL); + + /* + * A new AG needs to be found for the file. If the file's parent + * directory is also known, it will be moved to the new AG as well to + * ensure that files created inside it in future use the new AG. + */ + ag = (ag == NULLAGNUMBER) ? 0 : (ag + 1) % mp->m_sb.sb_agcount; + flags = (ap->userdata ? XFS_PICK_USERDATA : 0) | + (ap->low ? XFS_PICK_LOWSPACE : 0); + + if ((err = _xfs_filestream_pick_ag(mp, ag, agp, flags, minlen)) || + *agp == NULLAGNUMBER) + goto exit; + + /* + * If the file wasn't found in the file cache, then its parent directory + * inode isn't known. For this to have happened, the file must either + * be pre-existing, or it was created long enough ago that its cache + * entry has expired. This isn't the sort of usage that the filestreams + * allocator is trying to optimise, so there's no point trying to track + * its new AG somehow in the filestream data structures. + */ + if (!pip) { + dprint("gave ag %d to orphan ip %p ino %lld", *agp, ip, + ip->i_ino); + goto exit; + } + + /* Associate the parent inode with the AG. */ + if ((err = _xfs_filestream_set_ag(pip, NULL, *agp))) { + dprint("_xfs_filestream_set_ag(%p (%lld), NULL, %d) -> err %d", + pip, pip->i_ino, *agp, err); + goto exit; + } + + /* Associate the file inode with the AG. */ + if ((err = _xfs_filestream_set_ag(ip, pip, *agp))) { + dprint("_xfs_filestream_set_ag(%p (%lld), %p (%lld), %d) -> " + "err %d", ip, ip->i_ino, pip, pip->i_ino, *agp, err); + goto exit; + } + + dprint("pip %p ino %lld and ip %p ino %lld moved to new ag %d[%d]", + pip, pip->i_ino, ip, ip->i_ino, *agp, GET_AG_REF(mp, *agp)); + +exit: + /* + * If _xfs_filestream_pick_ag() returned a valid AG, remove the + * reference it took on it, since the file and directory will have taken + * their own now if they were successfully cached. + */ + if (*agp != NULLAGNUMBER) + DEC_AG_REF(mp, *agp); + else { + dprint("_pick_ag() returned invalid AG %d, using AG 0", *agp); + *agp = 0; + } + + if (pip) + xfs_iunlock(pip, XFS_IOLOCK_EXCL); + + return err; +} + +/* + * Remove an association between an inode and a filestream object. + * Typically this is done on last close of an unlinked file. + */ +void +xfs_filestream_deassociate( + xfs_inode_t *ip) +{ + xfs_mru_cache_t *cache = ip->i_mount->m_filestream->fstrm_items; + + xfs_mru_cache_delete(cache, ip->i_ino); +} Index: 2.6.x-xfs-new/fs/xfs/xfs_filestream.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_filestream.h 2007-05-10 17:24:13.107008304 +1000 @@ -0,0 +1,59 @@ +/* + * Copyright (c) 2000-2002,2005 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#ifndef __XFS_FILESTREAM_H__ +#define __XFS_FILESTREAM_H__ + +#ifdef __KERNEL__ + +struct xfs_mount; +struct xfs_inode; +struct xfs_perag; +struct xfs_bmalloca; + +void +xfs_filestream_init(void); + +void +xfs_filestream_uninit(void); + +int +xfs_filestream_mount(struct xfs_mount *mp); + +void +xfs_filestream_unmount(struct xfs_mount *mp); + +void +xfs_filestream_flush(struct xfs_mount *mp); + +xfs_agnumber_t +xfs_filestream_get_ag(struct xfs_inode *ip); + +int +xfs_filestream_associate(struct xfs_inode *dip, + struct xfs_inode *ip); + +void +xfs_filestream_deassociate(struct xfs_inode *ip); + +int +xfs_filestream_new_ag(struct xfs_bmalloca *ap, + xfs_agnumber_t *agp); + +#endif /* __KERNEL__ */ + +#endif /* __XFS_FILESTREAM_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_fs.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fs.h 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fs.h 2007-05-10 17:24:13.123006207 +1000 @@ -66,6 +66,7 @@ struct fsxattr { #define XFS_XFLAG_EXTSIZE 0x00000800 /* extent size allocator hint */ #define XFS_XFLAG_EXTSZINHERIT 0x00001000 /* inherit inode extent size */ #define XFS_XFLAG_NODEFRAG 0x00002000 /* do not defragment */ +#define XFS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */ #define XFS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ /* Index: 2.6.x-xfs-new/fs/xfs/xfs_fsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fsops.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fsops.c 2007-05-10 17:24:13.131005159 +1000 @@ -44,6 +44,7 @@ #include "xfs_trans_space.h" #include "xfs_rtalloc.h" #include "xfs_rw.h" +#include "xfs_filestream.h" /* * File system operations @@ -163,6 +164,7 @@ xfs_growfs_data_private( new = nb - mp->m_sb.sb_dblocks; oagcount = mp->m_sb.sb_agcount; if (nagcount > oagcount) { + xfs_filestream_flush(mp); down_write(&mp->m_peraglock); mp->m_perag = kmem_realloc(mp->m_perag, sizeof(xfs_perag_t) * nagcount, Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c 2007-05-10 17:24:13.143003586 +1000 @@ -48,6 +48,7 @@ #include "xfs_dir2_trace.h" #include "xfs_quota.h" #include "xfs_acl.h" +#include "xfs_filestream.h" kmem_zone_t *xfs_ifork_zone; @@ -817,6 +818,8 @@ _xfs_dic2xflags( flags |= XFS_XFLAG_EXTSZINHERIT; if (di_flags & XFS_DIFLAG_NODEFRAG) flags |= XFS_XFLAG_NODEFRAG; + if (di_flags & XFS_DIFLAG_FILESTREAM) + flags |= XFS_XFLAG_FILESTREAM; } return flags; @@ -1099,7 +1102,7 @@ xfs_ialloc( * Call the space management code to pick * the on-disk inode to be allocated. */ - error = xfs_dialloc(tp, pip->i_ino, mode, okalloc, + error = xfs_dialloc(tp, pip ? pip->i_ino : 0, mode, okalloc, ialloc_context, call_again, &ino); if (error != 0) { return error; @@ -1153,7 +1156,7 @@ xfs_ialloc( if ( (prid != 0) && (ip->i_d.di_version == XFS_DINODE_VERSION_1)) xfs_bump_ino_vers2(tp, ip); - if (XFS_INHERIT_GID(pip, vp->v_vfsp)) { + if (pip && XFS_INHERIT_GID(pip, vp->v_vfsp)) { ip->i_d.di_gid = pip->i_d.di_gid; if ((pip->i_d.di_mode & S_ISGID) && (mode & S_IFMT) == S_IFDIR) { ip->i_d.di_mode |= S_ISGID; @@ -1195,8 +1198,14 @@ xfs_ialloc( flags |= XFS_ILOG_DEV; break; case S_IFREG: + if (unlikely(pip && + ((ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (pip->i_d.di_flags & XFS_DIFLAG_FILESTREAM)) && + (error = xfs_filestream_associate(pip, ip)))) + return error; + /* fall through */ case S_IFDIR: - if (unlikely(pip->i_d.di_flags & XFS_DIFLAG_ANY)) { + if (unlikely(pip && (pip->i_d.di_flags & XFS_DIFLAG_ANY))) { uint di_flags = 0; if ((mode & S_IFMT) == S_IFDIR) { @@ -1233,6 +1242,8 @@ xfs_ialloc( if ((pip->i_d.di_flags & XFS_DIFLAG_NODEFRAG) && xfs_inherit_nodefrag) di_flags |= XFS_DIFLAG_NODEFRAG; + if (pip->i_d.di_flags & XFS_DIFLAG_FILESTREAM) + di_flags |= XFS_DIFLAG_FILESTREAM; ip->i_d.di_flags |= di_flags; } /* FALLTHROUGH */ Index: 2.6.x-xfs-new/fs/xfs/xfs_mount.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_mount.h 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mount.h 2007-05-10 17:24:13.147003062 +1000 @@ -66,6 +66,7 @@ struct xfs_bmbt_irec; struct xfs_bmap_free; struct xfs_extdelta; struct xfs_swapext; +struct xfs_filestream; extern struct bhv_vfsops xfs_vfsops; extern struct bhv_vnodeops xfs_vnodeops; @@ -436,6 +437,7 @@ typedef struct xfs_mount { struct notifier_block m_icsb_notifier; /* hotplug cpu notifier */ struct mutex m_icsb_mutex; /* balancer sync lock */ #endif + struct fstrm_mnt_data *m_filestream; /* per-mount filestream data */ } xfs_mount_t; /* @@ -475,6 +477,8 @@ typedef struct xfs_mount { * I/O size in stat() */ #define XFS_MOUNT_NO_PERCPU_SB (1ULL << 23) /* don't use per-cpu superblock counters */ +#define XFS_MOUNT_FILESTREAMS (1ULL << 24) /* enable the filestreams + allocator */ /* Index: 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.c 2007-05-10 17:24:13.151002538 +1000 @@ -0,0 +1,607 @@ +/* + * Copyright (c) 2000-2002,2006 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +//#define DEBUG_MRU_CACHE 1 +#include "xfs.h" +#include "xfs_mru_cache.h" + +/* + * An MRU Cache is a dynamic data structure that stores its elements in a way + * that allows efficient lookups, but also groups them into discrete time + * intervals based on insertion time. This allows elements to be efficiently + * and automatically reaped after a fixed period of inactivity. + */ + +#ifdef DEBUG_MRU_CACHE +#define dprint(fmt, args...) do { \ + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ + current_pid(), __FUNCTION__, ##args); \ +} while(0) + +#define DEBUG_DECL_CACHE_FIELDS \ + unsigned int *list_elems; \ + unsigned int reap_elems; \ + unsigned long allocs; \ + unsigned long frees; + +#define DEBUG_INIT_CACHE(mru) \ + ((mru)->list_elems = (unsigned int*) \ + kmem_zalloc((mru)->grp_count * sizeof(*(mru)->list_elems), \ + KM_SLEEP)) + +#define DEBUG_UNINIT_CACHE(mru) \ + kmem_free((mru)->list_elems, \ + (mru)->grp_count * sizeof(*(mru)->list_elems)) + +#define DEBUG_INC_ALLOCS(mru) (mru)->allocs++ +#define DEBUG_INC_FREES(mru) (mru)->frees++ + +STATIC int +_xfs_mru_cache_print(struct xfs_mru_cache *mru, char *buf); + +#define DEBUG_PRINT_STACK_VARS \ + char buf[256]; \ + char *bufp = buf; + +#define DEBUG_PRINT_BEFORE_REAP \ + bufp += _xfs_mru_cache_print(mru, bufp) + +#define DEBUG_PRINT_AFTER_REAP \ + bufp += sprintf(bufp, " -> "); \ + bufp += _xfs_mru_cache_print(mru, bufp); \ + dprint("[%p]: %s", mru, buf) +#else /* !defined DEBUG_MRU_CACHE */ +#define dprint(args...) do {} while (0) +#define DEBUG_DECL_CACHE_FIELDS +#define DEBUG_INIT_CACHE(mru) 1 +#define DEBUG_UNINIT_CACHE(mru) do {} while (0) +#define DEBUG_INC_ALLOCS(mru) do {} while (0) +#define DEBUG_INC_FREES(mru) do {} while (0) +#define DEBUG_PRINT_STACK_VARS +#define DEBUG_PRINT_BEFORE_REAP do {} while (0) +#define DEBUG_PRINT_AFTER_REAP do {} while (0) +#endif /* DEBUG_MRU_CACHE */ + + +/* + * When a client data pointer is stored in the MRU Cache it needs to be added to + * both the data store and to one of the lists. It must also be possible to + * access each of these entries via the other, i.e. to: + * + * a) Walk a list, removing the corresponding data store entry for each item. + * b) Look up a data store entry, then access its list entry directly. + * + * To achieve both of these goals, each entry must contain both a list entry and + * a key, in addition to the user's data pointer. Note that it's not a good + * idea to have the client embed one of these structures at the top of their own + * data structure, because inserting the same item more than once would most + * likely result in a loop in one of the lists. That's a sure-fire recipe for + * an infinite loop in the code. + */ +typedef struct xfs_mru_cache_elem +{ + struct list_head list_node; + unsigned long key; + void *value; +} xfs_mru_cache_elem_t; + +static kmem_zone_t *elem_zone; +static struct workqueue_struct *reap_wq; + +/* + * When inserting, destroying or reaping, it's first necessary to update the + * lists relative to a particular time. In the case of destroying, that time + * will be well in the future to ensure that all items are moved to the reap + * list. In all other cases though, the time will be the current time. + * + * This function enters a loop, moving the contents of the LRU list to the reap + * list again and again until either a) the lists are all empty, or b) time zero + * has been advanced sufficiently to be within the immediate element lifetime. + * + * Case a) above is detected by counting how many groups are migrated and + * stopping when they've all been moved. Case b) is detected by monitoring the + * time_zero field, which is updated as each group is migrated. + * + * The return value is the earliest time that more migration could be needed, or + * zero if there's no need to schedule more work because the lists are empty. + */ +STATIC unsigned long +_xfs_mru_cache_migrate( + xfs_mru_cache_t *mru, + unsigned long now) +{ + unsigned int grp; + unsigned int migrated = 0; + struct list_head *lru_list; + + /* Nothing to do if the data store is empty. */ + if (!mru->time_zero) + return 0; + + /* While time zero is older than the time spanned by all the lists. */ + while (mru->time_zero <= now - mru->grp_count * mru->grp_time) { + + /* + * If the LRU list isn't empty, migrate its elements to the tail + * of the reap list. + */ + lru_list = mru->lists + mru->lru_grp; + if (!list_empty(lru_list)) + list_splice_init(lru_list, mru->reap_list.prev); + + /* + * Advance the LRU group number, freeing the old LRU list to + * become the new MRU list; advance time zero accordingly. + */ + mru->lru_grp = (mru->lru_grp + 1) % mru->grp_count; + mru->time_zero += mru->grp_time; + + /* + * If reaping is so far behind that all the elements on all the + * lists have been migrated to the reap list, it's now empty. + */ + if (++migrated == mru->grp_count) { + mru->lru_grp = 0; + mru->time_zero = 0; + return 0; + } + } + + /* Find the first non-empty list from the LRU end. */ + for (grp = 0; grp < mru->grp_count; grp++) { + + /* Check the grp'th list from the LRU end. */ + lru_list = mru->lists + ((mru->lru_grp + grp) % mru->grp_count); + if (!list_empty(lru_list)) + return mru->time_zero + + (mru->grp_count + grp) * mru->grp_time; + } + + /* All the lists must be empty. */ + mru->lru_grp = 0; + mru->time_zero = 0; + return 0; +} + +/* + * When inserting or doing a lookup, an element needs to be inserted into the + * MRU list. The lists must be migrated first to ensure that they're + * up-to-date, otherwise the new element could be given a shorter lifetime in + * the cache than it should. + */ +STATIC void +_xfs_mru_cache_list_insert( + xfs_mru_cache_t *mru, + xfs_mru_cache_elem_t *elem) +{ + unsigned int grp = 0; + unsigned long now = jiffies; + + /* + * If the data store is empty, initialise time zero, leave grp set to + * zero and start the work queue timer if necessary. Otherwise, set grp + * to the number of group times that have elapsed since time zero. + */ + if (!_xfs_mru_cache_migrate(mru, now)) { + mru->time_zero = now; + if (!mru->next_reap) + mru->next_reap = mru->grp_count * mru->grp_time; + } else { + grp = (now - mru->time_zero) / mru->grp_time; + grp = (mru->lru_grp + grp) % mru->grp_count; + } + + /* Insert the element at the tail of the corresponding list. */ + list_add_tail(&elem->list_node, mru->lists + grp); +} + +/* + * When destroying or reaping, all the elements that were migrated to the reap + * list need to be deleted. For each element this involves removing it from the + * data store, removing it from the reap list, calling the client's free + * function and deleting the element from the element zone. + */ +STATIC void +_xfs_mru_cache_clear_reap_list( + xfs_mru_cache_t *mru) +{ + xfs_mru_cache_elem_t *elem, *next; + struct list_head tmp; + + INIT_LIST_HEAD(&tmp); + list_for_each_entry_safe(elem, next, &mru->reap_list, list_node) { + + /* Remove the element from the data store. */ + radix_tree_delete(&mru->store, elem->key); + + /* + * remove to temp list so it can be freed without + * needing to hold the lock + */ + list_move(&elem->list_node, &tmp); + } + mutex_spinunlock(&mru->lock, 0); + + list_for_each_entry_safe(elem, next, &tmp, list_node) { + + /* Remove the element from the reap list. */ + list_del_init(&elem->list_node); + + /* Call the client's free function with the key and value pointer. */ + mru->free_func(elem->key, elem->value); + + /* Free the element structure. */ + kmem_zone_free(elem_zone, elem); + DEBUG_INC_FREES(mru); + } + + mutex_spinlock(&mru->lock); +} + +/* + * We fire the reap timer every group expiry interval so + * we always have a reaper ready to run. This makes shutdown + * and flushing of the reaper easy to do. Hence we need to + * keep when the next reap must occur so we can determine + * at each interval whether there is anything we need to do. + */ +STATIC void +_xfs_mru_cache_reap( + struct work_struct *work) +{ + xfs_mru_cache_t *mru = container_of(work, xfs_mru_cache_t, work.work); + unsigned long now, next; + DEBUG_PRINT_STACK_VARS; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return; + + mutex_spinlock(&mru->lock); + now = jiffies; + if (mru->reap_all || + (mru->next_reap && time_after(now, mru->next_reap))) { + DEBUG_PRINT_BEFORE_REAP; + if (mru->reap_all) + now += mru->grp_count * mru->grp_time * 2; + mru->next_reap = _xfs_mru_cache_migrate(mru, now); + _xfs_mru_cache_clear_reap_list(mru); + DEBUG_PRINT_AFTER_REAP; + } + + /* + * the process that triggered the reap_all is responsible + * for restating the periodic reap if it is required. + */ + if (!mru->reap_all) + queue_delayed_work(reap_wq, &mru->work, mru->grp_time); + mru->reap_all = 0; + mutex_spinunlock(&mru->lock, 0); +} + +int +xfs_mru_cache_init(void) +{ + if (!(elem_zone = kmem_zone_init(sizeof(xfs_mru_cache_elem_t), + "xfs_mru_cache_elem"))) + return ENOMEM; + + if (!(reap_wq = create_singlethread_workqueue("xfs_mru_cache"))) { + kmem_zone_destroy(elem_zone); + elem_zone = NULL; + return ENOMEM; + } + + return 0; +} + +void +xfs_mru_cache_uninit(void) +{ + if (reap_wq) { + destroy_workqueue(reap_wq); + reap_wq = NULL; + } + + if (elem_zone) { + kmem_zone_destroy(elem_zone); + elem_zone = NULL; + } +} + +int +xfs_mru_cache_create( + xfs_mru_cache_t **mrup, + unsigned int lifetime_ms, + unsigned int grp_count, + xfs_mru_cache_free_func_t free_func) +{ + xfs_mru_cache_t *mru = NULL; + int err = 0, grp; + unsigned int grp_time; + + if (mrup) + *mrup = NULL; + + if (!mrup || !grp_count || !lifetime_ms || !free_func) + return EINVAL; + + if (!(grp_time = msecs_to_jiffies(lifetime_ms) / grp_count)) + return EINVAL; + + if (!(mru = kmem_zalloc(sizeof(*mru), KM_SLEEP))) + return ENOMEM; + + /* An extra list is needed to avoid reaping up to a grp_time early. */ + mru->grp_count = grp_count + 1; + mru->lists = (struct list_head*) + kmem_alloc(mru->grp_count * sizeof(*mru->lists), KM_SLEEP); + + if (!mru->lists || !DEBUG_INIT_CACHE(mru)) { + err = ENOMEM; + goto exit; + } + + for (grp = 0; grp < mru->grp_count; grp++) + INIT_LIST_HEAD(mru->lists + grp); + + /* + * We use GFP_KERNEL radix tree preload and do inserts under a + * spinlock so GFP_ATOMIC is appropriate for the radix tree itself. + */ + INIT_RADIX_TREE(&mru->store, GFP_ATOMIC); + INIT_LIST_HEAD(&mru->reap_list); + spinlock_init(&mru->lock, "xfs_mru_cache"); + INIT_DELAYED_WORK(&mru->work, _xfs_mru_cache_reap); + + mru->grp_time = grp_time; + mru->free_func = free_func; + + /* start up the reaper event */ + mru->next_reap = 0; + mru->reap_all = 0; + queue_delayed_work(reap_wq, &mru->work, mru->grp_time); + + *mrup = mru; + +exit: + if (err && mru && mru->lists) + kmem_free(mru->lists, mru->grp_count * sizeof(*mru->lists)); + if (err && mru) + kmem_free(mru, sizeof(*mru)); + + return err; +} + +/* + * When flushing, we stop the periodic reaper from running first + * so we don't race with it. If we are flushing on unmount, we + * don't want to restart the reaper again, so the restart is conditional. + * + * Because reaping can drop the last refcount on inodes which can free + * extents, we have to push the reaping off to the workqueue thread + * because we could be called holding locks that extent freeing requires. + */ +void +xfs_mru_cache_flush( + xfs_mru_cache_t *mru, + int restart) +{ + DEBUG_PRINT_STACK_VARS; + + if (!mru || !mru->lists) + return; + + cancel_rearming_delayed_workqueue(reap_wq, &mru->work); + + mutex_spinlock(&mru->lock); + mru->reap_all = 1; + mutex_spinunlock(&mru->lock, 0); + + queue_work(reap_wq, &mru->work.work); + flush_workqueue(reap_wq); + + mutex_spinlock(&mru->lock); + WARN_ON_ONCE(mru->reap_all != 0); + mru->reap_all = 0; + if (restart) + queue_delayed_work(reap_wq, &mru->work, mru->grp_time); + mutex_spinunlock(&mru->lock, 0); +} + +void +xfs_mru_cache_destroy( + xfs_mru_cache_t *mru) +{ + if (!mru || !mru->lists) + return; + + /* we don't want the reaper to restart here */ + xfs_mru_cache_flush(mru, 0); + + DEBUG_UNINIT_CACHE(mru); + kmem_free(mru->lists, mru->grp_count * sizeof(*mru->lists)); + kmem_free(mru, sizeof(*mru)); +} + +int +xfs_mru_cache_insert( + xfs_mru_cache_t *mru, + unsigned long key, + void *value) +{ + xfs_mru_cache_elem_t *elem; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return EINVAL; + + elem = (xfs_mru_cache_elem_t*)kmem_zone_zalloc(elem_zone, KM_SLEEP); + if (!elem) + return ENOMEM; + + if (radix_tree_preload(GFP_KERNEL)) { + kmem_zone_free(elem_zone, elem); + return ENOMEM; + } + + INIT_LIST_HEAD(&elem->list_node); + elem->key = key; + elem->value = value; + + mutex_spinlock(&mru->lock); + + radix_tree_insert(&mru->store, key, elem); + radix_tree_preload_end(); + + _xfs_mru_cache_list_insert(mru, elem); + + DEBUG_INC_ALLOCS(mru); + + mutex_spinunlock(&mru->lock, 0); + + return 0; +} + +void* +xfs_mru_cache_remove( + xfs_mru_cache_t *mru, + unsigned long key) +{ + xfs_mru_cache_elem_t *elem; + void *value = NULL; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return NULL; + + mutex_spinlock(&mru->lock); + elem = (xfs_mru_cache_elem_t*)radix_tree_delete(&mru->store, key); + if (elem) { + value = elem->value; + list_del(&elem->list_node); + DEBUG_INC_FREES(mru); + } + + mutex_spinunlock(&mru->lock, 0); + + if (elem) + kmem_zone_free(elem_zone, elem); + + return value; +} + +void +xfs_mru_cache_delete( + xfs_mru_cache_t *mru, + unsigned long key) +{ + void *value; + + if ((value = xfs_mru_cache_remove(mru, key))) + mru->free_func(key, value); +} + +void* +xfs_mru_cache_lookup( + xfs_mru_cache_t *mru, + unsigned long key) +{ + xfs_mru_cache_elem_t *elem; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return NULL; + + mutex_spinlock(&mru->lock); + elem = (xfs_mru_cache_elem_t*)radix_tree_lookup(&mru->store, key); + if (elem) { + list_del(&elem->list_node); + _xfs_mru_cache_list_insert(mru, elem); + } + else + mutex_spinunlock(&mru->lock, 0); + + return elem ? elem->value : NULL; +} + +void* +xfs_mru_cache_peek( + xfs_mru_cache_t *mru, + unsigned long key) +{ + xfs_mru_cache_elem_t *elem; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return NULL; + + mutex_spinlock(&mru->lock); + elem = (xfs_mru_cache_elem_t*)radix_tree_lookup(&mru->store, key); + if (!elem) + mutex_spinunlock(&mru->lock, 0); + + return elem ? elem->value : NULL; +} + +void +xfs_mru_cache_done( + xfs_mru_cache_t *mru) +{ + mutex_spinunlock(&mru->lock, 0); +} + +#ifdef DEBUG_MRU_CACHE +STATIC int +_xfs_mru_cache_print( + xfs_mru_cache_t *mru, + char *buf) +{ + unsigned int grp; + struct list_head *node; + char *bufp = buf; + + for (grp = 0; grp < mru->grp_count; grp++) { + mru->list_elems[grp] = 0; + list_for_each(node, mru->lists + grp) + mru->list_elems[grp]++; + } + mru->reap_elems = 0; + list_for_each(node, &mru->reap_list) + mru->reap_elems++; + + bufp += sprintf(bufp, "(%d) ", mru->reap_elems); + + for (grp = 0; grp < mru->grp_count; grp++) + { + if (grp == mru->lru_grp) + *bufp++ = '*'; + + bufp += sprintf(bufp, "%u", mru->list_elems[grp]); + + if (grp == mru->lru_grp) + *bufp++ = '*'; + + if (grp < mru->grp_count - 1) + *bufp++ = ' '; + } + + bufp += sprintf(bufp, " [%lu/%lu]", mru->allocs, mru->frees); + + return bufp - buf; +} +#endif /* DEBUG_MRU_CACHE */ Index: 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.h 2007-05-10 17:24:13.155002014 +1000 @@ -0,0 +1,225 @@ +/* + * Copyright (c) 2000-2002,2006 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#ifndef __XFS_MRU_CACHE_H__ +#define __XFS_MRU_CACHE_H__ + +/* + * The MRU Cache data structure consists of a data store, an array of lists and + * a lock to protect its internal state. At initialisation time, the client + * supplies an element lifetime in milliseconds and a group count, as well as a + * function pointer to call when deleting elements. A data structure for + * queueing up work in the form of timed callbacks is also included. + * + * The group count controls how many lists are created, and thereby how finely + * the elements are grouped in time. When reaping occurs, all the elements in + * all the lists whose time has expired are deleted. + * + * To give an example of how this works in practice, consider a client that + * initialises an MRU Cache with a lifetime of ten seconds and a group count of + * five. Five internal lists will be created, each representing a two second + * period in time. When the first element is added, time zero for the data + * structure is initialised to the current time. + * + * All the elements added in the first two seconds are appended to the first + * list. Elements added in the third second go into the second list, and so on. + * If an element is accessed at any point, it is removed from its list and + * inserted at the head of the current most-recently-used list. + * + * The reaper function will have nothing to do until at least twelve seconds + * have elapsed since the first element was added. The reason for this is that + * if it were called at t=11s, there could be elements in the first list that + * have only been inactive for nine seconds, so it still does nothing. If it is + * called anywhere between t=12 and t=14 seconds, it will delete all the + * elements that remain in the first list. It's therefore possible for elements + * to remain in the data store even after they've been inactive for up to + * (t + t/g) seconds, where t is the inactive element lifetime and g is the + * number of groups. + * + * The above example assumes that the reaper function gets called at least once + * every (t/g) seconds. If it is called less frequently, unused elements will + * accumulate in the reap list until the reaper function is eventually called. + * The current implementation uses work queue callbacks to carefully time the + * reaper function calls, so this should happen rarely, if at all. + * + * From a design perspective, the primary reason for the choice of a list array + * representing discrete time intervals is that it's only practical to reap + * expired elements in groups of some appreciable size. This automatically + * introduces a granularity to element lifetimes, so there's no point storing an + * individual timeout with each element that specifies a more precise reap time. + * The bonus is a saving of sizeof(long) bytes of memory per element stored. + * + * The elements could have been stored in just one list, but an array of + * counters or pointers would need to be maintained to allow them to be divided + * up into discrete time groups. More critically, the process of touching or + * removing an element would involve walking large portions of the entire list, + * which would have a detrimental effect on performance. The additional memory + * requirement for the array of list heads is minimal. + * + * When an element is touched or deleted, it needs to be removed from its + * current list. Doubly linked lists are used to make the list maintenance + * portion of these operations O(1). Since reaper timing can be imprecise, + * inserts and lookups can occur when there are no free lists available. When + * this happens, all the elements on the LRU list need to be migrated to the end + * of the reap list. To keep the list maintenance portion of these operations + * O(1) also, list tails need to be accessible without walking the entire list. + * This is the reason why doubly linked list heads are used. + */ + +/* Function pointer type for callback to free a client's data pointer. */ +typedef void (*xfs_mru_cache_free_func_t)(void*, void*); + +typedef struct xfs_mru_cache +{ + struct radix_tree_root store; /* Core storage data structure. */ + struct list_head *lists; /* Array of lists, one per grp. */ + struct list_head reap_list; /* Elements overdue for reaping. */ + spinlock_t lock; /* Lock to protect this struct. */ + unsigned int grp_count; /* Number of discrete groups. */ + unsigned int grp_time; /* Time period spanned by grps. */ + unsigned int lru_grp; /* Group containing time zero. */ + unsigned long time_zero; /* Time first element was added. */ + unsigned long next_reap; /* Time that the reaper should + next do something. */ + unsigned int reap_all; /* if set, reap all lists */ + xfs_mru_cache_free_func_t free_func; /* Function pointer for freeing. */ + struct delayed_work work; /* Workqueue data for reaping. */ +#ifdef DEBUG_MRU_CACHE + unsigned int *list_elems; + unsigned int reap_elems; + unsigned long allocs; + unsigned long frees; +#endif +} xfs_mru_cache_t; + +/* + * xfs_mru_cache_init() prepares memory zones and any other globally scoped + * resources. + */ +int +xfs_mru_cache_init(void); + +/* + * xfs_mru_cache_uninit() tears down all the globally scoped resources prepared + * in xfs_mru_cache_init(). + */ +void +xfs_mru_cache_uninit(void); + +/* + * To initialise a struct xfs_mru_cache pointer, call xfs_mru_cache_create() + * with the address of the pointer, a lifetime value in milliseconds, a group + * count and a free function to use when deleting elements. This function + * returns 0 if the initialisation was successful. + */ +int +xfs_mru_cache_create(struct xfs_mru_cache **mrup, + unsigned int lifetime_ms, + unsigned int grp_count, + xfs_mru_cache_free_func_t free_func); + +/* + * Call xfs_mru_cache_flush() to flush out all cached entries, calling their + * free functions as they're deleted. When this function returns, the caller is + * guaranteed that all the free functions for all the elements have finished + * executing. + * + * While we are flushing, we stop the periodic reaper event from triggering. + * Normally, we want to restart this periodic event, but if we are shutting + * down the cache we do not want it restarted. hence the restart parameter + * where 0 = do not restart reaper and 1 = restart reaper. + */ +void +xfs_mru_cache_flush( + xfs_mru_cache_t *mru, + int restart); + +/* + * Call xfs_mru_cache_destroy() with the MRU Cache pointer when the cache is no + * longer needed. + */ +void +xfs_mru_cache_destroy(struct xfs_mru_cache *mru); + +/* + * To insert an element, call xfs_mru_cache_insert() with the data store, the + * element's key and the client data pointer. This function returns 0 on + * success or ENOMEM if memory for the data element couldn't be allocated. + */ +int +xfs_mru_cache_insert(struct xfs_mru_cache *mru, + unsigned long key, + void *value); + +/* + * To remove an element without calling the free function, call + * xfs_mru_cache_remove() with the data store and the element's key. On success + * the client data pointer for the removed element is returned, otherwise this + * function will return a NULL pointer. + */ +void* +xfs_mru_cache_remove(struct xfs_mru_cache *mru, + unsigned long key); + +/* + * To remove and element and call the free function, call xfs_mru_cache_delete() + * with the data store and the element's key. + */ +void +xfs_mru_cache_delete(struct xfs_mru_cache *mru, + unsigned long key); + +/* + * To look up an element using its key, call xfs_mru_cache_lookup() with the + * data store and the element's key. If found, the element will be moved to the + * head of the MRU list to indicate that it's been touched. + * + * The internal data structures are protected by a spinlock that is STILL HELD + * when this function returns. Call xfs_mru_cache_done() to release it. Note + * that it is not safe to call any function that might sleep in the interim. + * + * The implementation could have used reference counting to avoid this + * restriction, but since most clients simply want to get, set or test a member + * of the returned data structure, the extra per-element memory isn't warranted. + * + * If the element isn't found, this function returns NULL and the spinlock is + * released. xfs_mru_cache_done() should NOT be called when this occurs. + */ +void* +xfs_mru_cache_lookup(struct xfs_mru_cache *mru, + unsigned long key); + +/* + * To look up an element using its key, but leave its location in the internal + * lists alone, call xfs_mru_cache_peek(). If the element isn't found, this + * function returns NULL. + * + * See the comments above the declaration of the xfs_mru_cache_lookup() function + * for important locking information pertaining to this call. + */ +void* +xfs_mru_cache_peek(struct xfs_mru_cache *mru, + unsigned long key); +/* + * To release the internal data structure spinlock after having performed an + * xfs_mru_cache_lookup() or an xfs_mru_cache_peek(), call xfs_mru_cache_done() + * with the data store pointer. + */ +void +xfs_mru_cache_done(struct xfs_mru_cache *mru); + +#endif /* __XFS_MRU_CACHE_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vfsops.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c 2007-05-10 17:24:13.163000966 +1000 @@ -51,6 +51,8 @@ #include "xfs_acl.h" #include "xfs_attr.h" #include "xfs_clnt.h" +#include "xfs_mru_cache.h" +#include "xfs_filestream.h" #include "xfs_fsops.h" STATIC int xfs_sync(bhv_desc_t *, int, cred_t *); @@ -81,6 +83,8 @@ xfs_init(void) xfs_dabuf_zone = kmem_zone_init(sizeof(xfs_dabuf_t), "xfs_dabuf"); xfs_ifork_zone = kmem_zone_init(sizeof(xfs_ifork_t), "xfs_ifork"); xfs_acl_zone_init(xfs_acl_zone, "xfs_acl"); + xfs_mru_cache_init(); + xfs_filestream_init(); /* * The size of the zone allocated buf log item is the maximum @@ -164,6 +168,8 @@ xfs_cleanup(void) xfs_cleanup_procfs(); xfs_sysctl_unregister(); xfs_refcache_destroy(); + xfs_filestream_uninit(); + xfs_mru_cache_uninit(); xfs_acl_zone_destroy(xfs_acl_zone); #ifdef XFS_DIR2_TRACE @@ -320,6 +326,9 @@ xfs_start_flags( else mp->m_flags &= ~XFS_MOUNT_BARRIER; + if (ap->flags2 & XFSMNT2_FILESTREAMS) + mp->m_flags |= XFS_MOUNT_FILESTREAMS; + return 0; } @@ -518,6 +527,9 @@ xfs_mount( if (mp->m_flags & XFS_MOUNT_BARRIER) xfs_mountfs_check_barriers(mp); + if ((error = xfs_filestream_mount(mp))) + goto error2; + error = XFS_IOINIT(vfsp, args, flags); if (error) goto error2; @@ -575,6 +587,13 @@ xfs_unmount( */ xfs_refcache_purge_mp(mp); + /* + * Blow away any referenced inode in the filestreams cache. + * This can and will cause log traffic as inodes go inactive + * here. + */ + xfs_filestream_unmount(mp); + XFS_bflush(mp->m_ddev_targp); error = xfs_unmount_flush(mp, 0); if (error) @@ -682,6 +701,7 @@ xfs_mntupdate( mp->m_flags &= ~XFS_MOUNT_BARRIER; } } else if (!(vfsp->vfs_flag & VFS_RDONLY)) { /* rw -> ro */ + xfs_filestream_flush(mp); bhv_vfs_sync(vfsp, SYNC_FSDATA|SYNC_BDFLUSH|SYNC_ATTR, NULL); xfs_quiesce_fs(mp); xfs_log_sbcount(mp, 1); @@ -909,6 +929,9 @@ xfs_sync( { xfs_mount_t *mp = XFS_BHVTOM(bdp); + if (flags & SYNC_IOWAIT) + xfs_filestream_flush(mp); + return xfs_syncsub(mp, flags, NULL); } @@ -1869,6 +1892,8 @@ xfs_parseargs( } else if (!strcmp(this_char, "irixsgid")) { cmn_err(CE_WARN, "XFS: irixsgid is now a sysctl(2) variable, option is deprecated."); + } else if (!strcmp(this_char, "filestreams")) { + args->flags2 |= XFSMNT2_FILESTREAMS; } else { cmn_err(CE_WARN, "XFS: unknown mount option [%s].", this_char); Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-05-10 17:24:13.170999917 +1000 @@ -51,6 +51,7 @@ #include "xfs_refcache.h" #include "xfs_trans_space.h" #include "xfs_log_priv.h" +#include "xfs_filestream.h" STATIC int xfs_open( @@ -94,6 +95,19 @@ xfs_close( return 0; /* + * If we are using filestreams, and we have an unlinked + * file that we are processing the last close on, then nothing + * will be able to reopen and write to this file. Purge this + * inode from the filestreams cache so that it doesn't delay + * teardown of the inode. + */ + if ((ip->i_d.di_nlink == 0) && + ((ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (ip->i_d.di_flags & XFS_DIFLAG_FILESTREAM))) { + xfs_filestream_deassociate(ip); + } + + /* * If we previously truncated this file and removed old data in * the process, we want to initiate "early" writeout on the last * close. This is an attempt to combat the notorious NULL files @@ -820,6 +834,8 @@ xfs_setattr( di_flags |= XFS_DIFLAG_PROJINHERIT; if (vap->va_xflags & XFS_XFLAG_NODEFRAG) di_flags |= XFS_DIFLAG_NODEFRAG; + if (vap->va_xflags & XFS_XFLAG_FILESTREAM) + di_flags |= XFS_DIFLAG_FILESTREAM; if ((ip->i_d.di_mode & S_IFMT) == S_IFDIR) { if (vap->va_xflags & XFS_XFLAG_RTINHERIT) di_flags |= XFS_DIFLAG_RTINHERIT; @@ -2564,6 +2580,18 @@ xfs_remove( */ xfs_refcache_purge_ip(ip); + /* + * If we are using filestreams, kill the stream association. + * If the file is still open it may get a new one but that + * will get killed on last close in xfs_close() so we don't + * have to worry about that. + */ + if (link_zero && + ((ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (ip->i_d.di_flags & XFS_DIFLAG_FILESTREAM))) { + xfs_filestream_deassociate(ip); + } + vn_trace_exit(XFS_ITOV(ip), __FUNCTION__, (inst_t *)__return_address); /* Index: 2.6.x-xfs-new/fs/xfs/quota/xfs_qm.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/quota/xfs_qm.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/quota/xfs_qm.c 2007-05-10 17:24:13.186997821 +1000 @@ -65,7 +65,6 @@ kmem_zone_t *qm_dqtrxzone; static struct shrinker *xfs_qm_shaker; static cred_t xfs_zerocr; -static xfs_inode_t xfs_zeroino; STATIC void xfs_qm_list_init(xfs_dqlist_t *, char *, int); STATIC void xfs_qm_list_destroy(xfs_dqlist_t *); @@ -1415,7 +1414,7 @@ xfs_qm_qino_alloc( return error; } - if ((error = xfs_dir_ialloc(&tp, &xfs_zeroino, S_IFREG, 1, 0, + if ((error = xfs_dir_ialloc(&tp, NULL, S_IFREG, 1, 0, &xfs_zerocr, 0, 1, ip, &committed))) { xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT); From owner-xfs@oss.sgi.com Thu May 10 18:11:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 18:11:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B1BpfB008635 for ; Thu, 10 May 2007 18:11:53 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA05863; Fri, 11 May 2007 11:11:47 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4B1BkAf88912650; Fri, 11 May 2007 11:11:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4B1BjMD90624383; Fri, 11 May 2007 11:11:45 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 11:11:45 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: fix b0rked test 030 behaviour. Message-ID: <20070511011145.GN86004887@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.4.2.1i X-archive-position: 11392 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Test 030 is not testing things as it should. Specifically, corrupting the AGFL with "-1" is a no-op on a freshly repaired filesystem, because xfs_repair rebuilds the AGF btrees and AGFL from scratch and does not populate the AGFL. The current test does: repair mount create file remove file umount And it does the filesystem twiddling to check that the filesystem is uable after repair. The problem is that this doesn't dirty the filesystem - the create is followed by a remove, so nothing is actually allocated and so the AGFL lists do not get modified. hence after a repair/check/corruption cycle, writing "-1" to the AGFL is a no-op because it is already full of "-1" fields (NULL blocks). With filestreams, the create/remove pair *does* modify the filesystem and so when we write "-1" to the AGFL, we get different output because the filesystem detects new corruptions and the test "fails". So, to make behaviour consistent, dirty the filesystem before corrupting it on each cycle. Hence it doesn't matter if we are using filestreams or not, we'll really test out corrupting the AGFL with NULL blocks (-1) now. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- xfstests/030.out.irix | 4 ++++ xfstests/030.out.linux | 4 ++++ xfstests/common.repair | 10 ++++++++++ 3 files changed, 18 insertions(+) Index: xfs-cmds/xfstests/030.out.irix =================================================================== --- xfs-cmds.orig/xfstests/030.out.irix 2007-05-03 17:10:29.554803451 +1000 +++ xfs-cmds/xfstests/030.out.irix 2007-05-03 17:10:54.227585189 +1000 @@ -262,6 +262,10 @@ Wrote X.XXKb (value 0xffffffff) Phase 1 - find and verify superblock... Phase 2 - zero log... - scan filesystem freespace and inode maps... +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... Index: xfs-cmds/xfstests/030.out.linux =================================================================== --- xfs-cmds.orig/xfstests/030.out.linux 2007-05-03 17:10:29.554803451 +1000 +++ xfs-cmds/xfstests/030.out.linux 2007-05-03 17:10:54.231584667 +1000 @@ -270,6 +270,10 @@ Phase 1 - find and verify superblock... Phase 2 - using log - zero log... - scan filesystem freespace and inode maps... +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... Index: xfs-cmds/xfstests/common.repair =================================================================== --- xfs-cmds.orig/xfstests/common.repair 2007-05-03 17:10:29.554803451 +1000 +++ xfs-cmds/xfstests/common.repair 2007-05-03 17:10:54.231584667 +1000 @@ -72,8 +72,18 @@ _check_repair() { value=$1 structure="$2" + + #ensure the filesystem has been dirtied since last repair + _scratch_mount + POSIXLY_CORRECT=yes \ + dd if=/bin/sh of=$SCRATCH_MNT/sh 2>&1 |_filter_dd + sync + rm -f $SCRATCH_MNT/sh + umount $SCRATCH_MNT + _zero_position $value "$structure" _scratch_xfs_repair 2>&1 | _filter_repair + # some basic sanity checks... _check_scratch_fs _scratch_mount #mount From owner-xfs@oss.sgi.com Thu May 10 20:39:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 20:39:18 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B3dDfB032370 for ; Thu, 10 May 2007 20:39:15 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA08907; Fri, 11 May 2007 13:39:07 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 423A158CA530; Fri, 11 May 2007 13:39:06 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964546 - Only use refcounted pages for I/O Message-Id: <20070511033907.423A158CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 13:39:06 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11393 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Only use refcounted pages for I/O Many block drivers (aoe, iscsi) really want refcountable pages in bios, which is what almost everyone send down. XFS unfortunately has a few places where it sends down buffers that may come from kmalloc, which breaks them. Fix the places that use kmalloc()d buffers. Signed-Off-By: Christoph Hellwig Date: Fri May 11 13:37:22 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: dgc,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28562a fs/xfs/xfs_log.c - 1.328 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.328&r2=text&tr2=1.327&f=h - Convert log buffers to use xfs_buf_get_noaddr rather than using kmem_alloc()d buffers. fs/xfs/linux-2.6/xfs_buf.h - 1.120 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.h.diff?r1=text&tr1=1.120&r2=text&tr2=1.119&f=h - Use alloc_page() rather than kmem_alloc() for buffers that do not use page cache backed pages. fs/xfs/linux-2.6/xfs_buf.c - 1.236 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.236&r2=text&tr2=1.235&f=h - Use alloc_page() rather than kmem_alloc() for buffers that do not use page cache backed pages. From owner-xfs@oss.sgi.com Thu May 10 21:01:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 21:01:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B41ZfB003462 for ; Thu, 10 May 2007 21:01:37 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA09369; Fri, 11 May 2007 14:01:31 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id C7E7A58CA530; Fri, 11 May 2007 14:01:31 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: PARTIAL TAKE 957886 - xfs_growfs should refuse to grow fs past 16Tb on a 32 bit system Message-Id: <20070511040131.C7E7A58CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 14:01:31 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11394 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Don't grow filesystems past the size they can index. When growing a filesystem we don't check to see if the new size overflows the page cache index range, so we can do silly things like grow a filesystem page 16TB on a 32bit. Check new filesystem sizes against the limits the kernel can support. Signed-Off-By: Nathan Scott Date: Fri May 11 14:00:21 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: dgc The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28563a fs/xfs/xfs_rtalloc.c - 1.107 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_rtalloc.c.diff?r1=text&tr1=1.107&r2=text&tr2=1.106&f=h - Check new rt volume size against the maximum the system can support. fs/xfs/xfs_mount.h - 1.235 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.h.diff?r1=text&tr1=1.235&r2=text&tr2=1.234&f=h - Factor maximum supported filesystem size checks to allow other callers to use it. fs/xfs/xfs_mount.c - 1.394 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c.diff?r1=text&tr1=1.394&r2=text&tr2=1.393&f=h - Factor maximum supported filesystem size checks to allow other callers to use it. fs/xfs/xfs_fsops.c - 1.123 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fsops.c.diff?r1=text&tr1=1.123&r2=text&tr2=1.122&f=h - Check new volume size against the maximum the system can support. From owner-xfs@oss.sgi.com Thu May 10 22:03:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 22:03:50 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B53ifB014004 for ; Thu, 10 May 2007 22:03:46 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA10567; Fri, 11 May 2007 15:03:40 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id A25B858CA530; Fri, 11 May 2007 15:03:40 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 963674 - Don't hold ilock when calling vn_iowait. Message-Id: <20070511050340.A25B858CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 15:03:40 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11395 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Sleeping with the ilock waiting for I/O completion is Bad. Recent fixes to the filesystem freezing code introduced a vn_iowait call in the middle of the sync code. Unfortunately, at the point where this call was added we are holding the ilock. The ilock is needed by I/O completion for unwritten extent conversion and now updating the file size. Hence I/o cannot complete if we hol dthe ilock while waiting for I/O completion. Fix up the bug and clean the code up around it. Date: Fri May 11 15:02:29 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28566a fs/xfs/xfs_vfsops.c - 1.519 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.519&r2=text&tr2=1.518&f=h - Drop the ilock before calling vn_iowait() when doing a SYNC_IOWAIT sync operation. Make the code easier to understand as well. From owner-xfs@oss.sgi.com Thu May 10 22:25:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 22:25:28 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B5POfB018617 for ; Thu, 10 May 2007 22:25:26 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA10961; Fri, 11 May 2007 15:25:20 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 2957D58CA530; Fri, 11 May 2007 15:25:20 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964545 - use-after-free of xfs_buf_t during log unmount Message-Id: <20070511052520.2957D58CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 15:25:20 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11396 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Fix use-after-free during log unmount. Don't reference the log buffer after running the callbacks as the callback can trigger the log buffers to be freed during unmount. Date: Fri May 11 15:24:46 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28567a fs/xfs/xfs_log.c - 1.329 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.329&r2=text&tr2=1.328&f=h - Don't reference the log buffer after running the callbacks as it may have been freed during the unmount. From owner-xfs@oss.sgi.com Thu May 10 22:35:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 22:35:54 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B5ZnfB020332 for ; Thu, 10 May 2007 22:35:51 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA11162; Fri, 11 May 2007 15:35:45 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 54D9758CA530; Fri, 11 May 2007 15:35:45 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964544 - Barriers need to be dynamically checked and switched off Message-Id: <20070511053545.54D9758CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 15:35:45 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11397 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Barriers need to be dynamically checked and switched off If the underlying block device sudden stops supporting barriers, we need to handle the -EOPNOTSUPP error in a sane manner rather than shutting downteh filesystem. If we get this error, clear the barrier flag, reissue the I/O, and tell the world bad things are occurring. Date: Fri May 11 15:35:19 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28568a fs/xfs/xfs_log.c - 1.330 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.330&r2=text&tr2=1.329&f=h - If we have barriers enabled and we see a barrier log write come back without the barrier flag on it, then we need to stop issuing barriers on the log writes. Make noise about it, too. fs/xfs/linux-2.6/xfs_super.c - 1.380 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_super.c.diff?r1=text&tr1=1.380&r2=text&tr2=1.379&f=h - We shouldn't peer down into the backing device to see if barriers are supported or not - the test I/O is sufficient to tell us this. fs/xfs/linux-2.6/xfs_buf.c - 1.237 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.237&r2=text&tr2=1.236&f=h - If the buffer gets a EOPNOTSUPP I/O error and it is a barrier write, clear the barrier and reissue the I/O. From owner-xfs@oss.sgi.com Fri May 11 04:01:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 11 May 2007 04:01:44 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4BB1bfB020236 for ; Fri, 11 May 2007 04:01:39 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4BB1ak9030146 for ; Fri, 11 May 2007 07:01:36 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4BB1aIw544592 for ; Fri, 11 May 2007 07:01:36 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4BB1ZDE031697 for ; Fri, 11 May 2007 07:01:36 -0400 Received: from qubit.in.ibm.com ([9.124.219.214]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4BB1XPB031541; Fri, 11 May 2007 07:01:34 -0400 Received: from qubit.in.ibm.com (localhost.localdomain [127.0.0.1]) by qubit.in.ibm.com (Postfix) with ESMTP id C81A667FFD; Fri, 11 May 2007 16:33:11 +0530 (IST) Received: (from suparna@localhost) by qubit.in.ibm.com (8.13.1/8.13.1/Submit) id l4BB37tF003708; Fri, 11 May 2007 16:33:07 +0530 Date: Fri, 11 May 2007 16:33:01 +0530 From: Suparna Bhattacharya To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070511110301.GB28425@in.ibm.com> Reply-To: suparna@in.ibm.com References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> <20070510115620.GB21400@amitarora.in.ibm.com> <20070510223950.GD86004887@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510223950.GD86004887@sgi.com> User-Agent: Mutt/1.5.11 X-archive-position: 11398 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: suparna@in.ibm.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > > I have the updated patches ready which take care of Andrew's comments. > > > > Will run some tests and post them soon. > > > > > > > > But, before submitting these patches, I think it will be better to > > > > finalize on certain things which might be worth some discussion here: > > > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > > in this case. I also tend to agree with them. Does anyone has an > > > > argument in favor of changing the filesize ? If not, I will remove the > > > > code which changes the filesize, before I resubmit the concerned ext4 > > > > patch. > > > > > > I think there needs to be both. If we don't have a mechanism to atomically > > > change the file size with the preallocation, then applications that use > > > stat() to work out if they need to preallocate more space will end up > > > racing. > > > > By "both" above, do you mean we should give user the flexibility if it wants > > the filesize changed or not ? It can be done by having *two* modes for > > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > > will change the filesize if required (i.e. when allocation is beyond EOF) > > and also update [cm]time. This way, the application can decide what it > > wants. > > Yes, that's right. > > > This will be helpfull for the partial allocation scenario also. Think of the > > case when we do not change the filesize in fallocate() and expect > > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > > Now if fallocate() results in a partial allocation with -ENOSPC error > > returned, applications/posix_fallocate() will not know for what length > > ftruncate() has to be called. :( > > Well, posix_fallocate() either gets all the space or it fails. If > you truncate to extend the file size after an ENOSPC, then that is > a buggy implementation. > > The same could be said for any application, or even the fallocate() > call itself if it changes the filesize without having completely > preallocated the space asked.... > > > Hence it may be a good idea to give user the flexibility if it wants to > > atomically change the file size with preallocation or not. But, with more > > flexibility there comes inconsistency in behavior, which is worth > > considering. > > We've got different modes to specify different behaviour. That's > what the mode field was put there for in the first place - the > interface is *designed* to support different preallocation > behaviours.... > > > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > > normal (non-preallocated) blocks (blocks allocated via regular > > > > write/truncate operations) also (i.e. work as punch()) ? > > > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > > i did for FA_UNALLOCATE as well. > > > > Ok. But, some people may not expect/like this. I think, we can keep it on > > the backburner for a while, till other issues are sorted out. > > How can it be a "backburner" issue when it defines the > implementation? I've already implemented some thing in XFS that > sort of does what I think that the interface is supposed to do, but > I need that interface to be nailed down before proceeding any > further. > > All I'm really interested in right now is that the fallocate > _interface_ can be used as a *complete replacement* for the > pre-existing XFS-specific ioctls that are already used by > applications. What ext4 can or can't do right now is irrelevant to > this discussion - the interface definition needs to take priority > over implementation.... Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards Suparna > > Cheers, > > Dave, > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India From owner-xfs@oss.sgi.com Fri May 11 07:48:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 11 May 2007 07:48:33 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4BEmSfB026547 for ; Fri, 11 May 2007 07:48:29 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id F047B2C804B; Fri, 11 May 2007 07:47:38 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id B19862C8043; Fri, 11 May 2007 07:47:38 -0700 (PDT) Received: from [192.168.28.126] (outer-dhcp-126.goop.org [192.168.28.126]) by lurch.goop.org (Postfix) with ESMTP; Fri, 11 May 2007 07:47:38 -0700 (PDT) Message-ID: <4644823A.8090104@goop.org> Date: Fri, 11 May 2007 07:48:26 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> <4643AF8F.5040705@goop.org> <20070511003257.GL86004887@sgi.com> In-Reply-To: <20070511003257.GL86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11399 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: >> Yes, that does look like a good candidate. Should I try to >> before-and-after this change? >> > > Yes please! > OK, definite result. Before ba87ea699ebd9dd577bf055ebc4a98200e337542: all OK. After: truncated files. I also got a bmap of a particular truncated file, linux-clone-test-1/.hg/store/00manifest.i, diffing before with after: --rw-r--r-- 1 root root 3558208 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i +-rw-r--r-- 1 root root 3541760 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i 16: [6144..6271]: 18141808..18141935 2 (2413168..2413295) 128 17: [6272..6399]: 18140608..18140735 2 (2411968..2412095) 128 18: [6400..6911]: 18136464..18136975 2 (2407824..2408335) 512 - 19: [6912..6951]: 18136336..18136375 2 (2407696..2407735) 40 + 19: [6912..6919]: 18136336..18136343 2 (2407696..2407703) 8 J From owner-xfs@oss.sgi.com Fri May 11 09:52:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 11 May 2007 09:52:30 -0700 (PDT) Received: from gab.dneg.com (mail.dneg.com [193.203.82.196]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4BGqNfB019117 for ; Fri, 11 May 2007 09:52:25 -0700 Received: from localhost (localhost.localdomain [127.0.0.1]) by gab.dneg.com (Postfix) with ESMTP id DF7594D5F86 for ; Fri, 11 May 2007 17:34:53 +0100 (BST) Received: from gab.dneg.com ([127.0.0.1]) by localhost (gab.dneg.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TpQDzJTEYiUv for ; Fri, 11 May 2007 17:34:52 +0100 (BST) Received: from [172.16.10.24] (spinach.dneg.com [172.16.10.24]) by gab.dneg.com (Postfix) with ESMTP id E13BE4D5EFD for ; Fri, 11 May 2007 17:34:52 +0100 (BST) Message-ID: <46449B2C.30208@dneg.com> Date: Fri, 11 May 2007 17:34:52 +0100 From: Evan Fraser User-Agent: Thunderbird 2.0.0.0 (X11/20070326) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: updatedb triggers XFS internal error Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11400 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: evan@dneg.com Precedence: bulk X-list: xfs Hello, I'm having a problem with one of my linux servers. whenever updatedb is run, the following errors occur in the system log. 0x0: c9 00 5a f1 3a 7f 66 be a3 c1 d4 7f e8 1d 6b c9 Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f Call Trace:{:xfs:xfs_da_do_buf+1513} {:xfs:xfs_da_read_buf+36} {do_lookup+83} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_da_read_buf+36} {:xfs:xfs_dir2_block_getdents+183} {:xfs:xfs_dir2_block_getdents+183} {:xfs:xfs_dir2_put_dirent64_direct+0} {link_path_walk+196} {:xfs:xfs_bmap_last_offset+226} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_dir2_getdents+222} {:xfs:xfs_readdir+84} {:xfs:linvfs_readdir+213} {filldir64+0} {filldir64+0} {vfs_readdir+154} {sys_getdents64+116} {tracesys+209} 0x0: 28 ab a5 ec 3e 42 55 1f 76 9e 01 72 72 ee bd f1 Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f Call Trace:{:xfs:xfs_da_do_buf+1513} {:xfs:xfs_da_read_buf+36} {find_or_create_page+30} {:xfs:xfs_da_read_buf+36} {:xfs:xfs_dir2_leaf_getdents+1107} {:xfs:xfs_dir2_leaf_getdents+1107} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_bmap_last_offset+226} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_dir2_getdents+250} {:xfs:xfs_readdir+84} {:xfs:linvfs_readdir+213} {filldir64+0} {filldir64+0} {vfs_readdir+154} {sys_getdents64+116} {tracesys+209} Its a dual opteron system running Fedora Core 4 and running the fedora packaged 2.6.12-1.1456_FC4smp kernel. The filesystem in question is on a md stripe raid running across an Infortrend 1.4TB hardware SCSI raid. The output from xfs_info is: meta-data=/user_data isize=256 agcount=32, agsize=11180624 blks = sectsz=512 data = bsize=4096 blocks=357779520, imaxpct=25 = sunit=16 swidth=32 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=131072 blocks=0, rtextents=0 Any help will be gratefully received! Cheers, Evan. -- evan@dneg.com Linux Systems Administrator Double Negative tel: +44 (0)20 7534 4400 fax: +44 (0)20 7534 4452 77 shaftesbury avenue, w1d 5du, London From owner-xfs@oss.sgi.com Sat May 12 00:56:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 00:56:43 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4C7uVfB013039 for ; Sat, 12 May 2007 00:56:32 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA11477; Sat, 12 May 2007 17:56:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4C7uMAf91189703; Sat, 12 May 2007 17:56:23 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4C7uIhK87536032; Sat, 12 May 2007 17:56:18 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Sat, 12 May 2007 17:56:18 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070512075618.GE85884050@sgi.com> References: <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> <4643AF8F.5040705@goop.org> <20070511003257.GL86004887@sgi.com> <4644823A.8090104@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4644823A.8090104@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11401 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 07:48:26AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > >> Yes, that does look like a good candidate. Should I try to > >> before-and-after this change? > >> > > > > Yes please! > > > > OK, definite result. Before ba87ea699ebd9dd577bf055ebc4a98200e337542: > all OK. After: truncated files. > > I also got a bmap of a particular truncated file, > linux-clone-test-1/.hg/store/00manifest.i, diffing before with after: > > --rw-r--r-- 1 root root 3558208 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i > +-rw-r--r-- 1 root root 3541760 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i > > 16: [6144..6271]: 18141808..18141935 2 (2413168..2413295) 128 > 17: [6272..6399]: 18140608..18140735 2 (2411968..2412095) 128 > 18: [6400..6911]: 18136464..18136975 2 (2407824..2408335) 512 > - 19: [6912..6951]: 18136336..18136375 2 (2407696..2407735) 40 > + 19: [6912..6919]: 18136336..18136343 2 (2407696..2407703) 8 Ok, thanks for confirming the cause of the regression. I'll post a patch when I've got something for you to try. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sat May 12 01:02:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 01:02:32 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4C82NfB014463 for ; Sat, 12 May 2007 01:02:24 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA11674; Sat, 12 May 2007 18:02:08 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4C824Af90572564; Sat, 12 May 2007 18:02:04 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4C81vTq91481671; Sat, 12 May 2007 18:01:57 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Sat, 12 May 2007 18:01:57 +1000 From: David Chinner To: Suparna Bhattacharya Cc: David Chinner , "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070512080157.GF85884050@sgi.com> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> <20070510115620.GB21400@amitarora.in.ibm.com> <20070510223950.GD86004887@sgi.com> <20070511110301.GB28425@in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070511110301.GB28425@in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11402 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > > All I'm really interested in right now is that the fallocate > > _interface_ can be used as a *complete replacement* for the > > pre-existing XFS-specific ioctls that are already used by > > applications. What ext4 can or can't do right now is irrelevant to > > this discussion - the interface definition needs to take priority > > over implementation.... > > Would you like to write up an interface definition description (likely > man page) and post it for review, possibly with a mention of apps using > it today ? Yeah, I started doing that yesterday as i figured it was the only way to cut the discussion short.... > One reason for introducing the mode parameter was to allow the interface to > evolve incrementally as more options / semantic questions are proposed, so > that we don't have to make all the decisions right now. > So it would be good to start with a *minimal* definition, even just one mode. > The rest could follow as subsequent patches, each being reviewed and debated > separately. Otherwise this discussion can drag on for a long time. Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sat May 12 05:46:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 05:47:04 -0700 (PDT) Received: from waste.org (waste.org [66.93.16.53]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CCksfB023990 for ; Sat, 12 May 2007 05:46:56 -0700 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.8/8.13.8/Debian-3) with ESMTP id l4CCkgpd012114 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Sat, 12 May 2007 07:46:43 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.8/8.13.8/Submit) id l4CCkf7o012113; Sat, 12 May 2007 07:46:41 -0500 Date: Sat, 12 May 2007 07:46:41 -0500 From: Matt Mackall To: Jan Engelhardt Cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070512124641.GZ11115@waste.org> References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11403 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mpm@selenic.com Precedence: bulk X-list: xfs On Sat, May 12, 2007 at 01:21:41PM +0200, Jan Engelhardt wrote: > > On May 10 2007 10:38, Matt Mackall wrote: > >> > >> for i in `seq 20`; do > >> hg clone -U --pull a b-$i > >> hg verify b-$i # always OK > >> umount /home > >> sleep 5 > >> mount /home > >> hg verify b-$i # often found truncated files > >> done > >> > [...] > > > >This test looks like it should consist solely of open-for-append and > >write on about 20k files in the target directory. Because of the > >--pull, no hardlinks are involved. It shouldn't be all that different > >from doing tar cf - a | tar xf - b. > > > >The files get visited in alphabetical order, so the start of the > >corruption may be telling. > > You should not assume alphabetical order. Filesystems may be free to > reorder things and return them (1) randomly like in a hash (2) by > creation time during readdir(). There is no assumption. Mercurial explicitly visits files in alphabetical order for the above commands. -- Mathematics is the supreme nostalgia of our time. From owner-xfs@oss.sgi.com Sat May 12 06:02:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 06:02:22 -0700 (PDT) Received: from mailer.gwdg.de (mailer.gwdg.de [134.76.10.26]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CD2AfB028338 for ; Sat, 12 May 2007 06:02:12 -0700 Received: from linux01.gwdg.de ([134.76.13.21]) by mailer.gwdg.de with esmtps (TLSv1:AES256-SHA:256) (Exim 4.66) (envelope-from ) id 1Hmpj7-0000iM-5t; Sat, 12 May 2007 13:25:29 +0200 Received: from linux01.gwdg.de (localhost [127.0.0.1]) by linux01.gwdg.de (8.13.3/8.13.3/SuSE Linux 0.7) with ESMTP id l4CBLgW1023744; Sat, 12 May 2007 13:21:44 +0200 Received: from localhost (jengelh@localhost) by linux01.gwdg.de (8.13.3/8.13.3/Submit) with ESMTP id l4CBLfOW023691; Sat, 12 May 2007 13:21:41 +0200 Date: Sat, 12 May 2007 13:21:41 +0200 (MEST) From: Jan Engelhardt To: Matt Mackall cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? In-Reply-To: <20070510153832.GQ11115@waste.org> Message-ID: References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11405 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jengelh@linux01.gwdg.de Precedence: bulk X-list: xfs On May 10 2007 10:38, Matt Mackall wrote: >> >> for i in `seq 20`; do >> hg clone -U --pull a b-$i >> hg verify b-$i # always OK >> umount /home >> sleep 5 >> mount /home >> hg verify b-$i # often found truncated files >> done >> [...] > >This test looks like it should consist solely of open-for-append and >write on about 20k files in the target directory. Because of the >--pull, no hardlinks are involved. It shouldn't be all that different >from doing tar cf - a | tar xf - b. > >The files get visited in alphabetical order, so the start of the >corruption may be telling. You should not assume alphabetical order. Filesystems may be free to reorder things and return them (1) randomly like in a hash (2) by creation time during readdir(). Jan -- From owner-xfs@oss.sgi.com Sat May 12 06:02:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 06:02:22 -0700 (PDT) Received: from mailer.gwdg.de (mailer.gwdg.de [134.76.10.26]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CD2AfC028338 for ; Sat, 12 May 2007 06:02:15 -0700 Received: from linux01.gwdg.de ([134.76.13.21]) by mailer.gwdg.de with esmtps (TLSv1:AES256-SHA:256) (Exim 4.66) (envelope-from ) id 1Hmpkn-00010V-61; Sat, 12 May 2007 13:27:13 +0200 Received: from linux01.gwdg.de (localhost [127.0.0.1]) by linux01.gwdg.de (8.13.3/8.13.3/SuSE Linux 0.7) with ESMTP id l4CBNRWc003792; Sat, 12 May 2007 13:23:30 +0200 Received: from localhost (jengelh@localhost) by linux01.gwdg.de (8.13.3/8.13.3/Submit) with ESMTP id l4CBNRpV003734; Sat, 12 May 2007 13:23:27 +0200 Date: Sat, 12 May 2007 13:23:27 +0200 (MEST) From: Jan Engelhardt To: Jeremy Fitzhardinge cc: Chuck Ebbert , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? In-Reply-To: <46439491.9010604@goop.org> Message-ID: References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11404 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jengelh@linux01.gwdg.de Precedence: bulk X-list: xfs On May 10 2007 14:54, Jeremy Fitzhardinge wrote: >>>> What CPU architecture is this happening on? Not i686 with PAE by >>>> any chance? >>>> >>> Yes. Why? >> >> I have a bug report where NFS files are corrupted only with PAE clients. >> Corruption is at the end of the (newly untarred) files. Doesn't happen >> without PAE. > >Hm, suggestive, but I'm not convinced. Two differences to this situation: > > 1. Immediately after the clone ("untar"), the contents are completely > OK; it's only after a umount/mount cycle to problems appear And if you do a "sync" rather than umount/mount? > 2. There's no corruption as such; the files are just too short. And > it seems they're at a previously OK length, not some random size. Jan -- From owner-xfs@oss.sgi.com Sat May 12 06:52:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 06:52:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4CDpvfB012422 for ; Sat, 12 May 2007 06:51:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA16432; Sat, 12 May 2007 23:51:51 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4CDpmAf91517894; Sat, 12 May 2007 23:51:49 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4CDpiT591658269; Sat, 12 May 2007 23:51:44 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Sat, 12 May 2007 23:51:43 +1000 From: David Chinner To: Jan Engelhardt Cc: Jeremy Fitzhardinge , Chuck Ebbert , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070512135143.GG85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11406 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sat, May 12, 2007 at 01:23:27PM +0200, Jan Engelhardt wrote: > > On May 10 2007 14:54, Jeremy Fitzhardinge wrote: > >>>> What CPU architecture is this happening on? Not i686 with PAE by > >>>> any chance? > >>>> > >>> Yes. Why? > >> > >> I have a bug report where NFS files are corrupted only with PAE clients. > >> Corruption is at the end of the (newly untarred) files. Doesn't happen > >> without PAE. > > > >Hm, suggestive, but I'm not convinced. Two differences to this situation: > > > > 1. Immediately after the clone ("untar"), the contents are completely > > OK; it's only after a umount/mount cycle to problems appear > > And if you do a "sync" rather than umount/mount? I doubt it will matter - I don't think we are marking the inode dirty at the right point. The change that was at fault modifies the way we update the file size on the inode. We added an in-memory copy of the file size to the in-memory copy of the disk inode's file size that we already keep. We now only update the disk inode's (in memory copy) file size on I/O completion. Because the generic code writes the inode out before waiting for I/O to complete, the old file size gets written out instead of the new one. If the write was to extending the file into an existing block there would be no delalloc transaction to redirty the inode (happens on log I/O completion). Hence when the I/O completes and the file size gets updated to the in-core disk inode (which is marked dirty), the linux inode remains clean. As a result, a sync will never flush the inode to get the updated file size to disk. What I don't understand is that on unmount dirty xfs inodes get written out. Clearly this is not happening - either there's a hole in the writeback logic (unlikely - it was unchanged) or we've missed some case where we need to update the filesize and mark the inode dirty. Hmmmm - if the write was just a short append to the file, then the block that was written to should already be mapped. Then we'll just look up the extent by doing a BMAPI_READ lookup, set the type to IOMAP_READ and add the block to ioend we are building. The type IOMAP_READ determines the I/O completion behaviour - in this case it is xfs_end_bio_read(), which fails to update the file size.... Bingo. A patch for you to try, Jeremy. I've just started a test run on it... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-11 16:03:59.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-12 23:35:42.691464799 +1000 @@ -973,8 +973,9 @@ xfs_page_state_convert( bh = head = page_buffers(page); offset = page_offset(page); - flags = -1; - type = IOMAP_READ; + iomap_valid = 0; + flags = BMAPI_READ; + type = IOMAP_NEW; /* TODO: cleanup count and page_dirty */ @@ -1004,14 +1005,14 @@ xfs_page_state_convert( * * Third case, an unmapped buffer was found, and we are * in a path where we need to write the whole page out. - */ + */ if (buffer_unwritten(bh) || buffer_delay(bh) || ((buffer_uptodate(bh) || PageUptodate(page)) && !buffer_mapped(bh) && (unmapped || startio))) { - /* + /* * Make sure we don't use a read-only iomap */ - if (flags == BMAPI_READ) + if (flags == BMAPI_READ) iomap_valid = 0; if (buffer_unwritten(bh)) { @@ -1060,7 +1061,7 @@ xfs_page_state_convert( * That means it must already have extents allocated * underneath it. Map the extent by reading it. */ - if (!iomap_valid || type != IOMAP_READ) { + if (!iomap_valid || flags != BMAPI_READ) { flags = BMAPI_READ; size = xfs_probe_cluster(inode, page, bh, head, 1); @@ -1071,7 +1072,15 @@ xfs_page_state_convert( iomap_valid = xfs_iomap_valid(&iomap, offset); } - type = IOMAP_READ; + /* + * We set the type to IOMAP_NEW in case we are doing a + * small write at EOF that is extending the file but + * without needing an allocation. We need to update the + * file size on I/O completion in this case so it is + * the same case as having just allocated a new extent + * that we are writing into for the first time. + */ + type = IOMAP_NEW; if (!test_and_set_bit(BH_Lock, &bh->b_state)) { ASSERT(buffer_mapped(bh)); if (iomap_valid) From owner-xfs@oss.sgi.com Sat May 12 07:56:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 07:56:27 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CEuJfB029390 for ; Sat, 12 May 2007 07:56:20 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 4A9C62C8042; Sat, 12 May 2007 07:55:30 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 252522C803B; Sat, 12 May 2007 07:55:30 -0700 (PDT) Received: from [192.168.28.126] (outer-dhcp-126.goop.org [192.168.28.126]) by lurch.goop.org (Postfix) with ESMTP; Sat, 12 May 2007 07:55:30 -0700 (PDT) Message-ID: <4645D594.4070801@goop.org> Date: Sat, 12 May 2007 07:56:20 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Jan Engelhardt , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070512135143.GG85884050@sgi.com> In-Reply-To: <20070512135143.GG85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11407 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > What I don't understand is that on unmount dirty xfs inodes get > written out. Clearly this is not happening - either there's a hole > in the writeback logic (unlikely - it was unchanged) or we've missed > some case where we need to update the filesize and mark the inode > dirty. > > Hmmmm - if the write was just a short append to the file, then the > block that was written to should already be mapped. Then we'll just > look up the extent by doing a BMAPI_READ lookup, set the type to > IOMAP_READ and add the block to ioend we are building. > Well, that result I mailed you showed that the difference was just over 16k, and that there was a 32 block difference in the final extent length. Does that fit with this theory? > The type IOMAP_READ determines the I/O completion behaviour - in this case > it is xfs_end_bio_read(), which fails to update the file size.... > > Bingo. > > A patch for you to try, Jeremy. I've just started a test run on it... > Thanks, I'll give it a spin. Have you reproduced the bug yourself? J From owner-xfs@oss.sgi.com Sat May 12 10:49:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 10:49:24 -0700 (PDT) Received: from mx1.suse.de (cantor.suse.de [195.135.220.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CHn2fB010718 for ; Sat, 12 May 2007 10:49:04 -0700 Received: from Relay1.suse.de (mail2.suse.de [195.135.221.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.suse.de (Postfix) with ESMTP id 9DA3D122E4; Sat, 12 May 2007 19:49:01 +0200 (CEST) To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams References: <20070511003606.GB85884050@sgi.com> From: Andi Kleen Date: 12 May 2007 20:46:19 +0200 In-Reply-To: <20070511003606.GB85884050@sgi.com> Message-ID: Lines: 21 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 11408 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: andi@firstfloor.org Precedence: bulk X-list: xfs David Chinner writes: > > The following patch survives XFSQA with timeouts set to minimum, > default, 500s and maximum. The patch has not had a great > deal of low memory testing, and the object cache may need a shrinker > interface to work in low memory conditions. > > Comments? It seems to be an optimization for a relatively small number of streams. When you do a large number on average you should get similar readahead benefits from round robing the streams over some AGs vs keeping it in a single AG, right? The fallback to AG 0 if nstreams>AGs seems pretty lousy. Wouldn't it be better to do the normal XFS allocation algorithm then? I think right now it will go into low space mode in this case, which might give worse results. Also centisecs is a really ugly unit whose use should be probably not propagated. -Andi From owner-xfs@oss.sgi.com Sat May 12 20:02:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 20:02:21 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4D32GfB025910 for ; Sat, 12 May 2007 20:02:17 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 7BEE81806FDAD; Sat, 12 May 2007 22:02:14 -0500 (CDT) Message-ID: <46467FB5.8080301@sandeen.net> Date: Sat, 12 May 2007 22:02:13 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Evan Fraser CC: xfs@oss.sgi.com Subject: Re: updatedb triggers XFS internal error References: <46449B2C.30208@dneg.com> In-Reply-To: <46449B2C.30208@dneg.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11409 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Evan Fraser wrote: > Hello, > I'm having a problem with one of my linux servers. whenever updatedb is > run, the following errors occur in the system log. > > 0x0: c9 00 5a f1 3a 7f 66 be a3 c1 d4 7f e8 1d 6b c9 > Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of > file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f This indicates bad metadata magic read from disk. You'll probably want to run xfs_repair; you can run it with -n to see what it *would* do first, to get an idea of how drastic the repair might be. Repairing 1.4T won't probably be lots of fun, but you've got corruption in there somewhere... -Eric From owner-xfs@oss.sgi.com Sat May 12 20:08:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 20:09:07 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4D38vfB027310 for ; Sat, 12 May 2007 20:08:58 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 5DE311806FDAD; Sat, 12 May 2007 22:08:57 -0500 (CDT) Message-ID: <46468148.7000708@sandeen.net> Date: Sat, 12 May 2007 22:08:56 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Andi Kleen CC: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams References: <20070511003606.GB85884050@sgi.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11410 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Andi Kleen wrote: > Also centisecs is a really ugly unit whose use should be probably not propagated. > > -Andi Hmm at one point I thought the preferred unit for this sort of tuneable *was* centisecs. What's the unit du jour? [root@neon ~]# sysctl -a |grep cent vm.dirty_expire_centisecs = 2999 vm.dirty_writeback_centisecs = 499 fs.xfs.age_buffer_centisecs = 1500 fs.xfs.xfsbufd_centisecs = 100 fs.xfs.xfssyncd_centisecs = 3000 I think xfs was following the vm lead at one point. -Eric From owner-xfs@oss.sgi.com Sun May 13 14:19:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 13 May 2007 14:19:40 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4DLJafB015043 for ; Sun, 13 May 2007 14:19:37 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HnLAX-0003eh-NS; Sun, 13 May 2007 21:59:53 +0100 Date: Sun, 13 May 2007 21:59:53 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070513205953.GA14030@infradead.org> References: <20070511003606.GB85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=unknown-8bit Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070511003606.GB85884050@sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11411 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs I already had some comments on this when discussing it with Sam in person, but it seems like they didn't make it to you. First the mru cache while beeing quite nice code is heavily overengineered for this case. Unless there are a many hundred filestreams per filesystem it will be a lot faster to just have a simple wrap-around array of linked lists. We don't want to feed the argument that xfs has lots of useless bloated code, do we? :) All the pip != NULL checks are superflous in Linux. A regular file can never have a non-null parent inode, and a directory can only have a non-NULL parent inode in very odd corner cases involving NFS exports, but it has to be connect again once you start doing namespace modifying operations on it. There some naming confusion: xfs_mount.h forward-declares struct xfs_filestream but everything else uses struct fstrm_mnt_data. The former is very non-descriptive and the latter but ugly, I'd suggestjust putting the mru-cache replacement directly in there as xfs_filestream_cache instead of the wrapping. The xfs_zeroino changes looks good but should be a separate commit. Some comments on the actual code in xfs_filestream.c > +#ifdef DEBUG_FILESTREAMS > +#define dprint(fmt, args...) do { \ > + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ > + current_pid(), __FUNCTION__, ##args); \ > +} while(0) > +#else > +#define dprint(args...) do {} while (0) > +#endif This should probably be killed entirely. > +#define GET_AG_REF(mp, ag) atomic_read(&(mp)->m_perag[ag].pagf_fstrms) > +#define INC_AG_REF(mp, ag) atomic_inc_return(&(mp)->m_perag[ag].pagf_fstrms) > +#define DEC_AG_REF(mp, ag) atomic_dec_return(&(mp)->m_perag[ag].pagf_fstrms) These should be inlines with more descriptive lower case names. > +#define XFS_PICK_USERDATA 1 > +#define XFS_PICK_LOWSPACE 2 enum. > + > +/* > + * Scan the AGs starting at startag looking for an AG that isn't in use and has > + * at least minlen blocks free. > + */ > +static int > +_xfs_filestream_pick_ag( > + xfs_mount_t *mp, > + xfs_agnumber_t startag, > + xfs_agnumber_t *agp, > + int flags, > + xfs_extlen_t minlen) > +{ > + int err, trylock, nscan; > + xfs_extlen_t delta, longest, need, free, minfree, maxfree = 0; > + xfs_agnumber_t ag, max_ag = NULLAGNUMBER; > + struct xfs_perag *pag; > + > + /* 2% of an AG's blocks must be free for it to be chosen. */ > + minfree = mp->m_sb.sb_agblocks / 50; > + > + ag = startag; > + *agp = NULLAGNUMBER; > + > + /* For the first pass, don't sleep trying to init the per-AG. */ > + trylock = XFS_ALLOC_FLAG_TRYLOCK; > + > + for (nscan = 0; 1; nscan++) { > + > + //dprint("scanning AG %d[%d]", ag, GET_AG_REF(mp, ag)); please don't leave commented out debug code in. > + pag = mp->m_perag + ag; > + > + if (!pag->pagf_init && > + (err = xfs_alloc_pagf_init(mp, NULL, ag, trylock)) && > + !trylock) { > + dprint("xfs_alloc_pagf_init returned %d", err); > + return err; > + } if (!pag->pagf_init) { err = xfs_alloc_pagf_init(mp, NULL, ag, trylock); if (err && !trylock) return err; } > +static int > +_xfs_filestream_set_ag( > + xfs_inode_t *ip, > + xfs_inode_t *pip, > + xfs_agnumber_t ag) > +{ > + int err = 0; > + xfs_mount_t *mp; > + xfs_mru_cache_t *cache; > + fstrm_item_t *item; > + xfs_agnumber_t old_ag; > + xfs_inode_t *old_pip; > + > + /* > + * Either ip is a regular file and pip is a directory, or ip is a > + * directory and pip is NULL. > + */ We have parent information for parents aswell so this should probably be made more regular. > + ASSERT(ip && (((ip->i_d.di_mode & S_IFREG) && pip && > + (pip->i_d.di_mode & S_IFDIR)) || > + ((ip->i_d.di_mode & S_IFDIR) && !pip))); > + mp = ip->i_mount; > + cache = mp->m_filestream->fstrm_items; > + > + if ((item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, ip->i_ino))) { assignment and conditional on separate lines please (also alsewhere in the code), and no needless casts from void * either (also various places > +void > +xfs_filestream_init(void) > +{ > + item_zone = kmem_zone_init(sizeof(fstrm_item_t), "fstrm_item"); > + ASSERT(item_zone); Please check for errors instead and propagate them. > +/* > + * xfs_filestream_uninit() is called at xfs termination time to destroy the > + * memory zone that was used for filestream data structure allocation. > + */ > +void > +xfs_filestream_uninit(void) > +{ > + if (item_zone) { > + kmem_zone_destroy(item_zone); > + item_zone = NULL; > + } > +} no need for the NULL check or setting it to NULL. > + if (!(md = (fstrm_mnt_data_t*)kmem_zalloc(sizeof(*md), KM_SLEEP))) Please use KM_MAYFAIL for all new code otside of transactions. > + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); > + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) > + return NULLAGNUMBER; either the assert or the if clause checking gor it, please. Now comes the worst part the new allocator function i IF we look at a diff between xfs_bmap_filestreams and xfs_bmap_btalloc we see that it's a pretty bad cut & paste job: --- btalloc 2007-05-12 12:43:03.000000000 +0200 +++ fsalloc 2007-05-12 12:42:28.000000000 +0200 @@ -1,44 +1,54 @@ > + rt = (ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata; xfs_bmap_alloc() never calls xfs_bmap_filestreams if this is true so all code guarded by if (rt) is dead. > - if (unlikely(align)) { > + if (align) { lign should have the same likelyhood for oth > - if (nullfb) > - ap->rval = XFS_INO_TO_FSB(mp, ap->ip->i_ino); > - else > + if (nullfb) { > + ag = xfs_filestream_get_ag(ap->ip); > + ag = (ag != NULLAGNUMBER) ? ag : 0; > + ap->rval = (ap->userdata) ? XFS_AGB_TO_FSB(mp, ag, 0) : > + XFS_INO_TO_FSB(mp, ap->ip->i_ino); > + } else { > ap->rval = ap->firstblock; > + } Some rreal changes :) But this could be just a third if case for the filesystream case. > - args.firstblock = ap->firstblock; Backout of parts of rev1.349 blen = 0; if (nullfb) { - args.type = XFS_ALLOCTYPE_START_BNO; + /* _vextent doesn't pick an AG */ + args.type = XFS_ALLOCTYPE_NEAR_BNO; /* > @@ -117,18 +167,19 @@ > */ > else > args.minlen = ap->alen; > + ap->rval = args.fsbno = XFS_AGB_TO_FSB(mp, ag, 0); > } else if (ap->low) { > - args.type = XFS_ALLOCTYPE_START_BNO; > + args.type = XFS_ALLOCTYPE_FIRST_AG; > args.total = args.minlen = ap->minlen; Why is this different? } > - if (unlikely(ap->userdata && ap->ip->i_d.di_extsize && > - (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE))) { > + if (ap->userdata && ap->ip->i_d.di_extsize && > + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) { args.prod = ap->ip->i_d.di_extsize; > - if ((args.mod = (xfs_extlen_t)do_mod(ap->off, args.prod))) > + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) Gratious difference. * is >= the stripe unit and the allocation offset is * at the end of file. */ > + atype = args.type; I don't quite undersatnd why we'd nee this in one, but not the other. if (!ap->low && ap->aeof) { if (!ap->off) { args.alignment = mp->m_dalign; > - * First try an exact bno allocation. > + * First try an exact bno allocation. > * If it fails then do a near or start bno > * allocation with alignment turned on. > - */ > + */ Backout of whitespace adjustments. > - XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, > - ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : > + if (XFS_IS_QUOTA_ON(mp) && > + ap->ip->i_ino != mp->m_sb.sb_uquotino && > + ap->ip->i_ino != mp->m_sb.sb_gquotino) { > + XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, > + ap->wasdel ? > + XFS_TRANS_DQ_DELBCOUNT : > XFS_TRANS_DQ_BCOUNT, > - (long) args.len); > + (long)args.len); > + } Gratious differenes but okay because there won't be file streams for quota inodes. Based onthat my conclusion is that xfs_bmap_filestreams and xfs_bmap_btalloc should be merged to avoid further maintaince overhead. From owner-xfs@oss.sgi.com Sun May 13 22:32:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 13 May 2007 22:32:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4E5W1fB012122 for ; Sun, 13 May 2007 22:32:03 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA23946; Mon, 14 May 2007 15:31:44 +1000 Date: Mon, 14 May 2007 15:35:31 +1000 From: Timothy Shimmin To: Eric Sandeen , Andi Kleen cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams - centisecs Message-ID: In-Reply-To: <46468148.7000708@sandeen.net> References: <20070511003606.GB85884050@sgi.com> <46468148.7000708@sandeen.net> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11412 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Yeah, I thought we were told off in the past for not using centisecs and so Nathan changed stuff so it was in centisecs. Looking in logs and bug db.... ---------------- xfs_sysctl.c revision 1.28 date: 2004/05/14 03:13:52; author: nathans; state: Exp; lines: +7 -7 modid: xfs-linux:xfs-kern:171825a Export/import tunable time intervals as centisecs not jiffies. Description: Not sure what we were smoking when we made these interfaces converse with userspace in terms of jiffies, I guess it was just more expedient at the time. Time to clean this up so regular humans know what time intervals they're asking for, and so that the interface works consistently for different HZ values. The kernel pdflush daemon in 2.6 uses centisecs, so we may as well make our units consistent with that (since that guy plays a big role in flushing our data & it is likely to be tuned along with any XFS-specific parameter changes). cheers. On Tue, May 11, 2004 at 03:40:57PM -0700, Andrew Morton wrote: > bart@samwel.tk wrote: > > > > The laptop mode control script incorrectly guesses XFS_HZ=1000. > > aargh. XFS is broken. It shouldn't be exposing jiffy-based tunables into > /proc, or `mount -o remount' or whatever. > > It would be much better to rework XFS so that these user-visible tunables > are in units of milliseconds, centiseconds or whatever. > > Is this possible, please? > > If so, please make the /proc filename reflect the tunable's units: > > /proc/sys/fs/xfs/lm_sync_centisecs > /proc/sys/fs/xfs/age_buffer_centisecs > etc. > > thanks. ---------------------------- --Tim --On 12 May 2007 10:08:56 PM -0500 Eric Sandeen wrote: > Andi Kleen wrote: > >> Also centisecs is a really ugly unit whose use should be probably not propagated. >> >> -Andi > > Hmm at one point I thought the preferred unit for this sort of tuneable *was* centisecs. What's > the unit du jour? > > [root@neon ~]# sysctl -a |grep cent > vm.dirty_expire_centisecs = 2999 > vm.dirty_writeback_centisecs = 499 > fs.xfs.age_buffer_centisecs = 1500 > fs.xfs.xfsbufd_centisecs = 100 > fs.xfs.xfssyncd_centisecs = 3000 > > I think xfs was following the vm lead at one point. > > -Eric From owner-xfs@oss.sgi.com Mon May 14 01:17:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 01:17:13 -0700 (PDT) Received: from gab.dneg.com (mail.dneg.com [193.203.82.196]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4E8H4fB026858 for ; Mon, 14 May 2007 01:17:06 -0700 Received: from localhost (localhost.localdomain [127.0.0.1]) by gab.dneg.com (Postfix) with ESMTP id 94AD64D6074; Mon, 14 May 2007 09:17:03 +0100 (BST) Received: from gab.dneg.com ([127.0.0.1]) by localhost (gab.dneg.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YlGqSKEZ+p2z; Mon, 14 May 2007 09:17:02 +0100 (BST) Received: from [172.16.10.24] (spinach.dneg.com [172.16.10.24]) by gab.dneg.com (Postfix) with ESMTP id 0F4A44D5F20; Mon, 14 May 2007 09:17:02 +0100 (BST) Message-ID: <46481AFD.8090109@dneg.com> Date: Mon, 14 May 2007 09:17:01 +0100 From: Evan Fraser User-Agent: Thunderbird 2.0.0.0 (X11/20070326) MIME-Version: 1.0 To: Eric Sandeen CC: xfs@oss.sgi.com Subject: Re: updatedb triggers XFS internal error References: <46449B2C.30208@dneg.com> <46467FB5.8080301@sandeen.net> In-Reply-To: <46467FB5.8080301@sandeen.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11413 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: evan@dneg.com Precedence: bulk X-list: xfs Thanks for your help. Cheers, Evan. Eric Sandeen wrote: > Evan Fraser wrote: >> Hello, >> I'm having a problem with one of my linux servers. whenever updatedb >> is run, the following errors occur in the system log. >> >> 0x0: c9 00 5a f1 3a 7f 66 be a3 c1 d4 7f e8 1d 6b c9 >> Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of >> file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f > > This indicates bad metadata magic read from disk. You'll probably > want to run xfs_repair; you can run it with -n to see what it *would* > do first, to get an idea of how drastic the repair might be. > Repairing 1.4T won't probably be lots of fun, but you've got > corruption in there somewhere... > > -Eric > -- evan@dneg.com Linux Systems Administrator Double Negative tel: +44 (0)20 7534 4400 fax: +44 (0)20 7534 4452 77 shaftesbury avenue, w1d 5du, London From owner-xfs@oss.sgi.com Mon May 14 06:29:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 06:29:46 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EDTdfB021279 for ; Mon, 14 May 2007 06:29:41 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EDTZQv003673 for ; Mon, 14 May 2007 09:29:35 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EDTZNT268006 for ; Mon, 14 May 2007 07:29:35 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EDTYRa015478 for ; Mon, 14 May 2007 07:29:35 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EDTX0M014657; Mon, 14 May 2007 07:29:34 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 0A54F29EBD3; Mon, 14 May 2007 18:59:28 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EDTRVE003530; Mon, 14 May 2007 18:59:27 +0530 Date: Mon, 14 May 2007 18:59:26 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 0/5][TAKE2] fallocate system call Message-ID: <20070514132926.GA30768@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11414 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is the new set of patches which take care of the review comments received from the community (mainly from Andrew). Description: ----------- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: --------- The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime/mtime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). sys_fallocate() on s390: ----------------------- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: ------------- mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: ToDos: ----- 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: --------- Each post will have an individual changelog for the particular patch. Following posts with patches follow: Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 14 07:01:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:01:07 -0700 (PDT) Received: from atrey.karlin.mff.cuni.cz (atrey.karlin.mff.cuni.cz [195.113.31.123]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EE0wfB028645 for ; Mon, 14 May 2007 07:01:00 -0700 Received: by atrey.karlin.mff.cuni.cz (Postfix, from userid 4043) id 1130BC7D2C; Mon, 14 May 2007 15:34:46 +0200 (CEST) Date: Mon, 14 May 2007 15:34:46 +0200 From: Jan Kara To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070514133445.GA28875@atrey.karlin.mff.cuni.cz> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> User-Agent: Mutt/1.5.9i X-archive-position: 11415 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jack@suse.cz Precedence: bulk X-list: xfs > On Mon, 7 May 2007 05:37:54 -0600 > > Does the proposed implementation handle quotas correctly, btw? Has that > been tested? It seems to handle quotas fine - the block allocation itself does not differ from the usual case, just the extents in the tree are marked as uninitialized... The only question is whether DQUOT_PREALLOC_BLOCK() shouldn't be called instead of DQUOT_ALLOC_BLOCK(). Then fallocate() won't be able to allocate anything after the softlimit has been reached which makes some sence but probably current behavior is kind-of less surprising. Honza -- Jan Kara SuSE CR Labs From owner-xfs@oss.sgi.com Mon May 14 07:45:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:45:26 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEjJfB010035 for ; Mon, 14 May 2007 07:45:21 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEjIg8009223 for ; Mon, 14 May 2007 10:45:18 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEjIje554160 for ; Mon, 14 May 2007 10:45:18 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEjHM5029882 for ; Mon, 14 May 2007 10:45:18 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEjF8e029685; Mon, 14 May 2007 10:45:16 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id DC29929EBD3; Mon, 14 May 2007 20:15:24 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEjOuJ007049; Mon, 14 May 2007 20:15:24 +0530 Date: Mon, 14 May 2007 20:15:24 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070514144524.GA31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11416 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: --------- Following changes were made to the previous version: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora --- arch/i386/kernel/syscall_table.S | 1 arch/powerpc/kernel/sys_ppc32.c | 7 +++ arch/x86_64/kernel/functionlist | 1 fs/open.c | 89 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 - include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 - include/asm-x86_64/unistd.h | 4 + include/linux/fs.h | 13 +++++ include/linux/syscalls.h | 1 10 files changed, 120 insertions(+), 3 deletions(-) Index: linux-2.6.21/arch/i386/kernel/syscall_table.S =================================================================== --- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.21/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.21/arch/x86_64/kernel/functionlist =================================================================== --- linux-2.6.21.orig/arch/x86_64/kernel/functionlist +++ linux-2.6.21/arch/x86_64/kernel/functionlist @@ -931,6 +931,7 @@ *(.text.sys_getitimer) *(.text.sys_getgroups) *(.text.sys_ftruncate) +*(.text.sys_fallocate) *(.text.sysfs_lookup) *(.text.sys_exit_group) *(.text.stub_fork) Index: linux-2.6.21/fs/open.c =================================================================== --- linux-2.6.21.orig/fs/open.c +++ linux-2.6.21/fs/open.c @@ -351,6 +351,95 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* + * Let individual file system decide if it supports preallocation + * for directories or not. + */ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; + + /* + * Update [cm]time. + * Partial allocation will not result in the time stamp changes, + * since ->fallocate will return error (say, -ENOSPC) in this case. + */ + if (!ret) + file_update_time(file); +out_fput: + fput(file); +out: + return ret; +} + +/* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. Index: linux-2.6.21/include/asm-i386/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-i386/unistd.h +++ linux-2.6.21/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages 317 #define __NR_getcpu 318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/asm-powerpc/systbl.h =================================================================== --- linux-2.6.21.orig/include/asm-powerpc/systbl.h +++ linux-2.6.21/include/asm-powerpc/systbl.h @@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list) COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) +COMPAT_SYS(fallocate) Index: linux-2.6.21/include/asm-powerpc/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-powerpc/unistd.h +++ linux-2.6.21/include/asm-powerpc/unistd.h @@ -326,10 +326,11 @@ #define __NR_move_pages 301 #define __NR_getcpu 302 #define __NR_epoll_pwait 303 +#define __NR_fallocate 304 #ifdef __KERNEL__ -#define __NR_syscalls 304 +#define __NR_syscalls 305 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls Index: linux-2.6.21/include/asm-x86_64/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-x86_64/unistd.h +++ linux-2.6.21/include/asm-x86_64/unistd.h @@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_fallocate 280 +__SYSCALL(__NR_fallocate, sys_fallocate) -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_fallocate #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/linux/fs.h =================================================================== --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -264,6 +264,17 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FA_ALLOCATE : This is the preallocate mode, using which an application/user + * may request (pre)allocation of blocks. + * FA_DEALLOCATE: This is the deallocate mode, which can be used to free + * the preallocated blocks. + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1125,6 +1136,8 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *inode, int mode, loff_t offset, + loff_t len); }; struct seq_file; Index: linux-2.6.21/include/linux/syscalls.h =================================================================== --- linux-2.6.21.orig/include/linux/syscalls.h +++ linux-2.6.21/include/linux/syscalls.h @@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); Index: linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c =================================================================== --- linux-2.6.21.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c @@ -777,6 +777,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { From owner-xfs@oss.sgi.com Mon May 14 07:48:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:48:37 -0700 (PDT) Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.143]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEmQfB010838 for ; Mon, 14 May 2007 07:48:28 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e3.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EDkcKw025338 for ; Mon, 14 May 2007 09:46:38 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEmQ4f517452 for ; Mon, 14 May 2007 10:48:26 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEmPPX005575 for ; Mon, 14 May 2007 10:48:25 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEmOEJ005493; Mon, 14 May 2007 10:48:24 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 0DF5529EBD3; Mon, 14 May 2007 20:18:34 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEmY3x008439; Mon, 14 May 2007 20:18:34 +0530 Date: Mon, 14 May 2007 20:18:34 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 2/5][TAKE2] fallocate() on s390 Message-ID: <20070514144833.GB31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11417 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is the patch suggested by Martin Schwidefsky. Here are the comments and patch from him. ------------- From: Martin Schwidefsky This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky --- arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ arch/s390/kernel/sys_s390.c | 29 +++++++++++++++++++++++++++++ arch/s390/kernel/syscalls.S | 1 + include/asm-s390/unistd.h | 3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr %r2,%r2 # int + lgfr %r3,%r3 # int + sllg %r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg %r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.21/arch/s390/kernel/syscalls.S =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.21/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.21/arch/s390/kernel/sys_s390.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.21/arch/s390/kernel/sys_s390.c @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, "d" (__arg3) : "memory"); return __svcres; } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.21/include/asm-s390/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-s390/unistd.h +++ linux-2.6.21/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu 311 #define __NR_epoll_pwait 312 #define __NR_utimes 313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some From owner-xfs@oss.sgi.com Mon May 14 07:50:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:50:07 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEo1fB011383 for ; Mon, 14 May 2007 07:50:02 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEo0N1016384 for ; Mon, 14 May 2007 10:50:00 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEo0mQ256682 for ; Mon, 14 May 2007 08:50:00 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEnxeB025855 for ; Mon, 14 May 2007 08:50:00 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEnwDG025737; Mon, 14 May 2007 08:49:59 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 6CC0529EBD3; Mon, 14 May 2007 20:20:08 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEo8Ad009129; Mon, 14 May 2007 20:20:08 +0530 Date: Mon, 14 May 2007 20:20:08 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 3/5][TAKE2] ext4: Extent overlap bugfix Message-ID: <20070514145008.GC31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11418 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Changelog: --------- As suggested by Andrew, a check for wrap though zero has been added. Here is the new patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 60 ++++++++++++++++++++++++++++++++++++++-- include/linux/ext4_fs_extents.h | 1 2 files changed, 59 insertions(+), 2 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1129,6 +1129,55 @@ ext4_can_extents_be_merged(struct inode } /* + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + unsigned int ret = 0; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* + * get the next allocated block if the extent in the path + * is before the requested block(s) + */ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + /* check for wrap through zero */ + if (b1 + len1 < b1) { + len1 = EXT_MAX_BLOCK - b1; + newext->ee_len = cpu_to_le16(len1); + ret = 1; + } + + /* check for overlap */ + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + ret = 1; + } +out: + return ret; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2032,7 +2081,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2040,7 +2097,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Mon May 14 07:52:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:52:29 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEqNfB012317 for ; Mon, 14 May 2007 07:52:25 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEqNxj004164 for ; Mon, 14 May 2007 10:52:23 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEqNe6552736 for ; Mon, 14 May 2007 10:52:23 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEqMmZ024628 for ; Mon, 14 May 2007 10:52:23 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEqKDR024534; Mon, 14 May 2007 10:52:20 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2ADCE29EBD3; Mon, 14 May 2007 20:22:30 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEqUK8010124; Mon, 14 May 2007 20:22:30 +0530 Date: Mon, 14 May 2007 20:22:30 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 4/5][TAKE2] ext4: fallocate support in ext4 Message-ID: <20070514145230.GD31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11419 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a "To Do" item. Changelog: --------- Here are the changes from the previous post: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 241 +++++++++++++++++++++++++++++++++------- fs/ext4/file.c | 1 include/linux/ext4_fs.h | 8 + include/linux/ext4_fs_extents.h | 12 + 4 files changed, 221 insertions(+), 41 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1107,7 +1107,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* + * Make sure that either both extents are uninitialized, or + * both are _not_. + */ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1192,8 +1204,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1201,14 +1214,24 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex && ext4_can_extents_be_merged(inode, ex, newext)) { ext_debug("append %d block to %d:%d (from %llu)\n", - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); err = ext4_ext_get_access(handle, inode, path + depth); if (err) return err; - ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len) - + le16_to_cpu(newext->ee_len)); + + /* + * ext4_can_extents_be_merged should have checked that either + * both extents are uninitialized, or both aren't. Thus we + * need to check only one of them here. + */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(newext)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); eh = path[depth].p_hdr; nearex = ex; goto merge; @@ -1264,7 +1287,7 @@ has_space: ext_debug("first extent in the leaf: %d:%llu:%d\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len)); + ext4_ext_get_actual_len(newext)); path[depth].p_ext = EXT_FIRST_EXTENT(eh); } else if (le32_to_cpu(newext->ee_block) > le32_to_cpu(nearex->ee_block)) { @@ -1277,7 +1300,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 2, nearex + 1, len); } @@ -1290,7 +1313,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 1, nearex, len); path[depth].p_ext = nearex; @@ -1309,8 +1332,13 @@ merge: if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) break; /* merge with next extent! */ - nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len) - + le16_to_cpu(nearex[1].ee_len)); + if (ext4_ext_is_uninitialized(nearex)) + uninitialized = 1; + nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) + + ext4_ext_get_actual_len(nearex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(nearex); + if (nearex + 1 < EXT_LAST_EXTENT(eh)) { len = (EXT_LAST_EXTENT(eh) - nearex - 1) * sizeof(struct ext4_extent); @@ -1380,8 +1408,8 @@ int ext4_ext_walk_space(struct inode *in end = le32_to_cpu(ex->ee_block); if (block + num < end) end = block + num; - } else if (block >= - le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) { + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { /* need to allocate space after found extent */ start = block; end = block + num; @@ -1393,7 +1421,8 @@ int ext4_ext_walk_space(struct inode *in * by found extent */ start = block; - end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len); + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); if (block + num < end) end = block + num; exists = 1; @@ -1409,7 +1438,7 @@ int ext4_ext_walk_space(struct inode *in cbex.ec_type = EXT4_EXT_CACHE_GAP; } else { cbex.ec_block = le32_to_cpu(ex->ee_block); - cbex.ec_len = le16_to_cpu(ex->ee_len); + cbex.ec_len = ext4_ext_get_actual_len(ex); cbex.ec_start = ext_pblock(ex); cbex.ec_type = EXT4_EXT_CACHE_EXTENT; } @@ -1482,15 +1511,15 @@ ext4_ext_put_gap_in_cache(struct inode * ext_debug("cache gap(before): %lu [%lu:%lu]", (unsigned long) block, (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len)); + (unsigned long) ext4_ext_get_actual_len(ex)); } else if (block >= le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len)) { + + ext4_ext_get_actual_len(ex)) { lblock = le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len); + + ext4_ext_get_actual_len(ex); len = ext4_ext_next_allocated_block(path); ext_debug("cache gap(after): [%lu:%lu] %lu", (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len), + (unsigned long) ext4_ext_get_actual_len(ex), (unsigned long) block); BUG_ON(len == lblock); len = len - lblock; @@ -1620,12 +1649,12 @@ static int ext4_remove_blocks(handle_t * unsigned long from, unsigned long to) { struct buffer_head *bh; + unsigned short ee_len = ext4_ext_get_actual_len(ex); int i; #ifdef EXTENTS_STATS { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); - unsigned short ee_len = le16_to_cpu(ex->ee_len); spin_lock(&sbi->s_ext_stats_lock); sbi->s_ext_blocks += ee_len; sbi->s_ext_extents++; @@ -1639,12 +1668,12 @@ static int ext4_remove_blocks(handle_t * } #endif if (from >= le32_to_cpu(ex->ee_block) - && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to == le32_to_cpu(ex->ee_block) + ee_len - 1) { /* tail removal */ unsigned long num; ext4_fsblk_t start; - num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from; - start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num; + num = le32_to_cpu(ex->ee_block) + ee_len - from; + start = ext_pblock(ex) + ee_len - num; ext_debug("free last %lu blocks starting %llu\n", num, start); for (i = 0; i < num; i++) { bh = sb_find_get_block(inode->i_sb, start + i); @@ -1652,12 +1681,12 @@ static int ext4_remove_blocks(handle_t * } ext4_free_blocks(handle, inode, start, num); } else if (from == le32_to_cpu(ex->ee_block) - && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } else { printk("strange request: removal(2) %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } return 0; } @@ -1672,6 +1701,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc unsigned a, b, block, num; unsigned long ex_ee_block; unsigned short ex_ee_len; + unsigned uninitialized = 0; struct ext4_extent *ex; ext_debug("truncate since %lu in leaf\n", start); @@ -1686,7 +1716,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex = EXT_LAST_EXTENT(eh); ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex_ee_len = ext4_ext_get_actual_len(ex); while (ex >= EXT_FIRST_EXTENT(eh) && ex_ee_block + ex_ee_len > start) { @@ -1754,6 +1786,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); if (err) @@ -1763,7 +1797,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc ext_pblock(ex)); ex--; ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + ex_ee_len = ext4_ext_get_actual_len(ex); } if (correct_index && eh->eh_entries) @@ -2039,7 +2073,7 @@ int ext4_ext_get_blocks(handle_t *handle if (ex) { unsigned long ee_block = le32_to_cpu(ex->ee_block); ext4_fsblk_t ee_start = ext_pblock(ex); - unsigned short ee_len = le16_to_cpu(ex->ee_len); + unsigned short ee_len; /* * Allow future support for preallocated extents to be added @@ -2047,8 +2081,9 @@ int ext4_ext_get_blocks(handle_t *handle * Uninitialized extents are treated as holes, except that * we avoid (fail) allocating new blocks during a write. */ - if (ee_len > EXT_MAX_LEN) + if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) goto out2; + ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { newblock = iblock - ee_block + ee_start; @@ -2056,8 +2091,11 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); - ext4_ext_put_in_cache(inode, ee_block, ee_len, - ee_start, EXT4_EXT_CACHE_EXTENT); + /* Do not put uninitialized extent in the cache */ + if (!ext4_ext_is_uninitialized(ex)) + ext4_ext_put_in_cache(inode, ee_block, + ee_len, ee_start, + EXT4_EXT_CACHE_EXTENT); goto out; } } @@ -2099,6 +2137,8 @@ int ext4_ext_get_blocks(handle_t *handle /* try to insert new extent into found leaf and return */ ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); + if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */ + ext4_ext_mark_uninitialized(&newex); err = ext4_ext_insert_extent(handle, inode, path, &newex); if (err) goto out2; @@ -2110,8 +2150,10 @@ int ext4_ext_get_blocks(handle_t *handle newblock = ext_pblock(&newex); __set_bit(BH_New, &bh_result->b_state); - ext4_ext_put_in_cache(inode, iblock, allocated, newblock, - EXT4_EXT_CACHE_EXTENT); + /* Cache only when it is _not_ an uninitialized extent */ + if (create!=EXT4_CREATE_UNINITIALIZED_EXT) + ext4_ext_put_in_cache(inode, iblock, allocated, newblock, + EXT4_EXT_CACHE_EXTENT); out: if (allocated > max_blocks) allocated = max_blocks; @@ -2215,10 +2257,127 @@ int ext4_ext_writepage_trans_blocks(stru return needed; } +/* + * preallocate space for a file. This implements ext4's fallocate inode + * operation, which gets called from sys_fallocate system call. + * Currently only FA_ALLOCATE mode is supported on extent based files. + * We may have more modes supported in future - like FA_DEALLOCATE, which + * tells fallocate to unallocate previously (pre)allocated blocks. + * For block-mapped files, posix_fallocate should fall back to the method + * of writing zeroes to the required new blocks (the same behavior which is + * expected for file systems which do not support fallocate() system call). + */ +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + ext4_fsblk_t nblocks = 0; + int ret = 0; + int ret2 = 0; + int retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode->i_blkbits; + + /* + * currently supporting (pre)allocate mode for extent-based + * files _only_ + */ + if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return -EOPNOTSUPP; + + /* preallocation to directories is currently not supported */ + if (S_ISDIR(inode->i_mode)) + return -ENODEV; + + block = offset >> blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) + - block; + + /* + * credits to insert 1 extent into extent tree + buffers to be able to + * modify 1 super block, 1 block bitmap and 1 group descriptor. + */ + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3; +retry: + while (ret >= 0 && ret < max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + handle = ext4_journal_start(inode, credits); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + break; + } + + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, &map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + WARN_ON(!ret); + if (!ret) { + ext4_error(inode->i_sb, "ext4_fallocate", + "ext4_ext_get_blocks returned 0! inode#%lu" + ", block=%llu, max_blocks=%llu", + inode->i_ino, block, max_blocks); + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (ret > 0) { + /* check wrap through sign-bit/zero here */ + if ((block + ret) < 0 || (block + ret) < block) { + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (buffer_new(&map_bh) && ((block + ret) > + (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits) + >> blkbits))) + nblocks = nblocks + ret; + } + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + if (ret2) + break; + } + + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry; + + /* + * Time to update the file size. + * Update only when preallocation was requested beyond the file size. + */ + if ((offset + len) > i_size_read(inode)) { + if (ret > 0) { + /* + * if no error, we assume preallocation succeeded + * completely + */ + mutex_lock(&inode->i_mutex); + i_size_write(inode, offset + len); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } else if (ret < 0 && nblocks) { + /* Handle partial allocation scenario */ + loff_t newsize; + + mutex_lock(&inode->i_mutex); + newsize = (nblocks << blkbits) + i_size_read(inode); + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } + } + + return ret > 0 ? ret2 : ret; +} + EXPORT_SYMBOL(ext4_mark_inode_dirty); EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); EXPORT_SYMBOL(ext4_ext_walk_space); EXPORT_SYMBOL(ext4_ext_find_goal); EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); +EXPORT_SYMBOL(ext4_fallocate); Index: linux-2.6.21/fs/ext4/file.c =================================================================== --- linux-2.6.21.orig/fs/ext4/file.c +++ linux-2.6.21/fs/ext4/file.c @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ .removexattr = generic_removexattr, #endif .permission = ext4_permission, + .fallocate = ext4_fallocate, }; Index: linux-2.6.21/include/linux/ext4_fs.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs.h +++ linux-2.6.21/include/linux/ext4_fs.h @@ -102,6 +102,7 @@ EXT4_GOOD_OLD_FIRST_INO : \ (s)->s_first_ino) #endif +#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size),(1 << (blkbits))) /* * Macro-instructions used to manage fragments @@ -225,6 +226,11 @@ struct ext4_new_group_data { __u32 free_blocks_count; }; +/* + * Following is used by preallocation code to tell get_blocks() that we + * want uninitialzed extents. + */ +#define EXT4_CREATE_UNINITIALIZED_EXT 2 /* * ioctl commands @@ -976,6 +982,8 @@ extern int ext4_ext_get_blocks(handle_t extern void ext4_ext_truncate(struct inode *, struct page *); extern void ext4_ext_init(struct super_block *); extern void ext4_ext_release(struct super_block *); +extern int ext4_fallocate(struct inode *inode, int mode, loff_t offset, + loff_t len); static inline int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, unsigned long max_blocks, struct buffer_head *bh, Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -188,6 +188,18 @@ ext4_ext_invalidate_cache(struct inode * EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO; } +static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext) { + ext->ee_len |= cpu_to_le16(0x8000); +} + +static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x8000); +} + +static inline int ext4_ext_get_actual_len(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF); +} + extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Mon May 14 07:54:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:54:08 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEs2fB013001 for ; Mon, 14 May 2007 07:54:04 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEs1i4020276 for ; Mon, 14 May 2007 10:54:01 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEs1Fk200374 for ; Mon, 14 May 2007 08:54:01 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEs08G028241 for ; Mon, 14 May 2007 08:54:01 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EErx5F028145; Mon, 14 May 2007 08:53:59 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 1264029EBD3; Mon, 14 May 2007 20:24:09 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEs97l010839; Mon, 14 May 2007 20:24:09 +0530 Date: Mon, 14 May 2007 20:24:09 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 5/5][TAKE2] ext4: write support for preallocated blocks Message-ID: <20070514145408.GE31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11420 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: --------- 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 234 +++++++++++++++++++++++++++++++++++----- include/linux/ext4_fs_extents.h | 3 2 files changed, 210 insertions(+), 27 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1141,6 +1141,54 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) + { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1328,25 +1376,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2012,15 +2042,152 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0; + int ret = 0; + + depth = ext_depth(inode); + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + ee_block = le32_to_cpu(ex->ee_block); + ee_len = ext4_ext_get_actual_len(ex); + allocated = ee_len - (iblock - ee_block); + newblock = iblock - ee_block + ext_pblock(ex); + ex2 = ex; + + /* ex1: ee_block to iblock - 1 : uninitialized */ + if (iblock > ee_block) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* for sanity, update the length of the ex2 extent before + * we insert ex3, if ex1 is NULL. This is to avoid temporary + * overlap of blocks. + */ + if (!ex1 && allocated > max_blocks) + ex2->ee_len = cpu_to_le16(max_blocks); + /* ex3: to ee_block + ee_len : uninitialised */ + if (allocated > max_blocks) { + unsigned int newdepth; + ex3 = &newex; + ex3->ee_block = cpu_to_le32(iblock + max_blocks); + ext4_ext_store_pblock(ex3, newblock + max_blocks); + ex3->ee_len = cpu_to_le16(allocated - max_blocks); + ext4_ext_mark_uninitialized(ex3); + err = ext4_ext_insert_extent(handle, inode, path, ex3); + if (err) + goto out; + /* The depth, and hence eh & ex might change + * as part of the insert above. + */ + newdepth = ext_depth(inode); + if (newdepth != depth) { + depth = newdepth; + path = ext4_ext_find_extent(inode, iblock, NULL); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + goto out; + } + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + if (ex2 != &newex) + ex2 = ex; + } + allocated = max_blocks; + } + /* If there was a change of depth as part of the + * insertion of ex3 above, we need to update the length + * of the ex1 extent again here + */ + if (ex1 && ex1 != ex) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* ex2: iblock to iblock + maxblocks-1 : initialised */ + ex2->ee_block = cpu_to_le32(iblock); + ex2->ee_start = cpu_to_le32(newblock); + ext4_ext_store_pblock(ex2, newblock); + ex2->ee_len = cpu_to_le16(allocated); + if (ex2 != ex) + goto insert; + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* New (initialized) extent starts from the first block + * in the current extent. i.e., ex2 == ex + * We have to see if it can be merged with the extent + * on the left. + */ + if (ex2 > EXT_FIRST_EXTENT(eh)) { + /* To merge left, pass "ex2 - 1" to try_to_merge(), + * since it merges towards right _only_. + */ + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + depth = ext_depth(inode); + ex2--; + } + } + /* Try to Merge towards right. This might be required + * only when the whole extent is being written to. + * i.e. ex2 == ex and ex3 == NULL. + */ + if (!ex3) { + ret = ext4_ext_try_to_merge(inode, path, ex2); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + } + } + /* Mark modified extent as dirty */ + err = ext4_ext_dirty(handle, inode, path + depth); + goto out; +insert: + err = ext4_ext_insert_extent(handle, inode, path, &newex); +out: + return err ? err : allocated; +} + int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, int create, int extend_disksize) { struct ext4_ext_path *path = NULL; + struct ext4_extent_header *eh; struct ext4_extent newex, *ex; ext4_fsblk_t goal, newblock; - int err = 0, depth; + int err = 0, depth, ret; unsigned long allocated = 0; __clear_bit(BH_New, &bh_result->b_state); @@ -2068,6 +2235,7 @@ int ext4_ext_get_blocks(handle_t *handle * this is why assert can't be put in ext4_ext_find_extent() */ BUG_ON(path[depth].p_ext == NULL && depth != 0); + eh = path[depth].p_hdr; ex = path[depth].p_ext; if (ex) { @@ -2076,13 +2244,9 @@ int ext4_ext_get_blocks(handle_t *handle unsigned short ee_len; /* - * Allow future support for preallocated extents to be added - * as an RO_COMPAT feature: * Uninitialized extents are treated as holes, except that - * we avoid (fail) allocating new blocks during a write. + * we split out initialized portions during a write. */ - if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) - goto out2; ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { @@ -2091,12 +2255,27 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); + /* Do not put uninitialized extent in the cache */ - if (!ext4_ext_is_uninitialized(ex)) + if (!ext4_ext_is_uninitialized(ex)) { ext4_ext_put_in_cache(inode, ee_block, ee_len, ee_start, EXT4_EXT_CACHE_EXTENT); - goto out; + goto out; + } + if (create == EXT4_CREATE_UNINITIALIZED_EXT) + goto out; + if (!create) + goto out2; + + ret = ext4_ext_convert_to_initialized(handle, inode, + path, iblock, + max_blocks); + if (ret <= 0) + goto out2; + else + allocated = ret; + goto outnew; } } @@ -2148,6 +2327,7 @@ int ext4_ext_get_blocks(handle_t *handle /* previous routine could use block we allocated */ newblock = ext_pblock(&newex); +outnew: __set_bit(BH_New, &bh_result->b_state); /* Cache only when it is _not_ an uninitialized extent */ Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -202,6 +202,9 @@ static inline int ext4_ext_get_actual_le extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); From owner-xfs@oss.sgi.com Mon May 14 08:33:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 08:33:32 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EFXSfB023882 for ; Mon, 14 May 2007 08:33:29 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EFXSGR025359 for ; Mon, 14 May 2007 11:33:28 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EFXR0K219552 for ; Mon, 14 May 2007 09:33:27 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EFXQqb012695 for ; Mon, 14 May 2007 09:33:27 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EFXPoT012613; Mon, 14 May 2007 09:33:26 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 9342629EBD3; Mon, 14 May 2007 21:03:30 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EFXURX028658; Mon, 14 May 2007 21:03:30 +0530 Date: Mon, 14 May 2007 21:03:30 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 2/5][TAKE2] fallocate() on s390 - glibc wrapper Message-ID: <20070514153330.GA25249@amitarora.in.ibm.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> <20070514144833.GB31748@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514144833.GB31748@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11421 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 14, 2007 at 08:18:34PM +0530, Amit K. Arora wrote: > This is the patch suggested by Martin Schwidefsky. Here are the comments > and patch from him. Martin also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. Here it is: .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15) /* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15) /* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15) /* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) -- Regards, Amit Arora > ------------- > From: Martin Schwidefsky > > This patch implements support of fallocate system call on s390(x) > platform. A wrapper is added to address the issue which s390 ABI has > with the arguments of this system call. > > Signed-off-by: Martin Schwidefsky > --- > > arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ > arch/s390/kernel/sys_s390.c | 29 +++++++++++++++++++++++++++++ > arch/s390/kernel/syscalls.S | 1 + > include/asm-s390/unistd.h | 3 ++- > 4 files changed, 42 insertions(+), 1 deletion(-) > > Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S > =================================================================== > --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S > +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S > @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: > llgtr %r2,%r2 # char * > llgtr %r3,%r3 # struct compat_timeval * > jg compat_sys_utimes > + > + .globl sys_fallocate_wrapper > +sys_fallocate_wrapper: > + lgfr %r2,%r2 # int > + lgfr %r3,%r3 # int > + sllg %r4,%r4,32 # get high word of 64bit loff_t > + lr %r4,%r5 # get low word of 64bit loff_t > + sllg %r5,%r6,32 # get high word of 64bit loff_t > + l %r5,164(%r15) # get low word of 64bit loff_t > + jg sys_fallocate > Index: linux-2.6.21/arch/s390/kernel/syscalls.S > =================================================================== > --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S > +++ linux-2.6.21/arch/s390/kernel/syscalls.S > @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * > SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) > SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) > SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) > +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) > Index: linux-2.6.21/arch/s390/kernel/sys_s390.c > =================================================================== > --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c > +++ linux-2.6.21/arch/s390/kernel/sys_s390.c > @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, > "d" (__arg3) : "memory"); > return __svcres; > } > + > +#ifndef CONFIG_64BIT > +/* > + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last > + * 64 bit argument "len" is split into the upper and lower 32 bits. The > + * system call wrapper in the user space loads the value to %r6/%r7. > + * The code in entry.S keeps the values in %r2 - %r6 where they are and > + * stores %r7 to 96(%r15). But the standard C linkage requires that > + * the whole 64 bit value for len is stored on the stack and doesn't > + * use %r6 at all. So s390_fallocate has to convert the arguments from > + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len > + * to > + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len > + */ > +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, > + u32 len_high, u32 len_low) > +{ > + union { > + u64 len; > + struct { > + u32 high; > + u32 low; > + }; > + } cv; > + cv.high = len_high; > + cv.low = len_low; > + return sys_fallocate(fd, mode, offset, cv.len); > +} > +#endif > Index: linux-2.6.21/include/asm-s390/unistd.h > =================================================================== > --- linux-2.6.21.orig/include/asm-s390/unistd.h > +++ linux-2.6.21/include/asm-s390/unistd.h > @@ -251,8 +251,9 @@ > #define __NR_getcpu 311 > #define __NR_epoll_pwait 312 > #define __NR_utimes 313 > +#define __NR_fallocate 314 > > -#define NR_syscalls 314 > +#define NR_syscalls 315 > > /* > * There are some system calls that are not present on 64 bit, some > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From owner-xfs@oss.sgi.com Mon May 14 13:20:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 13:20:08 -0700 (PDT) Received: from mailer.gwdg.de (mailer.gwdg.de [134.76.10.26]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EKK3fB012622 for ; Mon, 14 May 2007 13:20:04 -0700 Received: from linux01.gwdg.de ([134.76.13.21]) by mailer.gwdg.de with esmtps (TLSv1:AES256-SHA:256) (Exim 4.66) (envelope-from ) id 1Hnh1I-0001dN-FS; Mon, 14 May 2007 22:19:48 +0200 Received: from linux01.gwdg.de (localhost [127.0.0.1]) by linux01.gwdg.de (8.13.3/8.13.3/SuSE Linux 0.7) with ESMTP id l4EKG1gG018684; Mon, 14 May 2007 22:16:03 +0200 Received: from localhost (jengelh@localhost) by linux01.gwdg.de (8.13.3/8.13.3/Submit) with ESMTP id l4EKG0JO018678; Mon, 14 May 2007 22:16:00 +0200 Date: Mon, 14 May 2007 22:16:00 +0200 (MEST) From: Jan Engelhardt To: Matt Mackall cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? In-Reply-To: <20070512124641.GZ11115@waste.org> Message-ID: References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> <20070512124641.GZ11115@waste.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11422 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jengelh@linux01.gwdg.de Precedence: bulk X-list: xfs On May 12 2007 07:46, Matt Mackall wrote: >> >> You should not assume alphabetical order. Filesystems may be free to >> reorder things and return them (1) randomly like in a hash (2) by >> creation time during readdir(). > >There is no assumption. Mercurial explicitly visits files in >alphabetical order for the above commands. But who says that for i in {a..z}; do ## {..} is a bash3 extension touch $i; done; actually makes readdir() return them in the same order? Jan -- From owner-xfs@oss.sgi.com Mon May 14 13:27:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 13:27:49 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EKRkfB013776 for ; Mon, 14 May 2007 13:27:47 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 902FD2C8046; Mon, 14 May 2007 13:26:57 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id EB7702C803C; Mon, 14 May 2007 13:26:54 -0700 (PDT) Received: from [75.210.82.29] (29.sub-75-210-82.myvzw.com [75.210.82.29]) by lurch.goop.org (Postfix) with ESMTP; Mon, 14 May 2007 13:26:54 -0700 (PDT) Message-ID: <4648C63F.7020800@goop.org> Date: Mon, 14 May 2007 13:27:43 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jan Engelhardt CC: Matt Mackall , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> <20070512124641.GZ11115@waste.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11423 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Jan Engelhardt wrote: > On May 12 2007 07:46, Matt Mackall wrote: > >>> You should not assume alphabetical order. Filesystems may be free to >>> reorder things and return them (1) randomly like in a hash (2) by >>> creation time during readdir(). >>> >> There is no assumption. Mercurial explicitly visits files in >> alphabetical order for the above commands. >> > > But who says that > > for i in {a..z}; do ## {..} is a bash3 extension > touch $i; > done; > > actually makes readdir() return them in the same order? Nobody. But doing a readdir, sorting the results and visiting the files in that order does mean you'll visit them in alphabetical order. Hence "explicitly visits". J From owner-xfs@oss.sgi.com Mon May 14 16:06:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 16:06:07 -0700 (PDT) Received: from one.firstfloor.org (one.firstfloor.org [213.235.205.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EN61fB008398 for ; Mon, 14 May 2007 16:06:05 -0700 Received: by one.firstfloor.org (Postfix, from userid 503) id 8529B18902A2; Tue, 15 May 2007 00:39:46 +0200 (CEST) Date: Tue, 15 May 2007 00:39:46 +0200 From: Andi Kleen To: David Chatterton Cc: "'Andi Kleen'" , "'xfs-dev'" , "'xfs-oss'" , "'David Chinner'" Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070514223946.GA19487@one.firstfloor.org> References: <000001c79544$44076ac0$0501010a@DCHATTERTONLAPTOP> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <000001c79544$44076ac0$0501010a@DCHATTERTONLAPTOP> User-Agent: Mutt/1.4.2.1i X-archive-position: 11424 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: andi@firstfloor.org Precedence: bulk X-list: xfs > So yes this is designed for a workload where the number of AGs is a multiple > of the number of streams since mixing streams in the one AG is the problem > it tries to avoid. Sounds like a awful special case. Is that common? -Andi From owner-xfs@oss.sgi.com Mon May 14 17:05:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:05:45 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4F05efB021310 for ; Mon, 14 May 2007 17:05:41 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA22854; Tue, 15 May 2007 10:05:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4F05VAf93938633; Tue, 15 May 2007 10:05:32 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4F05NdH93919177; Tue, 15 May 2007 10:05:23 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 15 May 2007 10:05:23 +1000 From: David Chinner To: Andi Kleen Cc: David Chatterton , "'xfs-dev'" , "'xfs-oss'" , "'David Chinner'" Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070515000523.GQ86004887@sgi.com> References: <000001c79544$44076ac0$0501010a@DCHATTERTONLAPTOP> <20070514223946.GA19487@one.firstfloor.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514223946.GA19487@one.firstfloor.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11425 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 12:39:46AM +0200, Andi Kleen wrote: > > So yes this is designed for a workload where the number of AGs is a multiple > > of the number of streams since mixing streams in the one AG is the problem > > it tries to avoid. > > Sounds like a awful special case. Is that common? Common enough to be a serious problem when running multiple 2k ingest and playout streams (320MB/s each). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 14 17:12:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:12:39 -0700 (PDT) Received: from smtps.tip.net.au (chilli.pcug.org.au [203.10.76.44]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F0CYfB022809 for ; Mon, 14 May 2007 17:12:35 -0700 Received: from localhost (bh02i525f01.au.ibm.com [202.81.18.30]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by smtps.tip.net.au (Postfix) with ESMTP id B0EFB368012; Tue, 15 May 2007 09:44:40 +1000 (EST) Date: Tue, 15 May 2007 09:44:36 +1000 From: Stephen Rothwell To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Message-Id: <20070515094436.d441098f.sfr@canb.auug.org.au> In-Reply-To: <20070514144524.GA31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> <20070514144524.GA31748@amitarora.in.ibm.com> X-Mailer: Sylpheed 2.4.0 (GTK+ 2.10.12; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg="PGP-SHA1"; boundary="Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk" X-archive-position: 11426 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sfr@canb.auug.org.au Precedence: bulk X-list: xfs --Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" wrote: > > This patch implements sys_fallocate() and adds support on i386, x86_64 > and powerpc platforms. This patch no longer applies to Linus' tree - for a start there is no file arch/x86_64/kernel/functionlist any more. Can you rebase it, please? -- Cheers, Stephen Rothwell sfr@canb.auug.org.au --Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGSPRpFdBgD/zoJvwRAuHCAJsEB8TyYfKxqEtWnHM7smTPNqRiPwCfYj2B kUd5qmBLOd+TYg003bKAuVw= =ap96 -----END PGP SIGNATURE----- --Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk-- From owner-xfs@oss.sgi.com Mon May 14 17:15:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:15:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4F0F3fB023732 for ; Mon, 14 May 2007 17:15:05 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA22997; Tue, 15 May 2007 10:14:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4F0EsAf91589765; Tue, 15 May 2007 10:14:54 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4F0Epqv93878086; Tue, 15 May 2007 10:14:51 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 15 May 2007 10:14:50 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Jan Engelhardt , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070515001450.GS86004887@sgi.com> References: <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070512135143.GG85884050@sgi.com> <4645D594.4070801@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4645D594.4070801@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11427 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sat, May 12, 2007 at 07:56:20AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > What I don't understand is that on unmount dirty xfs inodes get > > written out. Clearly this is not happening - either there's a hole > > in the writeback logic (unlikely - it was unchanged) or we've missed > > some case where we need to update the filesize and mark the inode > > dirty. > > > > Hmmmm - if the write was just a short append to the file, then the > > block that was written to should already be mapped. Then we'll just > > look up the extent by doing a BMAPI_READ lookup, set the type to > > IOMAP_READ and add the block to ioend we are building. > > > > Well, that result I mailed you showed that the difference was just over > 16k, and that there was a 32 block difference in the final extent > length. Does that fit with this theory? Yes - because when we do specualtive allocation of 64k beyond EOF by default on appends.... > > The type IOMAP_READ determines the I/O completion behaviour - in this case > > it is xfs_end_bio_read(), which fails to update the file size.... > > > > Bingo. > > > > A patch for you to try, Jeremy. I've just started a test run on it... > > > > Thanks, I'll give it a spin. Have you reproduced the bug yourself? No, not yet. I haven't had chance because I'm travelling at the moment.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 14 17:15:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:15:54 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F0FmfB024021 for ; Mon, 14 May 2007 17:15:50 -0700 Received: from DCHATTERTONLAPTOP (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id BCDA592C5DE; Tue, 15 May 2007 10:15:46 +1000 (EST) From: "David Chatterton" To: "'Andi Kleen'" Cc: "'xfs-dev'" , "'xfs-oss'" , "'David Chinner'" Subject: RE: Review: Concurrent Multi-File Data Streams Date: Tue, 15 May 2007 10:15:50 +1000 Message-ID: <00f501c79686$319a0530$0501010a@DCHATTERTONLAPTOP> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <20070514223946.GA19487@one.firstfloor.org> Thread-Index: AceWeMpxeBXomGDQT3ag+pGdmo367AACszxA X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 X-archive-position: 11428 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dchatterton@aconex.com Precedence: bulk X-list: xfs Andi, Dave just beat me to it, this represents the workload used by all post-production houses since they moved to digital where each stream is 320MB/s (2K format) or 1.3GB/s (4K format). Making sure those files are written sequentially on disk and do not overlap other streams has a huge benefit when supporting multiple streams. There is no reason why other workloads that would benefit from files in the same directory being written sequentially into their "own AG" would not use this feature. Post-production just tends to push the filesystem to the limits earlier than some other workloads. David > -----Original Message----- > From: Andi Kleen [mailto:andi@firstfloor.org] > Sent: Tuesday, 15 May 2007 8:40 AM > To: David Chatterton > Cc: 'Andi Kleen'; 'xfs-dev'; 'xfs-oss'; 'David Chinner' > Subject: Re: Review: Concurrent Multi-File Data Streams > > > So yes this is designed for a workload where the number of AGs is a > > multiple of the number of streams since mixing streams in > the one AG > > is the problem it tries to avoid. > > Sounds like a awful special case. Is that common? > > -Andi > From owner-xfs@oss.sgi.com Mon May 14 17:53:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:53:18 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F0rFfB031079 for ; Mon, 14 May 2007 17:53:15 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 3EADE2C8046; Mon, 14 May 2007 17:52:26 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 4DD452C803C; Mon, 14 May 2007 17:52:24 -0700 (PDT) Received: from [75.210.82.29] (29.sub-75-210-82.myvzw.com [75.210.82.29]) by lurch.goop.org (Postfix) with ESMTP; Mon, 14 May 2007 17:52:23 -0700 (PDT) Message-ID: <46490478.3010409@goop.org> Date: Mon, 14 May 2007 17:53:12 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: xfs@oss.sgi.com, Linux Kernel Mailing List Subject: 2.6.22-rc1 xfs lockdep messages Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-archive-position: 11429 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs I tend to get this when doing unlinks or rms in xfs: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.22-rc1-paravirt #1382 ------------------------------------------------------- rm/1451 is trying to acquire lock: (&(&ip->i_lock)->mr_lock/1){--..}, at: [] xfs_ilock+0x64/0x8d [xfs] but task is already holding lock: (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x64/0x8d [xfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&ip->i_lock)->mr_lock){----}: [] __lock_acquire+0xa1f/0xbab [] lock_acquire+0x7b/0x9f [] down_write_nested+0x3d/0x58 [] xfs_ilock+0x64/0x8d [xfs] [] xfs_iget_core+0x2bd/0x605 [xfs] [] xfs_iget+0xac/0x133 [xfs] [] xfs_trans_iget+0xdc/0x142 [xfs] [] xfs_ialloc+0xa5/0x457 [xfs] [] xfs_dir_ialloc+0x6d/0x260 [xfs] [] xfs_create+0x2f4/0x5a6 [xfs] [] xfs_vn_mknod+0x130/0x1e5 [xfs] [] xfs_vn_create+0x12/0x14 [xfs] [] vfs_create+0x9b/0xe5 [] open_namei+0x176/0x593 [] do_filp_open+0x26/0x3b [] do_sys_open+0x43/0xc7 [] sys_open+0x1c/0x1e [] syscall_call+0x7/0xb [] 0xffffffff -> #0 (&(&ip->i_lock)->mr_lock/1){--..}: [] __lock_acquire+0x903/0xbab [] lock_acquire+0x7b/0x9f [] down_write_nested+0x3d/0x58 [] xfs_ilock+0x64/0x8d [xfs] [] xfs_lock_inodes+0x11d/0x12f [xfs] [] xfs_lock_dir_and_entry+0xc2/0xcc [xfs] [] xfs_remove+0x213/0x425 [xfs] [] xfs_vn_unlink+0x1c/0x44 [xfs] [] vfs_unlink+0x75/0xb3 [] do_unlinkat+0x96/0x12c [] sys_unlink+0x13/0x15 [] syscall_call+0x7/0xb [] 0xffffffff other info that might help us debug this: 3 locks held by rm/1451: #0: (&inode->i_mutex/1){--..}, at: [] do_unlinkat+0x5e/0x12c #1: (&inode->i_mutex){--..}, at: [] mutex_lock+0x1f/0x23 #2: (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x64/0x8d [xfs] stack backtrace: [] show_trace_log_lvl+0x1a/0x30 [] show_trace+0x12/0x14 [] dump_stack+0x16/0x18 [] print_circular_bug_tail+0x5f/0x68 [] __lock_acquire+0x903/0xbab [] lock_acquire+0x7b/0x9f [] down_write_nested+0x3d/0x58 [] xfs_ilock+0x64/0x8d [xfs] [] xfs_lock_inodes+0x11d/0x12f [xfs] [] xfs_lock_dir_and_entry+0xc2/0xcc [xfs] [] xfs_remove+0x213/0x425 [xfs] [] xfs_vn_unlink+0x1c/0x44 [xfs] [] vfs_unlink+0x75/0xb3 [] do_unlinkat+0x96/0x12c [] sys_unlink+0x13/0x15 [] syscall_call+0x7/0xb ======================= J From owner-xfs@oss.sgi.com Mon May 14 23:23:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 23:23:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4F6NWfB020802 for ; Mon, 14 May 2007 23:23:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA00542; Tue, 15 May 2007 16:23:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4F6NSAf93905926; Tue, 15 May 2007 16:23:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4F6NRxD93828017; Tue, 15 May 2007 16:23:27 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 15 May 2007 16:23:27 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070515062327.GI85884050@sgi.com> References: <20070511003606.GB85884050@sgi.com> <20070513205953.GA14030@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070513205953.GA14030@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11430 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sun, May 13, 2007 at 09:59:53PM +0100, Christoph Hellwig wrote: > I already had some comments on this when discussing it with Sam in person, > but it seems like they didn't make it to you. Some people vaguely remembered some stuff (I did ask around) but it no-one knew the exact details of what you and Sam talked about. > First the mru cache while beeing quite nice code is heavily overengineered > for this case. Unless there are a many hundred filestreams per filesystem > it will be a lot faster to just have a simple wrap-around array of > linked lists. Well.... The mru cache is a wrap-around array of linked lists. i.e. There's a linked list for each time quanta group, and an array that holds all the head of each list. As each time quanta expires, we reclaim the oldest list and move the head pointer to the just emptied list for the new or newly referenced entries. I guess then you're commenting on the fact that it is also indexed by a radix tree? Given that during QA I've seen the cache grow to over 30,000 elements (one mru cache entry per cached inode), this cache can grow very large. In that particular test (083 - multiple fsstress at ENOSPC) each AG had around 2,000 stream references. That's far too large to search based on linked lists and the cache size variation pretty much rules out a hashing based solution. Radix tree gives pretty good lookup performance in these cases.... So the issue here is not that we have hundreds of streams but we have the possibility of having to search hundreds of thousands of cache objects to find the association for a given inode..... > We don't want to feed the argument that xfs has lots of > useless bloated code, do we? :) I've got two or three other things lined up that will use the mru cache so I don't think this is an issue at all... > All the pip != NULL checks are superflous in Linux. A regular > file can never have a non-null parent inode, and a directory can only > have a non-NULL parent inode in very odd corner cases involving NFS > exports, but it has to be connect again once you start doing > namespace modifying operations on it. Yes - I was told you'd said that about the code but I couldn't understand how or why it was even relevant because the code has nothing at all to do with dentries or looking up parent inodes. Now I have the full context.... So, we do this: 578 /* Pick a new AG for the parent inode starting at startag. */ 579 if ((err = _xfs_filestream_pick_ag(mp, startag, &ag, 0, 0)) || 580 ag == NULLAGNUMBER) 581 goto exit_did_pick; 582 583 /* Associate the parent inode with the AG. */ 584 if ((err = _xfs_filestream_set_ag(pip, NULL, ag))) { 585 dprint("_xfs_filestream_set_ag(%p (%lld), NULL, %d) -> err %d", 586 pip, pip->i_ino, ag, err); 587 goto exit_did_pick; 588 } 589 590 /* Associate the file inode with the AG. */ 591 if ((err = _xfs_filestream_set_ag(ip, pip, ag))) { 592 dprint("_xfs_filestream_set_ag(%p (%lld), %p (%lld), %d) -> " 593 "err %d", ip, ip->i_ino, pip, pip->i_ino, ag, err); 594 goto exit_did_pick; 595 } _xfs_filestream_set_ag() is called in two cases here - once without a parent inode, and once with. When we associate a directory with an AG, we don't care what t's parent association is - we want that directory to be associated with the ag we got from _xfs_filestream_pick_ag(), not it's parent's association. With regular file inodes we want it to be associated with the parent inode's AG so we need to pass in a pip. Hence all the checks for pip being/not being NULL are required in this function. It really has nothing to do with whether an inode has a parent connected to it in the dentry tree or not.... > There some naming confusion: xfs_mount.h forward-declares struct > xfs_filestream but everything else uses struct fstrm_mnt_data. > The former is very non-descriptive and the latter but ugly, I'd > suggestjust putting the mru-cache replacement directly in there > as xfs_filestream_cache instead of the wrapping. I'll look at changing names to something more sensible, but at this point I don't see that the mru cache going away... > The xfs_zeroino changes looks good but should be a separate commit. Ok, I'll extract that out.... > Some comments on the actual code in xfs_filestream.c > > > +#ifdef DEBUG_FILESTREAMS > > +#define dprint(fmt, args...) do { \ > > + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ > > + current_pid(), __FUNCTION__, ##args); \ > > +} while(0) > > +#else > > +#define dprint(args...) do {} while (0) > > +#endif > > This should probably be killed entirely. I think it needs to be replaced with real tracing code rather than printk()s - this stuff is pretty much impossible to debug in a finite time period without some form of tracing telling us what happened. Is converting this to ktrace infrastructure acceptible? > > +#define GET_AG_REF(mp, ag) atomic_read(&(mp)->m_perag[ag].pagf_fstrms) > > +#define INC_AG_REF(mp, ag) atomic_inc_return(&(mp)->m_perag[ag].pagf_fstrms) > > +#define DEC_AG_REF(mp, ag) atomic_dec_return(&(mp)->m_perag[ag].pagf_fstrms) > > These should be inlines with more descriptive lower case names. *nod* > > +#define XFS_PICK_USERDATA 1 > > +#define XFS_PICK_LOWSPACE 2 > > enum. Yup. > > + for (nscan = 0; 1; nscan++) { > > + > > + //dprint("scanning AG %d[%d]", ag, GET_AG_REF(mp, ag)); > > please don't leave commented out debug code in. I missed that one :/ > > + pag = mp->m_perag + ag; > > + > > + if (!pag->pagf_init && > > + (err = xfs_alloc_pagf_init(mp, NULL, ag, trylock)) && > > + !trylock) { > > + dprint("xfs_alloc_pagf_init returned %d", err); > > + return err; > > + } > > if (!pag->pagf_init) { > err = xfs_alloc_pagf_init(mp, NULL, ag, trylock); > if (err && !trylock) > return err; > } Yup, I'll convert all those. > > +static int > > +_xfs_filestream_set_ag( > > + xfs_inode_t *ip, > > + xfs_inode_t *pip, > > + xfs_agnumber_t ag) > > +{ > > + int err = 0; > > + xfs_mount_t *mp; > > + xfs_mru_cache_t *cache; > > + fstrm_item_t *item; > > + xfs_agnumber_t old_ag; > > + xfs_inode_t *old_pip; > > + > > + /* > > + * Either ip is a regular file and pip is a directory, or ip is a > > + * directory and pip is NULL. > > + */ > > We have parent information for parents aswell so this should probably > be made more regular. As explained above, the association of the parent of a directory is irrelevant which is why we do not use it... > > +void > > +xfs_filestream_init(void) > > +{ > > + item_zone = kmem_zone_init(sizeof(fstrm_item_t), "fstrm_item"); > > + ASSERT(item_zone); > > Please check for errors instead and propagate them. Ooo. I missed that one. > > +/* > > + * xfs_filestream_uninit() is called at xfs termination time to destroy the > > + * memory zone that was used for filestream data structure allocation. > > + */ > > +void > > +xfs_filestream_uninit(void) > > +{ > > + if (item_zone) { > > + kmem_zone_destroy(item_zone); > > + item_zone = NULL; > > + } > > +} > > no need for the NULL check or setting it to NULL. *nod* > > + if (!(md = (fstrm_mnt_data_t*)kmem_zalloc(sizeof(*md), KM_SLEEP))) > > Please use KM_MAYFAIL for all new code otside of transactions. Yeah - that is pretty silly - checking if a KM_SLEEP allocation failed.... > > + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); > > + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) > > + return NULLAGNUMBER; > > either the assert or the if clause checking gor it, please. Purely defensive - on a production system we'll return NULLAGNUMBER if we get called for the wrong type so teh system will silently continue without issues. On a debug kernel we'll get an assert failure so we can debug why we got here incorrectly. This is a common way of handling should-not-happen-but-not-fatal error conditions in XFS - look at all the places where we have "ASSERT(0)" in error cases that a non-debug kernel will just return an error. What is the accepted way of coding this? > Now comes the worst part the new allocator function > i > IF we look at a diff between xfs_bmap_filestreams and xfs_bmap_btalloc > we see that it's a pretty bad cut & paste job: FWIW, it was done that way originally so that it didn't perturb the existing allocator code. > > --- btalloc 2007-05-12 12:43:03.000000000 +0200 > +++ fsalloc 2007-05-12 12:42:28.000000000 +0200 > @@ -1,44 +1,54 @@ > > > + rt = (ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata; > > xfs_bmap_alloc() never calls xfs_bmap_filestreams if this is > true so all code guarded by if (rt) is dead. Will kill. > > - if (unlikely(align)) { > > + if (align) { > > lign should have the same likelyhood for oth > > > - if (nullfb) > > - ap->rval = XFS_INO_TO_FSB(mp, ap->ip->i_ino); > > - else > > + if (nullfb) { > > + ag = xfs_filestream_get_ag(ap->ip); > > + ag = (ag != NULLAGNUMBER) ? ag : 0; > > + ap->rval = (ap->userdata) ? XFS_AGB_TO_FSB(mp, ag, 0) : > > + XFS_INO_TO_FSB(mp, ap->ip->i_ino); > > + } else { > > ap->rval = ap->firstblock; > > + } > > Some rreal changes :) But this could be just a third if case > for the filesystream case. Yes, it could..... > > @@ -117,18 +167,19 @@ > > */ > > else > > args.minlen = ap->alen; > > + ap->rval = args.fsbno = XFS_AGB_TO_FSB(mp, ag, 0); > > } else if (ap->low) { > > - args.type = XFS_ALLOCTYPE_START_BNO; > > + args.type = XFS_ALLOCTYPE_FIRST_AG; > > args.total = args.minlen = ap->minlen; > > Why is this different? Because when we are low on space stream associations typically fail and we associate with AG 0 in that case. > } > > - if (unlikely(ap->userdata && ap->ip->i_d.di_extsize && > > - (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE))) { > > + if (ap->userdata && ap->ip->i_d.di_extsize && > > + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) { > args.prod = ap->ip->i_d.di_extsize; > > - if ((args.mod = (xfs_extlen_t)do_mod(ap->off, args.prod))) > > + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) > > Gratious difference. > > * is >= the stripe unit and the allocation offset is > * at the end of file. > */ > > + atype = args.type; > > I don't quite undersatnd why we'd nee this in one, but not the other. I don't think it's needed in either. Possibly it was added to remove a used-uninitialised warning... > Based onthat my conclusion is that xfs_bmap_filestreams and xfs_bmap_btalloc > should be merged to avoid further maintaince overhead. Yes, agreed - they could be. Christoph - thanks for taking the time to review this code. I'll post a new version in a few days when I've had a chance to incorporate your suggestions... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 14 23:31:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 23:31:31 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F6VPfB022437 for ; Mon, 14 May 2007 23:31:27 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 61CD44E4595; Tue, 15 May 2007 00:31:22 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 10F4E4078; Tue, 15 May 2007 00:31:21 -0600 (MDT) Date: Tue, 15 May 2007 00:31:21 -0600 From: Andreas Dilger To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5][TAKE2] fallocate system call Message-ID: <20070515063120.GI5286@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514132926.GA30768@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11431 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 14, 2007 18:59 +0530, Amit K. Arora wrote: > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > fd: The descriptor of the open file. > > mode*: This specifies the behavior of the system call. Currently the > system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. > FA_ALLOCATE: Applications can use this mode to preallocate blocks to > a given file (specified by fd). This mode changes the file size if > the preallocation is done beyond the EOF. It also updates the > ctime/mtime in the inode of the corresponding file, marking a > successfull allocation. > FA_DEALLOCATE: This mode can be used by applications to deallocate the > previously preallocated blocks. This also may change the file size > and the ctime/mtime. > * New modes might get added in future. One such new mode which is > already under discussion is FA_PREALLOCATE, which when used will > preallocate space but will not change the filesize and [cm]time. > Since the semantics of this new mode is not clear and agreed upon yet, > this patchset does not implement it currently. > > offset: This is the offset in bytes, from where the preallocation should > start. > > len: This is the number of bytes requested for preallocation (from > offset). What is the return value? I'd hope it is the number of bytes preallocated, in case of interrupted preallocation for whatever reason (interrupt, out of space, etc) like a regular write(2) call. In this case the return type needs to also be an loff_t to match @len. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 15 01:12:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 01:12:44 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F8CffB011266 for ; Tue, 15 May 2007 01:12:43 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 4240BB00B91A; Tue, 15 May 2007 04:12:41 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 1C06E50001A7 for ; Tue, 15 May 2007 04:12:41 -0400 (EDT) Date: Tue, 15 May 2007 04:12:40 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: xfs@oss.sgi.com Subject: xfs_db: segfault: error 4 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11432 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Kernel: 2.6.21.1 # xfs_db -V xfs_db version 2.8.18 May 14 22:15:54 p34 kernel: [186121.414224] xfs_db[18999]: segfault at 00000000005b6ff8 rip 00002ac92ced40ce rsp 00007fff7e0a9a68 error 4 While running an xfs_db -c frag -f /dev/md3 (which runs nightly) this is the first time I have seen this problem. Justin. From owner-xfs@oss.sgi.com Tue May 15 02:49:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 02:49:40 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F9nafB026820 for ; Tue, 15 May 2007 02:49:38 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HntFJ-0002YO-LE; Tue, 15 May 2007 10:23:05 +0100 Date: Tue, 15 May 2007 10:23:05 +0100 From: Christoph Hellwig To: David Chinner Cc: Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070515092305.GA9409@infradead.org> References: <20070511003606.GB85884050@sgi.com> <20070513205953.GA14030@infradead.org> <20070515062327.GI85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515062327.GI85884050@sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11433 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 04:23:27PM +1000, David Chinner wrote: > Well.... The mru cache is a wrap-around array of linked lists. i.e. > There's a linked list for each time quanta group, and an array that > holds all the head of each list. As each time quanta expires, we > reclaim the oldest list and move the head pointer to the just > emptied list for the new or newly referenced entries. > > I guess then you're commenting on the fact that it is also indexed by > a radix tree? Yes. > > Given that during QA I've seen the cache grow to over 30,000 > elements (one mru cache entry per cached inode), this cache can grow > very large. In that particular test (083 - multiple fsstress at > ENOSPC) each AG had around 2,000 stream references. That's far too > large to search based on linked lists and the cache size variation > pretty much rules out a hashing based solution. Radix tree gives > pretty good lookup performance in these cases.... > > So the issue here is not that we have hundreds of streams but we > have the possibility of having to search hundreds of thousands of > cache objects to find the association for a given inode..... Okay, convinced. > > > We don't want to feed the argument that xfs has lots of > > useless bloated code, do we? :) > > I've got two or three other things lined up that will use the > mru cache so I don't think this is an issue at all... In that case however the code should move into lib/ instead of beeing in XFS. That also means updating it to kernel standard style, e.g. getting rid of all the odd XFS wrappers, removing useless casts, converting the documentation to kerneldoc style, return negative error values, etc.. Probably wants splitting into a separate patch. > > > All the pip != NULL checks are superflous in Linux. A regular > > file can never have a non-null parent inode, and a directory can only > > have a non-NULL parent inode in very odd corner cases involving NFS > > exports, but it has to be connect again once you start doing > > namespace modifying operations on it. > > Yes - I was told you'd said that about the code but I couldn't > understand how or why it was even relevant because the code has > nothing at all to do with dentries or looking up parent inodes. > Now I have the full context.... Actually here I meant a different context :) This is in reference to the xfs_inode.c changes, which are namespace operations only called from the VFS so the normal Linux gurantees should always apply here. > _xfs_filestream_set_ag() is called in two cases here - once without a > parent inode, and once with. When we associate a directory with an AG, > we don't care what ?t's parent association is - we want that directory > to be associated with the ag we got from _xfs_filestream_pick_ag(), not > it's parent's association. > > With regular file inodes we want it to be associated with the parent inode's > AG so we need to pass in a pip. Hence all the checks for pip being/not being > NULL are required in this function. It really has nothing to do with > whether an inode has a parent connected to it in the dentry tree or > not.... > > There some naming confusion: xfs_mount.h forward-declares struct > > xfs_filestream but everything else uses struct fstrm_mnt_data. > > The former is very non-descriptive and the latter but ugly, I'd > > suggestjust putting the mru-cache replacement directly in there > > as xfs_filestream_cache instead of the wrapping. > > I'll look at changing names to something more sensible, but at this > point I don't see that the mru cache going away... Well in that case s/replacement//. Just have a struct mru_cache *m_filestreams; in struct xfs_mount. > > Some comments on the actual code in xfs_filestream.c > > > > > +#ifdef DEBUG_FILESTREAMS > > > +#define dprint(fmt, args...) do { \ > > > + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ > > > + current_pid(), __FUNCTION__, ##args); \ > > > +} while(0) > > > +#else > > > +#define dprint(args...) do {} while (0) > > > +#endif > > > > This should probably be killed entirely. > > I think it needs to be replaced with real tracing code rather than > printk()s - this stuff is pretty much impossible to debug in a finite > time period without some form of tracing telling us what happened. > Is converting this to ktrace infrastructure acceptible? Sounds fine to me, that way it's consistant with the reset of XFS. And now that the kernel tracing informations make progress we might actually be able to use that in mainline soon. > > > + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); > > > + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) > > > + return NULLAGNUMBER; > > > > either the assert or the if clause checking gor it, please. > > Purely defensive - on a production system we'll return NULLAGNUMBER if > we get called for the wrong type so teh system will silently continue > without issues. On a debug kernel we'll get an assert failure so we can > debug why we got here incorrectly. > > This is a common way of handling should-not-happen-but-not-fatal error > conditions in XFS - look at all the places where we have "ASSERT(0)" in > error cases that a non-debug kernel will just return an error. > > What is the accepted way of coding this? In normal kernel doc this would be a BUG() in the taken branch of the if, that would probably translate to an ASSERT(0) in XFS. > > Now comes the worst part the new allocator function > > i > > IF we look at a diff between xfs_bmap_filestreams and xfs_bmap_btalloc > > we see that it's a pretty bad cut & paste job: > > FWIW, it was done that way originally so that it didn't perturb the > existing allocator code. That might be a good strategy for delivering an IRIX patch to a customers, but for long-term maintaince this kind of duplication should rather be avoided. > > > } else if (ap->low) { > > > - args.type = XFS_ALLOCTYPE_START_BNO; > > > + args.type = XFS_ALLOCTYPE_FIRST_AG; > > > args.total = args.minlen = ap->minlen; > > > > Why is this different? > > Because when we are low on space stream associations typically fail > and we associate with AG 0 in that case. As Andi already mentioned that might be a bad default and some kind of round robing might be better. Or just falling back to the default allocator scheme so we don't get subtile differences. From owner-xfs@oss.sgi.com Tue May 15 05:15:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 05:15:17 -0700 (PDT) Received: from gk.uu.epigenomics.net (gk.uu.epigenomics.net [195.127.125.226]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FCFBfB025527 for ; Tue, 15 May 2007 05:15:13 -0700 Received: (qmail 2398 invoked from network); 15 May 2007 11:48:30 -0000 Received: from perl.epigenomics.epi (192.168.48.4) by salam.epigenomics.epi with SMTP; 15 May 2007 11:48:30 -0000 Received: (qmail 24312 invoked by uid 9); 15 May 2007 11:48:30 -0000 From: linux-xfs@ml.epigenomics.com X-Newsgroups: epi.ml.linux.xfs Subject: xfs_repair: buf calloc failed (4132 bytes): Cannot allocate memory Date: Tue, 15 May 2007 11:48:30 +0000 (UTC) Organization: Epigenomics AG Lines: 73 Message-ID: X-Complaints-To: usenet@epigenomics.net User-Agent: slrn/0.9.8.1pl1 (Debian) To: xfs@oss.sgi.com X-archive-position: 11434 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: linux-xfs@ml.epigenomics.com Precedence: bulk X-list: xfs Hi! We have a RAID0 set of 3 400GB disks. After a crash we needed to run xfs_repair, but it bails out with the error message: - ensuring existence of lost+found directory - traversing filesystem starting at / ... xfs_repair: buf calloc failed (4132 bytes): Cannot allocate memory The filesystem contains many hardlinked files as it is a dirvish repository (www.dirvish.org) with the hardlinks created by rsync. This is the xfs_db info: # xfs_db -r -c "sb 0" -c "p" /dev/md0 magicnum = 0x58465342 blocksize = 4096 dblocks = 293031424 rblocks = 0 rextents = 0 uuid = e8d3a22c-716f-4f3e-9e95-e06afb3559d0 logstart = 268435472 rootino = 256 rbmino = 257 rsumino = 258 rextsize = 48 agblocks = 9157232 agcount = 32 rbmblocks = 0 logblocks = 32768 versionnum = 0x3184 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 24 rextslog = 0 inprogress = 0 imax_pct = 25 icount = 18882496 ifree = 373596 fdblocks = 27494887 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 16 width = 48 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 0 features2 = 0 Kernel is 2.6.20.6 on a dual PIII machine with 1GB RAM and 10GB swap. Mounting the filesystem is possible, but what about its current state? Greetings -- Robert Sander Senior Manager Information Systems Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-0 fax:+49-30-24345-555 http://www.epigenomics.com robert.sander@epigenomics.com From owner-xfs@oss.sgi.com Tue May 15 05:40:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 05:40:30 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FCeNfB031295 for ; Tue, 15 May 2007 05:40:25 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FCeN0N009204 for ; Tue, 15 May 2007 08:40:23 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FCeNkl492006 for ; Tue, 15 May 2007 08:40:23 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FCeM5B024328 for ; Tue, 15 May 2007 08:40:23 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FCeLZX024194; Tue, 15 May 2007 08:40:21 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id F3A7D94C82; Tue, 15 May 2007 18:10:26 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FCeLsU004120; Tue, 15 May 2007 18:10:21 +0530 Date: Tue, 15 May 2007 18:10:20 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5][TAKE2] fallocate system call Message-ID: <20070515124020.GA12964@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070515063120.GI5286@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515063120.GI5286@schatzie.adilger.int> User-Agent: Mutt/1.4.1i X-archive-position: 11435 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 12:31:21AM -0600, Andreas Dilger wrote: > On May 14, 2007 18:59 +0530, Amit K. Arora wrote: > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > > > fd: The descriptor of the open file. > > > > mode*: This specifies the behavior of the system call. Currently the > > system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. > > FA_ALLOCATE: Applications can use this mode to preallocate blocks to > > a given file (specified by fd). This mode changes the file size if > > the preallocation is done beyond the EOF. It also updates the > > ctime/mtime in the inode of the corresponding file, marking a > > successfull allocation. > > FA_DEALLOCATE: This mode can be used by applications to deallocate the > > previously preallocated blocks. This also may change the file size > > and the ctime/mtime. > > * New modes might get added in future. One such new mode which is > > already under discussion is FA_PREALLOCATE, which when used will > > preallocate space but will not change the filesize and [cm]time. > > Since the semantics of this new mode is not clear and agreed upon yet, > > this patchset does not implement it currently. > > > > offset: This is the offset in bytes, from where the preallocation should > > start. > > > > len: This is the number of bytes requested for preallocation (from > > offset). > > What is the return value? I'd hope it is the number of bytes preallocated, > in case of interrupted preallocation for whatever reason (interrupt, out of > space, etc) like a regular write(2) call. In this case the return type needs > to also be an loff_t to match @len. The return value in current implementation has been kept as "long" where zero is returned for success and an error on failure. This is done to keep it inline with posix_fallocate behavior. This point was brought up sometime back by Badari. At that time it was decided to keep it the way posix_fallocate is designed. Here are the posts related to this: Still if you feel that we should be returning number of bytes preallocated, we can again ask for opinion here. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 15 06:24:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 06:24:08 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FDO1fB007752 for ; Tue, 15 May 2007 06:24:04 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FDNw43004993 for ; Tue, 15 May 2007 09:23:58 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FDNw4o269312 for ; Tue, 15 May 2007 07:23:58 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FDNvU0029955 for ; Tue, 15 May 2007 07:23:58 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FDNuUx029126; Tue, 15 May 2007 07:23:57 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id A399C94C82; Tue, 15 May 2007 18:53:53 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FDNrws021935; Tue, 15 May 2007 18:53:53 +0530 Date: Tue, 15 May 2007 18:53:53 +0530 From: "Amit K. Arora" To: Stephen Rothwell Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070515132353.GB12964@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> <20070514144524.GA31748@amitarora.in.ibm.com> <20070515094436.d441098f.sfr@canb.auug.org.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515094436.d441098f.sfr@canb.auug.org.au> User-Agent: Mutt/1.4.1i X-archive-position: 11436 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 09:44:36AM +1000, Stephen Rothwell wrote: > On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" wrote: > > > > This patch implements sys_fallocate() and adds support on i386, x86_64 > > and powerpc platforms. > > This patch no longer applies to Linus' tree - for a start there is no file > arch/x86_64/kernel/functionlist any more. > > Can you rebase it, please? I will rebase it to 2.6.22-rc1 and repost the patches soon. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 15 12:24:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 12:24:28 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FJOKfB016847 for ; Tue, 15 May 2007 12:24:21 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 680712C8047; Tue, 15 May 2007 12:23:30 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 05DDE2C8043; Tue, 15 May 2007 12:23:28 -0700 (PDT) Received: from [75.208.159.192] (192.sub-75-208-159.myvzw.com [75.208.159.192]) by lurch.goop.org (Postfix) with ESMTP; Tue, 15 May 2007 12:23:27 -0700 (PDT) Message-ID: <464A08DC.7030303@goop.org> Date: Tue, 15 May 2007 12:24:12 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Jan Engelhardt , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070512135143.GG85884050@sgi.com> In-Reply-To: <20070512135143.GG85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11437 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > A patch for you to try, Jeremy. I've just started a test run on it... > OK, it seems to work. I haven't given it an overnight run, but its run longer without failing than it did before. J From owner-xfs@oss.sgi.com Tue May 15 12:37:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 12:37:27 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FJbMfB019056 for ; Tue, 15 May 2007 12:37:23 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FJbL1g008123 for ; Tue, 15 May 2007 15:37:21 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FJbL8M528288 for ; Tue, 15 May 2007 15:37:21 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FJbKHH022655 for ; Tue, 15 May 2007 15:37:21 -0400 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FJbIoq022564; Tue, 15 May 2007 15:37:19 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 8853010CFF8; Wed, 16 May 2007 01:07:25 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FJbN4Q026374; Wed, 16 May 2007 01:07:23 +0530 Date: Wed, 16 May 2007 01:07:22 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 0/5][TAKE3] fallocate system call Message-ID: <20070515193722.GA3487@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11438 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P L E A S E N O T E : *********************** 1. Patches have been now rebased to 2.6.22-rc1 kernel. Earlier they were based on 2.6.21. 2. An unnecessary export of symbol is removed from the ext4 preallocate patch. Details in the corresponding post (PATCH 4/5). 3. Return type now described in the interface description below. 4. Besides above points, everything is exactly same as TAKE2. -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This is the new set of patches which take care of the review comments received from the community (mainly from Andrew). Description: ----------- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: --------- The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime/mtime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). RETURN VALUE: The system call returns 0 on success and an error on failure. This is done to keep the semantics same as of posix_fallocate(). sys_fallocate() on s390: ----------------------- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: ------------- mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: ToDos: ----- 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: --------- Each post will have an individual changelog for a particular patch. Following patches follow: Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 15 13:04:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:04:07 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FK40fB023066 for ; Tue, 15 May 2007 13:04:03 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FK3umn028133 for ; Tue, 15 May 2007 16:03:56 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FK3uXw270492 for ; Tue, 15 May 2007 14:03:56 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FK3tN8001616 for ; Tue, 15 May 2007 14:03:56 -0600 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FK3rtw001348; Tue, 15 May 2007 14:03:54 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 66FD010CFF8; Wed, 16 May 2007 01:34:01 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FK40Tf008752; Wed, 16 May 2007 01:34:00 +0530 Date: Wed, 16 May 2007 01:33:59 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070515200359.GA5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515195421.GA2948@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11439 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: --------- Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. Following changes were made to the previous version: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora --- arch/i386/kernel/syscall_table.S | 1 arch/powerpc/kernel/sys_ppc32.c | 7 +++ arch/x86_64/ia32/ia32entry.S | 1 fs/open.c | 89 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 - include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 - include/asm-x86_64/unistd.h | 2 include/linux/fs.h | 13 +++++ include/linux/syscalls.h | 1 10 files changed, 119 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S =================================================================== --- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c =================================================================== --- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22-rc1/fs/open.c =================================================================== --- linux-2.6.22-rc1.orig/fs/open.c +++ linux-2.6.22-rc1/fs/open.c @@ -353,6 +353,95 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* + * Let individual file system decide if it supports preallocation + * for directories or not. + */ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; + + /* + * Update [cm]time. + * Partial allocation will not result in the time stamp changes, + * since ->fallocate will return error (say, -ENOSPC) in this case. + */ + if (!ret) + file_update_time(file); +out_fput: + fput(file); +out: + return ret; +} + +/* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. Index: linux-2.6.22-rc1/include/asm-i386/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-i386/unistd.h +++ linux-2.6.22-rc1/include/asm-i386/unistd.h @@ -329,10 +329,11 @@ #define __NR_signalfd 321 #define __NR_timerfd 322 #define __NR_eventfd 323 +#define __NR_fallocate 324 #ifdef __KERNEL__ -#define NR_syscalls 324 +#define NR_syscalls 325 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.22-rc1/include/asm-powerpc/systbl.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-powerpc/systbl.h +++ linux-2.6.22-rc1/include/asm-powerpc/systbl.h @@ -308,3 +308,4 @@ COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) COMPAT_SYS_SPU(utimensat) +COMPAT_SYS(fallocate) Index: linux-2.6.22-rc1/include/asm-powerpc/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-powerpc/unistd.h +++ linux-2.6.22-rc1/include/asm-powerpc/unistd.h @@ -327,10 +327,11 @@ #define __NR_getcpu 302 #define __NR_epoll_pwait 303 #define __NR_utimensat 304 +#define __NR_fallocate 305 #ifdef __KERNEL__ -#define __NR_syscalls 305 +#define __NR_syscalls 306 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls Index: linux-2.6.22-rc1/include/asm-x86_64/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-x86_64/unistd.h +++ linux-2.6.22-rc1/include/asm-x86_64/unistd.h @@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd) __SYSCALL(__NR_timerfd, sys_timerfd) #define __NR_eventfd 283 __SYSCALL(__NR_eventfd, sys_eventfd) +#define __NR_fallocate 284 +__SYSCALL(__NR_fallocate, sys_fallocate) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.22-rc1/include/linux/fs.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/fs.h +++ linux-2.6.22-rc1/include/linux/fs.h @@ -266,6 +266,17 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FA_ALLOCATE : This is the preallocate mode, using which an application/user + * may request (pre)allocation of blocks. + * FA_DEALLOCATE: This is the deallocate mode, which can be used to free + * the preallocated blocks. + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1137,6 +1148,8 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *inode, int mode, loff_t offset, + loff_t len); }; struct seq_file; Index: linux-2.6.22-rc1/include/linux/syscalls.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/syscalls.h +++ linux-2.6.22-rc1/include/linux/syscalls.h @@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, si asmlinkage long sys_timerfd(int ufd, int clockid, int flags, const struct itimerspec __user *utmr); asmlinkage long sys_eventfd(unsigned int count); +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); Index: linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S =================================================================== --- linux-2.6.22-rc1.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_fallocate ia32_syscall_end: From owner-xfs@oss.sgi.com Tue May 15 13:10:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:10:42 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FKAbfB024550 for ; Tue, 15 May 2007 13:10:38 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FKAaUa023072 for ; Tue, 15 May 2007 16:10:36 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FKAaP7557306 for ; Tue, 15 May 2007 16:10:36 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FKAaG5010348 for ; Tue, 15 May 2007 16:10:36 -0400 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FKAYbe010301; Tue, 15 May 2007 16:10:35 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 310A410CFF8; Wed, 16 May 2007 01:40:42 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FKAf4G012434; Wed, 16 May 2007 01:40:41 +0530 Date: Wed, 16 May 2007 01:40:40 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 2/5][TAKE3] fallocate() on s390 Message-ID: <20070515201040.GB5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515195421.GA2948@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11440 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is the patch suggested by Martin Schwidefsky to support sys_fallocate() on s390(x) platform. He also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15) /* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15) /* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15) /* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) Here are the comments and the patch to linux kernel from him. ------------- From: Martin Schwidefsky This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky --- arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ arch/s390/kernel/sys_s390.c | 29 +++++++++++++++++++++++++++++ arch/s390/kernel/syscalls.S | 1 + include/asm-s390/unistd.h | 3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr %r2,%r2 # int + lgfr %r3,%r3 # int + sllg %r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg %r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c @@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar return -EFAULT; return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice); } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.22-rc1/include/asm-s390/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h +++ linux-2.6.22-rc1/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu 311 #define __NR_epoll_pwait 312 #define __NR_utimes 313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some From owner-xfs@oss.sgi.com Tue May 15 13:13:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:13:37 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FKDUfB025378 for ; Tue, 15 May 2007 13:13:31 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l4FK9pSH002775 for ; Tue, 15 May 2007 16:09:51 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FKDQ8U269646 for ; Tue, 15 May 2007 14:13:26 -0600 Received: from d03av04.bo