From owner-xfs@oss.sgi.com Tue May 1 07:21:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 07:21:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l41EL5fB015382 for ; Tue, 1 May 2007 07:21:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id AAA04161; Wed, 2 May 2007 00:20:54 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l41EKqAf81627789; Wed, 2 May 2007 00:20:53 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l41EKn6U80735839; Wed, 2 May 2007 00:20:49 +1000 (AEST) Date: Wed, 2 May 2007 00:20:49 +1000 From: David Chinner To: Nicholas Miell Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177994346.3362.5.camel@entropy> User-Agent: Mutt/1.4.2.1i X-archive-position: 11237 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: > On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > This is actually for future use. Any flags that are added into this > > > range must be understood by both sides or it should be considered an > > > error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need > > > to be supported. If it turns out that 8 bits is too small a range for > > > INCOMPAT flags, then we can make 0x01000000 an incompat flag that means > > > e.g. 0x00ff0000 are also incompat flags also. > > > > Ah, ok. So it's not really a set of "compatibility" flags, it's more a > > "compulsory" set. Under those terms, i don't really see why this is > > necessary - either the filesystem will understand the flags or it will > > return EINVAL or ignore them... > > > > > I'm assuming that all flags that will be in the original FIEMAP proposal > > > will be understood by the implementations. Most filesystems can safely > > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > > > that matter FLAG_SYNC is probably moot for most filesystems also because > > > they do block allocation at preprw time. > > > > Exactly my point - so why do we really need to encode a compulsory set of > > > > Because flags have meaning, independent of whether or not the filesystem > understands them. And if the filesystem chooses to ignore critically > important flags (instead of returning EINVAL), bad things may happen. > > So, either the filesystem will understand the flag or iff the unknown flag > is in the incompat set, it will return EINVAL or else the unknown flag will > be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 11:38:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:38:27 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41IcLfB004929 for ; Tue, 1 May 2007 11:38:23 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49945 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixE1-00068o-W4 (Exim 4.63) (return-path ); Tue, 01 May 2007 19:37:22 +0100 In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:37:20 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11238 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 05:22, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >> The FIBMAP ioctl is for privileged users >> only, and I wonder if FIEMAP should be the same, or at least >> disallow >> mapping files that the user can't access especially with >> FLAG_SYNC and/or >> FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the machine. Perhaps for non-privileged users FIEMAP has to be read- only? As soon as any of the FLAG_* flags come into play you make it privileged. For example fancy any user being able to fill up your file system by calling FIEMAP with FLAG_HSM_READ on all files recursively? This should certainly not be simply dismissed as a non- issue without thinking about it first... Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 11:48:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:48:44 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41ImefB006913 for ; Tue, 1 May 2007 11:48:41 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49949 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixNG-0000gV-WA (Exim 4.63) (return-path ); Tue, 01 May 2007 19:46:55 +0100 In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:46:53 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11239 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 15:20, David Chinner wrote: > On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: >> On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> This is actually for future use. Any flags that are added into >>>> this >>>> range must be understood by both sides or it should be >>>> considered an >>>> error. Flags outside the FIEMAP_FLAG_INCOMPAT do not >>>> necessarily need >>>> to be supported. If it turns out that 8 bits is too small a >>>> range for >>>> INCOMPAT flags, then we can make 0x01000000 an incompat flag >>>> that means >>>> e.g. 0x00ff0000 are also incompat flags also. >>> >>> Ah, ok. So it's not really a set of "compatibility" flags, it's >>> more a >>> "compulsory" set. Under those terms, i don't really see why this is >>> necessary - either the filesystem will understand the flags or it >>> will >>> return EINVAL or ignore them... >>> >>>> I'm assuming that all flags that will be in the original FIEMAP >>>> proposal >>>> will be understood by the implementations. Most filesystems can >>>> safely >>>> ignore FLAG_HSM_READ, for example, since they don't support HSM, >>>> and for >>>> that matter FLAG_SYNC is probably moot for most filesystems also >>>> because >>>> they do block allocation at preprw time. >>> >>> Exactly my point - so why do we really need to encode a >>> compulsory set of >> >> Because flags have meaning, independent of whether or not the >> filesystem >> understands them. And if the filesystem chooses to ignore critically >> important flags (instead of returning EINVAL), bad things may happen. >> >> So, either the filesystem will understand the flag or iff the >> unknown flag >> is in the incompat set, it will return EINVAL or else the unknown >> flag will >> be safely ignored. > > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... You are wrong about this because you are missing the point that you have no code to review. The users that will use those flags are going to be applications that run in user space. Chances are you will never see their code. Heck, they might not even be open source applications... And all applications will run against a multitude of kernels. So version X of the application will run on kernel 2.4.*, 2.6.*, a.b.*, etc... For future expandability of the interface I think it is important to have both compulsory and non-compulsory flags. For example there is no reason why FIEMAP_HSM_READ needs to be compulsory. Most filesystems do not support HSM so can safely ignore it. And applications that want to read/write the data locations that are obtained with the FIEMAP call will likely always supply FIEMAP_HSM_READ because they want to ensure the file is brought in if it is off line so they definitely want file systems that do not support this flag to ignore it. And vice versa, an application might specify some weird and funky yet to be developed feature that it expects the FS to perform and if the FS cannot do it (either because it does not support it or because it failed to perform the operation) the application expects the FS to return an error and not to ignore the flag. An example could be the asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS ignores it it will return the extent map for the file data instead of the XATTR_FORK! Not what the application wanted at all. Ouch! So this is definitely a compulsory flag if I ever saw one. So as you see you must support both voluntary and compulsory flags... Also consider what I said above about different kernels. A new feature is implemented in kernel 2.8.13 say that was not there before and an application is updated to use that feature. There will be lots of instances where that application will still be run on older kernels where this feature does not exist. Depending on the feature it may be quite sensible to simply ignore in the kernel that the application set an unknown flag whilst for a different feature it may be the opposite. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 15:32:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:32:47 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MWgfB012145 for ; Tue, 1 May 2007 15:32:43 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id DFE564E4564; Tue, 1 May 2007 16:32:41 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id AC4524179; Tue, 1 May 2007 15:32:36 -0700 (PDT) Date: Tue, 1 May 2007 15:32:36 -0700 From: Andreas Dilger To: David Chinner Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223236.GM5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11241 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 00:20 +1000, David Chinner wrote: > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... What you seem to be missing about my proposal is that the FLAG_INCOMPAT is for future use by that part of the specification we haven't thought of yet... Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4, and is much better than having version numbers for the interface. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 15:30:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:30:53 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MUnfB011674 for ; Tue, 1 May 2007 15:30:50 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 459B44E4564; Tue, 1 May 2007 16:30:47 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id F06254179; Tue, 1 May 2007 15:30:40 -0700 (PDT) Date: Tue, 1 May 2007 15:30:40 -0700 From: Andreas Dilger To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223040.GL5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11240 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 01, 2007 14:22 +1000, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > I disagree - why would you want to indicate the state is unknown when we know > very well that it is offline? If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a catch-all flag that indicates "this extent contains data but there is nothing sensible to be returned for the extent mapping." > Effectively, when your extent is offline in the HSM, it is inaccessable, and > you have to bring it back from tape so it becomes accessible again. i.e. some > action is necessary on behalf of the user to make it accessible. So I think > that OFFLINE is a good name for this state because it really is inaccessible. What you are calling OFFLINE I would prefer to call UNMAPPED, since that can be used by applications as a catch-all for "no mapping". There can be further flags that give refinements to UNMAPPED that some applications might care about them (e.g. HSM_RESIDENT), but many users/apps will not if they just want the number of fragments in a given file. > Also, I don't think "secondary" is a good term because most large systems > have more than one tier of storage. One possibility is "HSM_RESIDENT" > which indicates the extent is current and resident with a HSM's archive.... Sure. > > Can you propose reasonable flag names for these (I can't think of anything > > very good) and a clear explanation of what they mean. I suspect it will > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > the concept of stripe unit and stripe width, but as yet they are not > > communicated between the two very well. I'd be much happier if this info > > could be queried in a standard way from the block layer instead of the > > user having to specify it and the filesystem having to track it. > > My preference is definitely for a separate ioctl to grab the > filesystem geometry so this stuff can be calculated in userspace. > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > bother trying to define names until we decide which appraoch we take > to implement this. Hmm, previously you wrote "This information could be easily passed up in the flags fields if the filesystem has geometry information". So, I _think_ what you are saying is that you want 4 flags to convey this start/end alignment information, but the exact semantics of what a "stripe unit" and a "stripe width" is filesystem specific? I definitely do NOT want to get into any issues of querying the block device geometry here. I was just making a passing comment that ext4+mballoc can already do RAID-specific allocation alignment, but it depends on the admin to specify this information and it would be nice if there was some easy way to get this from userspace/kernel interfaces. Having an API that can request "tell me the number of blocks from this offset until the next physical disk boundary" or similar would be useful to any allocator, and the block layer already needs to know this when submitting IO. > In XFS, mkfs.xfs does the work of getting this information > to see in the filesystem superblock. Here's the code for getting > sunit/swidth from the underlying block device: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > Not much in common there ;) It looks like this might be just what e2fsprogs needs also. > > It does make sense to specify zero for the fm_extent_count array and a > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > extent data itself, for the non-verbose mode of filefrag, and for > > pre-allocating a buffer large enough to hold the file if that is important. > > Rather than rely on implicit behaviour of "pass in extent count of > zero and a don't try to return any extents" to return the number of > extents on the file, why not just explicitly define this as a valid > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my clever-clever for "return no extents" and "return number of extents" is wasted :-/. > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > No, but we can return the extent map for the attribute fork (i.e. > extended attrs) if asked for (XFS_IOC_GETBMAPA). This seems like it would be a useful addition to the interface also, having FIEMAP_FLAG_METADATA request the return of metadata allocations too. > > - does XFS return preallocated extents beyond EOF? > > Yes - they are part of the extent map for the file. OK. > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > use by non-root users at all? > > Users can run xfs_bmap on any file they have permission to > open(O_RDONLY). > > > The FIBMAP ioctl is for privileged users > > only, and I wonder if FIEMAP should be the same, or at least disallow > > mapping files that the user can't access especially with FLAG_SYNC and/or > > FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. I think I agree with Anton that allowing some of the flags for non-privileged users seems dangerous. I think this needs to be determined on a flag-by-flag basis, and -EPERM should be returned in some cases. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 17:07:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 17:07:22 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4207GfB029493 for ; Tue, 1 May 2007 17:07:18 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA19765; Wed, 2 May 2007 10:07:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4206xAf82132681; Wed, 2 May 2007 10:07:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4206tcJ81768258; Wed, 2 May 2007 10:06:55 +1000 (AEST) Date: Wed, 2 May 2007 10:06:54 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11242 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 05:22, David Chinner wrote: > >On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > >> The FIBMAP ioctl is for privileged users > >> only, and I wonder if FIEMAP should be the same, or at least > >>disallow > >> mapping files that the user can't access especially with > >>FLAG_SYNC and/or > >> FLAG_HSM_READ. > > > >I see little reason for restricting FI[BE]MAP to privileged users - > >anyone should be able to determine if files they have permission to > >access are fragmented. > > Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the > machine. Perhaps for non-privileged users FIEMAP has to be read- > only? As soon as any of the FLAG_* flags come into play you make it > privileged. For example fancy any user being able to fill up your > file system by calling FIEMAP with FLAG_HSM_READ on all files > recursively? By that reasoning, users should not be allowed to recall any files without root privileges. HSMs don't work that way, though - any user is allowed to recall any files they have permission to access either by manual command or by trying to read the file daata. If that runs the filesytem out of space, then the HSM either hasn't been configured properly or it's failed to manage the space correctly. Either way, that's not the fault of the user for recalling their own files. Hence allowing FIEMAP to be executed by the user does not open up any DOS conditions that don't already exist in normal HSM-managed filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 19:27:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 19:27:06 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l422QxfB029690 for ; Tue, 1 May 2007 19:27:01 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA22695; Wed, 2 May 2007 12:26:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l422QkAf82214176; Wed, 2 May 2007 12:26:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l422Qisa78652236; Wed, 2 May 2007 12:26:44 +1000 (AEST) Date: Wed, 2 May 2007 12:26:44 +1000 From: David Chinner To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502022644.GO77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11243 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > > > I disagree - why would you want to indicate the state is unknown when we know > > very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." Yes, I like that much more. Good suggestion. ;) > > Effectively, when your extent is offline in the HSM, it is inaccessable, and > > you have to bring it back from tape so it becomes accessible again. i.e. some > > action is necessary on behalf of the user to make it accessible. So I think > > that OFFLINE is a good name for this state because it really is inaccessible. > > What you are calling OFFLINE I would prefer to call UNMAPPED, since that > can be used by applications as a catch-all for "no mapping". There can > be further flags that give refinements to UNMAPPED that some applications > might care about them (e.g. HSM_RESIDENT), but many users/apps will not > if they just want the number of fragments in a given file. Agreed - UNMAPPED does make a lot more sense in this case. > > > Can you propose reasonable flag names for these (I can't think of anything > > > very good) and a clear explanation of what they mean. I suspect it will > > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > > the concept of stripe unit and stripe width, but as yet they are not > > > communicated between the two very well. I'd be much happier if this info > > > could be queried in a standard way from the block layer instead of the > > > user having to specify it and the filesystem having to track it. > > > > My preference is definitely for a separate ioctl to grab the > > filesystem geometry so this stuff can be calculated in userspace. > > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > > bother trying to define names until we decide which appraoch we take > > to implement this. > > Hmm, previously you wrote "This information could be easily passed up in the > flags fields if the filesystem has geometry information". So, I _think_ > what you are saying is that you want 4 flags to convey this start/end > alignment information, but the exact semantics of what a "stripe unit" and > a "stripe width" is filesystem specific? Right. > I definitely do NOT want to get into any issues of querying the block > device geometry here. I was just making a passing comment that ext4+mballoc > can already do RAID-specific allocation alignment, but it depends on the > admin to specify this information and it would be nice if there was some > easy way to get this from userspace/kernel interfaces. > > Having an API that can request "tell me the number of blocks from this > offset until the next physical disk boundary" or similar would be useful > to any allocator, and the block layer already needs to know this when > submitting IO. The block layer knows this once you get inside the volume manager. I think the issue is that there is no common export interface for this information. > > In XFS, mkfs.xfs does the work of getting this information > > to see in the filesystem superblock. Here's the code for getting > > sunit/swidth from the underlying block device: > > > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > > > Not much in common there ;) > > It looks like this might be just what e2fsprogs needs also. More than likely. > > > It does make sense to specify zero for the fm_extent_count array and a > > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > > extent data itself, for the non-verbose mode of filefrag, and for > > > pre-allocating a buffer large enough to hold the file if that is important. > > > > Rather than rely on implicit behaviour of "pass in extent count of > > zero and a don't try to return any extents" to return the number of > > extents on the file, why not just explicitly define this as a valid > > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS > > That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my > clever-clever for "return no extents" and "return number of extents" > is wasted :-/. Too clever for an API, I think. ;) My point is mainly that if you are going to use an API for a specific function (e.g. query the number of extents) I think that the API should have an obvious method for executing that specific function. Using a command of "get no extents" to provide the query of "how many extents in this file" is kind of obscure. When you read the code it doesn't make a lot of sense, as opposed to seeing a clear statement of intent from the code itself. i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API and the code that uses it... > > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > > > No, but we can return the extent map for the attribute fork (i.e. > > extended attrs) if asked for (XFS_IOC_GETBMAPA). > > This seems like it would be a useful addition to the interface also, having > FIEMAP_FLAG_METADATA request the return of metadata allocations too. Agreed. The different types of requests need to be mutually exclusive, though - returning the map of the attribute fork mixed with the map of the data fork is going to be confusing.... > > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > > use by non-root users at all? > > > > Users can run xfs_bmap on any file they have permission to > > open(O_RDONLY). > > > > > The FIBMAP ioctl is for privileged users > > > only, and I wonder if FIEMAP should be the same, or at least disallow > > > mapping files that the user can't access especially with FLAG_SYNC and/or > > > FLAG_HSM_READ. > > > > I see little reason for restricting FI[BE]MAP to privileged users - > > anyone should be able to determine if files they have permission to > > access are fragmented. > > I think I agree with Anton that allowing some of the flags for non-privileged > users seems dangerous. I think this needs to be determined on a flag-by-flag > basis, and -EPERM should be returned in some cases. Agreed, but I'm yet to see any flags where I think that is necessary yet. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 01:18:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:18:25 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428IKfB012099 for ; Wed, 2 May 2007 01:18:21 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49210) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA0M-0001Ma-PD (Exim 4.63) (return-path ); Wed, 02 May 2007 09:16:06 +0100 In-Reply-To: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> <20070502000654.GK77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <8464EA47-03AC-4162-A2D0-683517568640@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:16:04 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11244 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 01:06, David Chinner wrote: > On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 05:22, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> The FIBMAP ioctl is for privileged users >>>> only, and I wonder if FIEMAP should be the same, or at least >>>> disallow >>>> mapping files that the user can't access especially with >>>> FLAG_SYNC and/or >>>> FLAG_HSM_READ. >>> >>> I see little reason for restricting FI[BE]MAP to privileged users - >>> anyone should be able to determine if files they have permission to >>> access are fragmented. >> >> Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the >> machine. Perhaps for non-privileged users FIEMAP has to be read- >> only? As soon as any of the FLAG_* flags come into play you make it >> privileged. For example fancy any user being able to fill up your >> file system by calling FIEMAP with FLAG_HSM_READ on all files >> recursively? > > By that reasoning, users should not be allowed to recall any files > without root privileges. HSMs don't work that way, though - any user > is allowed to recall any files they have permission to access either > by manual command or by trying to read the file daata. > > If that runs the filesytem out of space, then the HSM either hasn't > been configured properly or it's failed to manage the space > correctly. Either way, that's not the fault of the user for > recalling their own files. > > Hence allowing FIEMAP to be executed by the user does not open up > any DOS conditions that don't already exist in normal HSM-managed > filesystem. Sorry, it was not a great example. But the point still stands that there are/may be created flags that you do not want to allow everyone to use. I completely agree with Andreas that those can simply return -EPERM and the rest can be allowed through. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:25:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:25:15 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428P7fB013738 for ; Wed, 2 May 2007 01:25:08 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49214) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA7h-0003fi-Mq (Exim 4.63) (return-path ); Wed, 02 May 2007 09:23:41 +0100 In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:23:38 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11245 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 23:30, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: >> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I >>> didn't >> >> I disagree - why would you want to indicate the state is unknown >> when we know >> very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." I like UNMAPPED. I even use it in NTFS internally for extents maps that have not been read into memory yet. (-: On a different issue, do you think it would be worth adding an option flags like FIEMAP_DONT_RELOCATE or something similar that would be a compulsory flag and if set the FS is not allowed to move the file around/change the block allocation of the file. My thinking is that the extent map is not terribly useful if the FS goes and relocates the file to somewhere else just after you have done the ioctl. For example HFS on OSX automatically defragments files whilst it is running... Linux file systems may one day do similar things. Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell the FS we want to access the actual raw blocks so the FS can make sure the data is on block aligned boundaries and if the FS does not support this (e.g. ZFS or a compressed or encrypted NTFS file) then it can return -ENOTSUP. Perhaps this is totally the wrong interface and such a "prepare file for direct access" API should be a different ioctl() or syscall or whatever. It just seems very simple and appropriate to combine it here as people who use FIEMAP are at least sometimes going to be wanting to access those blocks directly as well and it feels right to be able to communicate this to the FS in the same call, kind of like an "open intent" of "I want to use the data directly on disk"... What do you think? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:31:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:31:39 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428VZfB015273 for ; Wed, 2 May 2007 01:31:36 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49220) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjAE7-0006gx-N5 (Exim 4.63) (return-path ); Wed, 02 May 2007 09:30:19 +0100 In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <69B76939-CAAD-4F43-BE9F-6C3CA3ECCF5E@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, Linux Filesystems , xfs@oss.sgi.com, Christoph Hellwig Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:30:17 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11246 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 09:23, Anton Altaparmakov wrote: > On 1 May 2007, at 23:30, Andreas Dilger wrote: > >> On May 01, 2007 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but >>>> I didn't >>> >>> I disagree - why would you want to indicate the state is unknown >>> when we know >>> very well that it is offline? >> >> If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a >> catch-all flag that indicates "this extent contains data but there is >> nothing sensible to be returned for the extent mapping." > > I like UNMAPPED. I even use it in NTFS internally for extents maps > that have not been read into memory yet. (-: Oops, I use NOT_MAPPED in NTFS rather than UNMAPPED but I still like UNMAPPED, too. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:15:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:15:52 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429FjfB025664 for ; Wed, 2 May 2007 02:15:47 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03102; Wed, 2 May 2007 19:15:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429FUAf82146138; Wed, 2 May 2007 19:15:31 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429FQYw81999881; Wed, 2 May 2007 19:15:26 +1000 (AEST) Date: Wed, 2 May 2007 19:15:26 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11247 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 15:20, David Chinner wrote: > >> > >>So, either the filesystem will understand the flag or iff the > >>unknown flag > >>is in the incompat set, it will return EINVAL or else the unknown > >>flag will > >>be safely ignored. > > > >My point was that there is a difference between specification and > >implementation - if the specification says something is compulsory, > >then they must be implemented in the filesystem. This is easy > >enough to ensure by code review - we don't need additional interface > >complexity for this.... > > You are wrong about this because you are missing the point that you > have no code to review. The users that will use those flags are > going to be applications that run in user space. Chances are you > will never see their code. Heck, they might not even be open source > applications... Ummm - the specification defines what is compulsory for *filesystems* to implement, not what applications can use. We don't need to see what the applications do - what we care about is that all filesystems implement the compulsory part of the specification. That's the code we review, and that's what I was referring to. > And all applications will run against a multitude of > kernels. So version X of the application will run on kernel 2.4.*, > 2.6.*, a.b.*, etc... For future expandability of the interface I > think it is important to have both compulsory and non-compulsory flags. Ah, so that's what you want - a mutable interface. i.e. versioning. So how does compusory flags help here? What happens if a voluntary flag now becomes compulsory? Or vice versa? How is the application supposed to deal with this dynamically? I suggested a version number for this right back at the start of this discussion and got told that we don't want versioned interfaces because we should make the effort to get it right the first time. I don't think this can be called "getting it right". > For example there is no reason why FIEMAP_HSM_READ needs to be > compulsory. Most filesystems do not support HSM so can safely ignore > it. They might be able to safely ignore it, but in reality it should be saying "I don't understand this". If the application *needs* to use a flag like this, then it should be told that the filesystem is not capable of doing what it was asked! OTOH if the application does not need to use the flag, then it shouldn't be using it and we shouldn't be silently ignoring incorrect usage of the provided API. What you are effectively saying about these "voluntary" flags is that their behaviour is _undefined_. That is, if you use these flags what you get on a successful call is undefined; it may or may not contain what you asked for but you can't tell if it really did what you want or returned the information you asked for. This is a really bad semantic to encode into an API. > And vice versa, an application might specify some weird and funky yet > to be developed feature that it expects the FS to perform and if the > FS cannot do it (either because it does not support it or because it > failed to perform the operation) the application expects the FS to > return an error and not to ignore the flag. An example could be the > asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > ignores it it will return the extent map for the file data instead of > the XATTR_FORK! Not what the application wanted at all. Ouch! So > this is definitely a compulsory flag if I ever saw one. Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But we don't need a flag defined in the user visible API to tell us that we need to return an error here. > So as you see you must support both voluntary and compulsory flags... No, you've managed to convince me that they are not necessary and they are in fact a Bad Idea... ;) > Also consider what I said above about different kernels. A new > feature is implemented in kernel 2.8.13 say that was not there before > and an application is updated to use that feature. There will be > lots of instances where that application will still be run on older > kernels where this feature does not exist. This is *exactly* where silently ignoring flags really falls down. On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does something and it returns different structure contents for the same state. Now how does the application writer know which is correct or how to tell the difference? They have to guess or write detection code which is exactly what we want to avoid. I objected to the UNKNOWN flag because it wasn't explicit in it's meaning - I'm doing the same thing here. An interface needs to be explicitly defined and should not have and undefined behaviour in it.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:38:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:38:32 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429cPfB032340 for ; Wed, 2 May 2007 02:38:28 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49355) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBFu-0000Yj-Ne (Exim 4.63) (return-path ); Wed, 02 May 2007 10:36:14 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:36:12 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11248 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 15:20, David Chinner wrote: >>>> >>>> So, either the filesystem will understand the flag or iff the >>>> unknown flag >>>> is in the incompat set, it will return EINVAL or else the unknown >>>> flag will >>>> be safely ignored. >>> >>> My point was that there is a difference between specification and >>> implementation - if the specification says something is compulsory, >>> then they must be implemented in the filesystem. This is easy >>> enough to ensure by code review - we don't need additional interface >>> complexity for this.... >> >> You are wrong about this because you are missing the point that you >> have no code to review. The users that will use those flags are >> going to be applications that run in user space. Chances are you >> will never see their code. Heck, they might not even be open source >> applications... > > Ummm - the specification defines what is compulsory for *filesystems* > to implement, not what applications can use. We don't need to see > what the applications do - what we care about is that all filesystems > implement the compulsory part of the specification. That's the code > we review, and that's what I was referring to. > >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? What happens if a voluntary > flag now becomes compulsory? Or vice versa? How is the application > supposed to deal with this dynamically? > > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". Look at ext2/3/4. They do it that way and it works well. No versioning just compatible and incompatible flags... The proposal is to do the same here. >> For example there is no reason why FIEMAP_HSM_READ needs to be >> compulsory. Most filesystems do not support HSM so can safely ignore >> it. > > They might be able to safely ignore it, but in reality it should > be saying "I don't understand this". If the application *needs* to > use a flag like this, then it should be told that the filesystem is > not capable of doing what it was asked! That is where you are completely wrong! (-: Or rather you are wrong for my example, i.e. you are wrong/right depending on the type of flag in question. HSM_READ is definitely _NOT_ required because all it means is "if the file is OFFLINE, bring it ONLINE and then return the extent map". Clearly all file systems that do not support HSM can 100% ignore this flag as all files will ALWAYS be ONLINE so they will return the correct data ALWAYS so no need to do anything for HSM_READ. > OTOH if the application does not need to use the flag, then it > shouldn't be using it and we shouldn't be silently ignoring > incorrect usage of the provided API. > > What you are effectively saying about these "voluntary" flags > is that their behaviour is _undefined_. That is, if you use > these flags what you get on a successful call is undefined; > it may or may not contain what you asked for but you can't > tell if it really did what you want or returned the information > you asked for. > > This is a really bad semantic to encode into an API. That is your opinion. There is nothing undefined in the API at all. You just fail to understand it... >> And vice versa, an application might specify some weird and funky yet >> to be developed feature that it expects the FS to perform and if the >> FS cannot do it (either because it does not support it or because it >> failed to perform the operation) the application expects the FS to >> return an error and not to ignore the flag. An example could be the >> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS >> ignores it it will return the extent map for the file data instead of >> the XATTR_FORK! Not what the application wanted at all. Ouch! So >> this is definitely a compulsory flag if I ever saw one. > > Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > we don't need a flag defined in the user visible API to tell us > that we need to return an error here. Heh? What are you talking about? You need a flag to specify that you want XATTR_FORK. If not how the hell does the application specify that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you of the opinion that FIEMAP should definitely not support XATTR_FORK. If the latter I fully agree. This should be a separate API with named streams and the FD of the named stream should be passed to FIEMAP without the silly XATTR_FORK flag... >> So as you see you must support both voluntary and compulsory flags... > > No, you've managed to convince me that they are not necessary and > they are in fact a Bad Idea... ;) We agree to disagree then. I think they are a very Good Idea(TM). (-; >> Also consider what I said above about different kernels. A new >> feature is implemented in kernel 2.8.13 say that was not there before >> and an application is updated to use that feature. There will be >> lots of instances where that application will still be run on older >> kernels where this feature does not exist. > > This is *exactly* where silently ignoring flags really falls down. It does not! > On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > something and it returns different structure contents for the same No it does not. You do NOT understand at all what we are talking about do you?!? If a flag would do something weird like returning different data then OBVIOUSLY you would make this a mandatory flag and it will NOT be ignored! You should know better than arguing with fallacies. Seriously... > state. Now how does the application writer know which is correct or > how to tell the difference? They have to guess or write detection > code which is exactly what we want to avoid. No they don't. It is then a compulsory flag so your argument is totally moot. > I objected to the UNKNOWN flag because it wasn't explicit > in it's meaning - I'm doing the same thing here. An interface > needs to be explicitly defined and should not have and undefined > behaviour in it.... That is exactly the point. It is explicitly defined and has NO undefined behaviour in it. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:48:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:48:16 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429mCfB003811 for ; Wed, 2 May 2007 02:48:14 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49362) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBPJ-0006HX-NQ (Exim 4.63) (return-path ); Wed, 02 May 2007 10:45:57 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1AFF1746-8313-4DC2-81D6-4271B5FB71A3@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:45:55 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11249 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? A concrete example: Let's say that the FIEMAP interface goes live as is without any flags at all and just defined bits for "these are optional and those are compulsory". Then the next kernel adds support for optional flag HSM_READ and compulsory flag XATTR_READ. FS that do not support XATTR_READ will return -ENOTSUP as they cannot return the wanted data. FS that do not support HSM_READ will still return the correct data in majority of cases (except when the FS supports HSM and the data is actually OFFLINE which the application will need to be able to cope with anyway incase the FS failed to bring the file ONLINE even if it supports the HSM_READ flag so no added complexity for handling this case). > What happens if a voluntary flag now becomes compulsory? Or vice > versa? How is the application supposed to deal with this dynamically? Forgot to answer this bit: This cannot happen. There cannot be flags that move from compulsory to non-compulsory or anything stupid like that. It would have to be a totally new flag otherwise it breaks backwards compatibility and hence this interface becomes useless crap. > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". So all applications end up doing: if (version X, do blah) else if (version Y, do blob) else if (version Z, do foo) else if (version A, do bar) else exit(1); Every time a new version is added? And abort for unknown versions? Now that is a great interface! Not. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:49:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:49:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429nEfB004317 for ; Wed, 2 May 2007 02:49:17 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03843; Wed, 2 May 2007 19:49:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429mtAf82223314; Wed, 2 May 2007 19:48:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429mqgl82278699; Wed, 2 May 2007 19:48:52 +1000 (AEST) Date: Wed, 2 May 2007 19:48:51 +1000 From: David Chinner To: Anton Altaparmakov Cc: Andreas Dilger , David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11250 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: > On a different issue, do you think it would be worth adding an option > flags like FIEMAP_DONT_RELOCATE or something similar that would be a > compulsory flag and if set the FS is not allowed to move the file > around/change the block allocation of the file. We already have an inode flag in XFS to say this - the defrag tool checks it and ignores the file if it is set. > Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell > the FS we want to access the actual raw blocks so the FS can make > sure the data is on block aligned boundaries and if the FS does not > support this (e.g. ZFS or a compressed or encrypted NTFS file) then > it can return -ENOTSUP. > > Perhaps this is totally the wrong interface and such a "prepare file > for direct access" API should be a different ioctl() or syscall or > whatever. It just seems very simple and appropriate to combine it > here as people who use FIEMAP are at least sometimes going to be > wanting to access those blocks directly as well and it feels right to > be able to communicate this to the FS in the same call, kind of like > an "open intent" of "I want to use the data directly on disk"... I think this is wrong interface for this. Sure, use it to get the mappings (that's what it's for) but what you do with the mappings after that is not part of FIEMAP.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:57:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:57:55 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429vnfB007855 for ; Wed, 2 May 2007 02:57:52 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49383) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBZ6-0001FA-PT (Exim 4.63) (return-path ); Wed, 02 May 2007 10:56:04 +0100 In-Reply-To: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> <20070502094851.GX77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Andreas Dilger , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:56:03 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11251 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:48, David Chinner wrote: > On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: >> On a different issue, do you think it would be worth adding an option >> flags like FIEMAP_DONT_RELOCATE or something similar that would be a >> compulsory flag and if set the FS is not allowed to move the file >> around/change the block allocation of the file. > > We already have an inode flag in XFS to say this - the defrag > tool checks it and ignores the file if it is set. That is great for XFS but you control the metadata. NTFS, HFS, etc are cases where we cannot add such a flag because we cannot modify the metadata format (ok we could in some kludgy manner like storing an EA with an inode to say "com.linux.ntfs.immutable" or something but I would rather not if I can avoid it). >> Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell >> the FS we want to access the actual raw blocks so the FS can make >> sure the data is on block aligned boundaries and if the FS does not >> support this (e.g. ZFS or a compressed or encrypted NTFS file) then >> it can return -ENOTSUP. >> >> Perhaps this is totally the wrong interface and such a "prepare file >> for direct access" API should be a different ioctl() or syscall or >> whatever. It just seems very simple and appropriate to combine it >> here as people who use FIEMAP are at least sometimes going to be >> wanting to access those blocks directly as well and it feels right to >> be able to communicate this to the FS in the same call, kind of like >> an "open intent" of "I want to use the data directly on disk"... > > I think this is wrong interface for this. Sure, use it to get the > mappings (that's what it's for) but what you do with the mappings > after that is not part of FIEMAP.... Thanks for the comments. I am not sure it is a good idea either, just thought it would be worth discussing in case people thought it a good idea. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 03:52:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:52:48 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42AqhfB021110 for ; Wed, 2 May 2007 03:52:44 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l42AqgmK016050 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 2 May 2007 12:52:42 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l42Aqfmi016048 for xfs@oss.sgi.com; Wed, 2 May 2007 12:52:41 +0200 Date: Wed, 2 May 2007 12:52:41 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: Re: [Bug 756] New: File data corruption when writing to files with DM_EVENT_WRITE enabled over NFS (2.4 kernel) Message-ID: <20070502105241.GA15399@lst.de> References: <200705012104.l41L4CI3029767@oss.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705012104.l41L4CI3029767@oss.sgi.com> User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11252 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > by this recent change: > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h Seems like someone forgot to send TAKEs to the xfs list once again.. From owner-xfs@oss.sgi.com Wed May 2 03:58:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:58:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42AwCfB023745 for ; Wed, 2 May 2007 03:58:14 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id UAA05217; Wed, 2 May 2007 20:57:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42AvrAf82323358; Wed, 2 May 2007 20:57:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42AvnBI81446737; Wed, 2 May 2007 20:57:49 +1000 (AEST) Date: Wed, 2 May 2007 20:57:49 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11253 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > On 2 May 2007, at 10:15, David Chinner wrote: > >On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > >>And all applications will run against a multitude of > >>kernels. So version X of the application will run on kernel 2.4.*, > >>2.6.*, a.b.*, etc... For future expandability of the interface I > >>think it is important to have both compulsory and non-compulsory > >>flags. > > > >Ah, so that's what you want - a mutable interface. i.e. versioning. > > > >So how does compusory flags help here? What happens if a voluntary > >flag now becomes compulsory? Or vice versa? How is the application > >supposed to deal with this dynamically? > > > >I suggested a version number for this right back at the start of > >this discussion and got told that we don't want versioned interfaces > >because we should make the effort to get it right the first time. > >I don't think this can be called "getting it right". > > Look at ext2/3/4. They do it that way and it works well. No > versioning just compatible and incompatible flags... The proposal is > to do the same here. Just because it works for extN doesn't make it right for this interface. > >>For example there is no reason why FIEMAP_HSM_READ needs to be > >>compulsory. Most filesystems do not support HSM so can safely ignore > >>it. > > > >They might be able to safely ignore it, but in reality it should > >be saying "I don't understand this". If the application *needs* to > >use a flag like this, then it should be told that the filesystem is > >not capable of doing what it was asked! > > That is where you are completely wrong! (-: Or rather you are wrong > for my example, i.e. you are wrong/right depending on the type of > flag in question. And that is the crux of the argument. My point is that *any* flag returns an error if the filesystem does not support it. > HSM_READ is definitely _NOT_ required because all > it means is "if the file is OFFLINE, bring it ONLINE and then return > the extent map". You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. > >OTOH if the application does not need to use the flag, then it > >shouldn't be using it and we shouldn't be silently ignoring > >incorrect usage of the provided API. > > > >What you are effectively saying about these "voluntary" flags > >is that their behaviour is _undefined_. That is, if you use > >these flags what you get on a successful call is undefined; > >it may or may not contain what you asked for but you can't > >tell if it really did what you want or returned the information > >you asked for. > > > >This is a really bad semantic to encode into an API. > > That is your opinion. There is nothing undefined in the API at all. > You just fail to understand it... FIEMAP returned success. Did it do what I asked? I don't know because it's allowed to return success when it did ignored me. This is as silly an interface definition as saying you can implement fsync() with { return 0; }. So, when fsync() succeeded did it write my data to disk? I don't know; it's allowed to return success when it ignored me. It's crazy, isn't it? It makes writing applications portable across operating systems a real PITA (ask the MySQL folk ;) because POSIX really does allow fsync() to be implemented like this. I use this example because the "allow some filesystems to silently ignore flags they don't understand" is a portability problem for applications - rather than a cross-OS issue it is a cross-filesystem issue. That is, if different filesystems behave differently to the same request they will have to be handled specifically by the application. Every filesystem should behave in *exactly* the same way to the FIEMAP ioctls - if they don't support something they throw an error, if they do then they return the correct data. > >>And vice versa, an application might specify some weird and funky yet > >>to be developed feature that it expects the FS to perform and if the > >>FS cannot do it (either because it does not support it or because it > >>failed to perform the operation) the application expects the FS to > >>return an error and not to ignore the flag. An example could be the > >>asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > >>ignores it it will return the extent map for the file data instead of > >>the XATTR_FORK! Not what the application wanted at all. Ouch! So > >>this is definitely a compulsory flag if I ever saw one. > > > >Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > >we don't need a flag defined in the user visible API to tell us > >that we need to return an error here. > > Heh? What are you talking about? You need a flag to specify that you > want XATTR_FORK. If not how the hell does the application specify > that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you > of the opinion that FIEMAP should definitely not support XATTR_FORK. > If the latter I fully agree. This should be a separate API with > named streams and the FD of the named stream should be passed to > FIEMAP without the silly XATTR_FORK flag... Ummmm - I think you misunderstood what I was saying. I was agreeing with you that is a FS does not support FIEMAP_XATTR_FORK "the correct answer is -EOPNOTSUPP or -EINVAL". What I was saying is that we don't need a COMPAT flag bit to tell us the obvious error return if the filesystem does not support this functionality.... > >>Also consider what I said above about different kernels. A new > >>feature is implemented in kernel 2.8.13 say that was not there before > >>and an application is updated to use that feature. There will be > >>lots of instances where that application will still be run on older > >>kernels where this feature does not exist. > > > >This is *exactly* where silently ignoring flags really falls down. > > It does not! > > >On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > >something and it returns different structure contents for the same > > No it does not. You do NOT understand at all what we are talking > about do you?!? > > If a flag would do something weird like returning different data then > OBVIOUSLY you would make this a mandatory flag and it will NOT be > ignored! You've just successfully argued my case for me. By your reasoning, if we have voluntary flags 1, 2 and 3 and filesystems A, B and C and filesystem A is the only filesystem to implement 1, when B implements 1 bit must become a compulsory flag and hence C must now return an error despite being unchanged. Likewise when C implement 3, 3 must become a comulsory flag and A and B must now return an error despite being unchanged. IOWs, whenever *any* filesystem implements a voluntary feature that it didn't previously support, we have to make that a mandatory feature and all other filesystems that don't support it now must return an error. You're guaranteeing th application sees changes in behaviour with this interface, not preventing. Can we simply mandate that filesystems return an error to commands they don't support or don't understand and drop this silly interface mutation thing? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 04:19:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 04:19:31 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42BJPfB003965 for ; Wed, 2 May 2007 04:19:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49519) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjCpx-000625-Oy (Exim 4.63) (return-path ); Wed, 02 May 2007 12:17:33 +0100 In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 12:17:32 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11254 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 11:57, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >> On 2 May 2007, at 10:15, David Chinner wrote: >>> On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >>>> And all applications will run against a multitude of >>>> kernels. So version X of the application will run on kernel 2.4.*, >>>> 2.6.*, a.b.*, etc... For future expandability of the interface I >>>> think it is important to have both compulsory and non-compulsory >>>> flags. >>> >>> Ah, so that's what you want - a mutable interface. i.e. versioning. >>> >>> So how does compusory flags help here? What happens if a voluntary >>> flag now becomes compulsory? Or vice versa? How is the application >>> supposed to deal with this dynamically? >>> >>> I suggested a version number for this right back at the start of >>> this discussion and got told that we don't want versioned interfaces >>> because we should make the effort to get it right the first time. >>> I don't think this can be called "getting it right". >> >> Look at ext2/3/4. They do it that way and it works well. No >> versioning just compatible and incompatible flags... The proposal is >> to do the same here. > > Just because it works for extN doesn't make it right for this > interface. > >>>> For example there is no reason why FIEMAP_HSM_READ needs to be >>>> compulsory. Most filesystems do not support HSM so can safely >>>> ignore >>>> it. >>> >>> They might be able to safely ignore it, but in reality it should >>> be saying "I don't understand this". If the application *needs* to >>> use a flag like this, then it should be told that the filesystem is >>> not capable of doing what it was asked! >> >> That is where you are completely wrong! (-: Or rather you are wrong >> for my example, i.e. you are wrong/right depending on the type of >> flag in question. > > And that is the crux of the argument. > > My point is that *any* flag returns an error if the filesystem > does not support it. Yes and my point is that it should not do so as there are flags where it is not necessary. >> HSM_READ is definitely _NOT_ required because all >> it means is "if the file is OFFLINE, bring it ONLINE and then return >> the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. Ah, sorry, I did indeed misunderstand what it was meant to mean. >>> OTOH if the application does not need to use the flag, then it >>> shouldn't be using it and we shouldn't be silently ignoring >>> incorrect usage of the provided API. >>> >>> What you are effectively saying about these "voluntary" flags >>> is that their behaviour is _undefined_. That is, if you use >>> these flags what you get on a successful call is undefined; >>> it may or may not contain what you asked for but you can't >>> tell if it really did what you want or returned the information >>> you asked for. >>> >>> This is a really bad semantic to encode into an API. >> >> That is your opinion. There is nothing undefined in the API at all. >> You just fail to understand it... > > FIEMAP returned success. Did it do what I asked? I don't > know because it's allowed to return success when it did ignored me. So what? > This is as silly an interface definition as saying you can > implement fsync() with { return 0; }. So, when fsync() succeeded > did it write my data to disk? I don't know; it's allowed to return > success when it ignored me. No it is not silly at all. There can be flags that fail but still the operation is a success. Example from admittedly unrelated area: when truncating a file to smaller size if the freeing of the allocated blocks fails it does not cause the truncate to fail, it just means some space is wasted/marked used when it is unused on the volume and running fsck fixes this. At least that is how I have implemented it for NTFS and I think this is the most sensible way to do it. The user does not care if some blocks could not be freed. All they care about is that the file is now truncated. The volume is then marked dirty thus running fsck/ chkdsk will reclaim the lost space. > It's crazy, isn't it? It makes writing applications portable > across operating systems a real PITA (ask the MySQL folk ;) > because POSIX really does allow fsync() to be implemented like this. > > I use this example because the "allow some filesystems to silently > ignore flags they don't understand" is a portability problem for > applications - rather than a cross-OS issue it is a cross-filesystem > issue. That is, if different filesystems behave differently to > the same request they will have to be handled specifically by > the application. Every filesystem should behave in *exactly* the > same way to the FIEMAP ioctls - if they don't support something > they throw an error, if they do then they return the correct > data. It is only a problem if you do not choose wisely which flags my be ignored silently... >>>> And vice versa, an application might specify some weird and >>>> funky yet >>>> to be developed feature that it expects the FS to perform and if >>>> the >>>> FS cannot do it (either because it does not support it or >>>> because it >>>> failed to perform the operation) the application expects the FS to >>>> return an error and not to ignore the flag. An example could be >>>> the >>>> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and >>>> the FS >>>> ignores it it will return the extent map for the file data >>>> instead of >>>> the XATTR_FORK! Not what the application wanted at all. Ouch! So >>>> this is definitely a compulsory flag if I ever saw one. >>> >>> Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But >>> we don't need a flag defined in the user visible API to tell us >>> that we need to return an error here. >> >> Heh? What are you talking about? You need a flag to specify that you >> want XATTR_FORK. If not how the hell does the application specify >> that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you >> of the opinion that FIEMAP should definitely not support XATTR_FORK. >> If the latter I fully agree. This should be a separate API with >> named streams and the FD of the named stream should be passed to >> FIEMAP without the silly XATTR_FORK flag... > > Ummmm - I think you misunderstood what I was saying. I was agreeing > with you that is a FS does not support FIEMAP_XATTR_FORK "the correct > answer is -EOPNOTSUPP or -EINVAL". > > What I was saying is that we don't need a COMPAT flag bit to tell > us the obvious error return if the filesystem does not support this > functionality.... But there is no COMPAT bit. I don't understand what you are saying... >>>> Also consider what I said above about different kernels. A new >>>> feature is implemented in kernel 2.8.13 say that was not there >>>> before >>>> and an application is updated to use that feature. There will be >>>> lots of instances where that application will still be run on older >>>> kernels where this feature does not exist. >>> >>> This is *exactly* where silently ignoring flags really falls down. >> >> It does not! >> >>> On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does >>> something and it returns different structure contents for the same >> >> No it does not. You do NOT understand at all what we are talking >> about do you?!? >> >> If a flag would do something weird like returning different data then >> OBVIOUSLY you would make this a mandatory flag and it will NOT be >> ignored! > > You've just successfully argued my case for me. No I have not at all. > By your reasoning, if we have voluntary flags 1, 2 and 3 and > filesystems A, B and C and filesystem A is the only filesystem to > implement 1, when B implements 1 bit must become a compulsory flag WHY? It does not at all. Flags CANNOT move from voluntary to compulsory. Read my argument again... > and hence C must now return an error despite being unchanged. Nope. > Likewise when C implement 3, 3 must become a comulsory flag and > A and B must now return an error despite being unchanged. Again no. > IOWs, whenever *any* filesystem implements a voluntary feature that > it didn't previously support, we have to make that a mandatory > feature and all other filesystems that don't support it now This is total crap. > must return an error. You're guaranteeing th application sees > changes in behaviour with this interface, not preventing. > > Can we simply mandate that filesystems return an error > to commands they don't support or don't understand and > drop this silly interface mutation thing? Can we simply not and drop this silly argument? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 05:19:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:19:37 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CJWfB016412 for ; Wed, 2 May 2007 05:19:33 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HjDMP-0005ml-DC; Wed, 02 May 2007 12:51:05 +0100 Date: Wed, 2 May 2007 12:51:05 +0100 From: Christoph Hellwig To: Lachlan McIlroy Cc: xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070502115105.GA21031@infradead.org> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11255 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > Add lockdep support for XFS I don't think this is entirely correct, and it misses some of the most interesting cases. I've Cc'ed -fsdevel and Al to get some comments on the more tricky issues in the rename section at the end of the mail. > Modid: xfs-linux-melb:xfs-kern:28485a > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. xfs_lock_dir_and_entry should go away and just become and opencoded xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); xfs_ilock(ip, XFS_ILOCK_EXCL); in the two callers, once we made sure to have a sufficient locking protocol where we always lock the parent before the child. xfs_lock_dir_and_entry can be totally removed and replaced with just the two ilock calls if we sort out the locking as proposed in this mail. > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h This looks a bit odd to me - the rt inodes are not connected to the filesystem namespace so the root inode can't really be it's parent. Why are we locking the root inode so early. Is there a good reason we don't delay the locking until we're done with the rt inodes? If not the parent annotation is probably safe beause we never lock the rt inode at the same time as any other inode, but it at least needs a big comment describing what's going on. Now what seems to be completely lacking is any kind of annotation in xfs_rename.c, which is the most difficult thing to get right for inode locking because we may have to lock up to four inodes. I suggest to implement the same locking protocol the the VFS uses for locking i_mutex, as document in Documentation/filesystems/directory-locking: Also xfs_lock_inodes lacks any kind of annotation. Let's start with the xfs_lock_inodes that don't fall into rename or xfs_lock_dir_and_entry handled above: - xfs_swap_extents locks two inodes of the same type, but these could be directories, so there is a chance we can get into conflicts with the parent->child type locking - xfs_link locks the source inode and the target directory inode. vfs locking rule is lock parent, lock source and we should follow this as it's in line with the directory before child rule except that the source doesn't always have to be a child, in which case we don't have a problem anyway And now rename gets ugly, we should follow the VFS rules with the following required adjustments: - XFS needs both source and target inode (if existing) locked. Because both must be non-directories sorting by inode number should be okay - Doing a lock_rename equivalent for locking the parent directories requires dentries, but only inodes are passed down from the VFS. On the other hand they are obviously guranteed to be directories so i_dentry has exactly one dentry on which we can do the upwards walk. s_vfs_rename_mutex is already held by the vfs so we don't need to do that again. I'd suggest having a copy of the directory-locking file with the XFS adjustments somewhere so all this is actually well documented. - case for source directory == parent directory is trivial. lock parent From owner-xfs@oss.sgi.com Wed May 2 05:53:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:53:21 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CrGfB024777 for ; Wed, 2 May 2007 05:53:17 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l42CrBgT015874 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l42CrB9i554574 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l42CrAET015347 for ; Wed, 2 May 2007 08:53:10 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l42Cr9Ww015185; Wed, 2 May 2007 08:53:09 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B3BC793BC1; Wed, 2 May 2007 18:23:13 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l42CrCw4025574; Wed, 2 May 2007 18:23:12 +0530 Date: Wed, 2 May 2007 18:23:12 +0530 From: "Amit K. Arora" To: Chris Wedgwood Cc: David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070502125312.GA5845@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430052559.GA13145@tuatara.stupidest.org> User-Agent: Mutt/1.4.1i X-archive-position: 11256 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > For FA_ALLOCATE, it's supposed to change the file size if we > > allocate past EOF, right? > > I would argue no. Use truncate for that. The patch I posted for ext4 *does* change the filesize after preallocation, if required (i.e. when preallocation is after EOF). I may have to change that, if we decide on not doing this. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 2 06:12:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 06:12:04 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42DBvfB029629 for ; Wed, 2 May 2007 06:12:00 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA08194; Wed, 2 May 2007 23:11:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42DBkAf82475833; Wed, 2 May 2007 23:11:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42DBipP82488324; Wed, 2 May 2007 23:11:44 +1000 (AEST) Date: Wed, 2 May 2007 23:11:44 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Missing TAKE 958522 (was Re: [Bug 756] New: File data corruption.....) Message-ID: <20070502131144.GZ77450368@melbourne.sgi.com> References: <200705012104.l41L4CI3029767@oss.sgi.com> <20070502105241.GA15399@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105241.GA15399@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11257 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:52:41PM +0200, Christoph Hellwig wrote: > > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > > by this recent change: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h > > Seems like someone forgot to send TAKEs to the xfs list once again.. Hmmm - that was a bad one to miss considering the importance of the problem it fixes...... ----- TAKE 958522 - XFS has conflicting strategies between metadata and file data flushing Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. Date: Fri Mar 30 02:24:06 AEST 2007 Workarea: vpn-emea-sw-emea-160-18.emea.sgi.com:/home/lachlan/isms/2.6.x-null Inspected by: dgc,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28322a fs/xfs/xfsidbg.c - 1.312 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfsidbg.c.diff?r1=text&tr1=1.312&r2=text&tr2=1.311&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_vnodeops.c - 1.693 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.693&r2=text&tr2=1.692&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iocore.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iocore.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.c - 1.463 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.c.diff?r1=text&tr1=1.463&r2=text&tr2=1.462&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.h - 1.219 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.h.diff?r1=text&tr1=1.219&r2=text&tr2=1.218&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_bmap.c - 1.367 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_bmap.c.diff?r1=text&tr1=1.367&r2=text&tr2=1.366&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.h - 1.10 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.h.diff?r1=text&tr1=1.10&r2=text&tr2=1.9&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_lrw.c - 1.259 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_lrw.c.diff?r1=text&tr1=1.259&r2=text&tr2=1.258&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_aops.c - 1.142 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_aops.c.diff?r1=text&tr1=1.142&r2=text&tr2=1.141&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/dmapi/xfs_dm.c - 1.34 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/dmapi/xfs_dm.c.diff?r1=text&tr1=1.34&r2=text&tr2=1.33&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 23:45:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 23:45:20 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l436jDfB003835 for ; Wed, 2 May 2007 23:45:15 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA04360; Thu, 3 May 2007 16:45:03 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l436j2Af82987621; Thu, 3 May 2007 16:45:02 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l436ixY983041938; Thu, 3 May 2007 16:44:59 +1000 (AEST) Date: Thu, 3 May 2007 16:44:59 +1000 From: David Chinner To: Christoph Hellwig Cc: Lachlan McIlroy , xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070503064459.GJ77450368@melbourne.sgi.com> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> <20070502115105.GA21031@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502115105.GA21031@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11258 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:51:05PM +0100, Christoph Hellwig wrote: > On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > > Add lockdep support for XFS > > I don't think this is entirely correct, and it misses some of the > most interesting cases. Yeah, we decided it was better to get something out there that fixes the obvious and frequently reported false positives than hold it up on the hard stuff.... > I've Cc'ed -fsdevel and Al to get some comments on the more tricky > issues in the rename section at the end of the mail. There's several other tricky cases that we're not sure to handle as well - they are mainly due to *valid* lock inversions. i.e. we do "lock A, lock B" in most places, but in others we do "lock B, *trylock* A" to avoid deadlocks. I think the MOUNT_ILOCK/inode ilock is one of these pairs. > > > > Modid: xfs-linux-melb:xfs-kern:28485a > > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. > > xfs_lock_dir_and_entry should go away and just become and opencoded > > xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); > xfs_ilock(ip, XFS_ILOCK_EXCL); > > in the two callers, once we made sure to have a sufficient locking > protocol where we always lock the parent before the child. > > xfs_lock_dir_and_entry can be totally removed and replaced with just > the two ilock calls if we sort out the locking as proposed in this > mail. I'm not sure it is that simple - we currently always group locking of multiple inodes in increasing inode number order. i don't know what deadlock that is protecting against. There's also the case that we can't sleep on the ilock if the inode in the AIL while we hold the directory lock. Once again I'm not sure what the deadlock is, but given we are now in a transaction it's probably a tail-pushing deadlock that it is avoiding. Without knowing for certain what these are avoiding, I don't think we should be removing the code blindly.... > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > This looks a bit odd to me - the rt inodes are not connected to the > filesystem namespace so the root inode can't really be it's parent. > > Why are we locking the root inode so early. Is there a good reason we > don't delay the locking until we're done with the rt inodes? No idea - it's like that on irix too, and I don't have time right now to discover why.... > If not the parent annotation is probably safe beause we never lock > the rt inode at the same time as any other inode, but it at least needs > a big comment describing what's going on. > > > > Now what seems to be completely lacking is any kind of annotation in > xfs_rename.c, which is the most difficult thing to get right for > inode locking because we may have to lock up to four inodes. I suggest > to implement the same locking protocol the the VFS uses for locking > i_mutex, as document in Documentation/filesystems/directory-locking: > > Also xfs_lock_inodes lacks any kind of annotation. It calls xfs_lock_inumorder() to set up the annotation. The inode number in the set of inodes to be locked drives the lock subclass for nesting. Also xfs_rename locking ends up calling xfs_lock_inodes() and so it does get annotated. > Let's start with the xfs_lock_inodes that don't fall into rename or > xfs_lock_dir_and_entry handled above: > > > - xfs_swap_extents locks two inodes of the same type, but these > could be directories, so there is a chance we can get into > conflicts with the parent->child type locking Uses xfs_lock_inodes() so subclass nesting is used instead of parent/child. > - xfs_link locks the source inode and the target directory > inode. vfs locking rule is lock parent, lock source and > we should follow this as it's in line with the directory > before child rule except that the source doesn't always > have to be a child, in which case we don't have a problem > anyway It locks in inode number order as per xfs_lock_dir_and_entry() and uses xfs_lock_inodes() for annotation. > And now rename gets ugly, we should follow the VFS rules with > the following required adjustments: > > - XFS needs both source and target inode (if existing) locked. > Because both must be non-directories sorting by inode number > should be okay > - Doing a lock_rename equivalent for locking the parent directories > requires dentries, but only inodes are passed down from the VFS. > On the other hand they are obviously guranteed to be directories > so i_dentry has exactly one dentry on which we can do the upwards > walk. This is a lot of churn that I don't really see as necessary - why should we risk deadlocks and difficult to diagnose problems when the current code works and is now annotated? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 00:49:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 00:49:17 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l437nCfB010685 for ; Thu, 3 May 2007 00:49:13 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id C2D3C4E456B; Thu, 3 May 2007 01:49:10 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 4864A406D; Thu, 3 May 2007 00:49:09 -0700 (PDT) Date: Thu, 3 May 2007 00:49:09 -0700 From: Andreas Dilger To: David Chinner Cc: Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070503074909.GA6220@schatzie.adilger.int> Mail-Followup-To: David Chinner , Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11259 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 20:57 +1000, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > > HSM_READ is definitely _NOT_ required because all > > it means is "if the file is OFFLINE, bring it ONLINE and then return > > the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. > > Specifying the flag indicates that we do *not* want the offline > extents brought back online. i.e. it is a HSM or a datamover > (e.g. backup program) that is querying the extents and we want to > known *exactly* what the current state of the file is right now. > > So, if the HSM_READ flag is set, then the application is > expecting the filesytem to be part of a HSM. Hence if it's not, > it should return an error because somebody has done something wrong. In my original proposal I specifically pointed out that the FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the HSM_READ flag is set. That's why the flag is called "HSM_READ" instead of "HSM_NO_READ". The reason is that it seems bad if the default behaviour for calling ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is only disabled by specifying a flag. It makes a lot more sense to just leave the data as it is and return the extent mapping by default (i.e. this is the principle of least surprise). It would probably be equally surprising and undesirable if the default behaviour was to force all data out to HSM. For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should even be a part of this interface? I have no problem with returning a flag that reports if the data is migrated to HSM and whether it is UNMAPPED. Having FIEMAP force the retrieval of data from HSM strikes me as something that should be a part of a separate HSM interface, which also needs to be able to do things like push specific files or parts thereof out to HSM, set the aging policy, and return information like "where does the HSM file live" and "how many copies are there". Do you know the reasoning behind including this into XFS_IOC_GETBMAPX? Looking at the bmap.c comments it appears it is simply because the API isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate there is data in HSM but it has no blocks allocated in the filesystem. I don't think it makes the operation significantly more efficient than say "ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP)" if an application actually needs the data to be present instead of just returning mapping info that includes "UNMAPPED. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu May 3 01:24:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 01:24:31 -0700 (PDT) Received: from ppsw-2.csi.cam.ac.uk (ppsw-2.csi.cam.ac.uk [131.111.8.132]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l438OPfB031233 for ; Thu, 3 May 2007 01:24:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49510) by ppsw-2.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.152]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjWb8-0005kD-8u (Exim 4.63) (return-path ); Thu, 03 May 2007 09:23:34 +0100 In-Reply-To: <20070503074909.GA6220@schatzie.adilger.int> References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> <20070503074909.GA6220@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <13539C2E-16DA-4F86-9CBB-D16050EDDC44@cam.ac.uk> Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Thu, 3 May 2007 09:23:33 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11260 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 3 May 2007, at 08:49, Andreas Dilger wrote: > On May 02, 2007 20:57 +1000, David Chinner wrote: >> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >>> HSM_READ is definitely _NOT_ required because all >>> it means is "if the file is OFFLINE, bring it ONLINE and then return >>> the extent map". >> >> You've got the definition of HSM_READ wrong. If the flag is *not* >> set, then we bring everything back online and return the full extent >> map. >> >> Specifying the flag indicates that we do *not* want the offline >> extents brought back online. i.e. it is a HSM or a datamover >> (e.g. backup program) that is querying the extents and we want to >> known *exactly* what the current state of the file is right now. >> >> So, if the HSM_READ flag is set, then the application is >> expecting the filesytem to be part of a HSM. Hence if it's not, >> it should return an error because somebody has done something wrong. > > In my original proposal I specifically pointed out that the > FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the > XFS_IOC_GETBMAPX > BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the > HSM_READ flag is set. That's why the flag is called "HSM_READ" > instead > of "HSM_NO_READ". Cool. I did not misunderstand after all then. (-: > The reason is that it seems bad if the default behaviour for calling > ioctl(FIEMAP) would be to force retrieval of data from HSM, and > this is > only disabled by specifying a flag. It makes a lot more sense to just > leave the data as it is and return the extent mapping by default (i.e. > this is the principle of least surprise). It would probably be > equally > surprising and undesirable if the default behaviour was to force all > data out to HSM. > > For that matter, I'm also beginning to wonder if the FLAG_HSM_READ > should > even be a part of this interface? I have no problem with returning a > flag that reports if the data is migrated to HSM and whether it is > UNMAPPED. > > Having FIEMAP force the retrieval of data from HSM strikes me as > something > that should be a part of a separate HSM interface, which also needs > to be > able to do things like push specific files or parts thereof out to > HSM, > set the aging policy, and return information like "where does the HSM > file live" and "how many copies are there". That would seem sensible to me also. Just like David argued that causing the data to be in a fixed location should be a separate interface rather than part of FIEMAP so by analogy the same should apply to touching HSM. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Thu May 3 03:34:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 03:34:34 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l43AYRfB018100 for ; Thu, 3 May 2007 03:34:28 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 6152E7BA319; Thu, 3 May 2007 04:34:26 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 2F13B4153; Thu, 3 May 2007 03:34:25 -0700 (PDT) Date: Thu, 3 May 2007 03:34:25 -0700 From: Andreas Dilger To: "Amit K. Arora" Cc: Chris Wedgwood , David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070503103425.GE6220@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , Chris Wedgwood , David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> <20070502125312.GA5845@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502125312.GA5845@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11261 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 18:23 +0530, Amit K. Arora wrote: > On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > > > For FA_ALLOCATE, it's supposed to change the file size if we > > > allocate past EOF, right? > > > > I would argue no. Use truncate for that. > > The patch I posted for ext4 *does* change the filesize after > preallocation, if required (i.e. when preallocation is after EOF). > I may have to change that, if we decide on not doing this. I think I'd agree - it may be useful to allow preallocation beyond EOF for some kinds of applications (e.g. PVR preallocating live TV in 10 minute segments or something, but not knowing in advance how long the show will actually be recorded or the final encoded size). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu May 3 08:01:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 08:02:00 -0700 (PDT) Received: from smtp-ft6.fr.colt.net (smtp-ft6.fr.colt.net [213.41.78.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l43F1ufB005799 for ; Thu, 3 May 2007 08:01:57 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft6.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l43EjJQV005258 for ; Thu, 3 May 2007 16:45:19 +0200 Date: Thu, 3 May 2007 16:45:21 +0200 From: Emmanuel Florac To: xfs@oss.sgi.com Subject: XFS crash on linux raid Message-ID: <20070503164521.16efe075@harpe.intellique.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 11262 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Hello, Apparently quite a lot of people do encounter the same problem from time to time, but I couldn't find any solution. When writing quite a lot to the filesystem (heavy load on the fileserver), the filesystem crashes when filled at 2.5~3TB (varies from time to time). The filesystems tested where always running on a software raid 0, with disabled barriers. I tend to think that disabled write barriers are causing the crash but I'll do some more tests to get sure. I've met this problem for the first time on 12/23 (yup... merry christmas :) when a 13 TB filesystem went belly up : Dec 23 01:38:10 storiq1 -- MARK -- Dec 23 01:58:10 storiq1 -- MARK -- Dec 23 02:10:29 storiq1 kernel: xfs_iunlink_remove: xfs_itobp() returned an error 990 on md0. Returning error. Dec 23 02:10:29 storiq1 kernel: xfs_inactive:^Ixfs_ifree() returned an error = 990 on md0 Dec 23 02:10:29 storiq1 kernel: xfs_force_shutdown(md0,0x1) called from line 1763 of file fs/xfs/xfs_vnodeops.c. Return address = 0xc027f78b Dec 23 02:38:11 storiq1 -- MARK -- Dec 23 02:58:11 storiq1 -- MARK -- When mounting, it did that : Filesystem "md0": Disabling barriers, not supported by the underlying device XFS mounting filesystem md0 Starting XFS recovery on filesystem: md0 (logdev: internal) Filesystem "md0": xfs_inode_recover: Bad inode magic number, dino ptr = 0xf7196600, dino bp = 0xf718e980, ino = 119318 Filesystem "md0": XFS internal error xlog_recover_do_inode_trans(1) at line 2352 of file fs/xfs/xfs_log_recover.c. Caller 0xc025d180 xlog_recover_do_inode_trans+0x93d/0xa00 xlog_recover_do_trans+0x140/0x160 xfs_buf_delwri_queue+0x2b/0xb0 xlog_recover_do_trans+0x140/0x160 kmem_zalloc+0x1f/0x50 xlog_recover_commit_trans+0x3f/0x50 xlog_recover_process_data+0xea/0x240 xlog_do_recovery_pass+0x39a/0xb70 hrtimer_run_queues+0x29/0x110 xlog_do_log_recovery+0x96/0xd0 xlog_do_recover+0x3b/0x170 xlog_recover+0xdd/0xf0 xfs_log_mount+0xa1/0x110 xfs_mountfs+0x825/0xf30 xfs_fs_cmn_err+0x27/0x30 xfs_ioinit+0x27/0x50 xfs_mount+0x2ff/0x520 vfs_mount+0x43/0x50 xfs_fs_fill_super+0x9a/0x200 debug_mutex_add_waiter+0x3d/0xd0 snprintf+0x27/0x30 disk_name+0xb4/0xc0 sb_set_blocksize+0x1f/0x50 get_sb_bdev+0x106/0x150 xfs_fs_get_sb+0x30/0x40 xfs_fs_fill_super+0x0/0x200 do_kern_mount+0x5f/0xe0 do_new_mount+0x77/0xc0 do_mount+0x18d/0x1f0 take_cpu_down+0xb/0x20 copy_mount_options+0x63/0xc0 sys_mount+0x9f/0xe0 syscall_call+0x7/0xb XFS: log mount/recovery failed: error 990 XFS: log mount failed XFS_repair (too old a version...) hosed the filesystem and destroyed most of the 2.6TB of data. Yes, there were no backup, I wrote a recovery tool to restore the video data from the raw device but the is a different story. The system was running vanilla 2.6.17.9, and md0 was made of 3 striped RAID-5 on 3 3Ware-9550 cards, each hardware RAID-5 made of 8 750 GB drives. On a similar hardware with 2 3Ware-9550 16x750GB striped together, but running 2.6.17.13, I had a similar fs crash last week. Unfortunately I don't have the logs at hand, but we where able to reproduce several times the crash at home : Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 xfs_btree_check_sblock+0x58/0xe0 xfs_alloc_lookup+0x142/0x400 xfs_alloc_lookup+0x142/0x400 kmem_zone_alloc+0x59/0xd0 xfs_btree_init_cursor+0x23/0x190 xfs_alloc_ag_vextent_near+0x54/0x9e0 xfs_bmap_add_extent+0x383/0x430 xfs_bmap_search_multi_extents+0x76/0xf0 xfs_alloc_ag_vextent+0x119/0x120 xfs_alloc_vextent+0x3db/0x4f0 xfs_bmap_btalloc+0x3ee/0x890 xfs_bmapi+0x1216/0x1690 xfs_dir2_grow_inode+0xf6/0x400 cache_alloc_refill+0xb6/0x1e0 xfs_idata_realloc+0x3b/0x130 xfs_dir2_sf_to_block+0xac/0x5d0 xfs_dir2_lookup+0x129/0x130 xfs_dir2_sf_addname+0x97/0x110 xfs_dir2_createname+0x144/0x150 xfs_trans_ijoin+0x2b/0x80 xfs_rename+0x354/0x9f0 xfs_access+0x3f/0x50 xfs_vn_rename+0x48/0xa0 __link_path_walk+0xc7c/0xc90 xfs_getattr+0x23f/0x2f0 mntput_no_expire+0x1b/0x80 cache_alloc_refill+0xb6/0x1e0 vfs_rename_other+0x96/0xd0 vfs_rename+0x258/0x2d0 do_rename+0x171/0x1a0 cache_grow+0x10b/0x160 cache_alloc_refill+0xb6/0x1e0 do_getname+0x4b/0x80 sys_renameat+0x47/0x80 sys_rename+0x28/0x30 syscall_call+0x7/0xb Filesystem "md0": XFS internal error xfs_trans_cancel at line 1150 of file fs/xfs/xfs_trans.c. Caller 0xc0245ec7 xfs_trans_cancel+0xd0/0x100 xfs_rename+0x6a7/0x9f0 xfs_rename+0x6a7/0x9f0 xfs_access+0x3f/0x50 xfs_vn_rename+0x48/0xa0 __link_path_walk+0xc7c/0xc90 xfs_getattr+0x23f/0x2f0 mntput_no_expire+0x1b/0x80 cache_alloc_refill+0xb6/0x1e0 vfs_rename_other+0x96/0xd0 vfs_rename+0x258/0x2d0 do_rename+0x171/0x1a0 cache_grow+0x10b/0x160 cache_alloc_refill+0xb6/0x1e0 do_getname+0x4b/0x80 sys_renameat+0x47/0x80 sys_rename+0x28/0x30 syscall_call+0x7/0xb xfs_force_shutdown(md0,0x8) called from line 1151 of file fs/xfs/xfs_trans.c. Return address = 0xc025f7b9 Filesystem "md0": Corruption of in-memory data detected. Shutting down filesystem: md0 Please umount the filesystem, and rectify the problem(s) xfs_force_shutdown(md0,0x1) called from line 338 of file fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 xfs_force_shutdown(md0,0x1) called from line 338 of file fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 After xfs_repair, the fs is fine. However, it crashes again when writing again a couple of GBs of data. It crashes again under 2.6.17.13, 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... Out of curiosity, I've tried to use reiserfs (just to see how it compares regarding this). Reiserfs crashed before even writing 100MB! So I tend to believe this is a "write barrier" problem and it looks really nasty!!! To sort this out I've started a test on a single 3Ware raid, without software raid. Any idea on how to circumvent the problem to make software RAID/LVM usable? -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Thu May 3 16:02:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 16:02:10 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l43N25fB020477 for ; Thu, 3 May 2007 16:02:06 -0700 Received: (qmail 94083 invoked from network); 3 May 2007 23:02:04 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 3 May 2007 23:02:03 -0000 X-YMail-OSG: ArKZSuYVM1kqn6qAuVrrwBMH7q78gcbdZ1PV.SHTJD7BztaEkuYJYhv3Ob5ff5ZJrgc4r7nNHw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id B6EFB1827265; Thu, 3 May 2007 16:02:02 -0700 (PDT) Date: Thu, 3 May 2007 16:02:02 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070503230202.GA12747@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503164521.16efe075@harpe.intellique.com> X-archive-position: 11263 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 04:45:21PM +0200, Emmanuel Florac wrote: > After xfs_repair, the fs is fine. However, it crashes again when > writing again a couple of GBs of data. It crashes again under > 2.6.17.13, 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... 4K stacks? > So I tend to believe this is a "write barrier" problem and it looks > really nasty!!! You could try "mount -o nobarrier ...." From owner-xfs@oss.sgi.com Thu May 3 17:59:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 17:59:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l440xXfB009201 for ; Thu, 3 May 2007 17:59:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA29867; Fri, 4 May 2007 10:59:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l440xNAf83828843; Fri, 4 May 2007 10:59:24 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l440xMeV83970284; Fri, 4 May 2007 10:59:22 +1000 (AEST) Date: Fri, 4 May 2007 10:59:22 +1000 From: David Chinner To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504005922.GC32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503164521.16efe075@harpe.intellique.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11264 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 04:45:21PM +0200, Emmanuel Florac wrote: > > Hello, > Apparently quite a lot of people do encounter the same problem from > time to time, but I couldn't find any solution. > > When writing quite a lot to the filesystem (heavy load on the > fileserver), the filesystem crashes when filled at 2.5~3TB (varies from > time to time). The filesystems tested where always running on a software > raid 0, with disabled barriers. I tend to think that disabled write > barriers are causing the crash but I'll do some more tests to get sure. > > I've met this problem for the first time on 12/23 (yup... merry > christmas :) when a 13 TB filesystem went belly up : > > Dec 23 01:38:10 storiq1 -- MARK -- > Dec 23 01:58:10 storiq1 -- MARK -- > Dec 23 02:10:29 storiq1 kernel: xfs_iunlink_remove: xfs_itobp() > returned an error 990 on md0. Returning error. > Dec 23 02:10:29 storiq1 kernel: xfs_inactive:^Ixfs_ifree() returned an > error = 990 on md0 > Dec 23 02:10:29 storiq1 kernel: xfs_force_shutdown(md0,0x1) called from > line 1763 of file fs/xfs/xfs_vnodeops.c. Return address = 0xc027f78b > Dec 23 02:38:11 storiq1 -- MARK -- > Dec 23 02:58:11 storiq1 -- MARK -- So, trying to remove an inode there was a corruption found on disk and it shut the filesystem down. Where there any I/o errors reported before the shutdown? > When mounting, it did that : > > Filesystem "md0": Disabling barriers, not supported by the underlying > device XFS mounting filesystem md0 > Starting XFS recovery on filesystem: md0 (logdev: internal) > Filesystem "md0": xfs_inode_recover: Bad inode magic number, dino ptr = > 0xf7196600, dino bp = 0xf718e980, ino = 119318 Filesystem "md0": XFS Which was found again during log recovery. > The system was running vanilla 2.6.17.9, and md0 was made of 3 striped > RAID-5 on 3 3Ware-9550 cards, each hardware RAID-5 made of 8 750 GB > drives. > > On a similar hardware with 2 3Ware-9550 16x750GB striped together, but > running 2.6.17.13, I had a similar fs crash last week. Unfortunately I > don't have the logs at hand, but we where able to reproduce several > times the crash at home : Hmm - 750GB drives are brand new. i wouldn't rule out media issues at this point... > Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 336 > of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 Memory corruption? > line 1151 of file fs/xfs/xfs_trans.c. Return address = 0xc025f7b9 > Filesystem "md0": Corruption of in-memory data detected. Shutting down > filesystem: md0 Please umount the filesystem, and rectify the > problem(s) xfs_force_shutdown(md0,0x1) called from line 338 of file > fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 > xfs_force_shutdown(md0,0x1) called from line 338 of file > fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 > > After xfs_repair, the fs is fine. However, it crashes again when > writing again a couple of GBs of data. It crashes again under 2.6.17.13, > 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... > > Out of curiosity, I've tried to use reiserfs (just to see how it > compares regarding this). Reiserfs crashed before even writing 100MB! That indicates there's something wrong other than the filesystem. I'd suggest making sure your raid arrays, memory, etc are all functioning correctly first. What platform are you running on? Are you running ia32 with 4k stacks? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 19:46:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 19:46:05 -0700 (PDT) Received: from mailsecure1.itc.griffith.edu.au (mailsecure1-out.itc.griffith.edu.au [132.234.242.61]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l442jwfB031706 for ; Thu, 3 May 2007 19:46:01 -0700 Received: from mailsecure1.itc.griffith.edu.au (unknown [127.0.0.1]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id 04449286 for ; Fri, 4 May 2007 12:45:57 +1000 (EST) X-AuditID: 84eaf23c-af2f2bb000004912-c9-463a9e64a23b Received: from nox-1.itc.griffith.edu.au (sc2bigip02-242.nms.griffith.edu.au [132.234.242.254]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id 4AF7730187 for ; Fri, 4 May 2007 12:45:56 +1000 (EST) Received: from [132.234.242.254] (helo=studentemail.griffith.edu.au) by nox-1.itc.griffith.edu.au with esmtp (Exim 4.63) (envelope-from ) id 1Hjnnw-0006gz-52 for xfs@oss.sgi.com; Fri, 04 May 2007 12:45:56 +1000 Received: from ss64.me.griffith.edu.au ([132.234.103.168]) by studentemail.griffith.edu.au (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPA id <0JHH002HWX0KTM40@studentemail.griffith.edu.au> for xfs@oss.sgi.com; Fri, 04 May 2007 12:45:56 +1000 (EST) Date: Fri, 04 May 2007 12:45:55 +1000 From: Stephen So Subject: Re: Slow performance when extracting tarballs In-reply-to: <20070430213538.GA30809@tuatara.stupidest.org> To: xfs@oss.sgi.com Message-id: <463A9E63.7010007@griffith.edu.au> Organization: Griffith School of Engineering, Griffith University, Australia MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7BIT X-Enigmail-Version: 0.95.0 References: <4635DAA4.4070402@griffith.edu.au> <20070430213538.GA30809@tuatara.stupidest.org> User-Agent: Thunderbird 2.0.0.0 (X11/20070326) X-Brightmail-Tracker: AAAAAA== X-archive-position: 11265 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: S.So@griffith.edu.au Precedence: bulk X-list: xfs Hi, thanks for the reply xfs-bounce@oss.sgi.com wrote: > what does "vmstat 1" look like during this? > I did a vmstat 1 and this is the output: % vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 1002716 3540 745316 0 0 59 15 559 560 1 2 96 1 0 0 0 0 1002700 3540 745316 0 0 0 12 1091 1543 2 2 97 0 0 1 0 0 995464 3540 750300 0 0 2060 401 1134 2569 18 3 76 4 0 2 0 0 980884 3540 762652 0 0 3712 1376 1238 4850 43 7 43 8 0 1 0 0 968368 3540 776152 0 0 3968 1568 1224 5155 43 7 44 7 0 2 0 0 954660 3540 787264 0 0 3584 1344 1244 4542 38 6 45 11 0 1 0 0 942668 3540 797556 0 0 2944 1431 1224 4376 36 6 48 11 0 1 0 0 932852 3540 807304 0 0 3072 1312 1229 4164 33 6 46 15 0 3 0 0 922724 3540 817912 0 0 3072 1440 1215 4378 37 7 44 12 0 0 1 0 911612 3540 828552 0 0 3328 1568 1242 4558 37 5 46 12 0 1 0 0 900804 3540 839140 0 0 3072 1568 1222 4279 36 5 45 13 0 0 0 0 887824 3540 848788 0 0 3072 1427 1250 3862 35 5 46 14 0 1 0 0 880036 3540 857700 0 0 2560 1529 1229 3775 31 7 47 16 0 1 0 0 867552 3540 867548 0 0 3072 1632 1250 4035 36 5 46 14 0 0 1 0 859156 3540 877576 0 0 2944 1696 1239 4291 33 6 45 16 0 1 0 0 852904 3540 883628 0 0 1664 5403 1229 3111 23 4 48 25 0 0 1 0 846328 3540 888188 0 0 1536 5300 1188 2622 21 6 61 12 0 0 1 0 842076 3540 892752 0 0 1280 5383 1232 2478 21 5 62 12 0 1 1 0 837312 3540 897396 0 0 1408 5330 1211 2476 20 5 53 24 0 6 1 0 828876 3540 903572 0 0 1920 5771 1245 2904 24 5 46 25 0 1 0 0 822016 3540 912304 0 0 2304 1203 1216 3897 30 7 55 7 0 0 1 0 818404 3540 915628 0 0 1024 9446 1181 2028 14 5 63 17 0 0 1 0 809552 3540 923336 0 0 2432 1109 1228 3344 28 5 46 22 0 1 0 0 801124 3540 928892 0 0 1664 9195 1201 2821 22 6 59 13 0 0 0 0 794364 3540 935364 0 0 1792 5296 1218 3052 24 6 52 18 0 2 1 0 789784 3540 941564 0 0 2048 4992 1194 3116 23 4 51 23 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 781540 3540 947180 0 0 1536 6434 1226 2942 23 6 50 22 0 4 0 0 777300 3540 953628 0 0 1920 1088 1200 2970 25 5 56 14 0 0 1 0 772892 3540 957032 0 0 1152 9440 1201 2141 17 4 59 21 0 1 0 0 764432 3540 964572 0 0 2304 1253 1216 3198 29 4 46 22 0 2 0 0 756516 3540 970284 0 0 1664 9720 1222 2832 22 5 57 17 0 1 1 0 750880 3540 977204 0 0 2176 1100 1207 2973 25 5 49 20 0 0 0 0 745424 3540 980768 0 0 1024 9140 1200 2205 16 4 66 14 0 0 1 0 741928 3540 986200 0 0 1664 1376 1193 2746 20 5 61 15 0 0 1 0 734536 3540 992480 0 0 1920 5516 1226 2874 24 5 57 14 0 0 1 0 729072 3540 997168 0 0 1408 5328 1199 2473 21 5 62 13 0 0 1 0 723228 3540 1003288 0 0 1792 5509 1243 2959 24 6 54 15 0 2 0 0 717948 3540 1007752 0 0 1408 5308 1196 2418 20 4 59 18 0 4 0 0 709940 3540 1013564 0 0 1536 5568 1217 3145 25 4 55 16 0 0 0 0 701132 3540 1021948 0 0 2816 5612 1224 3562 32 6 47 16 0 0 1 0 702448 3540 1023140 0 0 256 5108 1203 1538 6 5 73 15 0 0 1 0 691688 3540 1032264 0 0 2688 1852 1239 3630 32 5 45 18 0 0 1 0 688292 3540 1034228 0 0 768 9348 1198 1671 10 3 60 27 0 1 0 0 682636 3540 1039248 0 0 1408 1069 1198 2729 20 5 47 29 0 1 0 0 676848 3540 1044456 0 0 1408 5704 1234 2897 20 5 59 16 0 1 0 0 672460 3540 1049428 0 0 1536 5484 1215 2813 19 5 55 22 0 1 0 0 663820 3540 1056108 0 0 2176 5258 1241 3245 27 5 49 20 0 1 0 0 660064 3540 1061708 0 0 1664 1688 1222 3100 22 6 60 11 0 0 0 0 653400 3540 1065924 0 0 1152 5496 1221 2495 17 4 51 28 0 0 1 0 651468 3540 1069324 0 0 1152 5278 1187 2157 16 3 67 14 0 2 0 0 645132 3540 1073620 0 0 1152 5466 1221 2714 19 5 61 17 0 3 0 0 640544 3540 1078720 0 0 1664 5587 1219 2830 21 6 51 21 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 634040 3540 1083872 0 0 1536 5223 1208 2996 20 3 64 14 0 0 1 0 629024 3540 1090772 0 0 2048 5342 1199 3141 26 5 49 20 0 0 0 0 621116 3540 1095840 0 0 1664 4410 1211 2631 22 4 52 22 0 0 0 0 615760 3540 1100840 0 0 1408 6032 1186 2601 20 6 61 14 0 0 0 0 608852 3540 1107448 0 0 1920 1192 1215 3228 24 6 50 21 0 0 1 0 605872 3540 1112248 0 0 1536 5424 1220 2779 22 4 63 12 0 0 0 0 598016 3540 1117476 0 0 1536 5603 1227 3016 23 4 53 21 0 2 0 0 592416 3540 1122576 0 0 1536 5407 1217 2671 22 7 56 16 0 0 0 0 587504 3540 1127404 0 0 1408 4624 1230 2599 19 5 55 21 0 2 1 0 585800 3540 1130704 0 0 1152 1880 1175 2431 15 2 53 30 0 0 1 0 582732 3540 1133696 0 0 896 5293 1210 2357 16 4 74 6 0 2 0 0 575528 3540 1138696 0 0 1536 5424 1214 2585 22 5 48 26 0 1 0 0 569992 3540 1145872 0 0 2176 1519 1245 3267 27 5 50 17 0 0 0 0 563568 3540 1149772 0 0 1152 8164 1189 2364 15 4 74 6 0 0 0 0 559936 3540 1153020 0 0 896 2483 1198 2145 16 6 64 15 0 1 0 0 556504 3540 1156720 0 0 1408 5248 1206 2152 17 6 62 14 0 0 1 0 553568 3540 1161280 0 0 1280 5716 1231 2620 19 4 59 18 0 1 0 0 544820 3540 1167580 0 0 2048 1545 1234 2947 26 5 51 18 0 1 0 0 541096 3540 1170748 0 0 1024 5272 1205 2107 17 3 73 8 0 0 1 0 535092 3540 1176848 0 0 1792 6132 1225 2861 25 6 49 19 0 0 1 0 531696 3540 1181220 0 0 1280 969 1215 2758 18 3 66 14 0 0 1 0 528920 3540 1184220 0 0 896 5268 1192 2248 16 4 71 10 0 0 0 0 520532 3540 1189884 0 0 1664 5425 1252 3008 21 4 64 10 0 0 0 0 514012 3540 1196084 0 0 1920 1920 1214 3110 25 6 60 10 0 1 0 0 511804 3540 1199608 0 0 1152 5336 1240 2224 20 6 60 15 0 0 0 0 503212 3540 1206108 0 0 2048 6516 1227 2963 26 4 58 12 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 500968 3540 1208380 0 0 512 4684 1214 2066 13 5 79 3 0 2 0 0 496216 3540 1212580 0 0 1408 5727 1214 2399 21 7 65 8 0 4 0 0 491268 3540 1217184 0 0 1408 4304 1243 2593 21 5 64 11 0 2 0 0 488856 3540 1219784 0 0 896 2058 1189 1849 15 4 70 11 0 0 1 0 483660 3540 1224340 0 0 1408 5824 1240 2571 23 4 52 22 0 0 0 0 477704 3540 1229940 0 0 1536 5170 1173 2855 21 6 52 21 0 0 1 0 474500 3540 1234952 0 0 1536 5163 1212 2629 20 3 55 23 0 1 0 0 465196 3540 1242552 0 0 2304 940 1204 3265 28 4 47 22 0 0 0 0 458280 3540 1247892 0 0 1664 9382 1211 2719 19 6 70 4 0 1 0 0 453276 3540 1252592 0 0 1408 5040 1176 2827 19 5 58 18 0 0 0 0 446840 3540 1258496 0 0 1792 5676 1221 3025 24 5 56 14 0 1 0 0 443180 3540 1264096 0 0 1664 932 1193 2680 21 5 56 19 0 1 0 0 435748 3540 1269060 0 0 1664 5182 1209 2635 21 4 49 27 0 0 1 0 432060 3540 1274860 0 0 1536 5376 1183 2860 21 7 51 20 0 0 1 0 426376 3540 1279492 0 0 1408 5177 1214 2480 19 3 51 27 0 0 1 0 422356 3540 1283992 0 0 1280 5256 1196 2516 18 4 55 23 0 0 0 0 410112 3540 1292848 0 0 2560 1916 1254 3839 32 5 49 14 0 1 0 0 407296 3540 1295448 0 0 896 8244 1203 1816 15 6 74 6 0 1 0 0 405256 3540 1297456 0 0 384 2276 1192 1729 10 5 78 7 0 1 0 0 401044 3540 1303756 0 0 1920 2004 1260 2779 29 5 56 11 0 1 0 0 397668 3540 1306976 0 0 1024 5432 1229 2264 18 4 68 9 0 1 0 0 393720 3540 1310076 0 0 1024 5520 1219 1983 17 6 66 11 0 1 0 0 384148 3540 1316896 0 0 2048 2224 1279 3279 33 5 54 9 0 1 0 0 384716 3540 1318996 0 0 336 5291 1194 2084 12 3 74 13 0 0 0 0 384716 3540 1319252 0 0 0 149 1115 1467 1 1 98 0 0 0 0 0 384716 3540 1319252 0 0 0 92 1065 1075 2 3 95 0 0 > have you also tried setting (increasing) logbsize? (i think you need > > v2 logs to make that work) > I read in the man page for mount that the max logbsize is 32K and the default value for machines with more than 32 MB of memory is 32768, so I assumed it was already set to maximum. Best regards, Steve. -- __________________________________________________ Dr Stephen So, PhD, MIEEE Griffith School of Engineering & Institute for Integrated and Intelligent Systems Griffith University, Gold Coast Campus PMB 50 Gold Coast Mail Centre Gold Coast, QLD, 9726, Australia. E-mail: s.so@griffith.edu.au Phone: +61 7 5552 8663 Fax: +61 7 5552 8065 __________________________________________________ From owner-xfs@oss.sgi.com Thu May 3 21:30:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:30:17 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444UDfB026817 for ; Thu, 3 May 2007 21:30:14 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444U3vS017825 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:30:04 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444U2BH028973; Thu, 3 May 2007 21:30:02 -0700 Date: Thu, 3 May 2007 21:30:02 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 3/5] ext4: Extent overlap bugfix Message-Id: <20070503213002.eff696db.akpm@linux-foundation.org> In-Reply-To: <20070426181101.GC7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181101.GC7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11267 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" wrote: > +unsigned int ext4_ext_check_overlap(struct inode *inode, > + struct ext4_extent *newext, > + struct ext4_ext_path *path) > +{ > + unsigned long b1, b2; > + unsigned int depth, len1; > + > + b1 = le32_to_cpu(newext->ee_block); > + len1 = le16_to_cpu(newext->ee_len); > + depth = ext_depth(inode); > + if (!path[depth].p_ext) > + goto out; > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > + > + /* get the next allocated block if the extent in the path > + * is before the requested block(s) */ > + if (b2 < b1) { > + b2 = ext4_ext_next_allocated_block(path); > + if (b2 == EXT_MAX_BLOCK) > + goto out; > + } > + > + if (b1 + len1 > b2) { Are we sure that b1+len cannot wrap through zero here? > + newext->ee_len = cpu_to_le16(b2 - b1); > + return 1; > + } From owner-xfs@oss.sgi.com Thu May 3 21:30:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:30:11 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444U6fB026766 for ; Thu, 3 May 2007 21:30:06 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444Tu2f017820 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:29:57 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444TtUT028928; Thu, 3 May 2007 21:29:55 -0700 Date: Thu, 3 May 2007 21:29:55 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503212955.b1b6443c.akpm@linux-foundation.org> In-Reply-To: <20070426180332.GA7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11266 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > This patch implements the fallocate() system call and adds support for > i386, x86_64 and powerpc. > > ... > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be said? > +{ > + struct file *file; > + struct inode *inode; > + long ret = -EINVAL; > + > + if (len == 0 || offset < 0) > + goto out; The posix spec implies that negative `len' is permitted - presumably "allocate ahead of `offset'". How peculiar. > + ret = -EBADF; > + file = fget(fd); > + if (!file) > + goto out; > + if (!(file->f_mode & FMODE_WRITE)) > + goto out_fput; > + > + inode = file->f_path.dentry->d_inode; > + > + ret = -ESPIPE; > + if (S_ISFIFO(inode->i_mode)) > + goto out_fput; > + > + ret = -ENODEV; > + if (!S_ISREG(inode->i_mode)) > + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. > + ret = -EFBIG; > + if (offset + len > inode->i_sb->s_maxbytes) > + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment here would settle the reader's mind. > + if (inode->i_op && inode->i_op->fallocate) > + ret = inode->i_op->fallocate(inode, mode, offset, len); > + else > + ret = -ENOSYS; If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. If we're not going to handle negative `len' then we should check for it. > +out_fput: > + fput(file); > +out: > + return ret; > +} > +EXPORT_SYMBOL(sys_fallocate); I don't believe this needs to be exported to modules? > +/* > + * fallocate() modes > + */ > +#define FA_ALLOCATE 0x1 > +#define FA_DEALLOCATE 0x2 Now those aren't in posix. They should be documented, along with their expected semantics. > #ifdef __KERNEL__ > > #include > @@ -1125,6 +1131,7 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + long (*fallocate)(struct inode *, int, loff_t, loff_t); I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? From owner-xfs@oss.sgi.com Thu May 3 21:31:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:31:48 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444VhfB027433 for ; Thu, 3 May 2007 21:31:44 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444VY8K017921 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:31:35 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444VXbq029006; Thu, 3 May 2007 21:31:33 -0700 Date: Thu, 3 May 2007 21:31:33 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070503213133.d1559f52.akpm@linux-foundation.org> In-Reply-To: <20070426181332.GD7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11268 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > This patch has the ext4 implemtation of fallocate system call. > > ... > > + /* ext4_can_extents_be_merged should have checked that either > + * both extents are uninitialized, or both aren't. Thus we > + * need to check only one of them here. > + */ Please always format multiline comments like this: /* * ext4_can_extents_be_merged should have checked that either * both extents are uninitialized, or both aren't. Thus we * need to check only one of them here. */ > ... > > +/* > + * ext4_fallocate: > + * preallocate space for a file > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > + */ This description is rather thin. What is the filesystem's actual behaviour here? If the file is using extents then the implementation will do . If the file is using bitmaps then we will do . But what? Here is where it should be described. > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > +{ > + handle_t *handle; > + ext4_fsblk_t block, max_blocks; > + int ret, ret2, nblocks = 0, retries = 0; > + struct buffer_head map_bh; > + unsigned int credits, blkbits = inode->i_blkbits; > + > + /* Currently supporting (pre)allocate mode _only_ */ > + if (mode != FA_ALLOCATE) > + return -EOPNOTSUPP; > + > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > + return -ENOTTY; So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Also, posix says nothing about fallocate() returning ENOTTY. > + block = offset >> blkbits; > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > + - block; > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? > + handle=ext4_journal_start(inode, credits + Please always put spaces around "=" > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); And around "+" > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > +retry: > + ret = 0; > + while (ret >= 0 && ret < max_blocks) { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > + BUG_ON(!ret); BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and ext4_error() would be safer and more useful here. > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) Use buffer_new() here. A separate patch which fixes the three existing instances of open-coded BH_foo usage would be appreciated. > + && ((block + ret) > (i_size_read(inode) << blkbits))) Check for wrap though the sign bit and through zero please. > + nblocks = nblocks + ret; > + } > + > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + > + /* Time to update the file size. > + * Update only when preallocation was requested beyond the file size. > + */ Fix comment layout. > + if ((offset + len) > i_size_read(inode)) { Both the lhs and the rhs here are signed. Please review for possible overflows through the sign bit and through zero. Perhaps a comment explaining why it's correct would be appropriate. > + if (ret > 0) { > + /* if no error, we assume preallocation succeeded completely */ > + mutex_lock(&inode->i_mutex); > + i_size_write(inode, offset + len); > + EXT4_I(inode)->i_disksize = i_size_read(inode); > + mutex_unlock(&inode->i_mutex); > + } else if (ret < 0 && nblocks) { > + /* Handle partial allocation scenario */ The above two comments should be indented one additional tabstop. > + loff_t newsize; > + mutex_lock(&inode->i_mutex); > + newsize = (nblocks << blkbits) + i_size_read(inode); > + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); > + EXT4_I(inode)->i_disksize = i_size_read(inode); > + mutex_unlock(&inode->i_mutex); > + } > + } > + ext4_mark_inode_dirty(handle, inode); > + ret2 = ext4_journal_stop(handle); > + if (ret > 0) > + ret = ret2; > + > + return ret > 0 ? 0 : ret; > +} > + > EXPORT_SYMBOL(ext4_mark_inode_dirty); > EXPORT_SYMBOL(ext4_ext_invalidate_cache); > EXPORT_SYMBOL(ext4_ext_insert_extent); > EXPORT_SYMBOL(ext4_ext_walk_space); > EXPORT_SYMBOL(ext4_ext_find_goal); > EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); > +EXPORT_SYMBOL(ext4_fallocate); > > Index: linux-2.6.21/fs/ext4/file.c > =================================================================== > --- linux-2.6.21.orig/fs/ext4/file.c > +++ linux-2.6.21/fs/ext4/file.c > @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ > .removexattr = generic_removexattr, > #endif > .permission = ext4_permission, > + .fallocate = ext4_fallocate, > }; > > Index: linux-2.6.21/include/linux/ext4_fs.h > =================================================================== > --- linux-2.6.21.orig/include/linux/ext4_fs.h > +++ linux-2.6.21/include/linux/ext4_fs.h > @@ -102,6 +102,8 @@ > EXT4_GOOD_OLD_FIRST_INO : \ > (s)->s_first_ino) > #endif > +#define EXT4_BLOCK_ALIGN(size, blkbits) (((size)+(1 << blkbits)-1) & \ > + (~((1 << blkbits)-1))) Maybe a comment describing what this does? Probably it's obvious enough. I think it could use the standard ALIGN macro. Is blkbits sufficiently parenthesised here? Even if it is, adding the parens would be better practice. > /* > * Macro-instructions used to manage fragments > @@ -225,6 +227,10 @@ struct ext4_new_group_data { > __u32 free_blocks_count; > }; > > +/* Following is used by preallocation logic to tell get_blocks() that we > + * want uninitialzed extents. > + */ Please convert all newly-added multiline comments to the preferred layout. > +#define EXT4_CREATE_UNINITIALIZED_EXT 2 > > /* > * ioctl commands > @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t > extern void ext4_ext_truncate(struct inode *, struct page *); > extern void ext4_ext_init(struct super_block *); > extern void ext4_ext_release(struct super_block *); > +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t); argh. And feel free to give these args some useful names. > static inline int > ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > unsigned long max_blocks, struct buffer_head *bh, > Index: linux-2.6.21/include/linux/ext4_fs_extents.h > =================================================================== > --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h > +++ linux-2.6.21/include/linux/ext4_fs_extents.h > @@ -125,6 +125,19 @@ struct ext4_ext_path { > #define EXT4_EXT_CACHE_EXTENT 2 > > /* > + * Macro-instructions to handle (mark/unmark/check/create) unitialized > + * extents. Applications can issue an IOCTL for preallocation, which results > + * in assigning unitialized extents to the file. > + */ > +#define ext4_ext_mark_uninitialized(ext) ((ext)->ee_len |= \ > + cpu_to_le16(0x8000)) > +#define ext4_ext_is_uninitialized(ext) ((le16_to_cpu((ext)->ee_len))& \ > + 0x8000) > +#define ext4_ext_get_actual_len(ext) ((le16_to_cpu((ext)->ee_len))& \ > + 0x7FFF) inlined C functions are preferred, and I think these could be implemented that way. From owner-xfs@oss.sgi.com Thu May 3 21:32:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:32:51 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444WmfB027913 for ; Thu, 3 May 2007 21:32:49 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444WdFD017959 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:32:40 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444Wc1E029024; Thu, 3 May 2007 21:32:39 -0700 Date: Thu, 3 May 2007 21:32:38 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-Id: <20070503213238.5cdb1585.akpm@linux-foundation.org> In-Reply-To: <20070426181623.GE7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11269 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" wrote: > This patch adds write support for preallocated (using fallocate system > call) blocks/extents. The preallocated extents in ext4 are marked > "uninitialized", hence they need special handling especially while > writing to them. This patch takes care of that. > > ... > > /* > + * ext4_ext_try_to_merge: > + * tries to merge the "ex" extent to the next extent in the tree. > + * It always tries to merge towards right. If you want to merge towards > + * left, pass "ex - 1" as argument instead of "ex". > + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > + * 1 if they got merged. OK. > + */ > +int ext4_ext_try_to_merge(struct inode *inode, > + struct ext4_ext_path *path, > + struct ext4_extent *ex) > +{ > + struct ext4_extent_header *eh; > + unsigned int depth, len; > + int merge_done=0, uninitialized = 0; space around "=", please. Many people prefer not to do the multiple-definitions-per-line, btw: int merge_done = 0; int uninitialized = 0; reasons: - If gives you some space for a nice comment - It makes patches much more readable, and it makes rejects easier to fix - standardisation. > + depth = ext_depth(inode); > + BUG_ON(path[depth].p_hdr == NULL); > + eh = path[depth].p_hdr; > + > + while (ex < EXT_LAST_EXTENT(eh)) { > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > + break; > + /* merge with next extent! */ > + if (ext4_ext_is_uninitialized(ex)) > + uninitialized = 1; > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > + + ext4_ext_get_actual_len(ex + 1)); > + if (uninitialized) > + ext4_ext_mark_uninitialized(ex); > + > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > + * sizeof(struct ext4_extent); > + memmove(ex + 1, ex + 2, len); > + } > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); Kenrel convention is to put spaces around "-" > + merge_done = 1; > + BUG_ON(eh->eh_entries == 0); eek, scary BUG_ON. Do we really need to be that severe? Would it be better to warn and run ext4_error() here? > + } > + > + return merge_done; > +} > + > + > > ... > > +/* > + * ext4_ext_convert_to_initialized: > + * this function is called by ext4_ext_get_blocks() if someone tries to write > + * to an uninitialized extent. It may result in splitting the uninitialized > + * extent into multiple extents (upto three). Atleast one initialized extent > + * and atmost two uninitialized extents can result. There are some typos here > + * There are three possibilities: > + * a> No split required: Entire extent should be initialized. > + * b> Split into two extents: Only one end of the extent is being written to. > + * c> Split into three extents: Somone is writing in middle of the extent. and here > + */ > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > + struct ext4_ext_path *path, > + ext4_fsblk_t iblock, > + unsigned long max_blocks) > +{ > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > + struct ext4_extent_header *eh; > + unsigned int allocated, ee_block, ee_len, depth; > + ext4_fsblk_t newblock; > + int err = 0, ret = 0; > + > + depth = ext_depth(inode); > + eh = path[depth].p_hdr; > + ex = path[depth].p_ext; > + ee_block = le32_to_cpu(ex->ee_block); > + ee_len = ext4_ext_get_actual_len(ex); > + allocated = ee_len - (iblock - ee_block); > + newblock = iblock - ee_block + ext_pblock(ex); > + ex2 = ex; > + > + /* ex1: ee_block to iblock - 1 : uninitialized */ > + if (iblock > ee_block) { > + ex1 = ex; > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > + ext4_ext_mark_uninitialized(ex1); > + ex2 = &newex; > + } > + /* for sanity, update the length of the ex2 extent before > + * we insert ex3, if ex1 is NULL. This is to avoid temporary > + * overlap of blocks. > + */ > + if (!ex1 && allocated > max_blocks) > + ex2->ee_len = cpu_to_le16(max_blocks); > + /* ex3: to ee_block + ee_len : uninitialised */ > + if (allocated > max_blocks) { > + unsigned int newdepth; > + ex3 = &newex; > + ex3->ee_block = cpu_to_le32(iblock + max_blocks); > + ext4_ext_store_pblock(ex3, newblock + max_blocks); > + ex3->ee_len = cpu_to_le16(allocated - max_blocks); > + ext4_ext_mark_uninitialized(ex3); > + err = ext4_ext_insert_extent(handle, inode, path, ex3); > + if (err) > + goto out; > + /* The depth, and hence eh & ex might change > + * as part of the insert above. > + */ > + newdepth = ext_depth(inode); > + if (newdepth != depth) > + { Use if (newdepth != depth) { > + depth=newdepth; spaces > + path = ext4_ext_find_extent(inode, iblock, NULL); > + if (IS_ERR(path)) { > + err = PTR_ERR(path); > + path = NULL; > + goto out; > + } > + eh = path[depth].p_hdr; > + ex = path[depth].p_ext; > + if (ex2 != &newex) > + ex2 = ex; > + } > + allocated = max_blocks; > + } > + /* If there was a change of depth as part of the > + * insertion of ex3 above, we need to update the length > + * of the ex1 extent again here > + */ > + if (ex1 && ex1 != ex) { > + ex1 = ex; > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > + ext4_ext_mark_uninitialized(ex1); > + ex2 = &newex; > + } > + /* ex2: iblock to iblock + maxblocks-1 : initialised */ > + ex2->ee_block = cpu_to_le32(iblock); > + ex2->ee_start = cpu_to_le32(newblock); > + ext4_ext_store_pblock(ex2, newblock); > + ex2->ee_len = cpu_to_le16(allocated); > + if (ex2 != ex) > + goto insert; > + if ((err = ext4_ext_get_access(handle, inode, path + depth))) > + goto out; The preferred style is err = ext4_ext_get_access(handle, inode, path + depth); if (err) goto out; > + /* New (initialized) extent starts from the first block > + * in the current extent. i.e., ex2 == ex > + * We have to see if it can be merged with the extent > + * on the left. > + */ > + if (ex2 > EXT_FIRST_EXTENT(eh)) { > + /* To merge left, pass "ex2 - 1" to try_to_merge(), > + * since it merges towards right _only_. > + */ > + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); > + if (ret) { > + err = ext4_ext_correct_indexes(handle, inode, path); > + if (err) > + goto out; > + depth = ext_depth(inode); > + ex2--; > + } > + } > + /* Try to Merge towards right. This might be required > + * only when the whole extent is being written to. > + * i.e. ex2==ex and ex3==NULL. > + */ > + if (!ex3) { > + ret = ext4_ext_try_to_merge(inode, path, ex2); > + if (ret) { > + err = ext4_ext_correct_indexes(handle, inode, path); > + if (err) > + goto out; > + } > + } > + /* Mark modified extent as dirty */ > + err = ext4_ext_dirty(handle, inode, path + depth); > + goto out; > +insert: > + err = ext4_ext_insert_extent(handle, inode, path, &newex); > +out: > + return err ? err : allocated; > +} Sigh. I hope you guys know how all this works, because the extent code is a mystery to me. Is the on-disk layout and the allocation strategy described anywhere? > +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *); Again, I do think that sticking the identifiers in there helps readability. Although it is not as important in a boring old declaration as it is in, say, inode_operations, etc. Please try to keep the code looking nice in an 80-column display. From owner-xfs@oss.sgi.com Thu May 3 21:55:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:55:43 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444tdfB002706 for ; Thu, 3 May 2007 21:55:40 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444tTgs018661 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:55:31 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444tSik029320; Thu, 3 May 2007 21:55:29 -0700 Date: Thu, 3 May 2007 21:55:28 -0700 From: Andrew Morton To: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503215528.d8ab4e47.akpm@linux-foundation.org> In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11270 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton wrote: > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. But it doesn't handle offset+len wrapping through zero. From owner-xfs@oss.sgi.com Thu May 3 22:16:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 22:16:49 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l445GifB007660 for ; Thu, 3 May 2007 22:16:45 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id 8F525DDFF5; Fri, 4 May 2007 15:16:43 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17978.47502.786970.196554@cargo.ozlabs.ibm.com> Date: Fri, 4 May 2007 14:41:50 +1000 From: Paul Mackerras To: Andrew Morton Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> X-Mailer: VM 7.19 under Emacs 21.4.1 X-archive-position: 11271 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Andrew Morton writes: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Paul. From owner-xfs@oss.sgi.com Thu May 3 23:08:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:08:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l44687fB021573 for ; Thu, 3 May 2007 23:08:09 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA06552; Fri, 4 May 2007 16:07:46 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4467cAf83970051; Fri, 4 May 2007 16:07:38 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4467VZ384026819; Fri, 4 May 2007 16:07:31 +1000 (AEST) Date: Fri, 4 May 2007 16:07:31 +1000 From: David Chinner To: Andrew Morton Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504060731.GJ32602149@melbourne.sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11272 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. Hmmmm - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 23:28:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:28:37 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l446STfB031053 for ; Thu, 3 May 2007 23:28:30 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l446SGLQ021546 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 23:28:18 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l446SFXl030589; Thu, 3 May 2007 23:28:16 -0700 Date: Thu, 3 May 2007 23:28:15 -0700 From: Andrew Morton To: David Chinner Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503232815.2f62a75e.akpm@linux-foundation.org> In-Reply-To: <20070504060731.GJ32602149@melbourne.sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11273 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Fri, 4 May 2007 16:07:31 +1000 David Chinner wrote: > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > +{ > > > + struct file *file; > > > + struct inode *inode; > > > + long ret = -EINVAL; > > > + > > > + if (len == 0 || offset < 0) > > > + goto out; > > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > ahead of `offset'". How peculiar. > > I just checked the man page for posix_fallocate() and it says: > > EINVAL offset or len was less than zero. > > We should probably follow this lead. Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. However it would be neat if someone could test $OTHER_OS and, perhaps more importantly, the present glibc emulation (which I assume your manpage is referring to, so this would be a manpage test ;)). > > > + > > > + ret = -ENODEV; > > > + if (!S_ISREG(inode->i_mode)) > > > + goto out_fput; > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > seems a bit silly of them. > > Hmmmm - I thought that the intention of sys_fallocate() was to > be generic enough to eventually allow preallocation on directories. > If that is the case, then this check will prevent that.... The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? From owner-xfs@oss.sgi.com Thu May 3 23:57:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:57:07 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l446uwfB004955 for ; Thu, 3 May 2007 23:57:00 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l446ujPL002724; Fri, 4 May 2007 02:56:46 -0400 Received: from devserv.devel.redhat.com (devserv.devel.redhat.com [172.16.58.1]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l446uef7021912; Fri, 4 May 2007 02:56:40 -0400 Received: from devserv.devel.redhat.com (localhost.localdomain [127.0.0.1]) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l446ueGH007487; Fri, 4 May 2007 02:56:40 -0400 Received: (from jakub@localhost) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11/Submit) id l446uQr9007476; Fri, 4 May 2007 02:56:26 -0400 Date: Fri, 4 May 2007 02:56:26 -0400 From: Jakub Jelinek To: Andrew Morton Cc: Ulrich Drepper , David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504065626.GW355@devserv.devel.redhat.com> Reply-To: Jakub Jelinek References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11274 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jakub@redhat.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. That describes the current glibc implementation. > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. > > However it would be neat if someone could test $OTHER_OS and, perhaps more > importantly, the present glibc emulation (which I assume your manpage is > referring to, so this would be a manpage test ;)). int posix_fallocate (int fd, __off_t offset, __off_t len) { struct stat64 st; struct statfs f; /* `off_t' is a signed type. Therefore we can determine whether OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; if (offset + len < 0) return EFBIG; /* First thing we have to make sure is that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) return ESPIPE; if (! S_ISREG (st.st_mode)) return ENODEV; if (len == 0) { if (st.st_size < offset) { int ret = __ftruncate (fd, offset); if (ret != 0) ret = errno; return ret; } return 0; } ... is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is "shall fail". But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and "required storage for regular file data starting at offset and continuing for len bytes" doesn't make sense for negative size. And given the general "Implementations may support additional errors not included in this list, may generate errors included in this list under circumstances other than those described here, or may contain extensions or limitations that prevent some errors from occurring." I believe returning EINVAL for len < 0 is not a POSIX violation. That doesn't mean the standard shouldn't be clarified, whether by saying EINVAL must be returned for non-positive len or saying that using negative len has undefined or implementation defined behavior. > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. No, see above. Jakub From owner-xfs@oss.sgi.com Fri May 4 00:28:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:28:25 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l447SHfB017704 for ; Fri, 4 May 2007 00:28:19 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA08305; Fri, 4 May 2007 17:27:56 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l447RnAf84055039; Fri, 4 May 2007 17:27:50 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l447Rg2j84042753; Fri, 4 May 2007 17:27:42 +1000 (AEST) Date: Fri, 4 May 2007 17:27:42 +1000 From: David Chinner To: Andrew Morton Cc: David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504072742.GK32602149@melbourne.sgi.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11275 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > On Fri, 4 May 2007 16:07:31 +1000 David Chinner wrote: > > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > > > This patch implements the fallocate() system call and adds support for > > > > i386, x86_64 and powerpc. > > > > > > > > ... > > > > +{ > > > > + struct file *file; > > > > + struct inode *inode; > > > > + long ret = -EINVAL; > > > > + > > > > + if (len == 0 || offset < 0) > > > > + goto out; > > > > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. > > > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. I don't think we should care. If we provide a syscall with the semantics of "allocate from offset to offset+len" then glibc's implementation can turn negative length into two separate fallocate syscalls.... > > > > + ret = -ENODEV; > > > > + if (!S_ISREG(inode->i_mode)) > > > > + goto out_fput; > > > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > > seems a bit silly of them. > > > > Hmmmm - I thought that the intention of sys_fallocate() was to > > be generic enough to eventually allow preallocation on directories. > > If that is the case, then this check will prevent that.... > > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the above > check was relaxed? No - right now empty blocks are pruned from the directory immediately so I don't think we really have a concept of empty blocks in the btree structure. dir2 is bloody complex, so adding preallocation is probably not going to be simple to do. It's not high on my list to add, either, because we can typically avoid the worst case directory fragmentation by using larger directory block sizes (e.g. 16k instead of the default 4k on a 4k block size fs). IIRC directory preallocation has been talked about more for ext3/4.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 4 00:29:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:29:41 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l447TbfB018149 for ; Fri, 4 May 2007 00:29:38 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id CD3ADFA8658 for ; Fri, 4 May 2007 08:06:38 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CD49517BA4; Fri, 4 May 2007 09:06:13 +0200 (CEST) Date: Fri, 4 May 2007 09:06:13 +0200 From: Emmanuel Florac To: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504090613.7c0f97d3@galadriel.home> In-Reply-To: <20070504005922.GC32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l447TcfB018166 X-archive-position: 11276 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 10:59:22 +1000 vous criviez: > Where there any I/o errors reported before the shutdown? > Nope. To make it clear : the problem can be reproduce on several different systems, different motherboards, different drives, different RAID controllers... This isn't a hardware problem. > > On a similar hardware with 2 3Ware-9550 16x750GB striped together, > > but running 2.6.17.13, I had a similar fs crash last week. > > Unfortunately I don't have the logs at hand, but we where able to > > reproduce several times the crash at home : > > Hmm - 750GB drives are brand new. i wouldn't rule out media issues > at this point... The problem is quite easily reproduced with 500GB drives too. > > Filesystem "md0": XFS internal error xfs_btree_check_sblock at line > > 336 of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 > > Memory corruption? Tried with different RAMs, and the problem occurs on ECC RAM too. > > > > Out of curiosity, I've tried to use reiserfs (just to see how it > > compares regarding this). Reiserfs crashed before even writing > > 100MB! > > That indicates there's something wrong other than the filesystem. > I'd suggest making sure your raid arrays, memory, etc are all > functioning correctly first. They are. I've tested 5 different machines so far (Supermicro or Tyan mobos, kingston RAM, Intel or AMD cpus, hitachi and seagate drives...) > What platform are you running on? Are you running ia32 with 4k stacks? Yes. I'll try this week 2.6.18.8 thoroughly and 2.6.20.11 too. Then jfs, just to be sure. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 00:34:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:34:03 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l447XvfB019746 for ; Fri, 4 May 2007 00:33:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA08568; Fri, 4 May 2007 17:33:47 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l447XkAf83983180; Fri, 4 May 2007 17:33:46 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l447Xi8582990264; Fri, 4 May 2007 17:33:44 +1000 (AEST) Date: Fri, 4 May 2007 17:33:44 +1000 From: David Chinner To: Emmanuel Florac Cc: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504073344.GL32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070504090613.7c0f97d3@galadriel.home> User-Agent: Mutt/1.4.2.1i X-archive-position: 11277 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 09:06:13AM +0200, Emmanuel Florac wrote: > Le Fri, 4 May 2007 10:59:22 +1000 vous criviez: > > What platform are you running on? Are you running ia32 with 4k stacks? > > Yes. I'll try this week 2.6.18.8 thoroughly and 2.6.20.11 too. Then > jfs, just to be sure. Well, there's your problem. Stack overflows. IMO, if you use a filesystem, you shouldn't use 4k stacks. ;) If you remake you kernel with 8k stacks then your problems will most likely go away. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 4 06:25:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 06:25:53 -0700 (PDT) Received: from smtp-ft5.fr.colt.net (smtp-ft5.fr.colt.net [213.41.78.197]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44DPlfB025077 for ; Fri, 4 May 2007 06:25:49 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft5.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l44DPhpu000578; Fri, 4 May 2007 15:25:43 +0200 Date: Fri, 4 May 2007 15:25:46 +0200 From: Emmanuel Florac To: David Chinner Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504152546.614374ac@harpe.intellique.com> In-Reply-To: <20070504073344.GL32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44DPofB025089 X-archive-position: 11278 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 17:33:44 +1000 David Chinner crivait: > Well, there's your problem. Stack overflows. IMO, if you use a > filesystem, you shouldn't use 4k stacks. ;) > > If you remake you kernel with 8k stacks then your problems will > most likely go away. Well, I've double-checked the asm-i386/module.h, and it actually looks like 4K stacks is NOT the default, so I must be using 8K, isn't it? I've ran the same test on the same machine but WITHOUT software raid-0 (so write barriers are in use), and all went well, more than 3TB written without a glitch. I still think there's something related to the write barriers here. I'll try with another RAID controller, Adaptec for instance, to get sure the 3ware driver isn't involved. I'll also try again with an amd64 kernel. I'd really like to sort this out... -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 07:55:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 07:55:38 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44EtXfB019895 for ; Fri, 4 May 2007 07:55:35 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id C945B18022E01; Fri, 4 May 2007 09:55:30 -0500 (CDT) Message-ID: <463B4962.70904@sandeen.net> Date: Fri, 04 May 2007 09:55:30 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11279 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:33:44 +1000 > David Chinner crivait: > >> Well, there's your problem. Stack overflows. IMO, if you use a >> filesystem, you shouldn't use 4k stacks. ;) >> >> If you remake you kernel with 8k stacks then your problems will >> most likely go away. > > Well, I've double-checked the asm-i386/module.h, and it actually looks > like 4K stacks is NOT the default, so I must be using 8K, isn't it? Depends on how you config'd it, just look at the .config you built with, and search for CONFIG_4KSTACKS On Fedora at least (and I can't remember - I don't think this is a fedora-ism...) you can do "modinfo" on some module, and see: vermagic: 2.6.21 SMP mod_unload 686 4KSTACKS -Eric From owner-xfs@oss.sgi.com Fri May 4 08:30:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 08:30:51 -0700 (PDT) Received: from smtp-ft1.fr.colt.net (smtp-ft1.fr.colt.net [213.41.78.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44FUlfB030646 for ; Fri, 4 May 2007 08:30:49 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft1.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l44FUdlH008756; Fri, 4 May 2007 17:30:41 +0200 Date: Fri, 4 May 2007 17:30:49 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504173049.14606033@harpe.intellique.com> In-Reply-To: <463B4962.70904@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 X-Antivirus: checked in 0.023sec at smtp-ft1.fr.colt.net ([213.41.78.210]) by smf-clamd Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44FUnfB030656 X-archive-position: 11280 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 04 May 2007 09:55:30 -0500 Eric Sandeen crivait: > Emmanuel Florac wrote: > > Le Fri, 4 May 2007 17:33:44 +1000 > > David Chinner crivait: > > > >> Well, there's your problem. Stack overflows. IMO, if you use a > >> filesystem, you shouldn't use 4k stacks. ;) > >> > >> If you remake you kernel with 8k stacks then your problems will > >> most likely go away. > > > > Well, I've double-checked the asm-i386/module.h, and it actually > > looks like 4K stacks is NOT the default, so I must be using 8K, > > isn't it? > > Depends on how you config'd it, just look at the .config you built > with, and search for CONFIG_4KSTACKS config-2.6.17.13: # CONFIG_4KSTACKS is not set So the problem lies elsewhere... -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 08:58:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 08:58:29 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44FwOfB005594 for ; Fri, 4 May 2007 08:58:25 -0700 Received: from localhost (dslb-084-057-112-255.pools.arcor-ip.net [84.57.112.255]) by mail.lichtvoll.de (Postfix) with ESMTP id AF67E5AD3F for ; Fri, 4 May 2007 17:58:22 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Date: Fri, 4 May 2007 17:58:21 +0200 User-Agent: KMail/1.9.6 References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> (sfid-20070504_161005_263297_AD8C4AAD) In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705041758.21320.Martin@lichtvoll.de> X-archive-position: 11281 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Freitag 04 Mai 2007 schrieb Emmanuel Florac: > I've ran the same test on the same machine but WITHOUT software raid-0 > (so write barriers are in use), and all went well, more than 3TB > written without a glitch. I still think there's something related to > the write barriers here. I'll try with another RAID controller, Adaptec > for instance, to get sure the 3ware driver isn't involved. I'll also > try again with an amd64 kernel. Hello Emmanuel! When you can't use write barriers as XFS tell you in the logs, you better switch of write caching for the harddisks / raid controller, unless you happen to have NVRAM or safe power supply. But then using write cache without barrier should not make any difference unless you actually have a crash or power failure during write operation. Did you test with ext3 as well? You wrote it crashes with ReiserFS (version 3) even faster. When it crashes with several filesystems its unlikely to be a filesystem issue. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Fri May 4 15:12:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 15:12:33 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44MCTfB015022 for ; Fri, 4 May 2007 15:12:30 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id 48027FA5D2B for ; Fri, 4 May 2007 22:44:22 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 89A381838B; Fri, 4 May 2007 23:43:56 +0200 (CEST) Date: Fri, 4 May 2007 23:43:57 +0200 From: Emmanuel Florac To: Martin Steigerwald Cc: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504234357.24d22883@galadriel.home> In-Reply-To: <200705041758.21320.Martin@lichtvoll.de> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44MCVfB015030 X-archive-position: 11282 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 17:58:21 +0200 vous criviez: > Did you test with ext3 as well? You wrote it crashes with ReiserFS > (version 3) even faster. When it crashes with several filesystems its > unlikely to be a filesystem issue. Unfortunately ext3 doesn't support volumes bigger than 8TB, so that's useless to me. I plan to test jfs, however. I think it's more a dm/md issue, but I'm not sure... -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 16:20:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 16:20:38 -0700 (PDT) Received: from smtp108.sbc.mail.mud.yahoo.com (smtp108.sbc.mail.mud.yahoo.com [68.142.198.207]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l44NKVfB031820 for ; Fri, 4 May 2007 16:20:32 -0700 Received: (qmail 71668 invoked from network); 4 May 2007 23:20:30 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp108.sbc.mail.mud.yahoo.com with SMTP; 4 May 2007 23:20:29 -0000 X-YMail-OSG: OPP1hd4VM1lfaVvz3tObISaM4S9Wsbmdmu7ru90QC85M5NGiDwRjeqFhPzMWSgDFVI.VQ1CzkQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 7B9EE1827261; Fri, 4 May 2007 16:20:28 -0700 (PDT) Date: Fri, 4 May 2007 16:20:28 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504232028.GA19744@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070504173049.14606033@harpe.intellique.com> X-archive-position: 11283 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 05:30:49PM +0200, Emmanuel Florac wrote: > # CONFIG_4KSTACKS is not set > > So the problem lies elsewhere... CONFIG_4KSTACKS is badly named. It means you have 4K process + 4K interrupt stacks. Without this set you have just a single 8K stack for processes and interrupts. One argument for 4K+4K stacks is that 8K+0K isn't really safer in many cases --- it just appears that way becasue the problems are harder to hit. Almost three years ago I posted patches to split the CONFIG_4KSTACKS option into two options. I quickly just ported that to 2.6.21 just now (very quickly, I might have goofed fixing up the rejects). You could if you have time try this and enable CONFIG_I386_IRQSTACKS but don't enable CONFIG_I386_4KSTACKS and see if that helps... diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug index 458bc16..f32fbec 100644 --- a/arch/i386/Kconfig.debug +++ b/arch/i386/Kconfig.debug @@ -56,15 +56,22 @@ config DEBUG_RODATA portion of the kernel code won't be covered by a 2MB TLB anymore. If in doubt, say "N". -config 4KSTACKS +config I386_4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates running more threads on a system and also reduces the pressure - on the VM subsystem for higher order allocations. This option - will also use IRQ stacks to compensate for the reduced stackspace. + on the VM subsystem for higher order allocations. + +config I386_IRQSTACKS + bool "Allocate separate IRQ stacks" + depends on DEBUG_KERNEL + default y + help + If you say Y here the kernel will allocate and use separate + stacks for interrupts. config X86_FIND_SMP_CONFIG bool diff --git a/arch/i386/defconfig b/arch/i386/defconfig diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c index 8db8d51..f6224fd 100644 --- a/arch/i386/kernel/irq.c +++ b/arch/i386/kernel/irq.c @@ -47,7 +47,7 @@ void ack_bad_irq(unsigned int irq) #endif } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * per-CPU IRQ handling contexts (thread information and stack) */ @@ -58,7 +58,7 @@ union irq_ctx { static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly; static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * do_IRQ handles all normal device IRQ's (the special @@ -71,7 +71,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) /* high bit used in ret_from_ code */ int irq = ~regs->orig_eax; struct irq_desc *desc = irq_desc + irq; -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS union irq_ctx *curctx, *irqctx; u32 *isp; #endif @@ -99,7 +99,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS curctx = (union irq_ctx *) current_thread_info(); irqctx = hardirq_ctx[smp_processor_id()]; @@ -136,7 +136,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) : "memory", "cc" ); } else -#endif +#endif /* CONFIG_I386_IRQSTACKS */ desc->handle_irq(irq, desc); irq_exit(); @@ -144,7 +144,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) return 1; } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * These should really be __section__(".bss.page_aligned") as well, but @@ -234,7 +234,7 @@ asmlinkage void do_softirq(void) } EXPORT_SYMBOL(do_softirq); -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * Interrupt statistics: diff --git a/include/asm-i386/irq.h b/include/asm-i386/irq.h index 11761cd..7db95e1 100644 --- a/include/asm-i386/irq.h +++ b/include/asm-i386/irq.h @@ -24,14 +24,14 @@ static __inline__ int irq_canonicalize(int irq) # define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */ #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS extern void irq_ctx_init(int cpu); extern void irq_ctx_exit(int cpu); # define __ARCH_HAS_DO_SOFTIRQ -#else +#else /* !CONFIG_I386_IRQSTACKS */ # define irq_ctx_init(cpu) do { } while (0) # define irq_ctx_exit(cpu) do { } while (0) -#endif +#endif /* CONFIG_I386_IRQSTACKS */ #ifdef CONFIG_IRQBALANCE extern int irqbalance_disable(char *str); diff --git a/include/asm-i386/module.h b/include/asm-i386/module.h index 02f8f54..7d5d2df 100644 --- a/include/asm-i386/module.h +++ b/include/asm-i386/module.h @@ -62,11 +62,11 @@ struct mod_arch_specific #error unknown processor family #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define MODULE_STACKSIZE "4KSTACKS " -#else +#else /* not using CONFIG_I386_4KSTACKS */ #define MODULE_STACKSIZE "" -#endif +#endif /* CONFIG_I386_4KSTACKS */ #define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_STACKSIZE diff --git a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h index 4b187bb..f5268e0 100644 --- a/include/asm-i386/thread_info.h +++ b/include/asm-i386/thread_info.h @@ -53,7 +53,7 @@ struct thread_info { #endif #define PREEMPT_ACTIVE 0x10000000 -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define THREAD_SIZE (4096) #else #define THREAD_SIZE (8192) From owner-xfs@oss.sgi.com Fri May 4 22:21:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 22:21:40 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l455LZfB010386 for ; Fri, 4 May 2007 22:21:36 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 659901802EE36; Fri, 4 May 2007 23:49:31 -0500 (CDT) Message-ID: <463C0CD8.4090402@sandeen.net> Date: Fri, 04 May 2007 23:49:28 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> In-Reply-To: <20070504234357.24d22883@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11284 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:58:21 +0200 vous criviez: > >> Did you test with ext3 as well? You wrote it crashes with ReiserFS >> (version 3) even faster. When it crashes with several filesystems its >> unlikely to be a filesystem issue. > > Unfortunately ext3 doesn't support volumes bigger than 8TB, so that's > useless to me. I plan to test jfs, however. > I think it's more a dm/md issue, but I'm not sure... > Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from rhel5/centos5) can do up to 16T ext3 filesystems, so you should be able to test that if you like. -Eric From owner-xfs@oss.sgi.com Fri May 4 23:06:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 23:06:50 -0700 (PDT) Received: from mta5.adelphia.net (mta5.adelphia.net [68.168.78.187]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4566jfB021481 for ; Fri, 4 May 2007 23:06:47 -0700 Subject: Re: Mail System Error - Returned Mail To: linux-xfs@oss.sgi.com From: "Auto-reply from pjmarkert@adelphia.net" In-Reply-To: <20070505053606.FRLF26012.mta9.adelphia.net@oss.sgi.com> Precedence: bulk Date: Sat, 5 May 2007 01:36:08 -0400 Message-ID: <20070505053608.FRPH26012.mta9.adelphia.net@mta9> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 11285 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: pjmarkert@adelphia.net Precedence: bulk X-list: xfs My email address is changed to pjmarkert@verizon.net From owner-xfs@oss.sgi.com Sat May 5 08:20:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 08:20:34 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45FKUfB010573 for ; Sat, 5 May 2007 08:20:31 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 26851F15FD1 for ; Sat, 5 May 2007 17:20:30 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 19169182B0; Sat, 5 May 2007 17:20:27 +0200 (CEST) Date: Sat, 5 May 2007 17:19:31 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505171931.6fe9b6f5@galadriel.home> In-Reply-To: <20070504232028.GA19744@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45FKVfB010597 X-archive-position: 11287 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 16:20:28 -0700 vous criviez: > You could if you have time try this and enable CONFIG_I386_IRQSTACKS > but don't enable CONFIG_I386_4KSTACKS and see if that helps... That sounds very interesting, I'll give it a try monday. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 08:18:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 08:18:29 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45FIOfB006549 for ; Sat, 5 May 2007 08:18:24 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 2E267F15D45 for ; Sat, 5 May 2007 17:18:23 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id A524E17BFE; Sat, 5 May 2007 17:18:20 +0200 (CEST) Date: Sat, 5 May 2007 17:18:20 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505171820.6e92d437@galadriel.home> In-Reply-To: <463C0CD8.4090402@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45FIPfB006557 X-archive-position: 11286 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 04 May 2007 23:49:28 -0500 vous criviez: > > Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from > rhel5/centos5) can do up to 16T ext3 filesystems, so you should be > able to test that if you like. Thanks, I'll try that too. Though it won't cover all my needs (I plan to set up 50 and 150TB systems really soon). -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 09:33:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:33:55 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GXofB000741 for ; Sat, 5 May 2007 09:33:51 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 8860BB02F5B2; Sat, 5 May 2007 12:33:49 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 849C85000166; Sat, 5 May 2007 12:33:49 -0400 (EDT) Date: Sat, 5 May 2007 12:33:49 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: linux-raid@vger.kernel.org cc: xfs@oss.sgi.com Subject: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11288 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Question, I currently have a 965 chipset-based motherboard, use 4 port onboard and several PCI-e x1 controller cards for a raid 5 of 10 raptor drives. I get pretty decent speeds: user@host$ time dd if=/dev/zero of=100gb bs=1M count=102400 102400+0 records in 102400+0 records out 107374182400 bytes (107 GB) copied, 247.134 seconds, 434 MB/s real 4m7.164s user 0m0.223s sys 3m3.505s user@host$ time dd if=100gb of=/dev/null bs=1M count=102400 102400+0 records in 102400+0 records out 107374182400 bytes (107 GB) copied, 172.588 seconds, 622 MB/s real 2m52.631s user 0m0.212s sys 1m50.905s user@host$ Also, when I run simultaenous dd's from all of the drives, I see 850-860MB/s, I am curious if there is some kind of limitation with software raid as to why I am not getting better than 500MB/s for sequential write speed? With 7 disks, I got about the same speed, adding 3 more for a total of 10 did not seem to help in regards to write. However, read improved to 622MBs/ from about 420-430MB/s. However, if I want to upgrade to more than 12 disks, I am out of PCI-e slots, so I was wondering, does anyone on this list run a 16 port Areca or 3ware card and use it for JBOD? What kind of performance do you see when using mdadm with such a card? Or if anyone uses mdadm with less than a 16 port card, I'd like to hear what kind of experiences you have seen with that type of configuration. Justin. From owner-xfs@oss.sgi.com Sat May 5 09:48:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:48:03 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GlxfB005748 for ; Sat, 5 May 2007 09:48:00 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 93A4518022E01; Sat, 5 May 2007 11:47:58 -0500 (CDT) Message-ID: <463CB53E.8000202@sandeen.net> Date: Sat, 05 May 2007 11:47:58 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> In-Reply-To: <20070505171820.6e92d437@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11289 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 04 May 2007 23:49:28 -0500 vous criviez: > >> Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from >> rhel5/centos5) can do up to 16T ext3 filesystems, so you should be >> able to test that if you like. > > Thanks, I'll try that too. Though it won't cover all my needs (I plan > to set up 50 and 150TB systems really soon). > Sure, I understand - it may be helpful in figuring out what the problem is, though. I'll be curious to see how it goes... Oh, btw, you'll need the -F (force) flag for mkfs.ext3 -Eric From owner-xfs@oss.sgi.com Sat May 5 09:50:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:50:17 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GoDfB006809 for ; Sat, 5 May 2007 09:50:15 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id C4EDC18022E01; Sat, 5 May 2007 11:50:12 -0500 (CDT) Message-ID: <463CB5C4.7040803@sandeen.net> Date: Sat, 05 May 2007 11:50:12 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> In-Reply-To: <20070505171931.6fe9b6f5@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11290 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 16:20:28 -0700 vous criviez: > >> You could if you have time try this and enable CONFIG_I386_IRQSTACKS >> but don't enable CONFIG_I386_4KSTACKS and see if that helps... > > That sounds very interesting, I'll give it a try monday. > There are also stack debugging config options; one that will warn if you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that will print max stack depth in sysrq-t output (CONFIG_DEBUG_STACK_USAGE). -Eric From owner-xfs@oss.sgi.com Sat May 5 13:35:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:35:35 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45KZSfB010426 for ; Sat, 5 May 2007 13:35:29 -0700 Received: (qmail 92356 invoked from network); 5 May 2007 20:35:28 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:35:27 -0000 X-YMail-OSG: NfkFI3wVM1l2KAzcA7Gpvf5kMfsvZM8GGJA_DL2tvbfn03E9cLQ8rwaGzn2fNG.7uUhguDxBvQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 887111827261; Sat, 5 May 2007 13:35:25 -0700 (PDT) Date: Sat, 5 May 2007 13:35:25 -0700 From: Chris Wedgwood To: Eric Sandeen Cc: Emmanuel Florac , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505203525.GA16477@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463CB5C4.7040803@sandeen.net> X-archive-position: 11291 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 11:50:12AM -0500, Eric Sandeen wrote: > There are also stack debugging config options; one that will warn if > you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that > will print max stack depth in sysrq-t output > (CONFIG_DEBUG_STACK_USAGE). I was in such a hurry I don't think I tweaked that sanely. I'll go over the patch checking that and test it later today. Is there some preferred kernel version people would like? From owner-xfs@oss.sgi.com Sat May 5 13:55:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:55:02 -0700 (PDT) Received: from smtp104.sbc.mail.mud.yahoo.com (smtp104.sbc.mail.mud.yahoo.com [68.142.198.203]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45KsxfB015700 for ; Sat, 5 May 2007 13:54:59 -0700 Received: (qmail 62153 invoked from network); 5 May 2007 20:54:58 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp104.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:54:58 -0000 X-YMail-OSG: m97Te.QVM1mbh8aFrqo95Qk4qrjAE4R81UJBHQJ1y14F1mB3VHCW427ig.b06hW2BI2KGF6gBQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id C851A1827261; Sat, 5 May 2007 13:54:56 -0700 (PDT) Date: Sat, 5 May 2007 13:54:56 -0700 From: Chris Wedgwood To: Justin Piszcz Cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: <20070505205456.GA17112@tuatara.stupidest.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-archive-position: 11292 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 12:33:49PM -0400, Justin Piszcz wrote: > Also, when I run simultaenous dd's from all of the drives, I see > 850-860MB/s, I am curious if there is some kind of limitation with > software raid as to why I am not getting better than 500MB/s for > sequential write speed? What does "vmstat 1" output look like in both cases? My guess is that for large writes it's NOT CPU bound but it can't hurt to check. > With 7 disks, I got about the same speed, adding 3 more for a total > of 10 did not seem to help in regards to write. However, read > improved to 622MBs/ from about 420-430MB/s. RAID is quirky. It's worth fiddling with the stripe size as that can have a big difference in terms of performance --- it's far from clear why on some setups some values work well and other setups you want very different values. It would be good to know if anyone has ever studied stripe size and also controller interleave/layout issues to get a good understanding of why certain values are good and others are very poor and why it varies so much from one setup to the other. Also, 'dd performance' varies between the start of a disk and the end. Typically you get better performance at the start of the disk so dd might not be a very good benchmark here. > However, if I want to upgrade to more than 12 disks, I am out of > PCI-e slots, so I was wondering, does anyone on this list run a 16 > port Areca or 3ware card and use it for JBOD? What kind of > performance do you see when using mdadm with such a card? Or if > anyone uses mdadm with less than a 16 port card, I'd like to hear > what kind of experiences you have seen with that type of > configuration. I've used some 2, 4 and 8 port 3ware cards. As JBODS they worked fine, as RAID cards I had no end of problems. I'm happy to test larger cards if someone wants to donate them :-) From owner-xfs@oss.sgi.com Sat May 5 13:56:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:56:58 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45KuqfB016458 for ; Sat, 5 May 2007 13:56:53 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 5B167F2888F for ; Sat, 5 May 2007 22:56:48 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id B30BE15507; Sat, 5 May 2007 22:56:46 +0200 (CEST) Date: Sat, 5 May 2007 22:56:46 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505225646.1e16b0c4@galadriel.home> In-Reply-To: <463CB53E.8000202@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <463CB53E.8000202@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45KusfB016478 X-archive-position: 11293 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 05 May 2007 11:47:58 -0500 vous criviez: > Sure, I understand - it may be helpful in figuring out what the > problem is, though. I'll be curious to see how it goes... Sure, stay tuned! > Oh, btw, you'll need the -F (force) flag for mkfs.ext3 Thanks! -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 13:57:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:57:31 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45KvQfB016809 for ; Sat, 5 May 2007 13:57:28 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id 26B7EFBB29D for ; Sat, 5 May 2007 21:57:49 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 134EF17BA4; Sat, 5 May 2007 22:57:23 +0200 (CEST) Date: Sat, 5 May 2007 22:57:23 +0200 From: Emmanuel Florac To: xfs@oss.sgi.com Cc: Chris Wedgwood Subject: Re: XFS crash on linux raid Message-ID: <20070505225723.012cc38b@galadriel.home> In-Reply-To: <463CB5C4.7040803@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45KvSfB016842 X-archive-position: 11294 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 05 May 2007 11:50:12 -0500 vous criviez: > There are also stack debugging config options; one that will warn if > you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that > will print max stack depth in sysrq-t output > (CONFIG_DEBUG_STACK_USAGE). Fine, I'll try that. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:00:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:00:25 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45L0LfB018275 for ; Sat, 5 May 2007 14:00:22 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 8F802F10C62 for ; Sat, 5 May 2007 22:58:21 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 9CFA618302; Sat, 5 May 2007 22:58:19 +0200 (CEST) Date: Sat, 5 May 2007 22:58:19 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505225819.0dd3c0fa@galadriel.home> In-Reply-To: <20070505203525.GA16477@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45L0MfB018282 X-archive-position: 11295 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 13:35:25 -0700 vous criviez: > Is there some preferred kernel version people would like? > Well I prefer staying away from the very latest bleeding edge, so I stick to 2.6.20.11 for now. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:18:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:18:50 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45LIkfB023937 for ; Sat, 5 May 2007 14:18:47 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 118A517BD2; Sat, 5 May 2007 23:18:45 +0200 (CEST) Date: Sat, 5 May 2007 23:18:45 +0200 From: Emmanuel Florac To: Justin Piszcz Cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: <20070505231845.7b1cbdc5@galadriel.home> In-Reply-To: References: Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45LImfB023945 X-archive-position: 11296 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 12:33:49 -0400 (EDT) vous criviez: > However, if I want to upgrade to more than 12 disks, I am out of > PCI-e slots, so I was wondering, does anyone on this list run a 16 > port Areca or 3ware card and use it for JBOD? I don't use this setup in production, but I tried it with 8 ports 3Ware cards. I didn't try the latest 9650 though. > What kind of > performance do you see when using mdadm with such a card? 3Ghz Supermicro P4D 1 GB RAM, 3Ware 9550SX with 8x250GB 8MB cache 7200 RPM Seagate drives, raid 0 Tested XFS and reiserfs, with 64 and 256K stripes. tested under Linux 2.6.15.1, with bonnie++ in "fast mode" (-f option). use bon_csv2html to translate, or see bonnie++ documentation, roughly : 2G is the file size tested, then numbers on the first line are : write speed (KB/s), CPU usage (%), rewrite speed (overwrite), cpu usage, read speed, cpu usage. Then follow sequential and random seeks, reads, writes and delete with their cpu usage. "+++++" means "no significant value". # XFS, stripe 256k storiq,2G,,,353088,69,76437,17,,,197376,16,410.8,0,16,11517,57,+++++,+++,10699,51,11502,59,+++++,+++,12158,61 storiq,2G,,,349166,71,75397,17,,,196057,16,433.3,0,16,12744,64,+++++,+++,12700,58,13008,67,+++++,+++,9890,51 storiq,2G,,,336683,68,72581,16,,,191254,18,419.9,0,16,12377,62,+++++,+++,10991,52,12947,67,+++++,+++,10580,52 storiq,2G,,,335646,65,77938,17,,,195350,17,397.4,0,16,14578,74,+++++,+++,11085,53,14377,74,+++++,+++,10852,54 storiq,2G,,,330022,67,73004,17,,,197846,18,412.3,0,16,12534,65,+++++,+++,10983,52,12161,63,+++++,+++,11752,61 storiq,2G,,,279454,55,75256,17,,,196065,18,412.7,0,16,13022,67,+++++,+++,10802,52,13759,72,+++++,+++,9800,47 storiq,2G,,,314606,61,74883,16,,,194131,16,401.2,0,16,11665,58,+++++,+++,10723,52,11880,61,+++++,+++,6659,33 storiq,2G,,,264382,53,72011,15,,,196690,18,411.5,0,16,10194,52,+++++,+++,12202,57,10367,52,+++++,+++,9175,45 storiq,2G,,,360252,72,75845,17,,,199721,18,432.7,0,16,12067,61,+++++,+++,11047,54,12156,62,+++++,+++,12372,60 storiq,2G,,,280746,57,74541,17,,,193562,19,414.0,0,16,12418,61,+++++,+++,11090,52,11135,57,+++++,+++,11309,55 storiq,2G,,,309464,61,79153,18,,,191533,17,419.5,0,16,12705,62,+++++,+++,11889,57,12027,61,+++++,+++,10960,54 storiq,2G,,,342122,67,68113,15,,,195572,16,413.5,0,16,13667,69,+++++,+++,10596,55,12731,66,+++++,+++,10766,54 storiq,2G,,,329945,63,72183,15,,,193082,18,421.8,0,16,12627,62,+++++,+++,9270,43,12455,63,+++++,+++,8878,44 storiq,2G,,,309570,63,69628,16,,,192415,19,413.1,0,16,13568,69,+++++,+++,10104,48,13512,70,+++++,+++,9261,45 storiq,2G,,,298528,58,70029,15,,,193531,17,399.5,0,16,13028,64,+++++,+++,9990,47,10098,52,+++++,+++,7544,38 storiq,2G,,,260341,52,66979,15,,,197199,18,393.1,0,16,10633,53,+++++,+++,9189,43,11159,56,+++++,+++,11696,58 # XFS, stripe 64k storiq,2G,,,351241,70,90868,22,,,305222,29,408.7,0,16,8593,43,+++++,+++,6639,31,7555,39,+++++,+++,6639,33 storiq,2G,,,340145,67,83790,19,,,297148,28,401.4,0,16,9132,46,+++++,+++,6790,34,8881,45,+++++,+++,6305,31 storiq,2G,,,325791,65,81314,19,,,282439,26,395.5,0,16,9095,44,+++++,+++,6255,29,8173,42,+++++,+++,6194,31 storiq,2G,,,266009,53,83362,20,,,308438,26,407.7,0,16,8362,43,+++++,+++,6443,30,9264,47,+++++,+++,6339,33 storiq,2G,,,322776,65,76466,17,,,288001,26,399.7,0,16,8038,41,+++++,+++,5387,26,6389,34,+++++,+++,6545,31 storiq,2G,,,309007,60,77846,18,,,290613,29,392.8,0,16,7183,37,+++++,+++,6492,30,8270,41,+++++,+++,6813,35 storiq,2G,,,287662,58,72920,17,,,287911,26,398.4,0,16,8893,44,+++++,+++,7777,36,8150,41,+++++,+++,7717,39 storiq,2G,,,288149,56,75743,17,,,300949,29,386.2,0,16,9545,47,+++++,+++,7572,35,9115,46,+++++,+++,7211,36 # reiser, stripe 256k storiq,2G,,,289179,98,102775,26,,,188307,22,444.0,0,16,27326,100,+++++,+++,21887,99,26726,99,+++++,+++,20633,98 storiq,2G,,,275847,93,101970,25,,,190551,21,450.2,0,16,27397,100,+++++,+++,21926,100,26609,100,+++++,+++,20895,99 storiq,2G,,,289414,99,105080,26,,,189022,22,423.9,0,16,27212,100,+++++,+++,21757,100,26651,99,+++++,+++,20863,100 storiq,2G,,,292746,99,103681,25,,,186303,21,431.5,0,16,27375,100,+++++,+++,21989,99,26251,99,+++++,+++,20924,99 storiq,2G,,,290222,99,104135,26,,,189656,22,449.7,0,16,27453,99,+++++,+++,21849,100,26757,99,+++++,+++,20845,99 storiq,2G,,,291716,99,103872,26,,,187410,23,437.0,0,16,27419,99,+++++,+++,22119,99,26516,100,+++++,+++,20934,100 storiq,2G,,,285545,99,101637,25,,,189788,21,422.1,0,16,27224,99,+++++,+++,21742,99,26500,99,+++++,+++,20922,100 storiq,2G,,,293042,98,100272,24,,,185631,22,453.8,0,16,27268,99,+++++,+++,21944,100,26777,100,+++++,+++,21042,99 # reiser stripe 64k storiq,2G,,,295569,99,112563,29,,,282178,32,434.5,0,16,27631,99,+++++,+++,22015,99,27021,100,+++++,+++,21028,99 storiq,2G,,,287830,98,112449,29,,,271047,33,425.1,0,16,27447,99,+++++,+++,21973,99,26810,99,+++++,+++,21008,100 storiq,2G,,,271668,95,114410,30,,,282419,33,438.7,0,16,27495,100,+++++,+++,22158,100,26707,100,+++++,+++,21106,100 storiq,2G,,,282535,99,118620,30,,,272089,33,425.0,0,16,27569,100,+++++,+++,22021,100,26778,100,+++++,+++,20629,98 storiq,2G,,,294392,98,119654,32,,,273269,32,429.7,0,16,27591,100,+++++,+++,21984,99,26786,100,+++++,+++,20994,99 storiq,2G,,,296652,99,118420,31,,,279586,33,425.5,0,16,15007,78,+++++,+++,21889,99,26998,99,+++++,+++,20952,100 storiq,2G,,,290551,98,124374,32,,,273852,32,424.0,0,16,27534,99,+++++,+++,21974,99,26746,100,+++++,+++,20786,99 storiq,2G,,,287033,99,100559,26,,,204845,24,390.9,0,16,27620,99,+++++,+++,21996,99,26811,100,+++++,+++,21009,100 Here are the tests I did with a similar system, but with 500GB drives, XFS only, 64KB stripe (3ware default).I tested RAID 5 software RAID compared to RAID-5 hardware (3Ware 9550). # software raid 5 storiq-5U,2G,,,155913,22,23390,4,,,84327,9,531.5,0,16,1323,3,+++++,+++,634,1,657,2,+++++,+++,903,3 storiq-5U,2G,,,168104,24,23964,4,,,81666,8,534.2,0,16,605,2,+++++,+++,608,2,770,2,+++++,+++,706,1 storiq-5U,2G,,,149516,21,22612,4,,,82111,9,571.3,0,16,606,2,+++++,+++,590,2,729,2,+++++,+++,450,1 storiq-5U,2G,,,141883,20,22966,4,,,78116,8,568.5,0,16,615,2,+++++,+++,553,2,684,2,+++++,+++,508,2 # hardware raid 5 storiq-1,2G,,,148500,29,43043,9,,,148808,14,442.3,0,16,5953,27,+++++,+++,4408,20,4994,24,+++++,+++,2399,11 storiq-1,2G,,,191440,37,38092,8,,,155494,15,420.9,0,16,3074,15,+++++,+++,3356,17,4246,21,+++++,+++,2513,12 storiq-1,2G,,,150460,29,40018,9,,,144936,14,386.9,0,16,4206,20,+++++,+++,2497,11,5182,26,+++++,+++,2440,11 storiq-1,2G,,,163132,34,34525,8,,,132131,13,369.7,0,16,6796,33,+++++,+++,10002,47,5475,28,+++++,+++,3652,17 As you can see, hardware RAID-5 doesn't perform significantly faster at writing, but read thruput and rewrite performance is way better, and seeks are an order of magnitude faster. That's why I use striped 3Ware hardware RAID-5 to build high capacity systems instead of software RAID 5. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:23:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:23:05 -0700 (PDT) Received: from smtp107.sbc.mail.mud.yahoo.com (smtp107.sbc.mail.mud.yahoo.com [68.142.198.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45LN2fB026375 for ; Sat, 5 May 2007 14:23:03 -0700 Received: (qmail 55252 invoked from network); 5 May 2007 20:56:18 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp107.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:56:18 -0000 X-YMail-OSG: B34Ic84VM1nv8UOP1HdKHiretfaGEYgoETcAHAVgODfGX.6akLlrLcUL7d6IS85Oo1moL3QNRw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 624B21827261; Sat, 5 May 2007 13:56:17 -0700 (PDT) Date: Sat, 5 May 2007 13:56:17 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505205617.GB17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070504234357.24d22883@galadriel.home> X-archive-position: 11297 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 11:43:57PM +0200, Emmanuel Florac wrote: > Unfortunately ext3 doesn't support volumes bigger than 8TB, so > that's useless to me. I plan to test jfs, however. Is jfs supported by anyone right now? From owner-xfs@oss.sgi.com Sat May 5 14:32:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:32:53 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45LWmfB030926 for ; Sat, 5 May 2007 14:32:49 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 5F5C4B02F5B2; Sat, 5 May 2007 17:32:47 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 5887A5000177; Sat, 5 May 2007 17:32:47 -0400 (EDT) Date: Sat, 5 May 2007 17:32:47 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Emmanuel Florac cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? In-Reply-To: <20070505231845.7b1cbdc5@galadriel.home> Message-ID: References: <20070505231845.7b1cbdc5@galadriel.home> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1478584756-1178400767=:18820" X-archive-position: 11298 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1478584756-1178400767=:18820 Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Sat, 5 May 2007, Emmanuel Florac wrote: > Le Sat, 5 May 2007 12:33:49 -0400 (EDT) vous =E9criviez: > >> However, if I want to upgrade to more than 12 disks, I am out of >> PCI-e slots, so I was wondering, does anyone on this list run a 16 >> port Areca or 3ware card and use it for JBOD? > > I don't use this setup in production, but I tried it with 8 ports 3Ware > cards. > I didn't try the latest 9650 though. > >> What kind of >> performance do you see when using mdadm with such a card? > > 3Ghz Supermicro P4D 1 GB RAM, 3Ware 9550SX with 8x250GB 8MB cache 7200 > RPM Seagate drives, raid 0 > > Tested XFS and reiserfs, with 64 and 256K stripes. > > tested under Linux 2.6.15.1, with bonnie++ in "fast mode" (-f option). > use bon_csv2html to translate, or see bonnie++ documentation, roughly : > 2G is the file size tested, then numbers on the first line are : write > speed (KB/s), CPU usage (%), rewrite speed (overwrite), cpu usage, read > speed, cpu usage. Then follow sequential and random seeks, reads, > writes and delete with their cpu usage. "+++++" means "no significant > value". > > # XFS, stripe 256k > storiq,2G,,,353088,69,76437,17,,,197376,16,410.8,0,16,11517,57,+++++,+++,= 10699,51,11502,59,+++++,+++,12158,61 > storiq,2G,,,349166,71,75397,17,,,196057,16,433.3,0,16,12744,64,+++++,+++,= 12700,58,13008,67,+++++,+++,9890,51 > storiq,2G,,,336683,68,72581,16,,,191254,18,419.9,0,16,12377,62,+++++,+++,= 10991,52,12947,67,+++++,+++,10580,52 > storiq,2G,,,335646,65,77938,17,,,195350,17,397.4,0,16,14578,74,+++++,+++,= 11085,53,14377,74,+++++,+++,10852,54 > storiq,2G,,,330022,67,73004,17,,,197846,18,412.3,0,16,12534,65,+++++,+++,= 10983,52,12161,63,+++++,+++,11752,61 > storiq,2G,,,279454,55,75256,17,,,196065,18,412.7,0,16,13022,67,+++++,+++,= 10802,52,13759,72,+++++,+++,9800,47 > storiq,2G,,,314606,61,74883,16,,,194131,16,401.2,0,16,11665,58,+++++,+++,= 10723,52,11880,61,+++++,+++,6659,33 > storiq,2G,,,264382,53,72011,15,,,196690,18,411.5,0,16,10194,52,+++++,+++,= 12202,57,10367,52,+++++,+++,9175,45 > storiq,2G,,,360252,72,75845,17,,,199721,18,432.7,0,16,12067,61,+++++,+++,= 11047,54,12156,62,+++++,+++,12372,60 > storiq,2G,,,280746,57,74541,17,,,193562,19,414.0,0,16,12418,61,+++++,+++,= 11090,52,11135,57,+++++,+++,11309,55 > storiq,2G,,,309464,61,79153,18,,,191533,17,419.5,0,16,12705,62,+++++,+++,= 11889,57,12027,61,+++++,+++,10960,54 > storiq,2G,,,342122,67,68113,15,,,195572,16,413.5,0,16,13667,69,+++++,+++,= 10596,55,12731,66,+++++,+++,10766,54 > storiq,2G,,,329945,63,72183,15,,,193082,18,421.8,0,16,12627,62,+++++,+++,= 9270,43,12455,63,+++++,+++,8878,44 > storiq,2G,,,309570,63,69628,16,,,192415,19,413.1,0,16,13568,69,+++++,+++,= 10104,48,13512,70,+++++,+++,9261,45 > storiq,2G,,,298528,58,70029,15,,,193531,17,399.5,0,16,13028,64,+++++,+++,= 9990,47,10098,52,+++++,+++,7544,38 > storiq,2G,,,260341,52,66979,15,,,197199,18,393.1,0,16,10633,53,+++++,+++,= 9189,43,11159,56,+++++,+++,11696,58 > # XFS, stripe 64k > storiq,2G,,,351241,70,90868,22,,,305222,29,408.7,0,16,8593,43,+++++,+++,6= 639,31,7555,39,+++++,+++,6639,33 > storiq,2G,,,340145,67,83790,19,,,297148,28,401.4,0,16,9132,46,+++++,+++,6= 790,34,8881,45,+++++,+++,6305,31 > storiq,2G,,,325791,65,81314,19,,,282439,26,395.5,0,16,9095,44,+++++,+++,6= 255,29,8173,42,+++++,+++,6194,31 > storiq,2G,,,266009,53,83362,20,,,308438,26,407.7,0,16,8362,43,+++++,+++,6= 443,30,9264,47,+++++,+++,6339,33 > storiq,2G,,,322776,65,76466,17,,,288001,26,399.7,0,16,8038,41,+++++,+++,5= 387,26,6389,34,+++++,+++,6545,31 > storiq,2G,,,309007,60,77846,18,,,290613,29,392.8,0,16,7183,37,+++++,+++,6= 492,30,8270,41,+++++,+++,6813,35 > storiq,2G,,,287662,58,72920,17,,,287911,26,398.4,0,16,8893,44,+++++,+++,7= 777,36,8150,41,+++++,+++,7717,39 > storiq,2G,,,288149,56,75743,17,,,300949,29,386.2,0,16,9545,47,+++++,+++,7= 572,35,9115,46,+++++,+++,7211,36 > # reiser, stripe 256k > storiq,2G,,,289179,98,102775,26,,,188307,22,444.0,0,16,27326,100,+++++,++= +,21887,99,26726,99,+++++,+++,20633,98 > storiq,2G,,,275847,93,101970,25,,,190551,21,450.2,0,16,27397,100,+++++,++= +,21926,100,26609,100,+++++,+++,20895,99 > storiq,2G,,,289414,99,105080,26,,,189022,22,423.9,0,16,27212,100,+++++,++= +,21757,100,26651,99,+++++,+++,20863,100 > storiq,2G,,,292746,99,103681,25,,,186303,21,431.5,0,16,27375,100,+++++,++= +,21989,99,26251,99,+++++,+++,20924,99 > storiq,2G,,,290222,99,104135,26,,,189656,22,449.7,0,16,27453,99,+++++,+++= ,21849,100,26757,99,+++++,+++,20845,99 > storiq,2G,,,291716,99,103872,26,,,187410,23,437.0,0,16,27419,99,+++++,+++= ,22119,99,26516,100,+++++,+++,20934,100 > storiq,2G,,,285545,99,101637,25,,,189788,21,422.1,0,16,27224,99,+++++,+++= ,21742,99,26500,99,+++++,+++,20922,100 > storiq,2G,,,293042,98,100272,24,,,185631,22,453.8,0,16,27268,99,+++++,+++= ,21944,100,26777,100,+++++,+++,21042,99 > # reiser stripe 64k > storiq,2G,,,295569,99,112563,29,,,282178,32,434.5,0,16,27631,99,+++++,+++= ,22015,99,27021,100,+++++,+++,21028,99 > storiq,2G,,,287830,98,112449,29,,,271047,33,425.1,0,16,27447,99,+++++,+++= ,21973,99,26810,99,+++++,+++,21008,100 > storiq,2G,,,271668,95,114410,30,,,282419,33,438.7,0,16,27495,100,+++++,++= +,22158,100,26707,100,+++++,+++,21106,100 > storiq,2G,,,282535,99,118620,30,,,272089,33,425.0,0,16,27569,100,+++++,++= +,22021,100,26778,100,+++++,+++,20629,98 > storiq,2G,,,294392,98,119654,32,,,273269,32,429.7,0,16,27591,100,+++++,++= +,21984,99,26786,100,+++++,+++,20994,99 > storiq,2G,,,296652,99,118420,31,,,279586,33,425.5,0,16,15007,78,+++++,+++= ,21889,99,26998,99,+++++,+++,20952,100 > storiq,2G,,,290551,98,124374,32,,,273852,32,424.0,0,16,27534,99,+++++,+++= ,21974,99,26746,100,+++++,+++,20786,99 > storiq,2G,,,287033,99,100559,26,,,204845,24,390.9,0,16,27620,99,+++++,+++= ,21996,99,26811,100,+++++,+++,21009,100 > > Here are the tests I did with a similar system, but with 500GB drives, > XFS only, 64KB stripe (3ware default).I tested RAID 5 software RAID > compared to RAID-5 hardware (3Ware 9550). > > # software raid 5 > storiq-5U,2G,,,155913,22,23390,4,,,84327,9,531.5,0,16,1323,3,+++++,+++,63= 4,1,657,2,+++++,+++,903,3 > storiq-5U,2G,,,168104,24,23964,4,,,81666,8,534.2,0,16,605,2,+++++,+++,608= ,2,770,2,+++++,+++,706,1 > storiq-5U,2G,,,149516,21,22612,4,,,82111,9,571.3,0,16,606,2,+++++,+++,590= ,2,729,2,+++++,+++,450,1 > storiq-5U,2G,,,141883,20,22966,4,,,78116,8,568.5,0,16,615,2,+++++,+++,553= ,2,684,2,+++++,+++,508,2 > # hardware raid 5 > storiq-1,2G,,,148500,29,43043,9,,,148808,14,442.3,0,16,5953,27,+++++,+++,= 4408,20,4994,24,+++++,+++,2399,11 > storiq-1,2G,,,191440,37,38092,8,,,155494,15,420.9,0,16,3074,15,+++++,+++,= 3356,17,4246,21,+++++,+++,2513,12 > storiq-1,2G,,,150460,29,40018,9,,,144936,14,386.9,0,16,4206,20,+++++,+++,= 2497,11,5182,26,+++++,+++,2440,11 > storiq-1,2G,,,163132,34,34525,8,,,132131,13,369.7,0,16,6796,33,+++++,+++,= 10002,47,5475,28,+++++,+++,3652,17 > > As you can see, hardware RAID-5 doesn't perform significantly faster > at writing, but read thruput and rewrite performance is way better, and > seeks are an order of magnitude faster. That's why I use striped 3Ware > hardware RAID-5 to build high capacity systems instead of software RAID > 5. > > --=20 > -------------------------------------------------- > Emmanuel Florac www.intellique.com > -------------------------------------------------- > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Wow, very impressive benchmarks, thank you very much for this. Justin.= ---1463747160-1478584756-1178400767=:18820-- From owner-xfs@oss.sgi.com Sat May 5 15:12:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 15:13:00 -0700 (PDT) Received: from smtp113.sbc.mail.mud.yahoo.com (smtp113.sbc.mail.mud.yahoo.com [68.142.198.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45MCrfB009680 for ; Sat, 5 May 2007 15:12:54 -0700 Received: (qmail 80976 invoked from network); 5 May 2007 22:12:52 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp113.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 22:12:52 -0000 X-YMail-OSG: kYmy1WoVM1lLSqHi8kH_YIg9mAqfQe1Fv.gpSI1oIdhpFszmQ05A3stLHe0TtQ_9tudApt0ekA-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 2A6251827261; Sat, 5 May 2007 15:12:50 -0700 (PDT) Date: Sat, 5 May 2007 15:12:50 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505221249.GA21960@tuatara.stupidest.org> References: <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> <20070505225819.0dd3c0fa@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070505225819.0dd3c0fa@galadriel.home> X-archive-position: 11299 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 10:58:19PM +0200, Emmanuel Florac wrote: > Well I prefer staying away from the very latest bleeding edge, so I > stick to 2.6.20.11 for now. diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug index f68cc6f..908b755 100644 --- a/arch/i386/Kconfig.debug +++ b/arch/i386/Kconfig.debug @@ -56,15 +56,22 @@ config DEBUG_RODATA portion of the kernel code won't be covered by a 2MB TLB anymore. If in doubt, say "N". -config 4KSTACKS +config I386_4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates running more threads on a system and also reduces the pressure - on the VM subsystem for higher order allocations. This option - will also use IRQ stacks to compensate for the reduced stackspace. + on the VM subsystem for higher order allocations. + +config I386_IRQSTACKS + bool "Allocate separate IRQ stacks" + depends on DEBUG_KERNEL + default y + help + If you say Y here the kernel will allocate and use separate + stacks for interrupts. config X86_FIND_SMP_CONFIG bool diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c index 3201d42..0da8251 100644 --- a/arch/i386/kernel/irq.c +++ b/arch/i386/kernel/irq.c @@ -33,7 +33,7 @@ void ack_bad_irq(unsigned int irq) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * per-CPU IRQ handling contexts (thread information and stack) */ @@ -44,7 +44,7 @@ union irq_ctx { static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly; static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * do_IRQ handles all normal device IRQ's (the special @@ -57,7 +57,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) /* high bit used in ret_from_ code */ int irq = ~regs->orig_eax; struct irq_desc *desc = irq_desc + irq; -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS union irq_ctx *curctx, *irqctx; u32 *isp; #endif @@ -85,7 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS curctx = (union irq_ctx *) current_thread_info(); irqctx = hardirq_ctx[smp_processor_id()]; @@ -122,7 +122,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) : "memory", "cc" ); } else -#endif +#endif /* CONFIG_I386_IRQSTACKS */ desc->handle_irq(irq, desc); irq_exit(); @@ -130,7 +130,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) return 1; } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * These should really be __section__(".bss.page_aligned") as well, but @@ -220,7 +220,7 @@ asmlinkage void do_softirq(void) } EXPORT_SYMBOL(do_softirq); -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * Interrupt statistics: diff --git a/include/asm-i386/irq.h b/include/asm-i386/irq.h index 11761cd..7db95e1 100644 --- a/include/asm-i386/irq.h +++ b/include/asm-i386/irq.h @@ -24,14 +24,14 @@ static __inline__ int irq_canonicalize(int irq) # define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */ #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS extern void irq_ctx_init(int cpu); extern void irq_ctx_exit(int cpu); # define __ARCH_HAS_DO_SOFTIRQ -#else +#else /* !CONFIG_I386_IRQSTACKS */ # define irq_ctx_init(cpu) do { } while (0) # define irq_ctx_exit(cpu) do { } while (0) -#endif +#endif /* CONFIG_I386_IRQSTACKS */ #ifdef CONFIG_IRQBALANCE extern int irqbalance_disable(char *str); diff --git a/include/asm-i386/module.h b/include/asm-i386/module.h index 02f8f54..7d5d2df 100644 --- a/include/asm-i386/module.h +++ b/include/asm-i386/module.h @@ -62,11 +62,11 @@ struct mod_arch_specific #error unknown processor family #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define MODULE_STACKSIZE "4KSTACKS " -#else +#else /* not using CONFIG_I386_4KSTACKS */ #define MODULE_STACKSIZE "" -#endif +#endif /* CONFIG_I386_4KSTACKS */ #define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_STACKSIZE diff --git a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h index 4b187bb..f5268e0 100644 --- a/include/asm-i386/thread_info.h +++ b/include/asm-i386/thread_info.h @@ -53,7 +53,7 @@ struct thread_info { #endif #define PREEMPT_ACTIVE 0x10000000 -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define THREAD_SIZE (4096) #else #define THREAD_SIZE (8192) From owner-xfs@oss.sgi.com Sun May 6 10:21:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:21:05 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HL1fB028324 for ; Sun, 6 May 2007 10:21:02 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CCD9D18762; Sun, 6 May 2007 19:21:00 +0200 (CEST) Date: Sun, 6 May 2007 19:21:04 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506192104.3becdd81@galadriel.home> In-Reply-To: <20070505210002.GC17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HL2fB028330 X-archive-position: 11301 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 14:00:02 -0700 vous criviez: > A 50TB filesystem might suck horrible on a 32-bit platform. I'm not > sure there is *ANY* way you coiuld fsck that should you need in some > cases. > > Is that what you're planning to do? Nope, I'll use an x86_64 system running an x86_64 kernel :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:21:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:21:46 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HLhfB028657 for ; Sun, 6 May 2007 10:21:44 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CC0CF18718; Sun, 6 May 2007 19:21:42 +0200 (CEST) Date: Sun, 6 May 2007 19:21:46 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506192146.7f03cd4e@galadriel.home> In-Reply-To: <20070505221249.GA21960@tuatara.stupidest.org> References: <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> <20070505225819.0dd3c0fa@galadriel.home> <20070505221249.GA21960@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HLifB028674 X-archive-position: 11302 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 15:12:50 -0700 vous criviez: > diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug Thanks! -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:19:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:19:53 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HJmfB027935 for ; Sun, 6 May 2007 10:19:49 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 582C9F34029 for ; Sun, 6 May 2007 19:19:47 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id E2A6818206; Sun, 6 May 2007 19:19:43 +0200 (CEST) Date: Sun, 6 May 2007 19:19:47 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506191947.75a2058a@galadriel.home> In-Reply-To: <20070505205617.GB17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <20070505205617.GB17112@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HJofB027943 X-archive-position: 11300 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 13:56:17 -0700 vous criviez: > Is jfs supported by anyone right now? Huh, IBM I hope :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:26:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:26:14 -0700 (PDT) Received: from smtp114.sbc.mail.mud.yahoo.com (smtp114.sbc.mail.mud.yahoo.com [68.142.198.213]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l46HQ8fB030347 for ; Sun, 6 May 2007 10:26:09 -0700 Received: (qmail 92936 invoked from network); 6 May 2007 17:26:08 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp114.sbc.mail.mud.yahoo.com with SMTP; 6 May 2007 17:26:08 -0000 X-YMail-OSG: JpyOiz0VM1mldeiku.Hr8o32aTLyos4dDOQSFemrA1zdTVKzh2MZehYlzOHEUP1wl41_FWvOGg-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id DA4271827261; Sun, 6 May 2007 10:26:06 -0700 (PDT) Date: Sun, 6 May 2007 10:26:06 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506172606.GB4823@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> <20070506192104.3becdd81@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070506192104.3becdd81@galadriel.home> X-archive-position: 11303 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sun, May 06, 2007 at 07:21:04PM +0200, Emmanuel Florac wrote: > Nope, I'll use an x86_64 system running an x86_64 kernel :) How much RAM? I think you'll want 10s of GBs possibly (well, it depends very much on what you're storing but you can fit a lot of small files in 150TB...) From owner-xfs@oss.sgi.com Sun May 6 10:56:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:56:08 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46Hu4fB005398 for ; Sun, 6 May 2007 10:56:05 -0700 Received: from localhost (dslb-084-057-122-104.pools.arcor-ip.net [84.57.122.104]) by mail.lichtvoll.de (Postfix) with ESMTP id 3A3FF5AD40 for ; Sun, 6 May 2007 19:56:03 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Date: Sun, 6 May 2007 19:56:02 +0200 User-Agent: KMail/1.9.6 References: <20070503164521.16efe075@harpe.intellique.com> <20070504234357.24d22883@galadriel.home> <20070505205617.GB17112@tuatara.stupidest.org> (sfid-20070506_174955_742323_AFBCDD13) In-Reply-To: <20070505205617.GB17112@tuatara.stupidest.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705061956.02375.Martin@lichtvoll.de> X-archive-position: 11304 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Samstag 05 Mai 2007 schrieb Chris Wedgwood: > On Fri, May 04, 2007 at 11:43:57PM +0200, Emmanuel Florac wrote: > > Unfortunately ext3 doesn't support volumes bigger than 8TB, so > > that's useless to me. I plan to test jfs, however. > > Is jfs supported by anyone right now? David 'Dave' Kleikamp was still taking care of JFS as I asked him some questions about write barrier support back in July 2007. He concentrated on bug fixes tough, not on new features. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Sun May 6 11:37:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 11:37:16 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46IbCfB014325 for ; Sun, 6 May 2007 11:37:13 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 88D3D18737; Sun, 6 May 2007 20:37:11 +0200 (CEST) Date: Sun, 6 May 2007 20:36:49 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506203649.1c4d9d14@galadriel.home> In-Reply-To: <20070506172606.GB4823@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> <20070506192104.3becdd81@galadriel.home> <20070506172606.GB4823@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46IbDfB014343 X-archive-position: 11305 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sun, 6 May 2007 10:26:06 -0700 vous criviez: > How much RAM? I think you'll want 10s of GBs possibly (well, it > depends very much on what you're storing but you can fit a lot of > small files in 150TB...) It will be video storage, big to huge file mainly. But I'll remember to stick as much RAM as I can :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 18:38:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 18:38:15 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l471c5fB018505 for ; Sun, 6 May 2007 18:38:08 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.197]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l471c2sH008659 for ; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l471c2s11594 for xfs@oss.sgi.com; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id l471c2O04063 for ; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070507.092351.98402312 for ; Mon, 7 May 2007 09:23:52 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Mon May 07 09:23:51 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 36BF6AE4B3; Mon, 7 May 2007 10:38:01 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l471c1ok001475; Mon, 7 May 2007 10:38:01 +0900 Message-Id: <200705070137.AA05294@TNESG9305.tnes.nec.co.jp> Date: Mon, 07 May 2007 10:37:56 +0900 To: xfs@oss.sgi.com Cc: tes@sgi.com Subject: [PATCH] Fix disable, enable, off and remove commands in xfs_quota. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: multipart/mixed; boundary="--------------------0751065352324900" X-archive-position: 11306 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs This is multipart message. ----------------------0751065352324900 Content-Type: text/plain; charset=iso-2022-jp Hi, I send this mail 10 days ago but it got lost...$B!!(B disable, enable, off and remove commands in xfs_quota don't work. Because: 1) The argument type to quotactl() is wrong. "addr" is fs_quota_stat_t structure in the original code but it should be an unsigned int as shown in man page. (disable, enable, off and remove) 2) The wrong flag is used for -ugp option check. (disable, enable, off and remove) 3) The accounting flag (XFS_QUOTA_*DQ_ACCT) is used for disabling quota enforcement incorrectly. (disable) 4) The accounting and enforcement flag is used for removing space incorrectly. (remove) 5) The quota types must be specified to quotactl() one by one. But multiple quota types are passed to quotactl() when specifying -ug|-up option. (remove) Attached patch fixes these problems. Signed-off-by: Utako Kusaka --- ----------------------0751065352324900 Content-Type: application/octet-stream; name="state.diff" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="state.diff" LS0tIHhmc3Byb2dzLTIuOC4yMC9xdW90YS9zdGF0ZS5vcmlnCTIwMDctMDQt MTkgMTM6MDc6MzguMDAwMDAwMDAwICswOTAwCisrKyB4ZnNwcm9ncy0yLjgu MjAvcXVvdGEvc3RhdGUuYwkyMDA3LTA0LTI2IDExOjQ2OjQ2LjAwMDAwMDAw MCArMDkwMApAQCAtMjUwLDEwICsyNTAsNiBAQCBlbmFibGVfZW5mb3JjZW1l bnQoCiAJdWludAkJZmxhZ3MpCiB7CiAJZnNfcGF0aF90CSptb3VudDsKLQlm c19xdW90YV9zdGF0X3QJcXN0YXQgPSB7IDAgfTsKLQotCXFzdGF0LnFzX3Zl cnNpb24gPSBGU19RU1RBVF9WRVJTSU9OOwotCXFzdGF0LnFzX2ZsYWdzID0g cWZsYWdzOwogCiAJbW91bnQgPSBmc190YWJsZV9sb29rdXAoZGlyLCBGU19N T1VOVF9QT0lOVCk7CiAJaWYgKCFtb3VudCkgewpAQCAtMjYxLDcgKzI1Nyw3 IEBAIGVuYWJsZV9lbmZvcmNlbWVudCgKIAkJcmV0dXJuOwogCX0KIAlkaXIg PSBtb3VudC0+ZnNfbmFtZTsKLQlpZiAoeGZzcXVvdGFjdGwoWEZTX1FVT1RB T04sIGRpciwgdHlwZSwgMCwgKHZvaWQgKikmcXN0YXQpIDwgMCkKKwlpZiAo eGZzcXVvdGFjdGwoWEZTX1FVT1RBT04sIGRpciwgdHlwZSwgMCwgKHZvaWQg KikmcWZsYWdzKSA8IDApCiAJCXBlcnJvcigiWEZTX1FVT1RBT04iKTsKIAll bHNlIGlmIChmbGFncyAmIFZFUkJPU0VfRkxBRykKIAkJc3RhdGVfcXVvdGFm aWxlX21vdW50KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZsYWdzKTsKQEAgLTI3 NSwxMCArMjcxLDYgQEAgZGlzYWJsZV9lbmZvcmNlbWVudCgKIAl1aW50CQlm bGFncykKIHsKIAlmc19wYXRoX3QJKm1vdW50OwotCWZzX3F1b3RhX3N0YXRf dAlxc3RhdCA9IHsgMCB9OwotCi0JcXN0YXQucXNfdmVyc2lvbiA9IEZTX1FT VEFUX1ZFUlNJT047Ci0JcXN0YXQucXNfZmxhZ3MgPSBxZmxhZ3M7CiAKIAlt b3VudCA9IGZzX3RhYmxlX2xvb2t1cChkaXIsIEZTX01PVU5UX1BPSU5UKTsK IAlpZiAoIW1vdW50KSB7CkBAIC0yODYsNyArMjc4LDcgQEAgZGlzYWJsZV9l bmZvcmNlbWVudCgKIAkJcmV0dXJuOwogCX0KIAlkaXIgPSBtb3VudC0+ZnNf bmFtZTsKLQlpZiAoeGZzcXVvdGFjdGwoWEZTX1FVT1RBT0ZGLCBkaXIsIHR5 cGUsIDAsICh2b2lkICopJnFzdGF0KSA8IDApCisJaWYgKHhmc3F1b3RhY3Rs KFhGU19RVU9UQU9GRiwgZGlyLCB0eXBlLCAwLCAodm9pZCAqKSZxZmxhZ3Mp IDwgMCkKIAkJcGVycm9yKCJYRlNfUVVPVEFPRkYiKTsKIAllbHNlIGlmIChm bGFncyAmIFZFUkJPU0VfRkxBRykKIAkJc3RhdGVfcXVvdGFmaWxlX21vdW50 KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZsYWdzKTsKQEAgLTMwMCwxMCArMjky LDYgQEAgcXVvdGFvZmYoCiAJdWludAkJZmxhZ3MpCiB7CiAJZnNfcGF0aF90 CSptb3VudDsKLQlmc19xdW90YV9zdGF0X3QJcXN0YXQgPSB7IDAgfTsKLQot CXFzdGF0LnFzX3ZlcnNpb24gPSBGU19RU1RBVF9WRVJTSU9OOwotCXFzdGF0 LnFzX2ZsYWdzID0gcWZsYWdzOwogCiAJbW91bnQgPSBmc190YWJsZV9sb29r dXAoZGlyLCBGU19NT1VOVF9QT0lOVCk7CiAJaWYgKCFtb3VudCkgewpAQCAt MzExLDI0ICsyOTksMzEgQEAgcXVvdGFvZmYoCiAJCXJldHVybjsKIAl9CiAJ ZGlyID0gbW91bnQtPmZzX25hbWU7Ci0JaWYgKHhmc3F1b3RhY3RsKFhGU19R VU9UQU9GRiwgZGlyLCB0eXBlLCAwLCAodm9pZCAqKSZxc3RhdCkgPCAwKQor CWlmICh4ZnNxdW90YWN0bChYRlNfUVVPVEFPRkYsIGRpciwgdHlwZSwgMCwg KHZvaWQgKikmcWZsYWdzKSA8IDApCiAJCXBlcnJvcigiWEZTX1FVT1RBT0ZG Iik7CiAJZWxzZSBpZiAoZmxhZ3MgJiBWRVJCT1NFX0ZMQUcpCiAJCXN0YXRl X3F1b3RhZmlsZV9tb3VudChzdGRvdXQsIHR5cGUsIG1vdW50LCBmbGFncyk7 CiB9CiAKK3N0YXRpYyBpbnQKK3JlbW92ZV9xdHlwZV9leHRlbnRzKAorCWNo YXIJCSpkaXIsCisJdWludAkJdHlwZSkKK3sKKwlpbnQJZXJyb3IgPSAwOwor CisJaWYgKChlcnJvciA9IHhmc3F1b3RhY3RsKFhGU19RVU9UQVJNLCBkaXIs IHR5cGUsIDAsICh2b2lkICopJnR5cGUpKSA8IDApCisJCXBlcnJvcigiWEZT X1FVT1RBUk0iKTsKKwlyZXR1cm4gZXJyb3I7Cit9CisKIHN0YXRpYyB2b2lk CiByZW1vdmVfZXh0ZW50cygKIAljaGFyCQkqZGlyLAogCXVpbnQJCXR5cGUs Ci0JdWludAkJcWZsYWdzLAogCXVpbnQJCWZsYWdzKQogewogCWZzX3BhdGhf dAkqbW91bnQ7Ci0JZnNfcXVvdGFfc3RhdF90CXFzdGF0ID0geyAwIH07Ci0K LQlxc3RhdC5xc192ZXJzaW9uID0gRlNfUVNUQVRfVkVSU0lPTjsKLQlxc3Rh dC5xc19mbGFncyA9IHFmbGFnczsKIAogCW1vdW50ID0gZnNfdGFibGVfbG9v a3VwKGRpciwgRlNfTU9VTlRfUE9JTlQpOwogCWlmICghbW91bnQpIHsKQEAg LTMzNiw5ICszMzEsMTggQEAgcmVtb3ZlX2V4dGVudHMoCiAJCXJldHVybjsK IAl9CiAJZGlyID0gbW91bnQtPmZzX25hbWU7Ci0JaWYgKHhmc3F1b3RhY3Rs KFhGU19RVU9UQVJNLCBkaXIsIHR5cGUsIDAsICh2b2lkICopJnFzdGF0KSA8 IDApCi0JCXBlcnJvcigiWEZTX1FVT1RBUk0iKTsKLQllbHNlIGlmIChmbGFn cyAmIFZFUkJPU0VfRkxBRykKKwlpZiAodHlwZSAmIFhGU19VU0VSX1FVT1RB KSB7CisJCWlmIChyZW1vdmVfcXR5cGVfZXh0ZW50cyhkaXIsIFhGU19VU0VS X1FVT1RBKSA8IDApIAorCQkJcmV0dXJuOworCX0KKwlpZiAodHlwZSAmIFhG U19HUk9VUF9RVU9UQSkgeworCQlpZiAocmVtb3ZlX3F0eXBlX2V4dGVudHMo ZGlyLCBYRlNfR1JPVVBfUVVPVEEpIDwgMCkgCisJCQlyZXR1cm47CisJfSBl bHNlIGlmICh0eXBlICYgWEZTX1BST0pfUVVPVEEpIHsKKwkJaWYgKHJlbW92 ZV9xdHlwZV9leHRlbnRzKGRpciwgWEZTX1BST0pfUVVPVEEpIDwgMCkgCisJ CQlyZXR1cm47CisJfQorCWlmIChmbGFncyAmIFZFUkJPU0VfRkxBRykKIAkJ c3RhdGVfcXVvdGFmaWxlX21vdW50KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZs YWdzKTsKIH0KIApAQCAtMzc0LDcgKzM3OCw3IEBAIGVuYWJsZV9mKAogCWlm IChhcmdjICE9IG9wdGluZCkKIAkJcmV0dXJuIGNvbW1hbmRfdXNhZ2UoJmVu YWJsZV9jbWQpOwogCi0JaWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewog CQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwogCQlxZmxhZ3MgfD0gWEZTX1FV T1RBX1VEUV9BQ0NUIHwgWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KQEAgLTM5 NSwxNSArMzk5LDE1IEBAIGRpc2FibGVfZigKIAkJc3dpdGNoIChjKSB7CiAJ CWNhc2UgJ2cnOgogCQkJdHlwZSB8PSBYRlNfR1JPVVBfUVVPVEE7Ci0JCQlx ZmxhZ3MgfD0gWEZTX1FVT1RBX0dEUV9BQ0NUOworCQkJcWZsYWdzIHw9IFhG U19RVU9UQV9HRFFfRU5GRDsKIAkJCWJyZWFrOwogCQljYXNlICdwJzoKIAkJ CXR5cGUgfD0gWEZTX1BST0pfUVVPVEE7Ci0JCQlxZmxhZ3MgfD0gWEZTX1FV T1RBX1BEUV9BQ0NUOworCQkJcWZsYWdzIHw9IFhGU19RVU9UQV9QRFFfRU5G RDsKIAkJCWJyZWFrOwogCQljYXNlICd1JzoKIAkJCXR5cGUgfD0gWEZTX1VT RVJfUVVPVEE7Ci0JCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUOwor CQkJcWZsYWdzIHw9IFhGU19RVU9UQV9VRFFfRU5GRDsKIAkJCWJyZWFrOwog CQljYXNlICd2JzoKIAkJCWZsYWdzIHw9IFZFUkJPU0VfRkxBRzsKQEAgLTQx Niw5ICs0MjAsOSBAQCBkaXNhYmxlX2YoCiAJaWYgKGFyZ2MgIT0gb3B0aW5k KQogCQlyZXR1cm4gY29tbWFuZF91c2FnZSgmZGlzYWJsZV9jbWQpOwogCi0J aWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhGU19V U0VSX1FVT1RBOwotCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUOwor CQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KIAogCWlmIChm c19wYXRoLT5mc19mbGFncyAmIEZTX01PVU5UX1BPSU5UKQpAQCAtNDU4LDcg KzQ2Miw3IEBAIG9mZl9mKAogCWlmIChhcmdjICE9IG9wdGluZCkKIAkJcmV0 dXJuIGNvbW1hbmRfdXNhZ2UoJm9mZl9jbWQpOwogCi0JaWYgKCFmbGFncykg eworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwog CQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUIHwgWEZTX1FVT1RBX1VE UV9FTkZEOwogCX0KQEAgLTQ3MywyMSArNDc3LDE4IEBAIHJlbW92ZV9mKAog CWludAkJYXJnYywKIAljaGFyCQkqKmFyZ3YpCiB7Ci0JaW50CQljLCBmbGFn cyA9IDAsIHFmbGFncyA9IDAsIHR5cGUgPSAwOworCWludAkJYywgZmxhZ3Mg PSAwLCB0eXBlID0gMDsKIAogCXdoaWxlICgoYyA9IGdldG9wdChhcmdjLCBh cmd2LCAiZ3B1diIpKSAhPSBFT0YpIHsKIAkJc3dpdGNoIChjKSB7CiAJCWNh c2UgJ2cnOgogCQkJdHlwZSB8PSBYRlNfR1JPVVBfUVVPVEE7Ci0JCQlxZmxh Z3MgfD0gWEZTX1FVT1RBX0dEUV9BQ0NUIHwgWEZTX1FVT1RBX0dEUV9FTkZE OwogCQkJYnJlYWs7CiAJCWNhc2UgJ3AnOgogCQkJdHlwZSB8PSBYRlNfUFJP Sl9RVU9UQTsKLQkJCXFmbGFncyB8PSBYRlNfUVVPVEFfUERRX0FDQ1QgfCBY RlNfUVVPVEFfUERRX0VORkQ7CiAJCQlicmVhazsKIAkJY2FzZSAndSc6CiAJ CQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwotCQkJcWZsYWdzIHw9IFhGU19R VU9UQV9VRFFfQUNDVCB8IFhGU19RVU9UQV9VRFFfRU5GRDsKIAkJCWJyZWFr OwogCQljYXNlICd2JzoKIAkJCWZsYWdzIHw9IFZFUkJPU0VfRkxBRzsKQEAg LTUwMCwxMyArNTAxLDEyIEBAIHJlbW92ZV9mKAogCWlmIChhcmdjICE9IG9w dGluZCkKIAkJcmV0dXJuIGNvbW1hbmRfdXNhZ2UoJnJlbW92ZV9jbWQpOwog Ci0JaWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhG U19VU0VSX1FVT1RBOwotCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NU IHwgWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KIAogCWlmIChmc19wYXRoLT5m c19mbGFncyAmIEZTX01PVU5UX1BPSU5UKQotCQlyZW1vdmVfZXh0ZW50cyhm c19wYXRoLT5mc19kaXIsIHR5cGUsIHFmbGFncywgZmxhZ3MpOworCQlyZW1v dmVfZXh0ZW50cyhmc19wYXRoLT5mc19kaXIsIHR5cGUsIGZsYWdzKTsKIAly ZXR1cm4gMDsKIH0KIAo= ----------------------0751065352324900-- From owner-xfs@oss.sgi.com Sun May 6 19:11:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 19:11:38 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l472BXfB024497 for ; Sun, 6 May 2007 19:11:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA17577; Mon, 7 May 2007 12:11:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l472BOAf86463303; Mon, 7 May 2007 12:11:24 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l472BMNV85439097; Mon, 7 May 2007 12:11:22 +1000 (AEST) Date: Mon, 7 May 2007 12:11:22 +1000 From: David Chinner To: Emmanuel Florac Cc: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070507021122.GQ32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11307 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 03:25:46PM +0200, Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:33:44 +1000 > David Chinner crivait: > > > Well, there's your problem. Stack overflows. IMO, if you use a > > filesystem, you shouldn't use 4k stacks. ;) > > > > If you remake you kernel with 8k stacks then your problems will > > most likely go away. > > Well, I've double-checked the asm-i386/module.h, and it actually looks > like 4K stacks is NOT the default, so I must be using 8K, isn't it? Yes. > I've ran the same test on the same machine but WITHOUT software raid-0 > (so write barriers are in use), and all went well, more than 3TB > written without a glitch. I still think there's something related to > the write barriers here. I'll try with another RAID controller, Adaptec > for instance, to get sure the 3ware driver isn't involved. I'll also try > again with an amd64 kernel. So you use software raid and you get corruptions, right? I doubt this has anything to do with write barriers - if it does thats an indication of broken drivers or hardware..... Can you run with "-o nobarrier" and no software raid and see if you still have a problem? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 03:07:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 03:07:59 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47A7tfB005622 for ; Mon, 7 May 2007 03:07:56 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id A440617CFA; Mon, 7 May 2007 12:07:54 +0200 (CEST) Date: Mon, 7 May 2007 12:07:54 +0200 From: Emmanuel Florac To: David Chinner Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070507120754.289deffd@galadriel.home> In-Reply-To: <20070507021122.GQ32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <20070507021122.GQ32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l47A7vfB005627 X-archive-position: 11308 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Mon, 7 May 2007 12:11:22 +1000 vous criviez: > So you use software raid and you get corruptions, right? I doubt this > has anything to do with write barriers - if it does thats an > indication of broken drivers or hardware..... > > Can you run with "-o nobarrier" and no software raid and see if you > still have a problem? I tried on the same machine without software RAID and barriers, and i worked OK. I'll try today with nobarrier. Stay tuned :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Mon May 7 04:03:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:03:55 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47B3lfB017677 for ; Mon, 7 May 2007 04:03:49 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47B3hN7031521 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47B3hcJ515560 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47B3g21005021 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47B3frL004964; Mon, 7 May 2007 07:03:42 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5AB9D94BBD; Mon, 7 May 2007 16:33:49 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47B3nwC010945; Mon, 7 May 2007 16:33:49 +0530 Date: Mon, 7 May 2007 16:33:48 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070507110348.GA7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11309 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. > > If that's all too much, this material should at least be spelled out in the > changelog. Because there's no way in which this change can be fully > reviewed unless someone (ie: you) tells us what it is setting out to > achieve. > > If we 100% implement some standard then a URL for what we claim to > implement would suffice. Given that we're at least using different types from > posix I doubt if such a thing would be sufficient. > > And given the complexity and potential variability within the filesystem > implementations of this, I'd expect that _something_ additional needs to be > said? Ok. I will add a detailed comment here. > > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. > > > + ret = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + if (!(file->f_mode & FMODE_WRITE)) > > + goto out_fput; > > + > > + inode = file->f_path.dentry->d_inode; > > + > > + ret = -ESPIPE; > > + if (S_ISFIFO(inode->i_mode)) > > + goto out_fput; > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. True. > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment > here would settle the reader's mind. Ok. I will add a check here for wrap though zero. > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, mode, offset, len); > > + else > > + ret = -ENOSYS; > > If we _are_ going to support negative `len', as posix suggests, I think we > should perform the appropriate sanity conversions to `offset' and `len' > right here, rather than expecting each filesystem to do it. > > If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). > > +out_fput: > > + fput(file); > > +out: > > + return ret; > > +} > > +EXPORT_SYMBOL(sys_fallocate); > > I don't believe this needs to be exported to modules? Ok. Will remove it. > > +/* > > + * fallocate() modes > > + */ > > +#define FA_ALLOCATE 0x1 > > +#define FA_DEALLOCATE 0x2 > > Now those aren't in posix. They should be documented, along with their > expected semantics. Will add a comment describing the role of these modes. > > #ifdef __KERNEL__ > > > > #include > > @@ -1125,6 +1131,7 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *, int, loff_t, loff_t); > > I really do think it's better to put the variable names in definitions such > as this. Especially when we have two identically-typed variables next to > each other like that. Quick: which one is the offset and which is the > length? Ok. Will add the variable names here. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 04:10:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:10:42 -0700 (PDT) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.146]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47BAbfB018894 for ; Mon, 7 May 2007 04:10:39 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47BBYno028901 for ; Mon, 7 May 2007 07:11:34 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47BAbK5550866 for ; Mon, 7 May 2007 07:10:37 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47BAaUk020671 for ; Mon, 7 May 2007 07:10:36 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47BAZ5f020654; Mon, 7 May 2007 07:10:36 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B3CE494BBD; Mon, 7 May 2007 16:40:38 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47BAcV7013746; Mon, 7 May 2007 16:40:38 +0530 Date: Mon, 7 May 2007 16:40:38 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: David Chinner , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070507111038.GB7012@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11310 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the above > check was relaxed? I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 04:46:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:46:48 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47BkifB032153 for ; Mon, 7 May 2007 04:46:45 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47Bkh21026930 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47BkhAb550744 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47Bkh3g010826 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47Bkg9x010807; Mon, 7 May 2007 07:46:42 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id ECD1694BBD; Mon, 7 May 2007 17:16:49 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47BknTn028767; Mon, 7 May 2007 17:16:49 +0530 Date: Mon, 7 May 2007 17:16:49 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 3/5] ext4: Extent overlap bugfix Message-ID: <20070507114649.GC7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181101.GC7209@amitarora.in.ibm.com> <20070503213002.eff696db.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213002.eff696db.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11311 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" wrote: > > > +unsigned int ext4_ext_check_overlap(struct inode *inode, > > + struct ext4_extent *newext, > > + struct ext4_ext_path *path) > > +{ > > + unsigned long b1, b2; > > + unsigned int depth, len1; > > + > > + b1 = le32_to_cpu(newext->ee_block); > > + len1 = le16_to_cpu(newext->ee_len); > > + depth = ext_depth(inode); > > + if (!path[depth].p_ext) > > + goto out; > > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > > + > > + /* get the next allocated block if the extent in the path > > + * is before the requested block(s) */ > > + if (b2 < b1) { > > + b2 = ext4_ext_next_allocated_block(path); > > + if (b2 == EXT_MAX_BLOCK) > > + goto out; > > + } > > + > > + if (b1 + len1 > b2) { > > Are we sure that b1+len cannot wrap through zero here? No. Will add a check here for this. Thanks! > > + newext->ee_len = cpu_to_le16(b2 - b1); > > + return 1; > > + } -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 05:11:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 05:11:58 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47CBqfB003761 for ; Mon, 7 May 2007 05:11:54 -0700 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47C7E7Y027576 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 08:07:15 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l47C3sWu029214 for ; Mon, 7 May 2007 08:03:54 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47C7CFZ184746 for ; Mon, 7 May 2007 06:07:12 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47C7CKS012675 for ; Mon, 7 May 2007 06:07:12 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47C7Bi2012612; Mon, 7 May 2007 06:07:11 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2FC5D94BBD; Mon, 7 May 2007 17:37:19 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47C7Jq7004761; Mon, 7 May 2007 17:37:19 +0530 Date: Mon, 7 May 2007 17:37:19 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507120719.GD7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213133.d1559f52.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11312 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > This patch has the ext4 implemtation of fallocate system call. > > > > ... > > > > + /* ext4_can_extents_be_merged should have checked that either > > + * both extents are uninitialized, or both aren't. Thus we > > + * need to check only one of them here. > > + */ > > Please always format multiline comments like this: > > /* > * ext4_can_extents_be_merged should have checked that either > * both extents are uninitialized, or both aren't. Thus we > * need to check only one of them here. > */ Ok. > > ... > > > > +/* > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. Ok. Will expand the description. > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. > Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > + - block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? You are right to say that the credits can not be fixed here. But, 'len' will not directly tell us how many extents might need to be inserted and how many block groups (if any - think about the "segment range" already being allocated case) the allocation request might touch. One solution I have thought is to check the buffer credits after a call to ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the credits are falling short. Incase journal_extend fails, we call journal_restart. This will automatically take care of how much journal space we might need for any value of "len". > > + handle=ext4_journal_start(inode, credits + > > Please always put spaces around "="A Ok. > > > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > > And around "+" Ok. > > > + if (IS_ERR(handle)) > > + return PTR_ERR(handle); > > +retry: > > + ret = 0; > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ok. Will do that. > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > Use buffer_new() here. A separate patch which fixes the three existing > instances of open-coded BH_foo usage would be appreciated. Ok. > > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > Check for wrap though the sign bit and through zero please. Ok. > > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + > > + /* Time to update the file size. > > + * Update only when preallocation was requested beyond the file size. > > + */ > > Fix comment layout. Ok. > > > + if ((offset + len) > i_size_read(inode)) { > > Both the lhs and the rhs here are signed. Please review for possible > overflows through the sign bit and through zero. Perhaps a comment > explaining why it's correct would be appropriate. Ok. > > > > + if (ret > 0) { > > + /* if no error, we assume preallocation succeeded completely */ > > + mutex_lock(&inode->i_mutex); > > + i_size_write(inode, offset + len); > > + EXT4_I(inode)->i_disksize = i_size_read(inode); > > + mutex_unlock(&inode->i_mutex); > > + } else if (ret < 0 && nblocks) { > > + /* Handle partial allocation scenario */ > > The above two comments should be indented one additional tabstop. Ok. > > > + loff_t newsize; > > + mutex_lock(&inode->i_mutex); > > + newsize = (nblocks << blkbits) + i_size_read(inode); > > + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); > > + EXT4_I(inode)->i_disksize = i_size_read(inode); > > + mutex_unlock(&inode->i_mutex); > > + } > > + } > > + ext4_mark_inode_dirty(handle, inode); > > + ret2 = ext4_journal_stop(handle); > > + if (ret > 0) > > + ret = ret2; > > + > > + return ret > 0 ? 0 : ret; > > +} > > + > > EXPORT_SYMBOL(ext4_mark_inode_dirty); > > EXPORT_SYMBOL(ext4_ext_invalidate_cache); > > EXPORT_SYMBOL(ext4_ext_insert_extent); > > EXPORT_SYMBOL(ext4_ext_walk_space); > > EXPORT_SYMBOL(ext4_ext_find_goal); > > EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); > > +EXPORT_SYMBOL(ext4_fallocate); > > > > Index: linux-2.6.21/fs/ext4/file.c > > =================================================================== > > --- linux-2.6.21.orig/fs/ext4/file.c > > +++ linux-2.6.21/fs/ext4/file.c > > @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ > > .removexattr = generic_removexattr, > > #endif > > .permission = ext4_permission, > > + .fallocate = ext4_fallocate, > > }; > > > > Index: linux-2.6.21/include/linux/ext4_fs.h > > =================================================================== > > --- linux-2.6.21.orig/include/linux/ext4_fs.h > > +++ linux-2.6.21/include/linux/ext4_fs.h > > @@ -102,6 +102,8 @@ > > EXT4_GOOD_OLD_FIRST_INO : \ > > (s)->s_first_ino) > > #endif > > +#define EXT4_BLOCK_ALIGN(size, blkbits) (((size)+(1 << blkbits)-1) & \ > > + (~((1 << blkbits)-1))) > > Maybe a comment describing what this does? Probably it's obvious enough. > > I think it could use the standard ALIGN macro. > > Is blkbits sufficiently parenthesised here? Even if it is, adding the > parens would be better practice. I agree. Will change it. > > > /* > > * Macro-instructions used to manage fragments > > @@ -225,6 +227,10 @@ struct ext4_new_group_data { > > __u32 free_blocks_count; > > }; > > > > +/* Following is used by preallocation logic to tell get_blocks() that we > > + * want uninitialzed extents. > > + */ > > Please convert all newly-added multiline comments to the preferred layout. Ok. > > > +#define EXT4_CREATE_UNINITIALIZED_EXT 2 > > > > /* > > * ioctl commands > > @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t > > extern void ext4_ext_truncate(struct inode *, struct page *); > > extern void ext4_ext_init(struct super_block *); > > extern void ext4_ext_release(struct super_block *); > > +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t); > > argh. And feel free to give these args some useful names. Ok. > > > static inline int > > ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > > unsigned long max_blocks, struct buffer_head *bh, > > Index: linux-2.6.21/include/linux/ext4_fs_extents.h > > =================================================================== > > --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h > > +++ linux-2.6.21/include/linux/ext4_fs_extents.h > > @@ -125,6 +125,19 @@ struct ext4_ext_path { > > #define EXT4_EXT_CACHE_EXTENT 2 > > > > /* > > + * Macro-instructions to handle (mark/unmark/check/create) unitialized > > + * extents. Applications can issue an IOCTL for preallocation, which results > > + * in assigning unitialized extents to the file. > > + */ > > +#define ext4_ext_mark_uninitialized(ext) ((ext)->ee_len |= \ > > + cpu_to_le16(0x8000)) > > +#define ext4_ext_is_uninitialized(ext) ((le16_to_cpu((ext)->ee_len))& \ > > + 0x8000) > > +#define ext4_ext_get_actual_len(ext) ((le16_to_cpu((ext)->ee_len))& \ > > + 0x7FFF) > > inlined C functions are preferred, and I think these could be implemented > that way. Ok. Will convert them to inline functions. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 05:24:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 05:24:59 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47COsfB005440 for ; Mon, 7 May 2007 05:24:55 -0700 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47CBCnQ029352 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 08:11:12 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l47C7pKq031566 for ; Mon, 7 May 2007 08:07:51 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47CB9dj171804 for ; Mon, 7 May 2007 06:11:09 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47CB9VC024191 for ; Mon, 7 May 2007 06:11:09 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47CB7of024105; Mon, 7 May 2007 06:11:08 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2B6F894BBD; Mon, 7 May 2007 17:41:16 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47CBFgC006463; Mon, 7 May 2007 17:41:15 +0530 Date: Mon, 7 May 2007 17:41:15 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-ID: <20070507121115.GE7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> <20070503213238.5cdb1585.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213238.5cdb1585.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11313 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" wrote: > > + */ > > +int ext4_ext_try_to_merge(struct inode *inode, > > + struct ext4_ext_path *path, > > + struct ext4_extent *ex) > > +{ > > + struct ext4_extent_header *eh; > > + unsigned int depth, len; > > + int merge_done=0, uninitialized = 0; > > space around "=", please. > > Many people prefer not to do the multiple-definitions-per-line, btw: > > int merge_done = 0; > int uninitialized = 0; Ok. Will make the change. > > reasons: > > - If gives you some space for a nice comment > > - It makes patches much more readable, and it makes rejects easier to fix > > - standardisation. > > > + depth = ext_depth(inode); > > + BUG_ON(path[depth].p_hdr == NULL); > > + eh = path[depth].p_hdr; > > + > > + while (ex < EXT_LAST_EXTENT(eh)) { > > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > > + break; > > + /* merge with next extent! */ > > + if (ext4_ext_is_uninitialized(ex)) > > + uninitialized = 1; > > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > > + + ext4_ext_get_actual_len(ex + 1)); > > + if (uninitialized) > > + ext4_ext_mark_uninitialized(ex); > > + > > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > > + * sizeof(struct ext4_extent); > > + memmove(ex + 1, ex + 2, len); > > + } > > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); > > Kenrel convention is to put spaces around "-" Will fix this. > > > + merge_done = 1; > > + BUG_ON(eh->eh_entries == 0); > > eek, scary BUG_ON. Do we really need to be that severe? Would it be > better to warn and run ext4_error() here? Ok. > > > + } > > + > > + return merge_done; > > +} > > + > > + > > > > ... > > > > +/* > > + * ext4_ext_convert_to_initialized: > > + * this function is called by ext4_ext_get_blocks() if someone tries to write > > + * to an uninitialized extent. It may result in splitting the uninitialized > > + * extent into multiple extents (upto three). Atleast one initialized extent > > + * and atmost two uninitialized extents can result. > > There are some typos here > > > + * There are three possibilities: > > + * a> No split required: Entire extent should be initialized. > > + * b> Split into two extents: Only one end of the extent is being written to. > > + * c> Split into three extents: Somone is writing in middle of the extent. > > and here > Ok. Will fix them. > > + */ > > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > > + struct ext4_ext_path *path, > > + ext4_fsblk_t iblock, > > + unsigned long max_blocks) > > +{ > > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > > + struct ext4_extent_header *eh; > > + unsigned int allocated, ee_block, ee_len, depth; > > + ext4_fsblk_t newblock; > > + int err = 0, ret = 0; > > + > > + depth = ext_depth(inode); > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + ee_block = le32_to_cpu(ex->ee_block); > > + ee_len = ext4_ext_get_actual_len(ex); > > + allocated = ee_len - (iblock - ee_block); > > + newblock = iblock - ee_block + ext_pblock(ex); > > + ex2 = ex; > > + > > + /* ex1: ee_block to iblock - 1 : uninitialized */ > > + if (iblock > ee_block) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* for sanity, update the length of the ex2 extent before > > + * we insert ex3, if ex1 is NULL. This is to avoid temporary > > + * overlap of blocks. > > + */ > > + if (!ex1 && allocated > max_blocks) > > + ex2->ee_len = cpu_to_le16(max_blocks); > > + /* ex3: to ee_block + ee_len : uninitialised */ > > + if (allocated > max_blocks) { > > + unsigned int newdepth; > > + ex3 = &newex; > > + ex3->ee_block = cpu_to_le32(iblock + max_blocks); > > + ext4_ext_store_pblock(ex3, newblock + max_blocks); > > + ex3->ee_len = cpu_to_le16(allocated - max_blocks); > > + ext4_ext_mark_uninitialized(ex3); > > + err = ext4_ext_insert_extent(handle, inode, path, ex3); > > + if (err) > > + goto out; > > + /* The depth, and hence eh & ex might change > > + * as part of the insert above. > > + */ > > + newdepth = ext_depth(inode); > > + if (newdepth != depth) > > + { > > Use > > if (newdepth != depth) { Ok. > > > + depth=newdepth; > > spaces Ok. > > > + path = ext4_ext_find_extent(inode, iblock, NULL); > > + if (IS_ERR(path)) { > > + err = PTR_ERR(path); > > + path = NULL; > > + goto out; > > + } > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + if (ex2 != &newex) > > + ex2 = ex; > > + } > > + allocated = max_blocks; > > + } > > + /* If there was a change of depth as part of the > > + * insertion of ex3 above, we need to update the length > > + * of the ex1 extent again here > > + */ > > + if (ex1 && ex1 != ex) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* ex2: iblock to iblock + maxblocks-1 : initialised */ > > + ex2->ee_block = cpu_to_le32(iblock); > > + ex2->ee_start = cpu_to_le32(newblock); > > + ext4_ext_store_pblock(ex2, newblock); > > + ex2->ee_len = cpu_to_le16(allocated); > > + if (ex2 != ex) > > + goto insert; > > + if ((err = ext4_ext_get_access(handle, inode, path + depth))) > > + goto out; > > The preferred style is > > err = ext4_ext_get_access(handle, inode, path + depth); > if (err) > goto out; Right. Will change it. > > + /* New (initialized) extent starts from the first block > > + * in the current extent. i.e., ex2 == ex > > + * We have to see if it can be merged with the extent > > + * on the left. > > + */ > > + if (ex2 > EXT_FIRST_EXTENT(eh)) { > > + /* To merge left, pass "ex2 - 1" to try_to_merge(), > > + * since it merges towards right _only_. > > + */ > > + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); > > + if (ret) { > > + err = ext4_ext_correct_indexes(handle, inode, path); > > + if (err) > > + goto out; > > + depth = ext_depth(inode); > > + ex2--; > > + } > > + } > > + /* Try to Merge towards right. This might be required > > + * only when the whole extent is being written to. > > + * i.e. ex2==ex and ex3==NULL. > > + */ > > + if (!ex3) { > > + ret = ext4_ext_try_to_merge(inode, path, ex2); > > + if (ret) { > > + err = ext4_ext_correct_indexes(handle, inode, path); > > + if (err) > > + goto out; > > + } > > + } > > + /* Mark modified extent as dirty */ > > + err = ext4_ext_dirty(handle, inode, path + depth); > > + goto out; > > +insert: > > + err = ext4_ext_insert_extent(handle, inode, path, &newex); > > +out: > > + return err ? err : allocated; > > +} > > Sigh. I hope you guys know how all this works, because the extent code is > a mystery to me. Is the on-disk layout and the allocation strategy > described anywhere? > > > +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *); > > Again, I do think that sticking the identifiers in there helps > readability. Although it is not as important in a boring old declaration > as it is in, say, inode_operations, etc. > > Please try to keep the code looking nice in an 80-column display. Ok. Will make the required changes. Thanks again for your comments! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 06:04:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:04:30 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47D4PfB010686 for ; Mon, 7 May 2007 06:04:26 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47D4Odp031307 for ; Mon, 7 May 2007 09:04:24 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47D4ObH130260 for ; Mon, 7 May 2007 07:04:24 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47D4NSZ024479 for ; Mon, 7 May 2007 07:04:23 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47D4MAa024384; Mon, 7 May 2007 07:04:23 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 774D594BBD; Mon, 7 May 2007 18:34:30 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47D4T2s028246; Mon, 7 May 2007 18:34:29 +0530 Date: Mon, 7 May 2007 18:34:29 +0530 From: "Amit K. Arora" To: Pekka Enberg Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-ID: <20070507130429.GA6681@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> User-Agent: Mutt/1.4.1i X-archive-position: 11314 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote: > On 4/26/07, Amit K. Arora wrote: > > /* > >+ * ext4_ext_try_to_merge: > >+ * tries to merge the "ex" extent to the next extent in the tree. > >+ * It always tries to merge towards right. If you want to merge towards > >+ * left, pass "ex - 1" as argument instead of "ex". > >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > >+ * 1 if they got merged. > >+ */ > >+int ext4_ext_try_to_merge(struct inode *inode, > >+ struct ext4_ext_path *path, > >+ struct ext4_extent *ex) > >+{ > > Please either use proper kerneldoc format or drop > "ext4_ext_try_to_merge" from the comment. Ok, Thanks. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 06:07:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:07:09 -0700 (PDT) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.174]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47D75fB011315 for ; Mon, 7 May 2007 06:07:06 -0700 Received: by ug-out-1314.google.com with SMTP id t39so888009ugd for ; Mon, 07 May 2007 06:07:04 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=urrBzjEDw6lND4d8kP5iZTiYWtAhrbTo0ORHvRO5Ac0osqutU/p2ps7ovA3enA6g5I7Jm25FfmAzxoA7atZKmc+vZuTtqPfy9vd5MJb3PzeWa9bscWaYVyMm7LqyoXNAnbXxY1thOUP7Bbzn5Tcc1AexjGTILIVaW1ugvFG3vGM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=EMuKhDR2U7hdITRMtxB/Oej5BaFNrGc5hvurNCbd0H9fYPUePT20nmMki3NBZJPq657HruCk61mcjc92u/jpzZQ8RuphVoJtrLKhzBQT/CvM4E+FcNF5nfiW9ei7sZh0QH9smIIL1eDa9egvH4kK/9Z+4XVOjfUOcHV5JVm02qg= Received: by 10.67.90.19 with SMTP id s19mr3525671ugl.1178541626311; Mon, 07 May 2007 05:40:26 -0700 (PDT) Received: by 10.67.9.19 with HTTP; Mon, 7 May 2007 05:40:26 -0700 (PDT) Message-ID: <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> Date: Mon, 7 May 2007 15:40:26 +0300 From: "Pekka Enberg" To: "Amit K. Arora" Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070426181623.GE7209@amitarora.in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> X-Google-Sender-Auth: 7ffddca7cb123766 X-archive-position: 11315 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: penberg@cs.helsinki.fi Precedence: bulk X-list: xfs On 4/26/07, Amit K. Arora wrote: > /* > + * ext4_ext_try_to_merge: > + * tries to merge the "ex" extent to the next extent in the tree. > + * It always tries to merge towards right. If you want to merge towards > + * left, pass "ex - 1" as argument instead of "ex". > + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > + * 1 if they got merged. > + */ > +int ext4_ext_try_to_merge(struct inode *inode, > + struct ext4_ext_path *path, > + struct ext4_extent *ex) > +{ Please either use proper kerneldoc format or drop "ext4_ext_try_to_merge" from the comment. From owner-xfs@oss.sgi.com Mon May 7 06:22:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:22:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47DMGfB013748 for ; Mon, 7 May 2007 06:22:17 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l47D8pKs005754; Mon, 7 May 2007 09:08:51 -0400 Received: from lacrosse.corp.redhat.com (lacrosse.corp.redhat.com [172.16.52.154]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47D8otS020113; Mon, 7 May 2007 09:08:50 -0400 Received: from myware66.akkadia.org (vpn-14-5.rdu.redhat.com [10.11.14.5]) by lacrosse.corp.redhat.com (8.12.11.20060308/8.11.6) with ESMTP id l47D8med016906; Mon, 7 May 2007 09:08:49 -0400 Message-ID: <463F24DB.5040406@redhat.com> Date: Mon, 07 May 2007 06:08:43 -0700 From: Ulrich Drepper Organization: Red Hat, Inc. User-Agent: Thunderbird 2.0.0.0 (X11/20070419) MIME-Version: 1.0 To: Jakub Jelinek CC: Andrew Morton , David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> <20070504065626.GW355@devserv.devel.redhat.com> In-Reply-To: <20070504065626.GW355@devserv.devel.redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11316 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: drepper@redhat.com Precedence: bulk X-list: xfs Jakub Jelinek wrote: > is what glibc does ATM. Seems we violate the case where len == 0, as > EINVAL in that case is "shall fail". But reading the standard to imply > negative len is ok is too much guessing, there is no word what it means > when len is negative and > "required storage for regular file data starting at offset and continuing for len bytes" > doesn't make sense for negative size. This wording has already been cleaned up. The current draft for the next revision reads: [EINVAL] The len argument is less than or equal to zero, or the offset argument is less than zero, or the underlying file system does not support this operation. I still don't like it since len==0 shouldn't create an error (it's inconsistent) but len<0 is already outlawed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ From owner-xfs@oss.sgi.com Mon May 7 08:48:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 08:48:24 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47FmJfB009196 for ; Mon, 7 May 2007 08:48:21 -0700 Received: from e1.ny.us.ibm.com ([192.168.1.101]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47FOkpf023745 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 11:24:46 -0400 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47FOffK009339 for ; Mon, 7 May 2007 11:24:41 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47FOfJk549588 for ; Mon, 7 May 2007 11:24:41 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47FOewe027863 for ; Mon, 7 May 2007 11:24:40 -0400 Received: from [9.53.41.190] (kleikamp.austin.ibm.com [9.53.41.190]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47FOdo3027760; Mon, 7 May 2007 11:24:39 -0400 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Dave Kleikamp To: "Amit K. Arora" Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070507120719.GD7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> Content-Type: text/plain Date: Mon, 07 May 2007 10:24:37 -0500 Message-Id: <1178551477.12900.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11317 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > > +{ > > > + handle_t *handle; > > > + ext4_fsblk_t block, max_blocks; > > > + int ret, ret2, nblocks = 0, retries = 0; > > > + struct buffer_head map_bh; > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > + > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > + if (mode != FA_ALLOCATE) > > > + return -EOPNOTSUPP; > > > + > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > + return -ENOTTY; > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > news. The changelog would be an appropriate place to communicate this, > > along with reasons why, or a description of the plan to fix it. > > Ok. Will add this in the function description as well. > > > Also, posix says nothing about fallocate() returning ENOTTY. > > Right. I don't seem to find any suitable error from posix description. > Can you please suggest an error code which might make more sense here ? > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > non-extent files. Isn't the idea that libc will interpret -ENOTTY, or whatever is returned here, and fall back to the current library code to do preallocation? This way, the caller of fallocate() will never see this return code, so it won't violate posix. -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Mon May 7 11:35:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:35:08 -0700 (PDT) Received: from tur.go2.pl (tur.go2.pl [193.17.41.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IZ0fB008787 for ; Mon, 7 May 2007 11:35:02 -0700 Received: from poczta.o2.pl (mx10.go2.pl [193.17.41.74]) by tur.go2.pl (o2.pl Mailer 2.0.1) with ESMTP id CF0912349DA for ; Mon, 7 May 2007 20:04:28 +0200 (CEST) Received: from poczta.o2.pl (mx10.go2.pl [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id 07A2C58113 for ; Mon, 7 May 2007 20:04:26 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP for ; Mon, 7 May 2007 20:04:25 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: xfs@oss.sgi.com Subject: RESVSP problems Date: Mon, 7 May 2007 20:04:22 +0200 User-Agent: KMail/1.9.6 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705072004.22848.lucke@o2.pl> X-archive-position: 11318 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs Hello, guys, I've been trying to implement RESVSP-based allocation in rtorrent. From the very beginning it has, alas, misbehaved, thus (also considering my very basic programming skills and experience and unfamiliarity with rtorrent's code) after hours of trying to determine what's wrong, I finally observed that blocks of files allocated with RESVSP (previously ftruncated to a proper size) and being downloaded in rtorrent don't have their unwritten flags removed (as confirmed by xfs_bmap -vp). In the effect downloaded file promptly corrupts (read: changes its md5sum). What is interesting, files RESVSP-allocated in ktorrent and then imported to rtorrent seem to download properly. Everything works properly with ALLOCSP (although I've noticed that while RESVSP worked with l_start = 0 and l_length = size, ALLOCSP worked with l_start = size and l_length = 0; is that intended?). I'm not quite sure what's at fault here. Perhaps rtorrent, as it prides itself on "directly between file pages mapped to memory by the mmap() function and the network stack". I haven't been yet able to determine how it actually writes chunks to files (aforementioned lacks of skills, experience and familiarity). Perhaps it's somehow XFS's fault, hence my posting to this ML. Any help/suggestions would be appreciated. Cheers, Luke From owner-xfs@oss.sgi.com Mon May 7 11:46:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:46:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IkAfB010835 for ; Mon, 7 May 2007 11:46:12 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik5aL021039; Mon, 7 May 2007 14:46:06 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik5VF013479; Mon, 7 May 2007 14:46:05 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik4LP014643; Mon, 7 May 2007 14:46:04 -0400 Message-ID: <463F7368.8090101@sandeen.net> Date: Mon, 07 May 2007 13:43:52 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: lucke@o2.pl CC: xfs@oss.sgi.com Subject: Re: RESVSP problems References: <200705072004.22848.lucke@o2.pl> In-Reply-To: <200705072004.22848.lucke@o2.pl> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-archive-position: 11319 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Łukasz Fibinger wrote: > Hello, guys, > > I've been trying to implement RESVSP-based allocation in rtorrent. From the > very beginning it has, alas, misbehaved, thus (also considering my very basic > programming skills and experience and unfamiliarity with rtorrent's code) > after hours of trying to determine what's wrong, I finally observed that > blocks of files allocated with RESVSP (previously ftruncated to a proper > size) and being downloaded in rtorrent don't have their unwritten flags > removed (as confirmed by xfs_bmap -vp). You've probably hit: http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 unwritten extents remain unwritten after mmap() modifies them Bug dchinner about it... ;-) > In the effect downloaded file > promptly corrupts (read: changes its md5sum). What is interesting, files > RESVSP-allocated in ktorrent and then imported to rtorrent seem to download > properly. > > Everything works properly with ALLOCSP (although I've noticed that while > RESVSP worked with l_start = 0 and l_length = size, ALLOCSP worked with > l_start = size and l_length = 0; is that intended?). yeah... ISTR that the arguments are funky. I can't remember if it's a bug or not. :) FWIW, allocsp just writes zeros to the file, so you could do it just as well from userspace w/ no fancy ioctls... ALLOCSP is a bit pointless if you ask me... though maybe someone knows why it's there :) -Eric > I'm not quite sure what's at fault here. Perhaps rtorrent, as it prides itself > on "directly between file pages mapped to memory by the mmap() function and > the network stack". I haven't been yet able to determine how it actually > writes chunks to files (aforementioned lacks of skills, experience and > familiarity). Perhaps it's somehow XFS's fault, hence my posting to this ML. > Any help/suggestions would be appreciated. > > Cheers, > > Luke > > From owner-xfs@oss.sgi.com Mon May 7 11:58:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:58:53 -0700 (PDT) Received: from poczta.o2.pl (mx12.go2.pl [193.17.41.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IwjfB013062 for ; Mon, 7 May 2007 11:58:46 -0700 Received: from poczta.o2.pl (mx12 [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id 1FBB83E81A6; Mon, 7 May 2007 20:58:37 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP; Mon, 7 May 2007 20:58:37 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: Eric Sandeen Subject: Re: RESVSP problems Date: Mon, 7 May 2007 20:58:32 +0200 User-Agent: KMail/1.9.6 References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> In-Reply-To: <463F7368.8090101@sandeen.net> Cc: xfs@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705072058.32679.lucke@o2.pl> X-archive-position: 11320 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs On Monday 07 of May 2007, you wrote: > You've probably hit: > http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > unwritten extents remain unwritten after mmap() modifies them > > Bug dchinner about it... ;-) Dave, consider it a bugging from my humble self :-) > yeah... ISTR that the arguments are funky. I can't remember if it's a > bug or not. :) FWIW, allocsp just writes zeros to the file, so you > could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > is a bit pointless if you ask me... though maybe someone knows why it's > there :) Let me say that I have noticed that using ALLOCSP seems to create less extents than posix_fallocate/manual zeroing. Thanks for your answer. Incidentally, I'm really happy that XFS has been bestowed upon linux users. Thanks for all your work, guys :-) Cheers, Luke From owner-xfs@oss.sgi.com Mon May 7 12:49:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 12:49:44 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47JnbfB023617 for ; Mon, 7 May 2007 12:49:39 -0700 Received: from localhost.adilger.int (dhcp215-19.nersc.gov [128.55.19.215]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id EB3E17BA315; Mon, 7 May 2007 13:49:36 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 2DAA8406D; Mon, 7 May 2007 05:37:54 -0600 (MDT) Date: Mon, 7 May 2007 05:37:54 -0600 From: Andreas Dilger To: Andrew Morton Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507113753.GA5439@schatzie.adilger.int> Mail-Followup-To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213133.d1559f52.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11321 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 03, 2007 21:31 -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. My understanding is that glibc will handle zero-filling of files for filesystems that do not support fallocate(). > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. > > Also, posix says nothing about fallocate() returning ENOTTY. I _think_ this is to convince glibc to do the zero-filling in userspace, but I'm not up on the API specifics. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > + - block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). Also, since @path is not being given then truncate_mutex is not needed. > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ouch, not very friendly error handling. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 13:58:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 13:58:46 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47KwffB028868 for ; Mon, 7 May 2007 13:58:42 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47KwQAQ005761 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 13:58:27 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47KwPPl005141; Mon, 7 May 2007 13:58:25 -0700 Date: Mon, 7 May 2007 13:58:25 -0700 From: Andrew Morton To: Andreas Dilger Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507135825.f8545a65.akpm@linux-foundation.org> In-Reply-To: <20070507113753.GA5439@schatzie.adilger.int> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11322 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger wrote: > > > + block = offset >> blkbits; > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > > + - block; > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > > space, and that this disk space will require an arbitrary amount of > > metadata, how can we work out how much journal space we'll be needing > > without at least looking at `len'? > > Good question. > > The uninitialized extent can cover up to 128MB with a single entry. > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > function returns the maximum number of extents needed to insert a leaf, > including splitting all of the index blocks. That would allow up to 43GB > (340 extents/block * 128MB) to be preallocated, but it still needs to take > the size of the preallocation into account (adding 3 blocks per 43GB - a > leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() "succeeded" and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. Does the proposed implementation handle quotas correctly, btw? Has that been tested? Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. From owner-xfs@oss.sgi.com Mon May 7 15:21:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 15:21:14 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47ML9fB005559 for ; Mon, 7 May 2007 15:21:10 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 99AB27BA306; Mon, 7 May 2007 16:21:08 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 6A5173F57; Mon, 7 May 2007 15:21:04 -0700 (PDT) Date: Mon, 7 May 2007 15:21:04 -0700 From: Andreas Dilger To: Andrew Morton Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507222103.GJ8181@schatzie.adilger.int> Mail-Followup-To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11323 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 13:58 -0700, Andrew Morton wrote: > Final point: it's fairly disappointing that the present implementation is > ext4-only, and extent-only. I do think we should be aiming at an ext4 > bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 15:39:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 15:39:43 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47MdcfB007821 for ; Mon, 7 May 2007 15:39:39 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47McuH6010334 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 15:38:58 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47Mcut9007194; Mon, 7 May 2007 15:38:56 -0700 Date: Mon, 7 May 2007 15:38:56 -0700 From: Andrew Morton To: Andreas Dilger Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507153856.d56a5133.akpm@linux-foundation.org> In-Reply-To: <20070507222103.GJ8181@schatzie.adilger.int> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11324 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 15:21:04 -0700 Andreas Dilger wrote: > On May 07, 2007 13:58 -0700, Andrew Morton wrote: > > Final point: it's fairly disappointing that the present implementation is > > ext4-only, and extent-only. I do think we should be aiming at an ext4 > > bitmap-based implementation and an ext3 implementation. > > Actually, this is a non-issue. The reason that it is handled for extent-only > is that this is the only way to allocate space in the filesystem without > doing the explicit zeroing. For other filesystems (including ext3 and > ext4 with block-mapped files) the filesystem should return an error (e.g. > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. hrm, spose so. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Totally blowing pagecache could be a problem. Fixable in userspace by using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't. From owner-xfs@oss.sgi.com Mon May 7 16:31:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:31:56 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NVnfB016106 for ; Mon, 7 May 2007 16:31:52 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47NVZ47012460 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 16:31:37 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47NVZ6H008256; Mon, 7 May 2007 16:31:35 -0700 Date: Mon, 7 May 2007 16:31:35 -0700 From: Andrew Morton To: Theodore Tso Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507163135.cf455103.akpm@linux-foundation.org> In-Reply-To: <20070507231442.GA29907@thunk.org> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> <20070507231442.GA29907@thunk.org> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11325 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 19:14:42 -0400 Theodore Tso wrote: > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > Actually, this is a non-issue. The reason that it is handled for extent-only > > > is that this is the only way to allocate space in the filesystem without > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > > > It can be a bit suboptimal from the layout POV. The reservations code will > > largely save us here, but kernel support might make it a bit better. > > Actually, the reservations code won't matter, since glibc will fall > back to its current behavior, which is it will do the preallocation by > explicitly writing zeros to the file. No! Reservations code is *critical* here. Without reservations, we get disastrously-bad layout if two processes were running a large fallocate() at the same time. (This is an SMP-only problem, btw: on UP the timeslice lengths save us). My point is that even though reservations save us, we could do even-better in-kernel. But then, a smart application would bypass the glibc() fallocate() implementation and would tune the reservation window size and would use direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > This wlil result in the same > layout as if we had done the persistent preallocation, but of course > it will mean the posix_fallocate() could potentially take a long time > if you're a PVR and you're reserving a gig or two for a two hour movie > at high quality. That seems suboptimal, granted, and ideally the > application should be warned about this before it calls > posix_fallocate(). On the other hand, it's what happens today, all > the time, so applications won't be too badly surprised. A PVR implementor would take all this over and would do it themselves, for sure. > If we think applications programmers badly need to know in advance if > posix_fallocate() will be fast or slow, probably the right thing is to > define a new fpathconf() configuration option so they can query to see > whether a particular file will support a fast posix_fallocate(). I'm > not 100% convinced such complexity is really needed, but I'm willing > to be convinced.... what do folks think? > An application could do sys_fallocate(one-byte) to work out whether it's supported in-kernel, I guess. From owner-xfs@oss.sgi.com Mon May 7 16:36:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:36:41 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NabfB016849 for ; Mon, 7 May 2007 16:36:38 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlCrZ-00083r-GU; Mon, 07 May 2007 19:43:30 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlCkg-0006Ub-B6; Mon, 07 May 2007 19:36:22 -0400 Date: Mon, 7 May 2007 19:36:22 -0400 From: Theodore Tso To: Jeff Garzik Cc: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507233622.GB29907@thunk.org> Mail-Followup-To: Theodore Tso , Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463FB008.3080706@garzik.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11326 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >On May 07, 2007 13:58 -0700, Andrew Morton wrote: > >>Final point: it's fairly disappointing that the present implementation is > >>ext4-only, and extent-only. I do think we should be aiming at an ext4 > >>bitmap-based implementation and an ext3 implementation. > > > >Actually, this is a non-issue. The reason that it is handled for > >extent-only > >is that this is the only way to allocate space in the filesystem without > >doing the explicit zeroing. For other filesystems (including ext3 and > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. There is a bit in the extent structure which indicates that the extent has not been initialized. When reading from a block where the extent is marked as unitialized, ext4 returns zero's, to avoid returning the uninitalized contents of the disk, which might contain someone else's love letters, p0rn, or other information which we shouldn't leak out. When writing to an extent which is uninitalized, we may potentially have to split the extent into three extents in the worst case. My understanding is that XFS uses a similar implementation; it's a pretty obvious and standard way to implement allocated-but-not-initialized extents. We thought about supporting persistent preallocation for inodes using indirect blocks, but it would require stealing a bit from each entry in the indirect block, reducing the maximum size of the filesystem by two (i.e., 2**31 blocks). It was decided it wasn't worth the complexity, given the tradeoffs. - Ted From owner-xfs@oss.sgi.com Mon May 7 16:44:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:44:13 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47Ni8fB017858 for ; Mon, 7 May 2007 16:44:09 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlCWZ-0007zC-Rg; Mon, 07 May 2007 19:21:48 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlCPi-0003Ie-6y; Mon, 07 May 2007 19:14:42 -0400 Date: Mon, 7 May 2007 19:14:42 -0400 From: Theodore Tso To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507231442.GA29907@thunk.org> Mail-Followup-To: Theodore Tso , Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507153856.d56a5133.akpm@linux-foundation.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11327 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > Actually, this is a non-issue. The reason that it is handled for extent-only > > is that this is the only way to allocate space in the filesystem without > > doing the explicit zeroing. For other filesystems (including ext3 and > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > It can be a bit suboptimal from the layout POV. The reservations code will > largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced.... what do folks think? - Ted From owner-xfs@oss.sgi.com Mon May 7 16:57:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:57:51 -0700 (PDT) Received: from mail.dvmed.net (srv5.dvmed.net [207.36.208.214]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NvkfB019596 for ; Mon, 7 May 2007 16:57:47 -0700 Received: from cpe-065-190-194-075.nc.res.rr.com ([65.190.194.75] helo=[10.10.10.10]) by mail.dvmed.net with esmtpsa (Exim 4.63 #1 (Red Hat Linux)) id 1HlCDx-0000ze-FS; Mon, 07 May 2007 23:02:33 +0000 Message-ID: <463FB008.3080706@garzik.org> Date: Mon, 07 May 2007 19:02:32 -0400 From: Jeff Garzik User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> In-Reply-To: <20070507222103.GJ8181@schatzie.adilger.int> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11328 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeff@garzik.org Precedence: bulk X-list: xfs Andreas Dilger wrote: > On May 07, 2007 13:58 -0700, Andrew Morton wrote: >> Final point: it's fairly disappointing that the present implementation is >> ext4-only, and extent-only. I do think we should be aiming at an ext4 >> bitmap-based implementation and an ext3 implementation. > > Actually, this is a non-issue. The reason that it is handled for extent-only > is that this is the only way to allocate space in the filesystem without > doing the explicit zeroing. For other filesystems (including ext3 and Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. Jeff From owner-xfs@oss.sgi.com Mon May 7 17:16:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:16:15 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480GBfB022224 for ; Mon, 7 May 2007 17:16:12 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l480Fii3015034 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 17:15:46 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l480FfvA009569; Mon, 7 May 2007 17:15:41 -0700 Date: Mon, 7 May 2007 17:15:41 -0700 From: Andrew Morton To: cmm@us.ibm.com Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507171541.5370a36a.akpm@linux-foundation.org> In-Reply-To: <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11329 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao wrote: > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + > > Now the interesting question is: what do we do if we get halfway through > > this loop and then run out of space? We could leave the disk all filled up > > and then return failure to the caller, but that's pretty poor behaviour, > > IMO. > > > The current code handles earlier ENOSPC by three times retries. After > that if we still run out of space, then it's propably right to notify > the caller there isn't much space left. > > We could extend the block reservation window size before the while loop > so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. From owner-xfs@oss.sgi.com Mon May 7 17:20:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:20:05 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480K1fB022916 for ; Mon, 7 May 2007 17:20:02 -0700 Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4800TK3031704 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 20:00:30 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4800Rgk009304 for ; Mon, 7 May 2007 20:00:27 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4800RNf186946 for ; Mon, 7 May 2007 18:00:27 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4800QhU006734 for ; Mon, 7 May 2007 18:00:27 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4800OlA006675; Mon, 7 May 2007 18:00:25 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:00:24 -0700 Message-Id: <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11330 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 05:37:54 -0600 > Andreas Dilger wrote: > > > > > + block = offset >> blkbits; > > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > > > + - block; > > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > > > space, and that this disk space will require an arbitrary amount of > > > metadata, how can we work out how much journal space we'll be needing > > > without at least looking at `len'? > > > > Good question. > > > > The uninitialized extent can cover up to 128MB with a single entry. > > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > > function returns the maximum number of extents needed to insert a leaf, > > including splitting all of the index blocks. That would allow up to 43GB > > (340 extents/block * 128MB) to be preallocated, but it still needs to take > > the size of the preallocation into account (adding 3 blocks per 43GB - a > > leaf block, a bitmap block and a group descriptor). > > I think the use of ext4_journal_extend() (as Amit has proposed) will help > here, but it is not sufficient. > > Because under some circumstances, a journal_extend() failure could mean > that we fail to allocate all the required disk space. If it is infrequent > enough, that is acceptable when the caller is using fallocate() for > performance reasons. > > But it is very much not acceptable if the caller is using fallocate() for > space-reservation reasons. If you used fallocate to reserve 1GB of disk > and fallocate() "succeeded" and you later get ENOSPC then you'd have a > right to get a bit upset. > > So I think the ext3/4 fallocate() implementation will need to be > implemented as a loop: > > while (len) { > journal_start(); > len -= do_fallocate(len, ...); > journal_stop(); > } > > I agree. There is already a loop in Amit's current's patch to call ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to ask for in each journal_start()? > +/* > + * ext4_fallocate: > + * preallocate space for a file > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > + */ > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > +{ .... > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); I think the calculation is based on the assumption that there is only a single extent to be inserted, which is the ideal case. But in some cases we may end up allocating several chunk of blocks(extents) for this single preallocation request when fs is fragmented (or part of preallocation request is already fulfilled) I think we should move this calculation inside the loop as well,and we really do not need to grab the lock to calculate the credit if the @path is always NULL, all the function does is mathmatics. I can't think of any good way to estimate the total credits needed for this whole preallocation request. Looked at ext4_get_block(), which is used for DIO code to deal with large amount of block allocation. The credit reservation is quite weak there too. The DIO_CREDIT is only (EXT4_RESERVE_TRANS_BLOCKS + 32) > + handle=ext4_journal_start(inode, credits + > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > +retry: > + ret = 0; > + while (ret >= 0 && ret < max_blocks) { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > + BUG_ON(!ret); > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > + && ((block + ret) > (i_size_read(inode) << blkbits))) > + nblocks = nblocks + ret; > + } > + > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + > Now the interesting question is: what do we do if we get halfway through > this loop and then run out of space? We could leave the disk all filled up > and then return failure to the caller, but that's pretty poor behaviour, > IMO. > The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. > > Does the proposed implementation handle quotas correctly, btw? Has that > been tested? > I think so. The ext4_ext_get_blocks() will end up calling ext4_new_blocks() to do the real block allocation, quota is being handled there, therefor is tested already. Mingming From owner-xfs@oss.sgi.com Mon May 7 17:30:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:30:43 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480UbfB024196 for ; Mon, 7 May 2007 17:30:38 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l480UWm5018735 for ; Mon, 7 May 2007 20:30:32 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l480UWTX162186 for ; Mon, 7 May 2007 18:30:32 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l480UVng005071 for ; Mon, 7 May 2007 18:30:32 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l480UU96005046; Mon, 7 May 2007 18:30:30 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Theodore Tso , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507163135.cf455103.akpm@linux-foundation.org> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> <20070507231442.GA29907@thunk.org> <20070507163135.cf455103.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:30:29 -0700 Message-Id: <1178584229.3933.60.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11331 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 19:14:42 -0400 > Theodore Tso wrote: > > > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > > Actually, this is a non-issue. The reason that it is handled for extent-only > > > > is that this is the only way to allocate space in the filesystem without > > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > > > > > It can be a bit suboptimal from the layout POV. The reservations code will > > > largely save us here, but kernel support might make it a bit better. > > > > Actually, the reservations code won't matter, since glibc will fall > > back to its current behavior, which is it will do the preallocation by > > explicitly writing zeros to the file. > > No! Reservations code is *critical* here. Without reservations, we get > disastrously-bad layout if two processes were running a large fallocate() > at the same time. (This is an SMP-only problem, btw: on UP the timeslice > lengths save us). > > My point is that even though reservations save us, we could do even-better > in-kernel. > In this case, since the number of blocks to preallocate (eg. N=10GB) is clear, we could improve the current reservation code, to allow callers explicitly ask for a new window that have the minimum N free blocks for the blocks-to-preallocated(rather than just have at least 1 free blocks). Before the ext4_fallocate() is called, the right reservation window size is set with the flag to indicating "please spend time if needed to find a window covers at least N free blocks". So for ex4 block mapped files, later when glibc is doing allocation and zeroing, the ext4 block-mapped allocator will knows to reserve the right amount of free blocks before allocating and zeroing 10GB space. I am not sure whether this worth the effort though. > But then, a smart application would bypass the glibc() fallocate() > implementation and would tune the reservation window size and would use > direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > > > This wlil result in the same > > layout as if we had done the persistent preallocation, but of course > > it will mean the posix_fallocate() could potentially take a long time > > if you're a PVR and you're reserving a gig or two for a two hour movie > > at high quality. That seems suboptimal, granted, and ideally the > > application should be warned about this before it calls > > posix_fallocate(). On the other hand, it's what happens today, all > > the time, so applications won't be too badly surprised. > > A PVR implementor would take all this over and would do it themselves, for > sure. > > > If we think applications programmers badly need to know in advance if > > posix_fallocate() will be fast or slow, probably the right thing is to > > define a new fpathconf() configuration option so they can query to see > > whether a particular file will support a fast posix_fallocate(). I'm > > not 100% convinced such complexity is really needed, but I'm willing > > to be convinced.... what do folks think? > > > > An application could do sys_fallocate(one-byte) to work out whether it's > supported in-kernel, I guess. > From owner-xfs@oss.sgi.com Mon May 7 17:41:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:42:01 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480ftfB025394 for ; Mon, 7 May 2007 17:41:56 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l480cMBk002369 for ; Mon, 7 May 2007 20:38:22 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l480fgcB102890 for ; Mon, 7 May 2007 18:41:42 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l480ffZE025439 for ; Mon, 7 May 2007 18:41:42 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l480fePn025409; Mon, 7 May 2007 18:41:40 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507171541.5370a36a.akpm@linux-foundation.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:41:39 -0700 Message-Id: <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11332 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote: > On Mon, 07 May 2007 17:00:24 -0700 > Mingming Cao wrote: > > > > + while (ret >= 0 && ret < max_blocks) { > > > + block = block + ret; > > > + max_blocks = max_blocks - ret; > > > + ret = ext4_ext_get_blocks(handle, inode, block, > > > + max_blocks, &map_bh, > > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > > + BUG_ON(!ret); > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > > + nblocks = nblocks + ret; > > > + } > > > + > > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > > + goto retry; > > > + > > > Now the interesting question is: what do we do if we get halfway through > > > this loop and then run out of space? We could leave the disk all filled up > > > and then return failure to the caller, but that's pretty poor behaviour, > > > IMO. > > > > > The current code handles earlier ENOSPC by three times retries. After > > that if we still run out of space, then it's propably right to notify > > the caller there isn't much space left. > > > > We could extend the block reservation window size before the while loop > > so we could get a lower chance to get more fragmented. > > yes, but my point is that the proposed behaviour is really quite bad. > I agree your point, that's why I mention it only helped the fragmentation issue but not the ENOSPC case. > We will attempt to allocate the disk space and then we will return failure, > having consumed all the disk space and having partially and uselessly > populated an unknown amount of the file. > Not totally useless I think. If only half of the space is preallocated because run out of space, the application can decide whether it's good enough to start to use this preallocated space or wait for the fs to have more free space. > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. > > So we'd need to either keep track of which blocks were newly-allocated and > then free them all again on the error path (doesn't work right across > commit+crash+recovery) or we could later use the space-reservation scheme which > delayed allocation will need to introduce. > > Or we could decide to live with the above IMO-crappy behaviour. In fact Amit and I had raised this issue before, whether it's okay to do allow partial preallocation. At that moment the feedback is it's no much different than the current zero-out-preallocation behavior: people might preallocating half-way then later deal with ENOSPC. We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. The fs free blocks account is just a estimate though. Mingming From owner-xfs@oss.sgi.com Mon May 7 17:59:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:59:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l480xYfB027310 for ; Mon, 7 May 2007 17:59:36 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA18644; Tue, 8 May 2007 10:59:27 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l480xOAf87607519; Tue, 8 May 2007 10:59:25 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l480xNoZ87644275; Tue, 8 May 2007 10:59:23 +1000 (AEST) Date: Tue, 8 May 2007 10:59:23 +1000 From: David Chinner To: =?iso-8859-1?Q?=C5=81ukasz?= Fibinger Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070508005923.GS77450368@melbourne.sgi.com> References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <200705072058.32679.lucke@o2.pl> User-Agent: Mutt/1.4.2.1i X-archive-position: 11333 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 08:58:32PM +0200, ?ukasz Fibinger wrote: > On Monday 07 of May 2007, you wrote: > > You've probably hit: > > http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > > unwritten extents remain unwritten after mmap() modifies them > > > > Bug dchinner about it... ;-) > > Dave, consider it a bugging from my humble self :-) Yeah, yeah ;) I'm waiting to see what happens with Nick's patches in .22 before going any further. If they are not merged into .22, then I think we should push the XFS specific fix in.... > > yeah... ISTR that the arguments are funky. I can't remember if it's a > > bug or not. :) FWIW, allocsp just writes zeros to the file, so you > > could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > > is a bit pointless if you ask me... though maybe someone knows why it's > > there :) > > Let me say that I have noticed that using ALLOCSP seems to create less extents > than posix_fallocate/manual zeroing. Yes, that's likely ;) There's work currently active to make posix_fallocate() do the same thing as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but that's a ways off yet... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 18:07:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:07:45 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4817efB028619 for ; Mon, 7 May 2007 18:07:42 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 7F4124E457A; Mon, 7 May 2007 19:07:38 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 1941F3F57; Mon, 7 May 2007 18:07:36 -0700 (PDT) Date: Mon, 7 May 2007 18:07:36 -0700 From: Andreas Dilger To: Jeff Garzik Cc: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508010736.GO8181@schatzie.adilger.int> Mail-Followup-To: Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463FB008.3080706@garzik.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11334 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 19:02 -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >Actually, this is a non-issue. The reason that it is handled for > >extent-only is that this is the only way to allocate space in the > >filesystem without doing the explicit zeroing. > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. In ext4 (as in XFS) there is a flag stored in the extent that tells if the extent is initialized or not. Reads from uninitialized extents will return zero-filled data, and writes that don't span the whole extent will cause the uninitialized extent to be split into a regular extent and one or two uninitialized extents (depending where the write is). My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 18:26:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:26:10 -0700 (PDT) Received: from mail.dvmed.net (srv5.dvmed.net [207.36.208.214]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l481Q5fB030567 for ; Mon, 7 May 2007 18:26:06 -0700 Received: from cpe-065-190-194-075.nc.res.rr.com ([65.190.194.75] helo=[10.10.10.10]) by mail.dvmed.net with esmtpsa (Exim 4.63 #1 (Red Hat Linux)) id 1HlESi-0001kC-7i; Tue, 08 May 2007 01:25:56 +0000 Message-ID: <463FD1A2.1020505@garzik.org> Date: Mon, 07 May 2007 21:25:54 -0400 From: Jeff Garzik User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> <20070508010736.GO8181@schatzie.adilger.int> In-Reply-To: <20070508010736.GO8181@schatzie.adilger.int> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11335 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeff@garzik.org Precedence: bulk X-list: xfs Andreas Dilger wrote: > My comment was just that the extent doesn't have to be explicitly zero > filled on the disk, by virtue of the fact that the uninitialized flag > will cause reads to return zero. Agreed, thanks for the clarification. Jeff From owner-xfs@oss.sgi.com Mon May 7 18:43:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:43:59 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l481hrfB032410 for ; Mon, 7 May 2007 18:43:54 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlEqi-0008QR-T3; Mon, 07 May 2007 21:50:45 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlEjp-00009j-MT; Mon, 07 May 2007 21:43:37 -0400 Date: Mon, 7 May 2007 21:43:37 -0400 From: Theodore Tso To: Mingming Cao Cc: Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508014337.GA14072@thunk.org> Mail-Followup-To: Theodore Tso , Mingming Cao , Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11336 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > We could check the total number of fs free blocks account before > preallocation happens, if there isn't enough space left, there is no > need to bother preallocating. Checking against the fs free blocks is a good idea, since it will prevent the obvious error case where someone tries to preallocate 10GB when there is only 2GB left. But it won't help if there are multiple processes trying to allocate blocks the same time. On the other hand, that case is probably relatively rare, and in that case, the filesystem was probably going to be left completely full in any case. On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. Actually, the kernel could do it, in that could simply release all unitialized extents back to the system. The problem is distinguishing between the unitialized extents that had just been newly added, versus the ones that had there from before. (On the other hand, if the filesystem was completely full, releasing unitialized blocks wouldn't be the worse thing in the world to do, although releasing previously fallocated blocks probably does violate the princple of least surprise, even if it's what the user would have wanted.) On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > If there is enough free space, we could make a reservation window that > have at least N free blocks and mark it not stealable by other files. So > later we will not run into the ENOSPC error. Could you really use a single reservation window? When the filesystem is almost full, the free extents are likely going to be scattered all over the disk. The general principle of grabbing all of the extents and keeping them in an in-memory data structure, and only adding them to the extent tree would work, though; I'm just not sure we could do it using the existing reservation window code, since it only supports a single reservation window per file, yes? - Ted From owner-xfs@oss.sgi.com Mon May 7 22:03:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 22:03:13 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48539fB010772 for ; Mon, 7 May 2007 22:03:10 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 1C86018077E7E; Tue, 8 May 2007 00:03:08 -0500 (CDT) Message-ID: <4640048B.6070803@sandeen.net> Date: Tue, 08 May 2007 00:03:07 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: David Chinner CC: =?UTF-8?B?xYF1a2FzeiBGaWJpbmdlcg==?= , xfs@oss.sgi.com Subject: Re: RESVSP problems References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> <20070508005923.GS77450368@melbourne.sgi.com> In-Reply-To: <20070508005923.GS77450368@melbourne.sgi.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11337 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs David Chinner wrote: >>> yeah... ISTR that the arguments are funky. I can't remember if it's a >>> bug or not. :) FWIW, allocsp just writes zeros to the file, so you >>> could do it just as well from userspace w/ no fancy ioctls... ALLOCSP >>> is a bit pointless if you ask me... though maybe someone knows why it's >>> there :) >> Let me say that I have noticed that using ALLOCSP seems to create less extents >> than posix_fallocate/manual zeroing. > > Yes, that's likely ;) > > There's work currently active to make posix_fallocate() do the same thing > as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but > that's a ways off yet... Dave, doesn't ALLOCSP actually create actual zeroed space though? Pretty much as posix_fallocate from userspace does today, maybe with better allocation... And "smart stuff" would be *not* needing to write zeros.... i.e. what RESVSP does. -Eric From owner-xfs@oss.sgi.com Mon May 7 22:25:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 22:25:32 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l485PQfB014571 for ; Mon, 7 May 2007 22:25:28 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA24103; Tue, 8 May 2007 15:25:24 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l485PNAf83137079; Tue, 8 May 2007 15:25:23 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l485PLnZ87479850; Tue, 8 May 2007 15:25:21 +1000 (AEST) Date: Tue, 8 May 2007 15:25:21 +1000 From: David Chinner To: Eric Sandeen Cc: David Chinner , =?iso-8859-1?Q?=C5=81ukasz?= Fibinger , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070508052521.GH32602149@melbourne.sgi.com> References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> <20070508005923.GS77450368@melbourne.sgi.com> <4640048B.6070803@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4640048B.6070803@sandeen.net> User-Agent: Mutt/1.4.2.1i X-archive-position: 11338 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 08, 2007 at 12:03:07AM -0500, Eric Sandeen wrote: > David Chinner wrote: > > >>>yeah... ISTR that the arguments are funky. I can't remember if it's a > >>>bug or not. :) FWIW, allocsp just writes zeros to the file, so you > >>>could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > >>>is a bit pointless if you ask me... though maybe someone knows why it's > >>>there :) > >>Let me say that I have noticed that using ALLOCSP seems to create less > >>extents than posix_fallocate/manual zeroing. > > > >Yes, that's likely ;) > > > >There's work currently active to make posix_fallocate() do the same thing > >as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but > >that's a ways off yet... > > Dave, doesn't ALLOCSP actually create actual zeroed space though? Ah, yes it does - I was sort of lumping allocsp/resvsp together as one there. > Pretty much as posix_fallocate from userspace does today, maybe with > better allocation... Better allocations and with no ENOSPC-after-partial-zeroing problems, either. > And "smart stuff" would be *not* needing to write > zeros.... i.e. what RESVSP does. Yup. I've implemented fallocate() with the equivalent of RESVSP. xfs_zero_eof() is smart enough to not try to zero unwritten extents so changing the filesize after preallocation is effectively a no-op ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 23:51:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 23:51:37 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l486pWfB030333 for ; Mon, 7 May 2007 23:51:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA26143; Tue, 8 May 2007 16:51:28 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l486pRAf87623879; Tue, 8 May 2007 16:51:27 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l486pQsF87768021; Tue, 8 May 2007 16:51:26 +1000 (AEST) Date: Tue, 8 May 2007 16:51:26 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070508065126.GK32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11339 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Back in 2.6.13, unwritten extent conversion was changed to be done via a workqueue because we can't do conversion in interrupt context (AIO issue). The problem was that the changes extent conversion to run asynchronously w.r.t I/o completion. Under heavy load (e.g. 100 fsstress processes), a direct write into an unwritten extent can complete and return to userspace before the unwritten extent is converted. If that range of the file is then read immediately, it will return zeros - unwritten - instead of the data that was written and is present on disk. A simpl etest case to show this is to run 100 fsstress processes, the loop doing: prealloc direct write bmap and at some point during this time, the bmap will return an unwritten extent spanning a range that has already been written. The following patch fixes the synchronous direct I/O by triggering a workqueue flush on detection of a sync direct I/O into an unwritten extent after queuing the conversion work. The other approach that could be taken is to simply do the conversion without passing it off to a work queue. Anyone have a preference on which would be the better method to choose? The patch below passes the QA test I wrote to exercise this bug. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-04-26 09:25:26.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-08 14:28:20.854616591 +1000 @@ -108,14 +108,19 @@ xfs_page_trace( /* * Schedule IO completion handling on a xfsdatad if this was - * the final hold on this ioend. + * the final hold on this ioend. If we are asked to wait, + * flush the workqueue. */ STATIC void xfs_finish_ioend( - xfs_ioend_t *ioend) + xfs_ioend_t *ioend, + int wait) { - if (atomic_dec_and_test(&ioend->io_remaining)) + if (atomic_dec_and_test(&ioend->io_remaining)) { queue_work(xfsdatad_workqueue, &ioend->io_work); + if (wait) + flush_workqueue(xfsdatad_workqueue); + } } /* @@ -334,7 +339,7 @@ xfs_end_bio( bio->bi_end_io = NULL; bio_put(bio); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); return 0; } @@ -470,7 +475,7 @@ xfs_submit_ioend( } if (bio) xfs_submit_ioend_bio(ioend, bio); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } while ((ioend = next) != NULL); } @@ -1408,6 +1413,13 @@ xfs_end_io_direct( * This is not necessary for synchronous direct I/O, but we do * it anyway to keep the code uniform and simpler. * + * Well, if only it were that simple. Because synchronous direct I/O + * requires extent conversion to occur *before* we return to userspace, + * we have to wait for extent conversion to complete. Look at the + * iocb that has been passed to use to determine if this is AIO or + * not. If it is synchronous, tell xfs_finish_ioend() to kick the + * workqueue and wait for it to complete. + * * The core direct I/O code might be changed to always call the * completion handler in the future, in which case all this can * go away. @@ -1415,9 +1427,9 @@ xfs_end_io_direct( ioend->io_offset = offset; ioend->io_size = size; if (ioend->io_type == IOMAP_READ) { - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } else if (private && size > 0) { - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, is_sync_kiocb(iocb) ? 1 : 0); } else { /* * A direct I/O write ioend starts it's life in unwritten @@ -1426,7 +1438,7 @@ xfs_end_io_direct( * handler. */ INIT_WORK(&ioend->io_work, xfs_end_bio_written); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } /* From owner-xfs@oss.sgi.com Mon May 7 23:53:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 23:53:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l486rYfB030943 for ; Mon, 7 May 2007 23:53:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA26229; Tue, 8 May 2007 16:53:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l486rSAf87718760; Tue, 8 May 2007 16:53:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l486rRnx87261508; Tue, 8 May 2007 16:53:27 +1000 (AEST) Date: Tue, 8 May 2007 16:53:27 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: XFSQA: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070508065327.GL32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11340 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Test to exercise synchronous direct I/O into unwritten extents. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- xfstests/167 | 65 ++++++++++++++++ xfstests/167.out | 3 xfstests/group | 1 xfstests/src/Makefile | 5 + xfstests/src/unwritten_sync.c | 167 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 240 insertions(+), 1 deletion(-) Index: xfs-cmds/xfstests/src/Makefile =================================================================== --- xfs-cmds.orig/xfstests/src/Makefile 2007-05-03 17:10:54.000000000 +1000 +++ xfs-cmds/xfstests/src/Makefile 2007-05-07 10:54:08.296322074 +1000 @@ -10,7 +10,7 @@ TARGETS = dirstress fill fill2 getpagesi mmapcat append_reader append_writer dirperf metaperf \ devzero feature alloc fault fstest t_access_root \ godown resvtest writemod makeextents itrash \ - multi_open_unlink dmiperf + multi_open_unlink dmiperf unwritten_sync LINUX_TARGETS = loggen xfsctl bstat t_mtab getdevicesize \ preallo_rw_pattern_reader preallo_rw_pattern_writer ftrunc trunc \ @@ -111,6 +111,9 @@ looptest: looptest.o locktest: locktest.o $(LINKTEST) +unwritten_sync: unwritten_sync.o + $(LINKTEST) + ifeq ($(PKG_PLATFORM),irix) fill2: fill2.o $(LINKTEST) -lgen Index: xfs-cmds/xfstests/src/unwritten_sync.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/src/unwritten_sync.c 2007-05-07 11:44:38.668980258 +1000 @@ -0,0 +1,167 @@ +#include +#include +#include +#include +#include +#include + +/* test thanks to judith@sgi.com */ + +#define IO_SIZE 1048576 + +void +print_getbmapx( + const char *pathname, + int fd, + int64_t start, + int64_t limit); + +int +main(int argc, char *argv[]) +{ + int i; + int fd; + char *buf; + struct dioattr dio; + xfs_flock64_t flock; + off_t offset; + char *file; + int loops; + + if(argc != 3) { + fprintf(stderr, "%s \n", argv[0]); + exit(1); + } + + errno = 0; + loops = strtoull(argv[1], NULL, 0); + if (errno) { + perror("strtoull"); + exit(errno); + } + file = argv[2]; + + while (loops-- > 0) { + sleep(1); + fd = open(file, O_RDWR|O_CREAT|O_DIRECT, 0666); + if (fd < 0) { + perror("open"); + exit(1); + } + if (xfsctl(file, fd, XFS_IOC_DIOINFO, &dio) < 0) { + perror("dioinfo"); + exit(1); + } + + if ((dio.d_miniosz > IO_SIZE) || (dio.d_maxiosz < IO_SIZE)) { + fprintf(stderr,"Test won't work. Sorry\n"); + exit(1); + } + buf = (char *)memalign(dio.d_mem , IO_SIZE); + if (buf == NULL) { + fprintf(stderr,"Can't get memory\n"); + exit(1); + } + memset(buf,'Z',IO_SIZE); + offset = 0; + + flock.l_whence = 0; + flock.l_start= 0; + flock.l_len = IO_SIZE*21; + if (xfsctl(file, fd, XFS_IOC_RESVSP64, &flock) < 0) { + perror("xfsctl "); + exit(1); + } + for (i = 0; i < 21; i++) { + if (pwrite(fd, buf, IO_SIZE, offset) != IO_SIZE) { + perror("pwrite"); + exit(1); + } + offset += IO_SIZE; + } + + print_getbmapx(file, fd, 0, 0); + + flock.l_whence = 0; + flock.l_start= 0; + flock.l_len = 0; + xfsctl(file, fd, XFS_IOC_FREESP64, &flock); + print_getbmapx(file, fd, 0, 0); + close(fd); + } +} + + + +int +get_getbmapx( + const char *pathname, + int fd, + struct getbmapx *bmapx) +{ + int rc; + + rc = ioctl(fd, XFS_IOC_GETBMAPX, bmapx); + if (rc < 0) { + perror("xfs_ioc_getbmapx"); + exit(1); + } +} + +void +print_getbmapx( +const char *pathname, + int fd, + int64_t start, + int64_t limit) +{ + struct getbmapx bmapx[50]; + int array_size = sizeof(bmapx) / sizeof(bmapx[0]); + int x; + int foundone = 0; + int foundany = 0; + +again: + foundone = 0; + memset(bmapx, '\0', sizeof(bmapx)); + + bmapx[0].bmv_offset = start; + bmapx[0].bmv_length = -1; /* limit - start; */ + bmapx[0].bmv_count = array_size; + bmapx[0].bmv_entries = 0; /* no entries filled in yet */ + + bmapx[0].bmv_iflags = BMV_IF_PREALLOC; + + x = array_size; + for (;;) { + if (x > bmapx[0].bmv_entries) { + if (x != array_size) { + break; /* end of file */ + } + if (get_getbmapx(pathname, fd, bmapx) < 0) { + fprintf(stderr, "getbmapx failed\n"); + exit(1); + } + if (bmapx[0].bmv_entries == 0) { + break; + } + x = 1; /* back at first extent in buffer */ + } + if (bmapx[x].bmv_oflags & 1) { + fprintf(stderr, "FOUND ONE %lld %lld %x\n", + bmapx[x].bmv_offset, bmapx[x].bmv_length,bmapx[x].bmv_oflags); + foundone = 1; + foundany = 1; + } + x++; + } + if (foundone) { + sleep(1); + fprintf(stderr,"Repeat\n"); + goto again; + } + if (foundany) { + exit(1); + } +} + Index: xfs-cmds/xfstests/167 =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/167 2007-05-07 16:02:58.993892587 +1000 @@ -0,0 +1,65 @@ +#! /bin/sh +# FSQA Test No. 167 +# +# unwritten extent conversion test +# +#----------------------------------------------------------------------- +# Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved. +#----------------------------------------------------------------------- +# +# creator +owner=dgc@sgi.com + +seq=`basename $0` +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +rm -f $seq.full +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + killall -q -TERM fsstress 2> /dev/null + _cleanup_testdir +} + +workout() +{ + procs=100 + nops=15000 + $FSSTRESS_PROG -d $SCRATCH_MNT -p $procs -n $nops $FSSTRESS_AVOID \ + >>$seq.full & + sleep 2 +} + +# get standard environment, filters and checks +. ./common.rc +. ./common.filter + +# real QA test starts here +_supported_fs xfs +_supported_os Linux + +_setup_testdir +_require_scratch +_scratch_mkfs_xfs >/dev/null 2>&1 +_scratch_mount + +TEST_FILE=$SCRATCH_MNT/test_file +TEST_PROG=$here/src/unwritten_sync +LOOPS=100 + +echo "*** test unwritten extent conversion under heavy I/O" + +workout + +rm -f $TEST_FILE +$TEST_PROG $LOOPS $TEST_FILE +killall -q -TERM fsstress 2> /dev/null + +echo " *** test done" + +status=0 +exit Index: xfs-cmds/xfstests/167.out =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/167.out 2007-05-07 11:46:46.560202917 +1000 @@ -0,0 +1,3 @@ +QA output created by 167 +*** test unwritten extent conversion under heavy I/O + *** test done Index: xfs-cmds/xfstests/group =================================================================== --- xfs-cmds.orig/xfstests/group 2007-04-23 16:22:06.000000000 +1000 +++ xfs-cmds/xfstests/group 2007-05-07 10:57:00.721817454 +1000 @@ -246,3 +246,4 @@ pattern ajones@sgi.com 164 rw pattern auto 165 rw pattern auto 166 rw metadata auto +167 rw metadata auto From owner-xfs@oss.sgi.com Tue May 8 01:08:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 01:08:26 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4888KfB019878 for ; Tue, 8 May 2007 01:08:22 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA27849; Tue, 8 May 2007 18:08:13 +1000 Date: Tue, 08 May 2007 18:11:37 +1000 From: Timothy Shimmin To: torvalds@linux-foundation.org cc: akpm@osdl.org, xfs@oss.sgi.com Subject: [GIT] XFS updates for 2.6.22 Message-ID: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11341 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Linus, Please pull from: git pull git://oss.sgi.com:8090/xfs/xfs-2.6 --Tim This will update the following files: fs/xfs/linux-2.6/mrlock.h | 12 +++ fs/xfs/linux-2.6/xfs_aops.c | 89 +++++++++++++++++++--- fs/xfs/linux-2.6/xfs_buf.c | 10 ++ fs/xfs/linux-2.6/xfs_buf.h | 3 + fs/xfs/linux-2.6/xfs_fs_subr.c | 21 +++-- fs/xfs/linux-2.6/xfs_fs_subr.h | 2 fs/xfs/linux-2.6/xfs_lrw.c | 163 +++++++++++++++++++++++----------------- fs/xfs/linux-2.6/xfs_vnode.h | 2 fs/xfs/quota/xfs_dquot.c | 3 - fs/xfs/quota/xfs_qm.c | 16 +++- fs/xfs/quota/xfs_qm_syscalls.c | 19 +++-- fs/xfs/quota/xfs_trans_dquot.c | 4 + fs/xfs/support/debug.c | 17 ---- fs/xfs/support/debug.h | 2 fs/xfs/xfs_alloc.c | 2 fs/xfs/xfs_attr.c | 12 +-- fs/xfs/xfs_attr_leaf.c | 2 fs/xfs/xfs_bmap.c | 28 +++---- fs/xfs/xfs_dfrag.c | 6 + fs/xfs/xfs_dir2_block.c | 14 +-- fs/xfs/xfs_dir2_data.c | 7 -- fs/xfs/xfs_dir2_data.h | 2 fs/xfs/xfs_dir2_leaf.c | 7 +- fs/xfs/xfs_dir2_node.c | 4 - fs/xfs/xfs_error.c | 2 fs/xfs/xfs_fsops.c | 4 - fs/xfs/xfs_iget.c | 15 ++-- fs/xfs/xfs_inode.c | 58 +++++++++++--- fs/xfs/xfs_inode.h | 65 ++++++++++++---- fs/xfs/xfs_iocore.c | 2 fs/xfs/xfs_iomap.c | 15 ++-- fs/xfs/xfs_iomap.h | 1 fs/xfs/xfs_log_recover.c | 15 +--- fs/xfs/xfs_mount.c | 2 fs/xfs/xfs_qmops.c | 2 fs/xfs/xfs_quota.h | 3 - fs/xfs/xfs_rename.c | 2 fs/xfs/xfs_rtalloc.c | 6 + fs/xfs/xfs_rw.c | 4 - fs/xfs/xfs_trans.c | 6 - fs/xfs/xfs_trans.h | 4 - fs/xfs/xfs_utils.c | 11 ++- fs/xfs/xfs_vfsops.c | 6 + fs/xfs/xfs_vnodeops.c | 125 ++++++++++++++++++------------- 44 files changed, 491 insertions(+), 304 deletions(-) through these commits: commit f7c66ce3f70d8417de0cfb481ca4e5430382ec5d Author: Lachlan McIlroy Date: Tue May 8 13:50:19 2007 +1000 [XFS] Add lockdep support for XFS SGI-PV: 963965 SGI-Modid: xfs-linux-melb:xfs-kern:28485a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 71dfd5a396d11512aa6c8ed0d35b268bc084bb9b Author: Lachlan McIlroy Date: Tue May 8 13:50:12 2007 +1000 [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks. In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA() which means that the file could change from not-cached to cached and we need to redo the direct I/O checks. We should also redo the direct I/O checks when the file size changes regardless if O_APPEND is set or not. SGI-PV: 963483 SGI-Modid: xfs-linux-melb:xfs-kern:28440a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 3a02ee1828915d6540b415a160344775e2a4f918 Author: Utako Kusaka Date: Tue May 8 13:50:06 2007 +1000 [XFS] Get rid of redundant "required" in msg. SGI-PV: 963466 SGI-Modid: xfs-linux-melb:xfs-kern:28416a Signed-off-by: Utako Kusaka Signed-off-by: Tim Shimmin Signed-off-by: Christoph Hellwig commit e6a0e9cdff79e1406e5653f759aaf9f59b7ce4c8 Author: Tim Shimmin Date: Tue May 8 13:49:59 2007 +1000 [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg. SGI-PV: 963465 SGI-Modid: xfs-linux-melb:xfs-kern:28414a Signed-off-by: Tim Shimmin Signed-off-by: Lachlan McIlroy commit f10bb2dad02a846966064a531ba6eec301bbb9e0 Author: Tim Shimmin Date: Tue May 8 13:49:53 2007 +1000 [XFS] Remove unused ilen variable and references. SGI-PV: 907752 SGI-Modid: xfs-linux-melb:xfs-kern:28344a Signed-off-by: Tim Shimmin Signed-off-by: Lachlan McIlroy Signed-off-by: Eric Sandeen commit ba87ea699ebd9dd577bf055ebc4a98200e337542 Author: Lachlan McIlroy Date: Tue May 8 13:49:46 2007 +1000 [XFS] Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. SGI-PV: 958522 SGI-Modid: xfs-linux-melb:xfs-kern:28322a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 2a32963130aec5e157b58ff7dfa3dfa1afdf7ca1 Author: Lachlan McIlroy Date: Tue May 8 13:49:39 2007 +1000 [XFS] Fix race condition in xfs_write(). This change addresses a race in xfs_write() where, for direct I/O, the flags need_i_mutex and need_flush are setup before the iolock is acquired. The logic used to setup the flags may change between setting the flags and acquiring the iolock resulting in these flags having incorrect values. For example, if a file is not currently cached then need_i_mutex is set to zero and then if the file is cached before the iolock is acquired we will fail to do the flushinval before the direct write. The flush (and also the call to xfs_zero_eof()) need to be done with the iolock held exclusive so we need to acquire the iolock before checking for cached data (or if the write begins after eof) to prevent this state from changing. For direct I/O I've chosen to always acquire the iolock in shared mode initially and if there is a need to promote it then drop it and reacquire it. There's also some other tidy-ups including removing the O_APPEND offset adjustment since that work is done in generic_write_checks() (and we don't use offset as an input parameter anywhere). SGI-PV: 962170 SGI-Modid: xfs-linux-melb:xfs-kern:28319a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit e6d29426bc8a5d07d0eebd0842fe0cf6ecc862cd Author: Kouta Ooizumi Date: Tue May 8 13:49:33 2007 +1000 [XFS] Fix uquota and oquota enforcement problems. When uquota and oquota (gquota/pquota) are enabled for accounting both are enforced if ether has enforcement active. Conditions: - Both XFS_UQUOTA_ACCT and XFS_GQUOTA_ACCT are enabled. - Either XFS_UQUOTA_ENFD or XFS_OQUOTA_ENFD is enabled. - The usage without enforce is reached at the soft limit. Problems: 1. "repquota" shows all grace time even if no enforcement. 2. we cannot make a file over a hard limits even if no enforcement. SGI-PV: 962291 SGI-Modid: xfs-linux-melb:xfs-kern:28272a Signed-off-by: Kouta Ooizumi Signed-off-by: Donald Douwsma Signed-off-by: Tim Shimmin commit d3cf209476b72c83907a412b6708c5e498410aa7 Author: Lachlan McIlroy Date: Tue May 8 13:49:27 2007 +1000 [XFS] propogate return codes from flush routines This patch handles error return values in fs_flush_pages and fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we can propogate the errors and handle them at higher layers. I also modified xfs_itruncate_start so that it could propogate the error further. SGI-PV: 961990 SGI-Modid: xfs-linux-melb:xfs-kern:28231a Signed-off-by: Lachlan McIlroy Signed-off-by: Stewart Smith Signed-off-by: Tim Shimmin commit 424ea91ba61c1cdc2dac68576c97030cbf47d84f Author: Donald Douwsma Date: Tue May 8 13:49:15 2007 +1000 [XFS] Fix quotaon syscall failures for group enforcement requests. xfs_qm_scall_quotaon was incorrectly failing requests to enable group quota enforcement. Fixes logic error in OQUOTA handling. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28227a Signed-off-by: Donald Douwsma Signed-off-by: Tim Shimmin commit 646d5bdab38c88f4b9088d4e517986a3f3b0edb9 Author: Donald Douwsma Date: Tue May 8 13:49:09 2007 +1000 [XFS] Invalidate quotacheck when mounting without a quota type. When quotas are mounted or remounted without a particular quota type the quota accounting for that type becomes invalid. Previously we were ignoring this leading to accounting errors. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28225a Signed-off-by: Donald Douwsma Signed-off-by: Utako Kusaka Signed-off-by: Vlad Apostolov Signed-off-by: Tim Shimmin commit e7a23a9b37c395a153a541d4c50e166eef6abe49 Author: Joe Perches Date: Tue May 8 13:49:03 2007 +1000 [XFS] reducing the number of random number functions. Patch provided by Joe Perches SGI-PV: 961696 SGI-Modid: xfs-linux-melb:xfs-kern:28209a Signed-off-by: Joe Perches Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit e9ed9d2240c71014a84043095af4465ffce61367 Author: Eric Sandeen Date: Tue May 8 13:48:56 2007 +1000 [XFS] remove more misc. unused args Patch provided by Eric Sandeen. SGI-PV: 961695 SGI-Modid: xfs-linux-melb:xfs-kern:28205a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit ef497f8a1eafe0447f0473940ff2e0f6c8519a14 Author: Eric Sandeen Date: Tue May 8 13:48:49 2007 +1000 [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it. Patch provided by Eric Sandeen. SGI-PV: 961694 SGI-Modid: xfs-linux-melb:xfs-kern:28204a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit 1c72bf90037f32fc2b10e0a05dff2640abce8ee2 Author: Eric Sandeen Date: Tue May 8 13:48:42 2007 +1000 [XFS] The last argument "lsn" of xfs_trans_commit() is always called with NULL. Patch provided by Eric Sandeen. SGI-PV: 961693 SGI-Modid: xfs-linux-melb:xfs-kern:28199a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin From owner-xfs@oss.sgi.com Tue May 8 03:00:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:00:09 -0700 (PDT) Received: from atlas.informatik.uni-freiburg.de (atlas.informatik.uni-freiburg.de [132.230.150.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48A00fB010168 for ; Tue, 8 May 2007 03:00:01 -0700 Received: from login.informatik.uni-freiburg.de ([132.230.151.6]) by atlas.informatik.uni-freiburg.de with esmtps (TLSv1:DES-CBC3-SHA:168) (Exim 4.66) (envelope-from ) id 1HlLns-00067j-VF for xfs@oss.sgi.com; Tue, 08 May 2007 11:16:17 +0200 Received: from login.informatik.uni-freiburg.de (localhost [127.0.0.1]) by login.informatik.uni-freiburg.de (8.13.8+Sun/8.12.11) with ESMTP id l489GFYW008121 for ; Tue, 8 May 2007 11:16:15 +0200 (MEST) Received: (from zeisberg@localhost) by login.informatik.uni-freiburg.de (8.13.8+Sun/8.12.11/Submit) id l489GEui008120 for xfs@oss.sgi.com; Tue, 8 May 2007 11:16:14 +0200 (MEST) Date: Tue, 8 May 2007 11:16:14 +0200 From: Uwe =?iso-8859-1?Q?Kleine-K=F6nig?= To: xfs@oss.sgi.com Subject: Problems with XFS in a power failure Message-ID: <20070508091613.GA5852@cepheus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.14+cvs20070321 (2007-03-20) Organization: Universitaet Freiburg, Institut f. Informatik X-archive-position: 11342 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: ukleinek@informatik.uni-freiburg.de Precedence: bulk X-list: xfs Hello, my machine suffered a power failure while doing a apt-get upgrade. This damaged several files. E.g. root@cepheus:~# xxd /var/lib/dpkg/info/myspell-en-us.postrm 0000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000040: 0000 0000 0000 0000 0000 0000 0000 .............. while the repaired file has a size of 78 (= 0x4e) bytes. Some other files got broken with random data. I checked with debsums -c and it reported: root@cepheus:~# debsums -c >=������2[j�b��nw�������� in md5sums for irssi-scripts: ����gV{ڛ� �N���L���Mg{����.�����`����ӈL���j$�kC1'� ��S� ���ݏ�� debsums: invalid line (2) in md5sums for irssi-scripts: ����g�DPH�� 숐���}�]g�����N�ci�5�h �w�W{SZ��q��F_�sR�[���ie�A|��Sv��@�@��;�5�'#c��$��l%���� ��T���$�!d�B�y debsums: invalid line (3) in md5sums for irssi-scripts: �����-ċE��yq�/7đ>�������Ў������Vu����V �+ɋA�f��:%O��_l���}������}� ��1���ȴϘ��=?��&��������F���mT�trZ� ���1���enO%.�YN��=�k��@����\{8ɔw�x����z��-P!g�j����QV9u������)�m���5�l�8l �Rk5�;M���R��� �fx��O gѝ��;����٠�HYfrc��9�����u�q���Ox߀`����~_�ƃ2"J;�Q$vl?�{�=V������ �[��\�d��n�!�UH��Y�D��j2I���*� [�c��G�������[��h*���������2A��m&����������ޥGЉ�;R�0��̦��� ... I don't know exactly, but I think the damaged file here was /var/lib/dpkg/info/irssi.md5sums and debsums repaired it!? In theory this should not happen with a journaled fs, does it? This is a 2.6.19.3 kernel, unfortunately tainted by madwifi. There was nothing logged in dmesg and/or syslog. root@cepheus:~# xfs_info /var meta-data=/dev/hda9 isize=256 agcount=8, agsize=91619 blks = sectsz=512 attr=0 data = bsize=4096 blocks=732952, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 I have to shutdown that machine because of a power cut by my supplier, but probably I will use a boot cd to bring it up again ... Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=1+degree+celsius+in+kelvin From owner-xfs@oss.sgi.com Tue May 8 03:25:53 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:25:56 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48APqfB015225 for ; Tue, 8 May 2007 03:25:53 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l48APnoq007956 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 8 May 2007 03:25:50 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l48APkHi021147; Tue, 8 May 2007 03:25:47 -0700 Date: Tue, 8 May 2007 03:25:46 -0700 From: Andrew Morton To: Timothy Shimmin Cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-Id: <20070508032546.0728ae95.akpm@linux-foundation.org> In-Reply-To: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> X-Mailer: Sylpheed 2.4.1 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11343 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Tue, 08 May 2007 18:11:37 +1000 Timothy Shimmin wrote: > Please pull from: > git pull git://oss.sgi.com:8090/xfs/xfs-2.6 > I pull that regularly and it's always empty. Where did all this code suddenly come from? From owner-xfs@oss.sgi.com Tue May 8 03:52:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:52:49 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48AqhfB021631 for ; Tue, 8 May 2007 03:52:45 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48AqgVH002005 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48Aqg6c518946 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48Aqg30020671 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48AqeM6020596; Tue, 8 May 2007 06:52:41 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id BDCF194C6E; Tue, 8 May 2007 16:22:48 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l48AqmIq011407; Tue, 8 May 2007 16:22:48 +0530 Date: Tue, 8 May 2007 16:22:47 +0530 From: "Amit K. Arora" To: Dave Kleikamp Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508105247.GA1950@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> <1178551477.12900.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1178551477.12900.6.camel@kleikamp.austin.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11344 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > > > +{ > > > > + handle_t *handle; > > > > + ext4_fsblk_t block, max_blocks; > > > > + int ret, ret2, nblocks = 0, retries = 0; > > > > + struct buffer_head map_bh; > > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > > + > > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > > + if (mode != FA_ALLOCATE) > > > > + return -EOPNOTSUPP; > > > > + > > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > > + return -ENOTTY; > > > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > > news. The changelog would be an appropriate place to communicate this, > > > along with reasons why, or a description of the plan to fix it. > > > > Ok. Will add this in the function description as well. > > > > > Also, posix says nothing about fallocate() returning ENOTTY. > > > > Right. I don't seem to find any suitable error from posix description. > > Can you please suggest an error code which might make more sense here ? > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > > non-extent files. > > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned > here, and fall back to the current library code to do preallocation? > This way, the caller of fallocate() will never see this return code, so > it won't violate posix. You are right. But, we still need to "standardize" (and limit) the error codes which we should return from kernel when we want to fall back on the library implementation. The posix_fallocate() library function will have to look for a set of errors from fallocate() system call, upon receiving which it will do preallocation from user level; or else, it will return success/error-code returned by the system call to the user. I think we can make it fall back to library implementation of fallocate, whenever posix_fallocate() receives any of the following errors from fallocate() system call: 1. ENOSYS 2. EOPNOTSUPP 3. ENOTTY (?) Now the question is - should we limit the set of errors for this purpose to just 1 & 2 above ? In that case I will need to change the error being returned here to -EOPNOTSUPP (from current -ENOTTY). -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 8 06:28:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 06:29:00 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48DStfB001716 for ; Tue, 8 May 2007 06:28:57 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id D1B2E18723; Tue, 8 May 2007 15:28:53 +0200 (CEST) Date: Tue, 8 May 2007 15:28:53 +0200 From: Emmanuel Florac To: Uwe =?ISO-8859-1?Q?Kleine-K=F6nig?= Cc: xfs@oss.sgi.com Subject: Re: Problems with XFS in a power failure Message-ID: <20070508152853.1d387fea@galadriel.home> In-Reply-To: <20070508091613.GA5852@cepheus> References: <20070508091613.GA5852@cepheus> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l48DSvfB001742 X-archive-position: 11345 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Tue, 8 May 2007 11:16:14 +0200 vous criviez: > In theory this should not happen with a journaled fs, does it? That's the opposite. It's the expected behaviour. It's especially important to garantee proper power when using journaling filesystems. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Tue May 8 07:12:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 07:12:59 -0700 (PDT) Received: from amanpulo.fs3.ph (amanpulo.fs3.ph [72.51.42.241]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48ECufB017002 for ; Tue, 8 May 2007 07:12:56 -0700 Received: from localhost (localhost [127.0.0.1]) by amanpulo.fs3.ph (Postfix) with ESMTP id 25E8E1E0D5967 for ; Tue, 8 May 2007 21:55:59 +0800 (PHT) Received: from amanpulo.fs3.ph ([127.0.0.1]) by localhost (amanpulo.fs3.ph [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 1NVf-U3wH6+s for ; Tue, 8 May 2007 21:55:56 +0800 (PHT) Received: from musang.fs3.ph (smtp01.globe.com.ph [203.177.91.252]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by amanpulo.fs3.ph (Postfix) with ESMTP id A97BD1E0D5953 for ; Tue, 8 May 2007 21:55:55 +0800 (PHT) Received: by musang.fs3.ph (Postfix, from userid 1000) id BD35A2017683; Tue, 8 May 2007 21:55:42 +0800 (PHT) Date: Tue, 8 May 2007 21:55:42 +0800 From: Federico Sevilla III To: xfs@oss.sgi.com Subject: Re: Problems with XFS in a power failure Message-ID: <20070508135542.GF5621@fs3.ph> Mail-Followup-To: xfs@oss.sgi.com References: <20070508091613.GA5852@cepheus> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070508091613.GA5852@cepheus> X-Personal-URL: http://jijo.free.net.ph User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11346 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jijo@fs3.ph Precedence: bulk X-list: xfs On Tue, May 08, 2007 at 11:16:14AM +0200, Uwe Kleine-Knig wrote: > my machine suffered a power failure while doing a apt-get upgrade. Uh-oh. You hit the (in)famous binary nulls "issue". You may want to read the FAQ entry: http://oss.sgi.com/projects/xfs/faq.html#nulls. -- Federico Sevilla III F S 3 Consulting Inc. http://www.fs3.ph From owner-xfs@oss.sgi.com Tue May 8 07:47:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 07:48:02 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48ElvfB024039 for ; Tue, 8 May 2007 07:47:58 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48EluvR013297 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48EluWL551990 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48ElujD011654 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from [9.53.41.190] (kleikamp.austin.ibm.com [9.53.41.190]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48Eltlu011606; Tue, 8 May 2007 10:47:55 -0400 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Dave Kleikamp To: "Amit K. Arora" Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070508105247.GA1950@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> <1178551477.12900.6.camel@kleikamp.austin.ibm.com> <20070508105247.GA1950@amitarora.in.ibm.com> Content-Type: text/plain Date: Tue, 08 May 2007 09:47:54 -0500 Message-Id: <1178635675.11344.10.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11347 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote: > On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: > > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > > > news. The changelog would be an appropriate place to communicate this, > > > > along with reasons why, or a description of the plan to fix it. > > > > > > Ok. Will add this in the function description as well. > > > > > > > Also, posix says nothing about fallocate() returning ENOTTY. > > > > > > Right. I don't seem to find any suitable error from posix description. > > > Can you please suggest an error code which might make more sense here ? > > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > > > non-extent files. > > > > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned > > here, and fall back to the current library code to do preallocation? > > This way, the caller of fallocate() will never see this return code, so > > it won't violate posix. > > You are right. > > But, we still need to "standardize" (and limit) the error codes > which we should return from kernel when we want to fall back on the > library implementation. The posix_fallocate() library function will have > to look for a set of errors from fallocate() system call, upon receiving > which it will do preallocation from user level; or else, it will return > success/error-code returned by the system call to the user. > > I think we can make it fall back to library implementation of fallocate, > whenever posix_fallocate() receives any of the following errors from > fallocate() system call: > > 1. ENOSYS > 2. EOPNOTSUPP > 3. ENOTTY (?) > > Now the question is - should we limit the set of errors for this purpose > to just 1 & 2 above ? In that case I will need to change the error being > returned here to -EOPNOTSUPP (from current -ENOTTY). If you want my opinion, -EOPNOTSUPP is better than -ENOTTY. Shaggy -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Tue May 8 09:53:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 09:53:09 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48Gr2fB021897 for ; Tue, 8 May 2007 09:53:04 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id EA0D24E4557; Tue, 8 May 2007 10:53:00 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 28A3A3FB4; Tue, 8 May 2007 09:52:59 -0700 (PDT) Date: Tue, 8 May 2007 09:52:59 -0700 From: Andreas Dilger To: Theodore Tso , Mingming Cao , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508165259.GD6375@schatzie.adilger.int> Mail-Followup-To: Theodore Tso , Mingming Cao , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> <20070508014337.GA14072@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070508014337.GA14072@thunk.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11348 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 21:43 -0400, Theodore Tso wrote: > On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > > Userspace could presumably repair the mess in most situations by truncating > > the file back again. The kernel cannot do that because there might be live > > data in amongst there. > > Actually, the kernel could do it, in that could simply release all > unitialized extents back to the system. The problem is distinguishing > between the unitialized extents that had just been newly added, versus > the ones that had there from before. (On the other hand, if the > filesystem was completely full, releasing unitialized blocks wouldn't > be the worse thing in the world to do, although releasing previously > fallocated blocks probably does violate the princple of least > surprise, even if it's what the user would have wanted.) I tend to agree with this. Having fallocate() fill up the filesystem is exactly what the caller asked. Doing a write() hit ENOSPC doesn't trucate off the whole write either, nor does "dd" delete the whole file when the filesystem is full. Even checking the statfs() space before doing the fallocate() may be counter intuitive, since it will return ENOSPC but the filesystem will not actually be full. Some applications (e.g. database) may WANT to fill the filesystem and then get the actual file size back to avoid trusting statfs() because of metadata overhead (e.g. indirect blocks). One of the design goals for sys_fallocate() was to allow FA_DELALLOC to deallocate unwritten extents in a safe manner. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 8 10:46:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 10:46:15 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48Hk9fB032642 for ; Tue, 8 May 2007 10:46:10 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48Hk7kD027946 for ; Tue, 8 May 2007 13:46:07 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48Hk5a2140062 for ; Tue, 8 May 2007 11:46:05 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48Hk44h029943 for ; Tue, 8 May 2007 11:46:05 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48Hk2Vq029851; Tue, 8 May 2007 11:46:03 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Theodore Tso Cc: Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070508014337.GA14072@thunk.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> <20070508014337.GA14072@thunk.org> Content-Type: text/plain Organization: IBM LTC Date: Tue, 08 May 2007 10:46:01 -0700 Message-Id: <1178646362.4135.17.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11349 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 21:43 -0400, Theodore Tso wrote: > On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > > We could check the total number of fs free blocks account before > > preallocation happens, if there isn't enough space left, there is no > > need to bother preallocating. > > Checking against the fs free blocks is a good idea, since it will > prevent the obvious error case where someone tries to preallocate 10GB > when there is only 2GB left. Think it again, this check is useful when preallocate blocks at EOF. It's not much useful is preallocating a range with holes. In that case 2GB space might be enough if the application tries to preallocate a 10GB. > But it won't help if there are multiple > processes trying to allocate blocks the same time. On the other hand, > that case is probably relatively rare, and in that case, the > filesystem was probably going to be left completely full in any case. > On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > > Userspace could presumably repair the mess in most situations by truncating > > the file back again. The kernel cannot do that because there might be live > > data in amongst there. > > Actually, the kernel could do it, in that could simply release all > unitialized extents back to the system. The problem is distinguishing > between the unitialized extents that had just been newly added, versus > the ones that had there from before. True, the new uninitialized extents can be merged to the near old uninitialized extents, there is no way to distinguish the just added unintialized extents from the merged one. > (On the other hand, if the > filesystem was completely full, releasing unitialized blocks wouldn't > be the worse thing in the world to do, although releasing previously > fallocated blocks probably does violate the princple of least > surprise, even if it's what the user would have wanted.) > > On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > > If there is enough free space, we could make a reservation window that > > have at least N free blocks and mark it not stealable by other files. So > > later we will not run into the ENOSPC error. > > Could you really use a single reservation window? When the filesystem > is almost full, the free extents are likely going to be scattered all > over the disk. The general principle of grabbing all of the extents > and keeping them in an in-memory data structure, and only adding them > to the extent tree would work, though; I'm just not sure we could do > it using the existing reservation window code, since it only supports > a single reservation window per file, yes? > You are right. One reservation window per file and there is limit to the maximum window size). So yeah this way it's not going to prevent ENOSPC for sure:( Mingming From owner-xfs@oss.sgi.com Tue May 8 18:06:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 18:06:36 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4916UfB010738 for ; Tue, 8 May 2007 18:06:32 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA24949; Wed, 9 May 2007 11:06:24 +1000 Date: Wed, 09 May 2007 11:09:51 +1000 From: Timothy Shimmin To: Andrew Morton cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-ID: In-Reply-To: <20070508032546.0728ae95.akpm@linux-foundation.org> References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> <20070508032546.0728ae95.akpm@linux-foundation.org> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11350 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Andrew, --On 8 May 2007 3:25:46 AM -0700 Andrew Morton wrote: > On Tue, 08 May 2007 18:11:37 +1000 Timothy Shimmin wrote: > >> Please pull from: >> git pull git://oss.sgi.com:8090/xfs/xfs-2.6 >> > > I pull that regularly and it's always empty. Where did all this > code suddenly come from? It came from our internal tree which is also mirrored in cvs on oss. Our internal tree gets updated (non-xfs) from mainline every so often and has latest kdb patches applied (and dmapi patches). The internal tree is where our changes are originated from before moving out to an absolutely ridiculous number of trees. I only update the git tree every so often (start of rc's, important fixes, when I remember:) for Linus. Should I be updating a git branch for you more often? Cheers, Tim. From owner-xfs@oss.sgi.com Tue May 8 18:44:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 18:44:16 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l491iCfB019017 for ; Tue, 8 May 2007 18:44:12 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l491iADi017404 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 8 May 2007 18:44:11 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l491i9Fx007238; Tue, 8 May 2007 18:44:09 -0700 Date: Tue, 8 May 2007 18:44:09 -0700 From: Andrew Morton To: Timothy Shimmin Cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-Id: <20070508184409.e6ad4c8b.akpm@linux-foundation.org> In-Reply-To: References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> <20070508032546.0728ae95.akpm@linux-foundation.org> X-Mailer: Sylpheed 2.4.1 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11351 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Wed, 09 May 2007 11:09:51 +1000 Timothy Shimmin wrote: > Should I be updating a git branch for you more often? Only if you want it tested ;) Yes please. From owner-xfs@oss.sgi.com Tue May 8 22:59:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 22:59:29 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l495xOfB008438 for ; Tue, 8 May 2007 22:59:26 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.195]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l495xLcV009966 for ; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l495xL321403 for xfs@oss.sgi.com; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id l495xLO26246 for ; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070509.145924.21802300 for ; Wed, 9 May 2007 14:59:24 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Wed May 09 14:59:24 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 07047AE4B3; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l495xKD3006884; Wed, 9 May 2007 14:59:20 +0900 Message-Id: <200705090559.AA05331@TNESG9305.tnes.nec.co.jp> Date: Wed, 09 May 2007 14:59:11 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix xfs_quota path command. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11352 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, In path command in xfs_quota, the range value in the message becomes from 0 to -1 incorrectly when the list number is specified though the path list is empty. I think that the message is unnecessary the same as not specifying the list number in this case. Example: # ./xfs_quota -x xfs_quota> path xfs_quota> path 0 value 0 is out of range (0--1) Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/path.orig 2007-04-26 14:14:00.000000000 +0900 +++ xfsprogs-2.8.20/quota/path.c 2007-04-27 11:27:56.000000000 +0900 @@ -102,6 +102,9 @@ path_f( if (argc <= 1) return pathlist_f(); + if (!fs_count) + return 0; + i = atoi(argv[1]); if (i < 0 || i >= fs_count) { printf(_("value %d is out of range (0-%d)\n"), From owner-xfs@oss.sgi.com Wed May 9 03:52:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 03:52:33 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49AqPfB028412 for ; Wed, 9 May 2007 03:52:26 -0700 Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49AEJTa016628 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 9 May 2007 06:14:20 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49AECOf004909 for ; Wed, 9 May 2007 06:14:13 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49AECD9171760 for ; Wed, 9 May 2007 04:14:12 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49AECFt015558 for ; Wed, 9 May 2007 04:14:12 -0600 Received: from qubit.in.ibm.com (wks184594wss.in.ibm.com [9.184.236.184]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49AEAHi014659; Wed, 9 May 2007 04:14:11 -0600 Received: from qubit.in.ibm.com (localhost.localdomain [127.0.0.1]) by qubit.in.ibm.com (Postfix) with ESMTP id A8B2A67FFD; Wed, 9 May 2007 15:45:19 +0530 (IST) Received: (from suparna@localhost) by qubit.in.ibm.com (8.13.1/8.13.1/Submit) id l49AFDbE001436; Wed, 9 May 2007 15:45:13 +0530 Date: Wed, 9 May 2007 15:45:07 +0530 From: Suparna Bhattacharya To: Paul Mackerras Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509101507.GA26056@in.ibm.com> Reply-To: suparna@in.ibm.com References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17978.47502.786970.196554@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.5.11 X-archive-position: 11354 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: suparna@in.ibm.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: > Andrew Morton writes: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > > > Please add a comment over this function which specifies its behaviour. > > Really it should be enough material from which a full manpage can be > > written. > > This looks like it will have the same problem on s390 as > sys_sync_file_range. Maybe the prototype should be: > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2 http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2 Are there any better ideas ? Regards Suparna > > Paul. > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India From owner-xfs@oss.sgi.com Wed May 9 03:51:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 03:51:46 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49ApefB028095 for ; Wed, 9 May 2007 03:51:42 -0700 Received: by ozlabs.org (Postfix, fr