[Top] [All Lists]

Re: [PATCH 1/6] fs: add hole punching to fallocate

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 1/6] fs: add hole punching to fallocate
From: Lawrence Greenfield <leg@xxxxxxxxxx>
Date: Tue, 11 Jan 2011 16:13:42 -0500
Cc: "Ted Ts'o" <tytso@xxxxxxx>, Josef Bacik <josef@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, linux-btrfs@xxxxxxxxxxxxxxx, linux-ext4@xxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx, joel.becker@xxxxxxxxxx, cmm@xxxxxxxxxx, cluster-devel@xxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1294780424; bh=+PGLfYmGYm2n8Mu18mLMNGNxQs0=; h=MIME-Version:In-Reply-To:References:Date:Message-ID:Subject:From: To:Cc:Content-Type:Content-Transfer-Encoding; b=gFogpmcK1VzSqNoXPE8S3C4m1G6OjUngufBZtZcCGN9Gwv4ygAYaKsIZPDgEYHuNL 1iSoCH3TZggOuGsLQOB4A==
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=beta; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=ccjFxex9Uz0aOishYrYcXdZ7+geEtyiqm5Xs9UquBJE=; b=uvLufrAEd6WstG5c4sXcpEvNxoEAzqkSBuTxupXe/VNN6icZfW44vktvHUIebMtGw4 RPVlogLGftEa+KUjOWgQ==
Domainkey-signature: a=rsa-sha1; c=nofws; d=google.com; s=beta; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=Bs7+u7VQVJWuODo3TugeIYwbDoun5Eq3/ozikVmRb2C/th7CBHY5FoWevBmxpLBS7T 3tHEdtPwpqB/bWPqVpPg==
In-reply-to: <20101109234049.GQ2715@dastard>
References: <1289248327-16308-1-git-send-email-josef@xxxxxxxxxx> <20101109011222.GD2715@dastard> <20101109033038.GF3099@xxxxxxxxx> <20101109044242.GH2715@dastard> <20101109214147.GK3099@xxxxxxxxx> <20101109234049.GQ2715@dastard>
On Tue, Nov 9, 2010 at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Tue, Nov 09, 2010 at 04:41:47PM -0500, Ted Ts'o wrote:
>> On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
>> > Implementation is up to the filesystem. However, XFS does (b)
>> > because:
>> >
>> >     1) it was extremely simple to implement (one of the
>> >        advantages of having an exceedingly complex allocation
>> >        interface to begin with :P)
>> >     2) conversion is atomic, fast and reliable
>> >     3) it is independent of the underlying storage; and
>> >     4) reads of unwritten extents operate at memory speed,
>> >        not disk speed.
>> Yeah, I was thinking that using a device-style TRIM might be better
>> since future attempts to write to it won't require a separate seek to
>> modify the extent tree.  But yeah, there are a bunch of advantages of
>> simply mutating the extent tree.
>> While we're on the subject of changes to fallocate, what do people
>> think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
>> privileges or (if capabilities are in use) CAP_DAC_OVERRIDE &&
>> CAP_MAC_OVERRIDE && CAP_SYS_ADMIN.  This would allow a trusted process
>> to fallocate blocks with the extent already marked initialized.  I've
>> had two requests for such functionality for ext4 already.
> We removed that ability from XFS about three years ago because it's
> a massive security hole. e.g. what happens if the file is world
> readable, even though the process that called
> FALLOC_FL_EXPOSE_OLD_DATA was privileged and was allowed to expose
> such data? Or the file is chmod 777 after being exposed?
> The historical reason for such behaviour existing in XFS was that in
> 1997 the CPU and IO latency cost of unwritten extent conversion was
> significant, so users with real physical security (i.e. marines with
> guns) were able to make use of fast preallocation with no conversion
> overhead without caring about the security implications. These days,
> the performance overhead of unwritten extent conversion is minimal -
> I generally can't measure a difference in IO performance as a result
> of it - so there is simply no good reaѕon for leaving such a gaping
> security hole in the system.
> If anyone wants to read the underlying data, then use fiemap to map
> the physical blocks and read it directly from the block device. That
> requires root privileges but does not open any new stale data
> exposure problems....
>> (Take for example a trusted cluster filesystem backend that checks the
>> object checksum before returning any data to the user; and if the
>> check fails the cluster file system will try to use some other replica
>> stored on some other server.)
> IOWs, all they want to do is avoid the unwritten extent conversion
> overhead. Time has shown that a bad security/performance tradeoff
> decision was made 13 years ago in XFS, so I see little reason to
> repeat it for ext4 today....

I'd make use of FALLOC_FL_EXPOSE_OLD_DATA. It's not the CPU overhead
of extent conversion. It's that extent conversion causes more metadata
operations than what you'd have otherwise, which means systems that
want to use O_DIRECT and make sure the data doesn't go away either
have to write O_DIRECT|O_DSYNC or need to call fdatasync().

cluster file system implementor,

> Cheers,
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

<Prev in Thread] Current Thread [Next in Thread>