On Tue, Nov 15, 2011 at 03:14:12PM -0500, Christoph Hellwig wrote:
> Do not use unlogged metadata updates and the VFS dirty bit for updating
> the file size after writeback. In addition to causing various problems
> with updates getting delayed for far too log this also drags in the
> unscalable VFS dirty tracking, and is one of the few remaining unlogged
> metadata updates.
> Note that we allocate a new transaction from the I/O completion handler.
> While this sounds fairly dangerous it isn't an issue in practice given
> that any appending write alreay had to start a transaction in writepages
> to allocate blocks, and we'll start throtteling there if we run low on
> log space or memory.
Actually, that is not true: we know that transaction reservation in
Io completion can hang systems. That's why we specifically shifted
unwritten extent conversion to it's own workqueue.
$ gl -n 1 c626d17
Author: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Mon Apr 6 18:42:11 2009 +0200
xfs: prevent unwritten extent conversion from blocking I/O completion
Unwritten extent conversion can recurse back into the filesystem due
to memory allocation. Memory reclaim requires I/O completions to be
processed to allow the callers to make progress. If the I/O
completion workqueue thread is doing the recursion, then we have a
Move unwritten extent completion into it's own workqueue so it
doesn't block I/O completions for normal delayed allocation or
Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
The condition that I saw most frequently was transaction reservation
blocking and preventing IO from completing and therefore marking
memory clean and freeable, whilst the unwritten extent conversion
transaction reservation was blocked because the tail of the log
could not be moved due to memory allocation for inode buffers to
write back the inodes pinning the tail of the log failing and hence
looping endlessly in the buffer cache. If we do transaction
reservations for every single extending IO completion during Io
completion, we are effectively making the deadlock possible for most
extending delayed allocation write IOs (i.e. the majority of write
IO in a typical system).
Contrast that to when we are in the .writepage calling context:
transaction reservation does not block IO completions that do
not require unwritten extent conversion, so dirty pages can still be
cleaned even when transactions and/or transaction reservations stall
waiting for memory to be freed.
> We could still occasionally stall in the completion handler, but given
> that we have per-filesystems workqueues for the I/O completions,
> and completions that do not have to either convert unwritten extents
> or update the file size are processed from interrupt context we do not
> have to worry about this stalling a system to death.
per-filesystem workqueues does not fix the problem - the deadlock
still exists when you only have one active filesystem.
> In addition to that implementing log reservations from ->writepage that
> are only released by a different thread requires a lot of work,
That's trivial to do - there's nothing thread specific about a
transaction. All you need to do is ensure that everything is logged
before you pass the struct xfs_trans off to another thread. We
can do that simply by adding it to the ioend.
> and even
> with that wasn't quite doable in a deadlock-free manner.
What new deadlocks did you find? I can't think of any off the top of
my head, so I'm curious to know what you hit...