xfs
[Top] [All Lists]

Re: [PATCH 3/3] xfs: always log file size updates

To: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: [PATCH 3/3] xfs: always log file size updates
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Thu, 25 Aug 2011 11:04:21 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20110824060150.245096567@xxxxxxxxxxxxxxxxxxxxxx>
References: <20110824055924.139283426@xxxxxxxxxxxxxxxxxxxxxx> <20110824060150.245096567@xxxxxxxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Wed, Aug 24, 2011 at 01:59:27AM -0400, Christoph Hellwig wrote:
> Instead of updating the inode size through the VFS dirty mechanism and
> ->write_inode just log it directly.    This fixes a nasty bug where we'd
> lose file size updates if they happen from writeout during inode eviction,
> and prepares for getting rid of non-transaction metadata updates entirely.
> 
> This may cause a few additional
> transactions for inode core updates, but with the delaylog code those
> are cheap enough to not bother.

Now that I've thought about this a little longer, the only concern I
have about this is that xfs_trans_reserve() can block for long
periods of time and require IO completion to make progress. This is
one of the reasons we push unwritten extent conversion off into it's
workqueue, so that it doesn't block non-conversion IO completions.

Given that once we have a reservation we can hold it for as long as
we want, maybe we should check before IO submission whether the
completion is beyond the current EOF and allocate and reserve the
transaction space before IO submission then attach it to the ioend.
All that leaves on IO completion is this:

        if (xfs_ioend_is_append(ioend)) {
                lock inode
                update size
                join inode to transaction
                log inode core
                commit transaction
        } else
                cancel transaction

That way we minimise the amount blocking we potentially do in IO
completion.

> -
> -/*
>   * Schedule IO completion handling on the final put of an ioend.
>   *
>   * If there is no work to do we might as well call it a day and free the
> @@ -203,10 +179,29 @@ xfs_end_io(
>       }
>  
>       /*
> -      * We might have to update the on-disk file size after extending
> -      * writes.
> -      */
> -     xfs_setfilesize(ioend);
> +      * Update on-disk file size now that data has been written to disk.
> +      *
> +      * The current in-memory file size is i_size.  If a write is beyond
> +      * eof i_new_size will be the intended file size until i_size is
> +      * updated.  If this write does not extend all the way to the valid
> +      * file size then restrict this update to the end of the write.
> +      */
> +     if (xfs_ioend_is_append(ioend)) {
> +             xfs_fsize_t     new_size;
> +
> +             xfs_ilock(ip, XFS_ILOCK_EXCL);
> +             new_size = xfs_ioend_new_eof(ioend);
> +             if (new_size) {
> +                     trace_xfs_setfilesize(ip, ioend->io_offset,
> +                                           ioend->io_size);
> +                     error = xfs_setfilesize(ip, new_size);
> +             }
> +             xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +             if (error)
> +                     ioend->io_error = -error;
> +     }
> +
>  done:
>       xfs_destroy_ioend(ioend);
>  }
> @@ -364,14 +359,6 @@ xfs_submit_ioend_bio(
>       atomic_inc(&ioend->io_remaining);
>       bio->bi_private = ioend;
>       bio->bi_end_io = xfs_end_bio;
> -
> -     /*
> -      * If the I/O is beyond EOF we mark the inode dirty immediately
> -      * but don't update the inode size until I/O completion.
> -      */
> -     if (xfs_ioend_new_eof(ioend))
> -             xfs_mark_inode_dirty(XFS_I(ioend->io_inode));
> -
>       submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
>  }
>  
> Index: xfs/fs/xfs/xfs_file.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_file.c        2011-08-24 00:41:55.864686316 +0200
> +++ xfs/fs/xfs/xfs_file.c     2011-08-24 00:41:59.691332251 +0200
> @@ -409,7 +409,7 @@ xfs_aio_write_newsize_update(
>               xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
>               ip->i_new_size = 0;
>               if (ip->i_d.di_size > ip->i_size)
> -                     ip->i_d.di_size = ip->i_size;
> +                     xfs_setfilesize(ip, ip->i_size);
>               xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
>       }
>  }
> Index: xfs/fs/xfs/xfs_inode.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_inode.c       2011-08-24 00:41:52.724703326 +0200
> +++ xfs/fs/xfs/xfs_inode.c    2011-08-24 00:42:36.824464417 +0200
> @@ -1413,6 +1413,34 @@ xfs_itruncate_data(
>  }
>  
>  /*
> + * Update the inode size during I/O completions or error handling.
> + */
> +int
> +xfs_setfilesize(
> +     struct xfs_inode        *ip,
> +     xfs_fsize_t             isize)
> +{
> +     struct xfs_mount        *mp = ip->i_mount;
> +     struct xfs_trans        *tp;
> +     int                     error = 0;
> +
> +     ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +
> +     tp = _xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS, KM_NOFS);
> +     error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0);

That can deadlock - if the xfs_trans_reserve() call requires tail
pushing and this inode pins the tail of the log, then it cannot be
flushed because we are already holding the ilock and so the flush
cannot get it. hence the tail of the log cannot be moved forward,
and we are stuck.

So the XFS_ILOCK_EXCL must be taken only after the
xfs_trans_reserve() call.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>