xfs
[Top] [All Lists]

Re: [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes insi

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF.
From: Ben Myers <bpm@xxxxxxx>
Date: Mon, 7 May 2012 17:00:59 -0500
Cc: xfs@xxxxxxxxxxx
In-reply-to: <1335519922-14371-3-git-send-email-david@xxxxxxxxxxxxx>
References: <1335519922-14371-1-git-send-email-david@xxxxxxxxxxxxx> <1335519922-14371-3-git-send-email-david@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Fri, Apr 27, 2012 at 07:45:21PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> When a partial write inside EOF fails, it can leave delayed
> allocation blocks lying around because they don't get punched back
> out. This leads to assert failures like:
> 
> XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks 
> == 0, file: fs/xfs/xfs_super.c, line: 847
> 
> when evicting inodes from the cache. This can be trivially triggered
> by xfstests 083, which takes between 5 and 15 executions on a 512
> byte block size filesystem to trip over this. Debugging shows a
> failed write due to ENOSPC calling xfs_vm_write_failed such as:
> 
> [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae
> 
> and no action is taken on it. This leaves behind a delayed
> allocation extent that has no page covering it and no data in it:
> 
> [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0
> [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1
> [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b
> [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1
>                                     ^^^^^^^^^^^^^^^
> [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd
> [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa
> 
> So the delayed allocation extent is one block long at offset
> 0x16c00. Tracing shows that a bigger write:
> 
> xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags
> 
> allocates the block, and then fails with ENOSPC trying to allocate
> the last block on the page, leading to a failed write with stale
> delalloc blocks on it.
> 
> Because we've had an ENOSPC when trying to allocate 0x16e00, it
> means that we are never goinge to call ->write_end on the page and
                          going
> so the allocated new buffer will not get marked dirty or have the
> buffer_new state cleared. In other works, what the above write is
> supposed to end up with is this mapping for the page:
> 
>     +------+------+------+------+------+------+------+------+
>       UMA    UMA    UMA    UMA    UMA    UMA    UND    FAIL
> 
> where:  U = uptodate
>         M = mapped
>         N = new
>         A = allocated
>         D = delalloc
>         FAIL = block we ENOSPC'd on.
> 
> and the key point being the buffer_new() state for the newly
> allocated delayed allocation block. Except it doesn't - we're not
> marking buffers new correctly.
> 
> That buffer_new() problem goes back to the xfs_iomap removal days,
> where xfs_iomap() used to return a "new" status for any map with
> newly allocated blocks, so that __xfs_get_blocks() could call
> set_buffer_new() on it. We still have the "new" variable and the
> check for it in the set_buffer_new() logic - except we never set it
> now!
> 
> Hence that newly allocated delalloc block doesn't have the new flag
> set on it, so when the write fails we cannot tell which blocks we
> are supposed to punch out. WHy do we need the buffer_new flag? Well,
                             Why
> that's because we can have this case:
> 
>     +------+------+------+------+------+------+------+------+
>       UMD    UMD    UMD    UMD    UMD    UMD    UND    FAIL
> 
> where all the UMD buffers contain valid data from a previously
> successful write() system call. We only want to punch the UND buffer
> because that's the only one that we added in this write and it was
> only this write that failed.
> 
> That implies that even the old buffer_new() logic was wrong -
> because it would result in all those UMD buffers on the page having
> set_buffer_new() called on them even though they aren't new. Hence
> we shoul donly be calling set_buffer_new() for delalloc buffers that
     should only
> were allocated (i.e. were a hole before xfs_iomap_write_delay() was
> called).
> 
> So, fix this set_buffer_new logic according to how we need it to
> work for handling failed writes correctly. Also, restore the new
> buffer logic handling for blocks allocated via
> xfs_iomap_write_direct(), because it should still set the buffer_new
> flag appropriately for newly allocated blocks, too.
> 
> SO, now we have the buffer_new() being set appropriately in
> __xfs_get_blocks(), we can detect the exact delalloc ranges that
> we allocated in a failed write, and hence can now do a walk of the
> buffers on a page to find them.
> 
> Except, it's not that easy. When block_write_begin() fails, it
> unlocks and releases the page that we just had an error on, so we
> can't use that page to handle errors anymore. We have to get access
> to the page while it is still locked to walk the buffers. Hence we
> have to open code block_write_begin() in xfs_vm_write_begin() to be
> able to insert xfs_vm_write_failed() is the right place.
> 
> With that, we can pass the page and write range to
> xfs_vm_write_failed() and walk the buffers on the page, looking for
> delalloc buffers that are either new or beyond EOF and punch them
> out. Handling buffers beyond EOF ensures we still handle the
> existing case that xfs_vm_write_failed() handles.
> 
> Of special note is the truncate_pagecache() handling - that only
> should be done for pages outside EOF - pages within EOF can still
> contain valid, dirty data so we must not punch them out of the
> cache.
> 
> That just leaves the xfs_vm_write_end() failure handling.
> The only failure case here is that we didn't copy the entire range,
> and generic_write_end() handles that by zeroing the region of the
> page that wasn't copied,

Are you referring to 
xfs_vm_write_end
  generic_write_end
    block_write_end
      page_zero_new_buffers?

>       we don't have to punch out blocks within
> the file because they are guaranteed to contain zeros. Hence we only
> have to handle the existing "beyond EOF" case and don't need access
> to the buffers on the page. Hence it remains largely unchanged.
> 
> Note that xfs_getbmap() can still trip over delalloc blocks beyond
> EOF that are left there by speculative delayed allocation. Hence
> this bug fix does not solve all known issues with bmap vs delalloc,
> but it does fix all the the known accidental occurances of the
> problem.
> 
> Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
> ---
>  fs/xfs/xfs_aops.c |  173 
> +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 127 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 64ed87a..ae31c31 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1184,11 +1184,18 @@ __xfs_get_blocks(
>                                                      &imap, nimaps);
>                       if (error)
>                               return -error;
> +                     new = 1;
>               } else {
>                       /*
>                        * Delalloc reservations do not require a transaction,
> -                      * we can go on without dropping the lock here.
> +                      * we can go on without dropping the lock here. If we
> +                      * are allocating a new delalloc block, make sure that
> +                      * we set the new flag so that we mark the buffer new so
> +                      * that we know that it is newly allocated if the write
> +                      * fails.
>                        */
> +                     if (nimaps && imap.br_startblock == HOLESTARTBLOCK)
> +                             new = 1;
>                       error = xfs_iomap_write_delay(ip, offset, size, &imap);
>                       if (error)
>                               goto out_unlock;
> @@ -1405,52 +1412,91 @@ out_destroy_ioend:
>       return ret;
>  }
>  
> +/*
> + * Punch out the delalloc blocks we have already allocated.

This language is confusing.  I suggest that delay blocks are reserved and real
blocks are allocated.  Tomato Tomato.

> + *
> + * Don't bother with xfs_setattr given that nothing can have made it to disk 
> yet
> + * as the page is still locked at this point.
> + */
> +STATIC void
> +xfs_vm_kill_delalloc_range(
> +     struct inode            *inode,
> +     loff_t                  start,
> +     loff_t                  end)
> +{
> +     struct xfs_inode        *ip = XFS_I(inode);
> +     xfs_fileoff_t           start_fsb;
> +     xfs_fileoff_t           end_fsb;
> +     int                     error;
> +
> +     start_fsb = XFS_B_TO_FSB(ip->i_mount, start);
> +     end_fsb = XFS_B_TO_FSB(ip->i_mount, end);
> +     if (end_fsb <= start_fsb)
> +             return;
> +
> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
> +     error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> +                                             end_fsb - start_fsb);
> +     if (error) {
> +             /* something screwed, just bail */
> +             if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> +                     xfs_alert(ip->i_mount,
> +             "xfs_vm_write_failed: unable to clean up ino %lld",

Consider updating the function name in this error message and printing out the
value of error.

> +                                     ip->i_ino);
> +             }
> +     }
> +     xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +}
> +
>  STATIC void
>  xfs_vm_write_failed(
> -     struct address_space    *mapping,
> -     loff_t                  to)
> +     struct inode            *inode,
> +     struct page             *page,
> +     loff_t                  pos,
> +     unsigned                len)
>  {
> -     struct inode            *inode = mapping->host;
> +     loff_t                  block_offset = pos & PAGE_MASK;
> +     loff_t                  block_start;
> +     loff_t                  block_end;
> +     loff_t                  from = pos & (PAGE_CACHE_SIZE - 1);
> +     loff_t                  to = from + len;
> +     struct buffer_head      *bh, *head;
>  
> -     if (to > inode->i_size) {
> -             /*
> -              * Punch out the delalloc blocks we have already allocated.
> -              *
> -              * Don't bother with xfs_setattr given that nothing can have
> -              * made it to disk yet as the page is still locked at this
> -              * point.
> -              */
> -             struct xfs_inode        *ip = XFS_I(inode);
> -             xfs_fileoff_t           start_fsb;
> -             xfs_fileoff_t           end_fsb;
> -             int                     error;
> +     ASSERT(block_offset + from == pos);
>  
> -             truncate_pagecache(inode, to, inode->i_size);
> +     head = page_buffers(page);
> +     block_start = 0;
> +     for (bh = head; bh != head || !block_start;
> +          bh = bh->b_this_page, block_start = block_end,
> +                                block_offset += bh->b_size) {
> +             block_end = block_start + bh->b_size;
>  
> -             /*
> -              * Check if there are any blocks that are outside of i_size
> -              * that need to be trimmed back.
> -              */
> -             start_fsb = XFS_B_TO_FSB(ip->i_mount, inode->i_size);
> -             end_fsb = XFS_B_TO_FSB(ip->i_mount, to);
> -             if (end_fsb <= start_fsb)
> -                     return;
> -
> -             xfs_ilock(ip, XFS_ILOCK_EXCL);
> -             error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> -                                                     end_fsb - start_fsb);
> -             if (error) {
> -                     /* something screwed, just bail */
> -                     if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> -                             xfs_alert(ip->i_mount,
> -                     "xfs_vm_write_failed: unable to clean up ino %lld",
> -                                             ip->i_ino);
> -                     }
> -             }
> -             xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +             /* skip buffers before the write */
> +             if (block_end <= from)
> +                     continue;
> +
> +             /* if the buffer is after the write, we're done */
> +             if (block_start >= to)

*blink*  I was looking pretty hard at that for an off-by-one.  Mark
straightened me out.  Eesh.

> +                     break;
> +
> +             if (!buffer_delay(bh))
> +                     continue;
> +
> +             if (!buffer_new(bh) && block_offset < i_size_read(inode))
> +                     continue;
> +
> +             xfs_vm_kill_delalloc_range(inode, block_offset,
> +                                        block_offset + bh->b_size);
>       }
> +
>  }
>  
> +/*
> + * This used to call block_write_begin(), but it unlocks and releases the 
> page
> + * on error, and we need that page to be able to punch stale delalloc blocks 
> out
> + * on failure. hence we copy-n-waste it here and call xfs_vm_write_failed() 
> at
> + * the appropriate point.
> + */
>  STATIC int
>  xfs_vm_write_begin(
>       struct file             *file,
> @@ -1461,15 +1507,40 @@ xfs_vm_write_begin(
>       struct page             **pagep,
>       void                    **fsdata)
>  {
> -     int                     ret;
> +     pgoff_t                 index = pos >> PAGE_CACHE_SHIFT;
> +     struct page             *page;
> +     int                     status;
>  
> -     ret = block_write_begin(mapping, pos, len, flags | AOP_FLAG_NOFS,
> -                             pagep, xfs_get_blocks);
> -     if (unlikely(ret))
> -             xfs_vm_write_failed(mapping, pos + len);
> -     return ret;
> +     ASSERT(len <= PAGE_CACHE_SIZE);
> +
> +     page = grab_cache_page_write_begin(mapping, index,
> +                                        flags | AOP_FLAG_NOFS);
> +     if (!page)
> +             return -ENOMEM;
> +
> +     status = __block_write_begin(page, pos, len, xfs_get_blocks);
> +     if (unlikely(status)) {
> +             struct inode    *inode = mapping->host;
> +
> +             xfs_vm_write_failed(inode, page, pos, len);
> +             unlock_page(page);

Consistent with block_write_begin.

> +
> +             if (pos + len > i_size_read(inode))
> +                     truncate_pagecache(inode, pos + len, 
> i_size_read(inode));
> +
> +             page_cache_release(page);
> +             page = NULL;
> +     }
> +
> +     *pagep = page;
> +     return status;
>  }
>  
> +/*
> + * On failure, we only need to kill delalloc blocks beyond EOF because they
> + * will never be written. For blocks within EOF, generic_write_end() zeros 
> them
> + * so they are safe to leave alone and be written with all the other valid 
> data.
> + */
>  STATIC int
>  xfs_vm_write_end(
>       struct file             *file,
> @@ -1482,9 +1553,19 @@ xfs_vm_write_end(
>  {
>       int                     ret;
>  
> +     ASSERT(len <= PAGE_CACHE_SIZE);
> +
>       ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> -     if (unlikely(ret < len))
> -             xfs_vm_write_failed(mapping, pos + len);
> +     if (unlikely(ret < len)) {
> +             struct inode    *inode = mapping->host;
> +             size_t          isize = i_size_read(inode);
> +             loff_t          to = pos + len;
> +
> +             if (to > isize) {
> +                     truncate_pagecache(inode, to, isize);
> +                     xfs_vm_kill_delalloc_range(inode, isize, to);
> +             }
> +     }
>       return ret;
>  }

Aside from a few nits this is looking good.

Reviewed-by: Ben Myers <bpm@xxxxxxx>

<Prev in Thread] Current Thread [Next in Thread>
  • Re: [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF., Ben Myers <=