[Top] [All Lists]


To: Nathan Scott <nathans@xxxxxxx>
Subject: Re: XFS Writes: IOLOCK_EXCL
From: Steve Lord <lord@xxxxxxx>
Date: Wed, 25 Feb 2004 07:58:24 -0600
Cc: Alex Wun <alexwun@xxxxxxxxxxxxx>, linux-xfs@xxxxxxxxxxx
In-reply-to: <20040225062907.GB1832@frodo>
References: <Pine.LNX.4.44.0402240914430.12803-100000@xxxxxxxxxxxxxxxxxxxxxx> <01fa01c3fafa$7c3482d0$9002a8c0@xxxxxxxxxxxxx> <20040225062907.GB1832@frodo>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mozilla Thunderbird 0.5 (X11/20040208)
Nathan Scott wrote:
On Tue, Feb 24, 2004 at 12:20:22PM -0500, Alex Wun wrote:


hi there.

What I've been trying to do is squeeze some more write performance out of my

Good stuff.

I've ended up fiddling with some XFS code and have found that write
performance can improve significantly in certain applications (i.e.
exporting the filesystem over a network).  I've left the system running (and

Can you describe your workload there a bit more?  Do you have
multiple writers to the same file (from different NFS clients?)?
Readers too?  thanks.

writing) over the weekend and it seems to be stable.  Although things look
good for now, I'd like to be certain that I haven't broken anything.

Essentially, this is what I've done:

In the function xfs_fsync():

Instead of:


I essentially have:
                 xfs_iunlock(ip, XFS_IOLOCK_EXCL);
                 xfs_ilock(ip, XFS_IOLOCK_EXCL);

Where VOP_COMMIT_PAGES() + VOP_CLEANUP_PAGES() does the same thing as
VOP_FLUSH_PAGES() but in two parts, first the writing of the pages then the
cleanup.  So I've basically released the XFS_IOLOCK_EXCL during page

By "cleanup" I guess you mean the filemap_fdatawait() call in
fs_flush_pages?  So, you're releasing the IOLOCK inbetween the
initiation of IO, and the wait-for-completion of that IO.  Hmm,
I'll need to think about that - Steve, anything immediately
obvious to you as to whether thats going to go bad?

filemap_fdatawait is only processing already-locked pages, so
other writers would be synchronised by the page lock.  However
another writer can sneak into the window created there, and do
nasty things like updating the inode size or do direct IO when
you weren't looking, etc.  I think, not convinced yet though.

What pointed out this spot in the code as a problem initially?
Is this the call down from NFS where it wants to write a range
within a file, but since theres no such interface it has to sync
all the dirty data?  Someone else was asking me if that could
be improved a little while ago, but there are no VFS interfaces
to provide that functionality (yet?) - XFS would be able to use
that if it existed.

This may not work for everyone but it seems to give me better writing (maybe
strict operating conditions will apply?). But I'd just like to ask, from a
conceptual stand-point, whether this is a legitimate thing to do.

I'm a little concerned that this may be opening a window for
data corruption, but I need to think about it a whole lot.

Thanks for your time,

Thanks for looking into this.


Only speaking for 2.6 here, so take that under advisement.

There are very few callers of fsync in the kernel, and they all do the
data flush and wait for data themselves, the filesystem fsync code is
not involved in the data flush at all for most filesystems, just the
metadata for the file. The callers hold the i_sem, but not the nfs
commit case.

So, in 2.6, xfs_fsync needs gutting, the VOP_FLUSH_PAGES call can go
completely. Since FSYNC_INVAL is not used (at least not in the kernel
tree code, Nathan you know where to look), the VOP_FLUSHINVAL_PAGES
could go to, and with it all the code associated with the iolock.

As for flushing data without the IO lock, I think this is OK in general,
if a new read comes along there is no problem, if a new write modifies
the page then it will get redirtied and flushed again later. If a
truncate removes the page then it will delete the page before after
it is flushed. The holding of the IO lock is partially a hold over
from Irix where it is required around the VOP_FLUSH..... calls.

The tosspages and flushinval pages cases do need the lock though.

This is untested, but should generally speed things up in 2.6. A very
quick scan of 2.4 seems to suggest a similar change will work there


        bhv_desc_t      *bdp,
        int             flag,
        cred_t          *credp,
        xfs_off_t       start,
        xfs_off_t       stop)
        xfs_inode_t     *ip;
        int             error;
        xfs_trans_t     *tp;

vn_trace_entry(BHV_TO_VNODE(bdp), __FUNCTION__, (inst_t *) __return_address);

        ip = XFS_BHVTOI(bdp);

        ASSERT(start >= 0 && stop >= -1);

        if (XFS_FORCED_SHUTDOWN(ip->i_mount))
                return XFS_ERROR(EIO);

        xfs_ilock(ip, XFS_IOLOCK_EXCL);

         * We always need to make sure that the required inode state
         * is safe on disk.  The vnode might be clean but because
         * of committed transactions that haven't hit the disk yet.
         * Likewise, there could be unflushed non-transactional
         * changes to the inode core that have to go to disk.
         * The following code depends on one assumption:  that
         * any transaction that changes an inode logs the core
         * because it has to change some field in the inode core
         * (typically nextents or nblocks).  That assumption
         * implies that any transactions against an inode will
         * catch any non-transactional updates.  If inode-altering
         * transactions exist that violate this assumption, the
         * code breaks.  Right now, it figures that if the involved
         * update_* field is clear and the inode is unpinned, the
         * inode is clean.  Either it's been flushed or it's been
         * committed and the commit has hit the disk unpinning the inode.
         * (Note that xfs_inode_item_format() called at commit clears
         * the update_* fields.)
        xfs_ilock(ip, XFS_ILOCK_SHARED);

        /* If we are flushing data then we care about update_size
         * being set, otherwise we care about update_core
        if ((flag & FSYNC_DATA) ?
                        (ip->i_update_size == 0) :
                        (ip->i_update_core == 0)) {
                 * Timestamps/size haven't changed since last inode
                 * flush or inode transaction commit.  That means
                 * either nothing got written or a transaction
                 * committed which caught the updates.  If the
                 * latter happened and the transaction hasn't
                 * hit the disk yet, the inode will be still
                 * be pinned.  If it is, force the log.

                xfs_iunlock(ip, XFS_ILOCK_SHARED);

                if (xfs_ipincount(ip)) {
                        xfs_log_force(ip->i_mount, (xfs_lsn_t)0,
                                      XFS_LOG_FORCE |
                                      ((flag & FSYNC_WAIT)
                                       ? XFS_LOG_SYNC : 0));
                error = 0;
        } else  {
                 * Kick off a transaction to log the inode
                 * core to get the updates.  Make it
                 * sync if FSYNC_WAIT is passed in (which
                 * is done by everybody but specfs).  The
                 * sync transaction will also force the log.
                xfs_iunlock(ip, XFS_ILOCK_SHARED);
                tp = xfs_trans_alloc(ip->i_mount, XFS_TRANS_FSYNC_TS);
                if ((error = xfs_trans_reserve(tp, 0,
                                0, 0, 0)))  {
                        xfs_trans_cancel(tp, 0);
                        return error;
                xfs_ilock(ip, XFS_ILOCK_EXCL);

                 * Note - it's possible that we might have pushed
                 * ourselves out of the way during trans_reserve
                 * which would flush the inode.  But there's no
                 * guarantee that the inode buffer has actually
                 * gone out yet (it's delwri).  Plus the buffer
                 * could be pinned anyway if it's part of an
                 * inode in another recent transaction.  So we
                 * play it safe and fire off the transaction anyway.
                xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
                xfs_trans_ihold(tp, ip);
                xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
                if (flag & FSYNC_WAIT)
                error = xfs_trans_commit(tp, 0, NULL);

                xfs_iunlock(ip, XFS_ILOCK_EXCL);
        return error;

<Prev in Thread] Current Thread [Next in Thread>