On Thu 17-05-12 17:43:08, Dave Chinner wrote:
> On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote:
> > On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> > > On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote:
> > > > It's not easy to protect against these races. For truncate, i_size
> > > > protects
> > > > us against similar races but for hole punching we don't have any such
> > > > mechanism. One way to avoid the race would be to hold mmap_sem while we
> > > > are
> > > > invalidating the page cache and punching hole but that sounds a bit
> > > > ugly.
> > > > Alternatively we could just have some special lock (rwsem?) held during
> > > > page_mkwrite() (for reading) and during whole hole punching (for
> > > > writing)
> > > > to serialize these two operations.
> > >
> > > What really needs to happen is that .page_mkwrite() can be made to
> > > fail with -EAGAIN and retry the entire page fault from the start an
> > > arbitrary number of times instead of just once as the current code
> > > does with VM_FAULT_RETRY. That would allow us to try to take the
> > > filesystem lock that provides IO exclusion for all other types of IO
> > > and fail with EAGAIN if we can't get it without blocking. For XFS,
> > > that is the i_iolock rwsem, for others it is the i_mutex, and some
> > > other filesystems might take other locks.
> > Actually, I've been playing with VM_FAULT_RETRY recently (for freezing
> > patches) and it's completely unhandled for .page_mkwrite() callbacks.
> Yeah, it's a mess.
> > Also
> > only x86 really tries to handle it at all. Other architectures just don't
> > allow it at all. Also there's a ton of callers of things like
> > get_user_pages() which would need to handle VM_FAULT_RETRY and for some of
> > them it would be actually non-trivial.
> Seems kind of silly to me to have a generic retry capability in the
> page fault handler and then not implement it in a useful manner for
Yeah. It's only tailored for one specific use in filemap_fault() on
> > But in this particular case, I don't think VM_FAULT_RETRY is strictly
> > necessary. We can have a lock, which ranks below mmap_sem (and thus
> > i_mutex / i_iolock) and above i_mmap_mutex (thus page lock), transaction
> > start, etc. Such lock could be taken in page_mkwrite() before taking page
> > lock, in truncate() and punch_hold() just after i_mutex, and direct IO
> > paths could be tweaked to use it as well I think.
> Which means we'd be adding another layer of mostly redundant locking
> just to avoid i_mutex/mmap_sem inversion. But I don't see how it
> solves the direct IO problem because we still need to grab the
> mmap_sem inside the IO during get_user_pages_fast() while holding
Yeah, direct IO case won't be trivial. We would have to take the lock
after dio_get_page() and release it before going for next page. While the
lock was released someone could fault in pages into the area where direct
IO is happening so we would have to invalidate them while holding the lock
again. It seems it would work but I agree it's probably too ugly to live
given how abstract the problem is...
> > > FWIW, I've been running at "use the IO lock in page_mkwrite" patch
> > > for XFS for several months now, but I haven't posted it because
> > > without the VM side being able to handle such locking failures
> > > gracefully there's not much point in making the change. I did this
> > > patch to reduce the incidence of mmap vs direct IO races that are
> > > essentially identical in nature to rule them out of the cause of
> > > stray delalloc blocks in files that fsstress has been producing on
> > > XFS. FYI, this race condition hasn't been responsible for any of the
> > > problems I've found recently....
> > Yeah, I've been trying to hit the race window for a while and I failed as
> > well...
> IIRC, it's a rare case (that I consider insane, BTW): read from a
> file with into a buffer that is a mmap()d region of the same file
> that has not been faulted in yet.....
With punch hole, the race is less insane - just punching hole in the area
which is accessed via mmap could race in a bad way AFAICS.
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR