On Fri, May 20, 2011 at 05:49:20PM +0200, Marc Lehmann wrote:
> On Fri, May 20, 2011 at 12:56:59PM +1000, Dave Chinner <david@xxxxxxxxxxxxx>
> [thanks for the thorough explanation]
> > So the question there: how is your workload accessing the files? Is
> > it opening and closing them multiple times in quick succession after
> > writing them?
> I don't think so, but of course, when compiling a file, it will be linked
> afterwards, so I guess it would be accessed at least once.
Ok, I'll see if I can reporduce it localy.
> > I think it is triggering the "NFS server access pattern" logic and so
> > keeping speculative preallocation around for longer.
> Longer meaning practically infinitely :)
No, longer meaning the in-memory lifecycle of the inode.
> > I'd suggest removing the allocsize mount option - you shouldn't need
> > it anymore because the new default behaviour resists fragmentation a
> > whole lot better than pre-2.6.38 kernels.
> I did remove it already, and will actively try this on our production
> server which suffer from severe fragmentation (but xfs_fsr fixes that with
> some work (suspending the logfile writing) anyway).
log file writing - append only workloads - is one where the dynamic
speculative preallocation can make a significant difference.
> However, I would suggest that whatever heuristic 2.6.38 uses is deeply
> broken at the momment,
One bug report two months after general availability != deeply
> as NFS was not involved here at all (so no need for
> it), the usage pattern was a simple compile-then-link-pattern (which is
> very common),
While using a large allocsize mount option, which is relatively
rare. Basically, you've told XFS to optimise allocation for large
files and then are running workloads with lots of small files. It's
not surprise that there are issues, and you don't need the changes
in 2.6.38 to get bitten by this problem....
> and there is really no need to cache this preallocation for
> files that have been closed 8 hours ago and never touched since then.
If the preallocation was the size of the dynamic behaviour, you
wouldn't have even noticed this. So really what you are saying is
that it is excessive for your current configuration and workload.
If I can reproduce it, I'll have a think about how to tweak it
better for allocsize filesystems. However, I'm not going to start to
add lots of workload-dependent tweaks to this code - the default
behaviour is much better and in most cases removes the problems that
led to using allocsize in the first place. So removing allocsize
from your config is, IMO, the correct fix, not tweaking heuristics in