On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote:
> When multiple concurrent streaming writes land in the same AG,
> allocation of extents interleaves between inodes and causes
> excessive fragmentation of the files being written. That instead of
> getting maximally sized extents, we'll get writeback range sized
> extents interleaved on disk. that is for four files A, B, C and D,
> we'll end up with extents like:
> A1 B1 C1 D1 A2 B2 C2 A3 D2 C3 B3 D3 .....
> instead of:
> A B C D
> It is well known that using the allocsize mount option makes the
> allocator behaviour much better and more likely to result in
> the second layout above than the first, but that doesn't work in all
> situations (e.g. writes from the NFS server). I think that we should
> not be relying on manual configuration to solve this problem.
. . . (deleting some of your demonstration detail)
> The same results occur for tests running 16 and 64 sequential
> writers into the same AG - extents of 8GB in all files, so
> this is a major improvement in default behaviour and effectively
> means we do not need the allocsize mount option anymore.
> Worth noting is that the extents still interleave between files -
> that problem still exists - but the size of the extents now means
> that sequential read and write rates are not going to be affected
> by excessive seeks between extents within each file.
Just curious--do we have any current and meaningful
information about the trade-off between the size of an
extent and seek time? Obviously maximizing the extent
size will maximize the bang (data read) for the buck (seek
cost) but can we quantify that with current storage device
specs? (This is really a theoretical aside...)
> Given this demonstratably improves allocation patterns, the only
> question that remains in my mind is exactly what algorithm to use to
> scale the preallocation. The current patch records the last
> prealloc size and increases the next one from that. While that
> preovides good results, it will cause problems when interacting with
> truncation. It also means that a file may have a substantial amount
> of preallocatin beyond EOF - maybe several times the size of the
I honestly haven't looked into this yet, but can you expand on
the truncation problems you mention? Is it that the preallocated
blocks should be dropped and the scaling algorithm should be
reset when a truncation occurs or something?
> However, the current algorithm does work well when writing lots of
> relatively small files (e.g. up to a few tens of megabytes), as
> increasing the preallocation size fast reduces the chances of
> interleaving small allocations.
One thing that I keep wondering about as I think about this
is what the effect is as the file system (or AG) gets full,
and what level of "full" is enough to make any adverse
effects of a change like this start to show up. The other
thing is, what sort of workloads are reasonable things to
use to gauge the effect? NFS is perhaps common, but it's
unique in how it closes files all the time. What happens
when there's a more "normal" (non-NFS) workload? For AGs
with enough free space I suppose it's a win overall.
> I've been thinking that basing the preallocation size on the current
> file size - say preallocate half the size of the file, is a better
> option once file sizes start to grow large (more than a few tens of
> of megabytes), so maybe a combination of the two is a better idea
> (increase exponentially up to default^2 (4MB prealloc), then take
> min(max(i_size / 2, default^2), XFS_MAXEXTLEN) as the prealloc size
> so that we don't do excessive amounts of preallocation?
I think basing it on the file size is a good idea, it
scales the (initial) preallocation size to the specific
file. This would assume that files tend to grow by
amounts comparable to their size rather than suddenly
and dramatically changing. That seems reasonable but
I have nothing empirical to back up that assumption.
Similarly, the assumption that once a file starts to
grow you should rapidly increase the EOF preallocation
goal seems sensible--certainly for the hindsight case
of a stream of appends--but I have no proof that a
normal use case wouldn't trigger this algorithm when
it might be better not to.
> We need to make the same write patterns result in equivalent
> allocation patterns even when they come through the NFS server.
> Right now the NFS server uses a file descriptor for each write that
> comes across the wire. This means that the ->release function is
> called after every write, and that means XFS will be truncating away
> the speculative preallocation it did during the write. Hence we get
> interleaving files and fragmentation.
It could be useful to base the behavior on actual knowledge
that a file system is being exported by NFS. But it may well
be that other applications (like shell scripts that loop and
append to the same file repeatedly) might benefit.
> To avoid this problem, detect when the ->release function is being
> called repeatedly on an inode that has delayed allocation
> outstanding. If this happenѕ twice in a row, then don't truncate the
> speculative allocation away. This ensures that the speculative
> preallocation is preserved until the delalloc blocks are converted
> to real extents during writeback.
. . .
I have a few other comments in my reviews of your two patches.
. . .
> Comments welcome.
You got some...