xfs
[Top] [All Lists]

[RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc

To: xfs@xxxxxxxxxxx
Subject: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Mon, 4 Oct 2010 21:13:54 +1100
When multiple concurrent streaming writes land in the same AG,
allocation of extents interleaves between inodes and causes
excessive fragmentation of the files being written. That instead of
getting maximally sized extents, we'll get writeback range sized
extents interleaved on disk. that is for four files A, B, C and D,
we'll end up with extents like:

   +---+---+---+---+---+---+---+---+---+---+---+---+
     A1  B1  C1  D1  A2  B2  C2  A3  D2  C3  B3  D3 .....

instead of:

   +-----------+-----------+-----------+-----------+
         A           B           C           D

It is well known that using the allocsize mount option makes the
allocator behaviour much better and more likely to result in
the second layout above than the first, but that doesn't work in all
situations (e.g. writes from the NFS server). I think that we should
not be relying on manual configuration to solve this problem.

To demonstrate, writing 4 x 64GB files in parallel (16TB volume,
inode64 so all files land in same AG, 700MB/s write speed)

$ for i in `seq 0 1 3`; do
> dd if=/dev/zero of=/mnt/scratch/test.$i bs=64k count=1048576 &
> done
....

results in:

$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
777
196
804
784
$

This shows an average extent size on three of files of 80MB, and
320MB for the other file. The level of fragmentation varies
throughout the files, and varies greatly from run to run. To
demonstrate allocsize=1g:

$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
64
64
64
64
$

Which is 64x1GB extents per file, as we would expect. However, we
can do better than that - with this dynamic speculative
preallocation patch:

$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
9
9
9
9
$

Which gives extent sizes of a maximal 8GB (i.e. perfect):

$ sudo xfs_bmap -vv /mnt/scratch/test.0
/mnt/scratch/test.0:
 EXT: FILE-OFFSET             BLOCK-RANGE          AG AG-OFFSET                 
TOTAL
   0: [0..16777207]:          96..16777303          0 (96..16777303)         
16777208
   1: [16777208..33554295]:   91344616..108121703   0 (91344616..108121703)  
16777088
   2: [33554296..50331383]:   158452968..175230055  0 (158452968..175230055) 
16777088
   3: [50331384..67108471]:   225561320..242338407  0 (225561320..242338407) 
16777088
   4: [67108472..83885559]:   292669672..309446759  0 (292669672..309446759) 
16777088
   5: [83885560..100662647]:  359778024..376555111  0 (359778024..376555111) 
16777088
   6: [100662648..117439735]: 426886376..443663463  0 (426886376..443663463) 
16777088
   7: [117439736..134216823]: 510771816..527548903  0 (510771816..527548903) 
16777088
   8: [134216824..134217727]: 594657256..594658159  0 (594657256..594658159)    
  904
$

The same results occur for tests running 16 and 64 sequential
writers into the same AG - extents of 8GB in all files, so
this is a major improvement in default behaviour and effectively
means we do not need the allocsize mount option anymore.

Worth noting is that the extents still interleave between files -
that problem still exists - but the size of the extents now means
that sequential read and write rates are not going to be affected
by excessive seeks between extents within each file.

Given this demonstratably improves allocation patterns, the only
question that remains in my mind is exactly what algorithm to use to
scale the preallocation.  The current patch records the last
prealloc size and increases the next one from that. While that
preovides good results, it will cause problems when interacting with
truncation. It also means that a file may have a substantial amount
of preallocatin beyond EOF - maybe several times the size of the
file.

However, the current algorithm does work well when writing lots of
relatively small files (e.g. up to a few tens of megabytes), as
increasing the preallocation size fast reduces the chances of
interleaving small allocations.

I've been thinking that basing the preallocation size on the current
file size - say preallocate half the size of the file, is a better
option once file sizes start to grow large (more than a few tens of
of megabytes), so maybe a combination of the two is a better idea
(increase exponentially up to default^2 (4MB prealloc), then take
min(max(i_size / 2, default^2), XFS_MAXEXTLEN) as the prealloc size
so that we don't do excessive amounts of preallocation?

--

We need to make the same write patterns result in equivalent
allocation patterns even when they come through the NFS server.
Right now the NFS server uses a file descriptor for each write that
comes across the wire. This means that the ->release function is
called after every write, and that means XFS will be truncating away
the speculative preallocation it did during the write. Hence we get
interleaving files and fragmentation.

To avoid this problem, detect when the ->release function is being
called repeatedly on an inode that has delayed allocation
outstanding. If this happenѕ twice in a row, then don't truncate the
speculative allocation away. This ensures that the speculative
preallocation is preserved until the delalloc blocks are converted
to real extents during writeback.

The result of this is that concurrent files written by NFS will tend
to have a small first extent (due to specultive prealloc being
truncated once), followed by 4-8GB extents that interleave
identically to the above local dd exmaples. I have tested this for
4, 16 and 64 concurrent writers from multiple NFS clients. The
result for 2 clients each writing 16x16GB files (32 all up):

$ for i in `seq 0 1 31`; do
> sudo xfs_bmap -vv /mnt/scratch/test.$i |grep ": \[" | wc -l
> done | uniq -c
    1 2
   31 3

Mostly a combination of 4GB and 8GB extents, instead of severe
fragmentation. The typical layout was:

/mnt/scratch/test.1:
 EXT: FILE-OFFSET        BLOCK-RANGE          AG AG-OFFSET                TOTAL
   0: [0..8388607]:         225562280..233950887  0 (225562280..233950887)  
8388608
   1: [8388608..25165815]:  410111608..426888815  0 (410111608..426888815) 
16777208
   2: [25165816..33554431]: 896648152..905036767  0 (896648152..905036767)  
8388616

These results are using NFSv3, and the per-file write rate is only
~3MB/s.  Hence it can be seen that the dynamic preallocation works
for both high and low per-file write throughput.

Comments welcome.

<Prev in Thread] Current Thread [Next in Thread>