[ ... ]
>> As to this, in theory even having split the files among 4
>> AGs, the upload from system RAM to host adapter RAM and then
>> to disk could happen by writing first all the dirty blocks
>> for one AG, then a long seek to the next AG, and so on, and
>> the additional cost of 3 long seeks would be negligible.
> Yes, that’s exactly what I had in mind, and what prompted me
> to write this post. It would be about 10 times as fast.
Ahhh yes, but let's go back to this and summarize some of my
* If the scheduling order was by AG, and the hardware was
parallel, the available parallelism would not be exploited,
(and fragmentation might be worse) as if there were only a
single AG. And XFS does let you configure the number of AGs
in part for for that reason.
* Your storage layer does not seem to deliver parallel
operations: as the ~100MB/s overall 'ext4' speed and the
seek graphs show, in effect your 4+2 RAID6 performs in this
case as if it were a single drive with a single arm.
* Even with the actual scheduling at the Linux level being by
interleaving AGs in XFS, your host adapter with a BBWC
should be able to reorder them, in 256MiB lots, ignoring
Linux level barriers and ordering, but it seems that this is
So the major things to look into seem to me:
* Ensure that your RAID set can deliver the parallelism at
which XFS is targeted, with the bulk transfer rates that it
* Otherwise figure out ways to ensure that the IO transactions
generated by XFS are not in interleave-AG order.
* Otherwise figure out ways to get the XFS IO ordering
rearranged at the storage layer in spacewise order.
Summarizing some of the things to try, and some of them are
rather tentative, because you have a rather peculiar corner
* Change the flusher to writeout incrementally instead of just
at 'sync' time, e.g. every 1-2 seconds. In some similar
cases this makes things a lot better, as large 'uploads' to
the storage layer from the page cache can cause damaging
latencies. But the success of this may depend on having a
properly parallel storage layer, at least for XFS.
* Use a different RAID setup. If the RAID set is used only for
reproducible data, a RAID0, else a RAID10, or even a RAID5
with a small chunk size.
* Check the elevator and cache policy on the P400, if they are
settable. Too bad many RAID host adapters have (euphemism)
hideous fw (many older 3ware models come to mind) with some
undocumented (euphemism) peculiarties as to scheduling.
* Tweak 'queue/nr_requests' and 'device/queue_depth'. Probably
they should be big (hundreds/thousands), but various
settings should be tried as fw sometimes is so weird.
* Given that it is now established that your host adapter has
BBWC, consider switching the Linux elevator to 'noop', so as
to leave IO scheduling to the host adapter fw, and reduce
issue latency. 'queue/nr_requests' may be set to a very low
number here perhaps, but my guess is that it shouldn't matter.
* Alternatively if the host adapter fw insists on not
reordering IO from the Linux level, use Linux elevator
settings that behaves similarly to 'anticipatory'.
It may help to use Bonnie (Garloff's 1.4 version with
'-o_direct') to give a rough feel of filetree speed profile, for
example I tend to use these options:
Bonnie -y -u -o_direct -s 2000 -v 2 -d "$DIR"
Ultimately even 'ext4' does not seem the right filesystem for
this workload either, because all these "legacy" filesystems are
targeted at situations where data is much bigger than memory,
and you are trying to fit them into a very specific corner case
where the opposite is true.
Making my fantasy run wild, my guess is that your workload is
not 'tar x', but release building, where sources and objects fit
entirely in memory, and you are only concerned with persisting
the sources because you want to do several builds from that set
of sources without re-tar-x-ing them, and ideally you would like
to reduce build times by building several objects in parallel.
BTW your corner case then has another property here: that disk
writes greatly exceed disk reads, because you would only write
once the sources and then read them from cache every time
thereafter while the system is up. I doubt also that you would
want to persist the generated objects themselves, but only the
generated final "package" containing them, which might suggest
building the objects to a 'tmpfs', unlss you want them
persisted (a bit) to make builds restartable.
If that's the case, and you cannot fix the storage layer to be
more suitable for 'ext4' or XFS, consider using NILFS2, or even
'ext2' (with a long flusher interval perhaps).
Note: or "cheat" and do your builds to a flash SSD, as they
both run a fw layer that implements a COW/logging allocation
strategy, and have nicer seek times :-).
> That’s what bothers me so much.
And in case you did not get this before, I have a long standing
pet peeve about abusing filesystems for small file IO, or other
ways of going against the grain of what is plausible, which I
call the "syntactic approach" (every syntactically valid system
configuration is assumed to work equally well...).
Some technical postscripts:
* It seems that most if not all RAID6 implementations don't do
shortened RMWs, where only the updated blocks and the PQ
blocks are involved, but they always do full stripe RMW.
Even with a BBWC in the host adapter this is one major
reason to avoid RAID6 in favor of at least RAID5, for your
setup in particular. But hey, RAID6 setups are all
syntactically valid! :-)
* The 'ext3' on disk layout and allocation policies seem to
deliver very good compact locality on bulk writeouts and
on relatively fresh filetrees, but then locality can degrade
apocaliptically over time, like seven times:
I suspect that the same applies to 'ext4', even if perhaps a
bit less. You have tried to "age" the filetree a bit, but I
suspect you did not succeed enough, as the graphed Linux
level seek patterns with 'ext4' shows a mostly-linear write.
* Hopefully your storage layer does not use DM/LVMs...