[ ... ]
> The filesystem operations I care about the most are the likes which
> involve thousands of small files across lots of directories, like
> large trees of source code. For my test, I created a tarball of a
> finished IcedTea6 build, about 2.5 GB in size. It contains roughly
> 200,000 files in 20,000 directories.
Ah another totally inappropriate "test" of something (euphemism)
insipid. The XFS mailing list gets regularly queries on this topic.
Apparently not many people have figured out in the Linux culture
that general purpose filesystems cannot handle well large groups
of small files, and since the beginning of computing various
forms of "aggregate" files have been used for that, like 'ar'
('.a') files from UNIX, which should have been used far more
commonly than has happened, and never mind things like BDB/GDBM
But many lazy application programmer like to use the filesystem
as a small-record database, it is so easy...
> [ ... ] I ran the tests with a current RHEL 6.2 kernel and
> also with a 3.3rc2 kernel. Both of them exhibited the same
> behavior. The disk hardware used was a SmartArray p400
> controller with 6x 10k rpm 300GB SAS disks in RAID 6. The
> server has plenty of RAM (64 GB). [ ... ]
Huge hardware, but (euphemism) imaginative setup, as among its
many defects RAID6 is particularly inappropriate for most small
file/metadata heavy operation.
> [ ... ] I created two directory hierarchies, each containing
> the unpacked tarball 20 times, which I rsynced simultaneously
> to the target filesystem. When this was done, I deleted one
> half of them, creating some free space fragmentation, and what
> I hoped would mimic real-world conditions to some degree.
Your test is less (euphemism) insignificant because you tried
to cope with filetree lifetime issues.
> [ ... ] disk head jumps about wildly between four zones which
> are written to in almost perfectly linear fashion.
> [ ... ] I am aware that no filesystem can be optimal,
Every filesystem can be close to optimal, just not for every
> but given that the entire write set -- all 2.5 GB of it -- is
> "known" to the file system, that is, in memory, wouldn't it be
> possible to write it out to disk in a somewhat more reasonable
That sounds to me like a (euphemism) strategic aim: why ever
should a filesystem optimize that special case? Especially given
XFS does spread file allocations across AGs because it aims for
multihreaded operations, especially on RAID sets with several
independent (that is, not RAID6 with small writes) arms.
Unfortunately filesystems are not psychic and cannot use
predictive allocation policies, and have to cope with poorly
written applications that don't do advising (or 'fsync' properly
which is even worse). So some policies get hard-written in the
Your remedy, as you have noticed, is to tweak the filesystem
logic by changing the number of AGs, and you might also want to
experiment with the elevator (you seem to have forgotten about
that) and other block subsystem policies, and/or with the safety
vs. latency tradeoffs available at the filesystem and storage
There are many annoying details, and recentish version of XFS
try to help with the hideous hack of building an elevator inside
the filesystem code itself:
which however is sort of effective, because the Linux block IO
subsystem has several (euphemism) appalling issues.
> As can be seen from the time scale in the bottom part, the ext4
> version performed about 5 times as fast because of a much more
> disk-friendly write pattern.
Is it really disk friendly for every workload? Think about what
happens on 'ext4' there, and when it jumps between block groups,
and it is in effect doing commits in a different order. What
'ext4' does costs dearly on other workload types.