Chris Mason reported recently that a concurent stress test
(basically copying the linux kernel tree 20 times, verifying md5sums
and deleting it in a loop concurrently) under low memory conditions
was triggering the OOM killer muchmore easily than for btrfs.
Turns out there are two main problems. The first is that unlinked
inodes were not being reclaimed fast enough, leading to the OOM
being declared when there are large numbers of reclaimable inodes
still around. The second was that atime updates due to the verify
step were creating large numbers of dirty inodes at the VFS level
that were not being written back and hence made reclaimable before
the system declared OOM and killed stuff.
The first problem is fixed by making background inode reclaim more
frequent and faster, kicking background reclaim from the inode cache
shrinker so that when memory is low we always have background inode
reclaim in progress, and finally making the shrinker reclaim scan
block waiting on inodes to reclaim. This last step throttles memory
reclaim to the speed at which we can reclaim inodes, a key step
needed to prevent inodes from reclaim declaring OOM while there are
still reclaimable inodes around. The background inode reclaim
prevents this synchronous flush from finding dirty inodes and block
on them in most cases and hence prevents performance regressions in
more common workloads due to reclaim stalls.
To enable this new functionality, the xfssyncd thread is replaced
with a workqueue and the existing xfssyncd work replaced with a
global workqueue. Hence all filesystems will share the same
workqueue and we remove allt eh xfssyncd threads from the system.
The ENOSPC inode flush is converted to use the workqueue, and
optimised to only allow a single flush at a time. This significant
speeds up ENOSPC processing under concurrent workloads as it removes
all the unnecessary scanning that every single ENOSPC event
currently queues to the xfssyncd. Finally, a new reinode reclaim
worker is added to the workqueue that runs 5x more frequently that
the xfssyncd to do the background inode reclaim scan.
The second problem is could be fixed by making the XFS inode cache
shrinker kick the bdi flusher to write back inodes if the bdi
flusher is not already active. This, however, causes deadlocks when
the bdi-flusher thread needs to be forked under memory pressure, so
these patches have been dropped for now.
An addition to the series is to push the AIL when we are under
memory stress to speed up the cleaning of dirty metadata. While this
does not avoid the problems with VFS level dirty objects, it does
ensure we don't keep lots of dirty objects that the VFS knows
nothing about pinned in memory when there is a shortage.
By also pushing the AIL doing the periodic syncd work, we ensure
that we always clear out dirty objects from the AIL regularly and
hence allow the filesytem to idle correctly. Conversion of the
xfsaild thread to a workqueue also simplifies the push trigger
mechanism for cleaning the AIL and removes the need for a thread per
filesystem for this work.
- drop writeback_inodes_sb_nr_if_idle() patch
- no need for work_pending checks before queueing - the queuing
already does this atomically.
- remove xfs_syncd_lock as it is not necessary anymore
- simplify inode reclaim trigger
- add AIL pushing patches.