On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote:
> Dave Chinner put forth on 9/2/2010 2:01 AM:
> > No, that's definitely not the case. A different kernel in the
> > same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options
> > "logbsize=262144"
> > FSUse% Count Size Files/sec App Overhead
> > 0 800000 0 39554.2 7590355
> > 4 threads with mount options "logbsize=262144,delaylog"
> > FSUse% Count Size Files/sec App Overhead
> > 0 800000 0 67269.7 5697246
> What happens when you bump each of these to 8 threads, 1 per core? If
FSUse% Count Size Files/sec App Overhead
0 1600000 0 127979.3 13156823
So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets
67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU
bound at each load point ;)
> the test consumes all cpus/cores, what instrumentation are you viewing
> that tells you the cpu utilization _isn't_ due to memory b/w starvation?
1) profiling like 'perf top' or oprofile, using hardware counters to
profile on cpu cycles, l1/l2 cache misses, etc
2) the delayed logging code uses significantly more memory bandwidth
than the original code because it copies changed information twice
(instead of once) before it is written to disk. Given that single
threaded performance of delayed logging is identical to the original
code and scalability from 1 to 8 cores is almost linear, it cannot
be memory bandwidth bound....
The code might be memory _latency_ bound (i.e on cache misses), but
it is certainly not stressing pure memory bandwidth.
> A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue
> rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core. An
> 8 core machine would therefore have an instruction issue rate of 8 * 16
> GB/s = 128 GB/s. A modern dual socket system is going to top out at
> 24-48 GB/s, well short of the instruction issue rate. Now, this doesn't
> even take the b/w of data load/store operations into account, but I'm
> guessing the data size per directory operation is smaller than the total
> instruction sequence, which operates on the same variable(s).
> So, if the CPUs are pegging, and we're not running out of memory b/w,
> then this would lead me to believe that the hot kernel code, core
> fs_mark code and the filesystem data are fully, or near fully, contained
> in level 2 and 3 CPU caches. Is this correct, more or less?
However (and it is a big however!), I generally don't care to
analyse performance at this level because it's getting into
micro-optimisation territory. Sure, it will get you a few percent
here and there, but then you lose focus on improving the algorithms.
An algorithmic change can provide an order of magnitude improvement,
not a few percent. The delayed logging code is a clear example of
Another example - perf top shows this on the above 8p load on
a plain 2.6.36-rc3 kernel (and it gets about 40k files/s):
426043.00 27.4% _xfs_buf_find
87491.00 5.6% __ticket_spin_lock
67204.00 4.3% xfs_dir2_node_addname
60434.00 3.9% dso__find_symbol
48407.00 3.1% kmem_cache_alloc
37006.00 2.4% __d_lookup
31625.00 2.0% xfs_trans_buf_item_match
20036.00 1.3% xfs_log_commit_cil
18728.00 1.2% _raw_spin_unlock_irqrestore
18428.00 1.2% __memset
18001.00 1.2% __memcpy
17781.00 1.1% xfs_da_do_buf
17732.00 1.1% xfs_iflush_cluster
16831.00 1.1% kmem_cache_free
14836.00 1.0% __kmalloc
It is clear that buffer lookup is consuming the most CPU of any
operation. Why? Because the buffer hash table is too small. I've
already posted patches for a short term solution (increase the size
of the hash table) and the above 127k files/s result is using that
patch. hence it is clear that the micro-optimisation works, but at
the cost of 16x increase in memory usage for the hash table. And
that still isn't really large enough, because now the load is
already pushing the limits of the enlarged hash table.
As Christoph has already suggested, the correct way to fix the
problem is to change the caching algorithm to something that is
self-scaling (e.g. a rb-tree or a btree). That will keep memory
usage low on small filesystems, yet scale efficiently to large
numbers of buffers, something a hash cannot easily do.
IOWs, an algorithmic change will solve the problem far better for
more situations than the micro-optimisation of tweaking the hash
sizes. Reduced to the simplest argument, scalability is all
about choosing the right algorithm so you don't have to care about
minute details to obtain the performance you require.
> I'll have to dig around. I've never even looked for the archives for
> this list. It's hopefully mirrored in the usual places.
> Out of curiosity, have you ever run into memory b/w starvation before
> peaking all CPUs while running this test?
No. Last time I ran out of bandwidth doing an IO workloads was doing
6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk
subsystem could handle 11GB/s, we got 10GB/s with direct IO, but
buffered IO was limited by the cross-sectional memory bandwidth of
the machine (25GB/s) because of the extra copy buffered IO
> I could see that maybe
> occurring with dual 1GHz+ P3 class systems with their smallish caches
> and lowly single channel PC100, back before the switch to DDR memory,
> but those machines were probably gone before XFS was open sourced, IIRC,
> so you may not have had the pleasure (if you could call it that).
The ratio between CPU cycles and memory bandwidth really hasn't
changed much since then. The CPUs weren't powerful enough then,
either, to run enough metadata ops to get near memory bandwidth