[Top] [All Lists]

Re: Performance problem - reads slower than writes

To: Joe Landman <landman@xxxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Performance problem - reads slower than writes
From: Brian Candler <B.Candler@xxxxxxxxx>
Date: Tue, 7 Feb 2012 17:30:20 +0000
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=date:from:to :cc:subject:message-id:references:mime-version:content-type :in-reply-to; s=sasl; bh=GhTx9QpvFnTbl4yoD4N3HQqGqXE=; b=eAyaRMV PCY6cF4EnD1qAyNmIN5OxCkVmFOGwq2KN8V+0eV7nvUg8lg/NhOG2bkaZr0TUTUq g1FUXkiATPhTq4VXrs3YUoaYwhPOkh52kXawAddxh/Sb4G/IjS+mhSVkTkes8fZM 76VHFnLn5qh6eFjzH4MHrAGDQisiRIwQljmE=
Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=date:from:to:cc :subject:message-id:references:mime-version:content-type :in-reply-to; q=dns; s=sasl; b=OzAeNqqOf+tJkSqX9glTx/8NRrdwjzfsy UNx4Wuo6DZbYFWgAIKs5YrreAdK4FOmg7GCS1avT+NXEYxGlj0zyen2Pp89InsJi zkWUiMQRv1G80Dih9bkuzw9tlDYTlL18qwV9T5CmNOGGWEsiUPcG6ef1DZjIjOyO bSovQzBEgc=
In-reply-to: <4F2D98A9.4090709@xxxxxxxxxxxxxxxxxxxxxxx>
References: <20120131103126.GA46170@xxxxxxxx> <20120131145205.GA6607@xxxxxxxxxxxxx> <20120203115434.GA649@xxxxxxxx> <4F2C38BE.2010002@xxxxxxxxxxxxxxxxx> <20120203221015.GA2675@xxxxxxxx> <4F2D016C.9020406@xxxxxxxxxxxxxxxxx> <20120204112436.GA3167@xxxxxxxx> <4F2D2953.2020906@xxxxxxxxxxxxxxxxx> <20120204200417.GA3362@xxxxxxxx> <4F2D98A9.4090709@xxxxxxxxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Sat, Feb 04, 2012 at 03:44:25PM -0500, Joe Landman wrote:
> >Sure it can. A gluster volume consists of "bricks". Each brick is served by
> >a glusterd process listening on a different TCP port. Those bricks can be on
> >the same server or on different servers.
> I seem to remember that the Gluster folks abandoned this model
> (using their code versus MD raid) on single servers due to
> performance issues.  We did play with this a few times, and the
> performance wasn't that good.  Basically limited by single disk
> seek/write speed.

It does appear to scale up, although not as linearly as I'd like.

Here are some performance stats [1][2].
#p = number of concurrent client processes; files read first in sequence
and then randomly.

With a 12-brick distributed replicated volume (6 bricks each on 2 servers),
the servers connected by 10GE and the gluster volume mounted locally on one
of the servers:

 #p  files/sec  dd_args
  1      95.77  bs=1024k 
  1      24.42  bs=1024k [random]
  2     126.03  bs=1024k 
  2      43.53  bs=1024k [random]
  5     284.35  bs=1024k 
  5      82.23  bs=1024k [random]
 10     280.75  bs=1024k 
 10     146.47  bs=1024k [random]
 20     316.31  bs=1024k 
 20     209.67  bs=1024k [random]
 30     381.11  bs=1024k 
 30     241.55  bs=1024k [random]

With a 12-drive md raid10 "far" array, exported as a single brick and
accessed using glusterfs over 10GE:

 #p  files/sec  dd_args
  1     114.60  bs=1024k 
  1      38.58  bs=1024k [random]
  2     169.88  bs=1024k 
  2      70.68  bs=1024k [random]
  5     181.94  bs=1024k 
  5     141.74  bs=1024k [random]
 10     250.96  bs=1024k 
 10     209.76  bs=1024k [random]
 20     315.51  bs=1024k 
 20     277.99  bs=1024k [random]
 30     343.84  bs=1024k 
 30     316.24  bs=1024k [random]

This is a rather unfair comparison because the RAID10 "far" configuration
allows it to find all data on the first half of each drive, reducing the
seek times and giving faster read throughput.  Unsurprisingly, it wins on
all the random reads.

For sequential reads with 5+ concurrent clients, the gluster distribution
wins (because of the locality of files to their directory)

In the limiting case, because the filesystems are independent you can read
off them separately and concurrently:

# for i in /brick{1..6}; do find $i | time cpio -o >/dev/null & done

This completed in 127 seconds for the entire corpus of 100,352 files (65GB
of data), i.e.  790 files/sec or 513MB/sec.  If your main use case was to be
able to copy or process all the files at once, this would win hands-down.

In fact, since the data is duplicated, we can read half the directories
from each disk in the pair.

root@storage1:~# for i in /brick{1..6}; do find $i | egrep '/[0-9]{4}[02468]/' 
| time cpio -o >/dev/null & done
root@storage2:~# for i in /brick{1..6}; do find $i | egrep '/[0-9]{4}[13579]/' 
| time cpio -o >/dev/null & done

This read the whole corpus in 69 seconds: i.e. 1454 files/sec or 945MB/s. 
Clearly you have to jump through some hoops to get this, but actually
reading through all the files (in any order) is an important use case for

Maybe the RAID10 array could score better if I used a really big stripe size
- I'm using 1MB at the moment.



[1] Test script shown at

[2] Tuned by:
gluster volume set <volname> performance.io-thread-count 32
and with the patch at

<Prev in Thread] Current Thread [Next in Thread>