[Top] [All Lists]

Re: file system defragmentation

To: linux-xfs@xxxxxxxxxxx
Subject: Re: file system defragmentation
From: "Linda A. Walsh" <xfs@xxxxxxxxx>
Date: Wed, 21 Sep 2005 01:10:15 -0700
Cc: Austin Gonyou <austin@xxxxxxxxxxxxxxx>
In-reply-to: <4312913F.6040205@xxxxxxxxxxxxxxx>
References: <43128F82.4010004@xxxxxxxxx> <4312913F.6040205@xxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
Sorry for the long delay in answering, I must have missed this
coming in, but yes, you are right...if you have 1 gigantic
file that grows fragmented over time (while not regularly
running fsr), you might have a highly fragmented file even
though the % of fragmented files would read "low".  (I think
that is what you were saying in addition to mentioning
the concept of many small individual files that could raise
a fragmentation number).

However, if one were to run 'fsr' daily and if there was a large
enough blank contiguous area on the disk to hold the multi-Gig
file, fsr would still unfold it;  in fact, that was why it was
created: given xfs's delayed allocation (even slightly more so
on it's native IRIX, _I believe_), it was originally designed
without "fsr" because it was thought it wouldn't really be
likely to need it.  The file system was designed for to handle
large recording sessions and large files of streaming media such
as live audio and video feed -- which used to be one of SGI's
target customers.
This is 2nd hand, so I may have some details off, but the way
I heard it was that one customer came up with an unplanned scenario
where they would do something like recording different parts of weekly
shows interwoven throughout a day and throughout a week.  Even with
delayed writes, one can only delay so long, and, perhaps, while
producing multiple programs, there would still be a limit to
system memory to buffer multiple feeds in memory before a
write to disk would be forced. I think this was back in the mid
90's (?), so bandwidth and main memory were more limited as well.
Thus 'xfs_fsr' was born to re-arrange the data portions of files
to re-optimize streaming performance.

Now admittedly, I don't know what happens with DBMS files that
are 'locked'.  Since "xfs_fsr" can be run by any user that has
read/write access to the specified file it wouldn't appear to need
or use any extra privileges so DBMS files that remain locked 24
hours/day but grow slowly over time could, theoretically, become
highly internally fragmented.  I'm not sure how one would easily
work around that problem -- especially in the case of locked
records of a database that could be getting updated in real-time.

It would likely be "best" to allocate/create the DBMS files to max
size they will be allowed to grow, before putting them into use, make
sure they are defragmented before using, then use them in place.

If you can't predict the maximum file size, then one may have to
periodically checkpoint a database, make a copy of the database, defragment the new copy, (since xfs_fsr allows defragmenting by
file), then temporarily lock the database and update the new copy
with all the changes since the copy was made (perhaps some sort
of database journal can be started after the checkpoint that can
be replayed on the new copy? Guess it depends on the app).

As far as *many* small files.  XFS may have some mixed performance
characteristics on those.  Some small amount of file data or directory
information, I believe, may be stored internal to an inode, but
that is very limited and I don't know when XFS makes use of that
feature -- it might be symlink data or stuff like that only. Dunno. I have seen (and experienced) bad performance for
"removing" large numbers of files on XFS.  It is probably better
than when first ported to Linux, but I believe it's still a noticable
slow spot.  It _could_ be (guess), that when a file is deleted,
and the space is released, the file subsystem attempts to merge the
free space released with adjacent blocks of free space -- doing this
recursively over each block of space released so that the resulting
free space results in the fewest number of "extents" (variable
sized allocation units) possible.  While this creates a performance
penalty for file deletes, it has the effect of automatically
coalescing freespace automatically into large segments that can
be quickly allocated when needed for streaming 'write' performance.

With the delayed write feature of XFS, it can allow the file
subsystem to choose a more optimally sized free "extent" on write,
rather than allocating single blocks on a first free, first allocated
basis as is common in other file systems).  I.e. if I already have
buffered 256K in memory for a file that is still being written to,
XFS could choose a 512K extent that may be followed by other
large, contiguous free extents, vs. if I have written a file and closed
it for writing, say using cp, and it was a 23.5K file, it would
be able to know an extent size between 16-31K might be an exact fit or
it could split a 32+K size extent and create a left-over 8K extent.
Either way, it would quickly find a contiguous free block of space
for the file and not use the fragmentation-prone approach of
allocating the first-free blocks first.

I guess I've been pretty impressed with the XFS design strategy --
especially considering it's well over 10 years old -- supporting
low level fragmentation in _most_ cases w/o needing "fsr" (stock
IRIX release was configured to run it *weekly*), one of the first
file systems I was exposed to that provided journaling and
eliminated my long fsck waits.  It also reduced format time on
a multi Gig volume from large fractions of an hour to some number
of seconds. It was designed with extended attributes, allowing
both system and user-level
attribute space/file, special "real-time" recording zones that
can allow for faster access than going through the file system
and it's hierachy and allocation mechanism might allow, supporting
detailed layout specification to optimal tune RAID performance on
generic hardware, and other features we take for granted in a modern
file system.  I think it's sorta neat that things that might normally
need a data-block under other file systems, like "symlink" data, I
believe, is actually stored in the inode.   That's gotta reduce
large numbers of single, "allocation-unit" sized junks that would
be necessary in other filesystems.

While their IRIX systems were up to 1024 cpu's/node (one OS image)
about 3-4 years ago, they had to play catchup with the Intel Itanium
architecture but seem to have gotten up to speed there as well
delivering a 10,240 processor system (I'm guessing 10-OS-node cluster,
but dunno, maybe they were able to do some custom config -- never
can tell what some of the SGI engineers might come up with.  There
are still some damn smart engineers there, even though many were lost
over the past several years due to layoffs and attrition as new
management was brought in to 'control' costs by increased scrutiny
and critical review/control of the creative process...er...
<voice, character=Hagrid>I shouldn't have said that last part.
Nope...Forget I said anything.</voice>  Bad habit I have of saying
a bit too much sometimes.

On that note...time to shut up...would like to minimize foot-in
mouth disease...:-)


On Sunday, August 28th (more than 3 weeks ago), Austin Gonyou wrote:

I believe in the FAQ/man-info page for xfs_fsr, it refers to the fact that it will only work on file data and that the percent of frag is a factor of the size of files, number of inodes, and number of files.

This is important because if you run a mail/news/file server of some sort, and you have *high* frag, then there's probably a performance problem there. If you have large files, say, 2GB files, and have tons of them, (read RDBMS), then each file if it were fragged, would yield a higher amount of fragmentation per file. Just remember that when using FSR. ;)


Linda A. Walsh wrote:

XFS_FSR has been very stable -- I've been running it since before
xfs was merged into the kernel.

I have the following file, named "fsr" in my /etc/cron.daily:
I would think /bin/sh or /bin/sh would work equally well.

_File_ fragmentation is a virtual non-issue on my disks, however,
xfs_fsr only works on my file data.  It doesn't work
on directory data.  On a 118G partition nearing 79% capacity,
printing the defragmentation data from xfs_db:

actual 316870, ideal 316652, fragmentation factor 0.07%

actual 2538, ideal 1330, fragmentation factor 47.60%

I don't know about other data types (attr, symlink, quota,
realtime, realtime control) as I don't believe I have
any of those types on my volumes...But, hey, file fragmentation
seems to be handled! :-)


Fong Vang wrote:

>This is very useful.  Thank you.  How stable is xfs_fsr?

<Prev in Thread] Current Thread [Next in Thread>
  • Re: file system defragmentation, Linda A. Walsh <=