Sonny Rao wrote:
On Fri, Jan 21, 2005 at 07:32:29AM +0100, Andi Kleen wrote:
Sonny Rao <sonny@xxxxxxxxxxx> writes:
Interesting, since xfs_fsr already works online, I assume the only
remaining kernel function requirement is to allow locking off
allocation to a particular AG while the extents and metadata are moved
off? Then I assume there's some bookkeeping to get rid of refs to
that AG, which I guess might be fairly difficult ?
One issue is that you cannot move inodes. The inode number contains
the AG number, and you would need to renumber the inode which
would be fairly intrusive and visible to user space.
Ack, this is bad. I think JFS has a similar issue where inode numbers
are related to their ag numbers (shifted to the MSBs), but IIRC JFS
ags are dynamically allocated, so maybe less of a problem there ?
One problem used to be that XFS couldn't free any inodes. so you
couldn't get them out of AGs. But SGI added that recently, so it may
be more feasible now.
So you're saying one possible solution is to make copies of the inodes
in other AGs and then free them after that? Of course this is
contingent on being able to get exclusive access to all these inodes,
However to be interesting it would need online shrink, and that will
add lots of interesting races, because basically all operations
would need to check for their AG going away under them. Also changing
inode numbers in this case would be nasty.
As I don't really know the XFS code, I'm guessing there is no per-AG
lock and that "m_peraglock" is just protecting the list of AGs itself?
What about holding that lock and waiting for all transactions to
complete like in xfs_freeze ? (Apologies if I'm talking out of my ass here)
I did describe how to do this once, but I no longer have that email, so
I have to recreate.
1. Add to the kernel the ability to turn off new allocations to an
allocation group. You do need some special under the cover
allocations into the group to work though, in freeing up the
space in the allocation group, btree splits for the free space
may be required, these still need to work in the interim.
2. Find all the directories with inodes or blocks in the allocation
group - this requires walking all the extents of all the directory
inodes..... so not fast. Note that just because an inode is not in
the last allocation group does not mean it has no disk blocks there.
3. Recreate these directories from user space with a temp name, link all
their contents over to the new directory, switch the names of the
two inodes atomically inside the kernel, remove the old links and
directory. There needs to be logic to detect new files appearing in
the old directory, these need to be renamed to the new parent.
There are now only file blocks and inodes in the allocation group.
4. Repeat the above process with files who's inode or extents are in
the allocation group. If just the inode is there (unlikely), then
no need to move blocks. xfs_fsr contains most of the logic for this.
5. Fix up the superblock counters so that the allocation group count
shrinks. Note this could be applied to several allocation
groups at once.
As Andi pointed out, this results in the inode numbers changing, so
there is no way to do while the filesystem is exported, it also probably
messes with backups - they would need redoing afterwards.
There are several months of effort in this to get it all right and
Given the low price of storage nowadays, it is a lot cheaper to buy
another disk than to pay someone to do this. At current rates for
an experienced xfs developer, you are talking about 120 Gbytes/hour
at current prices ;-)
Now, what would be really neat is for a layer underneath the filesystem
to dynamically detect failing storage (smart?), take some storage from
a free pool of drives, and remap the filesystem blocks out to the new
space while it is live.