On Mon, Mar 26, 2007 at 02:27:24PM +1000, Neil Brown wrote:
> On Monday March 26, dgc@xxxxxxx wrote:
> > On Mon, Mar 26, 2007 at 09:21:43AM +1000, Neil Brown wrote:
> > > My point was that if the functionality cannot be provided in the
> > > lowest-level firmware (as it cannot for raid0 as there is no single
> > > lowest-level firmware), then it should be implemented at the
> > > filesystem level. Implementing barriers in md or dm doesn't make any
> > > sense (though passing barriers through can in some situations).
> > Hold on - you've said that the barrier support in a block deivce
> > can change because of MD doing hot swap. Now you're saying
> > there is no barrier implementation in md. Can you explain
> > *exactly* what barrier support there is in MD?
> For all levels other than md/raid1, md rejects bio_barrier() requests
> as -EOPNOTSUPP.
> For raid1 it tests barrier support when writing the superblock and the
> if all devices support barriers, then md/raid1 will allow
> bio_barrier() down. If it gets an unexpected failure it just rewrites
> it without the barrier flag and fails any future write requests (which
> isn't ideal, but is the best available, and should happen effectively
> So md/raid1 barrier support is completely dependant on the underlying
> devices. md/raid1 is aware of barriers but does not *implement*
> them. Does that make it clearer?
Ah, that clears up the picture - thanks Neil.
> > > The most straight-forward way to implement this is to make sure all
> > > preceding blocks have been written before writing the barrier block.
> > > All filesystems should be able to do this (if it is important to them).
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^
> > And that is the key point - XFS provides no guarantee that your
> > data is on spinning rust other than I/O barriers when you have
> > volatile write caches.
> > IOWs, if you turn barriers off, we provide *no guarantees*
> > about the consistency of your filesystem after a power failure
> > if you are using volatile write caching. This mode is for
> > use with non-cached disks or disks with NVRAM caches where there
> > is no need for barriers.
> But.... as the block layer can re-order writes, even non-cached disks
> could get the writes in a different or to the order in which you sent
But on a non-cached disk we've had an to have received an I/O completion
before the tail of the log moves, and hence the metadata is on stable
storage. The problem arises when volatile write caches are used and
I/O completion no longer means "data on stable storage".
> I have a report of xfs over md/raid1 going about 10% faster once we
> managed to let barrier writes through, so presumably XFS does
> something different if barriers are not enabled ??? What does it do
I bet that the disk doesn't have it's write cache turned on. For
disks with write cache turned on, barriers can slow down XFS by a factor
of 5. Safety, not speed, was all we are after with barriers.
> > > Because block IO tends to have long pipelines and because this
> > > operation will stall the pipeline, it makes sense for a block IO
> > > subsystem to provide the possibility of implementing this sequencing
> > > without a complete stall, and the 'barrier' flag makes that possible.
> > > But that doesn't mean it is block-layer functionality. It means (to
> > > me) it is common fs functionality that the block layer is helping out
> > > with.
> > I disagree - it is a function supported and defined by the block
> > layer. Errors returned to the filesystem are directly defined
> > in the block layer, the ordering guarantees are provided by the
> > block layer and changes in semantics appear to be defined by
> > the block layer......
> You can tell we are on different sides of the fence, can't you ?
Yup - no fence sitting here ;)
> There is certainly some validity in your position...
And likewise yours - I just don't think the responsibility here
is quite so black and white...
> > > > wait for all queued I/Os to complete
> > > > call blkdev_issue_flush
> > > > schedule the write of the 'barrier' block
> > > > call blkdev_issue_flush again.
> > > >
> > > > And not involve the filesystem at all? i.e. why should the filesystem
> > > > have to do this?
> > >
> > > Certainly it could.
> > > However
> > > a/ The the block layer would have to wait for *all* queued I/O,
> > > where-as the filesystem would only have to wait for queued IO
> > > which has a semantic dependence on the barrier block. So the
> > > filesystem can potentially perform the operation more efficiently.
> > Assuming the filesystem can do it more efficiently. What if it
> > can't? What if, like XFS, when barriers are turned off, the
> > filesystem provides *no* guarantees?
> (Yes.... Ted T'so like casting aspersions on XFS... I guess this is
> why :-)
Different design criteria. ext3 is great doing what it was designed
for, and the same can be said for XFS. Take them outside their
comfort area (like putting XFS on commodity disks with volatile
write caches or putting millions of files into a single directory in
ext3) and you get problems. it's just that they were designed for
different purposes, and that includes data resilience during failure
That being said, we are doing a lot in XFS to address some of these
shortcomings - it's just that ordered writes can be very difficult
to retrofit to an existing filesystem....
> Is there some mount flag to say "cope without barriers" or "require
> barriers" ??
XFs has "-o nobarrier" to say don't use barriers, and this is
*not* the default. If barriers don't work, we drop back to "-o nobarrier"
after leaving a loud warning inthe log....
> I can imagine implementing barriers in raid5 (which keeps careful
> track of everything) but I suspect it would be a performance hit. It
> might be nice if the sysadmin has to explicitly ask...
> For that matter, I could get raid1 to reject replacement devices that
> didn't support barriers, if there was a way for the filesystem to
> explicitly ask for them. I think we are getting back to interface
> issues, aren't we?
Yeah, very much so. If you need the filesystem to be aware of smart
things the block deivce can do or tell it, then we really don't
want to have to communicate them via mount options ;)
SGI Australian Software Group