xfs
[Top] [All Lists]

Re: XFS corruption with failover

To: Eric Sandeen <sandeen@xxxxxxxxxxx>
Subject: Re: XFS corruption with failover
From: Lachlan McIlroy <lmcilroy@xxxxxxxxxx>
Date: Thu, 13 Aug 2009 21:43:27 -0400 (EDT)
Cc: John Quigley <jquigley@xxxxxxxxxxxx>, XFS Development <xfs@xxxxxxxxxxx>, Felix Blyakher <felixb@xxxxxxx>
In-reply-to: <835473717.1935811250214078456.JavaMail.root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Reply-to: Lachlan McIlroy <lmcilroy@xxxxxxxxxx>
----- "Eric Sandeen" <sandeen@xxxxxxxxxxx> wrote:

> Lachlan McIlroy wrote:
> > ----- "Eric Sandeen" <sandeen@xxxxxxxxxxx> wrote:
> > 
> >> Felix Blyakher wrote:
> >>> On Aug 13, 2009, at 3:17 PM, John Quigley wrote:
> >>>
> >>>> Folks:
> >>>>
> >>>> We're deploying XFS in a configuration where the file system is 
> 
> >>>> being exported with NFS.  XFS is being mounted on Linux, with  
> >>>> default options; an iSCSI volume is the formatted media.  We're 
> 
> >>>> working out a failover solution for this deployment utilizing
> Linux
> >>  
> >>>> HA.  Things appear to work correctly in the general case, but in 
> 
> >>>> continuous testing we're getting XFS superblock corruption on a
> >> very  
> >>>> reproducible basis.
> >>>> The sequence of events in our test scenario:
> >>>>
> >>>> 1. NFS server #1 online
> >>>> 2. Run IO to NFS server #1 from NFS client
> >>>> 3. NFS server #1 offline, (via passing 'b' to
> /proc/sysrq-trigger)
> >>>> 4. NFS server #2 online
> >>>> 5. XFS mounted as part of failover mechanism, mount fails
> >>>>
> >>>> The mount fails with the following:
> >>>>
> >>>> <snip>
> >>>> kernel: XFS mounting filesystem sde
> >>>> kernel: Starting XFS recovery on filesystem: sde (logdev:
> >> internal)
> >>>> kernel: XFS: xlog_recover_process_data: bad clientid
> >>>> kernel: XFS: log mount/recovery failed: error 5
> >>> This is an IO error. Is the block device (/dev/sde) accessible
> >>> from the server #2 OK? Can you dd from that device?
> >> Are you sure?
> >>
> >>                 if (ohead->oh_clientid != XFS_TRANSACTION &&
> >>                     ohead->oh_clientid != XFS_LOG) {
> >>                         xlog_warn(
> >>                 "XFS: xlog_recover_process_data: bad clientid");
> >>                         ASSERT(0);
> >>                         return (XFS_ERROR(EIO));
> >>                 }
> >>
> >> so it does say EIO but that seems to me to be the wrong error;
> loks
> >> more
> >> like a bad log to me.
> >>
> >> It does make me wonder if there's any sort of per-initiator
> caching
> >> on
> >> the iscsi target or something.  </handwave>
> > Should barriers be enabled in XFS then?
> 
> Could try it but I bet the iscsi target doesn't claim to support
> them...
You're probably right.

Is it possible for a transaction record to span two log buffers and
only one made it to disk so the rest of the transction record appears
corrupt?

> 
> -eric
> 
> >> -Eric
> >>
> >> _______________________________________________
> >> xfs mailing list
> >> xfs@xxxxxxxxxxx
> >> http://oss.sgi.com/mailman/listinfo/xfs
> >

<Prev in Thread] Current Thread [Next in Thread>