On Thu, Sep 08, 2011 at 10:43:24AM -0700, Simon Kirby wrote:
> On Thu, Sep 08, 2011 at 05:13:05PM +0200, Lars Ellenberg wrote:
> > Sorry for double posting on drbd-dev, I managed to strip the other lists
> > from Cc.
> > > We upgraded from 2.6.36 which seemed to have a page leak (file pages left
> > > on the LRU) and so would eventually perform very poorly. 2.6.37 and
> > > 2.6.38 seemed to have some unix socket issue that caused heartbeat to
> > > wedge. Shall we enable lock debugging or something here?
> > That could help us understand that stack trace.
> > It looks like cpu 1 blocks in
> > > [ 1532.427149] [<ffffffff8103d512>] ? try_to_wake_up+0xc2/0x270
> > > [ 1532.427149] <<EOE>> <IRQ> [<ffffffff8103d6cd>]
> > > default_wake_function+0xd/0x10
> > Which does not make sense to me at all.
> Well, good news, I think.. I believe this may be related to
> "PCI: Set PCI-E Max Payload Size on fabric", added by b03e7495a862b02829.
> 3.1-rc5 is running now with a patch to basically disable those changes,
> and has been stable for 12 hours. It usually hung in a few minutes
> The XFS peoples say it was very likely not 58d84c4ee0389ddeb86238d5 which
> is the only other thing that changed between these versions that seems to
> be at all in the hang path.
> Also, when the thing hangs, it stops pinging immediately, and with the
> PCI-E max payload thing active, the device that raises a bus error is
> actually the PCI-E to PCI-X bridge chip used to support the BCM5708 NICs,
> so that all seems related.
Except that I accidentally git reset out the patch, and so it's been
running unmodified 79016f648872549392d232cd648bd02298c2d2bb (past -rc5),
and still hasn't crashed, so I guess it _was_ the XFS changes, or
something else. Boggle. In any event, it's still running well. :)