On Tue, Sep 07, 2010 at 02:26:54PM +0200, Tejun Heo wrote:
> On 09/07/2010 12:35 PM, Tejun Heo wrote:
> > Can you please help me a bit more? Are you saying the following?
> > Work w0 starts execution on wq0. w0 tries locking but fails. Does
> > delay(1) and requeues itself on wq0 hoping another work w1 would be
> > queued on wq0 which will release the lock. The requeueing should make
> > w0 queued and executed after w1, but instead w1 never gets executed
> > while w0 hogs the CPU constantly by re-executing itself. Also, how
> > does delay(1) help with chewing up CPU? Are you talking about
> > avoiding constant lock/unlock ops starving other lockers? In such
> > case, wouldn't cpu_relax() make more sense?
> Ooh, almost forgot. There was nr_active underflow bug in workqueue
> code which could lead to malfunctioning max_active regulation and
> problems during queue freezing, so you could be hitting that too. I
> sent out pull request some time ago but hasn't been pulled into
> mainline yet. Can you please pull from the following branch and add
> WQ_HIGHPRI as discussed before and see whether the problem is still
Ok, it looks as if the WQ_HIGHPRI is all that was required to avoid
the log IO completion starvation livelocks. I haven't yet pulled
the tree below, but I've now created about a billion inodes without
seeing any evidence of the livelock occurring.
Hence it looks like I've been seeing two livelocks - one caused by
the VM that Mel's patches fix, and one caused by the workqueue
changeover that is fixed by the WQ_HIGHPRI change.
Thanks for you insights, Tejun - I'll push the workqueue change
through the XFS tree to Linus.