X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.4.0-r929098 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id pAKGY8ih166839 for ; Sun, 20 Nov 2011 10:34:08 -0600 X-ASG-Debug-ID: 1321806846-7afe01ee0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail-iy0-f181.google.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7CED61628540 for ; Sun, 20 Nov 2011 08:34:06 -0800 (PST) Received: from mail-iy0-f181.google.com (mail-iy0-f181.google.com [209.85.210.181]) by cuda.sgi.com with ESMTP id TgyTycyCXAU2dZQ1 for ; Sun, 20 Nov 2011 08:34:06 -0800 (PST) Received: by iaen33 with SMTP id n33so7352021iae.26 for ; Sun, 20 Nov 2011 08:34:06 -0800 (PST) Received: by 10.50.203.70 with SMTP id ko6mr11140294igc.19.1321806845911; Sun, 20 Nov 2011 08:34:05 -0800 (PST) Received: from [172.16.1.4] (c-98-234-236-69.hsd1.ca.comcast.net. [98.234.236.69]) by mx.google.com with ESMTPS id g16sm33881805ibs.8.2011.11.20.08.34.04 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 20 Nov 2011 08:34:05 -0800 (PST) From: Mike Krieger Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-ASG-Orig-Subj: XFS / xfssyncd lock-ups on 2.6.38-8 Subject: XFS / xfssyncd lock-ups on 2.6.38-8 Date: Sun, 20 Nov 2011 08:34:02 -0800 Message-Id: To: xfs@oss.sgi.com Mime-Version: 1.0 (Apple Message framework v1251.1) X-Mailer: Apple Mail (2.1251.1) X-Barracuda-Connect: mail-iy0-f181.google.com[209.85.210.181] X-Barracuda-Start-Time: 1321806847 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.80796 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean We're running a dozen Amazon AWS instances (on Ubuntu Natty Narwhal, = kernel 2.6.38-8). We've recent brought up several machines based on some = previous snapshots (EBS snapshots, rather than LVM), and they've been = locking up under load. The dmesg output is below; does this issue look = familiar, or perhaps fixed in a later kernel? Or could it be indicative = of some data corruption in the snapshot process?=20 The drive is being used for Postgres write-ahead-logs, so it's a = write-heavy, read-light drive. When the array (4 drives in RAID0) = freezes up, nothing seems to fix it short of a hard restart of the = machine=97we've tried things like stopping Postgres, issuing a 'drop = cache' to the kernel, and trying to kill the locked process, to no = avail. Would appreciate any thoughts/pointers to fixes or workarounds if this = is a known issue. Thanks, Mike =3D=3D=3D=3D (The /dev/md126 array that is locking up is an XFS RAID0 across 4 = volumes) The errors look like this: [558307.361854] INFO: task xfssyncd/md126:1029 blocked for more than 120 = seconds. [558307.361867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" = disables this message. [558307.361874] xfssyncd/md126 D ffff881116f13b00 0 1029 2 = 0x00000000 [558307.361879] ffff881088989d00 0000000000000246 ffff881088989fd8 = ffff881088988000 [558307.361884] 0000000000013b00 ffff8810866c3120 ffff881088989fd8 = 0000000000013b00 [558307.361889] ffff881089b84440 ffff8810866c2d80 ffffffff815dc13e = ffff88108916e400 [558307.361894] Call Trace: [558307.361904] [] ? = _raw_spin_unlock_irqrestore+0x1e/0x30 [558307.361932] [] xlog_grant_log_space+0x4a8/0x500 = [xfs] [558307.361937] [] ? default_wake_function+0x0/0x20 [558307.361951] [] xfs_log_reserve+0xff/0x140 [xfs] [558307.361967] [] xfs_trans_reserve+0x9c/0x200 [xfs] [558307.361980] [] xfs_fs_log_dummy+0x43/0x90 [xfs] [558307.361995] [] xfs_sync_worker+0x81/0x90 [xfs] [558307.362009] [] xfssyncd+0x183/0x230 [xfs] [558307.362025] [] ? xfssyncd+0x0/0x230 [xfs] [558307.362030] [] kthread+0x96/0xa0 [558307.362035] [] kernel_thread_helper+0x4/0x10 [558307.362038] [] ? int_ret_from_sys_call+0x7/0x1b [558307.362041] [] ? retint_restore_args+0x5/0x6 [558307.362045] [] ? kernel_thread_helper+0x0/0x10