X-Spam-Checker-Version: SpamAssassin 3.4.0-r929098 (2010-03-30) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,FREEMAIL_FROM, J_CHICKENPOX_43,T_DKIM_INVALID autolearn=no version=3.4.0-r929098 Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p8D3xfwI035240 for ; Mon, 12 Sep 2011 22:59:41 -0500 X-ASG-Debug-ID: 1315886379-2dca01970000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail-bw0-f53.google.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 3044C1E8B44E for ; Mon, 12 Sep 2011 20:59:39 -0700 (PDT) Received: from mail-bw0-f53.google.com (mail-bw0-f53.google.com [209.85.214.53]) by cuda.sgi.com with ESMTP id B31S09FVaGo0F5cb for ; Mon, 12 Sep 2011 20:59:39 -0700 (PDT) Received: by bkbzt12 with SMTP id zt12so117213bkb.26 for ; Mon, 12 Sep 2011 20:59:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=NYkE34aLt+g0qCiFBKsvqIqlQqGh4W2reMV2wNrs//8=; b=jeuQ2IocQq4PSQflJFWWB6gwKS2MCInHc4iWSYHOSlUM49h4NFbyTkYgmQJUJM4SCW vJxEJ10pao4uxf6UksNP8s98ss6VqAeGVHQx4k/Xx7TMgYnJljzRQSeLY0Tbi3IExPiL 0Zwhn77Ls8dOsb5ioSOOQcUI72jwtPSTpv9aw= MIME-Version: 1.0 Received: by 10.204.141.147 with SMTP id m19mr180135bku.339.1315886378921; Mon, 12 Sep 2011 20:59:38 -0700 (PDT) Received: by 10.204.59.68 with HTTP; Mon, 12 Sep 2011 20:59:38 -0700 (PDT) In-Reply-To: References: Date: Mon, 12 Sep 2011 20:59:38 -0700 Message-ID: X-ASG-Orig-Subj: Re: 3.1-rc4: spectacular kernel errors / filesystem crash Subject: Re: 3.1-rc4: spectacular kernel errors / filesystem crash From: Jesse Brandeburg To: Justin Piszcz Cc: linux-kernel@vger.kernel.org, xfs@oss.sgi.com, Alan Piszcz , NetDEV list Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Barracuda-Connect: mail-bw0-f53.google.com[209.85.214.53] X-Barracuda-Start-Time: 1315886380 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=DKIM_SIGNED, DKIM_VERIFIED X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.74369 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- -0.00 DKIM_VERIFIED Domain Keys Identified Mail: signature passes verification 0.00 DKIM_SIGNED Domain Keys Identified Mail: message has a signature X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean added netdev because it appears to start with an igb tx hang On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz wr= ote: > Hi, > > Over the past 24-48 hours I was running some CPU-intenstive jobs and ther= e > was heavy I/O on the RAID (9750-24i4e + a RAID6).. > > I believe most of the problem started when I included many kernel options= as > modules (before I only compiled in [*] the drivers I used), there appears= to > have something to gone awry in the kernel and then afterwards, disks star= ted > going in and out, XFS shut down, etcera. > > I'm opening a case with LSI to see what happened with the 3ware card; > however, after a power cycle, everything came back OK (the drives and HW)= is > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL bu= t > other than that, everything 'seems' OK, still need to do an fsck. > > Something went wrong in the kernel and caused a cascading effect of error= s, > this occurred (I believe) when I started to run a lot of encoding jobs; > however, I was doing a lot of data transfer for the past 24-48 hours on t= he > RAID array, the system (separate SSD/EXT4) remained unaffected but other > weird stuff happened as well.. > > I still see these in the logs as well after the reboot (not often; but e.= g., > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (= the > physical drives are 100% healthy): > > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. =A0This is likely a > firmware bug on this device. =A0Contact the card vendor for a firmware up= date. > > So, my plan: > > 1. Report this error to LKML+XFS mailing lists. > 2. Open case with LSI support. > 3. Recompile the kernel how I used for many years [only compile in option= s > =A0 that you need [*] and do not compile drivers as modules] > 4. Reboot Linux systems and see if this recurs again under the same > =A0 workload, after the RAID is done rebuilding. > > -- > > So these errors are quite long, will upload to HTTP and paste the relevan= t > bits below. > > -- > > URLs for FULL logs: > > 1. tw_cli /cX show diag: > =A0 http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > 2. Full kernel log (and previous morning of kernel crash) > =A0 http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > 3. tw_cli /cX show all > =A0 http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > -- > > Summary (what seems to have occurred, have not done a full analysis yet) > > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors > > 2. Then, the time source went unstable (this happens with weird kernel bu= gs > =A0 on many different hosts, I have seen this over time). > > 3. Then, on the 3ward carde, drives started leaving and being re-inserted > =A0 by themsevles, XFS went off-line to protect the filesystem due to the > =A0 3ware issues > > -- > > 3ware/RAID-- Interesting errors: > > I've never seen this before on a 3ware RAID controller, at least from wha= t > I can remember and I've been using 3ware cards for many years.. > > p2 =A0 =A0CFG-OP-FAIL =A0 =A0- =A0 =A02.73 TB =A0 SATA =A02 =A0 - =A0 =A0= =A0 =A0 =A0 =A0Hitachi > HDS723030AL p3 =A0 =A0CFG-OP-FAIL =A0 =A0- =A0 =A02.73 TB =A0 SATA =A03 = =A0 - > =A0Hitachi HDS723030AL > > -- > > Kernel/ERRORS: > > FWIW it all seem to start during an encoding job around 21:00: > > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC > Link is Down > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO > (0x04:0x002B): Verify completed:unit=3D0. > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here > ]------------ > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6= F > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): > transmit queue 5 timed out > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_au= dio > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_li= b > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joyde= v > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi > i7core_edac edac_core video > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 N= ot > tainted 3.1.0-rc4 #1 > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: > Sep 10 20:59:39 p34 kernel: [531189.671424] =A0[] > warn_slowpath_common+0x7a/0xb0 > Sep 10 20:59:39 p34 kernel: [531189.671427] =A0[] > warn_slowpath_fmt+0x41/0x50 > Sep 10 20:59:39 p34 kernel: [531189.671433] =A0[] ? > schedule+0x2e4/0x950 > Sep 10 20:59:39 p34 kernel: [531189.671436] =A0[] > dev_watchdog+0x23f/0x250 > Sep 10 20:59:39 p34 kernel: [531189.671440] =A0[] > run_timer_softirq+0xf2/0x220 > Sep 10 20:59:39 p34 kernel: [531189.671443] =A0[] ? > qdisc_reset+0x50/0x50 > Sep 10 20:59:39 p34 kernel: [531189.671446] =A0[] > __do_softirq+0x98/0x120 > Sep 10 20:59:39 p34 kernel: [531189.671448] =A0[] > run_ksoftirqd+0xb5/0x160 > Sep 10 20:59:39 p34 kernel: [531189.671454] =A0[] ? > __do_softirq+0x120/0x120 > Sep 10 20:59:39 p34 kernel: [531189.671458] =A0[] > kthread+0x87/0x90 > Sep 10 20:59:39 p34 kernel: [531189.671462] =A0[] > kernel_thread_helper+0x4/0x10 > Sep 10 20:59:39 p34 kernel: [531189.671465] =A0[] ? > kthread_worker_fn+0x130/0x130 > Sep 10 20:59:39 p34 kernel: [531189.671467] =A0[] ? > gs_change+0xb/0xb > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91= ba > ]--- > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset > adapter > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 > Mbps Full Duplex, Flow Control: RX/TX > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuc= k > for 22s! [kswapd0:947] > > -- > > URLs for FULL logs: > > 1. tw_cli /cX show diag: > =A0 http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > 2. Full kernel log (and previous morning of kernel crash) > =A0 http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > 3. tw_cli /cX show all > =A0 http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > -- > > Currently... > > After all of this happened, I stopped all I/O on the system/all processes= , > etc > I shutdown the host, removed the power, powered it back up, now the drive= s > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for t= hem > to rebuild before doing anything else. > > Justin. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" i= n > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > Please read the FAQ at =A0http://www.tux.org/lkml/ >