Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id f5CJfa604017 for linux-xfs-outgoing; Tue, 12 Jun 2001 12:41:36 -0700 Received: from mailhost.idcomm.com (mailhost.idcomm.com [207.40.196.14]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id f5CJfXV04012 for ; Tue, 12 Jun 2001 12:41:34 -0700 Received: from idcomm.com (IDENT:stimits@k56-pip19.idcomm.com [209.60.72.146]) by mailhost.idcomm.com (8.10.2/8.10.0) with ESMTP id f5CJeuB01824 for ; Tue, 12 Jun 2001 13:40:57 -0600 Message-ID: <3B2670A5.35327B6F@idcomm.com> Date: Tue, 12 Jun 2001 13:42:29 -0600 From: "D. Stimits" Reply-To: stimits@idcomm.com X-Mailer: Mozilla 4.77 [en] (X11; U; Linux 2.4.6-pre1-xfs-2 i686) X-Accept-Language: en MIME-Version: 1.0 CC: linux-xfs@oss.sgi.com Subject: Re: XFS and knfsd References: <3B265B70.A5BCF044@cmdl.noaa.gov> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-xfs@oss.sgi.com Precedence: bulk At one point I had the NULL pointer dereference at the same address during boot of a new kernel, which did not have SGI patches. It appeared to be aic7xxx failure, but was not. I keep mentioning this one patch which is part of Alan Cox's ac 2.4.5 series, but which is not yet in the main kernel source. Even if you cannot use this patch at all times, can you test the following? In linux source, fs/block_dev.c, near line 596 (depending on kernel version), there is a function "ioctl_by_bdev". In that function, add the line I have below that starts with "+" (don't use the "+"): int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg) { kdev_t rdev = to_kdev_t(bdev->bd_dev); struct inode inode_fake; int res; mm_segment_t old_fs = get_fs(); if (!bdev->bd_op->ioctl) return -EINVAL; inode_fake.i_rdev=rdev; + inode_fake.i_bdev=bdev; init_waitqueue_head(&inode_fake.i_wait); set_fs(KERNEL_DS); res = bdev->bd_op->ioctl(&inode_fake, NULL, cmd, arg); set_fs(old_fs); return res; } Please note that on my SMP system, several kernel versions or combinations die at bootup while trying to work with filesystems. Without this, loopback encrypted partitions are also very likely to do a hard lockup on this machine (about 90% of loopback encrypted partition commands caused lockup). I have added this to every kernel since then that I have used, and never seen the Oops again. Here is my first Oops (no ksymoops because it was fatal and unbootable) without XFS: Trying to unmount old root ... <1>Unable to handle kernel NULL pointer dereference at virtual address 00000010 printing eip: c01c5bda *pde = 00000000 Oops: 0000 CPU: 1 EIP: 0010:[] EFLAGS: 00010202 eax: 00000000 ebx: 00000000 ecx: 00001261 edx: c1479d98 esi: 00000000 edi: c1479e2c ebp: ffffffff esp: c1479d68 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 1, stackpage=c1479000) Stack: c1478000 cfe97f60 c1479e2c c013b0fb c1479d98 00000000 00001261 00000000 cfeac560 00000000 fffffffe cfe97f60 cff06ea4 cff06e10 c0346340 0001def6 cfef0001 c01ea7b2 cff06e00 00000082 00000202 c14e1cb8 cfef8200 cff07e60 Call Trace: [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] Code: 8b 40 10 83 f8 02 7e 0e b8 f0 ff ff ff eb 7e 8d b4 26 00 00 Kernel panic: Attempted to kill init! While it is quite possible these are not the same thing, it is still fatal in some filesystem cases if this fix is not added (with or without XFS, it is a general bug). D. Stimits, stimits@idcomm.com PS: I'm in the Boulder area too. Kirk Thoning wrote: > > Has this issue been resolved? I am getting the same problem on a Redhat > 7.1 system > using the SGI kernel-2.4.2-SGI_XFS_1.0.i686.rpm. Almost all clients are > Redhat 6.2 (~15 clients) with a couple 7.0 and 3 HP-UX 10.20. My > impression is that the load wasn't that high, since it takes 1-2 weeks > for this to occur. > > Here's my output: > > Jun 6 09:05:40 ccgg kernel: Unable to handle kernel NULL pointer > dereference at virtual address 00000010 > Jun 6 09:05:40 ccgg kernel: printing eip: > Jun 6 09:05:40 ccgg kernel: c88e7e83 > Jun 6 09:05:40 ccgg kernel: pgd entry c793b000: 0000000000000000 > Jun 6 09:05:40 ccgg kernel: pmd entry c793b000: 0000000000000000 > Jun 6 09:05:40 ccgg kernel: ... pmd not present! > Jun 6 09:05:40 ccgg kernel: Oops: 0000 > Jun 6 09:05:40 ccgg kernel: CPU: 0 > Jun 6 09:05:41 ccgg kernel: EIP: > 0010:[ipchains:__insmod_ipchains_S.bss_L1076+564547/16321505] > Jun 6 09:05:41 ccgg kernel: EIP: 0010:[] > Jun 6 09:05:41 ccgg kernel: EFLAGS: 00010246 > Jun 6 09:05:41 ccgg kernel: eax: 00000000 ebx: 00000000 ecx: > c7fdadb0 edx: 00000010 > Jun 6 09:05:41 ccgg kernel: esi: c31db5a0 edi: c31dbaa0 ebp: > c31dbaa0 esp: c23f7edc > Jun 6 09:05:41 ccgg kernel: ds: 0018 es: 0018 ss: 0018 > Jun 6 09:05:41 ccgg kernel: Process nfsd (pid: 1005, > stackpage=c23f7000) > Jun 6 09:05:41 ccgg kernel: Stack: 00000003 0b01e79b c88e8286 c31dbaa0 > 00000003 c2338410 46000000 c2338400 > Jun 6 09:05:41 ccgg kernel: c23f7f4c c7f56be0 c02fa000 ffffff8c > 00000000 c88e8614 c7f56a00 0b01e79b > Jun 6 09:05:41 ccgg kernel: 00000000 00000000 00000001 c2338400 > c2338290 c2338490 c2338000 c24eb800 > Jun 6 09:05:41 ccgg kernel: Call Trace: > [ipchains:__insmod_ipchains_S.bss_L1076+565574/16320478] > [ipchains:__insmod_i > pchains_S.bss_L1076+566484/16319568] > [ipchains:__insmod_ipchains_S.bss_L1076+559621/16326431] > [ipchains:__insmod_ipcha > ins_S.bss_L1076+624832/16261220] > [ipchains:__insmod_ipchains_S.bss_L1076+558131/16327921] > [ipchains:__insmod_ipchains_ > S.bss_L1076+624832/16261220] > [ipchains:__insmod_ipchains_S.bss_L1076+372168/16513884] > Jun 6 09:05:41 ccgg kernel: Call Trace: [] [] > [] [] [] [] > [] > Jun 6 09:05:41 ccgg kernel: > [ipchains:__insmod_ipchains_S.bss_L1076+625984/16260068] > [ipchains:__insmod_ipchai > ns_S.bss_L1076+624648/16261404] > [ipchains:__insmod_ipchains_S.bss_L1076+557553/16328499] > [kernel_thread+35/48] > Jun 6 09:05:41 ccgg kernel: [] [] > [] [] > Jun 6 09:05:41 ccgg kernel: > Jun 6 09:05:41 ccgg kernel: Code: 8b 40 10 39 d0 74 21 8d 58 c8 39 f3 > 75 06 8b 5a 04 83 c3 c8 > > > > > Are there any tricks to getting knfsd working with XFS? Our server which > > serves an XFS partition gets an "oops" after about 15 hours of extremely > > heavy use. I'm using the latest distro from CVS (2.4.3-XFS). Here's the > > kdb output of the oops, just in case someone has any ideas on how to > > debug this: > > > > Unable to handle kernel NULL pointer dereference at virtual address 00000000 > > printing eip: > > c0145933 > > *pde = 00000000 > > > > Entering kdb (current=0xceffc000, pid 625) on processor 0 Oops: Oops > > due to oops @ 0xc0145933 > > eax = 0xcff9ae70 ebx = 0xffffffe8 ecx = 0x0000000f edx = 0xcff80000 > > esi = 0x00000000 edi = 0xceffdeb4 esp = 0xceffde48 eip = 0xc0145933 > > ebp = 0xceffde68 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010207 > > xds = 0xcff80018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xceffde14 > > [0]kdb> bt > > EBP EIP Function(args) > > 0xceffde68 0xc0145933 d_lookup+0x67 (0xc83e70c0, 0xceffdeb4) > > kernel .text 0xc0100000 0xc01458cc 0xc01459e8 > > 0xceffde7c 0xc013c911 cached_lookup+0x11 (0xc83e70c0, 0xceffdeb4, 0x0) > > kernel .text 0xc0100000 0xc013c900 0xc013c954 > > 0xceffdea0 0xc013d862 lookup_hash+0x52 (0xceffdeb4, 0xc83e70c0) > > kernel .text 0xc0100000 0xc013d810 0xc013d914 > > 0xceffdec0 0xc013d969 lookup_one+0x55 (0xc4b860e0, 0xc83e70c0) > > kernel .text 0xc0100000 0xc013d914 0xc013d97c > > 0xceffdf04 0xc016d882 nfsd_lookup+0x3b2 (0xcf05ac00, 0xcf05aa00, 0xc4b860e0, > > 0x6, 0xcf05a800) > > kernel .text 0xc0100000 0xc016d4d0 0xc016d9cc > > 0xceffdf2c 0xc016b50b nfsd_proc_lookup+0x87 (0xcf05ac00, 0xcf05aa00, 0xcf05a8 > > 00) > > kernel .text 0xc0100000 0xc016b484 0xc016b520 > > 0xceffdf4c 0xc016adf9 nfsd_dispatch+0xc5 (0xcf05ac00, 0xceff8014) > > kernel .text 0xc0100000 0xc016ad34 0xc016ae90 > > 0xceffdfa8 0xc0293bca svc_process+0x2ca (0xcfef5b20, 0xcf05ac00) > > kernel .text 0xc0100000 0xc0293900 0xc0293e30 > > 0xceffdfec 0xc016abba nfsd+0x1a2 > > kernel .text 0xc0100000 0xc016aa18 0xc016ad34 > > 0xc0105547 kernel_thread+0x23 > > kernel .text 0xc0100000 0xc0105524 0xc010555c > > > > > > Any help is appreciated. > > > > Ajay > > -- > ************************************************************ > * Kirk Thoning Phone: 303 497-6078 * > * NOAA/CMDL Fax: 303 497-6290 * > * R/CMDL1 e-mail: Kirk.W.Thoning@noaa.gov * > * 325 Broadway * > * Boulder, Colorado 80303 * > ************************************************************