Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id f7EJFjI18081 for linux-xfs-outgoing; Tue, 14 Aug 2001 12:15:45 -0700 Received: from Cantor.suse.de (ns.suse.de [213.95.15.193]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id f7EJFgj18061 for ; Tue, 14 Aug 2001 12:15:43 -0700 Received: from Hermes.suse.de (Hermes.suse.de [213.95.15.136]) by Cantor.suse.de (Postfix) with ESMTP id 5679B1E4E0 for ; Tue, 14 Aug 2001 21:15:37 +0200 (MEST) Date: Tue, 14 Aug 2001 21:15:36 +0200 From: Andi Kleen To: linux-xfs@oss.sgi.com Subject: Some XFS problems Message-ID: <20010814211536.A5490@gruyere.muc.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-linux-xfs@oss.sgi.com Precedence: bulk Here are some things I stumbled over during XFS stress testing. I don't have time to fix them, they're not that critical (except for the last one perhaps which is a admittedly bit vague); but I thought I would just note them. - When the log replay fails not all objects in the XFS zones are freed. This causes one of the zones not to be freed on module unload afterwards, and when XFS module loads again stumbles over an BUG() in slab.c that checks for duplicate zones. - pagebuf layer will surely not run on sparc32; it passes the _irqsave interrupt mask between functions which causes quick crashes there because on that architecture the interrupt mask includes the register window pointer and it can only be restored in the same function. (e.g. _pagebuf_free_lockable_buffer violates that) [I personally do not care about sparc32, I just thought I would note it. It is probably fairly easy to fix if anyone on the list is motivated] - When a filesystem shutdown occurs there seem to be some problems in the error cleanup paths. For example page locks seem to get leaked. In one case I had a whole bunch of processes stuck in lock_page and wait_on_page after a shutdown. - I had a buffer leak for some time in my version of XFS that eventually caused it to run out of memory because pagecache pages were not freed (no other corruption though) When that happened under heavy fsstress load I usually pressed reset because user space was dead. I think the page leak itself didn't cause any corruption. In several cases I got a corrupted file system afterwards that needed an xfs_repair; otherwise even after a remount with log replay it would shutdown the file system (from various places) while accessing fsstress test directory. In one case I got an corrupted log with garbage in it that caused log replay to fail. -Andi