Received: with ECARTIS (v1.0.0; list linux-xfs); Thu, 28 Aug 2003 15:40:09 -0700 (PDT) Received: from ext-ch1gw-2.online-age.net (ext-ch1gw-2.online-age.net [216.34.191.36]) by oss.sgi.com (8.12.9/8.12.5) with SMTP id h7SMdVWZ030569 for ; Thu, 28 Aug 2003 15:39:32 -0700 Received: from int-ch1gw-4.online-age.net (int-ch1gw-4 [3.159.232.68]) by ext-ch1gw-2.online-age.net (8.12.9/8.12.9/030701) with ESMTP id h7SKpBgX023321; Thu, 28 Aug 2003 16:51:12 -0400 (EDT) Received: from uswaumsxb4medge.med.ge.com (localhost [127.0.0.1]) by int-ch1gw-4.online-age.net (8.12.9/8.12.3/990426-RLH) with ESMTP id h7SKp9dS013764; Thu, 28 Aug 2003 16:51:10 -0400 (EDT) Received: by USWAUMSXB4MEDGE with Internet Mail Service (5.5.2656.59) id ; Thu, 28 Aug 2003 15:50:20 -0500 Received: from ct.ct.med.ge.com ([3.70.56.18]) by uswaumsxbhmedge.med.ge.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id RRVB4NAL; Thu, 28 Aug 2003 15:50:57 -0500 Received: from med.ge.com ([3.57.108.2]) by ct.ct.med.ge.com (8.8.8+Sun/8.8.8) with ESMTP id PAA23269; Thu, 28 Aug 2003 15:50:51 -0500 (CDT) From: "Foris, Jim (MED)" Reply-To: "Foris, Jim (MED)" To: Eric Sandeen Cc: Kai Leibrandt , "'Simon Matter'" , "'Axel Thimm'" , linux-xfs@oss.sgi.com Message-ID: <3F4E5AD3.80101@med.ge.com> Date: Thu, 28 Aug 2003 14:41:07 -0500 User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030314 X-Accept-Language: en-us, en MIME-Version: 1.0 Subject: Re: Patch 1300 & rpm issue with 1.3.0 References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 229 X-ecartis-version: Ecartis v1.0.0 Sender: linux-xfs-bounce@oss.sgi.com Errors-to: linux-xfs-bounce@oss.sgi.com X-original-sender: foris@mr.mr.med.ge.com Precedence: bulk X-list: linux-xfs Content-Length: 5195 Lines: 122 Eric Sandeen wrote: > On Thu, 28 Aug 2003, Kai Leibrandt wrote: > > >>That's just what I was thinking; is rpm only an indication that other >>apps might have issues as well? If so, how do we identify them and >>rectify the problems? In the kernel, or in the app? > > > That's not clear to me yet, but we have dome some O_DIRECT stresstesting > and it's all been fine. So this doesn't seem to be a problem with > O_DIRECT in general, which makes me think it might be the app. > Using "strace" on a RH 2.4.20-20.9.XFS1.3.0 system to follow what "rpm" does during an install, the key difference seems to be the following sequence: WORKS (created a EXT3 partition, copied /var/lib/rpm/* to it, then mounted it at /var/lib/rpm) 4217 access("/var/lib/rpm", W_OK) = 0 <0.000011> 4217 access("/var/lib/rpm/__db.001", F_OK) = -1 ENOENT (No such file or directory) <0.000011> 4217 access("/var/lib/rpm/Packages", F_OK) = 0 <0.000011> 4217 stat64("/var/lib/rpm/DB_CONFIG", 0xbffeeb60) = -1 ENOENT (No such file or directory) <0.000019> 4217 brk(0) = 0x807e000 <0.000006> 4217 brk(0x807f000) = 0x807f000 <0.000008> 4217 open("/var/lib/rpm/DB_CONFIG", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) <0.000011> 4217 stat64("/var/lib/rpm/__db.001", 0xbffeeb90) = -1 ENOENT (No such file or directory) <0.000010> 4217 open("/var/lib/rpm/__db.001", O_RDWR|O_CREAT|O_EXCL|O_DIRECT|O_LARGEFILE, 0644) = 4 <0.000044> 4217 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 <0.000007> 4217 open("/var/lib/rpm/__db.001", O_RDWR|O_CREAT|O_DIRECT|O_LARGEFILE, 0644) = 5 <0.000011> 4217 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 4217 _llseek(5, 0, [0], SEEK_END) = 0 <0.000006> 4217 _llseek(5, 8192, [8192], SEEK_CUR) = 0 <0.000007> 4217 write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 <0.000137> 4217 mmap2(NULL, 16384, PROT_READ|PROT_WRITE, MAP_SHARED, 5, 0) = 0x40019000 <0.000011> 4217 close(5) = 0 <0.000007> FAILS (/var/lib/rpm resides on a XFS partition) 4144 access("/var/lib/rpm/__db.001", F_OK) = -1 ENOENT (No such file or directory) <0.000010> 4144 access("/var/lib/rpm/Packages", F_OK) = 0 <0.000011> 4144 stat64("/var/lib/rpm/DB_CONFIG", 0xbffef0e0) = -1 ENOENT (No such file or directory) <0.000010> 4144 brk(0) = 0x807e000 <0.000006> 4144 brk(0x807f000) = 0x807f000 <0.000008> 4144 open("/var/lib/rpm/DB_CONFIG", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) <0.000012> 4144 stat64("/var/lib/rpm/__db.001", 0xbffef110) = -1 ENOENT (No such file or directory) <0.000010> 4144 open("/var/lib/rpm/__db.001", O_RDWR|O_CREAT|O_EXCL|O_DIRECT|O_LARGEFILE, 0644) = 4 <0.000103> 4144 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 4144 open("/var/lib/rpm/__db.001", O_RDWR|O_CREAT|O_DIRECT|O_LARGEFILE, 0644) = 5 <0.000012> 4144 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 4144 _llseek(5, 0, [0], SEEK_END) = 0 <0.000006> 4144 _llseek(5, 8192, [8192], SEEK_CUR) = 0 <0.000006> 4144 write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 EINVAL (Invalid argument) <0.000007> 4144 open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000016> 4144 open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000011> 4144 open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000013> 4144 open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000011> 4144 open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000010> 4144 open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000013> 4144 write(2, "rpmdb: ", 7) = 7 <0.000017> 4144 write(2, "write: 0xbffed120, 8192: Invalid"..., 41) = 41 <0.000012> 4144 write(2, "\n", 1) = 1 <0.000012> 4144 close(5) = 0 <0.000008> From the RPM 4.2 source, the file "__db.001" contains database environment information and is used also used to syncronize between multiple threads/processes. But the details of how/why "rpm" uses this file is not as significant as the different behavior shown in the example above: There is a difference in behavior between XFS and EXT3 with how sparse files are created/handled. From the above example it looks like XFS+O_DIRECT+sparse file creation is broken/not supported. (Things work if LD_ASSUME_KERNEL is set because then "rpm" uses a different method to control its database accesses..... it never runs through the above offending code. Although it finds that __db.001 is not there, it does not try to create it.) Does this ring any bells with anyone ? On the bright side, it DOES look like it is an application-specific combination of factors that cause the failure..... so the problem is not likely to be widely seen. Jim Foris > -Eric > >