From ahu@outpost.ds9a.nl Fri Apr 1 01:01:20 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 01:01:26 -0800 (PST) Received: from outpost.ds9a.nl (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j3191Ju3017215 for ; Fri, 1 Apr 2005 01:01:19 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id CE4E33FC3; Fri, 1 Apr 2005 11:01:16 +0200 (CEST) Date: Fri, 1 Apr 2005 11:01:16 +0200 From: bert hubert To: Ben Greear Cc: hadi@cyberus.ca, "David S. Miller" , netdev Subject: Re: RFC: Redirect-Device Message-ID: <20050401090116.GA21361@outpost.ds9a.nl> Mail-Followup-To: bert hubert , Ben Greear , hadi@cyberus.ca, "David S. Miller" , netdev References: <424C6089.1080507@candelatech.com> <1112303627.1073.71.camel@jzny.localdomain> <424C6B10.6030200@candelatech.com> <1112306031.1073.109.camel@jzny.localdomain> <424C7813.4000101@candelatech.com> <20050331143531.30f4eb8f.davem@davemloft.net> <424C7F96.4070002@candelatech.com> <1112311618.1090.20.camel@jzny.localdomain> <424C8E2C.70302@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <424C8E2C.70302@candelatech.com> User-Agent: Mutt/1.3.28i X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1185 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev On Thu, Mar 31, 2005 at 03:56:28PM -0800, Ben Greear wrote: > >I think you are more comfortable with using netdevices and ioctls and > >/proc. > > Definately. Ever tried to sniff a socket with ethereal? :) On loopback, all the time. I'm probably dense but I don't understand what problem you've solved with this interface. Could you elaborate a bit? -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services From pekkas@netcore.fi Fri Apr 1 01:28:54 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 01:28:59 -0800 (PST) Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j319SqP5018568 for ; Fri, 1 Apr 2005 01:28:53 -0800 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id j319SiR11426; Fri, 1 Apr 2005 12:28:44 +0300 Date: Fri, 1 Apr 2005 12:28:44 +0300 (EEST) From: Pekka Savola To: Ben Greear cc: "'netdev@oss.sgi.com'" Subject: Re: RFC: Redirect-Device In-Reply-To: <424CDBA9.80703@candelatech.com> Message-ID: References: <424C6089.1080507@candelatech.com> <424CDBA9.80703@candelatech.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1186 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pekkas@netcore.fi Precedence: bulk X-list: netdev On Thu, 31 Mar 2005, Ben Greear wrote: >> Is there something in your problem statement I'm missing? > > That would be similar to what I'm doing, but I'm not really trying > to tunnel anything. I am trying to duplicate the behaviour of two > ethernet interfaces connected by an external cross-over cable, and I'm > trying to duplicate it at the network-device interface level so that > common tools (and my own tools) can treat these virtual interfaces > just like ethernet interfaces. Oh ok, what you seem to want is some kind of "Ethernet loopback++", but the "looped" packets should come back from a virtual interface instead of the same interface? Btw, does the kernel support traditional loopback, so that at the last stage, just before sending a packet on the wire, it would be pushed back. -- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings From herbert@gondor.apana.org.au Fri Apr 1 01:37:18 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 01:37:26 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j319bGoK019244 for ; Fri, 1 Apr 2005 01:37:17 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHIaL-00028u-00; Fri, 01 Apr 2005 19:37:01 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHIZt-0000N0-00; Fri, 01 Apr 2005 19:36:33 +1000 Date: Fri, 1 Apr 2005 19:36:33 +1000 To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: [NETLINK] cb_lock does not needs ref count on sk Message-ID: <20050401093633.GA32707@gondor.apana.org.au> References: <20050327091524.GA23215@elte.hu> <20050327133811.GA5569@elte.hu> <20050329104906.GA19836@gondor.apana.org.au> <20050329114926.GA14986@elte.hu> <20050330082640.GA8269@gondor.apana.org.au> <20050330170236.2bddf666.davem@davemloft.net> <20050331231922.GA26587@gondor.apana.org.au> <20050331232322.GA26693@gondor.apana.org.au> <20050331203313.57e1c5c3.davem@davemloft.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="Nq2Wo0NMKNjxTN9z" Content-Disposition: inline In-Reply-To: <20050331203313.57e1c5c3.davem@davemloft.net> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1187 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev --Nq2Wo0NMKNjxTN9z Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi Dave: Here is a little optimisation for the cb_lock used by netlink_dump. While fixing that race earlier, I noticed that the reference count held by cb_lock is completely useless. The reason is that in order to obtain the protection of the reference count, you have to take the cb_lock. But the only way to take the cb_lock is through dereferencing the socket. That is, you must already possess a reference count on the socket before you can take advantage of the reference count held by cb_lock. As a corollary, we can remve the reference count held by the cb_lock. Signed-off-by: Herbert Xu Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --Nq2Wo0NMKNjxTN9z Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=p ===== net/netlink/af_netlink.c 1.75 vs edited ===== --- 1.75/net/netlink/af_netlink.c 2005-04-01 16:25:14 +10:00 +++ edited/net/netlink/af_netlink.c 2005-04-01 19:30:22 +10:00 @@ -374,7 +374,6 @@ nlk->cb->done(nlk->cb); netlink_destroy_callback(nlk->cb); nlk->cb = NULL; - __sock_put(sk); } spin_unlock(&nlk->cb_lock); @@ -1100,7 +1099,6 @@ spin_unlock(&nlk->cb_lock); netlink_destroy_callback(cb); - __sock_put(sk); return 0; } @@ -1139,7 +1137,6 @@ return -EBUSY; } nlk->cb = cb; - sock_hold(sk); spin_unlock(&nlk->cb_lock); netlink_dump(sk); --Nq2Wo0NMKNjxTN9z-- From abhishek@pal.ece.iisc.ernet.in Fri Apr 1 01:40:56 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 01:41:01 -0800 (PST) Received: from ece.iisc.ernet.in (ece.iisc.ernet.in [144.16.64.2]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j319em4l019848 for ; Fri, 1 Apr 2005 01:40:54 -0800 Received: from pal.ece.iisc.ernet.in (pal.ece.iisc.ernet.in [144.16.64.149]) by ece.iisc.ernet.in (8.12.6/8.12.6) with ESMTP id j319cS8V023201 for ; Fri, 1 Apr 2005 15:08:28 +0530 (IST) (envelope-from abhishek@pal.ece.iisc.ernet.in) Received: by pal.ece.iisc.ernet.in (Postfix, from userid 1047) id 97D6331E59; Fri, 1 Apr 2005 15:10:40 +0530 (IST) Received: from localhost (localhost [127.0.0.1]) by pal.ece.iisc.ernet.in (Postfix) with ESMTP id 8C98A31E57 for ; Fri, 1 Apr 2005 15:10:40 +0530 (IST) Date: Fri, 1 Apr 2005 15:10:40 +0530 (IST) From: Abhishek Gupta To: netdev@oss.sgi.com Subject: Problem using HTB Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1188 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: abhishek@pal.ece.iisc.ernet.in Precedence: bulk X-list: netdev hello everybody I am working on a project related to QoS. I am using Linux's tc to configure my PC based router. My setup is as follows:- eth0 eth1 eth0 eth0 PC-based server|----------|PC-based Router|---------|PC-Based Client (using tc) * All my ethernet cards are on 100Mbps lan * Traffic generators being used: > UDP: gen_send @ about 1Mbps (http://www.citi.umich.edu/projects/qbone/generator.html) * Kernel versions being used:- > At Router: linux-2.4.20 > At Client and Server: Linux-2.4.7-10 * iproute2 versions:- > At Router: iproute2-ss020116 > At Client and Server: iproute2-ss010824 * Packets before leaving sever and client are being marked with DSCP bits using Linux's tc option; Marking is done based on two-tuples: destination ip address and port number * At the Router, I have the following configuration(only related to HTB) for eth0 and similar configuration exits for eth1 too: ---Router Configuration Starts Here----- DEV0='eth0' tc qdisc add dev $DEV0 parent 1: handle 2: htb default 30 tc class add dev $DEV0 parent 2: classid 2:1 htb rate 100kbit burst 100 \ ceil 100kbit tc class add dev $DEV0 parent 2:1 classid 2:10 htb rate 60kbit burst 100 \ ceil 100kbit tc class add dev $DEV0 parent 2:1 classid 2:20 htb rate 30kbit burst 60 \ ceil 100kbit tc class add dev $DEV0 parent 2:1 classid 2:30 htb rate 10kbit burst 80 \ ceil 100kbit tc qdisc add dev $DEV0 parent 2:10 gred setup DPs 3 default 3 grio tc qdisc change dev $DEV0 parent 2:10 gred limit 185000 min 11394 \ max 11395 burst 100 avpkt 128 bandwidth 100kbit DP 1 probability 1 \ prio 1 tc qdisc change dev $DEV0 parent 2:10 gred limit 17972 min 4748 max 9493 \ burst 50 avpkt 1000 bandwidth 100kbit DP 2 probability 0.01 prio 2 tc qdisc change dev $DEV0 parent 2:10 gred limit 4368 min 1796 max 3582 \ burst 25 avpkt 1000 bandwidth 100kbit DP 3 probability 0.01 prio 2 tc qdisc add dev $DEV0 parent 2:20 gred setup DPs 2 default 2 grio tc qdisc change dev $DEV0 parent 2:20 gred limit 52480 min 11311 \ max 11312 burst 60 avpkt 256 bandwidth 100kbit DP 1 probability 1 \ prio 1 tc qdisc change dev $DEV0 parent 2:20 gred limit 47184 min 5898 \ max 11796 burst 60 avpkt 1000 bandwidth 100kbit DP 2 probability 0.01 \ prio 2 tc qdisc add dev $DEV0 parent 2:30 gred setup DPs 1 default 1 grio tc qdisc change dev $DEV0 parent 2:30 gred limit 15728 min 1966 \ max 3932 burst 80 avpkt 200 bandwidth 100kbit DP 1 probability 0.04 \ prio 1 -----Router Configuration Ends Here------ Now, the problem is that when I am sending packets from just one UDP source(at server), I am getting outbound bit rate at eth0(of Router) as 12kbps even though I have ceiled the corresponding HTB class to 100kbps; similar thing happens when I have two UDP sources(both at server). So, even though I have configured for 100kbps, I am getting only 12kbps as the link speed. Please help me out. Abhishek ========================================================================= ABHISHEK GUPTA E-mail:abhishek_it_bhu@yahoo.co.in ========================================================================= From akpm@osdl.org Fri Apr 1 02:11:54 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 02:12:01 -0800 (PST) Received: from smtp.osdl.org (fire.osdl.org [65.172.181.4]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31ABrpd020835 for ; Fri, 1 Apr 2005 02:11:53 -0800 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j31ABgs4005803 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Fri, 1 Apr 2005 02:11:42 -0800 Received: from bix (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id j31ABXgB002239; Fri, 1 Apr 2005 02:11:34 -0800 Date: Fri, 1 Apr 2005 02:11:21 -0800 From: Andrew Morton To: netdev@oss.sgi.com Cc: lukeross@sys3175.co.uk Subject: Fw: [Bugme-new] [Bug 4430] New: Virtual interfaces cannot have their own mtu Message-Id: <20050401021121.76da449b.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.7 (GTK+ 1.2.10; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.106 $ X-Scanned-By: MIMEDefang 2.36 X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1189 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev hm, mtu is implemented in the device driver - you might be out of luck. Begin forwarded message: Date: Fri, 1 Apr 2005 02:01:19 -0800 From: bugme-daemon@osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 4430] New: Virtual interfaces cannot have their own mtu http://bugme.osdl.org/show_bug.cgi?id=4430 Summary: Virtual interfaces cannot have their own mtu Kernel Version: kernel-2.6.9-1.6_FC2 Status: NEW Severity: low Owner: acme@conectiva.com.br Submitter: lukeross@sys3175.co.uk Distribution: Fedora Core 2,3 Hardware Environment: Broadcom gigabit card using tg3 (Tyan s2885 onboard) Problem Description: eth0 and eth0:1 cannot have different mtus. I have a jumbo-frame capable switch with three devices plugged in. Two are PCs with jumbo-capable cards, the other is a wireless router which isn't, and hangs if either PC attempts to discover whether it can support jumbo frames. To get the benefit of jumbo frames between the two PCs, I tried to set up eth0:1 - on a different subnet to the wireless router - on both PCs, and set the mtu of the eth0:1 to 9000. However it is not possible to set the mtu for eth0:1 to 9000 without setting the mtu of eth0 to 9000 as well. Also noted in http://xcat.org/pipermail/xcat-user/2003-April/002358.html ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hadi@cyberus.ca Fri Apr 1 03:03:28 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 03:03:35 -0800 (PST) Received: from mx02.cybersurf.com (mx02.cybersurf.com [209.197.145.105]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31B3SJF024059 for ; Fri, 1 Apr 2005 03:03:28 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx02.cybersurf.com with esmtp (Exim 4.30) id 1DHJvw-0003ME-QW for netdev@oss.sgi.com; Fri, 01 Apr 2005 06:03:24 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHJvt-0007Pl-5B; Fri, 01 Apr 2005 06:03:21 -0500 Subject: Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <20050401042106.GA27762@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112353398.1096.116.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 06:03:18 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1190 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Thu, 2005-03-31 at 23:21, Herbert Xu wrote: > On Thu, Mar 31, 2005 at 08:37:21PM -0500, jamal wrote: > > > --- a/include/net/xfrm.h 2005-03-25 22:28:26.000000000 -0500 > > +++ b/include/net/xfrm.h 2005-03-31 19:26:24.000000000 -0500 > > > > +/* callback structure passed from either netlink or pfkey */ > > +struct km_cb > > This name is a bit non-specific. > note: used by both SP/SA > > +{ > > + u32 data; /* callee to caller */ > > +}; > > Might as well put the event into it if we're going to keep this > structure. It'll help to shorten the function prototypes that > use it. > > And then we can just call this structure km_event. > sure. > > -extern void km_policy_expired(struct xfrm_policy *pol, int dir, int hard); > > +extern void km_policy_expired(struct xfrm_policy *pol, int dir, int event); > > Bogus prototype change. > agreed. > > +void xfrm_state_del_flush(struct xfrm_state *x) > > +{ > > + spin_lock_bh(&x->lock); > > + __xfrm_state_delete(x); > > + spin_unlock_bh(&x->lock); > > +} > > Sorry, I've changed my mind on this. This demonstrates why the > km_notify_* calls should be made from af_key/xfrm_user directly > instead of here. > > > Some of these functions are called internally as you discovered. > Since the notifications should only be generated by user requests, > calls to km_notify_* should be made at the places where the user > requests are handled, which is in the KM itself. > You need to be able to generate events at every km not just the one that generated the request. You also (most of the time) need to do it before affected object dissapears. So I am missing your point on this one. > Otherwise we'll have to add hacks like this to avoid the > notification for internal users. > I may be paranoid but i do this because x could be garbage collected way before i send the km user message - and i need it to use it to generate the event. I could take a copy of it ... > > void xfrm_state_delete(struct xfrm_state *x) > > { > > + int notif = 0; > > spin_lock_bh(&x->lock); > > + /* > > + * its unfortunate we have to freeze gc for this > > + * one moment - the other alternative would involve > > + * memcopying the state and then announcing that. > > + * think SMP where theres an iota where this could mess > > + * up - JHS > > + */ > > + spin_lock_bh(&xfrm_state_gc_lock); > > + if (x->km.state != XFRM_STATE_DEAD) > > + notif = 1; > > __xfrm_state_delete(x); > > + > > + if (notif) > > + km_state_notify(x, NULL, XFRM_SAP_DELETED); > > You've caught a real bug for af_key here. It's currently possible to > receive two delete notifications for the same state. Can you elaborate? > However, may I suggest that we code this differently. Make > __xfrm_state_delete return 0 if the state was really deleted > and -ESRCH otherwise. > > Then af_key/xfrm_user can simply call km_state_notify if the > return value was zero. > Again like i said: I need to tell every km user about the event, not just the originator. > BTW there is no need to grab xfrm_state_gc_lock. You've got > a reference count on the state from your caller. > Aha! I missed that - I will remove it. > > @@ -270,6 +319,10 @@ > > } > > } > > spin_unlock_bh(&xfrm_state_lock); > > + if (count) { > > + c.data = proto; > > + km_state_notify(NULL, &c, XFRM_SAP_FLUSHED); > > + } > > The notification should occur in all cases, even if count == 0. > Well, Masahide-San and I actually did discuss this and he was of the same opinion as you. My opinion: We only generate events when something happens, not just because someone issues a command. If flush was issued and there was nothing to flush why generate an event? does the PFKEY RFC say anything on this? > > @@ -957,8 +1020,9 @@ > > if (x->tunnel) { > > struct xfrm_state *t = x->tunnel; > > > > + /* XXX: Avoid announce?? */ > > if (atomic_read(&t->tunnel_users) == 2) > > - xfrm_state_delete(t); > > + xfrm_state_del_flush(t); > > That's right. We don't want to announce internal states to the world. > I will remove that comment. Thats achieved in the above code although the called funtion may not have the appropriate name . > > --- a/net/xfrm/xfrm_policy.c 2005-03-25 22:28:21.000000000 -0500 > > +++ b/net/xfrm/xfrm_policy.c 2005-03-31 19:26:24.000000000 -0500 > > @@ -298,7 +298,7 @@ > > * entry dead. The rule must be unlinked from lists to the moment. > > */ > > > > -static void xfrm_policy_kill(struct xfrm_policy *policy) > > +static void xfrm_policy_kill(struct xfrm_policy *policy, int dir, int notif) > > Again, had you done the km_* calls from af_key/xfrm_user, then there'd > be no need to check notif here. > Refer to my comments above on being able to tell multiple managers about the events originated by one. Actually, given that this function is being called in many places i would say this is the exact central location you want to issue the announce from. > BTW, as it is you're announcing expired policies twice. Once as an > expire event and once as a delete event. This problem will also go > away if you move the km_* calls into af_key/xfrm_user. > Theres an announcement only when policy goes dead ;-> So only one not two. Same with the state as well. And again cant do it from af_key/xfrm_user if you want to have events generated by one km to be sent to another as well. Its pf_key that needs fixing. > > @@ -579,7 +586,7 @@ > > write_unlock_bh(&xfrm_policy_lock); > > > > if (old_pol) { > > - xfrm_policy_kill(old_pol); > > + xfrm_policy_kill(old_pol, dir, 1); > > } > > Please don't announce socket policies :) > I missed this one - sorry. > > --- a/net/xfrm/xfrm_user.c 2005-03-25 22:28:22.000000000 -0500 > > +++ b/net/xfrm/xfrm_user.c 2005-03-31 19:26:24.000000000 -0500 > > @@ -683,6 +683,10 @@ > > if (!xp) > > return err; > > > > + /* shouldnt excl be based on nlh flags?? > > + * Aha! this is anti-netlink really i.e more pfkey derived > > + * in netlink excl is a flag and you wouldnt need > > + * a type XFRM_MSG_UPDPOLICY - JHS */ > > Good point. Care to provide a patch to treat NEW + NLM_F_REPLACE > as UPD? > > > @@ -1053,10 +1057,10 @@ > > return -1; > > } > > > > -static int xfrm_send_state_notify(struct xfrm_state *x, int hard) > > +static int xfrm_exp_state_notify(struct xfrm_state *x, u32 hard) > > How about calling this xfrm_notify_sa_expired for consistency? > Ditto for the policy function. sure. > > > +static int xfrm_notify_sa_flush(struct km_cb *c) > > +{ > > + struct xfrm_usersa_flush *p; > > + struct nlmsghdr *nlh; > > + struct sk_buff *skb; > > + unsigned char *b; > > + u32 ppid = 0; > > + int len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_flush)); > > + > > + skb = alloc_skb(len, GFP_ATOMIC); > > + if (skb == NULL) > > + return -ENOMEM; > > + b = skb->tail; > > + > > + nlh = NLMSG_PUT(skb, ppid, jiffies, > > If we're serious about providing sequence numbers then please > set it up as an atomic integer and use it throughout this file. > > Otherwise just pop zero in there. > I was just being lazy. I could send a 0 but whats wrong with using jiffies? > > + p = NLMSG_DATA(nlh); > > + if (!c) { > > + printk("xfrm_notify_sa_flush NULL km cb\n"); > > + p->proto = 0; > > Is anyone expected to call this with a NULL pointer? If not then > just let it OOPS. Same comment applies to the cb checks later on. > Will fix this. > > +static int xfrm_notify_sa( struct xfrm_state *x, int event, struct km_cb *c) > > > + if (event == XFRM_SAP_ADDED) > > + nlt = XFRM_MSG_NEWSA; > > + else if (event == XFRM_SAP_UPDATED) > > + nlt = XFRM_MSG_UPDSA; > > + else if (event == XFRM_SAP_DELETED) > > + nlt = XFRM_MSG_DELSA; > > + else > > + goto nlmsg_failure; > > Please use a switch. > sure. > > +static int xfrm_send_state_notify(struct xfrm_state *x, int event, struct km_cb *c) > > +{ > > + > > + if ((event == XFRM_SAP_ADDED) || > > + (event == XFRM_SAP_UPDATED) || > > + (event == XFRM_SAP_DELETED)) > > + return xfrm_notify_sa(x, event, c); > > + > > + if (event == XFRM_SAP_FLUSHED) > > + xfrm_notify_sa_flush(c); > > + > > + if (event != XFRM_SAP_EXPIRED) > > + return 0; > > Again a switch would be perfect. > Will fix this. BTW, Herbert, thanks for taking the time; appreciated. cheers, jamal From hadi@cyberus.ca Fri Apr 1 03:15:54 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 03:16:00 -0800 (PST) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31BFrHe024882 for ; Fri, 1 Apr 2005 03:15:54 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1DHK7s-0005FE-I0 for netdev@oss.sgi.com; Fri, 01 Apr 2005 06:15:44 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHK7p-0008Tf-9u; Fri, 01 Apr 2005 06:15:41 -0500 Subject: Re: Resend: Re: PATCH: IPSEC acquire in presence of multiple managers From: jamal Reply-To: hadi@cyberus.ca To: "David S. Miller" Cc: herbert@gondor.apana.org.au, "David S. Miller" , nakam@linux-ipv6.org, shinta.sugimoto@ericsson.com, netdev In-Reply-To: <20050331211340.0e6fbdfb.davem@davemloft.net> References: <1111795927.1089.749.camel@jzny.localdomain> <1111862131.1092.872.camel@jzny.localdomain> <20050331211340.0e6fbdfb.davem@davemloft.net> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112354137.1090.129.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 06:15:38 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1191 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Fri, 2005-04-01 at 00:13, David S. Miller wrote: > On 26 Mar 2005 13:35:31 -0500 > jamal wrote: > > > Apologies, The last patch had some a glitch in the filename. Dave please > > apply this one instead > > Doesn't apply, in the current tree km_query() is marked static. > > Please regenerate your patch and sorry for not getting to this > sooner. Dave, I am combining this with the other event patch that is under discussion right now which i will end up sending to you. If you want it separate i could do that. cheers, jamal From herbert@gondor.apana.org.au Fri Apr 1 03:45:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 03:45:08 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31Biv2i026089 for ; Fri, 1 Apr 2005 03:44:58 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHKZX-00032I-00; Fri, 01 Apr 2005 21:44:19 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHKYE-0000p4-00; Fri, 01 Apr 2005 21:42:58 +1000 Date: Fri, 1 Apr 2005 21:42:58 +1000 To: jamal Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: PATCH: IPSEC xfrm events Message-ID: <20050401114258.GA2932@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112353398.1096.116.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1192 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 06:03:18AM -0500, jamal wrote: > > > Some of these functions are called internally as you discovered. > > Since the notifications should only be generated by user requests, > > calls to km_notify_* should be made at the places where the user > > requests are handled, which is in the KM itself. > > You need to be able to generate events at every km not just the one that > generated the request. You also (most of the time) need to do it before I understand. However, that's not determined by where you put the km_notify call itself. Even when you call km_notify from af_key or xfrm_user it will notify every km in the system. It's the fact that we're calling km_notify instead of pfkey_broadcast or netlink_broadcast that's important, not the location. Having the km_notify call made in af_key/xfrm_user is convenient though for the reason I outlined above. > I may be paranoid but i do this because x could be garbage collected way > before i send the km user message - and i need it to use it to generate > the event. I could take a copy of it ... That's what the ref counter is for. > > You've caught a real bug for af_key here. It's currently possible to > > receive two delete notifications for the same state. > > Can you elaborate? Imagine you've got a KM that's trying to delete a state via af_key that's about to expire. If pfkey_delete looks up the state successfully, and then the timer triggers before the actual xfrm_state_delete, you will get one event generated by the timer and another by pfkey_delete. > Again like i said: I need to tell every km user about the event, not > just the originator. I'm suggesting that you add the km_notify calls to af_key and xfrm_user. That will take care of notifying everyone. > Well, Masahide-San and I actually did discuss this and he was of the > same opinion as you. My opinion: We only generate events when something > happens, not just because someone issues a command. If flush was issued > and there was nothing to flush why generate an event? does the PFKEY RFC > say anything on this? RFC 2367 says that: The messaging behavior for SADB_FLUSH is: Send an SADB_FLUSH message from a user process to the kernel. The kernel will return an SADB_FLUSH message to all listening sockets. As you can see, there is no exception for the case of an empty database. So my interpretation would be that a broadcast is needed. > Refer to my comments above on being able to tell multiple managers about > the events originated by one. May I also refer you to my comment above about this being achieved by calling km_notify, even if you do it from within af_key or xfrm_user :) > Actually, given that this function is being called in many places i > would say this is the exact central location you want to issue the > announce from. Try this as an exercise. List all the xfrm_policy_kills that need notifications and all those that don't, you will find that the former all originate from delete/flush commands in af_key/xfrm_user, while the latter originate from other callers. In other words, by placing the call in af_key/xfrm_user you simplify the logic and make it more maintainable. > > BTW, as it is you're announcing expired policies twice. Once as an > > expire event and once as a delete event. This problem will also go > > away if you move the km_* calls into af_key/xfrm_user. > > Theres an announcement only when policy goes dead ;-> > So only one not two. Same with the state as well. Well when the policy expires you will get one expire notification from the current timer code and a new one from your patch since the timer calls xfrm_policy_delete. See my point? By putting the call in xfrm_policy.c you have to be really careful in dividing the internal users which shouldn't generate notifications and the external users which should. By doing it in af_key/xfrm_user you can avoid all this work. > And again cant do it from af_key/xfrm_user if you want to have events > generated by one km to be sent to another as well. Its pf_key that needs > fixing. Well I must repeat that if you were calling km_notify from af_key/xfrm_user you will be sending these events to all km's no matter what their affiliation is :) > > If we're serious about providing sequence numbers then please > > set it up as an atomic integer and use it throughout this file. > > > > Otherwise just pop zero in there. > > I was just being lazy. I could send a 0 but whats wrong with using > jiffies? Using jiffies means that you can have two successive messages that share the same sequence number. It's not a big deal of course. But if we're going to indicate ordering, we might as well go the full length. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Fri Apr 1 03:47:14 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 03:47:20 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31BlDko026849 for ; Fri, 1 Apr 2005 03:47:13 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHKc0-000330-00; Fri, 01 Apr 2005 21:46:52 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHKbd-0000pi-00; Fri, 01 Apr 2005 21:46:29 +1000 From: Herbert Xu To: akpm@osdl.org (Andrew Morton) Subject: Re: Fw: [Bugme-new] [Bug 4430] New: Virtual interfaces cannot have their own mtu Cc: netdev@oss.sgi.com, lukeross@sys3175.co.uk Organization: Core In-Reply-To: <20050401021121.76da449b.akpm@osdl.org> X-Newsgroups: apana.lists.os.linux.netdev User-Agent: tin/1.7.4-20040225 ("Benbecula") (UNIX) (Linux/2.4.27-hx-1-686-smp (i686)) Message-Id: Date: Fri, 01 Apr 2005 21:46:29 +1000 X-Virus-Scanned: ClamAV 0.83/798/Thu Mar 31 01:54:41 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1193 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Andrew Morton wrote: > > the eth0:1 to 9000. However it is not possible to set the mtu for eth0:1 to 9000 > without setting the mtu of eth0 to 9000 as well. The solution is to set the mtu using ip route in addition to setting it on eth0, e.g., ip ro add x.0.0.0/8 via gw dev eth0 mtu 1500 src a.b.c.d ip ro add y.0.0.0/8 via gw2 dev eth0 mtu 9000 src e.f.g.h You still have to set the mtu on eth0 to 9000 since that determines the maximum receive size as well (MRU). -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From hadi@cyberus.ca Fri Apr 1 04:24:49 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 04:24:54 -0800 (PST) Received: from mx02.cybersurf.com (mx02.cybersurf.com [209.197.145.105]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31COmBE032004 for ; Fri, 1 Apr 2005 04:24:49 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx02.cybersurf.com with esmtp (Exim 4.30) id 1DHLCf-0004Uu-JU for netdev@oss.sgi.com; Fri, 01 Apr 2005 07:24:45 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHLCb-0007cd-Jd; Fri, 01 Apr 2005 07:24:41 -0500 Subject: Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <20050401114258.GA2932@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112358278.1096.160.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 07:24:38 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1194 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Fri, 2005-04-01 at 06:42, Herbert Xu wrote: > On Fri, Apr 01, 2005 at 06:03:18AM -0500, jamal wrote: > > > > > Some of these functions are called internally as you discovered. > > > Since the notifications should only be generated by user requests, > > > calls to km_notify_* should be made at the places where the user > > > requests are handled, which is in the KM itself. > > > > You need to be able to generate events at every km not just the one that > > generated the request. You also (most of the time) need to do it before > > I understand. However, that's not determined by where you put the > km_notify call itself. Even when you call km_notify from af_key > or xfrm_user it will notify every km in the system. > > It's the fact that we're calling km_notify instead of pfkey_broadcast > or netlink_broadcast that's important, not the location. > > Having the km_notify call made in af_key/xfrm_user is convenient though > for the reason I outlined above. I think either scheme is fine really;-> I will definetely go back and consider the approach you are suggesting and see if it results into more maintanable code - then fair. Otherwise you realize its more work for me ;-> > > > You've caught a real bug for af_key here. It's currently possible to > > > receive two delete notifications for the same state. > > > > Can you elaborate? > > Imagine you've got a KM that's trying to delete a state via af_key that's > about to expire. If pfkey_delete looks up the state successfully, and > then the timer triggers before the actual xfrm_state_delete, you will > get one event generated by the timer and another by pfkey_delete. > I havent checked the state machine closely, but the following seems to make sense: The first thing that happens to delete the state/policy should win if the state/policy is transitioned to dead. > RFC 2367 says that: > > The messaging behavior for SADB_FLUSH is: > > Send an SADB_FLUSH message from a user process to the kernel. > > > > The kernel will return an SADB_FLUSH message to all listening > sockets. > > > > As you can see, there is no exception for the case of an empty database. > So my interpretation would be that a broadcast is needed. > Does it really make sense, Herbert? ;-> What is it that you just flushed that results in the event? The RFC is ambigous in my opinion. Look at what it says about deleting (same ambiguity). ---- 3.1.4 SADB_DELETE The SADB_DELETE message causes the kernel to delete a Security Association from the key table. The delete message consists of the base header followed by the association, and the source and destination sockaddrs in the address extension. The kernel deletes the security association matching the type, spi, source address, and destination address in the message. The message behavior for SADB_DELETE is as follows: Send an SADB_DELETE message from a user process to the kernel. The kernel returns the SADB_DELETE message to all listening processes. ------ So why would you generate an event in the case when you didnt delete anything? > > Actually, given that this function is being called in many places i > > would say this is the exact central location you want to issue the > > announce from. > > Try this as an exercise. List all the xfrm_policy_kills that need > notifications and all those that don't, you will find that the former > all originate from delete/flush commands in af_key/xfrm_user, while > the latter originate from other callers. > > In other words, by placing the call in af_key/xfrm_user you simplify > the logic and make it more maintainable. > I will go over the code and review. You may be absolutely right - thats the better approach to take. > BTW, as it is you're announcing expired policies twice. Once as an > > > expire event and once as a delete event. This problem will also go > > > away if you move the km_* calls into af_key/xfrm_user. > > > > Theres an announcement only when policy goes dead ;-> > > So only one not two. Same with the state as well. > > Well when the policy expires you will get one expire notification from > the current timer code and a new one from your patch since the timer > calls xfrm_policy_delete. > > See my point? By putting the call in xfrm_policy.c you have to be > really careful in dividing the internal users which shouldn't > generate notifications and the external users which should. By doing > it in af_key/xfrm_user you can avoid all this work. > Thats a bug really which is being exposed now. So it has nothing to do with the approach taken ;-> No expire should be sent if the policy has transitioned to dead. The bug is trivial to fix - and actually should be fixed regardless of this patch. > > I was just being lazy. I could send a 0 but whats wrong with using > > jiffies? > > Using jiffies means that you can have two successive messages that > share the same sequence number. It's not a big deal of course. But > if we're going to indicate ordering, we might as well go the full > length. > Good point. I will stay lazy and just set a 0 ;-> cheers, jamal From herbert@gondor.apana.org.au Fri Apr 1 04:37:40 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 04:37:50 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31Cbcta032644 for ; Fri, 1 Apr 2005 04:37:39 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHLNt-0003Fl-00; Fri, 01 Apr 2005 22:36:21 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHLNS-0000ud-00; Fri, 01 Apr 2005 22:35:54 +1000 Date: Fri, 1 Apr 2005 22:35:54 +1000 To: jamal Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: PATCH: IPSEC xfrm events Message-ID: <20050401123554.GA3468@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112358278.1096.160.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1195 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 07:24:38AM -0500, jamal wrote: > > I think either scheme is fine really;-> I will definetely go back and > consider the approach you are suggesting and see if it results into > more maintanable code - then fair. Otherwise you realize its more work > for me ;-> Well I'm happy to code that part if you want :) > I havent checked the state machine closely, but the following seems to > make sense: > The first thing that happens to delete the state/policy should win if > the state/policy is transitioned to dead. Agreed. That's what we'll get if we make __xfrm_state_delete return success/failure. > So why would you generate an event in the case when you didnt delete anything? You're right that the RFC isn't very clear. Let's forget about the RFC and simply consider the usefulness of this. I contend that it is useful to see a FLUSH notification even when it flushed nothing. The reason is that this is an indication to all listeners that the database is completely empty. > > Well when the policy expires you will get one expire notification from > > the current timer code and a new one from your patch since the timer > > calls xfrm_policy_delete. > > > > See my point? By putting the call in xfrm_policy.c you have to be > > really careful in dividing the internal users which shouldn't > > generate notifications and the external users which should. By doing > > it in af_key/xfrm_user you can avoid all this work. > > Thats a bug really which is being exposed now. So it has nothing to do > with the approach taken ;-> You're right that it is a bug. However, this bug would've never triggered before because we simply didn't have delete policy notifications :) > No expire should be sent if the policy has transitioned to dead. The bug > is trivial to fix - and actually should be fixed regardless of this > patch. Yes the same fix to __xfrm_state_delete can be applied to xfrm_policy_delete. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From hadi@cyberus.ca Fri Apr 1 04:59:48 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 04:59:52 -0800 (PST) Received: from mx02.cybersurf.com (mx02.cybersurf.com [209.197.145.105]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31Cxl2i001350 for ; Fri, 1 Apr 2005 04:59:48 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx02.cybersurf.com with esmtp (Exim 4.30) id 1DHLkW-0001aH-N2 for netdev@oss.sgi.com; Fri, 01 Apr 2005 07:59:44 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHLkU-0002gZ-2l; Fri, 01 Apr 2005 07:59:42 -0500 Subject: Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <20050401123554.GA3468@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112360379.1096.193.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 07:59:39 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1196 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Fri, 2005-04-01 at 07:35, Herbert Xu wrote: > On Fri, Apr 01, 2005 at 07:24:38AM -0500, jamal wrote: > > > > I think either scheme is fine really;-> I will definetely go back and > > consider the approach you are suggesting and see if it results into > > more maintanable code - then fair. Otherwise you realize its more work > > for me ;-> > > Well I'm happy to code that part if you want :) > Let me review first. If it is valuable (we may have to leave expire alone). If i can get it done within next day or two fine - else if i get busyed out elsewhere i will hand it to you. Actually if you have plenty cycles and are very enthusiastic about this i can hand it to you right now ;-> Masahide and myself have some momentum going right now but i dont think this will be that disruptive. > You're right that the RFC isn't very clear. > > Let's forget about the RFC and simply consider the usefulness of this. > I contend that it is useful to see a FLUSH notification even when > it flushed nothing. > > The reason is that this is an indication to all listeners that the > database is completely empty. > Ok, let me hear from Masahide-san: If he still holds the same opinion as you then i will make the change. > > Thats a bug really which is being exposed now. So it has nothing to do > > with the approach taken ;-> > > You're right that it is a bug. However, this bug would've never triggered > before because we simply didn't have delete policy notifications :) > indeed. > > No expire should be sent if the policy has transitioned to dead. The bug > > is trivial to fix - and actually should be fixed regardless of this > > patch. > > Yes the same fix to __xfrm_state_delete can be applied to > xfrm_policy_delete. > agreed. cheers, jamal From hadi@cyberus.ca Fri Apr 1 05:18:45 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 05:18:49 -0800 (PST) Received: from mx01.cybersurf.com (mx01.cybersurf.com [209.197.145.104]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31DIif1002619 for ; Fri, 1 Apr 2005 05:18:45 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx01.cybersurf.com with esmtp (Exim 4.30) id 1DHM2o-00010O-TF for netdev@oss.sgi.com; Fri, 01 Apr 2005 06:18:38 -0700 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHM2q-00055C-BD; Fri, 01 Apr 2005 08:18:40 -0500 Subject: Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <1112360379.1096.193.camel@jzny.localdomain> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112360379.1096.193.camel@jzny.localdomain> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112361517.1089.197.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 08:18:37 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1197 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Fri, 2005-04-01 at 07:59, jamal wrote: > Let me review first. If it is valuable (we may have to leave expire > alone). Ok, from a first review I would agree with you the result of doing it in km user will be more maintainable. It will result in a larger patch but in the long run more maintainable. > If i can get it done within next day or two fine - else if i get > busyed out elsewhere i will hand it to you. Let me code away at it - The offer still stands though ;-> cheers, jamal From nakam@linux-ipv6.org Fri Apr 1 06:20:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 06:20:05 -0800 (PST) Received: from mail406.noc.n-bone.net (mail4.noc.n-bone.net [138.243.50.144]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31EJxCd004700 for ; Fri, 1 Apr 2005 06:19:59 -0800 Received: from [192.168.2.196] (polaris.linux-ipv6.org [203.178.140.10]) by mail406.noc.n-bone.net (NBONE-MTA) with ESMTP id CD2CBFD9; Fri, 1 Apr 2005 23:19:47 +0900 (JST) Message-ID: <424D5881.4010005@linux-ipv6.org> Date: Fri, 01 Apr 2005 23:19:45 +0900 From: Masahide NAKAMURA User-Agent: Debian Thunderbird 1.0 (X11/20050116) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadi@cyberus.ca, Herbert Xu Cc: Patrick McHardy , "David S. Miller" , netdev Subject: Re: PATCH: IPSEC xfrm events References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112360379.1096.193.camel@jzny.localdomain> In-Reply-To: <1112360379.1096.193.camel@jzny.localdomain> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1198 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nakam@linux-ipv6.org Precedence: bulk X-list: netdev Hello Jamal and Herbert, jamal wrote: > Let me review first. If it is valuable (we may have to leave expire > alone). If i can get it done within next day or two fine - else if i get > busyed out elsewhere i will hand it to you. Actually if you have plenty > cycles and are very enthusiastic about this i can hand it to you right > now ;-> Masahide and myself have some momentum going right now but i > dont think this will be that disruptive. > > >>You're right that the RFC isn't very clear. >> >>Let's forget about the RFC and simply consider the usefulness of this. >>I contend that it is useful to see a FLUSH notification even when >>it flushed nothing. >> >>The reason is that this is an indication to all listeners that the >>database is completely empty. >> > > > Ok, let me hear from Masahide-san: If he still holds the same opinion as > you then i will make the change. I think FLUSH should be sent in such case. Because flushing empty SADB/SPD is not an error (at current code), it is reasonable to broadcast it. Regards, -- Masahide NAKAMURA From dada1@cosmosbay.com Fri Apr 1 06:39:58 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 06:40:05 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31EdvAi005615 for ; Fri, 1 Apr 2005 06:39:58 -0800 Received: from [172.16.0.131] (edumazet-port [172.16.0.131]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j31Edm5v023180; Fri, 1 Apr 2005 16:39:49 +0200 Message-ID: <424D5D34.4030800@cosmosbay.com> Date: Fri, 01 Apr 2005 16:39:48 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> In-Reply-To: <20050331221352.13695124.davem@davemloft.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [172.16.8.80]); Fri, 01 Apr 2005 16:39:49 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1199 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev David S. Miller a écrit : > On Thu, 17 Mar 2005 20:52:44 +0100 > Eric Dumazet wrote: > > >> - Move the spinlocks out of tr_hash_table[] to a fixed size table : Saves a lot of memory (particulary on UP) > > > If spinlock_t is a zero sized structure on UP, how can this save memory > on UP? :-) Because I deleted the __attribute__((__aligned__(8))) constraint on struct rt_hash_bucket. So sizeof(struct rt_hash_bucket) is now 4 instead of 8 on 32 bits architectures. May I remind you some people still use 32 bits CPU ? :-) By the way I have an updated patch... surviving very serious loads. > > Anyways, I think perhaps you should dynamically allocate this lock table. Maybe I should make a static sizing, (replace the 256 constant by something based on MAX_CPUS) ? > Otherwise it looks fine. > > From Robert.Olsson@data.slu.se Fri Apr 1 07:53:07 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 07:53:12 -0800 (PST) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31Fr6ax007887 for ; Fri, 1 Apr 2005 07:53:07 -0800 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j31Fr21P015728; Fri, 1 Apr 2005 17:53:02 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 43674EE2B1; Fri, 1 Apr 2005 17:53:02 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16973.28254.203492.400896@robur.slu.se> Date: Fri, 1 Apr 2005 17:53:02 +0200 To: Eric Dumazet Cc: "David S. Miller" , netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() In-Reply-To: <424D5D34.4030800@cosmosbay.com> References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1200 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Hello! Eric Dumazet writes: > By the way I have an updated patch... surviving very serious loads. Did you check for performance changes too? From what I understand we can add new lookup and cache miss in the fast packet path. > > Anyways, I think perhaps you should dynamically allocate this lock table. > > Maybe I should make a static sizing, (replace the 256 constant by something based on MAX_CPUS) ? IMO we should be careful with adding new complexity the route hash. Also was this dynamic behavior gc_interval needed to fix the overflow? gc_interval is only sort of last resort timer. --ro From greearb@candelatech.com Fri Apr 1 08:29:22 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 08:29:27 -0800 (PST) Received: from www.lanforge.com (ns1.lanforge.com [66.165.47.210]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31GTLSX014658 for ; Fri, 1 Apr 2005 08:29:22 -0800 Received: from [4.33.45.22] (evrtwa1-ar2-4-33-045-022.evrtwa1.dsl-verizon.net [4.33.45.22]) (authenticated bits=0) by www.lanforge.com (8.12.8/8.12.8) with ESMTP id j31GtHLH009322; Fri, 1 Apr 2005 08:55:17 -0800 Message-ID: <424D76DF.5070002@candelatech.com> Date: Fri, 01 Apr 2005 08:29:19 -0800 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.3) Gecko/20041020 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Pekka Savola CC: "'netdev@oss.sgi.com'" Subject: Re: RFC: Redirect-Device References: <424C6089.1080507@candelatech.com> <424CDBA9.80703@candelatech.com> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1202 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Pekka Savola wrote: > On Thu, 31 Mar 2005, Ben Greear wrote: > >>> Is there something in your problem statement I'm missing? >> >> >> That would be similar to what I'm doing, but I'm not really trying >> to tunnel anything. I am trying to duplicate the behaviour of two >> ethernet interfaces connected by an external cross-over cable, and I'm >> trying to duplicate it at the network-device interface level so that >> common tools (and my own tools) can treat these virtual interfaces >> just like ethernet interfaces. > > > Oh ok, what you seem to want is some kind of "Ethernet loopback++", but > the "looped" packets should come back from a virtual interface instead > of the same interface? Yes. In practice, I use a pair of virtual interfaces, so I send on one virtual and receive on the other. I use separate software to bridge, or the normal linux stacks to route, the packets to other interfaces, including real interfaces. > Btw, does the kernel support traditional loopback, so that at the last > stage, just before sending a packet on the wire, it would be pushed back. Not that I'm aware of. -- Ben Greear Candela Technologies Inc http://www.candelatech.com From dada1@cosmosbay.com Fri Apr 1 08:34:29 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 08:34:33 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31GYSnC015284 for ; Fri, 1 Apr 2005 08:34:29 -0800 Received: from [172.16.0.131] (edumazet-port [172.16.0.131]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j31GYIaH026085; Fri, 1 Apr 2005 18:34:19 +0200 Message-ID: <424D780A.9000101@cosmosbay.com> Date: Fri, 01 Apr 2005 18:34:18 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: Robert Olsson CC: "David S. Miller" , netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <16973.28254.203492.400896@robur.slu.se> In-Reply-To: <16973.28254.203492.400896@robur.slu.se> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [172.16.8.80]); Fri, 01 Apr 2005 18:34:19 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1203 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev Robert Olsson a écrit : > Hello! > > Did you check for performance changes too? From what I understand > we can add new lookup and cache miss in the fast packet path. Performance is better because in case of stress (lot of incoming packets per second), the 1024 bytes of the locks are all in cache. As the size of the hash is divided by a 2 factor, rt_check_expire() and/or rt_garbage_collect() have to touch less cache lines. According to oprofile, an unpatched kernel was spending more than 15% of time in route.c routines, now I see ip_route_input() at 1.88% > > > > Anyways, I think perhaps you should dynamically allocate this lock table. > > > > Maybe I should make a static sizing, (replace the 256 constant by something based on MAX_CPUS) ? > > IMO we should be careful with adding new complexity the route hash. > Also was this dynamic behavior gc_interval needed to fix the overflow? In my case yes, because I have huge route cache. > gc_interval is only sort of last resort timer. Actually not : gc_interval controls the rt_check_expire() to clean the hash table after use. All old enough entries can be deleted smoothly, on behalf of a timer tick (so network interrupts can still occur) I found it was better to adjust gc_interval to 1 (to let it fire every second and examine 1/300 table slots, or more if the dynamic behavior triggers), and ajust params so that rt_garbage_collect() doesnt run at all : rt_garbage_collect() can take forever to complete, blocking network trafic. Eric Dumazet From ak@muc.de Fri Apr 1 08:40:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 08:40:16 -0800 (PST) Received: from one.firstfloor.org (one.firstfloor.org [213.235.205.2]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31Ge9g6015890 for ; Fri, 1 Apr 2005 08:40:10 -0800 Received: by one.firstfloor.org (Postfix, from userid 502) id 97B16D033E; Fri, 1 Apr 2005 18:40:07 +0200 (CEST) To: Rick Jones Cc: netdev@oss.sgi.com Subject: Re: [RFC] netif_rx: receive path optimization References: <20050330132815.605c17d0@dxpl.pdx.osdl.net> <20050331120410.7effa94d@dxpl.pdx.osdl.net> <1112303431.1073.67.camel@jzny.localdomain> <424C6A98.1070509@hp.com> From: Andi Kleen Date: Fri, 01 Apr 2005 18:40:07 +0200 In-Reply-To: <424C6A98.1070509@hp.com> (Rick Jones's message of "Thu, 31 Mar 2005 13:24:40 -0800") Message-ID: User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1204 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@muc.de Precedence: bulk X-list: netdev Rick Jones writes: > At the risk of again chewing on my toes (yum), if multiple CPUs are > pulling packets from the per-device queue there will be packet > reordering. HP-UX 10.0 did just that and it was quite nasty even at > low CPU counts (<=4). It was changed by HP-UX 10.20 (ca 1995) to > per-CPU queues with queue selection computed from packet headers (hash > the IP and TCP/UDP header to pick a CPU) It was called IPS for Inbound > Packet Scheduling. 11.0 (ca 1998) later changed that to "find where > the connection last ran and queue to that CPU" That was called TOPS - > Thread Optimized Packet Scheduling. We went over this a lot several years ago when Linux got multi threaded RX with softnet in 2.1. You might want to go over the archives. Some things that came out of it was a sender side TCP optimization to tolerate reordering without slowing down (works great with other Linux peers) and NAPI style polling mode (which was mostly designed for routing and still seems to have regressions for the client/server case :/) Something like TOPS was discussed, but afaik nobody ever implemented it. Of course benchmark guys do it manually by setting interrupt and scheduler affinity. -Andi From greearb@candelatech.com Fri Apr 1 08:58:57 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 08:59:02 -0800 (PST) Received: from www.lanforge.com (ns1.lanforge.com [66.165.47.210]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31GwuW4016989 for ; Fri, 1 Apr 2005 08:58:57 -0800 Received: from [4.33.45.22] (evrtwa1-ar2-4-33-045-022.evrtwa1.dsl-verizon.net [4.33.45.22]) (authenticated bits=0) by www.lanforge.com (8.12.8/8.12.8) with ESMTP id j31HOoLH009680; Fri, 1 Apr 2005 09:24:51 -0800 Message-ID: <424D7DCC.5030202@candelatech.com> Date: Fri, 01 Apr 2005 08:58:52 -0800 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.3) Gecko/20041020 X-Accept-Language: en-us, en MIME-Version: 1.0 To: bert hubert CC: hadi@cyberus.ca, "David S. Miller" , netdev Subject: Re: RFC: Redirect-Device References: <424C6089.1080507@candelatech.com> <1112303627.1073.71.camel@jzny.localdomain> <424C6B10.6030200@candelatech.com> <1112306031.1073.109.camel@jzny.localdomain> <424C7813.4000101@candelatech.com> <20050331143531.30f4eb8f.davem@davemloft.net> <424C7F96.4070002@candelatech.com> <1112311618.1090.20.camel@jzny.localdomain> <424C8E2C.70302@candelatech.com> <20050401090116.GA21361@outpost.ds9a.nl> In-Reply-To: <20050401090116.GA21361@outpost.ds9a.nl> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1205 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev bert hubert wrote: > On Thu, Mar 31, 2005 at 03:56:28PM -0800, Ben Greear wrote: > > >>>I think you are more comfortable with using netdevices and ioctls and >>>/proc. >> >>Definately. Ever tried to sniff a socket with ethereal? :) > > > On loopback, all the time. I'm probably dense but I don't understand what > problem you've solved with this interface. Could you elaborate a bit? It allows me to place a software bridge that can intercept all packets from user-space via raw packet sockets, and kernel space via registering an 'all' protocol on the device. Please note that to bridge in this manner I have to remove the IP protocol (set IP to 0.0.0.0), otherwise the IP stack can interfere with the bridging behaviour. By using a virtual pair of interfaces that are looped back, I can add an IP to the second virtual network interface that does not interfere with the two bridged interfaces (one physical, one redirect, both with 0.0.0.0 IP addresses). If there were an API to register handlers dynamically that act like the netpoll hook (ie, with ability to consume frames), then I would not have to remove the IP from the physical interface and I probably would not have had to create these redirect devices. But, when I was suggesting such a hook in the past, it was shot down because it could allow someone to write their own TCP stack, and the network guys did not want to allow this possibility. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From Robert.Olsson@data.slu.se Fri Apr 1 09:26:42 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 09:26:46 -0800 (PST) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31HQfZm018140 for ; Fri, 1 Apr 2005 09:26:42 -0800 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j31HQWQG025702; Fri, 1 Apr 2005 19:26:32 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 9CDC6EE2B1; Fri, 1 Apr 2005 19:26:32 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16973.33864.613333.389857@robur.slu.se> Date: Fri, 1 Apr 2005 19:26:32 +0200 To: Eric Dumazet Cc: Robert Olsson , "David S. Miller" , netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() In-Reply-To: <424D780A.9000101@cosmosbay.com> References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <16973.28254.203492.400896@robur.slu.se> <424D780A.9000101@cosmosbay.com> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1206 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Eric Dumazet writes: > According to oprofile, an unpatched kernel was spending more than 15% of time in route.c routines, now I see ip_route_input() at 1.88% Would like to see absolute numbers for UP/SMP single flow and DoS to be confident. > I found it was better to adjust gc_interval to 1 (to let it fire every second and examine 1/300 table slots, or more if the dynamic behavior > triggers), and ajust params so that rt_garbage_collect() doesnt run at all : rt_garbage_collect() can take forever to complete, blocking > network trafic. I don't think you can depend on timer for GC solely. Timer tick is eternity for todays packet rates. You can distribute the GC load by allowing it to run more frequent this in combination with huge cache seems to be a very interesting approach given that you have memory. --ro From nakam@linux-ipv6.org Fri Apr 1 09:28:16 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 09:28:20 -0800 (PST) Received: from mail406.noc.n-bone.net (mail4.noc.n-bone.net [138.243.50.144]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31HSFxd018552 for ; Fri, 1 Apr 2005 09:28:16 -0800 Received: from [192.168.2.195] (polaris.linux-ipv6.org [203.178.140.10]) by mail406.noc.n-bone.net (NBONE-MTA) with ESMTP id BDA70AE5; Sat, 2 Apr 2005 02:28:09 +0900 (JST) Message-ID: <424D84A7.6060707@linux-ipv6.org> Date: Sat, 02 Apr 2005 02:28:07 +0900 From: Masahide NAKAMURA User-Agent: Debian Thunderbird 1.0 (X11/20050116) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadi@cyberus.ca, Herbert Xu Cc: Patrick McHardy , "David S. Miller" , netdev Subject: Re: PATCH: IPSEC xfrm events References: <1112319441.1089.83.camel@jzny.localdomain> In-Reply-To: <1112319441.1089.83.camel@jzny.localdomain> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1207 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nakam@linux-ipv6.org Precedence: bulk X-list: netdev Jamal and Herbert, jamal wrote: > Herbert et al, > > Ok, heres the final patch with all the changes discussed. > > include/linux/xfrm.h | 2 > include/net/xfrm.h | 29 ++++++- > net/key/af_key.c | 24 +++++- > net/xfrm/xfrm_policy.c | 25 ++++-- > net/xfrm/xfrm_state.c | 84 +++++++++++++++++++-- > net/xfrm/xfrm_user.c | 188 > ++++++++++++++++++++++++++++++++++++++++++++++++- > 6 files changed, 323 insertions(+), 29 deletions(-) > > I have tested this with both setkey and iproute2 (about 10 scenarios or > so). Masahide-san is doing a lot more thorough testing with key servers > as well. He has not tested this patch yet (time difference) but it is > based on the last one he tested. Short report: I've tested on this patched kernel and it works. - add/del/flush for SA/SP and allocspi/acquire/upd for SA through netlink socket - racoon runs fine (pfkey works for normal operation) both without and with opening netlink socket to listen Since we have discussion which is still going on about the patch, the code will be change and I'll need to test again anyway. Thanks, -- Masahide NAKAMURA From roland@topspin.com Fri Apr 1 09:53:53 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 09:54:00 -0800 (PST) Received: from exch-1.topspincom.com (webmail.topspin.com [12.162.17.3]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31HrqZc019816 for ; Fri, 1 Apr 2005 09:53:53 -0800 Received: from localhost.localdomain ([10.3.1.93]) by exch-1.topspincom.com with Microsoft SMTPSVC(5.0.2195.5329); Fri, 1 Apr 2005 09:45:33 -0800 Received: by localhost.localdomain (Postfix, from userid 1113) id 7EA6C4FDF2; Fri, 1 Apr 2005 09:45:33 -0800 (PST) To: akpm@osdl.org Cc: linux-kernel@vger.kernel.org, openib-general@openib.org, netdev@oss.sgi.com, davem@davemloft.net Subject: [PATCH][4/3] IPoIB: document conversion to debugfs X-Message-Flag: Warning: May contain useful information References: <20053311936.XaQmN4N9new7dTCP@topspin.com> From: Roland Dreier Date: Fri, 01 Apr 2005 09:45:33 -0800 In-Reply-To: <20053311936.XaQmN4N9new7dTCP@topspin.com> (Roland Dreier's message of "Thu, 31 Mar 2005 19:36:12 -0800") Message-ID: <52r7hujsqq.fsf@topspin.com> User-Agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.4 (Jumbo Shrimp, linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-OriginalArrivalTime: 01 Apr 2005 17:45:33.0676 (UTC) FILETIME=[9AC0C2C0:01C536E2] X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1208 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: roland@topspin.com Precedence: bulk X-list: netdev Update IPoIB documentation now that multicast debugging files have moved from ipoibdebugfs to debugfs. Signed-off-by: Roland Dreier --- linux-export.orig/Documentation/infiniband/ipoib.txt 2005-03-31 19:07:01.000000000 -0800 +++ linux-export/Documentation/infiniband/ipoib.txt 2005-04-01 09:43:27.122520190 -0800 @@ -32,14 +32,13 @@ mcast_debug_level to 1. These parameters can be controlled at runtime through files in /sys/module/ib_ipoib/. - CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs virtual filesystem. By mounting this filesystem, for example with - mkdir -p /ipoib_debugfs - mount -t ipoib_debugfs none /ipoib_debufs + mount -t debugfs none /sys/kernel/debug - it is possible to get statistics about multicast groups from the - files /ipoib_debugfs/ib0_mcg and so on. + it is possible to get statistics about munlticast groups from the + files /sys/kernel/debug/ipoib/ib0_mcg and so on. The performance impact of this option is negligible, so it is safe to enable this option with debug_level set to 0 for normal From rick.jones2@hp.com Fri Apr 1 10:55:59 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 10:56:03 -0800 (PST) Received: from palrel11.hp.com (palrel11.hp.com [156.153.255.246]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31Itxgb022131 for ; Fri, 1 Apr 2005 10:55:59 -0800 Received: from tardy.cup.hp.com (tardy.cup.hp.com [15.244.44.58]) by palrel11.hp.com (Postfix) with ESMTP id 29A4E1F36E7 for ; Fri, 1 Apr 2005 10:22:52 -0800 (PST) Received: from hp.com (localhost [127.0.0.1]) by tardy.cup.hp.com (8.9.3 (PHNE_28810)/8.9.3 SMKit7.02) with ESMTP id KAA01022 for ; Fri, 1 Apr 2005 10:22:51 -0800 (PST) Message-ID: <424D917B.2060108@hp.com> Date: Fri, 01 Apr 2005 10:22:51 -0800 From: Rick Jones User-Agent: Mozilla/5.0 (X11; U; HP-UX 9000/785; en-US; rv:1.6) Gecko/20040304 X-Accept-Language: en-us, en MIME-Version: 1.0 To: netdev Subject: Re: [RFC] netif_rx: receive path optimization References: <20050330132815.605c17d0@dxpl.pdx.osdl.net> <20050331120410.7effa94d@dxpl.pdx.osdl.net> <1112303431.1073.67.camel@jzny.localdomain> <424C6A98.1070509@hp.com> <1112305084.1073.94.camel@jzny.localdomain> <424C7CDC.8050801@hp.com> <1112312206.1096.25.camel@jzny.localdomain> <424C90DA.7030600@hp.com> <1112318229.1090.63.camel@jzny.localdomain> In-Reply-To: <1112318229.1090.63.camel@jzny.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1209 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rick.jones2@hp.com Precedence: bulk X-list: netdev >>The main idea behind TOPS and prior to that IPS was to spread-out >>the processing of packets across as many CPUs as we could, as "correctly" as we >>could. > > > Very very hard to do. Why do you say that? "Correct" can be defined as either the same CPU for each packet in a given flow (IPS) or the same CPU as last accessed the endpoint (TOPS). > Isnt MSI supposed to give you ability such that a > NIC can pick a CPU to interupt? That would help in a small way That gives the NIC the knowledge of how to direct to a CPU, but as you know does not tell it how to decide where. Since I doubt that the NIC wants to reach-out and touch connection state in the host (nor I suppose do we want it to either) the best a NIC with MSI could do would be IPS >>TOPS lets the process (I suppose the scheduler really) decide where some of the >>processing for the packet will happen - the part after the handoff. >> > > I think this last part should be easy to do - but perhaps the expense of > landing on the wrong CPU may override any benefits perceived. Unless one has a scheduler that likes to migrate processes, the chances of landing on the wrong CPU are minimal and shortlived, and overall, the chances of being right are greater than if not doing anything and sticking with the interrupt CPU. (Handwaving based on experience-driven intuition and a bit of math as one increases the CPU count) This is all on the premis that one is running with numNIC << numCPU. With numNIC == numCPU one does things as seen in certain networking-intensive benchmarks :) rick jones From shemminger@osdl.org Fri Apr 1 12:07:36 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 12:07:41 -0800 (PST) Received: from smtp.osdl.org (fire.osdl.org [65.172.181.4]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31K7aJG024341 for ; Fri, 1 Apr 2005 12:07:36 -0800 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j31K7Rs4028918 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Fri, 1 Apr 2005 12:07:27 -0800 Received: from dxpl.pdx.osdl.net (dxpl.pdx.osdl.net [172.20.1.103]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with ESMTP id j31K7RLd030565; Fri, 1 Apr 2005 12:07:27 -0800 Date: Fri, 1 Apr 2005 12:07:27 -0800 From: Stephen Hemminger To: lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: [ANNOUNCE] iproute2 2.6.11-050330 Message-ID: <20050401120727.62700e8c@dxpl.pdx.osdl.net> Organization: Open Source Development Lab X-Mailer: Sylpheed-Claws 1.0.4 (GTK+ 1.2.10; x86_64-unknown-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.106 $ X-Scanned-By: MIMEDefang 2.36 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1210 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev An updated version of the iproute2 utilities is available at: http://developer.osdl.org/dev/iproute2/download/iproute2-2.6.11-050330.tar.gz It supports the latest features from 2.6, but is backwards compatiable with 2.4. This update includes several bugfixes and build clean from the previous version (2.6.11-050314): [Jamal Hadi Salim] * Proper verison of iptables headers (from 1.3.1) * Set revision file in m_ipt * Fix action_util naming in mirred * don't call ll_init_map in mirred [Thomas Graf] * Warn about wildcard deletions and provide IFA_ADDRESS upon deletions to enforce prefix length validation for IPv4. * Fix netlink message alignment when the last routing attribute added has a data length not aligned to RTA_ALIGNTO. [Masahide NAKAMURA] * ipv6 xfrm allocspi and monitor support. [Stephen Hemminger] * include/linux/netfilter_ipv4/ip_tables.h dont include compiler.h because it isn't needed and not on all systems * Update rtnetlink.h and pkt_cls.h to be stripped versions of headers from 2.6.12-rc1 * switch to stack for netem tables * add -force option to batch mode * handle midline comments in batch mode * sum per cpu fields in lnstat correctly From sds@tycho.nsa.gov Fri Apr 1 12:15:22 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 12:15:29 -0800 (PST) Received: from jazzhorn.ncsc.mil (mummy.ncsc.mil [144.51.88.129]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31KFLo9025229 for ; Fri, 1 Apr 2005 12:15:22 -0800 Received: from tycho.ncsc.mil (jazzhorn.ncsc.mil [144.51.5.9]) by jazzhorn.ncsc.mil (8.12.10/8.12.10) with ESMTP id j31KBvhV026499; Fri, 1 Apr 2005 20:11:57 GMT Received: from moss-spartans.epoch.ncsc.mil (moss-spartans [144.51.25.121]) by tycho.ncsc.mil (8.12.8/8.12.8) with ESMTP id j31KG5Do015003; Fri, 1 Apr 2005 15:16:05 -0500 (EST) Subject: [PATCH] Fix SELinux for removal of i_sock From: Stephen Smalley To: "David S. Miller" , James Morris , lkml , netdev@oss.sgi.com, matthew@wil.cx Content-Type: text/plain Organization: National Security Agency Date: Fri, 01 Apr 2005 15:06:37 -0500 Message-Id: <1112385997.14481.192.camel@moss-spartans.epoch.ncsc.mil> Mime-Version: 1.0 X-Mailer: Evolution 2.0.2 (2.0.2-14) Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1211 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sds@tycho.nsa.gov Precedence: bulk X-list: netdev Hi, This patch against -bk eliminates the use of i_sock by SELinux as it appears to have been removed recently, breaking the build of SELinux in -bk. Simply replacing the i_sock test with an S_ISSOCK test would be unsafe in the SELinux code, as the latter will also return true for the inodes of socket files in the filesystem, not just the actual socket objects IIUC. Hence this patch reworks the SELinux code to avoid the need to apply such a test in the first place, part of which was obsoleted anyway by earlier changes to SELinux. Please apply. Signed-off-by: Stephen Smalley Signed-off-by: James Morris security/selinux/hooks.c | 21 +++------------------ 1 files changed, 3 insertions(+), 18 deletions(-) ===== security/selinux/hooks.c 1.93 vs edited ===== --- 1.93/security/selinux/hooks.c 2005-03-28 17:21:19 -05:00 +++ edited/security/selinux/hooks.c 2005-04-01 15:01:58 -05:00 @@ -877,18 +877,8 @@ static int inode_doinit_with_dentry(stru isec->initialized = 1; out: - if (inode->i_sock) { - struct socket *sock = SOCKET_I(inode); - if (sock->sk) { - isec->sclass = socket_type_to_security_class(sock->sk->sk_family, - sock->sk->sk_type, - sock->sk->sk_protocol); - } else { - isec->sclass = SECCLASS_SOCKET; - } - } else { + if (isec->sclass == SECCLASS_FILE) isec->sclass = inode_mode_to_security_class(inode->i_mode); - } if (hold_sem) up(&isec->sem); @@ -2979,18 +2969,15 @@ out: static void selinux_socket_post_create(struct socket *sock, int family, int type, int protocol, int kern) { - int err; struct inode_security_struct *isec; struct task_security_struct *tsec; - err = inode_doinit(SOCK_INODE(sock)); - if (err < 0) - return; isec = SOCK_INODE(sock)->i_security; tsec = current->security; isec->sclass = socket_type_to_security_class(family, type, protocol); isec->sid = kern ? SECINITSID_KERNEL : tsec->sid; + isec->initialized = 1; return; } @@ -3158,14 +3145,12 @@ static int selinux_socket_accept(struct if (err) return err; - err = inode_doinit(SOCK_INODE(newsock)); - if (err < 0) - return err; newisec = SOCK_INODE(newsock)->i_security; isec = SOCK_INODE(sock)->i_security; newisec->sclass = isec->sclass; newisec->sid = isec->sid; + newisec->initialized = 1; return 0; } -- Stephen Smalley National Security Agency From davem@davemloft.net Fri Apr 1 12:28:55 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 12:29:01 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31KSt7x029634 for ; Fri, 1 Apr 2005 12:28:55 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHSkM-0002UR-00; Fri, 01 Apr 2005 12:28:02 -0800 Date: Fri, 1 Apr 2005 12:28:02 -0800 From: "David S. Miller" To: Eric Dumazet Cc: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-Id: <20050401122802.7c71afbc.davem@davemloft.net> In-Reply-To: <424D5D34.4030800@cosmosbay.com> References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1212 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Fri, 01 Apr 2005 16:39:48 +0200 Eric Dumazet wrote: > > If spinlock_t is a zero sized structure on UP, how can this save memory > > on UP? :-) > > Because I deleted the __attribute__((__aligned__(8))) constraint on struct rt_hash_bucket. Right. > > Anyways, I think perhaps you should dynamically allocate this lock table. > > Maybe I should make a static sizing, (replace the 256 constant by something based on MAX_CPUS) ? Even for NR_CPUS, I think the table should be dynamically allocated. It is a goal to eliminate all of these huge arrays in the static kernel image, which has grown incredibly too much in recent times. I work often to eliminate such things, let's not add new ones :-) From davem@davemloft.net Fri Apr 1 12:36:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 12:36:18 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31KaD6Z030336 for ; Fri, 1 Apr 2005 12:36:13 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHSrQ-0002Xi-00; Fri, 01 Apr 2005 12:35:20 -0800 Date: Fri, 1 Apr 2005 12:35:20 -0800 From: "David S. Miller" To: Stephen Smalley Cc: jmorris@redhat.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, matthew@wil.cx Subject: Re: [PATCH] Fix SELinux for removal of i_sock Message-Id: <20050401123520.7532528b.davem@davemloft.net> In-Reply-To: <1112385997.14481.192.camel@moss-spartans.epoch.ncsc.mil> References: <1112385997.14481.192.camel@moss-spartans.epoch.ncsc.mil> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1213 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Fri, 01 Apr 2005 15:06:37 -0500 Stephen Smalley wrote: > This patch against -bk eliminates the use of i_sock by SELinux as it > appears to have been removed recently, breaking the build of SELinux in > -bk. Simply replacing the i_sock test with an S_ISSOCK test would be > unsafe in the SELinux code, as the latter will also return true for the > inodes of socket files in the filesystem, not just the actual socket > objects IIUC. Hence this patch reworks the SELinux code to avoid the > need to apply such a test in the first place, part of which was > obsoleted anyway by earlier changes to SELinux. Please apply. > > Signed-off-by: Stephen Smalley > Signed-off-by: James Morris Applied, thanks Stephen. From dada1@cosmosbay.com Fri Apr 1 13:05:52 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 13:05:58 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31L5plG031537 for ; Fri, 1 Apr 2005 13:05:52 -0800 Received: from [192.168.0.3] ([84.5.129.64]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j31L5csf030409; Fri, 1 Apr 2005 23:05:43 +0200 Message-ID: <424DB7A1.8090803@cosmosbay.com> Date: Fri, 01 Apr 2005 23:05:37 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <20050401122802.7c71afbc.davem@davemloft.net> In-Reply-To: <20050401122802.7c71afbc.davem@davemloft.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [62.23.185.226]); Fri, 01 Apr 2005 23:05:44 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1214 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev David S. Miller a écrit : > On Fri, 01 Apr 2005 16:39:48 +0200 > Eric Dumazet wrote: > >>Maybe I should make a static sizing, (replace the 256 constant by something based on MAX_CPUS) ? > > > Even for NR_CPUS, I think the table should be dynamically allocated. > > It is a goal to eliminate all of these huge arrays in the static > kernel image, which has grown incredibly too much in recent times. > I work often to eliminate such things, let's not add new ones :-) You mean you prefer : static spinlock_t *rt_hash_lock ; /* rt_hash_lock = alloc_memory_at_boot_time(...) */ instead of static spinlock_t rt_hash_lock[RT_HASH_LOCK_SZ] ; In both cases, memory is taken from lowmem, and size of kernel image is roughly the same (bss section takes no space in image) Then the runtime cost is more expensive in the 'dynamic case' because of the extra indirection... ? From jheffner@psc.edu Fri Apr 1 13:05:56 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 13:06:03 -0800 (PST) Received: from mailer2.psc.edu (mailer2.psc.edu [128.182.66.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31L5ufx031548 for ; Fri, 1 Apr 2005 13:05:56 -0800 Received: from dexter.psc.edu (dexter.psc.edu [128.182.61.232]) by mailer2.psc.edu (8.13.3/8.13.3) with ESMTP id j31LAYiG018305 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 1 Apr 2005 16:10:38 -0500 (EST) Received: from dexter.psc.edu (localhost.psc.edu [127.0.0.1]) by dexter.psc.edu (8.12.11/8.12.10) with ESMTP id j31L5nhA018741; Fri, 1 Apr 2005 16:05:50 -0500 Received: from localhost (jheffner@localhost) by dexter.psc.edu (8.12.11/8.12.11/Submit) with ESMTP id j31L5nZa018738; Fri, 1 Apr 2005 16:05:49 -0500 X-Authentication-Warning: dexter.psc.edu: jheffner owned process doing -bs Date: Fri, 1 Apr 2005 16:05:49 -0500 (EST) From: John Heffner To: davem@davemloft.net, netdev@oss.sgi.com Subject: [PATCH] skb pcount with MTU discovery Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1215 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jheffner@psc.edu Precedence: bulk X-list: netdev The problem is that when doing MTU discovery, the too-large segments in the write queue will be calculated as having a pcount of >1. When tcp_write_xmit() is trying to send, tcp_snd_test() fails the cwnd test when pcount > cwnd. The segments are eventually transmitted one at a time by keepalive, but this can take a long time. This patch checks if TSO is enabled when setting pcount. -John Signed-off-by: John Heffner ===== include/net/tcp.h 1.114 vs edited ===== --- 1.114/include/net/tcp.h 2005-03-31 11:51:09 -05:00 +++ edited/include/net/tcp.h 2005-04-01 14:44:13 -05:00 @@ -1470,19 +1470,20 @@ tcp_minshall_check(tp)))); } -extern void tcp_set_skb_tso_segs(struct sk_buff *, unsigned int); +extern void tcp_set_skb_tso_segs(struct sock *, struct sk_buff *); /* This checks if the data bearing packet SKB (usually sk->sk_send_head) * should be put on the wire right now. */ -static __inline__ int tcp_snd_test(const struct tcp_sock *tp, +static __inline__ int tcp_snd_test(struct sock *sk, struct sk_buff *skb, unsigned cur_mss, int nonagle) { + struct tcp_sock *tp = tcp_sk(sk); int pkts = tcp_skb_pcount(skb); if (!pkts) { - tcp_set_skb_tso_segs(skb, tp->mss_cache_std); + tcp_set_skb_tso_segs(sk, skb); pkts = tcp_skb_pcount(skb); } @@ -1543,7 +1544,7 @@ if (skb) { if (!tcp_skb_is_last(sk, skb)) nonagle = TCP_NAGLE_PUSH; - if (!tcp_snd_test(tp, skb, cur_mss, nonagle) || + if (!tcp_snd_test(sk, skb, cur_mss, nonagle) || tcp_write_xmit(sk, nonagle)) tcp_check_probe_timer(sk, tp); } @@ -1561,7 +1562,7 @@ struct sk_buff *skb = sk->sk_send_head; return (skb && - tcp_snd_test(tp, skb, tcp_current_mss(sk, 1), + tcp_snd_test(sk, skb, tcp_current_mss(sk, 1), tcp_skb_is_last(sk, skb) ? TCP_NAGLE_PUSH : tp->nonagle)); } ===== net/ipv4/tcp_output.c 1.90 vs edited ===== --- 1.90/net/ipv4/tcp_output.c 2005-04-01 09:08:34 -05:00 +++ edited/net/ipv4/tcp_output.c 2005-04-01 14:45:27 -05:00 @@ -433,7 +433,7 @@ struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb = sk->sk_send_head; - if (tcp_snd_test(tp, skb, cur_mss, TCP_NAGLE_PUSH)) { + if (tcp_snd_test(sk, skb, cur_mss, TCP_NAGLE_PUSH)) { /* Send it out now. */ TCP_SKB_CB(skb)->when = tcp_time_stamp; tcp_tso_set_push(skb); @@ -446,9 +446,12 @@ } } -void tcp_set_skb_tso_segs(struct sk_buff *skb, unsigned int mss_std) +void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb) { - if (skb->len <= mss_std) { + struct tcp_sock *tp = tcp_sk(sk); + + if (skb->len <= tp->mss_cache_std || + !(sk->sk_route_caps & NETIF_F_TSO)) { /* Avoid the costly divide in the normal * non-TSO case. */ @@ -457,10 +460,10 @@ } else { unsigned int factor; - factor = skb->len + (mss_std - 1); - factor /= mss_std; + factor = skb->len + (tp->mss_cache_std - 1); + factor /= tp->mss_cache_std; skb_shinfo(skb)->tso_segs = factor; - skb_shinfo(skb)->tso_size = mss_std; + skb_shinfo(skb)->tso_size = tp->mss_cache_std; } } @@ -531,8 +534,8 @@ } /* Fix up tso_factor for both original and new SKB. */ - tcp_set_skb_tso_segs(skb, tp->mss_cache_std); - tcp_set_skb_tso_segs(buff, tp->mss_cache_std); + tcp_set_skb_tso_segs(sk, skb); + tcp_set_skb_tso_segs(sk, buff); if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST) { tp->lost_out += tcp_skb_pcount(skb); @@ -607,7 +610,7 @@ * factor and mss. */ if (tcp_skb_pcount(skb) > 1) - tcp_set_skb_tso_segs(skb, tcp_skb_mss(skb)); + tcp_set_skb_tso_segs(sk, skb); return 0; } @@ -815,7 +818,7 @@ sk_stream_free_skb(sk, skb); } else { TCP_SKB_CB(skb)->seq += copy; - tcp_set_skb_tso_segs(skb, tp->mss_cache_std); + tcp_set_skb_tso_segs(sk, skb); } len += copy; @@ -824,7 +827,7 @@ __skb_insert(nskb, skb->prev, skb, &sk->sk_write_queue); sk->sk_send_head = nskb; - tcp_set_skb_tso_segs(nskb, tp->mss_cache_std); + tcp_set_skb_tso_segs(sk, nskb); /* We're ready to send. If this fails, the probe will * be resegmented into mss-sized pieces by tcp_write_xmit(). */ @@ -885,7 +888,7 @@ mss_now = tcp_current_mss(sk, 1); while ((skb = sk->sk_send_head) && - tcp_snd_test(tp, skb, mss_now, + tcp_snd_test(sk, skb, mss_now, tcp_skb_is_last(sk, skb) ? nonagle : TCP_NAGLE_PUSH)) { if (skb->len > mss_now) { @@ -1822,7 +1825,7 @@ tp->mss_cache = tp->mss_cache_std; } } else if (!tcp_skb_pcount(skb)) - tcp_set_skb_tso_segs(skb, tp->mss_cache_std); + tcp_set_skb_tso_segs(sk, skb); TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; TCP_SKB_CB(skb)->when = tcp_time_stamp; From davem@davemloft.net Fri Apr 1 13:09:23 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 13:09:27 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31L9Nls032679 for ; Fri, 1 Apr 2005 13:09:23 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHTNY-0002m8-00; Fri, 01 Apr 2005 13:08:32 -0800 Date: Fri, 1 Apr 2005 13:08:32 -0800 From: "David S. Miller" To: Eric Dumazet Cc: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-Id: <20050401130832.1f972a3b.davem@davemloft.net> In-Reply-To: <424DB7A1.8090803@cosmosbay.com> References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <20050401122802.7c71afbc.davem@davemloft.net> <424DB7A1.8090803@cosmosbay.com> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1216 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Fri, 01 Apr 2005 23:05:37 +0200 Eric Dumazet wrote: > You mean you prefer : > > static spinlock_t *rt_hash_lock ; /* rt_hash_lock = > alloc_memory_at_boot_time(...) */ > > instead of > > static spinlock_t rt_hash_lock[RT_HASH_LOCK_SZ] ; > > In both cases, memory is taken from lowmem, and size of kernel image > is roughly the same (bss section takes no space in image) In the former case the kernel image the bootloader has to load is smaller. That's important, believe it or not. It means less TLB entries need to be locked permanently into the MMU on certain platforms. From davem@davemloft.net Fri Apr 1 13:11:36 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 13:11:42 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31LBaXI000825 for ; Fri, 1 Apr 2005 13:11:36 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHTPh-0002mS-00; Fri, 01 Apr 2005 13:10:45 -0800 Date: Fri, 1 Apr 2005 13:10:45 -0800 From: "David S. Miller" To: John Heffner Cc: netdev@oss.sgi.com Subject: Re: [PATCH] skb pcount with MTU discovery Message-Id: <20050401131045.4e558f65.davem@davemloft.net> In-Reply-To: References: X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1217 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Fri, 1 Apr 2005 16:05:49 -0500 (EST) John Heffner wrote: > The problem is that when doing MTU discovery, the too-large segments in > the write queue will be calculated as having a pcount of >1. When > tcp_write_xmit() is trying to send, tcp_snd_test() fails the cwnd test > when pcount > cwnd. > > The segments are eventually transmitted one at a time by keepalive, but > this can take a long time. > > This patch checks if TSO is enabled when setting pcount. Why isn't the MSS properly updated at this point in time? If it were, the pcount setting would do the right thing. That's how this code is supposed to work. From jheffner@psc.edu Fri Apr 1 13:23:06 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 13:23:13 -0800 (PST) Received: from mailer2.psc.edu (mailer2.psc.edu [128.182.66.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31LN5Pf001621 for ; Fri, 1 Apr 2005 13:23:06 -0800 Received: from dexter.psc.edu (dexter.psc.edu [128.182.61.232]) by mailer2.psc.edu (8.13.3/8.13.3) with ESMTP id j31LRi33009348 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 1 Apr 2005 16:27:48 -0500 (EST) Received: from dexter.psc.edu (localhost.psc.edu [127.0.0.1]) by dexter.psc.edu (8.12.11/8.12.10) with ESMTP id j31LMxdx018810; Fri, 1 Apr 2005 16:22:59 -0500 Received: from localhost (jheffner@localhost) by dexter.psc.edu (8.12.11/8.12.11/Submit) with ESMTP id j31LMx4H018807; Fri, 1 Apr 2005 16:22:59 -0500 X-Authentication-Warning: dexter.psc.edu: jheffner owned process doing -bs Date: Fri, 1 Apr 2005 16:22:59 -0500 (EST) From: John Heffner To: "David S. Miller" cc: netdev@oss.sgi.com Subject: Re: [PATCH] skb pcount with MTU discovery In-Reply-To: <20050401131045.4e558f65.davem@davemloft.net> Message-ID: References: <20050401131045.4e558f65.davem@davemloft.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1218 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jheffner@psc.edu Precedence: bulk X-list: netdev On Fri, 1 Apr 2005, David S. Miller wrote: > On Fri, 1 Apr 2005 16:05:49 -0500 (EST) > John Heffner wrote: > > > The problem is that when doing MTU discovery, the too-large segments in > > the write queue will be calculated as having a pcount of >1. When > > tcp_write_xmit() is trying to send, tcp_snd_test() fails the cwnd test > > when pcount > cwnd. > > > > The segments are eventually transmitted one at a time by keepalive, but > > this can take a long time. > > > > This patch checks if TSO is enabled when setting pcount. > > Why isn't the MSS properly updated at this point in time? > If it were, the pcount setting would do the right thing. > > That's how this code is supposed to work. The problem occurs when TSO is disabled. Common case, start out with mss of 8948. Send 2 segments; neither are acknowledged, and we receive an ICMP can't fragment indicating a pmtu of 1500 so mss is set down to 1448. Now tcp_set_skb_tso_segs() sets tso_segs to 6, so tcp_snd_test thinks we are doing TSO and will send the full 6 mss, and fails the cwnd test since cwnd == 2. -John From colin@colino.net Fri Apr 1 13:28:05 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 13:28:11 -0800 (PST) Received: from paperstreet.colino.net (colino.net [213.41.131.56]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31LS3F5002218 for ; Fri, 1 Apr 2005 13:28:04 -0800 Received: by paperstreet.colino.net (Postfix, from userid 1015) id 3D0C3101D9; Fri, 1 Apr 2005 23:27:52 +0200 (CEST) Received: from jack.colino.net (jack.colino.net [192.168.0.11]) by paperstreet.colino.net (Postfix) with ESMTP id 974A9101A2; Fri, 1 Apr 2005 23:27:49 +0200 (CEST) Date: Fri, 1 Apr 2005 23:27:47 +0200 From: Colin Leroy To: David Brownell Cc: linux-usb-devel@lists.sourceforge.net, Andrew Morton , Jeroen Vreeken , netdev@oss.sgi.com Subject: Re: [linux-usb-devel] [PATCH] PM support for zd1201 Message-ID: <20050401232747.3f9ed365@jack.colino.net> In-Reply-To: <200504011030.57978.david-b@pacbell.net> References: <20050330144423.0dde5b71@jack.colino.net> <200504011030.57978.david-b@pacbell.net> X-Mailer: Sylpheed-Claws 1.9.6cvs18 (GTK+ 2.6.4; powerpc-unknown-linux-gnu) X-Face: Fy:*XpRna1/tz}cJ@O'0^:qYs:8b[Rg`*8,+o^[fI?<%5LeB,Xz8ZJK[r7V0hBs8G)*&C+XA0qHoR=LoTohe@7X5K$A-@cN6n~~J/]+{[)E4h'lK$13WQf$.R+Pi;E09tk&{t|;~dakRD%CLHrk6m!?gA,5|Sb=fJ=>[9#n1Bu8?VngkVM4{'^'V_qgdA.8yn3) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1219 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: colin@colino.net Precedence: bulk X-list: netdev On 01 Apr 2005 at 10h04, David Brownell wrote: Hi, > Looked ok to me, other than needing to change "u32 state" into > a "pm_message_t message". And I'm not sure why "mac_enabled" > would be the right test, rather than maybe netif_running(). Here it is. Signed-off-by: Colin Leroy --- drivers/usb/net/zd1201.c.orig 2005-03-30 14:35:23.000000000 +0200 +++ drivers/usb/net/zd1201.c 2005-04-01 23:24:04.000000000 +0200 @@ -1896,12 +1896,50 @@ kfree(zd); } +#ifdef CONFIG_PM + +static int zd1201_suspend (struct usb_interface *interface, + pm_message_t message) +{ + struct zd1201 *zd = (struct zd1201 *)usb_get_intfdata(interface); + + netif_device_detach(zd->dev); + + zd->was_enabled = zd->mac_enabled; + + if (zd->was_enabled) + return zd1201_disable(zd); + else + return 0; +} + +static int zd1201_resume (struct usb_interface *interface) +{ + struct zd1201 *zd = (struct zd1201 *)usb_get_intfdata(interface); + + netif_device_attach(zd->dev); + + if (zd->was_enabled) + return zd1201_enable(zd); + else + return 0; +} + +#else + +#define zd1201_suspend NULL +#define zd1201_resume NULL + +#endif + struct usb_driver zd1201_usb = { .owner = THIS_MODULE, .name = "zd1201", .probe = zd1201_probe, .disconnect = zd1201_disconnect, .id_table = zd1201_table, + .suspend = zd1201_suspend, + .resume = zd1201_resume, }; static int __init zd1201_init(void) --- drivers/usb/net/zd1201.h.orig 2005-03-30 14:35:36.000000000 +0200 +++ drivers/usb/net/zd1201.h 2005-03-30 14:24:33.000000000 +0200 @@ -46,6 +46,7 @@ char essid[IW_ESSID_MAX_SIZE+1]; int essidlen; int mac_enabled; + int was_enabled; int monitor; int encode_enabled; int encode_restricted; From dada1@cosmosbay.com Fri Apr 1 13:43:53 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 13:43:58 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31LhqNT003067 for ; Fri, 1 Apr 2005 13:43:53 -0800 Received: from [192.168.0.3] ([84.5.129.64]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j31Lhd6I031012; Fri, 1 Apr 2005 23:43:45 +0200 Message-ID: <424DC08A.3020204@cosmosbay.com> Date: Fri, 01 Apr 2005 23:43:38 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <20050401122802.7c71afbc.davem@davemloft.net> <424DB7A1.8090803@cosmosbay.com> <20050401130832.1f972a3b.davem@davemloft.net> In-Reply-To: <20050401130832.1f972a3b.davem@davemloft.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [62.23.185.226]); Fri, 01 Apr 2005 23:43:45 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1220 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev David S. Miller a écrit : > On Fri, 01 Apr 2005 23:05:37 +0200 > Eric Dumazet wrote: > > >>You mean you prefer : >> >>static spinlock_t *rt_hash_lock ; /* rt_hash_lock = >>alloc_memory_at_boot_time(...) */ >> >>instead of >> >>static spinlock_t rt_hash_lock[RT_HASH_LOCK_SZ] ; >> >>In both cases, memory is taken from lowmem, and size of kernel image >>is roughly the same (bss section takes no space in image) > > > In the former case the kernel image the bootloader has to > load is smaller. That's important, believe it or not. It > means less TLB entries need to be locked permanently into > the MMU on certain platforms. > > OK thanks for this clarification. I changed to : #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) /* * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks * The size of this table is a power of two and depends on the number of CPUS. */ #if NR_CPUS >= 32 #define RT_HASH_LOCK_SZ 4096 #elif NR_CPUS >= 16 #define RT_HASH_LOCK_SZ 2048 #elif NR_CPUS >= 8 #define RT_HASH_LOCK_SZ 1024 #elif NR_CPUS >= 4 #define RT_HASH_LOCK_SZ 512 #else #define RT_HASH_LOCK_SZ 256 #endif static spinlock_t *rt_hash_locks; # define rt_hash_lock_addr(slot) &rt_hash_locks[slot & (RT_HASH_LOCK_SZ - 1)] # define rt_hash_lock_init() { \ int i; \ rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ, GFP_KERNEL); \ if (!rt_hash_locks) panic("IP: failed to allocate rt_hash_locks\n"); \ for (i = 0; i < RT_HASH_LOCK_SZ; i++) \ spin_lock_init(&rt_hash_locks[i]); \ } #else # define rt_hash_lock_addr(slot) NULL # define rt_hash_lock_init() #endif Are you OK if I also use alloc_large_system_hash() to allocate rt_hash_table, instead of the current method ? This new method is used in net/ipv4/tcp.c for tcp_ehash and tcp_bhash and permits NUMA tuning. Eric From davem@davemloft.net Fri Apr 1 14:35:37 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 14:35:47 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31MZbk5005035 for ; Fri, 1 Apr 2005 14:35:37 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHUiw-0003Dx-00; Fri, 01 Apr 2005 14:34:42 -0800 Date: Fri, 1 Apr 2005 14:34:42 -0800 From: "David S. Miller" To: Eric Dumazet Cc: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-Id: <20050401143442.62ed8bb9.davem@davemloft.net> In-Reply-To: <424DC08A.3020204@cosmosbay.com> References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <20050401122802.7c71afbc.davem@davemloft.net> <424DB7A1.8090803@cosmosbay.com> <20050401130832.1f972a3b.davem@davemloft.net> <424DC08A.3020204@cosmosbay.com> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1221 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Fri, 01 Apr 2005 23:43:38 +0200 Eric Dumazet wrote: > Are you OK if I also use alloc_large_system_hash() to allocate > rt_hash_table, instead of the current method ? This new method is used > in net/ipv4/tcp.c for tcp_ehash and tcp_bhash and permits NUMA tuning. Sure, that's fine. BTW, please line-wrap your emails. :-/ From herbert@gondor.apana.org.au Fri Apr 1 14:48:43 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 14:48:50 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31MmfmE005961 for ; Fri, 1 Apr 2005 14:48:42 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHUvx-0000Cu-00; Sat, 02 Apr 2005 08:48:09 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHUvP-00050C-00; Sat, 02 Apr 2005 08:47:35 +1000 From: Herbert Xu To: jheffner@psc.edu (John Heffner) Subject: Re: [PATCH] skb pcount with MTU discovery Cc: davem@davemloft.net, netdev@oss.sgi.com Organization: Core In-Reply-To: X-Newsgroups: apana.lists.os.linux.netdev User-Agent: tin/1.7.4-20040225 ("Benbecula") (UNIX) (Linux/2.4.27-hx-1-686-smp (i686)) Message-Id: Date: Sat, 02 Apr 2005 08:47:35 +1000 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1222 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev John Heffner wrote: > > Common case, start out with mss of 8948. Send 2 segments; neither are > acknowledged, and we receive an ICMP can't fragment indicating a pmtu of > 1500 so mss is set down to 1448. Now tcp_set_skb_tso_segs() sets tso_segs > to 6, so tcp_snd_test thinks we are doing TSO and will send the full 6 > mss, and fails the cwnd test since cwnd == 2. How about fixing tcp_snd_test directly like this? Of course all this will be moot once Dave finishes his TSO rewrite :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- ===== include/net/tcp.h 1.107 vs edited ===== --- 1.107/include/net/tcp.h 2005-03-16 10:15:03 +11:00 +++ edited/include/net/tcp.h 2005-04-02 08:45:48 +10:00 @@ -1433,6 +1433,9 @@ pkts = tcp_skb_pcount(skb); } + if (!(tp->inet.sk.sk_route_caps & NETIF_F_TSO)) + pkts = 1; + /* RFC 1122 - section 4.2.3.4 * * We must queue if From dada1@cosmosbay.com Fri Apr 1 15:22:03 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:22:11 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NM2Gf007563 for ; Fri, 1 Apr 2005 15:22:03 -0800 Received: from [192.168.0.3] ([84.5.129.64]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j31NLm32032417; Sat, 2 Apr 2005 01:21:54 +0200 Message-ID: <424DD78D.7070001@cosmosbay.com> Date: Sat, 02 Apr 2005 01:21:49 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <20050401122802.7c71afbc.davem@davemloft.net> <424DB7A1.8090803@cosmosbay.com> <20050401130832.1f972a3b.davem@davemloft.net> <424DC08A.3020204@cosmosbay.com> <20050401143442.62ed8bb9.davem@davemloft.net> In-Reply-To: <20050401143442.62ed8bb9.davem@davemloft.net> Content-Type: multipart/mixed; boundary="------------090807070004040008080507" X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [62.23.185.226]); Sat, 02 Apr 2005 01:21:55 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1223 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev This is a multi-part message in MIME format. --------------090807070004040008080507 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit David S. Miller a écrit : > On Fri, 01 Apr 2005 23:43:38 +0200 > Eric Dumazet wrote: > > >>Are you OK if I also use alloc_large_system_hash() to allocate >>rt_hash_table, instead of the current method ? This new method is used >>in net/ipv4/tcp.c for tcp_ehash and tcp_bhash and permits NUMA tuning. > > > Sure, that's fine. > > BTW, please line-wrap your emails. :-/ > > :-) OK this patch includes everything... - Locking abstraction - rt_check_expire() fixes - New gc_interval_ms sysctl to be able to have timer gc_interval < 1 second - New gc_debug sysctl to let sysadmin tune gc - Less memory used by hash table (spinlocks moved to a smaller table) - sizing of spinlocks table depends on NR_CPUS - hash table allocated using alloc_large_system_hash() function - header fix for /proc/net/stat/rt_cache Thank you Eric --------------090807070004040008080507 Content-Type: text/plain; name="diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="diff" diff -Nru linux-2.6.11/net/ipv4/route.c linux-2.6.11-ed/net/ipv4/route.c --- linux-2.6.11/net/ipv4/route.c 2005-03-02 08:38:38.000000000 +0100 +++ linux-2.6.11-ed/net/ipv4/route.c 2005-04-02 01:10:37.000000000 +0200 @@ -54,6 +54,8 @@ * Marc Boucher : routing by fwmark * Robert Olsson : Added rt_cache statistics * Arnaldo C. Melo : Convert proc stuff to seq_file + * Eric Dumazet : hashed spinlocks and rt_check_expire() fixes. + * : bugfix in rt_cpu_seq_show() * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -70,6 +72,7 @@ #include #include #include +#include #include #include #include @@ -107,12 +110,13 @@ #define IP_MAX_MTU 0xFFF0 #define RT_GC_TIMEOUT (300*HZ) +#define RT_GC_INTERVAL (RT_GC_TIMEOUT/10) /* rt_check_expire() scans 1/10 of the table each round */ static int ip_rt_min_delay = 2 * HZ; static int ip_rt_max_delay = 10 * HZ; static int ip_rt_max_size; static int ip_rt_gc_timeout = RT_GC_TIMEOUT; -static int ip_rt_gc_interval = 60 * HZ; +static int ip_rt_gc_interval = RT_GC_INTERVAL; static int ip_rt_gc_min_interval = HZ / 2; static int ip_rt_redirect_number = 9; static int ip_rt_redirect_load = HZ / 50; @@ -124,6 +128,7 @@ static int ip_rt_min_pmtu = 512 + 20 + 20; static int ip_rt_min_advmss = 256; static int ip_rt_secret_interval = 10 * 60 * HZ; +static int ip_rt_debug; static unsigned long rt_deadline; #define RTprint(a...) printk(KERN_DEBUG a) @@ -197,8 +202,38 @@ struct rt_hash_bucket { struct rtable *chain; - spinlock_t lock; -} __attribute__((__aligned__(8))); +}; + +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) +/* + * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks + * The size of this table is a power of two and depends on the number of CPUS. + */ +#if NR_CPUS >= 32 +#define RT_HASH_LOCK_SZ 4096 +#elif NR_CPUS >= 16 +#define RT_HASH_LOCK_SZ 2048 +#elif NR_CPUS >= 8 +#define RT_HASH_LOCK_SZ 1024 +#elif NR_CPUS >= 4 +#define RT_HASH_LOCK_SZ 512 +#else +#define RT_HASH_LOCK_SZ 256 +#endif + + static spinlock_t *rt_hash_locks; +# define rt_hash_lock_addr(slot) &rt_hash_locks[slot & (RT_HASH_LOCK_SZ - 1)] +# define rt_hash_lock_init() { \ + int i; \ + rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ, GFP_KERNEL); \ + if (!rt_hash_locks) panic("IP: failed to allocate rt_hash_locks\n"); \ + for (i = 0; i < RT_HASH_LOCK_SZ; i++) \ + spin_lock_init(&rt_hash_locks[i]); \ + } +#else +# define rt_hash_lock_addr(slot) NULL +# define rt_hash_lock_init() +#endif static struct rt_hash_bucket *rt_hash_table; static unsigned rt_hash_mask; @@ -393,7 +428,7 @@ struct rt_cache_stat *st = v; if (v == SEQ_START_TOKEN) { - seq_printf(seq, "entries in_hit in_slow_tot in_no_route in_brd in_martian_dst in_martian_src out_hit out_slow_tot out_slow_mc gc_total gc_ignored gc_goal_miss gc_dst_overflow in_hlist_search out_hlist_search\n"); + seq_printf(seq, "entries in_hit in_slow_tot in_slow_mc in_no_route in_brd in_martian_dst in_martian_src out_hit out_slow_tot out_slow_mc gc_total gc_ignored gc_goal_miss gc_dst_overflow in_hlist_search out_hlist_search\n"); return 0; } @@ -470,7 +505,7 @@ rth->u.dst.expires; } -static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2) +static __inline__ int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2) { unsigned long age; int ret = 0; @@ -516,45 +551,93 @@ /* This runs via a timer and thus is always in BH context. */ static void rt_check_expire(unsigned long dummy) { - static int rover; - int i = rover, t; + static unsigned int rover; + static unsigned int effective_interval = RT_GC_INTERVAL; + static unsigned int cached_gc_interval = RT_GC_INTERVAL; + unsigned int i, goal; struct rtable *rth, **rthp; unsigned long now = jiffies; + unsigned int freed = 0 , t0; + u64 mult; - for (t = ip_rt_gc_interval << rt_hash_log; t >= 0; - t -= ip_rt_gc_timeout) { - unsigned long tmo = ip_rt_gc_timeout; - + if (cached_gc_interval != ip_rt_gc_interval) { /* ip_rt_gc_interval may have changed with sysctl */ + cached_gc_interval = ip_rt_gc_interval; + effective_interval = cached_gc_interval; + } + /* Computes the number of slots we should examin in this run : + * We want to perform a full scan every ip_rt_gc_timeout, and + * the timer is started every 'effective_interval' ticks. + * so goal = (number_of_slots) * (effective_interval / ip_rt_gc_timeout) + */ + mult = ((u64)effective_interval) << rt_hash_log; + do_div(mult, ip_rt_gc_timeout); + goal = (unsigned int)mult; + + i = atomic_read(&ipv4_dst_ops.entries) << 3; + if (i > ip_rt_max_size) { + goal <<= 1; /* be more aggressive */ + i >>= 1; + if (i > ip_rt_max_size) { + goal <<= 1; /* be more aggressive */ + i >>= 1; + if (i > ip_rt_max_size) { + goal <<= 1; /* be more aggressive */ + now++; /* give us one more tick (time) to do our job */ + } + } + } + if (goal > rt_hash_mask) goal = rt_hash_mask + 1; + t0 = goal; + i = rover; + for ( ; goal > 0; goal--) { i = (i + 1) & rt_hash_mask; rthp = &rt_hash_table[i].chain; - - spin_lock(&rt_hash_table[i].lock); - while ((rth = *rthp) != NULL) { - if (rth->u.dst.expires) { - /* Entry is expired even if it is in use */ - if (time_before_eq(now, rth->u.dst.expires)) { + if (*rthp) { + unsigned long tmo = ip_rt_gc_timeout; + spin_lock(rt_hash_lock_addr(i)); + while ((rth = *rthp) != NULL) { + if (rth->u.dst.expires) { + /* Entry is expired even if it is in use */ + if (time_before_eq(now, rth->u.dst.expires)) { + tmo >>= 1; + rthp = &rth->u.rt_next; + continue; + } + } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) { tmo >>= 1; rthp = &rth->u.rt_next; continue; } - } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) { - tmo >>= 1; - rthp = &rth->u.rt_next; - continue; - } - /* Cleanup aged off entries. */ - *rthp = rth->u.rt_next; - rt_free(rth); + /* Cleanup aged off entries. */ + *rthp = rth->u.rt_next; + freed++; + rt_free(rth); + } + spin_unlock(rt_hash_lock_addr(i)); } - spin_unlock(&rt_hash_table[i].lock); - /* Fallback loop breaker. */ if (time_after(jiffies, now)) break; } rover = i; - mod_timer(&rt_periodic_timer, now + ip_rt_gc_interval); + if (goal != 0) { + /* Not enough time to perform our job, try to adjust the timer. + * Firing the timer sooner means less planned work. + * We allow the timer to be 1/8 of the sysctl value. + */ + effective_interval = (effective_interval + cached_gc_interval/8)/2; + } + else { + /* We finished our job before time limit, try to increase the timer + * The limit is the sysctl value, we use a weight of 3/1 to + * increase slowly. + */ + effective_interval = (3*effective_interval + cached_gc_interval + 3)/4; + } + if (ip_rt_debug & 1) + printk(KERN_WARNING "rt_check_expire() : %u freed, goal=%u/%u, interval=%u ticks\n", freed, goal, t0, effective_interval); + mod_timer(&rt_periodic_timer, jiffies + effective_interval); } /* This can run from both BH and non-BH contexts, the latter @@ -570,11 +653,11 @@ get_random_bytes(&rt_hash_rnd, 4); for (i = rt_hash_mask; i >= 0; i--) { - spin_lock_bh(&rt_hash_table[i].lock); + spin_lock_bh(rt_hash_lock_addr(i)); rth = rt_hash_table[i].chain; if (rth) rt_hash_table[i].chain = NULL; - spin_unlock_bh(&rt_hash_table[i].lock); + spin_unlock_bh(rt_hash_lock_addr(i)); for (; rth; rth = next) { next = rth->u.rt_next; @@ -704,7 +787,7 @@ k = (k + 1) & rt_hash_mask; rthp = &rt_hash_table[k].chain; - spin_lock_bh(&rt_hash_table[k].lock); + spin_lock_bh(rt_hash_lock_addr(k)); while ((rth = *rthp) != NULL) { if (!rt_may_expire(rth, tmo, expire)) { tmo >>= 1; @@ -715,7 +798,7 @@ rt_free(rth); goal--; } - spin_unlock_bh(&rt_hash_table[k].lock); + spin_unlock_bh(rt_hash_lock_addr(k)); if (goal <= 0) break; } @@ -792,7 +875,7 @@ rthp = &rt_hash_table[hash].chain; - spin_lock_bh(&rt_hash_table[hash].lock); + spin_lock_bh(rt_hash_lock_addr(hash)); while ((rth = *rthp) != NULL) { if (compare_keys(&rth->fl, &rt->fl)) { /* Put it first */ @@ -813,7 +896,7 @@ rth->u.dst.__use++; dst_hold(&rth->u.dst); rth->u.dst.lastuse = now; - spin_unlock_bh(&rt_hash_table[hash].lock); + spin_unlock_bh(rt_hash_lock_addr(hash)); rt_drop(rt); *rp = rth; @@ -854,7 +937,7 @@ if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) { int err = arp_bind_neighbour(&rt->u.dst); if (err) { - spin_unlock_bh(&rt_hash_table[hash].lock); + spin_unlock_bh(rt_hash_lock_addr(hash)); if (err != -ENOBUFS) { rt_drop(rt); @@ -895,7 +978,7 @@ } #endif rt_hash_table[hash].chain = rt; - spin_unlock_bh(&rt_hash_table[hash].lock); + spin_unlock_bh(rt_hash_lock_addr(hash)); *rp = rt; return 0; } @@ -962,7 +1045,7 @@ { struct rtable **rthp; - spin_lock_bh(&rt_hash_table[hash].lock); + spin_lock_bh(rt_hash_lock_addr(hash)); ip_rt_put(rt); for (rthp = &rt_hash_table[hash].chain; *rthp; rthp = &(*rthp)->u.rt_next) @@ -971,7 +1054,7 @@ rt_free(rt); break; } - spin_unlock_bh(&rt_hash_table[hash].lock); + spin_unlock_bh(rt_hash_lock_addr(hash)); } void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw, @@ -2569,6 +2652,23 @@ .strategy = &sysctl_jiffies, }, { + .ctl_name = NET_IPV4_ROUTE_GC_INTERVAL_MS, + .procname = "gc_interval_ms", + .data = &ip_rt_gc_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_ms_jiffies, + .strategy = &sysctl_ms_jiffies, + }, + { + .ctl_name = NET_IPV4_ROUTE_GC_DEBUG, + .procname = "gc_debug", + .data = &ip_rt_debug, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = NET_IPV4_ROUTE_REDIRECT_LOAD, .procname = "redirect_load", .data = &ip_rt_redirect_load, @@ -2718,12 +2818,13 @@ int __init ip_rt_init(void) { - int i, order, goal, rc = 0; rt_hash_rnd = (int) ((num_physpages ^ (num_physpages>>8)) ^ (jiffies ^ (jiffies >> 7))); #ifdef CONFIG_NET_CLS_ROUTE + { + int order; for (order = 0; (PAGE_SIZE << order) < 256 * sizeof(struct ip_rt_acct) * NR_CPUS; order++) /* NOTHING */; @@ -2731,6 +2832,7 @@ if (!ip_rt_acct) panic("IP: failed to allocate ip_rt_acct\n"); memset(ip_rt_acct, 0, PAGE_SIZE << order); + } #endif ipv4_dst_ops.kmem_cachep = kmem_cache_create("ip_dst_cache", @@ -2741,39 +2843,24 @@ if (!ipv4_dst_ops.kmem_cachep) panic("IP: failed to allocate ip_dst_cache\n"); - goal = num_physpages >> (26 - PAGE_SHIFT); - if (rhash_entries) - goal = (rhash_entries * sizeof(struct rt_hash_bucket)) >> PAGE_SHIFT; - for (order = 0; (1UL << order) < goal; order++) - /* NOTHING */; - - do { - rt_hash_mask = (1UL << order) * PAGE_SIZE / - sizeof(struct rt_hash_bucket); - while (rt_hash_mask & (rt_hash_mask - 1)) - rt_hash_mask--; - rt_hash_table = (struct rt_hash_bucket *) - __get_free_pages(GFP_ATOMIC, order); - } while (rt_hash_table == NULL && --order > 0); - - if (!rt_hash_table) - panic("Failed to allocate IP route cache hash table\n"); - - printk(KERN_INFO "IP: routing cache hash table of %u buckets, %ldKbytes\n", - rt_hash_mask, - (long) (rt_hash_mask * sizeof(struct rt_hash_bucket)) / 1024); + rt_hash_table = (struct rt_hash_bucket *) + alloc_large_system_hash("IP route cache", + sizeof(struct rt_hash_bucket), + rhash_entries, + (num_physpages >= 128 * 1024) ? + (27 - PAGE_SHIFT) : + (29 - PAGE_SHIFT), + HASH_HIGHMEM, + &rt_hash_log, + &rt_hash_mask, + 0); - for (rt_hash_log = 0; (1 << rt_hash_log) != rt_hash_mask; rt_hash_log++) - /* NOTHING */; + memset(rt_hash_table, 0, rt_hash_mask * sizeof(struct rt_hash_bucket)); + rt_hash_lock_init(); + ipv4_dst_ops.gc_thresh = rt_hash_mask; + ip_rt_max_size = rt_hash_mask * 16; rt_hash_mask--; - for (i = 0; i <= rt_hash_mask; i++) { - spin_lock_init(&rt_hash_table[i].lock); - rt_hash_table[i].chain = NULL; - } - - ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1); - ip_rt_max_size = (rt_hash_mask + 1) * 16; rt_cache_stat = alloc_percpu(struct rt_cache_stat); if (!rt_cache_stat) @@ -2819,7 +2906,7 @@ xfrm_init(); xfrm4_init(); #endif - return rc; + return 0; } EXPORT_SYMBOL(__ip_select_ident); diff -Nru linux-2.6.11/Documentation/filesystems/proc.txt linux-2.6.11-ed/Documentation/filesystems/proc.txt --- linux-2.6.11/Documentation/filesystems/proc.txt 2005-04-02 01:19:15.000000000 +0200 +++ linux-2.6.11-ed/Documentation/filesystems/proc.txt 2005-04-02 01:19:04.000000000 +0200 @@ -1709,12 +1709,13 @@ Writing to this file results in a flush of the routing cache. -gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh +gc_elasticity, gc_interval_ms, gc_min_interval_ms, gc_timeout, gc_thresh, gc_debug --------------------------------------------------------------------- Values to control the frequency and behavior of the garbage collection algorithm for the routing cache. gc_min_interval is deprecated and replaced -by gc_min_interval_ms. +by gc_min_interval_ms. gc_interval is deprecated and replaced by +gc_interval_ms. gc_debug enables some printk() max_size diff -Nru linux-2.6.11/include/linux/sysctl.h linux-2.6.11-ed/include/linux/sysctl.h --- linux-2.6.11/include/linux/sysctl.h 2005-03-02 08:38:10.000000000 +0100 +++ linux-2.6.11-ed/include/linux/sysctl.h 2005-04-02 00:43:11.000000000 +0200 @@ -367,6 +367,8 @@ NET_IPV4_ROUTE_MIN_ADVMSS=17, NET_IPV4_ROUTE_SECRET_INTERVAL=18, NET_IPV4_ROUTE_GC_MIN_INTERVAL_MS=19, + NET_IPV4_ROUTE_GC_INTERVAL_MS=20, + NET_IPV4_ROUTE_GC_DEBUG=21, }; enum --------------090807070004040008080507-- From tgraf@suug.ch Fri Apr 1 15:26:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:26:45 -0800 (PST) Received: from b.mx.projectdream.org (eth0-0.arisu.projectdream.org [194.158.4.191]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NQbov008184 for ; Fri, 1 Apr 2005 15:26:38 -0800 Received: from postel.suug.ch (postel.suug.ch [195.134.158.23]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by b.mx.projectdream.org (Postfix) with ESMTP id C77938A; Sat, 2 Apr 2005 01:26:12 +0200 (CEST) Received: by postel.suug.ch (Postfix, from userid 10001) id 354251C0EB; Sat, 2 Apr 2005 01:26:54 +0200 (CEST) Date: Sat, 2 Apr 2005 01:26:54 +0200 From: Thomas Graf To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: [PATCHSET] action statistics dumping fix & gnet_stats improvements Message-ID: <20050401232654.GJ3086@postel.suug.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1224 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: tgraf@suug.ch Precedence: bulk X-list: netdev Fixes a stupid bug I introduced in the last patchset which for some reason didn't get caught in the testing process. The other two patches change the behaviour of yet unused but likely use cases to what one would expect without reading the code. Please do a bk pull bk://kernel.bkbits.net/tgraf/net-2.6-tcf_exts This will update the following files: include/net/gen_stats.h | 3 ++- net/core/gen_stats.c | 48 +++++++++++++++++++++++++++++++----------------- net/sched/act_api.c | 2 ++ 3 files changed, 35 insertions(+), 18 deletions(-) through these ChangeSets: (05/04/01 1.2181.44.3) [NET]: Improve gnet_stats_* dumping logic to be less error prone The recent additions to make gnet_stats_* useable for action statistics dumping in two steps introcuded a few error prone assumptions which can easly be forgotten. This patch fixes this up by simplifying the process of adding new fields to struct gnet_dump or adding additional backward compatibility TLVs. Signed-off-by: Thomas Graf Signed-off-by: David S. Miller (05/04/01 1.2181.44.2) [NET]: Allow dumping of application specific statistics if no primary TLV is used Although this case is hypothetical at the moment, more advanced actions are likely to need this in the future. Signed-off-by: Thomas Graf Signed-off-by: David S. Miller (05/04/01 1.2181.44.1) [PKT_SCHED]: Properly return when no backward compatibility action statistics are to be dumped Fixes a stupid bug introcuded in my "Fix action statistics dumping in compatibility mode" patch, no clue why it actually worked without this fix. Signed-off-by: Thomas Graf Signed-off-by: David S. Miller From tgraf@suug.ch Fri Apr 1 15:27:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:27:15 -0800 (PST) Received: from b.mx.projectdream.org (eth0-0.arisu.projectdream.org [194.158.4.191]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NR9eD008422 for ; Fri, 1 Apr 2005 15:27:09 -0800 Received: from postel.suug.ch (postel.suug.ch [195.134.158.23]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by b.mx.projectdream.org (Postfix) with ESMTP id F1549F; Sat, 2 Apr 2005 01:26:46 +0200 (CEST) Received: by postel.suug.ch (Postfix, from userid 10001) id 5D8641C0EA; Sat, 2 Apr 2005 01:27:30 +0200 (CEST) Date: Sat, 2 Apr 2005 01:27:30 +0200 From: Thomas Graf To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: [PATCH 1/3] [PKT_SCHED]: Properly return when no backward compatibility action statistics are to be dumped Message-ID: <20050401232730.GK3086@postel.suug.ch> References: <20050401232654.GJ3086@postel.suug.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050401232654.GJ3086@postel.suug.ch> X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1225 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: tgraf@suug.ch Precedence: bulk X-list: netdev # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/04/01 14:05:21+02:00 tgraf@suug.ch # [PKT_SCHED]: Properly return when no backward compatibility action statistics are to be dumped # # Fixes a stupid bug introcuded in my "Fix action statistics dumping in # compatibility mode" patch, no clue why it actually worked without this fix. # # Signed-off-by: Thomas Graf # Signed-off-by: David S. Miller # # net/sched/act_api.c # 2005/04/01 14:05:09+02:00 tgraf@suug.ch +2 -0 # [PKT_SCHED]: Properly return when no backward compatibility action statistics are to be dumped # diff -Nru a/net/sched/act_api.c b/net/sched/act_api.c --- a/net/sched/act_api.c 2005-04-02 01:18:40 +02:00 +++ b/net/sched/act_api.c 2005-04-02 01:18:40 +02:00 @@ -397,6 +397,8 @@ if (a->type == TCA_OLD_COMPAT) err = gnet_stats_start_copy_compat(skb, 0, TCA_STATS, TCA_XSTATS, h->stats_lock, &d); + else + return 0; } else err = gnet_stats_start_copy(skb, TCA_ACT_STATS, h->stats_lock, &d); From tgraf@suug.ch Fri Apr 1 15:27:42 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:27:47 -0800 (PST) Received: from b.mx.projectdream.org (eth0-0.arisu.projectdream.org [194.158.4.191]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NRfgu008967 for ; Fri, 1 Apr 2005 15:27:41 -0800 Received: from postel.suug.ch (postel.suug.ch [195.134.158.23]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by b.mx.projectdream.org (Postfix) with ESMTP id A1105F; Sat, 2 Apr 2005 01:27:18 +0200 (CEST) Received: by postel.suug.ch (Postfix, from userid 10001) id 100F31C0EB; Sat, 2 Apr 2005 01:28:02 +0200 (CEST) Date: Sat, 2 Apr 2005 01:28:01 +0200 From: Thomas Graf To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: [PATCH 2/3] [NET]: Allow dumping of application specific statistics if no primary TLV is used Message-ID: <20050401232801.GL3086@postel.suug.ch> References: <20050401232654.GJ3086@postel.suug.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050401232654.GJ3086@postel.suug.ch> X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1226 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: tgraf@suug.ch Precedence: bulk X-list: netdev # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/04/01 14:24:14+02:00 tgraf@suug.ch # [NET]: Allow dumping of application specific statistics if no primary TLV is used # # Although this case is hypothetical at the moment, more advanced actions are # likely to need this in the future. # # Signed-off-by: Thomas Graf # Signed-off-by: David S. Miller # # net/core/gen_stats.c # 2005/04/01 14:23:57+02:00 tgraf@suug.ch +7 -4 # [NET]: Allow dumping of application specific statistics if no primary TLV is used # # include/net/gen_stats.h # 2005/04/01 14:23:57+02:00 tgraf@suug.ch +2 -1 # [NET]: Allow dumping of application specific statistics if no primary TLV is used # diff -Nru a/include/net/gen_stats.h b/include/net/gen_stats.h --- a/include/net/gen_stats.h 2005-04-02 01:18:33 +02:00 +++ b/include/net/gen_stats.h 2005-04-02 01:18:33 +02:00 @@ -15,7 +15,8 @@ /* Backward compatability */ int compat_tc_stats; int compat_xstats; - struct rtattr * xstats; + void * xstats; + int xstats_len; struct tc_stats tc_stats; }; diff -Nru a/net/core/gen_stats.c b/net/core/gen_stats.c --- a/net/core/gen_stats.c 2005-04-02 01:18:33 +02:00 +++ b/net/core/gen_stats.c 2005-04-02 01:18:33 +02:00 @@ -177,8 +177,11 @@ int gnet_stats_copy_app(struct gnet_dump *d, void *st, int len) { - if (d->compat_xstats) - d->xstats = (struct rtattr *) d->skb->tail; + if (d->compat_xstats) { + d->xstats = st; + d->xstats_len = len; + } + return gnet_stats_copy(d, TCA_STATS_APP, st, len); } @@ -206,8 +209,8 @@ return -1; if (d->compat_xstats && d->xstats) { - if (gnet_stats_copy(d, d->compat_xstats, RTA_DATA(d->xstats), - RTA_PAYLOAD(d->xstats)) < 0) + if (gnet_stats_copy(d, d->compat_xstats, d->xstats, + d->xstats_len) < 0) return -1; } From tgraf@suug.ch Fri Apr 1 15:28:16 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:28:21 -0800 (PST) Received: from b.mx.projectdream.org (eth0-0.arisu.projectdream.org [194.158.4.191]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NSFVt009537 for ; Fri, 1 Apr 2005 15:28:15 -0800 Received: from postel.suug.ch (postel.suug.ch [195.134.158.23]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by b.mx.projectdream.org (Postfix) with ESMTP id ABB7AF; Sat, 2 Apr 2005 01:27:52 +0200 (CEST) Received: by postel.suug.ch (Postfix, from userid 10001) id 2556D1C0EA; Sat, 2 Apr 2005 01:28:36 +0200 (CEST) Date: Sat, 2 Apr 2005 01:28:36 +0200 From: Thomas Graf To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: [PATCH 3/3] [NET]: Improve gnet_stats_* dumping logic to be less error prone Message-ID: <20050401232835.GM3086@postel.suug.ch> References: <20050401232654.GJ3086@postel.suug.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050401232654.GJ3086@postel.suug.ch> X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1227 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: tgraf@suug.ch Precedence: bulk X-list: netdev # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/04/01 15:01:24+02:00 tgraf@suug.ch # [NET]: Improve gnet_stats_* dumping logic to be less error prone # # The recent additions to make gnet_stats_* useable for action # statistics dumping in two steps introcuded a few error prone # assumptions which can easly be forgotten. This patch fixes this # up by simplifying the process of adding new fields to struct # gnet_dump or adding additional backward compatibility TLVs. # # Signed-off-by: Thomas Graf # Signed-off-by: David S. Miller # # net/core/gen_stats.c # 2005/04/01 15:01:12+02:00 tgraf@suug.ch +24 -13 # [NET]: Improve gnet_stats_* dumping logic to be less error prone # diff -Nru a/net/core/gen_stats.c b/net/core/gen_stats.c --- a/net/core/gen_stats.c 2005-04-02 01:18:26 +02:00 +++ b/net/core/gen_stats.c 2005-04-02 01:18:26 +02:00 @@ -26,9 +26,7 @@ static inline int gnet_stats_copy(struct gnet_dump *d, int type, void *buf, int size) { - if (type) - RTA_PUT(d->skb, type, size, buf); - + RTA_PUT(d->skb, type, size, buf); return 0; rtattr_failure: @@ -58,6 +56,8 @@ gnet_stats_start_copy_compat(struct sk_buff *skb, int type, int tc_stats_type, int xstats_type, spinlock_t *lock, struct gnet_dump *d) { + memset(d, 0, sizeof(*d)); + spin_lock_bh(lock); d->lock = lock; if (type) @@ -65,12 +65,11 @@ d->skb = skb; d->compat_tc_stats = tc_stats_type; d->compat_xstats = xstats_type; - d->xstats = NULL; - if (d->compat_tc_stats) - memset(&d->tc_stats, 0, sizeof(d->tc_stats)); + if (d->tail) + return gnet_stats_copy(d, type, NULL, 0); - return gnet_stats_copy(d, type, NULL, 0); + return 0; } /** @@ -111,8 +110,11 @@ d->tc_stats.bytes = b->bytes; d->tc_stats.packets = b->packets; } - - return gnet_stats_copy(d, TCA_STATS_BASIC, b, sizeof(*b)); + + if (d->tail) + return gnet_stats_copy(d, TCA_STATS_BASIC, b, sizeof(*b)); + + return 0; } /** @@ -134,7 +136,10 @@ d->tc_stats.pps = r->pps; } - return gnet_stats_copy(d, TCA_STATS_RATE_EST, r, sizeof(*r)); + if (d->tail) + return gnet_stats_copy(d, TCA_STATS_RATE_EST, r, sizeof(*r)); + + return 0; } /** @@ -157,8 +162,11 @@ d->tc_stats.backlog = q->backlog; d->tc_stats.overlimits = q->overlimits; } - - return gnet_stats_copy(d, TCA_STATS_QUEUE, q, sizeof(*q)); + + if (d->tail) + return gnet_stats_copy(d, TCA_STATS_QUEUE, q, sizeof(*q)); + + return 0; } /** @@ -182,7 +190,10 @@ d->xstats_len = len; } - return gnet_stats_copy(d, TCA_STATS_APP, st, len); + if (d->tail) + return gnet_stats_copy(d, TCA_STATS_APP, st, len); + + return 0; } /** From shemminger@osdl.org Fri Apr 1 15:44:08 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:44:22 -0800 (PST) Received: from smtp.osdl.org (fire.osdl.org [65.172.181.4]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31Ni4j6010722 for ; Fri, 1 Apr 2005 15:44:08 -0800 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j31Nhns4014569 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Fri, 1 Apr 2005 15:43:49 -0800 Received: from dxpl.pdx.osdl.net (dxpl.pdx.osdl.net [172.20.1.103]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with ESMTP id j31Nhm42010798; Fri, 1 Apr 2005 15:43:48 -0800 Date: Fri, 1 Apr 2005 15:43:48 -0800 From: Stephen Hemminger To: jaganav@us.ibm.com Cc: Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Message-ID: <20050401154348.553f3c46@dxpl.pdx.osdl.net> In-Reply-To: <1112321619.424cae539e75e@imap.linux.ibm.com> References: <1112321619.424cae539e75e@imap.linux.ibm.com> Organization: Open Source Development Lab X-Mailer: Sylpheed-Claws 1.0.4 (GTK+ 1.2.10; x86_64-unknown-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.106 $ X-Scanned-By: MIMEDefang 2.36 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1228 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev On Thu, 31 Mar 2005 21:13:39 -0500 jaganav@us.ibm.com wrote: > Quoting Roland Dreier : > > I have to admit I don't know much about the TOE / RDMA/TCP / RNIC (or > > whatever you want to call it) world. However I know that the large > > majority of InfiniBand use right now is running on Linux, and I hope > > the Linux community is willing to work with the IB community. > > > > Just want to let everyone know know that we have started an opensource > effort (www.openrdma.org) for enablement of RNICs (RDMA enabled NICs). This > community has now come up with an architecture > (http://rdma.sourceforge.net/architecture.pdf) to build this support in Linux. > Would really appreciate if you review and provide any comments. We have just > started to hack but no code is available on this project yet. > > Thanks > Venkat OpenRdma is a misnomer, because as I read your architecture you are trying to create a "kernel abstraction layer" for closed source vendor RDMA drivers. This will never be accepted, please go back to the drawing board and figure out how to make real open source drivers. From davem@davemloft.net Fri Apr 1 15:50:51 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:50:56 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NooXX011406 for ; Fri, 1 Apr 2005 15:50:51 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHVtn-0003gY-00; Fri, 01 Apr 2005 15:49:59 -0800 Date: Fri, 1 Apr 2005 15:49:59 -0800 From: "David S. Miller" To: Thomas Graf Cc: netdev@oss.sgi.com Subject: Re: [PATCHSET] action statistics dumping fix & gnet_stats improvements Message-Id: <20050401154959.1eef4880.davem@davemloft.net> In-Reply-To: <20050401232654.GJ3086@postel.suug.ch> References: <20050401232654.GJ3086@postel.suug.ch> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1229 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Sat, 2 Apr 2005 01:26:54 +0200 Thomas Graf wrote: > Fixes a stupid bug I introduced in the last patchset which for some > reason didn't get caught in the testing process. The other two > patches change the behaviour of yet unused but likely use cases > to what one would expect without reading the code. > > Please do a > > bk pull bk://kernel.bkbits.net/tgraf/net-2.6-tcf_exts All looks good. Pulled, thanks Thomas. From asgeir@chelsio.com Fri Apr 1 15:51:53 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:52:00 -0800 (PST) Received: from stargate.chelsio.com (stargate.chelsio.com [64.186.171.138] (may be forged)) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NprqB011864 for ; Fri, 1 Apr 2005 15:51:53 -0800 Received: from YOGI.asicdesigners.com (yogi.asicdesigners.com [10.192.160.7]) by stargate.chelsio.com (8.12.5/8.12.5) with SMTP id j31NotfZ012683; Fri, 1 Apr 2005 15:50:55 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.0.6487.1 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: RE: Linux support for RDMA Date: Fri, 1 Apr 2005 15:50:55 -0800 Message-ID: <67D69596DDF0C2448DB0F0547D0F947E01781F1A@yogi.asicdesigners.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Linux support for RDMA Thread-Index: AcU2XRJTLR7JHDJ6RnC21gxkbfiaswAtOgbQ From: "Asgeir Eiriksson" To: , "H. Peter Anvin" Cc: "Roland Dreier" , "Dmitry Yusupov" , , "David S. Miller" , , , , , , , "Benjamin LaHaise" X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j31NprqB011864 X-archive-position: 1230 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: asgeir@chelsio.com Precedence: bulk X-list: netdev Venkat Your assessment of the IB vs. Ethernet latencies isn't necessarily correct. - you already have available low latency 10GE switches (< 1us port-to-port) - you already have available low latency (cut-through processing) 10GE TOE engines The Veritest verified 10GE TOE end-to-end latency is < 10us today (end-to-end being from a Linux user-space-process to a Linux user-space-process through a switch; full report with detail of the setup is available at http://www.chelsio.com/technology/Chelsio10GbE_Fujitsu.pdf) For comparison: the published IB latency numbers are around 5us today and those use a polling receiver, and those don't include a context switch(es) as does the Ethernet number quoted above. 'Asgeir > -----Original Message----- > From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On > Behalf Of jaganav@us.ibm.com > Sent: Thursday, March 31, 2005 5:49 PM > To: H. Peter Anvin > Cc: Roland Dreier; Dmitry Yusupov; open-iscsi@googlegroups.com; David S. > Miller; mpm@selenic.com; andrea@suse.de; michaelc@cs.wisc.edu; > James.Bottomley@HansenPartnership.com; ksummit-2005-discuss@thunk.org; > netdev@oss.sgi.com; Benjamin LaHaise > Subject: Re: Linux support for RDMA > > Quoting "H. Peter Anvin" : > > Benjamin LaHaise wrote: > > > > > > I'm curious how the 10Gig ethernet market will pan out. Time and > again > > > the market has shown that ethernet always has the cost advantage in > the > > > end. If something like Intel's I/O Acceleration Technology makes it > > > that much easier for commodity ethernet to achieve similar performance > > > characteristics over ethernet to that of IB and fibre channel, the > cost > > > advantage alone might switch some new customers over. But the > hardware > > > isn't near what IB offers today, making IB an important niche filler. > > > > > > > From what I've seen coming down the pipe, I think 10GE is going to > > eventually win over IB, just like previous generations did over Token > > Ring, FDDI and other niche filler technologies. It doesn't, as you say, > > mean that e.g. IB doesn't matter *now*; furthermore, it also matters for > > the purpose of fixing the kind of issues that are going to have to be > > fixed anyway. > > > > -hpa > > > > > > > > No doubt, Ethernet will eventually win .. btw, Hasn't history proven this > over > ATM? More specifically when the industry predicted that ATM will replace > ethernet :) > > However, I'll have to agree with Ben that IB technolgy will fill an > important > niche segment, more specifically so in the low end of High Performance > Computing > (HPC) segment which is in a transition mode currently moving away from > proprietary interconnects to industry standards based IB technology. > Eventhough, > ethernet may eventually may catch up with IB in terms of the bandwidth but > IB > fabrics can offer better latencies. > > Thanks > Venkat From davem@davemloft.net Fri Apr 1 15:55:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 15:55:43 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j31NtbXM012578 for ; Fri, 1 Apr 2005 15:55:38 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHVyM-0003hr-00; Fri, 01 Apr 2005 15:54:42 -0800 Date: Fri, 1 Apr 2005 15:54:42 -0800 From: "David S. Miller" To: Eric Dumazet Cc: netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-Id: <20050401155442.3bbd6a73.davem@davemloft.net> In-Reply-To: <424DD78D.7070001@cosmosbay.com> References: <42370997.6010302@cosmosbay.com> <20050315103253.590c8bfc.davem@davemloft.net> <42380EC6.60100@cosmosbay.com> <20050316140915.0f6b9528.davem@davemloft.net> <4239E00C.4080309@cosmosbay.com> <20050331221352.13695124.davem@davemloft.net> <424D5D34.4030800@cosmosbay.com> <20050401122802.7c71afbc.davem@davemloft.net> <424DB7A1.8090803@cosmosbay.com> <20050401130832.1f972a3b.davem@davemloft.net> <424DC08A.3020204@cosmosbay.com> <20050401143442.62ed8bb9.davem@davemloft.net> <424DD78D.7070001@cosmosbay.com> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1231 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Sat, 02 Apr 2005 01:21:49 +0200 Eric Dumazet wrote: > OK this patch includes everything... > > - Locking abstraction > - rt_check_expire() fixes > - New gc_interval_ms sysctl to be able to have timer gc_interval < 1 second > - New gc_debug sysctl to let sysadmin tune gc > - Less memory used by hash table (spinlocks moved to a smaller table) > - sizing of spinlocks table depends on NR_CPUS > - hash table allocated using alloc_large_system_hash() function > - header fix for /proc/net/stat/rt_cache Looks fine to me. I'd like to see some feedback from folks like Robert Olsson and co. before applying this, so let's allow the patch to simmer over the weekend, ok? :-) From dima@neterion.com Fri Apr 1 16:03:46 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 16:03:51 -0800 (PST) Received: from ns1.s2io.com (ns1.s2io.com [142.46.200.198]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j3203jR6013216 for ; Fri, 1 Apr 2005 16:03:46 -0800 Received: from guinness.s2io.com (sentry.s2io.com [142.46.200.199]) by ns1.s2io.com (8.12.10/8.12.10) with ESMTP id j3202eOC027166; Fri, 1 Apr 2005 19:02:40 -0500 (EST) Received: from beastie ([10.16.16.220]) by guinness.s2io.com (8.12.6/8.12.6) with ESMTP id j3202cDD002273; Fri, 1 Apr 2005 19:02:38 -0500 (EST) Subject: RE: Linux support for RDMA From: Dmitry Yusupov To: Asgeir Eiriksson Cc: jaganav@us.ibm.com, "H. Peter Anvin" , Roland Dreier , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, Benjamin LaHaise In-Reply-To: <67D69596DDF0C2448DB0F0547D0F947E01781F1A@yogi.asicdesigners.com> References: <67D69596DDF0C2448DB0F0547D0F947E01781F1A@yogi.asicdesigners.com> Content-Type: text/plain Organization: Neterion, Inc Date: Fri, 01 Apr 2005 16:02:37 -0800 Message-Id: <1112400157.9559.98.camel@beastie> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-2) Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.34 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1232 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dima@neterion.com Precedence: bulk X-list: netdev On Fri, 2005-04-01 at 15:50 -0800, Asgeir Eiriksson wrote: > Venkat > > Your assessment of the IB vs. Ethernet latencies isn't necessarily > correct. > - you already have available low latency 10GE switches (< 1us > port-to-port) > - you already have available low latency (cut-through processing) 10GE > TOE engines > > The Veritest verified 10GE TOE end-to-end latency is < 10us today > (end-to-end being from a Linux user-space-process to a Linux > user-space-process through a switch; full report with detail of the > setup is available at > http://www.chelsio.com/technology/Chelsio10GbE_Fujitsu.pdf) > > For comparison: the published IB latency numbers are around 5us today > and those use a polling receiver, and those don't include a context > switch(es) as does the Ethernet number quoted above. yep. I should agree in here. On 10Gbps network latencies numbers are around 5-15us. Even with non-TOE card, I managed to get 13us latency with regular TCP/IP stack. [root@localhost root]# ./nptcp -a -t -l 256 -u 98304 -i 256 -p 5100 -P - h 17.1.1.227 Latency: 0.000013 Now starting main loop 0: 256 bytes 7 times --> 131.37 Mbps in 0.000015 sec 1: 512 bytes 65 times --> 239.75 Mbps in 0.000016 sec Dima > 'Asgeir > > > > -----Original Message----- > > From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On > > Behalf Of jaganav@us.ibm.com > > Sent: Thursday, March 31, 2005 5:49 PM > > To: H. Peter Anvin > > Cc: Roland Dreier; Dmitry Yusupov; open-iscsi@googlegroups.com; David > S. > > Miller; mpm@selenic.com; andrea@suse.de; michaelc@cs.wisc.edu; > > James.Bottomley@HansenPartnership.com; ksummit-2005-discuss@thunk.org; > > netdev@oss.sgi.com; Benjamin LaHaise > > Subject: Re: Linux support for RDMA > > > > Quoting "H. Peter Anvin" : > > > Benjamin LaHaise wrote: > > > > > > > > I'm curious how the 10Gig ethernet market will pan out. Time and > > again > > > > the market has shown that ethernet always has the cost advantage > in > > the > > > > end. If something like Intel's I/O Acceleration Technology makes > it > > > > that much easier for commodity ethernet to achieve similar > performance > > > > characteristics over ethernet to that of IB and fibre channel, the > > cost > > > > advantage alone might switch some new customers over. But the > > hardware > > > > isn't near what IB offers today, making IB an important niche > filler. > > > > > > > > > > From what I've seen coming down the pipe, I think 10GE is going to > > > eventually win over IB, just like previous generations did over > Token > > > Ring, FDDI and other niche filler technologies. It doesn't, as you > say, > > > mean that e.g. IB doesn't matter *now*; furthermore, it also matters > for > > > the purpose of fixing the kind of issues that are going to have to > be > > > fixed anyway. > > > > > > -hpa > > > > > > > > > > > > > No doubt, Ethernet will eventually win .. btw, Hasn't history proven > this > > over > > ATM? More specifically when the industry predicted that ATM will > replace > > ethernet :) > > > > However, I'll have to agree with Ben that IB technolgy will fill an > > important > > niche segment, more specifically so in the low end of High Performance > > Computing > > (HPC) segment which is in a transition mode currently moving away from > > proprietary interconnects to industry standards based IB technology. > > Eventhough, > > ethernet may eventually may catch up with IB in terms of the bandwidth > but > > IB > > fabrics can offer better latencies. > > > > Thanks > > Venkat > > > > From herbert@gondor.apana.org.au Fri Apr 1 16:51:41 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 16:51:50 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j320pdAt018588 for ; Fri, 1 Apr 2005 16:51:40 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHWqX-0001AE-00; Sat, 02 Apr 2005 10:50:41 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHWpo-0006Mj-00; Sat, 02 Apr 2005 10:49:56 +1000 Date: Sat, 2 Apr 2005 10:49:56 +1000 To: "David S. Miller" Cc: kaber@trash.net, kuznet@ms2.inr.ac.ru, jmorris@redhat.com, yoshfuji@linux-ipv6.org, netdev@oss.sgi.com Subject: [IPSEC]: Kill nested read lock by deleting xfrm_init_tempsel Message-ID: <20050402004956.GA24339@gondor.apana.org.au> References: <20050214221006.GA18415@gondor.apana.org.au> <20050214221200.GA18465@gondor.apana.org.au> <20050214221433.GB18465@gondor.apana.org.au> <20050214221607.GC18465@gondor.apana.org.au> <424864CE.5060802@trash.net> <20050328233917.GB15369@gondor.apana.org.au> <424B40C2.90304@trash.net> <20050331004658.GA26395@gondor.apana.org.au> <20050331212325.5e996432.davem@davemloft.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="YiEDa0DAkWCtVeE4" Content-Disposition: inline In-Reply-To: <20050331212325.5e996432.davem@davemloft.net> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1233 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev --YiEDa0DAkWCtVeE4 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi Dave: On Thu, Mar 31, 2005 at 09:23:25PM -0800, David S. Miller wrote: > On Thu, 31 Mar 2005 10:46:58 +1000 > Herbert Xu wrote: > > > > # This is a BitKeeper generated diff -Nru style patch. > > > # > > > # ChangeSet > > > # 2005/03/30 06:02:45+02:00 kaber@coreworks.de > > > # [IPSEC]: Check SPI in xfrm_state_find() > > > # > > > # Signed-off-by: Patrick McHardy > > > > Looks good. > > > > Signed-off-by: Herbert Xu > > To me too, both patches applied, thanks Patrick. Actually I only signed off on the first patch :) The second patch creates a dead lock since it does a nested read lock. The solution is simply to get rid of xfrm_init_tempsel and call the afinfo version directly. Signed-off-by: Herbert Xu BTW I'd like to start cleaning up the locking in net/xfrm. I don't want these changes to go into 2.6.12. However, I'd like to have them sit in mm for a while so that they get some testing coverage. What's the best way to do this? Could you create a tree slated for 2.6.13? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --YiEDa0DAkWCtVeE4 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=p ===== net/xfrm/xfrm_state.c 1.60 vs edited ===== --- 1.60/net/xfrm/xfrm_state.c 2005-04-01 15:19:54 +10:00 +++ edited/net/xfrm/xfrm_state.c 2005-04-02 10:35:06 +10:00 @@ -283,20 +283,6 @@ } EXPORT_SYMBOL(xfrm_state_flush); -static int -xfrm_init_tempsel(struct xfrm_state *x, struct flowi *fl, - struct xfrm_tmpl *tmpl, - xfrm_address_t *daddr, xfrm_address_t *saddr, - unsigned short family) -{ - struct xfrm_state_afinfo *afinfo = xfrm_state_get_afinfo(family); - if (!afinfo) - return -1; - afinfo->init_tempsel(x, fl, tmpl, daddr, saddr); - xfrm_state_put_afinfo(afinfo); - return 0; -} - struct xfrm_state * xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, @@ -370,7 +356,7 @@ } /* Initialize temporary selector matching only * to current session. */ - xfrm_init_tempsel(x, fl, tmpl, daddr, saddr, family); + afinfo->init_tempsel(x, fl, tmpl, daddr, saddr); if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; --YiEDa0DAkWCtVeE4-- From hadi@cyberus.ca Fri Apr 1 17:04:16 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:04:21 -0800 (PST) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j3214FuF019427 for ; Fri, 1 Apr 2005 17:04:16 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1DHX3g-0006qh-TC for netdev@oss.sgi.com; Fri, 01 Apr 2005 20:04:16 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHX3a-0006Je-Ca; Fri, 01 Apr 2005 20:04:10 -0500 Subject: Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <20050401123554.GA3468@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112403845.1088.14.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 20:04:05 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1234 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Herbert, Staring at the code, obversation: -> PFKEY is going to be interesting to have it actually generate events as a result of some app using netlink such as ip x - the reverse is actually easier to deal with. This problem doesnt exist with current approach i am taking. The issue is that pfkey echoes back a few things from the original message - important ones being version, pid, seq, and msgtype (as a sample take a look at pfkey_add()). So these need to be remembered... Brings back the original behavior i had netlink doing which was similar (but innacurate now that i stare at this). At the time i carried the nlmsg header around in the cb. So we would have to do the same for netlink[1]. The good news is all these fields happen to exist on netlink (except for the version - to which, for netlink created events, we could pass a hardcoded matching PFKEY2). In other words the structure i called km_cb will now have to have these fields i mentioned above. Thoughts before i start ? cheers, jamal [1]I actually would have no problems using a pid/seq etc generated by pfkey on a netlink header and viceversa. It shouldnt be an issue. From davem@davemloft.net Fri Apr 1 17:21:52 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:21:58 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j321Lp5B024479 for ; Fri, 1 Apr 2005 17:21:52 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHXJ1-0005XH-00; Fri, 01 Apr 2005 17:20:07 -0800 Date: Fri, 1 Apr 2005 17:20:07 -0800 From: "David S. Miller" To: Herbert Xu Cc: kaber@trash.net, kuznet@ms2.inr.ac.ru, jmorris@redhat.com, yoshfuji@linux-ipv6.org, netdev@oss.sgi.com Subject: Re: [IPSEC]: Kill nested read lock by deleting xfrm_init_tempsel Message-Id: <20050401172007.7296eced.davem@davemloft.net> In-Reply-To: <20050402004956.GA24339@gondor.apana.org.au> References: <20050214221006.GA18415@gondor.apana.org.au> <20050214221200.GA18465@gondor.apana.org.au> <20050214221433.GB18465@gondor.apana.org.au> <20050214221607.GC18465@gondor.apana.org.au> <424864CE.5060802@trash.net> <20050328233917.GB15369@gondor.apana.org.au> <424B40C2.90304@trash.net> <20050331004658.GA26395@gondor.apana.org.au> <20050331212325.5e996432.davem@davemloft.net> <20050402004956.GA24339@gondor.apana.org.au> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1235 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Sat, 2 Apr 2005 10:49:56 +1000 Herbert Xu wrote: > The second patch creates a dead lock since it does a nested read > lock. The solution is simply to get rid of xfrm_init_tempsel > and call the afinfo version directly. read locks nest even in the presence of pending writers From hadi@cyberus.ca Fri Apr 1 17:25:55 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:26:01 -0800 (PST) Received: from mx04.cybersurf.com (mx04.cybersurf.com [209.197.145.108]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j321Pt8b025072 for ; Fri, 1 Apr 2005 17:25:55 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx04.cybersurf.com with esmtp (Exim 4.30) id 1DHXOc-0007og-Kj for netdev@oss.sgi.com; Fri, 01 Apr 2005 20:25:54 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHXOY-0008VP-6i; Fri, 01 Apr 2005 20:25:50 -0500 Subject: IPSEC: on behavior of acquire From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu , "David S. Miller" , Masahide NAKAMURA Cc: psec-tools-devel@lists.sourceforge.net, netdev@oss.sgi.com, kaber@trash.net, kuznet@ms2.inr.ac.ru, jmorris@redhat.com Content-Type: text/plain Organization: jamalopolous Message-Id: <1112405144.1096.33.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 20:25:44 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1236 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Folks, Theres something wrong in the way acquire works - IMO in both pfkey and netlink. I asked this before but didnt get satisfactory answer. Masahide-san and myself have had private exchanges and we are both unsatisfied with current situation. Theres probably a spec or known good practise documented somewhere ... Let me provide some testcases then theorize. The idea is to simulate a situation where the kernel thinks a km is listening (it could be there but just non-responsive) or just a scenario where the acquire gets lost. You need the current events patches to see this. test1)on one window run setkey -x: ping -c 1 someDST -1) packet arrives towards outbound 0) Larval state created 1) one acquire sent. 2) timeout. 3) packet dropped. -ESRCH returned. 4) larval state deleted So question 1): Shouldnt the return code be -ERESTART to ask the app to retry? question 2) Why is there a hardcoding of 1 try only? ping -c2 someDST Same as above (Steps -1 to 4) repeated twice one for each packet sent ping -c3 DST Same as above repeated 3 times. test2) With ip x m (but not setkey). ping -c 1 DST -1) packet arrives 0) Larval state created Loop: 1) one acquire sent. 2) timeout. go to loop. So loop has no way to break. ping is hang waiting. the only way to break out is by hitting control-c on prompt. I think ping gets a -ERESTART which i believe is the correct signal? When you hit control-c Larval state is deleted. Clearly this is not desirable. We want at some point to give up. Question: Can we have a configurable max retries (sysctl settable) for acquire - or does it already exist just not being used? Couldnt find any staring at the code. ping -c2/3 DST does not change the above behavior. Ping is hang after first packet - so it doesnt matter. The conclusion we reached in our discussion is: a) -ERESTART is the correct signal to return b) number of acquire retries should be configurable preferably a system wide value. Thoughts? cheers, jamal From herbert@gondor.apana.org.au Fri Apr 1 17:28:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:28:55 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j321Sjxf025671 for ; Fri, 1 Apr 2005 17:28:46 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHXR8-0001Ml-00; Sat, 02 Apr 2005 11:28:30 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHXQr-0006Qw-00; Sat, 02 Apr 2005 11:28:13 +1000 Date: Sat, 2 Apr 2005 11:28:13 +1000 To: jamal Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: PATCH: IPSEC xfrm events Message-ID: <20050402012813.GA24575@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112403845.1088.14.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1237 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Hi Jamal: On Fri, Apr 01, 2005 at 08:04:05PM -0500, jamal wrote: > > The issue is that pfkey echoes back a few things from the original > message - important ones being version, pid, seq, and msgtype (as a > sample take a look at pfkey_add()). So these need to be remembered... You're right. The pid and seq should be stored in km_event by af_key and xfrm_user before they call km_notify. In fact bring back that the km_type field too and put it in km_event. That'll become useful when we figure out a way to include it in the netlink message so that the originator can be uniquely identified. The version should always be set by the kernel though. This is because the packet we're broadcasting has been regenerated by the kernel. If we ever get PFKEY v3 then in order that all existing applications understand these messages you'll have to reformat them as PFKEY v2 anyway. msgtype should be derived from the event as you did in xfrm_user. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From jaganav@us.ibm.com Fri Apr 1 17:37:26 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:37:33 -0800 (PST) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.130]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j321bQ3b026512 for ; Fri, 1 Apr 2005 17:37:26 -0800 Received: from westrelay01.boulder.ibm.com (westrelay01.boulder.ibm.com [9.17.195.10]) by e32.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j321bK5j733728 for ; Fri, 1 Apr 2005 20:37:20 -0500 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by westrelay01.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j321bKlD200322 for ; Fri, 1 Apr 2005 18:37:20 -0700 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j321bJ1U014970 for ; Fri, 1 Apr 2005 18:37:20 -0700 Received: from imap.linux.ibm.com (imap.rtp.raleigh.ibm.com [9.42.107.100]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j321bFBZ014935; Fri, 1 Apr 2005 18:37:19 -0700 Received: by imap.linux.ibm.com (Postfix, from userid 48) id 3D36E7C015; Fri, 1 Apr 2005 20:37:14 -0500 (EST) Received: from dyn9047018082.beaverton.ibm.com (dyn9047018082.beaverton.ibm.com [9.47.18.82]) by imap.rtp.raleigh.ibm.com (IMP) with HTTP for ; Fri, 1 Apr 2005 20:37:13 -0500 Message-ID: <1112405833.424df749e61b5@imap.linux.ibm.com> Date: Fri, 1 Apr 2005 20:37:13 -0500 From: jaganav@us.ibm.com To: Stephen Hemminger Cc: Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit User-Agent: Internet Messaging Program (IMP) 3.2.7 X-Originating-IP: 9.47.18.82 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1238 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jaganav@us.ibm.com Precedence: bulk X-list: netdev Quoting Stephen Hemminger : > On Thu, 31 Mar 2005 21:13:39 -0500 > jaganav@us.ibm.com wrote: > > > Quoting Roland Dreier : > > > I have to admit I don't know much about the TOE / RDMA/TCP / RNIC (or > > > whatever you want to call it) world. However I know that the large > > > majority of InfiniBand use right now is running on Linux, and I hope > > > the Linux community is willing to work with the IB community. > > > > > > > Just want to let everyone know know that we have started an opensource > > effort (www.openrdma.org) for enablement of RNICs (RDMA enabled NICs). > This > > community has now come up with an architecture > > (http://rdma.sourceforge.net/architecture.pdf) to build this support in > Linux. > > Would really appreciate if you review and provide any comments. We have > just > > started to hack but no code is available on this project yet. > > > > Thanks > > Venkat > > OpenRdma is a misnomer, because as I read your architecture you are trying > to > create a "kernel abstraction layer" for closed source vendor RDMA drivers. > This will > never be accepted, please go back to the drawing board and figure out how to > make > real open source drivers. > > First let me say that the purpose of this project is to make the entire stack (with all of the enablement layers) including the drivers opensourced. The kernel abstraction layer will be built around standards based (opengroup.org/icsc) RNIC-PI interface and which allows the RNIC vendors to opensource their drivers using that interface. BTW, RNIC-PI interface is work-in-progress and the first draft is targeted to be published soon. Several RNIC adapter vendors, who contribute to the openRDMA effort, are quite willing to opensource their drivers through openRDMA project. BTW, I understood why you got the impression that the this is for closed source vendor drivers: Our intention is not to allow the kernel verbs provider code (kVP) to be private and that was an error. Thanks for pointing this out but we'll make this change soon. Thanks Venkat From hadi@cyberus.ca Fri Apr 1 17:42:54 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:43:00 -0800 (PST) Received: from mx01.cybersurf.com (mx01.cybersurf.com [209.197.145.104]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j321gsmJ027199 for ; Fri, 1 Apr 2005 17:42:54 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx01.cybersurf.com with esmtp (Exim 4.30) id 1DHXex-0007wz-2o for netdev@oss.sgi.com; Fri, 01 Apr 2005 18:42:47 -0700 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHXf1-0001oe-Em; Fri, 01 Apr 2005 20:42:51 -0500 Subject: Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <20050402012813.GA24575@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112406164.1088.54.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 01 Apr 2005 20:42:45 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1239 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Herbert, On Fri, 2005-04-01 at 20:28, Herbert Xu wrote: > Hi Jamal: > > On Fri, Apr 01, 2005 at 08:04:05PM -0500, jamal wrote: > > > > The issue is that pfkey echoes back a few things from the original > > message - important ones being version, pid, seq, and msgtype (as a > > sample take a look at pfkey_add()). So these need to be remembered... > > You're right. The pid and seq should be stored in km_event by > af_key and xfrm_user before they call km_notify. In fact bring > back that the km_type field too and put it in km_event. Do we need km_type? Given we have: the event, seq, pid (regardless of where it was generated) we have sufficient info to create eitehr a netlink or pfkey message. > That'll > become useful when we figure out a way to include it in the netlink > message so that the originator can be uniquely identified. > The pid seems pretty accurate to describe what process generated the initial message. hold on: Ah, I think i may get what you are trying to get to: You want iproute to display something along the lines of "this was created by a pfkey app pid 1534". Did i read you correctly? > The version should always be set by the kernel though. This is because > the packet we're broadcasting has been regenerated by the kernel. If > we ever get PFKEY v3 then in order that all existing applications > understand these messages you'll have to reformat them as PFKEY v2 > anyway. > So always go v2? > msgtype should be derived from the event as you did in xfrm_user. > indeed. cheers, jamal From herbert@gondor.apana.org.au Fri Apr 1 17:46:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:46:47 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j321kbQW027801 for ; Fri, 1 Apr 2005 17:46:38 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHXiE-0001SM-00; Sat, 02 Apr 2005 11:46:10 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHXhn-0006TH-00; Sat, 02 Apr 2005 11:45:43 +1000 Date: Sat, 2 Apr 2005 11:45:43 +1000 To: jamal Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: PATCH: IPSEC xfrm events Message-ID: <20050402014543.GA24861@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112406164.1088.54.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1240 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 08:42:45PM -0500, jamal wrote: > > hold on: Ah, I think i may get what you are trying to get to: You want > iproute to display something along the lines of "this was created by a > pfkey app pid 1534". Did i read you correctly? That's right. Someone with a pathological mind might do pfkey and netlink from the same pid :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Fri Apr 1 17:46:53 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 17:46:58 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j321kqw2027873 for ; Fri, 1 Apr 2005 17:46:52 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHXiY-0001TB-00; Sat, 02 Apr 2005 11:46:30 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHXiN-0006Tc-00; Sat, 02 Apr 2005 11:46:19 +1000 Date: Sat, 2 Apr 2005 11:46:19 +1000 To: jamal Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: PATCH: IPSEC xfrm events Message-ID: <20050402014619.GB24861@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112406164.1088.54.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1241 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 08:42:45PM -0500, jamal wrote: > > So always go v2? Yes since that's the only version that the kernel knows how to generate. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From jaganav@us.ibm.com Fri Apr 1 18:00:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 18:00:21 -0800 (PST) Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.131]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32208Nk029126 for ; Fri, 1 Apr 2005 18:00:14 -0800 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j322014I563788 for ; Fri, 1 Apr 2005 21:00:01 -0500 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j32200dg184050 for ; Fri, 1 Apr 2005 19:00:00 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j321xx6X031851 for ; Fri, 1 Apr 2005 19:00:00 -0700 Received: from imap.linux.ibm.com (imap.rtp.raleigh.ibm.com [9.42.107.100]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j321xwQY031721; Fri, 1 Apr 2005 18:59:59 -0700 Received: by imap.linux.ibm.com (Postfix, from userid 48) id 34C8D7C015; Fri, 1 Apr 2005 20:59:47 -0500 (EST) Received: from dyn9047018082.beaverton.ibm.com (dyn9047018082.beaverton.ibm.com [9.47.18.82]) by imap.rtp.raleigh.ibm.com (IMP) with HTTP for ; Fri, 1 Apr 2005 20:59:46 -0500 Message-ID: <1112407186.424dfc92dc37a@imap.linux.ibm.com> Date: Fri, 1 Apr 2005 20:59:46 -0500 From: jaganav@us.ibm.com To: Dmitry Yusupov Cc: Asgeir Eiriksson , "H. Peter Anvin" , Roland Dreier , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, Benjamin LaHaise Subject: RE: Linux support for RDMA MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit User-Agent: Internet Messaging Program (IMP) 3.2.7 X-Originating-IP: 9.47.18.82 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1242 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jaganav@us.ibm.com Precedence: bulk X-list: netdev Quoting Dmitry Yusupov : > On Fri, 2005-04-01 at 15:50 -0800, Asgeir Eiriksson wrote: > > Venkat > > > > Your assessment of the IB vs. Ethernet latencies isn't necessarily > > correct. > > - you already have available low latency 10GE switches (< 1us > > port-to-port) > > - you already have available low latency (cut-through processing) 10GE > > TOE engines > > > > The Veritest verified 10GE TOE end-to-end latency is < 10us today > > (end-to-end being from a Linux user-space-process to a Linux > > user-space-process through a switch; full report with detail of the > > setup is available at > > http://www.chelsio.com/technology/Chelsio10GbE_Fujitsu.pdf) > > > > For comparison: the published IB latency numbers are around 5us today > > and those use a polling receiver, and those don't include a context > > switch(es) as does the Ethernet number quoted above. > > yep. I should agree in here. On 10Gbps network latencies numbers are > around 5-15us. Even with non-TOE card, I managed to get 13us latency > with regular TCP/IP stack. > > [root@localhost root]# ./nptcp -a -t -l 256 -u 98304 -i 256 -p 5100 -P - h > 17.1.1.227 > Latency: 0.000013 > Now starting main loop > 0: 256 bytes 7 times --> 131.37 Mbps in 0.000015 sec > 1: 512 bytes 65 times --> 239.75 Mbps in 0.000016 sec > > Dima When I mentioned about latency, the measurement is from end-to-end (i.e. from app to app) but not just the switching or port to port latencies. With IB, I have seen the best numbers ranging from 5 to 7 us and which is far better than ethernet today (15 to 35us) with the network we have. I am not denyig the fact that ethernet is trying to close the gap here but IB has got a relative advantage now. Good to see you have got 5us in one case but what were the switch and adapter latencies in this case. Thanks Venkat From herbert@gondor.apana.org.au Fri Apr 1 18:11:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 18:11:07 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j322AwI7030178 for ; Fri, 1 Apr 2005 18:10:59 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHY5c-0001Zm-00; Sat, 02 Apr 2005 12:10:20 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHY55-0006Vk-00; Sat, 02 Apr 2005 12:09:47 +1000 Date: Sat, 2 Apr 2005 12:09:47 +1000 To: "David S. Miller" Cc: kaber@trash.net, kuznet@ms2.inr.ac.ru, jmorris@redhat.com, yoshfuji@linux-ipv6.org, netdev@oss.sgi.com Subject: Re: [IPSEC]: Kill nested read lock by deleting xfrm_init_tempsel Message-ID: <20050402020947.GA24998@gondor.apana.org.au> References: <20050214221200.GA18465@gondor.apana.org.au> <20050214221433.GB18465@gondor.apana.org.au> <20050214221607.GC18465@gondor.apana.org.au> <424864CE.5060802@trash.net> <20050328233917.GB15369@gondor.apana.org.au> <424B40C2.90304@trash.net> <20050331004658.GA26395@gondor.apana.org.au> <20050331212325.5e996432.davem@davemloft.net> <20050402004956.GA24339@gondor.apana.org.au> <20050401172007.7296eced.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050401172007.7296eced.davem@davemloft.net> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1243 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 05:20:07PM -0800, David S. Miller wrote: > On Sat, 2 Apr 2005 10:49:56 +1000 > Herbert Xu wrote: > > > The second patch creates a dead lock since it does a nested read > > lock. The solution is simply to get rid of xfrm_init_tempsel > > and call the afinfo version directly. > > read locks nest even in the presence of pending writers Doh! I should've read the code first :) It's still a valid clean-up patch though. There is another reason why it won't dead lock. We don't actually ever hold the write lock on afinfo :) Is there any reason why we dont't just use xfrm_state_afinfo_lock instead of afinfo->lock? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Fri Apr 1 18:14:30 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 18:14:36 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j322ES0j030810 for ; Fri, 1 Apr 2005 18:14:29 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHY8k-0001bB-00; Sat, 02 Apr 2005 12:13:35 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHY7o-0006WQ-00; Sat, 02 Apr 2005 12:12:36 +1000 Date: Sat, 2 Apr 2005 12:12:36 +1000 To: jamal Cc: "David S. Miller" , Masahide NAKAMURA , psec-tools-devel@lists.sourceforge.net, netdev@oss.sgi.com, kaber@trash.net, kuznet@ms2.inr.ac.ru, jmorris@redhat.com Subject: Re: IPSEC: on behavior of acquire Message-ID: <20050402021236.GA25054@gondor.apana.org.au> References: <1112405144.1096.33.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112405144.1096.33.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1244 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 08:25:44PM -0500, jamal wrote: > > The conclusion we reached in our discussion is: > a) -ERESTART is the correct signal to return > b) number of acquire retries should be configurable preferably a system > wide value. > > Thoughts? Once we have the xfrm resolution stuff that Patrick is working on, we can have knobs for these cases just like those in the neighbour code. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From greg@kroah.com Fri Apr 1 21:29:17 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 21:29:29 -0800 (PST) Received: from perch.kroah.org (mail.kroah.org [69.55.234.183]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j325TG8L007523 for ; Fri, 1 Apr 2005 21:29:16 -0800 Received: from [192.168.0.10] (c-24-22-118-199.hsd1.or.comcast.net [24.22.118.199]) (authenticated) by perch.kroah.org (8.11.6/8.11.6) with ESMTP id j325Rsi06304; Fri, 1 Apr 2005 21:27:54 -0800 Received: from greg by echidna.kroah.org with local (masqmail 0.2.19) id 1DHbAY-4ZB-00; Fri, 01 Apr 2005 21:27:38 -0800 Date: Fri, 1 Apr 2005 21:27:38 -0800 From: Greg KH To: jaganav@us.ibm.com Cc: Stephen Hemminger , Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Message-ID: <20050402052738.GA17506@kroah.com> References: <20050401154348.553f3c46@dxpl.pdx.osdl.net> <1112405833.424df749e61b5@imap.linux.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112405833.424df749e61b5@imap.linux.ibm.com> User-Agent: Mutt/1.5.8i X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1245 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greg@kroah.com Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 08:37:13PM -0500, jaganav@us.ibm.com wrote: > > Several RNIC adapter vendors, who contribute to the > openRDMA effort, are quite willing to opensource > their drivers through openRDMA project. "Several"? Why not all? And why the dual license? What good is writing Linux kernel code that is BSD licensed for such a core component? Didn't you all learn from the openib licensing mess? thanks, greg k-h From greg@kroah.com Fri Apr 1 22:02:41 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 22:02:53 -0800 (PST) Received: from perch.kroah.org (mail.kroah.org [69.55.234.183]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j3262fL9008963 for ; Fri, 1 Apr 2005 22:02:41 -0800 Received: from [192.168.0.10] (c-24-22-118-199.hsd1.or.comcast.net [24.22.118.199]) (authenticated) by perch.kroah.org (8.11.6/8.11.6) with ESMTP id j3262Ri06657; Fri, 1 Apr 2005 22:02:27 -0800 Received: from greg by echidna.kroah.org with local (masqmail 0.2.19) id 1DHbi4-4dJ-00; Fri, 01 Apr 2005 22:02:16 -0800 Date: Fri, 1 Apr 2005 22:02:16 -0800 From: Greg KH To: jaganav@us.ibm.com Cc: Stephen Hemminger , Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Message-ID: <20050402060216.GA17766@kroah.com> References: <20050401154348.553f3c46@dxpl.pdx.osdl.net> <1112405833.424df749e61b5@imap.linux.ibm.com> <20050402052738.GA17506@kroah.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050402052738.GA17506@kroah.com> User-Agent: Mutt/1.5.8i X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1246 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greg@kroah.com Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 09:27:38PM -0800, Greg KH wrote: > On Fri, Apr 01, 2005 at 08:37:13PM -0500, jaganav@us.ibm.com wrote: > > > > Several RNIC adapter vendors, who contribute to the > > openRDMA effort, are quite willing to opensource > > their drivers through openRDMA project. > > "Several"? Why not all? > > And why the dual license? What good is writing Linux kernel code that > is BSD licensed for such a core component? Didn't you all learn from > the openib licensing mess? Oh, and for those of you who might not know what mess I am talking about: The openib code was set up to be dual GPL and BSD licensed for the express purpose of taking the openib code and placing it into a closed source operating system (not any of the *BSDs). Needless to say, this has prevented me from doing any openib work, and probably the same for a number of other Linux kernel developers. If you all wish to duplicate this stupidity, feel free, but do not expect to get any help from the community... thanks, greg k-h From a.kasparas@gmc.lt Fri Apr 1 23:10:14 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 23:10:19 -0800 (PST) Received: from smtp02.omnitel.sun (smtp02-neptunas.omnitel.net [194.176.45.2]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j327ADSH011054 for ; Fri, 1 Apr 2005 23:10:14 -0800 Received: from smtp04-neptunas.omnitel.net ([194.176.45.42]) by smtp02.omnitel.sun (Sun Java System Messaging Server 6.1 HotFix 0.01 (built Jun 24 2004)) with ESMTP id <0IEB0018L58VLY00@smtp02.omnitel.sun> for netdev@oss.sgi.com; Sat, 02 Apr 2005 10:10:07 +0300 (EEST) Received: from smtp04-neptunas.omnitel.net (localhost [127.0.0.1]) by smtp04-neptunas.omnitel.net (Postfix) with SMTP id 59872398079; Sat, 02 Apr 2005 10:10:05 +0300 (EEST) Received: from [192.168.0.128] (unknown [62.212.195.62]) by smtp04-neptunas.omnitel.net (Postfix) with ESMTP id DB5F9398069; Sat, 02 Apr 2005 10:10:04 +0300 (EEST) Date: Sat, 02 Apr 2005 10:10:05 +0300 From: Aidas Kasparas Subject: Re: IPSEC: on behavior of acquire In-reply-to: <1112405303.1096.37.camel@jzny.localdomain> To: hadi@cyberus.ca Cc: ipsec-tools-devel@lists.sourceforge.net, netdev@oss.sgi.com, nakam@linux-ipv6.org Message-id: <424E454D.4090402@gmc.lt> MIME-version: 1.0 Content-type: text/plain; charset=UTF-8; format=flowed Content-transfer-encoding: 7BIT X-Accept-Language: lt, en, ru, fr X-Enigmail-Version: 0.90.0.0 X-Enigmail-Supports: pgp-inline, pgp-mime References: <1112405303.1096.37.camel@jzny.localdomain> User-Agent: Debian Thunderbird 1.0 (X11/20050116) X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1247 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: a.kasparas@gmc.lt Precedence: bulk X-list: netdev jamal wrote: > test1)on one window run setkey -x: > > ping -c 1 someDST > > -1) packet arrives towards outbound > 0) Larval state created > 1) one acquire sent. > 2) timeout. > 3) packet dropped. -ESRCH returned. > 4) larval state deleted > > So question 1): Shouldnt the return code be -ERESTART to ask > the app to retry? > question 2) Why is there a hardcoding of 1 try only? Re 1 try only. There is little sense to do more tries. If there is no deamon listening to pfkey messages, then no connection will be made no matter how many retries you'll do. If deamon/link/peer is slow and SA was not established before timeout expired, then repeated acquire will be simply ignored (deamon will find out that negotiation is already in progress, there is no reason to start another negotiation and therefore will drop that acquire request). And the only situation where repeated acquires may help is when pfkey messages are lost. But pfkey was not designed to survive message loses, therefore you should not operate your boxes in mode when lost pfkey messages are a rule, not an exception. And on the other hand, occasional pfkey message loses can be worked around by applications/user retry. Re error code returned. Error codes returned by pfkey never were perfect. But your experiment is not perfect too. You sent pings with no KE deamon running. pfkey code found that there is nothing receiving acquire messages => there is no chance that any process will setup required SAs and tried to inform about that (I agree, return code is not very informative, at least until you learn about reasons why it is such). If you would have racoon (or other pfkey based ISAKMP daemon) running, you would get "resource temporarily unavailable" (don't know which error code corresponds to that message), which IMHO is ok (if it is not, please explain). Re netlink behaviour I can not comment as I don't use it for ipsec purposes, but would like to read similar explanation. Reason for that - idea that ipsec-tools one day could support operation via netlink is not ruled out of our minds. Yet, afaik nobody is working on it at the moment. -- Aidas Kasparas IT administrator GM Consult Group, UAB From jaganav@us.ibm.com Fri Apr 1 23:30:21 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Apr 2005 23:30:27 -0800 (PST) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.129]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j327UFIh012085 for ; Fri, 1 Apr 2005 23:30:21 -0800 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e31.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j327U8ua333834 for ; Sat, 2 Apr 2005 02:30:08 -0500 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j327U8dg153986 for ; Sat, 2 Apr 2005 00:30:08 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j327U70n002154 for ; Sat, 2 Apr 2005 00:30:08 -0700 Received: from imap.linux.ibm.com (imap.rtp.raleigh.ibm.com [9.42.107.100]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j327U3OI001848; Sat, 2 Apr 2005 00:30:07 -0700 Received: by imap.linux.ibm.com (Postfix, from userid 48) id 05DA67C015; Sat, 2 Apr 2005 02:29:51 -0500 (EST) Received: from sig-9-65-29-50.mts.ibm.com (sig-9-65-29-50.mts.ibm.com [9.65.29.50]) by imap.rtp.raleigh.ibm.com (IMP) with HTTP for ; Sat, 2 Apr 2005 02:29:51 -0500 Message-ID: <1112426991.424e49ef57e2b@imap.linux.ibm.com> Date: Sat, 2 Apr 2005 02:29:51 -0500 From: jaganav@us.ibm.com To: Greg KH Cc: Stephen Hemminger , Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit User-Agent: Internet Messaging Program (IMP) 3.2.7 X-Originating-IP: 9.65.29.50 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1248 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jaganav@us.ibm.com Precedence: bulk X-list: netdev Quoting Greg KH : > On Fri, Apr 01, 2005 at 09:27:38PM -0800, Greg KH wrote: > > On Fri, Apr 01, 2005 at 08:37:13PM -0500, jaganav@us.ibm.com wrote: > > > > > > Several RNIC adapter vendors, who contribute to the > > > openRDMA effort, are quite willing to opensource > > > their drivers through openRDMA project. > > > > "Several"? Why not all? Because I haven't heard from 'all' of them yet that they would opensource. I am sure every vendor will do when the most of the other vendors are opensourcing it but I can't speak for them. I have asked in the past and will continue to ask every vendor to opensource their driver and make it part of openRDMA stack. > > > > And why the dual license? What good is writing Linux kernel code that > > is BSD licensed for such a core component? Didn't you all learn from > > the openib licensing mess? > > Oh, and for those of you who might not know what mess I am talking > about: > > The openib code was set up to be dual GPL and BSD licensed for the > express purpose of taking the openib code and placing it into a closed > source operating system (not any of the *BSDs). Needless to say, this > has prevented me from doing any openib work, and probably the same for a > number of other Linux kernel developers. > Absolutely understand the dual-license mess with openIB code. -:) However the intention of dual license with OpenRDMA is not for placing the code in closed source OSes but specifically for BSD* and in fact, the request is specifically made by the most adapter vendors as they wanted to offer the same on BSD platforms as well. BTW, unlike OpenIB initial stack (i.e. Gen1) which was already developed when it got opensourced, the openRDMA code is developed from scratch in true opensource fashion (of course, OpenIB has also followed this approach for their next generation stack though) with no ifdef code for BSD*. If this dual license is a concern to other kernel developers as well from contributing to OpenRDMA, we would seriously consider this and discuss with the adapter vendors. Thanks Venkat From herbert@gondor.apana.org.au Sat Apr 2 00:22:25 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 00:22:34 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j328MLgN017287 for ; Sat, 2 Apr 2005 00:22:24 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHdt3-0002t0-00; Sat, 02 Apr 2005 18:21:45 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHdsP-0003Lr-00; Sat, 02 Apr 2005 18:21:05 +1000 From: Herbert Xu To: dada1@cosmosbay.com (Eric Dumazet) Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Cc: davem@davemloft.net, netdev@oss.sgi.com, Robert.Olsson@data.slu.se Organization: Core In-Reply-To: <424DD78D.7070001@cosmosbay.com> X-Newsgroups: apana.lists.os.linux.netdev User-Agent: tin/1.7.4-20040225 ("Benbecula") (UNIX) (Linux/2.4.27-hx-1-686-smp (i686)) Message-Id: Date: Sat, 02 Apr 2005 18:21:05 +1000 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1249 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Eric Dumazet wrote: > > OK this patch includes everything... > > - Locking abstraction > - rt_check_expire() fixes > - New gc_interval_ms sysctl to be able to have timer gc_interval < 1 second > - New gc_debug sysctl to let sysadmin tune gc > - Less memory used by hash table (spinlocks moved to a smaller table) > - sizing of spinlocks table depends on NR_CPUS > - hash table allocated using alloc_large_system_hash() function > - header fix for /proc/net/stat/rt_cache This patch is doing too many things. How about splitting it up? For instance the spin lock stuff is pretty straightforward and should be in its own patch. The benefits of the GC changes are not obvious to me. rt_check_expire is simply meant to kill off old entries. It's not really meant to be used to free up entries when the table gets full. rt_garbage_collect on the other hand is designed to free entries when it is needed. Eric raised the point that rt_garbage_collect is pretty expensive. So what about amortising its cost a bit more? For instance, we can set a new threshold that's lower than gc_thresh and perform GC on the chain being inserted in rt_intern_hash if we're above that threshold. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From mroos@tartu.cyber.ee Sat Apr 2 00:41:18 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 00:41:22 -0800 (PST) Received: from tartu.cyber.ee (tartu.cyber.ee [193.40.6.68]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j328fHft018280 for ; Sat, 2 Apr 2005 00:41:18 -0800 Received: Message by Barricade tartu.cyber.ee with ESMTP id j328LgA06688; Sat, 2 Apr 2005 11:21:42 +0300 Received: from rhn.tartu-labor (rhn.tartu-labor [192.168.74.17]) by ondatra.tartu-labor (Postfix) with ESMTP id 65A2314C48; Sat, 2 Apr 2005 10:41:11 +0200 (EET) Received: from mroos by rhn.tartu-labor with local (Exim 4.50) id 1DHeBr-0002mb-2L; Sat, 02 Apr 2005 11:41:11 +0300 From: Meelis Roos To: hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: RFC: Redirect-Device In-Reply-To: <1112303627.1073.71.camel@jzny.localdomain> User-Agent: tin/1.7.8-20050315 ("Scalpay") (UNIX) (Linux/2.6.12-rc1 (i686)) Message-Id: Date: Sat, 02 Apr 2005 11:41:11 +0300 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1250 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mroos@linux.ee Precedence: bulk X-list: netdev j> I must be missing something: What is it that this device can do that the j> mirred action cant do? I know what I am missing here: documentation. There is very basic documentation about tc qdisc+class+filter level and almost nothing on the newer features. Without good documentation only some developers understand it. -- Meelis Roos From dada1@cosmosbay.com Sat Apr 2 01:23:17 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 01:23:23 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j329NGlo020425 for ; Sat, 2 Apr 2005 01:23:16 -0800 Received: from [192.168.0.3] ([84.5.129.64]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j329LV9q008040; Sat, 2 Apr 2005 11:21:36 +0200 Message-ID: <424E641A.1020609@cosmosbay.com> Date: Sat, 02 Apr 2005 11:21:30 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: Herbert Xu CC: davem@davemloft.net, netdev@oss.sgi.com, Robert.Olsson@data.slu.se Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [62.23.185.226]); Sat, 02 Apr 2005 11:21:37 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1251 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev Herbert Xu a écrit : > Eric Dumazet wrote: > >>OK this patch includes everything... >> >> - Locking abstraction >> - rt_check_expire() fixes >> - New gc_interval_ms sysctl to be able to have timer gc_interval < 1 second >> - New gc_debug sysctl to let sysadmin tune gc >> - Less memory used by hash table (spinlocks moved to a smaller table) >> - sizing of spinlocks table depends on NR_CPUS >> - hash table allocated using alloc_large_system_hash() function >> - header fix for /proc/net/stat/rt_cache > > > This patch is doing too many things. How about splitting it up? > > For instance the spin lock stuff is pretty straightforward and > should be in its own patch. > > The benefits of the GC changes are not obvious to me. rt_check_expire > is simply meant to kill off old entries. It's not really meant to be > used to free up entries when the table gets full. Well, I began my work because of the overflow bug in rt_check_expire()... Then I realize this function could not work as expected. On a loaded machine, one timer tick is 1 ms. During this time, number of chains that are scanned is ridiculous. With the standard timer of 60 second, fact is rt_check_expire() is useless. > > rt_garbage_collect on the other hand is designed to free entries > when it is needed. Eric raised the point that rt_garbage_collect > is pretty expensive. So what about amortising its cost a bit more? Yes. rt_garbage_collect() has serious problems. But this function is sooo complex I dont want to touch it and let experts do it if they want. But then one may think why we have two similar functions that are doing basically the same thing : garbage collection. One of a production machine rtstat -i 1 output is : rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache| entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti| out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis| | | tot| mc| ute| | an_dst| an_src| | _tot| _mc| | ed| miss| verflow| _search|t_search| 2618087| 28581| 7673| 0| 0| 0| 0| 0| 1800| 1450| 0| 0| 0| 0| 0| 37630| 4783| 2618689| 25444| 4918| 0| 0| 0| 0| 0| 2051| 1699| 0| 0| 0| 0| 0| 27741| 5461| 2619369| 25000| 4567| 0| 0| 0| 0| 0| 1860| 1304| 0| 0| 0| 0| 0| 26606| 4563| 2618396| 24830| 4633| 0| 0| 0| 0| 0| 1959| 1492| 0| 0| 0| 0| 0| 26643| 4930| Without serious tuning, this machine could not handle this load, or even half of it. Crashes usually occurs when secret_interval interval is elapsed : rt_cache_flush(0); is called, and the whole machine begins to die. > > For instance, we can set a new threshold that's lower than gc_thresh > and perform GC on the chain being inserted in rt_intern_hash if we're > above that threshold. We could also try to perform GC on L1_CACHE_SIZE/sizeof(struct rt_hash_bucket) chains, not only the 'current chain', to fully use the cache miss. > > Cheers, Thank you From akpm@osdl.org Sat Apr 2 01:56:42 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 01:56:49 -0800 (PST) Received: from smtp.osdl.org (fire.osdl.org [65.172.181.4]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j329ufvY021947 for ; Sat, 2 Apr 2005 01:56:42 -0800 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j329uas4032011 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Sat, 2 Apr 2005 01:56:36 -0800 Received: from bix (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id j329uZIu002985; Sat, 2 Apr 2005 01:56:35 -0800 Date: Sat, 2 Apr 2005 01:56:22 -0800 From: Andrew Morton To: netdev@oss.sgi.com Cc: kernel@wpascanner.com Subject: Fw: [Bugme-new] [Bug 4434] New: Tulip based NIC card causes hard lock up of PC Message-Id: <20050402015622.41dff439.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.7 (GTK+ 1.2.10; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.106 $ X-Scanned-By: MIMEDefang 2.36 X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1252 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev Begin forwarded message: Date: Sat, 2 Apr 2005 01:49:50 -0800 From: bugme-daemon@osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 4434] New: Tulip based NIC card causes hard lock up of PC http://bugme.osdl.org/show_bug.cgi?id=4434 Summary: Tulip based NIC card causes hard lock up of PC Kernel Version: 2.6.11 Status: NEW Severity: high Owner: acme@conectiva.com.br Submitter: kernel@wpascanner.com Distribution: Knoppix V3.8 CeBIT, V3.7 PC-Welt, ANY Knoppix under kernel 2.6.x Hardware Environment: #1 FIC PA-2007 MB 160MB RAM BIOS V1.09CD12 #2 ABIT K7R MB 384MB RAM LAN Cards OEM DEC Tulip 21041 DLink DE-530+ LAN Cards Intel 21143 Tulip based Software Environment: de4x5 Problem Description: Hard lock up on setting up LAN/NIC card Steps to reproduce: Can not boot to working enviroment with DHCP enabled (default for Knoppix) or after booting via NODHCP cheat code on command line and using netcardconfig results in the hard lock up. See: http://www.knoppix.net/forum/viewtopic.php?t=17985&highlight= http://www.knoppix.net/forum/viewtopic.php?t=17986&highlight= ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From linux781@gmail.com Sat Apr 2 02:31:41 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 02:31:45 -0800 (PST) Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.197]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32AVex2023541 for ; Sat, 2 Apr 2005 02:31:41 -0800 Received: by zproxy.gmail.com with SMTP id 34so92309nzf for ; Sat, 02 Apr 2005 02:31:35 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding; b=m8KmkOw5qebPHfeH+i3pMbrY/IbFik0mgjgGVabMStvpnnuBglcvbkF810DYXX7mhlZyBICiTYXcoExX3TB/uLXNrMC9g5TzzrVn9jW1V0kD9Z8MyHmIMWY30VUCqz/HWq37msrhni/axG7jZg18xPMzdOqKIHcs/U7DZb8EZL0= Received: by 10.36.5.5 with SMTP id 5mr6599nze; Sat, 02 Apr 2005 02:31:35 -0800 (PST) Received: by 10.36.58.7 with HTTP; Sat, 2 Apr 2005 02:31:35 -0800 (PST) Message-ID: <72252ed05040202313a309e77@mail.gmail.com> Date: Sat, 2 Apr 2005 05:31:35 -0500 From: Akshay Kawale Reply-To: Akshay Kawale To: netdev@oss.sgi.com Subject: Problem accessing IP header fields. Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1253 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: linux781@gmail.com Precedence: bulk X-list: netdev Hi, I am trying to access the tot_len field in the IP Header using a sk_buff structure inside a Netfilter hook. I do something like: (**skb).nh.iph->tot_len += 64 I have tried other variants of the same statement but none of them work. I want to increment the length by 64 bytes, but it gives me an error saying that I am trying to access an 'incomplete data type'. Can anyone shed some light on this problem? tot_len if of type __u16 (unsigned short int). Thanks. - Akshay From herbert@gondor.apana.org.au Sat Apr 2 03:24:56 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 03:25:06 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32BOrml027752 for ; Sat, 2 Apr 2005 03:24:54 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHgjp-0003nQ-00; Sat, 02 Apr 2005 21:24:25 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHgiW-00063U-00; Sat, 02 Apr 2005 21:23:04 +1000 Date: Sat, 2 Apr 2005 21:23:04 +1000 To: Eric Dumazet Cc: davem@davemloft.net, netdev@oss.sgi.com, Robert.Olsson@data.slu.se, hadi@cyberus.ca Subject: Get rid of rt_check_expire and rt_garbage_collect Message-ID: <20050402112304.GA11321@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <424E641A.1020609@cosmosbay.com> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/799/Fri Apr 1 02:49:13 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1254 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 11:21:30AM +0200, Eric Dumazet wrote: > > Well, I began my work because of the overflow bug in rt_check_expire()... > Then I realize this function could not work as expected. On a loaded > machine, one timer tick is 1 ms. > During this time, number of chains that are scanned is ridiculous. > With the standard timer of 60 second, fact is rt_check_expire() is useless. I see. What we've got here is a scalability problem with respect to the number of hash buckets. As the number of buckets increases, the amount of work the timer GC has to perform inreases proportionally. Since the timer GC parameters are fixed, this will eventually break. Rather than changing the timer GC so that it runs more often to keep up with the large routing cache, we should get out of this by reducing the amount of work we have to do. Imagine an ideal balanced hash table with 2.6 million entries. That is, all incoming/outgoing packets belong to flows that are already in the hash table. Imagine also that there is no PMTU/link failure taking place so all entries are valid forever. In this state there is absolutely no need to execute the timer GC. Let's remove one of those assumptions and allow there to be entries which need to expire after a set period. Instead of having the timer GC clean them up, we can move the expire check to the place where the entries are used. That is, we make ip_route_input/ip_route_output/ipv4_dst_check check whether the entry has expired. On the face of it we're doing more work since every routing cache hit will need to check the validity of the dst. However, because it's a single subtraction it is actually pretty cheap. There is also no additional cache miss compared to doing it in the timer GC since we have to read the dst anyway. Let's go one step further and make the routing cache come to life. Now there are new entries coming in and we need to remove old ones in order to make room for them. That task is currently carried out by the timer GC in rt_check_expire and on demand by rt_garbage_collect. Either way we have to walk the entire routing cache looking for entries to get rid of. This is quite expensive when the routing cache is large. However, there is a better way. The reason we keep a cap on the routing cache (for a given hash size) is so that individual chains do not degenerate into long linked lists. In other words, we don't really care about how many entries there are in the routing cache. But we do care about how long each hash chain is. So instead of walking the entire routing cache to keep the number of entries down, what we should do is keep each hash chain as short as possible. Assuming that the hash function is good, this should achieve the same end result. Here is how it can be done: Every time a routing entry is inserted into a hash chain, we perform GC on that chain unconditionally. It might seem that we're doing more work again. However, as before because we're traversing the chain anyway, it is very cheap to perform the GC operations which mainly involve the checks in rt_may_expire. OK that's enough thinking and it's time to write some code to see whether this is all bullshit :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From zilvinas@barclay.balt.net Sat Apr 2 04:26:46 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 04:28:20 -0800 (PST) Received: from barclay.balt.net (root@barclay.balt.net [195.14.162.78]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32CQiid001289 for ; Sat, 2 Apr 2005 04:26:45 -0800 Received: from barclay.balt.net (zilvinas@localhost [127.0.0.1]) by barclay.balt.net (8.13.2/8.13.1/Debian-15) with ESMTP id j32CPsuD007894; Sat, 2 Apr 2005 15:25:54 +0300 Received: (from zilvinas@localhost) by barclay.balt.net (8.13.2/8.13.1/Submit) id j32CPrsa007893; Sat, 2 Apr 2005 15:25:53 +0300 Date: Sat, 2 Apr 2005 15:25:53 +0300 From: Zilvinas Valinskas To: Aidas Kasparas Cc: hadi@cyberus.ca, ipsec-tools-devel@lists.sourceforge.net, netdev@oss.sgi.com, nakam@linux-ipv6.org Subject: Re: [Ipsec-tools-devel] Re: IPSEC: on behavior of acquire Message-ID: <20050402122553.GA7521@gemtek.lt> Reply-To: Zilvinas Valinskas References: <1112405303.1096.37.camel@jzny.localdomain> <424E454D.4090402@gmc.lt> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <424E454D.4090402@gmc.lt> X-Attribution: Zilvinas X-Url: http://www.gemtek.lt/ User-Agent: Mutt/1.5.6+20040907i X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1255 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: zilvinas@gemtek.lt Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 10:10:05AM +0300, Aidas Kasparas wrote: > > > jamal wrote: > >test1)on one window run setkey -x: > > > >ping -c 1 someDST > > > >-1) packet arrives towards outbound > >0) Larval state created > >1) one acquire sent. > >2) timeout. > >3) packet dropped. -ESRCH returned. > >4) larval state deleted > > > >So question 1): Shouldnt the return code be -ERESTART to ask > >the app to retry? > >question 2) Why is there a hardcoding of 1 try only? > > Re 1 try only. There is little sense to do more tries. If there is no > deamon listening to pfkey messages, then no connection will be made no > matter how many retries you'll do. If deamon/link/peer is slow and SA > was not established before timeout expired, then repeated acquire will > be simply ignored (deamon will find out that negotiation is already in > progress, there is no reason to start another negotiation and therefore > will drop that acquire request). And the only situation where repeated > acquires may help is when pfkey messages are lost. But pfkey was not > designed to survive message loses, therefore you should not operate your > boxes in mode when lost pfkey messages are a rule, not an exception. And > on the other hand, occasional pfkey message loses can be worked around > by applications/user retry. > > Re error code returned. Error codes returned by pfkey never were > perfect. But your experiment is not perfect too. You sent pings with no > KE deamon running. pfkey code found that there is nothing receiving > acquire messages => there is no chance that any process will setup > required SAs and tried to inform about that (I agree, return code is not > very informative, at least until you learn about reasons why it is > such). If you would have racoon (or other pfkey based ISAKMP daemon) > running, you would get "resource temporarily unavailable" (don't know > which error code corresponds to that message), which IMHO is ok (if it > is not, please explain). EBUSY I think it is. I am not entirely sure it is ok to return such error, some applications are not coping nicely with it. Perhaps ECONNREFUSED is more reasonable - as it doesn't brake old apps assumption (connection cannot be established, doesn't matter if that is due to routing or IPsec SPD or anything else). Although it is quite simple to fix applications to handle EBUSY and retry ... I thought it was annoying that applications quit because of EBUSY - when I had tried IPsec first time. Now I think it is quite handy - especially from scripts, I am sure that if something goes wrong - ping (or other application) won't block ... > > Re netlink behaviour I can not comment as I don't use it for ipsec > purposes, but would like to read similar explanation. Reason for that - > idea that ipsec-tools one day could support operation via netlink is not > ruled out of our minds. Yet, afaik nobody is working on it at the moment. > > > -- > Aidas Kasparas > IT administrator > GM Consult Group, UAB > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Ipsec-tools-devel mailing list > Ipsec-tools-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ipsec-tools-devel From khc@pm.waw.pl Sat Apr 2 05:29:19 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 05:29:30 -0800 (PST) Received: from khc.piap.pl (khc.piap.pl [195.187.100.11]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32DTEYT004128 for ; Sat, 2 Apr 2005 05:29:16 -0800 Received: by khc.piap.pl (Postfix, from userid 500) id F0E7E1084C; Sat, 2 Apr 2005 15:29:12 +0200 (CEST) To: Jeff Garzik Cc: Subject: [PATCH] Generic HDLC update From: Krzysztof Halasa Date: Sat, 02 Apr 2005 15:29:12 +0200 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1256 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: khc@pm.waw.pl Precedence: bulk X-list: netdev --=-=-= Hi, The attached patch updates generic HDLC to version 1.18. Lab-tested. Please apply to Linux 2.6. Thanks. Changes: - doc updates - added Cisco LMI support to Frame-Relay code - cleaned hdlc_fr.c a bit, removed some orphaned #defines etc. - fixed a problem with non-functional LMI in FR DCE mode. - changed diagnostic messages to better conform to FR standards - all protocols: information about carrier changes (DCD line) is now printed to kernel logs. Signed-Off-By: Krzysztof Halasa -- Krzysztof Halasa --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=hdlc-2.6-1.18.patch --- linux-2.6/Documentation/networking/generic-hdlc.txt 25 May 2003 22:13:37 -0000 1.4 +++ linux-2.6/Documentation/networking/generic-hdlc.txt 2 Apr 2005 13:12:18 -0000 @@ -1,21 +1,21 @@ Generic HDLC layer Krzysztof Halasa -January, 2003 Generic HDLC layer currently supports: -- Frame Relay (ANSI, CCITT and no LMI), with ARP support (no InARP). - Normal (routed) and Ethernet-bridged (Ethernet device emulation) - interfaces can share a single PVC. -- raw HDLC - either IP (IPv4) interface or Ethernet device emulation. -- Cisco HDLC, -- PPP (uses syncppp.c), -- X.25 (uses X.25 routines). - -There are hardware drivers for the following cards: -- C101 by Moxa Technologies Co., Ltd. -- RISCom/N2 by SDL Communications Inc. -- and others, some not in the official kernel. +1. Frame Relay (ANSI, CCITT, Cisco and no LMI). + - Normal (routed) and Ethernet-bridged (Ethernet device emulation) + interfaces can share a single PVC. + - ARP support (no InARP support in the kernel - there is an + experimental InARP user-space daemon available on: + http://www.kernel.org/pub/linux/utils/net/hdlc/). +2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation. +3. Cisco HDLC. +4. PPP (uses syncppp.c). +5. X.25 (uses X.25 routines). + +Generic HDLC is a protocol driver only - it needs a low-level driver +for your particular hardware. Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging). @@ -24,7 +24,7 @@ Make sure the hdlc.o and the hardware driver are loaded. It should create a number of "hdlc" (hdlc0 etc) network devices, one for each WAN port. You'll need the "sethdlc" utility, get it from: - http://hq.pm.waw.pl/hdlc/ + http://www.kernel.org/pub/linux/utils/net/hdlc/ Compile sethdlc.c utility: gcc -O2 -Wall -o sethdlc sethdlc.c @@ -52,12 +52,12 @@ * v35 | rs232 | x21 | t1 | e1 - sets physical interface for a given port if the card has software-selectable interfaces loopback - activate hardware loopback (for testing only) -* clock ext - external clock (uses DTE RX and TX clock) -* clock int - internal clock (provides clock signal on DCE clock output) -* clock txint - TX internal, RX external (provides TX clock on DCE output) -* clock txfromrx - TX clock derived from RX clock (TX clock on DCE output) -* rate - sets clock rate in bps (not required for external clock or - for txfromrx) +* clock ext - both RX clock and TX clock external +* clock int - both RX clock and TX clock internal +* clock txint - RX clock external, TX clock internal +* clock txfromrx - RX clock external, TX clock derived from RX clock +* rate - sets clock rate in bps (for "int" or "txint" clock only) + Setting protocol: @@ -79,7 +79,7 @@ * x25 - sets X.25 mode * fr - Frame Relay mode - lmi ansi / ccitt / none - LMI (link management) type + lmi ansi / ccitt / cisco / none - LMI (link management) type dce - Frame Relay DCE (network) side LMI instead of default DTE (user). It has nothing to do with clocks! t391 - link integrity verification polling timer (in seconds) - user @@ -119,13 +119,14 @@ -If you have a problem with N2 or C101 card, you can issue the "private" -command to see port's packet descriptor rings (in kernel logs): +If you have a problem with N2, C101 or PLX200SYN card, you can issue the +"private" command to see port's packet descriptor rings (in kernel logs): sethdlc hdlc0 private -The hardware driver has to be build with CONFIG_HDLC_DEBUG_RINGS. +The hardware driver has to be build with #define DEBUG_RINGS. Attaching this info to bug reports would be helpful. Anyway, let me know if you have problems using this. -For patches and other info look at http://hq.pm.waw.pl/hdlc/ +For patches and other info look at: +. --- linux-2.6/include/linux/hdlc.h 28 Oct 2004 06:16:08 -0000 1.12 +++ linux-2.6/include/linux/hdlc.h 2 Apr 2005 13:12:18 -0000 @@ -1,7 +1,7 @@ /* * Generic HDLC support routines for Linux * - * Copyright (C) 1999-2003 Krzysztof Halasa + * Copyright (C) 1999-2005 Krzysztof Halasa * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2 of the GNU General Public License @@ -41,6 +41,7 @@ #define LMI_NONE 1 /* No LMI, all PVCs are static */ #define LMI_ANSI 2 /* ANSI Annex D */ #define LMI_CCITT 3 /* ITU-T Annex A */ +#define LMI_CISCO 4 /* The "original" LMI, aka Gang of Four */ #define HDLC_MAX_MTU 1500 /* Ethernet 1500 bytes */ #define HDLC_MAX_MRU (HDLC_MAX_MTU + 10 + 14 + 4) /* for ETH+VLAN over FR */ @@ -89,6 +90,7 @@ unsigned int deleted: 1; unsigned int fecn: 1; unsigned int becn: 1; + unsigned int bandwidth; /* Cisco LMI reporting only */ }state; }pvc_device; --- linux-2.6/drivers/net/wan/hdlc_fr.c 22 Jun 2004 03:25:28 -0000 1.13 +++ linux-2.6/drivers/net/wan/hdlc_fr.c 2 Apr 2005 13:12:18 -0000 @@ -2,7 +2,7 @@ * Generic HDLC support routines for Linux * Frame Relay support * - * Copyright (C) 1999 - 2003 Krzysztof Halasa + * Copyright (C) 1999 - 2005 Krzysztof Halasa * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2 of the GNU General Public License @@ -27,6 +27,10 @@ active = open and "link reliable" exist = new = not used + CCITT LMI: ITU-T Q.933 Annex A + ANSI LMI: ANSI T1.617 Annex D + CISCO LMI: the original, aka "Gang of Four" LMI + */ #include @@ -49,45 +53,41 @@ #undef DEBUG_ECN #undef DEBUG_LINK -#define MAXLEN_LMISTAT 20 /* max size of status enquiry frame */ +#define FR_UI 0x03 +#define FR_PAD 0x00 + +#define NLPID_IP 0xCC +#define NLPID_IPV6 0x8E +#define NLPID_SNAP 0x80 +#define NLPID_PAD 0x00 +#define NLPID_CCITT_ANSI_LMI 0x08 +#define NLPID_CISCO_LMI 0x09 + + +#define LMI_CCITT_ANSI_DLCI 0 /* LMI DLCI */ +#define LMI_CISCO_DLCI 1023 + +#define LMI_CALLREF 0x00 /* Call Reference */ +#define LMI_ANSI_LOCKSHIFT 0x95 /* ANSI locking shift */ +#define LMI_ANSI_CISCO_REPTYPE 0x01 /* report type */ +#define LMI_CCITT_REPTYPE 0x51 +#define LMI_ANSI_CISCO_ALIVE 0x03 /* keep alive */ +#define LMI_CCITT_ALIVE 0x53 +#define LMI_ANSI_CISCO_PVCSTAT 0x07 /* PVC status */ +#define LMI_CCITT_PVCSTAT 0x57 + +#define LMI_FULLREP 0x00 /* full report */ +#define LMI_INTEGRITY 0x01 /* link integrity report */ +#define LMI_SINGLE 0x02 /* single PVC report */ -#define PVC_STATE_NEW 0x01 -#define PVC_STATE_ACTIVE 0x02 -#define PVC_STATE_FECN 0x08 /* FECN condition */ -#define PVC_STATE_BECN 0x10 /* BECN condition */ - - -#define FR_UI 0x03 -#define FR_PAD 0x00 - -#define NLPID_IP 0xCC -#define NLPID_IPV6 0x8E -#define NLPID_SNAP 0x80 -#define NLPID_PAD 0x00 -#define NLPID_Q933 0x08 - - -#define LMI_DLCI 0 /* LMI DLCI */ -#define LMI_PROTO 0x08 -#define LMI_CALLREF 0x00 /* Call Reference */ -#define LMI_ANSI_LOCKSHIFT 0x95 /* ANSI lockshift */ -#define LMI_REPTYPE 1 /* report type */ -#define LMI_CCITT_REPTYPE 0x51 -#define LMI_ALIVE 3 /* keep alive */ -#define LMI_CCITT_ALIVE 0x53 -#define LMI_PVCSTAT 7 /* pvc status */ -#define LMI_CCITT_PVCSTAT 0x57 -#define LMI_FULLREP 0 /* full report */ -#define LMI_INTEGRITY 1 /* link integrity report */ -#define LMI_SINGLE 2 /* single pvc report */ #define LMI_STATUS_ENQUIRY 0x75 #define LMI_STATUS 0x7D /* reply */ #define LMI_REPT_LEN 1 /* report type element length */ #define LMI_INTEG_LEN 2 /* link integrity element length */ -#define LMI_LENGTH 13 /* standard LMI frame length */ -#define LMI_ANSI_LENGTH 14 +#define LMI_CCITT_CISCO_LENGTH 13 /* LMI frame lengths */ +#define LMI_ANSI_LENGTH 14 typedef struct { @@ -223,51 +223,34 @@ } -static inline u16 status_to_dlci(u8 *status, int *active, int *new) -{ - *new = (status[2] & 0x08) ? 1 : 0; - *active = (status[2] & 0x02) ? 1 : 0; - - return ((status[0] & 0x3F) << 4) | ((status[1] & 0x78) >> 3); -} - - -static inline void dlci_to_status(u16 dlci, u8 *status, int active, int new) -{ - status[0] = (dlci >> 4) & 0x3F; - status[1] = ((dlci << 3) & 0x78) | 0x80; - status[2] = 0x80; - - if (new) - status[2] |= 0x08; - else if (active) - status[2] |= 0x02; -} - - - static int fr_hard_header(struct sk_buff **skb_p, u16 dlci) { u16 head_len; struct sk_buff *skb = *skb_p; switch (skb->protocol) { - case __constant_ntohs(ETH_P_IP): + case __constant_ntohs(NLPID_CCITT_ANSI_LMI): head_len = 4; skb_push(skb, head_len); - skb->data[3] = NLPID_IP; + skb->data[3] = NLPID_CCITT_ANSI_LMI; break; - case __constant_ntohs(ETH_P_IPV6): + case __constant_ntohs(NLPID_CISCO_LMI): head_len = 4; skb_push(skb, head_len); - skb->data[3] = NLPID_IPV6; + skb->data[3] = NLPID_CISCO_LMI; break; - case __constant_ntohs(LMI_PROTO): + case __constant_ntohs(ETH_P_IP): head_len = 4; skb_push(skb, head_len); - skb->data[3] = LMI_PROTO; + skb->data[3] = NLPID_IP; + break; + + case __constant_ntohs(ETH_P_IPV6): + head_len = 4; + skb_push(skb, head_len); + skb->data[3] = NLPID_IPV6; break; case __constant_ntohs(ETH_P_802_3): @@ -461,13 +444,14 @@ hdlc_device *hdlc = dev_to_hdlc(dev); struct sk_buff *skb; pvc_device *pvc = hdlc->state.fr.first_pvc; - int len = (hdlc->state.fr.settings.lmi == LMI_ANSI) ? LMI_ANSI_LENGTH - : LMI_LENGTH; - int stat_len = 3; + int lmi = hdlc->state.fr.settings.lmi; + int dce = hdlc->state.fr.settings.dce; + int len = lmi == LMI_ANSI ? LMI_ANSI_LENGTH : LMI_CCITT_CISCO_LENGTH; + int stat_len = (lmi == LMI_CISCO) ? 6 : 3; u8 *data; int i = 0; - if (hdlc->state.fr.settings.dce && fullrep) { + if (dce && fullrep) { len += hdlc->state.fr.dce_pvc_count * (2 + stat_len); if (len > HDLC_MAX_MRU) { printk(KERN_WARNING "%s: Too many PVCs while sending " @@ -484,29 +468,31 @@ } memset(skb->data, 0, len); skb_reserve(skb, 4); - skb->protocol = __constant_htons(LMI_PROTO); - fr_hard_header(&skb, LMI_DLCI); + if (lmi == LMI_CISCO) { + skb->protocol = __constant_htons(NLPID_CISCO_LMI); + fr_hard_header(&skb, LMI_CISCO_DLCI); + } else { + skb->protocol = __constant_htons(NLPID_CCITT_ANSI_LMI); + fr_hard_header(&skb, LMI_CCITT_ANSI_DLCI); + } data = skb->tail; data[i++] = LMI_CALLREF; - data[i++] = hdlc->state.fr.settings.dce - ? LMI_STATUS : LMI_STATUS_ENQUIRY; - if (hdlc->state.fr.settings.lmi == LMI_ANSI) + data[i++] = dce ? LMI_STATUS : LMI_STATUS_ENQUIRY; + if (lmi == LMI_ANSI) data[i++] = LMI_ANSI_LOCKSHIFT; - data[i++] = (hdlc->state.fr.settings.lmi == LMI_CCITT) - ? LMI_CCITT_REPTYPE : LMI_REPTYPE; + data[i++] = lmi == LMI_CCITT ? LMI_CCITT_REPTYPE : + LMI_ANSI_CISCO_REPTYPE; data[i++] = LMI_REPT_LEN; data[i++] = fullrep ? LMI_FULLREP : LMI_INTEGRITY; - - data[i++] = (hdlc->state.fr.settings.lmi == LMI_CCITT) - ? LMI_CCITT_ALIVE : LMI_ALIVE; + data[i++] = lmi == LMI_CCITT ? LMI_CCITT_ALIVE : LMI_ANSI_CISCO_ALIVE; data[i++] = LMI_INTEG_LEN; data[i++] = hdlc->state.fr.txseq =fr_lmi_nextseq(hdlc->state.fr.txseq); data[i++] = hdlc->state.fr.rxseq; - if (hdlc->state.fr.settings.dce && fullrep) { + if (dce && fullrep) { while (pvc) { - data[i++] = (hdlc->state.fr.settings.lmi == LMI_CCITT) - ? LMI_CCITT_PVCSTAT : LMI_PVCSTAT; + data[i++] = lmi == LMI_CCITT ? LMI_CCITT_PVCSTAT : + LMI_ANSI_CISCO_PVCSTAT; data[i++] = stat_len; /* LMI start/restart */ @@ -523,8 +509,20 @@ fr_log_dlci_active(pvc); } - dlci_to_status(pvc->dlci, data + i, - pvc->state.active, pvc->state.new); + if (lmi == LMI_CISCO) { + data[i] = pvc->dlci >> 8; + data[i + 1] = pvc->dlci & 0xFF; + } else { + data[i] = (pvc->dlci >> 4) & 0x3F; + data[i + 1] = ((pvc->dlci << 3) & 0x78) | 0x80; + data[i + 2] = 0x80; + } + + if (pvc->state.new) + data[i + 2] |= 0x08; + else if (pvc->state.active) + data[i + 2] |= 0x02; + i += stat_len; pvc = pvc->next; } @@ -569,6 +567,8 @@ pvc_carrier(0, pvc); pvc->state.exist = pvc->state.active = 0; pvc->state.new = 0; + if (!hdlc->state.fr.settings.dce) + pvc->state.bandwidth = 0; pvc = pvc->next; } } @@ -583,11 +583,12 @@ int i, cnt = 0, reliable; u32 list; - if (hdlc->state.fr.settings.dce) + if (hdlc->state.fr.settings.dce) { reliable = hdlc->state.fr.request && time_before(jiffies, hdlc->state.fr.last_poll + hdlc->state.fr.settings.t392 * HZ); - else { + hdlc->state.fr.request = 0; + } else { hdlc->state.fr.last_errors <<= 1; /* Shift the list */ if (hdlc->state.fr.request) { if (hdlc->state.fr.reliable) @@ -634,65 +635,88 @@ static int fr_lmi_recv(struct net_device *dev, struct sk_buff *skb) { hdlc_device *hdlc = dev_to_hdlc(dev); - int stat_len; pvc_device *pvc; - int reptype = -1, error, no_ram; u8 rxseq, txseq; - int i; + int lmi = hdlc->state.fr.settings.lmi; + int dce = hdlc->state.fr.settings.dce; + int stat_len = (lmi == LMI_CISCO) ? 6 : 3, reptype, error, no_ram, i; - if (skb->len < ((hdlc->state.fr.settings.lmi == LMI_ANSI) - ? LMI_ANSI_LENGTH : LMI_LENGTH)) { + if (skb->len < (lmi == LMI_ANSI ? LMI_ANSI_LENGTH : + LMI_CCITT_CISCO_LENGTH)) { printk(KERN_INFO "%s: Short LMI frame\n", dev->name); return 1; } - if (skb->data[5] != (!hdlc->state.fr.settings.dce ? - LMI_STATUS : LMI_STATUS_ENQUIRY)) { - printk(KERN_INFO "%s: LMI msgtype=%x, Not LMI status %s\n", - dev->name, skb->data[2], - hdlc->state.fr.settings.dce ? "enquiry" : "reply"); + if (skb->data[3] != (lmi == LMI_CISCO ? NLPID_CISCO_LMI : + NLPID_CCITT_ANSI_LMI)) { + printk(KERN_INFO "%s: Received non-LMI frame with LMI" + " DLCI\n", dev->name); return 1; } - i = (hdlc->state.fr.settings.lmi == LMI_ANSI) ? 7 : 6; + if (skb->data[4] != LMI_CALLREF) { + printk(KERN_INFO "%s: Invalid LMI Call reference (0x%02X)\n", + dev->name, skb->data[4]); + return 1; + } + + if (skb->data[5] != (dce ? LMI_STATUS_ENQUIRY : LMI_STATUS)) { + printk(KERN_INFO "%s: Invalid LMI Message type (0x%02X)\n", + dev->name, skb->data[5]); + return 1; + } + + if (lmi == LMI_ANSI) { + if (skb->data[6] != LMI_ANSI_LOCKSHIFT) { + printk(KERN_INFO "%s: Not ANSI locking shift in LMI" + " message (0x%02X)\n", dev->name, skb->data[6]); + return 1; + } + i = 7; + } else + i = 6; - if (skb->data[i] != - ((hdlc->state.fr.settings.lmi == LMI_CCITT) - ? LMI_CCITT_REPTYPE : LMI_REPTYPE)) { - printk(KERN_INFO "%s: Not a report type=%x\n", + if (skb->data[i] != (lmi == LMI_CCITT ? LMI_CCITT_REPTYPE : + LMI_ANSI_CISCO_REPTYPE)) { + printk(KERN_INFO "%s: Not an LMI Report type IE (0x%02X)\n", dev->name, skb->data[i]); return 1; } - i++; - i++; /* Skip length field */ + if (skb->data[++i] != LMI_REPT_LEN) { + printk(KERN_INFO "%s: Invalid LMI Report type IE length" + " (%u)\n", dev->name, skb->data[i]); + return 1; + } - reptype = skb->data[i++]; + reptype = skb->data[++i]; + if (reptype != LMI_INTEGRITY && reptype != LMI_FULLREP) { + printk(KERN_INFO "%s: Unsupported LMI Report type (0x%02X)\n", + dev->name, reptype); + return 1; + } - if (skb->data[i]!= - ((hdlc->state.fr.settings.lmi == LMI_CCITT) - ? LMI_CCITT_ALIVE : LMI_ALIVE)) { - printk(KERN_INFO "%s: Unsupported status element=%x\n", - dev->name, skb->data[i]); + if (skb->data[++i] != (lmi == LMI_CCITT ? LMI_CCITT_ALIVE : + LMI_ANSI_CISCO_ALIVE)) { + printk(KERN_INFO "%s: Not an LMI Link integrity verification" + " IE (0x%02X)\n", dev->name, skb->data[i]); return 1; } - i++; - i++; /* Skip length field */ + if (skb->data[++i] != LMI_INTEG_LEN) { + printk(KERN_INFO "%s: Invalid LMI Link integrity verification" + " IE length (%u)\n", dev->name, skb->data[i]); + return 1; + } + i++; hdlc->state.fr.rxseq = skb->data[i++]; /* TX sequence from peer */ rxseq = skb->data[i++]; /* Should confirm our sequence */ txseq = hdlc->state.fr.txseq; - if (hdlc->state.fr.settings.dce) { - if (reptype != LMI_FULLREP && reptype != LMI_INTEGRITY) { - printk(KERN_INFO "%s: Unsupported report type=%x\n", - dev->name, reptype); - return 1; - } + if (dce) hdlc->state.fr.last_poll = jiffies; - } error = 0; if (!hdlc->state.fr.reliable) @@ -703,7 +727,7 @@ error = 1; } - if (hdlc->state.fr.settings.dce) { + if (dce) { if (hdlc->state.fr.fullrep_sent && !error) { /* Stop sending full report - the last one has been confirmed by DTE */ hdlc->state.fr.fullrep_sent = 0; @@ -725,6 +749,7 @@ hdlc->state.fr.dce_changed = 0; } + hdlc->state.fr.request = 1; /* got request */ fr_lmi_send(dev, reptype == LMI_FULLREP ? 1 : 0); return 0; } @@ -739,7 +764,6 @@ if (reptype != LMI_FULLREP) return 0; - stat_len = 3; pvc = hdlc->state.fr.first_pvc; while (pvc) { @@ -750,24 +774,35 @@ no_ram = 0; while (skb->len >= i + 2 + stat_len) { u16 dlci; + u32 bw; unsigned int active, new; - if (skb->data[i] != ((hdlc->state.fr.settings.lmi == LMI_CCITT) - ? LMI_CCITT_PVCSTAT : LMI_PVCSTAT)) { - printk(KERN_WARNING "%s: Invalid PVCSTAT ID: %x\n", - dev->name, skb->data[i]); + if (skb->data[i] != (lmi == LMI_CCITT ? LMI_CCITT_PVCSTAT : + LMI_ANSI_CISCO_PVCSTAT)) { + printk(KERN_INFO "%s: Not an LMI PVC status IE" + " (0x%02X)\n", dev->name, skb->data[i]); return 1; } - i++; - if (skb->data[i] != stat_len) { - printk(KERN_WARNING "%s: Invalid PVCSTAT length: %x\n", - dev->name, skb->data[i]); + if (skb->data[++i] != stat_len) { + printk(KERN_INFO "%s: Invalid LMI PVC status IE length" + " (%u)\n", dev->name, skb->data[i]); return 1; } i++; - dlci = status_to_dlci(skb->data + i, &active, &new); + new = !! (skb->data[i + 2] & 0x08); + active = !! (skb->data[i + 2] & 0x02); + if (lmi == LMI_CISCO) { + dlci = (skb->data[i] << 8) | skb->data[i + 1]; + bw = (skb->data[i + 3] << 16) | + (skb->data[i + 4] << 8) | + (skb->data[i + 5]); + } else { + dlci = ((skb->data[i] & 0x3F) << 4) | + ((skb->data[i + 1] & 0x78) >> 3); + bw = 0; + } pvc = add_pvc(dev, dlci); @@ -783,9 +818,11 @@ pvc->state.deleted = 0; if (active != pvc->state.active || new != pvc->state.new || + bw != pvc->state.bandwidth || !pvc->state.exist) { pvc->state.new = new; pvc->state.active = active; + pvc->state.bandwidth = bw; pvc_carrier(active, pvc); fr_log_dlci_active(pvc); } @@ -801,6 +838,7 @@ pvc_carrier(0, pvc); pvc->state.active = pvc->state.new = 0; pvc->state.exist = 0; + pvc->state.bandwidth = 0; fr_log_dlci_active(pvc); } pvc = pvc->next; @@ -829,22 +867,15 @@ dlci = q922_to_dlci(skb->data); - if (dlci == LMI_DLCI) { - if (hdlc->state.fr.settings.lmi == LMI_NONE) - goto rx_error; /* LMI packet with no LMI? */ - - if (data[3] == LMI_PROTO) { - if (fr_lmi_recv(ndev, skb)) - goto rx_error; - else { - dev_kfree_skb_any(skb); - return NET_RX_SUCCESS; - } - } - - printk(KERN_INFO "%s: Received non-LMI frame with LMI DLCI\n", - ndev->name); - goto rx_error; + if ((dlci == LMI_CCITT_ANSI_DLCI && + (hdlc->state.fr.settings.lmi == LMI_ANSI || + hdlc->state.fr.settings.lmi == LMI_CCITT)) || + (dlci == LMI_CISCO_DLCI && + hdlc->state.fr.settings.lmi == LMI_CISCO)) { + if (fr_lmi_recv(ndev, skb)) + goto rx_error; + dev_kfree_skb_any(skb); + return NET_RX_SUCCESS; } pvc = find_pvc(hdlc, dlci); @@ -1170,7 +1201,8 @@ if ((new_settings.lmi != LMI_NONE && new_settings.lmi != LMI_ANSI && - new_settings.lmi != LMI_CCITT) || + new_settings.lmi != LMI_CCITT && + new_settings.lmi != LMI_CISCO) || new_settings.t391 < 1 || new_settings.t392 < 2 || new_settings.n391 < 1 || --- linux-2.6/drivers/net/wan/hdlc_generic.c 3 Jun 2004 05:04:21 -0000 1.15 +++ linux-2.6/drivers/net/wan/hdlc_generic.c 2 Apr 2005 13:12:18 -0000 @@ -1,7 +1,7 @@ /* * Generic HDLC support routines for Linux * - * Copyright (C) 1999 - 2003 Krzysztof Halasa + * Copyright (C) 1999 - 2005 Krzysztof Halasa * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2 of the GNU General Public License @@ -38,7 +38,7 @@ #include -static const char* version = "HDLC support module revision 1.17"; +static const char* version = "HDLC support module revision 1.18"; #undef DEBUG_LINK @@ -126,10 +126,13 @@ if (!hdlc->open) goto carrier_exit; - if (hdlc->carrier) + if (hdlc->carrier) { + printk(KERN_INFO "%s: Carrier detected\n", dev->name); __hdlc_set_carrier_on(dev); - else + } else { + printk(KERN_INFO "%s: Carrier lost\n", dev->name); __hdlc_set_carrier_off(dev); + } carrier_exit: spin_unlock_irqrestore(&hdlc->state_lock, flags); @@ -157,8 +160,11 @@ spin_lock_irq(&hdlc->state_lock); - if (hdlc->carrier) + if (hdlc->carrier) { + printk(KERN_INFO "%s: Carrier detected\n", dev->name); __hdlc_set_carrier_on(dev); + } else + printk(KERN_INFO "%s: No carrier\n", dev->name); hdlc->open = 1; --=-=-=-- From Robert.Olsson@data.slu.se Sat Apr 2 05:48:45 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 05:48:49 -0800 (PST) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32Dmir2005056 for ; Sat, 2 Apr 2005 05:48:44 -0800 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j32DmWJo028236; Sat, 2 Apr 2005 15:48:33 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id AA1B9EE2B1; Sat, 2 Apr 2005 15:48:32 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16974.41648.568927.54429@robur.slu.se> Date: Sat, 2 Apr 2005 15:48:32 +0200 To: Eric Dumazet Cc: Herbert Xu , davem@davemloft.net, netdev@oss.sgi.com, Robert.Olsson@data.slu.se Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() In-Reply-To: <424E641A.1020609@cosmosbay.com> References: <424E641A.1020609@cosmosbay.com> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1257 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Eric Dumazet writes: > > This patch is doing too many things. How about splitting it up? > > > > For instance the spin lock stuff is pretty straightforward and > > should be in its own patch. Yes a good idea so it can be tested separatly.... > > The benefits of the GC changes are not obvious to me. rt_check_expire > > is simply meant to kill off old entries. It's not really meant to be > > used to free up entries when the table gets full. Agree with Herbert... > entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti| > out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis| > | | tot| mc| ute| | an_dst| an_src| | _tot| _mc| | ed| miss| verflow| > _search|t_search| > 2618087| 28581| 7673| 0| 0| 0| 0| 0| 1800| 1450| 0| 0| 0| 0| 0| > Without serious tuning, this machine could not handle this load, or even half of it. Yes thats a pretty much load. Very short flows some reason? What's your ip_rt_gc_min_interval? GC should be allowed to run frequent to smoothen out the GC load. Also good idea to decrease gc_thresh and you hash is really huge. > Crashes usually occurs when secret_interval interval is elapsed : rt_cache_flush(0); is called, and the whole machine begins to die. A good idea to increase the secret_interval interval but it should survive. --ro From dada1@cosmosbay.com Sat Apr 2 06:00:19 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 06:00:24 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32E0I23005893 for ; Sat, 2 Apr 2005 06:00:19 -0800 Received: from [192.168.0.3] ([84.5.129.64]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j32DwuRD012090; Sat, 2 Apr 2005 15:59:02 +0200 Message-ID: <424EA51F.6000300@cosmosbay.com> Date: Sat, 02 Apr 2005 15:58:55 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: Herbert Xu CC: davem@davemloft.net, netdev@oss.sgi.com, Robert.Olsson@data.slu.se, hadi@cyberus.ca Subject: Re: Get rid of rt_check_expire and rt_garbage_collect References: <424E641A.1020609@cosmosbay.com> <20050402112304.GA11321@gondor.apana.org.au> In-Reply-To: <20050402112304.GA11321@gondor.apana.org.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [62.23.185.226]); Sat, 02 Apr 2005 15:59:03 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1258 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev Herbert Xu a écrit : > On Sat, Apr 02, 2005 at 11:21:30AM +0200, Eric Dumazet wrote: > >>Well, I began my work because of the overflow bug in rt_check_expire()... >>Then I realize this function could not work as expected. On a loaded >>machine, one timer tick is 1 ms. >>During this time, number of chains that are scanned is ridiculous. >>With the standard timer of 60 second, fact is rt_check_expire() is useless. > > > I see. What we've got here is a scalability problem with respect > to the number of hash buckets. As the number of buckets increases, > the amount of work the timer GC has to perform inreases proportionally. > > Since the timer GC parameters are fixed, this will eventually break. > > Rather than changing the timer GC so that it runs more often to keep > up with the large routing cache, we should get out of this by reducing > the amount of work we have to do. > > Imagine an ideal balanced hash table with 2.6 million entries. That > is, all incoming/outgoing packets belong to flows that are already in > the hash table. Imagine also that there is no PMTU/link failure taking > place so all entries are valid forever. > > In this state there is absolutely no need to execute the timer GC. > > Let's remove one of those assumptions and allow there to be entries > which need to expire after a set period. > > Instead of having the timer GC clean them up, we can move the expire > check to the place where the entries are used. That is, we make > ip_route_input/ip_route_output/ipv4_dst_check check whether the > entry has expired. > > On the face of it we're doing more work since every routing cache > hit will need to check the validity of the dst. However, because > it's a single subtraction it is actually pretty cheap. There is > also no additional cache miss compared to doing it in the timer > GC since we have to read the dst anyway. > > Let's go one step further and make the routing cache come to life. > Now there are new entries coming in and we need to remove old ones > in order to make room for them. > > That task is currently carried out by the timer GC in rt_check_expire > and on demand by rt_garbage_collect. Either way we have to walk the > entire routing cache looking for entries to get rid of. > > This is quite expensive when the routing cache is large. However, > there is a better way. > > The reason we keep a cap on the routing cache (for a given hash size) > is so that individual chains do not degenerate into long linked lists. > > In other words, we don't really care about how many entries there are > in the routing cache. But we do care about how long each hash chain > is. > > So instead of walking the entire routing cache to keep the number of > entries down, what we should do is keep each hash chain as short as > possible. > > Assuming that the hash function is good, this should achieve the > same end result. > > Here is how it can be done: Every time a routing entry is inserted into > a hash chain, we perform GC on that chain unconditionally. > > It might seem that we're doing more work again. However, as before > because we're traversing the chain anyway, it is very cheap to perform > the GC operations which mainly involve the checks in rt_may_expire. > > OK that's enough thinking and it's time to write some code to see > whether this is all bullshit :) > > Cheers, Well, it may work if you dont care about memory used. # grep dst /proc/slabinfo ip_dst_cache 2825575 2849590 384 10 1 : tunables 54 27 8 : slabdata 284959 284959 0 On this machine, route cache takes 1.1 GB of ram... impressive. Then if the network load decrease (or completely stop), only a timer driven gc could purge the cache. So rt_check_expire() is *needed* You are right saying that gc parameters are fixed, thus gc breaks at high load. Eric From kuznet@yakov.inr.ac.ru Sat Apr 2 06:01:51 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 06:01:55 -0800 (PST) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.13.0/8.13.0) with SMTP id j32E1nGS006071 for ; Sat, 2 Apr 2005 06:01:50 -0800 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=B8c1/dbp/mKlnqQmR1uEiCDuXy7JrqjD3TOaLRO6GavyKIR5pkkfrtTkqwrL2rqtDeKJl2ixtsTFnEwsjwFigj4zaLU4CR6XietT3qfLyqFjhM4vvihNur9oHqnMiVZdxEMSzE2amGZimpXr59CxLROBlINBibpA7S6BSMOpbq8=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id SAA13068; Sat, 2 Apr 2005 18:00:19 +0400 Date: Sat, 2 Apr 2005 18:00:19 +0400 From: Alexey Kuznetsov To: jamal Cc: Herbert Xu , "David S. Miller" , Masahide NAKAMURA , psec-tools-devel@lists.sourceforge.net, netdev@oss.sgi.com, kaber@trash.net, kuznet@ms2.inr.ac.ru, jmorris@redhat.com Subject: Re: IPSEC: on behavior of acquire Message-ID: <20050402140019.GA13017@yakov.inr.ac.ru> References: <1112405144.1096.33.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112405144.1096.33.camel@jzny.localdomain> User-Agent: Mutt/1.5.6i X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1259 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > a) -ERESTART is the correct signal to return Right behaviour is to behave like ARP. A few of packets are queued, no errors (until timeout), no blocking. Alexey From Robert.Olsson@data.slu.se Sat Apr 2 06:04:26 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 06:04:30 -0800 (PST) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32E4PT9007106 for ; Sat, 2 Apr 2005 06:04:25 -0800 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j32E3gvg029545; Sat, 2 Apr 2005 16:03:43 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id C0F16EE2B2; Sat, 2 Apr 2005 16:03:42 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16974.42558.753736.846391@robur.slu.se> Date: Sat, 2 Apr 2005 16:03:42 +0200 To: Herbert Xu Cc: Eric Dumazet , davem@davemloft.net, netdev@oss.sgi.com, Robert.Olsson@data.slu.se, hadi@cyberus.ca Subject: Get rid of rt_check_expire and rt_garbage_collect In-Reply-To: <20050402112304.GA11321@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <20050402112304.GA11321@gondor.apana.org.au> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1260 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Herbert Xu writes: > Rather than changing the timer GC so that it runs more often to keep > up with the large routing cache, we should get out of this by reducing > the amount of work we have to do. Yeep. > Imagine an ideal balanced hash table with 2.6 million entries. That > is, all incoming/outgoing packets belong to flows that are already in > the hash table. Imagine also that there is no PMTU/link failure taking > place so all entries are valid forever. > > In this state there is absolutely no need to execute the timer GC. > Let's remove one of those assumptions and allow there to be entries > which need to expire after a set period. > > Instead of having the timer GC clean them up, we can move the expire > check to the place where the entries are used. That is, we make > ip_route_input/ip_route_output/ipv4_dst_check check whether the > entry has expired. > > On the face of it we're doing more work since every routing cache > hit will need to check the validity of the dst. However, because > it's a single subtraction it is actually pretty cheap. There is > also no additional cache miss compared to doing it in the timer > GC since we have to read the dst anyway. > > Let's go one step further and make the routing cache come to life. > Now there are new entries coming in and we need to remove old ones > in order to make room for them. > > That task is currently carried out by the timer GC in rt_check_expire > and on demand by rt_garbage_collect. Either way we have to walk the > entire routing cache looking for entries to get rid of. > > This is quite expensive when the routing cache is large. However, > there is a better way. > > The reason we keep a cap on the routing cache (for a given hash size) > is so that individual chains do not degenerate into long linked lists. > > In other words, we don't really care about how many entries there are > in the routing cache. But we do care about how long each hash chain > is. > > So instead of walking the entire routing cache to keep the number of > entries down, what we should do is keep each hash chain as short as > possible. > > Assuming that the hash function is good, this should achieve the > same end result. > > Here is how it can be done: Every time a routing entry is inserted into > a hash chain, we perform GC on that chain unconditionally. > > It might seem that we're doing more work again. However, as before > because we're traversing the chain anyway, it is very cheap to perform > the GC operations which mainly involve the checks in rt_may_expire. Agree... It's very interesting and worth to test something like this. also it could clean up the GC process and the need for tuning which would be very welcome. --ro From dada1@cosmosbay.com Sat Apr 2 06:10:50 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 06:10:55 -0800 (PST) Received: from gw1.cosmosbay.com (gw1.cosmosbay.com [62.23.185.226]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32EAnSm007751 for ; Sat, 2 Apr 2005 06:10:50 -0800 Received: from [192.168.0.3] ([84.5.129.64]) by gw1.cosmosbay.com (8.13.3/8.13.3) with ESMTP id j32EACvQ012304; Sat, 2 Apr 2005 16:10:17 +0200 Message-ID: <424EA7C2.6060308@cosmosbay.com> Date: Sat, 02 Apr 2005 16:10:10 +0200 From: Eric Dumazet User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: fr, en MIME-Version: 1.0 To: Robert Olsson CC: Herbert Xu , davem@davemloft.net, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> In-Reply-To: <16974.41648.568927.54429@robur.slu.se> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [62.23.185.226]); Sat, 02 Apr 2005 16:10:18 +0200 (CEST) X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1261 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dada1@cosmosbay.com Precedence: bulk X-list: netdev Robert Olsson a écrit : > Eric Dumazet writes: > Yes thats a pretty much load. Very short flows some reason? Well... yes. This is a real server, not a DOS simulation. 1 million TCP flows, and about 3 million peers using UDP frames. > What's your ip_rt_gc_min_interval? GC should be allowed to > run frequent to smoothen out the GC load. Also good idea > to decrease gc_thresh and you hash is really huge. No. As soon as I lower gc_thresh (and let gc running), the machine starts to drop connections and crash some seconds later. I found I had to make the hash table very large (but lowering elasticity, ie chain length) . It needs lot of ram, but at least CPU usage of net/ipv4/route.c is close to 0. # grep . /proc/sys/net/ipv4/route/* /proc/sys/net/ipv4/route/error_burst:5000 /proc/sys/net/ipv4/route/error_cost:1000 /proc/sys/net/ipv4/route/gc_elasticity:2 /proc/sys/net/ipv4/route/gc_interval:1 /proc/sys/net/ipv4/route/gc_min_interval:0 /proc/sys/net/ipv4/route/gc_min_interval_ms:500 /proc/sys/net/ipv4/route/gc_thresh:2900000 /proc/sys/net/ipv4/route/gc_timeout:155 /proc/sys/net/ipv4/route/max_delay:10 /proc/sys/net/ipv4/route/max_size:16777216 /proc/sys/net/ipv4/route/min_adv_mss:256 /proc/sys/net/ipv4/route/min_delay:2 /proc/sys/net/ipv4/route/min_pmtu:552 /proc/sys/net/ipv4/route/mtu_expires:600 /proc/sys/net/ipv4/route/redirect_load:20 /proc/sys/net/ipv4/route/redirect_number:9 /proc/sys/net/ipv4/route/redirect_silence:20480 /proc/sys/net/ipv4/route/secret_interval:36000 From Robert.Olsson@data.slu.se Sat Apr 2 06:47:05 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 06:47:09 -0800 (PST) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32El4NW009228 for ; Sat, 2 Apr 2005 06:47:05 -0800 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j32EkV4m000845; Sat, 2 Apr 2005 16:46:31 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 63253EE2B1; Sat, 2 Apr 2005 16:46:31 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16974.45127.318022.377635@robur.slu.se> Date: Sat, 2 Apr 2005 16:46:31 +0200 To: Eric Dumazet Cc: Robert Olsson , Herbert Xu , davem@davemloft.net, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() In-Reply-To: <424EA7C2.6060308@cosmosbay.com> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <424EA7C2.6060308@cosmosbay.com> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1262 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Eric Dumazet writes: > Well... yes. This is a real server, not a DOS simulation. > 1 million TCP flows, and about 3 million peers using UDP frames. I see. > > What's your ip_rt_gc_min_interval? GC should be allowed to > > run frequent to smoothen out the GC load. Also good idea > > to decrease gc_thresh and you hash is really huge. > No. As soon as I lower gc_thresh (and let gc running), the machine starts to drop connections and crash some seconds later. > I found I had to make the hash table very large (but lowering elasticity, ie chain length) . > It needs lot of ram, but at least CPU usage of net/ipv4/route.c is close to 0. OK! Not so bad. Most of your GC likely happens in rt_intern_hash chain pruning. This way you keep hash-chains short and get "datadriven" GC. But there must be bugs causing the crash... Maybe there should be an explicit control hash lengths not via elasticity but adding even more tuning knobs hurts. :) --ro From andrea@suse.de Sat Apr 2 07:01:29 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 07:01:33 -0800 (PST) Received: from g5.random (ppp-217-133-42-200.cust-adsl.tiscali.it [217.133.42.200]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32F1RNf010182 for ; Sat, 2 Apr 2005 07:01:28 -0800 Received: by g5.random (Postfix, from userid 500) id 308A05753AA; Sat, 2 Apr 2005 17:01:17 +0200 (CEST) Date: Sat, 2 Apr 2005 17:01:17 +0200 From: Andrea Arcangeli To: Greg KH Cc: jaganav@us.ibm.com, Stephen Hemminger , Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Message-ID: <20050402150116.GU29492@g5.random> References: <20050401154348.553f3c46@dxpl.pdx.osdl.net> <1112405833.424df749e61b5@imap.linux.ibm.com> <20050402052738.GA17506@kroah.com> <20050402060216.GA17766@kroah.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050402060216.GA17766@kroah.com> X-GPG-Key: 1024D/68B9CB43 13D9 8355 295F 4823 7C49 C012 DFA1 686E 68B9 CB43 User-Agent: Mutt/1.5.9i X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1263 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: andrea@suse.de Precedence: bulk X-list: netdev On Fri, Apr 01, 2005 at 10:02:16PM -0800, Greg KH wrote: > If you all wish to duplicate this stupidity, feel free, but do not > expect to get any help from the community... And just in case: do not expect to be allowed to use stuff like the rbtree.[ch] which is GPL'd (not LGPL). (ib patches from topspin originally relicensed rbtree.[ch] under BSD...) From jheffner@psc.edu Sat Apr 2 07:33:11 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 07:33:23 -0800 (PST) Received: from mailer2.psc.edu (mailer2.psc.edu [128.182.66.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32FXAWu011778 for ; Sat, 2 Apr 2005 07:33:11 -0800 Received: from dexter.psc.edu (dexter.psc.edu [128.182.61.232]) by mailer2.psc.edu (8.13.3/8.13.3) with ESMTP id j32FbOUl023976 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 2 Apr 2005 10:37:25 -0500 (EST) Received: from dexter.psc.edu (localhost.psc.edu [127.0.0.1]) by dexter.psc.edu (8.12.11/8.12.10) with ESMTP id j32FWXP1021028; Sat, 2 Apr 2005 10:32:33 -0500 Received: from localhost (jheffner@localhost) by dexter.psc.edu (8.12.11/8.12.11/Submit) with ESMTP id j32FWWVo021025; Sat, 2 Apr 2005 10:32:33 -0500 X-Authentication-Warning: dexter.psc.edu: jheffner owned process doing -bs Date: Sat, 2 Apr 2005 10:32:32 -0500 (EST) From: John Heffner To: Herbert Xu cc: davem@davemloft.net, netdev@oss.sgi.com Subject: Re: [PATCH] skb pcount with MTU discovery In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1264 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jheffner@psc.edu Precedence: bulk X-list: netdev On Sat, 2 Apr 2005, Herbert Xu wrote: > How about fixing tcp_snd_test directly like this? I tried that first, but it caused a panic. I assumed some other point in the code assumed that invariant that if TSO is disabled then tso_segs==1. I didn't investigate though. > Of course all this will be moot once Dave finishes his TSO rewrite :) That will make things much simpler. ;) -John From dmitry_yus@yahoo.com Sat Apr 2 10:08:42 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 10:08:48 -0800 (PST) Received: from smtp014.mail.yahoo.com (smtp014.mail.yahoo.com [216.136.173.58]) by oss.sgi.com (8.13.0/8.13.0) with SMTP id j32I8gNY020564 for ; Sat, 2 Apr 2005 10:08:42 -0800 Received: from unknown (HELO ?172.10.7.7?) (dmitry?yus@24.7.114.77 with plain) by smtp014.mail.yahoo.com with SMTP; 2 Apr 2005 18:08:42 -0000 Subject: Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics From: Dmitry Yusupov To: "open-iscsi@googlegroups.com" Cc: "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com In-Reply-To: <20050328223203.GC28983@kvack.org> References: <20050324215922.GT14202@opteron.random> <424346FE.20704@cs.wisc.edu> <20050324233921.GZ14202@opteron.random> <20050325034341.GV32638@waste.org> <20050327035149.GD4053@g5.random> <20050327054831.GA15453@waste.org> <1111905181.4753.15.camel@mylaptop> <20050326224621.61f6d917.davem@davemloft.net> <52vf7bwo4w.fsf@topspin.com> <1112042936.5088.22.camel@beastie> <20050328223203.GC28983@kvack.org> Content-Type: text/plain Date: Sat, 02 Apr 2005 10:08:37 -0800 Message-Id: <1112465317.24936.10.camel@mylaptop> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-2) Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1265 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dmitry_yus@yahoo.com Precedence: bulk X-list: netdev On Mon, 2005-03-28 at 17:32 -0500, Benjamin LaHaise wrote: > On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote: > > If you have plans to start new project such as SoftRDMA than yes. lets > > discuss it since set of problems will be similar to what we've got with > > software iSCSI Initiators. > > I'm somewhat interested in seeing a SoftRDMA project get off the ground. > At least the NatSemi 83820 gige MAC is able to provide early-rx interrupts > that allow one to get an rx interrupt before the full payload has arrived > making it possible to write out a new rx descriptor to place the payload > wherever it is ultimately desired. It would be fun to work on if not the > most performant RDMA implementation. I see a lot of skepticism around early-rx interrupt schema. It might work for gige, but i'm not sure if it will fit into 10g. What RDMA gives us is zero-copy on receive and new networking api which has a potential to be HW accelerated. SoftRDMA will never avoid copying on receive. But benefit for SoftRDMA would be its availability on client sides. It is free and it could be easily deployed. Soon Intel & Co will give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if one of those cores will do receive side copying? From willy@www.linux.org.uk Sat Apr 2 10:27:19 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 10:27:29 -0800 (PST) Received: from parcelfarce.linux.theplanet.co.uk (IDENT:93@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32IRI3F021472 for ; Sat, 2 Apr 2005 10:27:19 -0800 Received: from willy by parcelfarce.linux.theplanet.co.uk with local (Exim 4.33) id 1DHnKy-000079-IH; Sat, 02 Apr 2005 19:27:12 +0100 Date: Sat, 2 Apr 2005 19:27:12 +0100 From: Matthew Wilcox To: jaganav@us.ibm.com Cc: Greg KH , Stephen Hemminger , Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Message-ID: <20050402182712.GA24234@parcelfarce.linux.theplanet.co.uk> References: <1112426991.424e49ef57e2b@imap.linux.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112426991.424e49ef57e2b@imap.linux.ibm.com> User-Agent: Mutt/1.4.1i X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1266 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: matthew@wil.cx Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 02:29:51AM -0500, jaganav@us.ibm.com wrote: > If this dual license is a concern to other kernel developers as well from > contributing to OpenRDMA, we would seriously consider this and discuss with the > adapter vendors. Yes, it's a serious concern. Please release the code under the GPL only. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain From linux781@gmail.com Sat Apr 2 10:44:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 10:44:19 -0800 (PST) Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.205]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32IiEuN022356 for ; Sat, 2 Apr 2005 10:44:15 -0800 Received: by zproxy.gmail.com with SMTP id 8so55531nzo for ; Sat, 02 Apr 2005 10:44:07 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=aKGmZ4hpLJ7N6OferqIHuGfei+vozd7D7DoJc0CZObBsEX+Mu5qTTB2axEchIpMcZemzRqQPeQ3kqgriTNrooAzSnyHNVIfRqNKlktYZmVTUwbwciM5zfsmetes3V//dC4xOy547FvLkVteeFFTwsAJCOhvnu2XlwfTox/mL+Kw= Received: by 10.36.74.14 with SMTP id w14mr23466nza; Sat, 02 Apr 2005 10:44:07 -0800 (PST) Received: by 10.36.58.7 with HTTP; Sat, 2 Apr 2005 10:44:07 -0800 (PST) Message-ID: <72252ed0504021044e69d634@mail.gmail.com> Date: Sat, 2 Apr 2005 13:44:07 -0500 From: Akshay Kawale Reply-To: Akshay Kawale To: netdev@oss.sgi.com Subject: Re: Difference between skb_put() and skb_push() In-Reply-To: <72252ed05033021463a1f45b6@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit References: <72252ed05033021463a1f45b6@mail.gmail.com> X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1267 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: linux781@gmail.com Precedence: bulk X-list: netdev Hi, I am trying to access the tot_len field in the IP Header using a sk_buff structure inside a Netfilter hook. I do something like: (**skb).nh.iph->tot_len += 64 I have tried other variants of the same statement but none of them work. I want to increment the length by 64 bytes, but it gives me an error saying that I am trying to access an 'incomplete data type'. Can anyone shed some light on this problem? tot_len if of type __u16 (unsigned short int). Thanks. - Akshay From asgeir@chelsio.com Sat Apr 2 11:08:18 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:08:23 -0800 (PST) Received: from stargate.chelsio.com (stargate.chelsio.com [64.186.171.138] (may be forged)) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32J8HZ2024757 for ; Sat, 2 Apr 2005 11:08:18 -0800 Received: from YOGI.asicdesigners.com (yogi.asicdesigners.com [10.192.160.7]) by stargate.chelsio.com (8.12.5/8.12.5) with SMTP id j32J7SfZ015126; Sat, 2 Apr 2005 11:07:28 -0800 Subject: RE: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit ProposedTopics Date: Sat, 2 Apr 2005 11:07:28 -0800 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Message-ID: <67D69596DDF0C2448DB0F0547D0F947E01781F2E@yogi.asicdesigners.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: X-MimeOLE: Produced By Microsoft Exchange V6.0.6487.1 Thread-Topic: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit ProposedTopics content-class: urn:content-classes:message Thread-Index: AcU3rw1xnl0wdez6QdSv3xCWP+9qxgABjiog From: "Asgeir Eiriksson" To: "Dmitry Yusupov" , Cc: "David S. Miller" , , , , , , X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j32J8HZ2024757 X-archive-position: 1268 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: asgeir@chelsio.com Precedence: bulk X-list: netdev Dmitry The CPU cycles is only at most half of the story with the other half being the memory sub-system BW. So the validity of your observation depends on the BW we're talking about, i.e. if the client is using a fraction of 10Gbps for RDMA (or DDP, e.g. iSCSI DDP), yes then that fraction amounts to a fraction of the memory sub-system total BW so we don't much care about the extra copy. The situation is different if the client wants something close to 10Gbps (already have such client applications), because today 10Gbps is still a big chunk of the overall memory BW so you really care about eliminating that copy via DDP. 'Asgeir > -----Original Message----- > From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On > Behalf Of Dmitry Yusupov > Sent: Saturday, April 02, 2005 10:09 AM > To: open-iscsi@googlegroups.com > Cc: David S. Miller; mpm@selenic.com; andrea@suse.de; > michaelc@cs.wisc.edu; James.Bottomley@HansenPartnership.com; ksummit-2005- > discuss@thunk.org; netdev@oss.sgi.com > Subject: Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit > ProposedTopics > > On Mon, 2005-03-28 at 17:32 -0500, Benjamin LaHaise wrote: > > On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote: > > > If you have plans to start new project such as SoftRDMA than yes. lets > > > discuss it since set of problems will be similar to what we've got > with > > > software iSCSI Initiators. > > > > I'm somewhat interested in seeing a SoftRDMA project get off the ground. > > At least the NatSemi 83820 gige MAC is able to provide early-rx > interrupts > > that allow one to get an rx interrupt before the full payload has > arrived > > making it possible to write out a new rx descriptor to place the payload > > wherever it is ultimately desired. It would be fun to work on if not > the > > most performant RDMA implementation. > > I see a lot of skepticism around early-rx interrupt schema. It might > work for gige, but i'm not sure if it will fit into 10g. > > What RDMA gives us is zero-copy on receive and new networking api which > has a potential to be HW accelerated. SoftRDMA will never avoid copying > on receive. But benefit for SoftRDMA would be its availability on client > sides. It is free and it could be easily deployed. Soon Intel & Co will > give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if > one of those cores will do receive side copying? > From laforge@gnumonks.org Sat Apr 2 11:11:40 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:11:49 -0800 (PST) Received: from ganesha.gnumonks.org (Debian-exim@ganesha.gnumonks.org [213.95.27.120]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JBd6w025449 for ; Sat, 2 Apr 2005 11:11:40 -0800 Received: from sunbeam.hmw-consulting.de ([83.236.178.203] helo=sunbeam.gnumonks.org) by ganesha.gnumonks.org with asmtp (TLS-1.0:RSA_AES_128_CBC_SHA:16) (Exim 4.34) id 1DHo1t-0000J3-L8 for netdev@oss.sgi.com; Sat, 02 Apr 2005 21:11:33 +0200 Received: from laforge by sunbeam.gnumonks.org with local (Exim 4.50) id 1DHo1s-0000mb-FR for netdev@oss.sgi.com; Sat, 02 Apr 2005 21:11:32 +0200 Date: Sat, 2 Apr 2005 21:11:32 +0200 From: Harald Welte To: netdev@oss.sgi.com Subject: pktgen problem (skb refcount) in 2.6.12-rc1 Message-ID: <20050402191132.GF1890@sunbeam.de.gnumonks.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="hK8Uo4Yp55NZU70L" Content-Disposition: inline User-Agent: mutt-ng 1.5.8-r168i (Debian) X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1269 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: laforge@gnumonks.org Precedence: bulk X-list: netdev --hK8Uo4Yp55NZU70L Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! I've tried to get pktgen running on 2.6.12-rc1 (dual-opteron system, two dual e1000 boards). =20 It transmits the requested amount of packets, but the kernel thread(s) will continue to use 100% cpu even after that. I've tried to track the problem down, and I've confirmed that skb->users never goes down to 1 but instead stays at '2'. Therefore the while loop at line 2706 loops forever. Killing the kernel thread or configuring the interface down helps (as a kludge). However, the e1000 module will refuse to unload since apparently it's still referenced by that skb. The system is otherwise idle, and no fancy modules such as netfilter/iptables are loaded. The same system with the same pktgen script works fine with 2.6.11.6. I'm reporting this since it seems like it sounds like we have a skb usage count leak somewhere :( --=20 - Harald Welte http://gnumonks.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6) --hK8Uo4Yp55NZU70L Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) iD8DBQFCTu5kXaXGVTD0i/8RAnFTAJ0Zx+raxRpD3NBQYYp0vIh8uxK7lgCdFuqS cxrYExXXXuNnx4NAXVGfono= =9ULo -----END PGP SIGNATURE----- --hK8Uo4Yp55NZU70L-- From mingz@ele.uri.edu Sat Apr 2 11:13:30 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:13:35 -0800 (PST) Received: from leviathan.ele.uri.edu (leviathan.ele.uri.edu [131.128.51.64]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JDTJI025678 for ; Sat, 2 Apr 2005 11:13:30 -0800 Received: from [127.0.0.1] (leviathan [131.128.51.64]) by leviathan.ele.uri.edu (8.12.9/8.12.9) with ESMTP id j32JDLCu013331; Sat, 2 Apr 2005 14:13:22 -0500 (EST) Subject: Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics From: Ming Zhang Reply-To: mingz@ele.uri.edu To: open-iscsi Cc: "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com In-Reply-To: <1112465317.24936.10.camel@mylaptop> References: <20050324215922.GT14202@opteron.random> <424346FE.20704@cs.wisc.edu> <20050324233921.GZ14202@opteron.random> <20050325034341.GV32638@waste.org> <20050327035149.GD4053@g5.random> <20050327054831.GA15453@waste.org> <1111905181.4753.15.camel@mylaptop> <20050326224621.61f6d917.davem@davemloft.net> <52vf7bwo4w.fsf@topspin.com> <1112042936.5088.22.camel@beastie> <20050328223203.GC28983@kvack.org> <1112465317.24936.10.camel@mylaptop> Content-Type: text/plain Message-Id: <1112469200.4599.4.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.6 (1.4.6-2) Date: Sat, 02 Apr 2005 14:13:21 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1270 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mingz@ele.uri.edu Precedence: bulk X-list: netdev On Sat, 2005-04-02 at 13:08, Dmitry Yusupov wrote: > On Mon, 2005-03-28 at 17:32 -0500, Benjamin LaHaise wrote: > > On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote: > > > If you have plans to start new project such as SoftRDMA than yes. lets > > > discuss it since set of problems will be similar to what we've got with > > > software iSCSI Initiators. > > > > I'm somewhat interested in seeing a SoftRDMA project get off the ground. > > At least the NatSemi 83820 gige MAC is able to provide early-rx interrupts > > that allow one to get an rx interrupt before the full payload has arrived > > making it possible to write out a new rx descriptor to place the payload > > wherever it is ultimately desired. It would be fun to work on if not the > > most performant RDMA implementation. > > I see a lot of skepticism around early-rx interrupt schema. It might > work for gige, but i'm not sure if it will fit into 10g. > > What RDMA gives us is zero-copy on receive and new networking api which > has a potential to be HW accelerated. SoftRDMA will never avoid copying > on receive. But benefit for SoftRDMA would be its availability on client > sides. It is free and it could be easily deployed. Soon Intel & Co will > give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if > one of those cores will do receive side copying? > dedicated core to dealing with interrupt is fine. but the memory bandwidth is still over-used right? ming From mingz@ele.uri.edu Sat Apr 2 11:14:54 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:15:00 -0800 (PST) Received: from leviathan.ele.uri.edu (leviathan.ele.uri.edu [131.128.51.64]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JErRE026229 for ; Sat, 2 Apr 2005 11:14:54 -0800 Received: from [127.0.0.1] (leviathan [131.128.51.64]) by leviathan.ele.uri.edu (8.12.9/8.12.9) with ESMTP id j32JElCu013372; Sat, 2 Apr 2005 14:14:47 -0500 (EST) Subject: RE: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit ProposedTopics From: Ming Zhang Reply-To: mingz@ele.uri.edu To: open-iscsi Cc: Dmitry Yusupov , "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com In-Reply-To: <67D69596DDF0C2448DB0F0547D0F947E01781F2E@yogi.asicdesigners.com> References: <67D69596DDF0C2448DB0F0547D0F947E01781F2E@yogi.asicdesigners.com> Content-Type: text/plain Message-Id: <1112469286.4599.7.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.6 (1.4.6-2) Date: Sat, 02 Apr 2005 14:14:47 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1271 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mingz@ele.uri.edu Precedence: bulk X-list: netdev yes, thx for explaining this in more detail. copy avoidance is one main goal of rdma. the BW gap is the bottleneck. ming On Sat, 2005-04-02 at 14:07, Asgeir Eiriksson wrote: > Dmitry > > The CPU cycles is only at most half of the story with the other half > being the memory sub-system BW. > > So the validity of your observation depends on the BW we're talking > about, i.e. if the client is using a fraction of 10Gbps for RDMA (or > DDP, e.g. iSCSI DDP), yes then that fraction amounts to a fraction of > the memory sub-system total BW so we don't much care about the extra > copy. > > The situation is different if the client wants something close to 10Gbps > (already have such client applications), because today 10Gbps is still a > big chunk of the overall memory BW so you really care about eliminating > that copy via DDP. > > 'Asgeir > > > -----Original Message----- > > From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On > > Behalf Of Dmitry Yusupov > > Sent: Saturday, April 02, 2005 10:09 AM > > To: open-iscsi@googlegroups.com > > Cc: David S. Miller; mpm@selenic.com; andrea@suse.de; > > michaelc@cs.wisc.edu; James.Bottomley@HansenPartnership.com; > ksummit-2005- > > discuss@thunk.org; netdev@oss.sgi.com > > Subject: Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit > > ProposedTopics > > > > On Mon, 2005-03-28 at 17:32 -0500, Benjamin LaHaise wrote: > > > On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote: > > > > If you have plans to start new project such as SoftRDMA than yes. > lets > > > > discuss it since set of problems will be similar to what we've got > > with > > > > software iSCSI Initiators. > > > > > > I'm somewhat interested in seeing a SoftRDMA project get off the > ground. > > > At least the NatSemi 83820 gige MAC is able to provide early-rx > > interrupts > > > that allow one to get an rx interrupt before the full payload has > > arrived > > > making it possible to write out a new rx descriptor to place the > payload > > > wherever it is ultimately desired. It would be fun to work on if > not > > the > > > most performant RDMA implementation. > > > > I see a lot of skepticism around early-rx interrupt schema. It might > > work for gige, but i'm not sure if it will fit into 10g. > > > > What RDMA gives us is zero-copy on receive and new networking api > which > > has a potential to be HW accelerated. SoftRDMA will never avoid > copying > > on receive. But benefit for SoftRDMA would be its availability on > client > > sides. It is free and it could be easily deployed. Soon Intel & Co > will > > give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if > > one of those cores will do receive side copying? > > > > From hadi@cyberus.ca Sat Apr 2 11:20:18 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:20:26 -0800 (PST) Received: from mx02.cybersurf.com (mx02.cybersurf.com [209.197.145.105]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JKIHq027275 for ; Sat, 2 Apr 2005 11:20:18 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx02.cybersurf.com with esmtp (Exim 4.30) id 1DHoAJ-00041A-W6 for netdev@oss.sgi.com; Sat, 02 Apr 2005 14:20:15 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHoAA-0004ZR-4O; Sat, 02 Apr 2005 14:20:06 -0500 Subject: take 2 WAS(Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <20050402014619.GB24861@gondor.apana.org.au> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> <20050402014619.GB24861@gondor.apana.org.au> Content-Type: multipart/mixed; boundary="=-g5r6p/Y+YZcaZoLnsgWz" Organization: jamalopolous Message-Id: <1112469601.1088.173.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 02 Apr 2005 14:20:01 -0500 X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1272 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev --=-g5r6p/Y+YZcaZoLnsgWz Content-Type: text/plain Content-Transfer-Encoding: 7bit On Fri, 2005-04-01 at 20:46, Herbert Xu wrote: > On Fri, Apr 01, 2005 at 08:42:45PM -0500, jamal wrote: > > > > So always go v2? > > Yes since that's the only version that the kernel knows how to generate. Ok, heres a general patch first cut i think i got all that was discussed in there. ive done some basic 5 minutes tests on. Once we have agreement i will pass it on to Masahide-san to do more thorough testing. Look at the XXX comments in the patch. A couple of interesting things: 1) Weve discussed this before Herbert and i think you misspoke that pfkey delivers to all listerners. pfkey Add/del/upd now really do tell all processes about what happened. Before pfkey would skip the originating process. So far this doesnt seem to be an issue in the basic testing. 2) I ended adding a policy_notify to the pfkey manager to make the code generic. Interesting thing is i dont think pfkey knows what to do with policy expiration or i am misreading the code. I dont see any message type for policy expiration as i do for sa expiration. Ive put some hooks and a little noise. I could remove the printks - for now they are just place holders. cheers, jamal --=-g5r6p/Y+YZcaZoLnsgWz Content-Disposition: attachment; filename=ipsec-event-take2 Content-Type: text/plain; name=ipsec-event-take2; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit --- a/include/net/xfrm.h 2005-03-25 22:28:26.000000000 -0500 +++ b/include/net/xfrm.h 2005-04-02 11:59:17.000000000 -0500 @@ -157,6 +157,28 @@ XFRM_STATE_DEAD }; +/* events that could be sent by kernel */ +enum { + XFRM_SAP_INVALID, + XFRM_SAP_EXPIRED, + XFRM_SAP_ADDED, + XFRM_SAP_UPDATED, + XFRM_SAP_DELETED, + XFRM_SAP_FLUSHED, + __XFRM_SAP_MAX +}; +#define XFRM_SAP_MAX (__XFRM_SAP_MAX - 1) + +/* callback structure passed from either netlink or pfkey */ +struct km_event +{ + u32 data; + u32 seq; + u32 pid; + u32 event; +}; + + struct xfrm_type; struct xfrm_dst; struct xfrm_policy_afinfo { @@ -178,6 +200,9 @@ extern int xfrm_policy_register_afinfo(struct xfrm_policy_afinfo *afinfo); extern int xfrm_policy_unregister_afinfo(struct xfrm_policy_afinfo *afinfo); +extern void km_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c); +extern void km_state_notify(struct xfrm_state *x, struct km_event *c); + #define XFRM_ACQ_EXPIRES 30 @@ -283,17 +308,17 @@ struct xfrm_tmpl xfrm_vec[XFRM_MAX_DEPTH]; }; -#define XFRM_KM_TIMEOUT 30 +#define XFRM_KM_TIMEOUT 30 struct xfrm_mgr { struct list_head list; char *id; - int (*notify)(struct xfrm_state *x, int event); + int (*notify)(struct xfrm_state *x, struct km_event *c); int (*acquire)(struct xfrm_state *x, struct xfrm_tmpl *, struct xfrm_policy *xp, int dir); struct xfrm_policy *(*compile_policy)(u16 family, int opt, u8 *data, int len, int *dir); int (*new_mapping)(struct xfrm_state *x, xfrm_address_t *ipaddr, u16 sport); - int (*notify_policy)(struct xfrm_policy *x, int dir, int event); + int (*notify_policy)(struct xfrm_policy *x, int dir, struct km_event *c); }; extern int xfrm_register_km(struct xfrm_mgr *km); @@ -802,7 +827,7 @@ extern int xfrm_state_update(struct xfrm_state *x); extern struct xfrm_state *xfrm_state_lookup(xfrm_address_t *daddr, u32 spi, u8 proto, unsigned short family); extern struct xfrm_state *xfrm_find_acq_byseq(u32 seq); -extern void xfrm_state_delete(struct xfrm_state *x); +extern int xfrm_state_delete(struct xfrm_state *x); extern void xfrm_state_flush(u8 proto); extern int xfrm_replay_check(struct xfrm_state *x, u32 seq); extern void xfrm_replay_advance(struct xfrm_state *x, u32 seq); --- a/include/linux/xfrm.h 2005-03-25 22:28:39.000000000 -0500 +++ b/include/linux/xfrm.h 2005-04-02 09:53:03.000000000 -0500 @@ -254,5 +254,7 @@ #define XFRMGRP_ACQUIRE 1 #define XFRMGRP_EXPIRE 2 +#define XFRMGRP_SA 4 +#define XFRMGRP_POLICY 8 #endif /* _LINUX_XFRM_H */ --- a/net/xfrm/xfrm_state.c 2005-03-25 22:28:25.000000000 -0500 +++ b/net/xfrm/xfrm_state.c 2005-04-02 12:15:37.000000000 -0500 @@ -48,7 +48,7 @@ static struct list_head xfrm_state_gc_list = LIST_HEAD_INIT(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); -static void __xfrm_state_delete(struct xfrm_state *x); +static int __xfrm_state_delete(struct xfrm_state *x); static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned short family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -208,8 +208,10 @@ } EXPORT_SYMBOL(__xfrm_state_destroy); -static void __xfrm_state_delete(struct xfrm_state *x) +static int __xfrm_state_delete(struct xfrm_state *x) { + int err = -ESRCH; + if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); @@ -236,14 +238,47 @@ * is what we are dropping here. */ atomic_dec(&x->refcnt); + err = 0; } + + return err; } -void xfrm_state_delete(struct xfrm_state *x) +static DEFINE_RWLOCK(xfrm_km_lock); +static struct list_head xfrm_km_list = LIST_HEAD_INIT(xfrm_km_list); + +void km_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) { + struct xfrm_mgr *km; + + read_lock(&xfrm_km_lock); + list_for_each_entry(km, &xfrm_km_list, list) + if (km->notify_policy) + km->notify_policy(xp, dir, c); + read_unlock(&xfrm_km_lock); +} + +void km_state_notify(struct xfrm_state *x, struct km_event *c) +{ + struct xfrm_mgr *km; + read_lock(&xfrm_km_lock); + list_for_each_entry(km, &xfrm_km_list, list) + km->notify(x, c); + read_unlock(&xfrm_km_lock); +} + +EXPORT_SYMBOL(km_policy_notify); +EXPORT_SYMBOL(km_state_notify); + +int xfrm_state_delete(struct xfrm_state *x) +{ + int err; + spin_lock_bh(&x->lock); - __xfrm_state_delete(x); + err = __xfrm_state_delete(x); spin_unlock_bh(&x->lock); + + return err; } EXPORT_SYMBOL(xfrm_state_delete); @@ -402,6 +437,7 @@ static struct xfrm_state *__xfrm_find_acq_byseq(u32 seq); + int xfrm_state_add(struct xfrm_state *x) { struct xfrm_state_afinfo *afinfo; @@ -764,37 +800,45 @@ } EXPORT_SYMBOL(xfrm_replay_advance); -static struct list_head xfrm_km_list = LIST_HEAD_INIT(xfrm_km_list); -static DEFINE_RWLOCK(xfrm_km_lock); static void km_state_expired(struct xfrm_state *x, int hard) { - struct xfrm_mgr *km; + struct km_event c; if (hard) x->km.state = XFRM_STATE_EXPIRED; else x->km.dying = 1; - read_lock(&xfrm_km_lock); - list_for_each_entry(km, &xfrm_km_list, list) - km->notify(x, hard); - read_unlock(&xfrm_km_lock); + /* XXX: Do we wanna do this right at the top?? + * if the state is dead we dont want to announce + * the expire - a delete may already have announced + * it + */ + if (x->km.state == XFRM_STATE_DEAD) + return; + c.data = hard; + c.event = XFRM_SAP_EXPIRED; + km_state_notify(x, &c); if (hard) wake_up(&km_waitq); } +/* + * We send to all registered managers regardless of failure + * We are happy with one success +*/ static int km_query(struct xfrm_state *x, struct xfrm_tmpl *t, struct xfrm_policy *pol) { - int err = -EINVAL; + int err = -EINVAL, acqret; struct xfrm_mgr *km; read_lock(&xfrm_km_lock); list_for_each_entry(km, &xfrm_km_list, list) { - err = km->acquire(x, t, pol, XFRM_POLICY_OUT); - if (!err) - break; + acqret = km->acquire(x, t, pol, XFRM_POLICY_OUT); + if (!acqret) + err = acqret; } read_unlock(&xfrm_km_lock); return err; @@ -819,13 +863,20 @@ void km_policy_expired(struct xfrm_policy *pol, int dir, int hard) { - struct xfrm_mgr *km; + struct km_event c; - read_lock(&xfrm_km_lock); - list_for_each_entry(km, &xfrm_km_list, list) - if (km->notify_policy) - km->notify_policy(pol, dir, hard); - read_unlock(&xfrm_km_lock); + /* XXX: Do we still wanna wakeup km_waitq? + * if the policy is dead we dont want to announce + * the expire - a delete may already have announced + * it + */ + if (pol->dead) + return; + + c.data = hard; + c.data = hard; + c.event = XFRM_SAP_EXPIRED; + km_policy_notify(pol, dir, &c); if (hard) wake_up(&km_waitq); --- a/net/xfrm/xfrm_policy.c 2005-03-25 22:28:21.000000000 -0500 +++ b/net/xfrm/xfrm_policy.c 2005-04-02 12:16:30.000000000 -0500 @@ -298,7 +298,7 @@ * entry dead. The rule must be unlinked from lists to the moment. */ -static void xfrm_policy_kill(struct xfrm_policy *policy) +static void xfrm_policy_kill(struct xfrm_policy *policy, int dir) { write_lock_bh(&policy->lock); if (policy->dead) @@ -378,7 +378,7 @@ write_unlock_bh(&xfrm_policy_lock); if (delpol) { - xfrm_policy_kill(delpol); + xfrm_policy_kill(delpol, dir); } return 0; } @@ -402,7 +402,7 @@ if (pol && delete) { atomic_inc(&flow_cache_genid); - xfrm_policy_kill(pol); + xfrm_policy_kill(pol, dir); } return pol; } @@ -425,7 +425,7 @@ if (pol && delete) { atomic_inc(&flow_cache_genid); - xfrm_policy_kill(pol); + xfrm_policy_kill(pol, dir); } return pol; } @@ -442,7 +442,7 @@ xfrm_policy_list[dir] = xp->next; write_unlock_bh(&xfrm_policy_lock); - xfrm_policy_kill(xp); + xfrm_policy_kill(xp, dir); write_lock_bh(&xfrm_policy_lock); } @@ -558,7 +558,7 @@ if (pol) { if (dir < XFRM_POLICY_MAX) atomic_inc(&flow_cache_genid); - xfrm_policy_kill(pol); + xfrm_policy_kill(pol, dir); } } @@ -579,7 +579,7 @@ write_unlock_bh(&xfrm_policy_lock); if (old_pol) { - xfrm_policy_kill(old_pol); + xfrm_policy_kill(old_pol, dir); } return 0; } --- a/net/xfrm/xfrm_user.c 2005-03-25 22:28:22.000000000 -0500 +++ b/net/xfrm/xfrm_user.c 2005-04-02 12:21:32.000000000 -0500 @@ -268,6 +268,7 @@ struct xfrm_usersa_info *p = NLMSG_DATA(nlh); struct xfrm_state *x; int err; + struct km_event c; err = verify_newsa_info(p, (struct rtattr **) xfrma); if (err) @@ -285,14 +286,26 @@ if (err < 0) { x->km.state = XFRM_STATE_DEAD; xfrm_state_put(x); + return err; } + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + if (nlh->nlmsg_type == XFRM_MSG_NEWSA) + c.event = XFRM_SAP_ADDED; + else + c.event = XFRM_SAP_UPDATED; + + km_state_notify(x, &c); + return err; } static int xfrm_del_sa(struct sk_buff *skb, struct nlmsghdr *nlh, void **xfrma) { struct xfrm_state *x; + int err; + struct km_event c; struct xfrm_usersa_id *p = NLMSG_DATA(nlh); x = xfrm_state_lookup(&p->daddr, p->spi, p->proto, p->family); @@ -304,10 +317,20 @@ return -EPERM; } - xfrm_state_delete(x); + err = xfrm_state_delete(x); + if (err < 0) { + x->km.state = XFRM_STATE_DEAD; + xfrm_state_put(x); + return err; + } + + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + c.event = XFRM_SAP_DELETED; + km_state_notify(x, &c); xfrm_state_put(x); - return 0; + return err; } static void copy_to_user_state(struct xfrm_state *x, struct xfrm_usersa_info *p) @@ -672,6 +695,7 @@ { struct xfrm_userpolicy_info *p = NLMSG_DATA(nlh); struct xfrm_policy *xp; + struct km_event c; int err; int excl; @@ -683,6 +707,10 @@ if (!xp) return err; + /* shouldnt excl be based on nlh flags?? + * Aha! this is anti-netlink really i.e more pfkey derived + * in netlink excl is a flag and you wouldnt need + * a type XFRM_MSG_UPDPOLICY - JHS */ excl = nlh->nlmsg_type == XFRM_MSG_NEWPOLICY; err = xfrm_policy_insert(p->dir, xp, excl); if (err) { @@ -690,6 +718,16 @@ return err; } + + if (!excl) + c.event = XFRM_SAP_UPDATED; + else + c.event = XFRM_SAP_ADDED; + + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_policy_notify(xp, p->dir, &c); + xfrm_pol_put(xp); return 0; @@ -807,8 +845,10 @@ struct xfrm_policy *xp; struct xfrm_userpolicy_id *p; int err; + struct km_event c; int delete; + p = NLMSG_DATA(nlh); delete = nlh->nlmsg_type == XFRM_MSG_DELPOLICY; @@ -834,6 +874,11 @@ NETLINK_CB(skb).pid, MSG_DONTWAIT); } + } else { + c.event = XFRM_SAP_DELETED; + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_policy_notify(xp, p->dir, &c); } xfrm_pol_put(xp); @@ -843,15 +888,28 @@ static int xfrm_flush_sa(struct sk_buff *skb, struct nlmsghdr *nlh, void **xfrma) { + struct km_event c; struct xfrm_usersa_flush *p = NLMSG_DATA(nlh); xfrm_state_flush(p->proto); + c.data = p->proto; + c.event = XFRM_SAP_FLUSHED; + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_state_notify(NULL, &c); + return 0; } static int xfrm_flush_policy(struct sk_buff *skb, struct nlmsghdr *nlh, void **xfrma) { + struct km_event c; + xfrm_policy_flush(); + c.event = XFRM_SAP_FLUSHED; + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_policy_notify(NULL, 0, &c); return 0; } @@ -1053,10 +1111,11 @@ return -1; } -static int xfrm_send_state_notify(struct xfrm_state *x, int hard) +static int xfrm_exp_state_notify(struct xfrm_state *x, struct km_event *c) { struct sk_buff *skb; - + int hard = c ->data; + /* fix to do alloc using NLM macros */ skb = alloc_skb(sizeof(struct xfrm_user_expire) + 16, GFP_ATOMIC); if (skb == NULL) return -ENOMEM; @@ -1069,6 +1128,94 @@ return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_EXPIRE, GFP_ATOMIC); } +static int xfrm_notify_sa_flush(struct km_event *c) +{ + struct xfrm_usersa_flush *p; + struct nlmsghdr *nlh; + struct sk_buff *skb; + unsigned char *b; + int len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_flush)); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + nlh = NLMSG_PUT(skb, c->pid, c->seq, + XFRM_MSG_FLUSHSA, sizeof(*p)); + nlh->nlmsg_flags = 0; + + p = NLMSG_DATA(nlh); + p->proto = c->data; + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_SA, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_notify_sa( struct xfrm_state *x, struct km_event *c) +{ + struct xfrm_usersa_info *p; + struct nlmsghdr *nlh; + struct sk_buff *skb; + u32 nlt; + unsigned char *b; + int len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_info)); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + if (c->event == XFRM_SAP_ADDED) + nlt = XFRM_MSG_NEWSA; + else if (c->event == XFRM_SAP_UPDATED) + nlt = XFRM_MSG_UPDSA; + else if (c->event == XFRM_SAP_DELETED) + nlt = XFRM_MSG_DELSA; + else + goto nlmsg_failure; + + nlh = NLMSG_PUT(skb, c->pid, c->seq, nlt, sizeof(*p)); + nlh->nlmsg_flags = 0; + + p = NLMSG_DATA(nlh); + copy_to_user_state(x, p); + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_SA, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_send_state_notify(struct xfrm_state *x, struct km_event *c) +{ + + switch (c->event) { + case XFRM_SAP_EXPIRED: + return xfrm_exp_state_notify(x, c); + case XFRM_SAP_DELETED: + case XFRM_SAP_UPDATED: + case XFRM_SAP_ADDED: + return xfrm_notify_sa(x, c); + case XFRM_SAP_FLUSHED: + return xfrm_notify_sa_flush(c); + default: + printk("pfkey: Unknown SA event %d\n",c->event); + break; + } + + return 0; + +} + static int build_acquire(struct sk_buff *skb, struct xfrm_state *x, struct xfrm_tmpl *xt, struct xfrm_policy *xp, int dir) @@ -1202,7 +1349,8 @@ return -1; } -static int xfrm_send_policy_notify(struct xfrm_policy *xp, int dir, int hard) + +static int xfrm_exp_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) { struct sk_buff *skb; size_t len; @@ -1213,7 +1361,7 @@ if (skb == NULL) return -ENOMEM; - if (build_polexpire(skb, xp, dir, hard) < 0) + if (build_polexpire(skb, xp, dir, c->data) < 0) BUG(); NETLINK_CB(skb).dst_groups = XFRMGRP_EXPIRE; @@ -1221,6 +1369,90 @@ return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_EXPIRE, GFP_ATOMIC); } +static int xfrm_notify_policy( struct xfrm_policy *xp, int dir, struct km_event *c) +{ + struct xfrm_userpolicy_info *p; + struct nlmsghdr *nlh; + struct sk_buff *skb; + u32 nlt = 0 ; + unsigned char *b; + int len = NLMSG_LENGTH(sizeof(struct xfrm_userpolicy_info)); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + if (c->event == XFRM_SAP_ADDED) + nlt = XFRM_MSG_NEWPOLICY; + else if (c->event == XFRM_SAP_UPDATED) + nlt = XFRM_MSG_UPDPOLICY; + else if (c->event == XFRM_SAP_DELETED) + nlt = XFRM_MSG_DELPOLICY; + else + goto nlmsg_failure; + + nlh = NLMSG_PUT(skb, c->pid, c->seq, nlt, sizeof(*p)); + + p = NLMSG_DATA(nlh); + + nlh->nlmsg_flags = 0; + + copy_to_user_policy(xp, p, dir); + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_POLICY, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_notify_policy_flush(struct km_event *c) +{ + struct nlmsghdr *nlh; + struct sk_buff *skb; + unsigned char *b; + int len = NLMSG_LENGTH(0); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + + nlh = NLMSG_PUT(skb, c->pid, c->seq, XFRM_MSG_FLUSHPOLICY, 0); + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_POLICY, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_send_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) +{ + + switch (c->event) { + case XFRM_SAP_ADDED: + case XFRM_SAP_UPDATED: + case XFRM_SAP_DELETED: + return xfrm_notify_policy(xp, dir, c); + case XFRM_SAP_FLUSHED: + return xfrm_notify_policy_flush(c); + case XFRM_SAP_EXPIRED: + return xfrm_exp_policy_notify(xp, dir, c); + default: + printk("Netlink Unknown Policy event %d\n",c->event); + } + + return 0; + +} + static struct xfrm_mgr netlink_mgr = { .id = "netlink", .notify = xfrm_send_state_notify, --- a/net/key/af_key.c 2005-03-25 22:28:39.000000000 -0500 +++ b/net/key/af_key.c 2005-04-02 12:25:49.000000000 -0500 @@ -1240,13 +1240,85 @@ return 0; } +static inline int event2poltype (int event) +{ + switch (event) { + case XFRM_SAP_DELETED: + return SADB_X_SPDDELETE; + case XFRM_SAP_ADDED: + return SADB_X_SPDADD; + case XFRM_SAP_UPDATED: + return SADB_X_SPDUPDATE; + case XFRM_SAP_EXPIRED: + // return SADB_X_SPDEXPIRE; + default: + printk("pfkey: Unknown policy event %d\n",event); + break; + } + + return 0; +} + +static inline int event2keytype (int event) +{ + switch (event) { + case XFRM_SAP_DELETED: + return SADB_DELETE; + case XFRM_SAP_ADDED: + return SADB_ADD; + case XFRM_SAP_UPDATED: + return SADB_UPDATE; + case XFRM_SAP_EXPIRED: + return SADB_EXPIRE; + default: + printk("pfkey: Unknown SA event %d\n",event); + break; + } + + return 0; +} + +/* ADD/UPD/DEL */ +static int key_notify_sa(struct xfrm_state *x, struct km_event *c) +{ + struct sk_buff *skb; + struct sadb_msg *hdr; + int hsc = 3; + + if (c->event == XFRM_SAP_DELETED) + hsc = 0; + + if (c->event == XFRM_SAP_EXPIRED) { + if (c->data) + hsc = 2; + else + hsc = 1; + } + + skb = pfkey_xfrm_state2msg(x, 0, hsc); + + if (IS_ERR(skb)) + return PTR_ERR(skb); + + hdr = (struct sadb_msg *) skb->data; + hdr->sadb_msg_version = PF_KEY_V2; + hdr->sadb_msg_type = event2keytype(c->event); + hdr->sadb_msg_satype = pfkey_proto2satype(x->id.proto); + hdr->sadb_msg_errno = 0; + hdr->sadb_msg_reserved = 0; + hdr->sadb_msg_seq = c->seq; + hdr->sadb_msg_pid = c->pid; + + pfkey_broadcast(skb, GFP_ATOMIC, BROADCAST_ALL, NULL); + + return 0; +} static int pfkey_add(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; struct xfrm_state *x; int err; + struct km_event c; xfrm_probe_algs(); @@ -1256,7 +1328,7 @@ if (hdr->sadb_msg_type == SADB_ADD) err = xfrm_state_add(x); - else + else err = xfrm_state_update(x); if (err < 0) { @@ -1265,27 +1337,22 @@ return err; } - out_skb = pfkey_xfrm_state2msg(x, 0, 3); - if (IS_ERR(out_skb)) - return PTR_ERR(out_skb); /* XXX Should we return 0 here ? */ - - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = hdr->sadb_msg_type; - out_hdr->sadb_msg_satype = pfkey_proto2satype(x->id.proto); - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_reserved = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); + if (hdr->sadb_msg_type == SADB_ADD) + c.event = XFRM_SAP_ADDED; + else + c.event = XFRM_SAP_UPDATED; + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + km_state_notify(x, &c); - return 0; + return err; } static int pfkey_delete(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { struct xfrm_state *x; + struct km_event c; + int err; if (!ext_hdrs[SADB_EXT_SA-1] || !present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], @@ -1301,13 +1368,20 @@ return -EPERM; } - xfrm_state_delete(x); - xfrm_state_put(x); + err = xfrm_state_delete(x); + if (err < 0) { + x->km.state = XFRM_STATE_DEAD; + xfrm_state_put(x); + return err; + } - pfkey_broadcast(skb_clone(skb, GFP_KERNEL), GFP_KERNEL, - BROADCAST_ALL, sk); + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_DELETED; + km_state_notify(x, &c); + xfrm_state_put(x); - return 0; + return err; } static int pfkey_get(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) @@ -1445,28 +1519,42 @@ return 0; } +static int key_notify_sa_flush(struct km_event *c) +{ + struct sk_buff *skb; + struct sadb_msg *hdr; + + skb = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_KERNEL); + if (!skb) + return -ENOBUFS; + hdr = (struct sadb_msg *) skb_put(skb, sizeof(struct sadb_msg)); + // XXX:do we have to pass proto as well? + hdr->sadb_msg_seq = c->seq; + hdr->sadb_msg_pid = c->pid; + hdr->sadb_msg_version = PF_KEY_V2; + hdr->sadb_msg_errno = (uint8_t) 0; + hdr->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); + + pfkey_broadcast(skb, GFP_KERNEL, BROADCAST_ALL, NULL); + + return 0; +} + static int pfkey_flush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { unsigned proto; - struct sk_buff *skb_out; - struct sadb_msg *hdr_out; + struct km_event c; proto = pfkey_satype2proto(hdr->sadb_msg_satype); if (proto == 0) return -EINVAL; - skb_out = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_KERNEL); - if (!skb_out) - return -ENOBUFS; - xfrm_state_flush(proto); - - hdr_out = (struct sadb_msg *) skb_put(skb_out, sizeof(struct sadb_msg)); - pfkey_hdr_dup(hdr_out, hdr); - hdr_out->sadb_msg_errno = (uint8_t) 0; - hdr_out->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); - - pfkey_broadcast(skb_out, GFP_KERNEL, BROADCAST_ALL, NULL); + c.data = proto; + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_FLUSHED; + km_state_notify(NULL, &c); return 0; } @@ -1859,6 +1947,31 @@ hdr->sadb_msg_reserved = atomic_read(&xp->refcnt); } +static int key_notify_policy( struct xfrm_policy *xp, int dir, struct km_event *c) +{ + struct sk_buff *out_skb; + struct sadb_msg *out_hdr; + int err; + + out_skb = pfkey_xfrm_policy2msg_prep(xp); + if (IS_ERR(out_skb)) { + err = PTR_ERR(out_skb); + goto out; + } + pfkey_xfrm_policy2msg(out_skb, xp, dir); + + out_hdr = (struct sadb_msg *) out_skb->data; + out_hdr->sadb_msg_version = PF_KEY_V2; + out_hdr->sadb_msg_type = event2poltype(c->event); + out_hdr->sadb_msg_errno = 0; + out_hdr->sadb_msg_seq = c->seq; + out_hdr->sadb_msg_pid = c->pid; + pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, NULL); +out: + return 0; + +} + static int pfkey_spdadd(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { int err; @@ -1866,8 +1979,7 @@ struct sadb_address *sa; struct sadb_x_policy *pol; struct xfrm_policy *xp; - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; + struct km_event c; if (!present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], ext_hdrs[SADB_EXT_ADDRESS_DST-1]) || @@ -1935,31 +2047,25 @@ (err = parse_ipsecrequests(xp, pol)) < 0) goto out; - out_skb = pfkey_xfrm_policy2msg_prep(xp); - if (IS_ERR(out_skb)) { - err = PTR_ERR(out_skb); - goto out; - } err = xfrm_policy_insert(pol->sadb_x_policy_dir-1, xp, hdr->sadb_msg_type != SADB_X_SPDUPDATE); + if (err) { - kfree_skb(out_skb); - goto out; + kfree(xp); + return err; } - pfkey_xfrm_policy2msg(out_skb, xp, pol->sadb_x_policy_dir-1); + if (hdr->sadb_msg_type == SADB_X_SPDUPDATE) + c.event = XFRM_SAP_UPDATED; + else + c.event = XFRM_SAP_ADDED; - xfrm_pol_put(xp); + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = hdr->sadb_msg_type; - out_hdr->sadb_msg_satype = 0; - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); + km_policy_notify(xp, pol->sadb_x_policy_dir-1, &c); + xfrm_pol_put(xp); return 0; out: @@ -1973,9 +2079,8 @@ struct sadb_address *sa; struct sadb_x_policy *pol; struct xfrm_policy *xp; - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; struct xfrm_selector sel; + struct km_event c; if (!present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], ext_hdrs[SADB_EXT_ADDRESS_DST-1]) || @@ -2010,24 +2115,11 @@ err = 0; - out_skb = pfkey_xfrm_policy2msg_prep(xp); - if (IS_ERR(out_skb)) { - err = PTR_ERR(out_skb); - goto out; - } - pfkey_xfrm_policy2msg(out_skb, xp, pol->sadb_x_policy_dir-1); - - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = SADB_X_SPDDELETE; - out_hdr->sadb_msg_satype = 0; - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); - err = 0; + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_DELETED; + km_policy_notify(xp, pol->sadb_x_policy_dir-1, &c); -out: xfrm_pol_put(xp); return err; } @@ -2037,8 +2129,7 @@ int err; struct sadb_x_policy *pol; struct xfrm_policy *xp; - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; + struct km_event c; if ((pol = ext_hdrs[SADB_X_EXT_POLICY-1]) == NULL) return -EINVAL; @@ -2050,24 +2141,19 @@ err = 0; - out_skb = pfkey_xfrm_policy2msg_prep(xp); - if (IS_ERR(out_skb)) { - err = PTR_ERR(out_skb); - goto out; + /* + * XXX: previous get was doing a broadcast-all _always_ + * which didnt seem right for non-deletion case - JHS + * This is like the way netlink behaves .. + * Shall i restore original behavior? + */ + if (hdr->sadb_msg_type == SADB_X_SPDDELETE2) { + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_DELETED; + km_policy_notify(xp, pol->sadb_x_policy_dir-1, &c); } - pfkey_xfrm_policy2msg(out_skb, xp, pol->sadb_x_policy_dir-1); - - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = hdr->sadb_msg_type; - out_hdr->sadb_msg_satype = 0; - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); - err = 0; -out: xfrm_pol_put(xp); return err; } @@ -2102,22 +2188,33 @@ return xfrm_policy_walk(dump_sp, &data); } -static int pfkey_spdflush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) +static int key_notify_policy_flush(struct km_event *c) { struct sk_buff *skb_out; - struct sadb_msg *hdr_out; - - skb_out = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_KERNEL); + struct sadb_msg *hdr; + skb_out = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_ATOMIC); if (!skb_out) return -ENOBUFS; + hdr = (struct sadb_msg *) skb_put(skb_out, sizeof(struct sadb_msg)); + hdr->sadb_msg_seq = c->seq; + hdr->sadb_msg_pid = c->pid; + hdr->sadb_msg_version = PF_KEY_V2; + hdr->sadb_msg_errno = (uint8_t) 0; + hdr->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); + pfkey_broadcast(skb_out, GFP_KERNEL, BROADCAST_ALL, NULL); + return 0; - xfrm_policy_flush(); +} - hdr_out = (struct sadb_msg *) skb_put(skb_out, sizeof(struct sadb_msg)); - pfkey_hdr_dup(hdr_out, hdr); - hdr_out->sadb_msg_errno = (uint8_t) 0; - hdr_out->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); - pfkey_broadcast(skb_out, GFP_KERNEL, BROADCAST_ALL, NULL); +static int pfkey_spdflush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) +{ + struct km_event c; + + xfrm_policy_flush(); + c.event = XFRM_SAP_FLUSHED; + c.pid = hdr->sadb_msg_pid; + c.seq = hdr->sadb_msg_seq; + km_policy_notify(NULL, 0, &c); return 0; } @@ -2317,11 +2414,25 @@ } } -static int pfkey_send_notify(struct xfrm_state *x, int hard) +/* XXX: Noisy for now */ +static int key_notify_policy_expire(struct xfrm_policy *xp, struct km_event *c) +{ + printk("pfkey doesnt deal with expired policies ..\n"); + return 0; +} + +static int key_notify_sa_expire(struct xfrm_state *x, struct km_event *c) { struct sk_buff *out_skb; struct sadb_msg *out_hdr; - int hsc = (hard ? 2 : 1); + int hard; + int hsc; + + hard = c->data; + if (hard) + hsc = 2; + else + hsc = 1; out_skb = pfkey_xfrm_state2msg(x, 0, hsc); if (IS_ERR(out_skb)) @@ -2340,6 +2451,43 @@ return 0; } +static int pfkey_send_notify(struct xfrm_state *x, struct km_event *c) +{ + switch (c->event) { + case XFRM_SAP_EXPIRED: + return key_notify_sa_expire(x, c); + case XFRM_SAP_DELETED: + case XFRM_SAP_ADDED: + case XFRM_SAP_UPDATED: + return key_notify_sa(x, c); + case XFRM_SAP_FLUSHED: + return key_notify_sa_flush(c); + default: + printk("pfkey: Unknown SA event %d\n",c->event); + break; + } + + return 0; +} + +static int pfkey_send_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) +{ + switch (c->event) { + case XFRM_SAP_EXPIRED: + return key_notify_policy_expire(xp, c); + case XFRM_SAP_DELETED: + case XFRM_SAP_ADDED: + case XFRM_SAP_UPDATED: + return key_notify_policy(xp, dir, c); + case XFRM_SAP_FLUSHED: + return key_notify_policy_flush(c); + default: + printk("pfkey: Unknown policy event %d\n",c->event); + break; + } + + return 0; +} static u32 get_acqseq(void) { u32 res; @@ -2856,6 +3004,7 @@ .acquire = pfkey_send_acquire, .compile_policy = pfkey_compile_policy, .new_mapping = pfkey_send_new_mapping, + .notify_policy = pfkey_send_policy_notify, }; static void __exit ipsec_pfkey_exit(void) --=-g5r6p/Y+YZcaZoLnsgWz-- From liontooth@cogweb.net Sat Apr 2 11:25:49 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:25:57 -0800 (PST) Received: from weber.sscnet.ucla.edu (weber.sscnet.ucla.edu [128.97.42.3]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JPnCI027930 for ; Sat, 2 Apr 2005 11:25:49 -0800 Received: from localhost (localhost [127.0.0.1]) by weber.sscnet.ucla.edu (8.13.4/8.13.4) with ESMTP id j32JPj40013828; Sat, 2 Apr 2005 11:25:45 -0800 (PST) Received: from weber.sscnet.ucla.edu ([127.0.0.1]) by localhost (weber [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 12686-02; Sat, 2 Apr 2005 11:25:45 -0800 (PST) Received: from [128.97.221.35] (clitunno.sscnet.ucla.edu [128.97.221.35]) by weber.sscnet.ucla.edu (8.13.4/8.13.4) with ESMTP id j32JPKtv013762; Sat, 2 Apr 2005 11:25:20 -0800 (PST) Message-ID: <424EF19B.7030105@cogweb.net> Date: Sat, 02 Apr 2005 11:25:15 -0800 From: David Liontooth User-Agent: Debian Thunderbird 1.0 (X11/20050118) X-Accept-Language: en-us, en MIME-Version: 1.0 To: venza@brownhat.org, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: ICS1883 LAN PHY not detected X-Enigmail-Version: 0.90.0.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Scanned: by amavisd-new at weber.sscnet.ucla.edu X-Virus-Status: Clean X-archive-position: 1273 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: liontooth@cogweb.net Precedence: bulk X-list: netdev Gigabyte's K8NS Ultra-939 mobo has a 100/10 LAN PHY chip, ICS1883, which isn't detected by the 2.6.12-rc1 kernel (and likely not previous kernels). http://www.giga-byte.com/MotherBoard/Products/Products_Spec_GA-K8NS%20Ultra-939.htm On the other hand, the ports light up when connected. The device may be similar to ICS1893, which is supported by the sis900 driver. However, I figure the device first has to be detected? Any advice appreciated. Dave # lspci 0000:00:00.0 Host bridge: nVidia Corporation: Unknown device 00e1 (rev a1) 0000:00:01.0 ISA bridge: nVidia Corporation: Unknown device 00e0 (rev a2) 0000:00:01.1 SMBus: nVidia Corporation: Unknown device 00e4 (rev a1) 0000:00:02.0 USB Controller: nVidia Corporation: Unknown device 00e7 (rev a1) 0000:00:02.1 USB Controller: nVidia Corporation: Unknown device 00e7 (rev a1) 0000:00:02.2 USB Controller: nVidia Corporation: Unknown device 00e8 (rev a2) 0000:00:05.0 Bridge: nVidia Corporation: Unknown device 00df (rev a2) 0000:00:06.0 Multimedia audio controller: nVidia Corporation: Unknown device 00ea (rev a1) 0000:00:08.0 IDE interface: nVidia Corporation: Unknown device 00e5 (rev a2) 0000:00:0a.0 IDE interface: nVidia Corporation: Unknown device 00e3 (rev a2) 0000:00:0b.0 PCI bridge: nVidia Corporation: Unknown device 00e2 (rev a2) 0000:00:0e.0 PCI bridge: nVidia Corporation: Unknown device 00ed (rev a2) 0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge 0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge 0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge 0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge 0000:02:0b.0 Ethernet controller: Marvell Technology Group Ltd. Yukon Gigabit Ethernet 10/100/1000Base-T Adapter (rev 13) 0000:02:0d.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD Technology Inc)SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01) 0000:02:0e.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller (rev 01) From herbert@gondor.apana.org.au Sat Apr 2 11:33:22 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:33:31 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JXLwe028654 for ; Sat, 2 Apr 2005 11:33:22 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHoMW-0006q1-00; Sun, 03 Apr 2005 05:32:52 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHoM5-0006YU-00; Sun, 03 Apr 2005 05:32:25 +1000 Date: Sun, 3 Apr 2005 05:32:24 +1000 To: Robert Olsson Cc: Eric Dumazet , davem@davemloft.net, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-ID: <20050402193224.GA25157@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <16974.41648.568927.54429@robur.slu.se> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1274 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 03:48:32PM +0200, Robert Olsson wrote: > > > Crashes usually occurs when secret_interval interval is elapsed : rt_cache_flush(0); is called, and the whole machine begins to die. > > A good idea to increase the secret_interval interval but it should survive. Incidentally we should change the way the rehashing is triggered. Instead of doing it regularly, we can do it when we notice that a specific hash chain grows beyond a certain size. The idea is that if someone is attacking our hash then they can only do so by lengthening the chains. If they're not doing that then even if they knew how to attack us we don't really care. Of course when it does happen it'll still kill your machine unless we can find a way to amortise this. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Sat Apr 2 11:36:28 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:36:34 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JaQoK029236 for ; Sat, 2 Apr 2005 11:36:27 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHoPX-0006qk-00; Sun, 03 Apr 2005 05:35:59 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHoPL-0006ZD-00; Sun, 03 Apr 2005 05:35:47 +1000 Date: Sun, 3 Apr 2005 05:35:47 +1000 To: John Heffner Cc: davem@davemloft.net, netdev@oss.sgi.com Subject: Re: [PATCH] skb pcount with MTU discovery Message-ID: <20050402193547.GB25157@gondor.apana.org.au> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1275 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 10:32:32AM -0500, John Heffner wrote: > On Sat, 2 Apr 2005, Herbert Xu wrote: > > > How about fixing tcp_snd_test directly like this? > > I tried that first, but it caused a panic. I assumed some other point in > the code assumed that invariant that if TSO is disabled then tso_segs==1. > I didn't investigate though. Do you remember what the panic looked like? Perhaps it was because tso_segs wasn't set at all? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From davem@davemloft.net Sat Apr 2 11:56:27 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 11:56:33 -0800 (PST) Received: from cheetah.davemloft.net (mail@dsl027-180-174.sfo1.dsl.speakeasy.net [216.27.180.174]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32JuRLZ030268 for ; Sat, 2 Apr 2005 11:56:27 -0800 Received: from localhost ([127.0.0.1] helo=cheetah.davemloft.net ident=davem) by cheetah.davemloft.net with smtp (Exim 3.36 #1 (Debian)) id 1DHoiO-0005Rw-00; Sat, 02 Apr 2005 11:55:28 -0800 Date: Sat, 2 Apr 2005 11:55:28 -0800 From: "David S. Miller" To: Herbert Xu Cc: Robert.Olsson@data.slu.se, dada1@cosmosbay.com, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-Id: <20050402115528.11f71a3c.davem@davemloft.net> In-Reply-To: <20050402193224.GA25157@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <20050402193224.GA25157@gondor.apana.org.au> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; sparc-unknown-linux-gnu) X-Face: "_;p5u5aPsO,_Vsx"^v-pEq09'CU4&Dc1$fQExov$62l60cgCc%FnIwD=.UF^a>?5'9Kn[;433QFVV9M..2eN.@4ZWPGbdi<=?[:T>y?SD(R*-3It"Vj:)"dP Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1276 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev On Sun, 3 Apr 2005 05:32:24 +1000 Herbert Xu wrote: > On Sat, Apr 02, 2005 at 03:48:32PM +0200, Robert Olsson wrote: > > > > > Crashes usually occurs when secret_interval interval is elapsed : rt_cache_flush(0); is called, and the whole machine begins to die. > > > > A good idea to increase the secret_interval interval but it should survive. > > Incidentally we should change the way the rehashing is triggered. > Instead of doing it regularly, we can do it when we notice that a > specific hash chain grows beyond a certain size. > > The idea is that if someone is attacking our hash then they can > only do so by lengthening the chains. If they're not doing that > then even if they knew how to attack us we don't really care. Yes, the secret_interval is way too short. It is a very paranoid default value selected when initially fixing that DoS. I think we should, in the short term, increase the secret interval where it exists in the tree (netfilter conntrack is another instance for example). From hadi@cyberus.ca Sat Apr 2 12:47:44 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 12:47:49 -0800 (PST) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32KlhrM003188 for ; Sat, 2 Apr 2005 12:47:44 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1DHpWx-000554-Hj for netdev@oss.sgi.com; Sat, 02 Apr 2005 15:47:43 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHpWt-00064y-PT; Sat, 02 Apr 2005 15:47:40 -0500 Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() From: jamal Reply-To: hadi@cyberus.ca To: Eric Dumazet Cc: Robert Olsson , Herbert Xu , "David S. Miller" , netdev In-Reply-To: <424EA7C2.6060308@cosmosbay.com> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <424EA7C2.6060308@cosmosbay.com> Content-Type: text/plain; charset=ISO-8859-1 Organization: jamalopolous Message-Id: <1112474855.1096.274.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 02 Apr 2005 15:47:36 -0500 Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1277 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sat, 2005-04-02 at 09:10, Eric Dumazet wrote: > Robert Olsson a écrit : > > Eric Dumazet writes: > > > Yes thats a pretty much load. Very short flows some reason? > > Well... yes. This is a real server, not a DOS simulation. > 1 million TCP flows, and about 3 million peers using UDP frames. SMP? How many processors? cheers, jamal From hadi@cyberus.ca Sat Apr 2 13:06:05 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 13:06:09 -0800 (PST) Received: from mx02.cybersurf.com (mx02.cybersurf.com [209.197.145.105]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32L64CT004180 for ; Sat, 2 Apr 2005 13:06:04 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx02.cybersurf.com with esmtp (Exim 4.30) id 1DHpog-0001SQ-QE for netdev@oss.sgi.com; Sat, 02 Apr 2005 16:06:02 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHpoe-00081M-08; Sat, 02 Apr 2005 16:06:00 -0500 Subject: Re: Get rid of rt_check_expire and rt_garbage_collect From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Eric Dumazet , "David S. Miller" , netdev , Robert Olsson In-Reply-To: <20050402112304.GA11321@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <20050402112304.GA11321@gondor.apana.org.au> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112475955.1088.294.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 02 Apr 2005 16:05:55 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1278 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sat, 2005-04-02 at 06:23, Herbert Xu wrote: > On Sat, Apr 02, 2005 at 11:21:30AM +0200, Eric Dumazet wrote: > > > > Well, I began my work because of the overflow bug in rt_check_expire()... > > Then I realize this function could not work as expected. On a loaded > > machine, one timer tick is 1 ms. > > During this time, number of chains that are scanned is ridiculous. > > With the standard timer of 60 second, fact is rt_check_expire() is useless. > > I see. What we've got here is a scalability problem with respect > to the number of hash buckets. As the number of buckets increases, > the amount of work the timer GC has to perform inreases proportionally. > Its classical incremental garbage collection algorithm thats being used i.e something along whats typically refered to as mark-and-sweep. Could the main issue be not the amount of routes in the cache but rather the locking when number of CPUs go up? Incrementing the timer frequency would certainly help but maybe have adverse effects if the frequency is too high because of the across system locking IMO. > Since the timer GC parameters are fixed, this will eventually break. > > Rather than changing the timer GC so that it runs more often to keep > up with the large routing cache, we should get out of this by reducing > the amount of work we have to do. > Refer to my hint above: perhaps per CPU caches? > Imagine an ideal balanced hash table with 2.6 million entries. That > is, all incoming/outgoing packets belong to flows that are already in > the hash table. Imagine also that there is no PMTU/link failure taking > place so all entries are valid forever. > > > In this state there is absolutely no need to execute the timer GC. > Yeah, but memory is finite friend. True, if you can imagine infinite memory we would not need gc ;-> > Let's remove one of those assumptions and allow there to be entries > which need to expire after a set period. > Instead of having the timer GC clean them up, we can move the expire > check to the place where the entries are used. That is, we make > ip_route_input/ip_route_output/ipv4_dst_check check whether the > entry has expired. > If you can show lock grabbing is the main contentious issue; i believe it is as CPUs go up. Then this is a valuable idea since you are already grabbing the locks anyways. > On the face of it we're doing more work since every routing cache > hit will need to check the validity of the dst. However, because > it's a single subtraction it is actually pretty cheap. There is > also no additional cache miss compared to doing it in the timer > GC since we have to read the dst anyway. > In the case of slower machine, the compute is also an issue. To be honest i feel like handwaving - experimenting and collecting profiles would help nail it. > Let's go one step further and make the routing cache come to life. > Now there are new entries coming in and we need to remove old ones > in order to make room for them. > > That task is currently carried out by the timer GC in rt_check_expire > and on demand by rt_garbage_collect. Either way we have to walk the > entire routing cache looking for entries to get rid of. > we dont really do the whole route cache everytime - I am sure you know that. > This is quite expensive when the routing cache is large. However, > there is a better way. > > The reason we keep a cap on the routing cache (for a given hash size) > is so that individual chains do not degenerate into long linked lists. > > In other words, we don't really care about how many entries there are > in the routing cache. But we do care about how long each hash chain > is. > > So instead of walking the entire routing cache to keep the number of > entries down, what we should do is keep each hash chain as short as > possible. > Thats certainly one solution .. reading on how you achive this .. > Assuming that the hash function is good, this should achieve the > same end result. > > Here is how it can be done: Every time a routing entry is inserted into > a hash chain, we perform GC on that chain unconditionally. > May not be a good idea to do it unconditionally - in particular on SMP where another CPU maybe spinning waiting for you to let go of bucket lock. In particular if a burst of packets accessing the same bucket show up on different processors, this would be aggravated. You may wanna kick in this algorithm only when things start going past a certain threshold. > It might seem that we're doing more work again. However, as before > because we're traversing the chain anyway, it is very cheap to perform > the GC operations which mainly involve the checks in rt_may_expire. > > OK that's enough thinking and it's time to write some code to see > whether this is all bullshit :) > I think there are some good ideas in there; the bottleneck could be perceived as one of either the locks are too expensive (clearly so in SMP as number of CPUs go up) or the compute is taking too long (clearly so in slower systems - but a general fact of life as well). For the first issue, amortizing the lock grabbing via compute as you suggest maybe of value or make per cpu caches. cheers, jamal From hadi@cyberus.ca Sat Apr 2 13:09:03 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 13:09:09 -0800 (PST) Received: from mx01.cybersurf.com (mx01.cybersurf.com [209.197.145.104]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32L93mq004713 for ; Sat, 2 Apr 2005 13:09:03 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx01.cybersurf.com with esmtp (Exim 4.30) id 1DHprT-0005n7-Rj for netdev@oss.sgi.com; Sat, 02 Apr 2005 14:08:55 -0700 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHprX-0008HI-Mo; Sat, 02 Apr 2005 16:09:00 -0500 Subject: Re: RFC: Redirect-Device From: jamal Reply-To: hadi@cyberus.ca To: Meelis Roos Cc: netdev In-Reply-To: References: Content-Type: text/plain Organization: jamalopolous Message-Id: <1112476135.1087.298.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 02 Apr 2005 16:08:55 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1279 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Sat, 2005-04-02 at 03:41, Meelis Roos wrote: > j> I must be missing something: What is it that this device can do that the > j> mirred action cant do? > > I know what I am missing here: documentation. There is very basic > documentation about tc qdisc+class+filter level and almost nothing on the > newer features. Without good documentation only some developers > understand it. Have you tried looking at iproute2 doc/examples? Theres some new stuff in there. Over time more stuff will be added - and contributions welcome as well. cheers, jamal From hadi@cyberus.ca Sat Apr 2 13:28:57 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 13:29:01 -0800 (PST) Received: from mx04.cybersurf.com (mx04.cybersurf.com [209.197.145.108]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32LSu4O006074 for ; Sat, 2 Apr 2005 13:28:57 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx04.cybersurf.com with esmtp (Exim 4.30) id 1DHqAo-0005rD-Nc for netdev@oss.sgi.com; Sat, 02 Apr 2005 16:28:54 -0500 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHqAm-0001r3-CC; Sat, 02 Apr 2005 16:28:52 -0500 Subject: Re: IPSEC: on behavior of acquire From: jamal Reply-To: hadi@cyberus.ca To: Aidas Kasparas Cc: ipsec-tools-devel@lists.sourceforge.net, netdev , nakam@linux-ipv6.org In-Reply-To: <424E454D.4090402@gmc.lt> References: <1112405303.1096.37.camel@jzny.localdomain> <424E454D.4090402@gmc.lt> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112477326.1088.321.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 02 Apr 2005 16:28:46 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1280 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sat, 2005-04-02 at 02:10, Aidas Kasparas wrote: > > Re 1 try only. There is little sense to do more tries. If there is no > deamon listening to pfkey messages, then no connection will be made no > matter how many retries you'll do. If deamon/link/peer is slow and SA > was not established before timeout expired, then repeated acquire will > be simply ignored (deamon will find out that negotiation is already in > progress, there is no reason to start another negotiation and therefore > will drop that acquire request). And the only situation where repeated > acquires may help is when pfkey messages are lost. Exactly what i was trying to emulate - lost messages. I would expect it to be the rule to loose messages - but given theres no guarantee of delivery, messages could be lost. > But pfkey was not > designed to survive message loses, therefore you should not operate your > boxes in mode when lost pfkey messages are a rule, not an exception. And > on the other hand, occasional pfkey message loses can be worked around > by applications/user retry. > I think its more than just pfkey (or netlink) - rather the ipsec framework itself. One could look at the acquire as part of the "connection" setup (for lack of better description). Without the acquire succeeding, theres no connection..(assuming that to be a policy). Therefore if acquire is not supposed to be delivered with some certainty (read: retries) then theres some resiliciency issues IMO. Note: Sometimes theres no app. Example a packet coming into a gateway. > Re error code returned. Error codes returned by pfkey never were > perfect. But your experiment is not perfect too. You sent pings with no > KE deamon running. Note what my goals were. > pfkey code found that there is nothing receiving > acquire messages => there is no chance that any process will setup > required SAs and tried to inform about that (I agree, return code is not > very informative, at least until you learn about reasons why it is > such). If you would have racoon (or other pfkey based ISAKMP daemon) > running, you would get "resource temporarily unavailable" (don't know > which error code corresponds to that message), which IMHO is ok (if it > is not, please explain). > Havent tried that - the reason i said restart was the right signal was mainly that an app could translate that to mean "try again". In other words even in the case of ping -c1 the ping app could have reattempted. On Sat, 2005-04-02 at 07:25, Zilvinas Valinskas wrote: > EBUSY I think it is. > > I am not entirely sure it is ok to return such error, some applications are > not coping nicely with it. Perhaps ECONNREFUSED is more reasonable - as it > doesn't brake old apps assumption (connection cannot be established, > doesn't matter if that is due to routing or IPsec SPD or anything else). > What about ERESTART the way netlink does it right now? ECONNREFUSED is probably not a bad idea. ping was clearly dumb and didnt do anything with the info. Overall, I think the errors are unfortunately not descriptive at all. cheers, jamal From tgraf@suug.ch Sat Apr 2 13:36:26 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 13:36:35 -0800 (PST) Received: from b.mx.projectdream.org (eth0-0.arisu.projectdream.org [194.158.4.191]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32LaPOw006809 for ; Sat, 2 Apr 2005 13:36:26 -0800 Received: from postel.suug.ch (postel.suug.ch [195.134.158.23]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by b.mx.projectdream.org (Postfix) with ESMTP id 58E7BF; Sat, 2 Apr 2005 23:36:02 +0200 (CEST) Received: by postel.suug.ch (Postfix, from userid 10001) id 1FDD41C0EA; Sat, 2 Apr 2005 23:36:43 +0200 (CEST) Date: Sat, 2 Apr 2005 23:36:42 +0200 From: Thomas Graf To: Abhishek Gupta Cc: netdev@oss.sgi.com Subject: Re: Problem using HTB Message-ID: <20050402213642.GO3086@postel.suug.ch> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1281 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: tgraf@suug.ch Precedence: bulk X-list: netdev * Abhishek Gupta 2005-04-01 15:10 > tc class add dev $DEV0 parent 2: classid 2:1 htb rate 100kbit burst 100 \ > ceil 100kbit > [...] > I have configured for 100kbps, I am getting only 12kbps as the link speed. Before I look into this, are you aware of 1kbps=8kbit? From hadi@cyberus.ca Sat Apr 2 13:43:02 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 13:43:07 -0800 (PST) Received: from mx01.cybersurf.com (mx01.cybersurf.com [209.197.145.104]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32Lh2nq007527 for ; Sat, 2 Apr 2005 13:43:02 -0800 Received: from mail.cyberus.ca ([209.197.145.21]) by mx01.cybersurf.com with esmtp (Exim 4.30) id 1DHqON-0007KX-5Z for netdev@oss.sgi.com; Sat, 02 Apr 2005 14:42:55 -0700 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DHqOM-00035F-Qn; Sat, 02 Apr 2005 16:42:55 -0500 Subject: Re: IPSEC: on behavior of acquire From: jamal Reply-To: hadi@cyberus.ca To: Alexey Kuznetsov Cc: Herbert Xu , "David S. Miller" , Masahide NAKAMURA , ipsec-tools-devel@lists.sourceforge.net, netdev , kaber@trash.net, jmorris@redhat.com In-Reply-To: <20050402140019.GA13017@yakov.inr.ac.ru> References: <1112405144.1096.33.camel@jzny.localdomain> <20050402140019.GA13017@yakov.inr.ac.ru> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112478168.1088.337.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 02 Apr 2005 16:42:48 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1282 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sat, 2005-04-02 at 09:00, Alexey Kuznetsov wrote: > Hello! > > > a) -ERESTART is the correct signal to return > > Right behaviour is to behave like ARP. A few of packets are queued, > no errors (until timeout), no blocking. Herbert also mentions something along the same lines in his email. This would make a lot of sense! Is the state machine going to look something along the same lines as ARP? i.e incomplete->reachable etc? What would be a good code to return when you queue the packet? cheers, jamal From tgraf@suug.ch Sat Apr 2 13:52:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 13:52:42 -0800 (PST) Received: from b.mx.projectdream.org (eth0-0.arisu.projectdream.org [194.158.4.191]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32LqbVE008323 for ; Sat, 2 Apr 2005 13:52:38 -0800 Received: from postel.suug.ch (postel.suug.ch [195.134.158.23]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by b.mx.projectdream.org (Postfix) with ESMTP id 93A0282; Sat, 2 Apr 2005 23:52:14 +0200 (CEST) Received: by postel.suug.ch (Postfix, from userid 10001) id 9099E1C0EA; Sat, 2 Apr 2005 23:52:56 +0200 (CEST) Date: Sat, 2 Apr 2005 23:52:56 +0200 From: Thomas Graf To: jamal Cc: Alexey Kuznetsov , Herbert Xu , "David S. Miller" , Masahide NAKAMURA , ipsec-tools-devel@lists.sourceforge.net, netdev , kaber@trash.net, jmorris@redhat.com Subject: Re: IPSEC: on behavior of acquire Message-ID: <20050402215256.GP3086@postel.suug.ch> References: <1112405144.1096.33.camel@jzny.localdomain> <20050402140019.GA13017@yakov.inr.ac.ru> <1112478168.1088.337.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112478168.1088.337.camel@jzny.localdomain> X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1283 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: tgraf@suug.ch Precedence: bulk X-list: netdev * jamal <1112478168.1088.337.camel@jzny.localdomain> 2005-04-02 16:42 > Herbert also mentions something along the same lines in his email. > This would make a lot of sense! > Is the state machine going to look something along the same lines as > ARP? i.e incomplete->reachable etc? > > What would be a good code to return when you queue the packet? EINPROGRESS? From juhl-lkml@dif.dk Sat Apr 2 14:36:45 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 14:36:51 -0800 (PST) Received: from saerimmer.dif.dk (mail.dif.dk [193.138.115.101]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j32MaiVo009757 for ; Sat, 2 Apr 2005 14:36:45 -0800 Received: from localhost (localhost [127.0.0.1]) by saerimmer.dif.dk (Postfix) with ESMTP id 9219BFFD23 for ; Sun, 3 Apr 2005 00:46:20 +0200 (CEST) Received: from saerimmer.dif.dk ([127.0.0.1]) by localhost (saerimmer [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 15976-02 for ; Sun, 3 Apr 2005 00:46:19 +0200 (CEST) Received: from diftmgw2.backbone.dif.dk (diftmgw2.backbone.dif.dk [10.227.136.246]) by saerimmer.dif.dk (Postfix) with ESMTP id 445ACFFCA9 for ; Sun, 3 Apr 2005 00:46:19 +0200 (CEST) Received: from DIFPST1A.backbone.dif.dk ([10.227.136.220]) by diftmgw2.backbone.dif.dk with InterScan Messaging Security Suite; Sun, 03 Apr 2005 00:35:29 +0200 Received: from [172.16.2.11] (10.227.136.29 [10.227.136.29]) by DIFPST1A.backbone.dif.dk with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2657.72) id HNMVRDHM; Sun, 3 Apr 2005 00:36:33 +0200 Date: Sun, 3 Apr 2005 00:38:54 +0200 (CEST) From: Jesper Juhl To: Maciej Soltysiak Cc: "James P. Ketrenos" , netdev@oss.sgi.com, "David S. Miller" , linux-kernel@vger.kernel.org Subject: Re: [2.6.12-rc1-mm4] swapped memset arguments In-Reply-To: <74334709.20050402233007@dns.toxicfilms.tv> Message-ID: References: <74334709.20050402233007@dns.toxicfilms.tv> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Scanned: amavisd-new at dif.dk X-Virus-Status: Clean X-archive-position: 1284 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: juhl-lkml@dif.dk Precedence: bulk X-list: netdev On Sat, 2 Apr 2005, Maciej Soltysiak wrote: > Hi, > > out of boredom I grepped 2.6.12-rc1-mm4 for swapped memset arguments. > I found one: > > # grep -nr "memset.*\,\(\ \|\)0\(\ \|\));" * > net/ieee80211/ieee80211_tx.c:226: memset(txb, sizeof(struct ieee80211_txb), 0); > And here's a patch : Fix swapped memset() arguments in net/ieee80211/ieee80211_tx.c found by Maciej Soltysiak. Signed-off-by: Jesper Juhl --- linux-2.6.12-rc1-mm4-orig/net/ieee80211/ieee80211_tx.c 2005-03-31 21:20:08.000000000 +0200 +++ linux-2.6.12-rc1-mm4/net/ieee80211/ieee80211_tx.c 2005-04-03 00:34:22.000000000 +0200 @@ -223,7 +223,7 @@ struct ieee80211_txb *ieee80211_alloc_tx if (!txb) return NULL; - memset(txb, sizeof(struct ieee80211_txb), 0); + memset(txb, 0, sizeof(struct ieee80211_txb)); txb->nr_frags = nr_frags; txb->frag_size = txb_size; From jgarzik@pobox.com Sat Apr 2 17:25:29 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 17:25:34 -0800 (PST) Received: from parcelfarce.linux.theplanet.co.uk (IDENT:93@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j331PShU018360 for ; Sat, 2 Apr 2005 17:25:28 -0800 Received: from cpe-024-025-022-197.nc.res.rr.com ([24.25.22.197] helo=[10.10.10.88]) by parcelfarce.linux.theplanet.co.uk with asmtp (TLSv1:AES256-SHA:256) (Exim 4.33) id 1DHtri-0006Jy-F3; Sun, 03 Apr 2005 02:25:26 +0100 Message-ID: <424F45F0.1000504@pobox.com> Date: Sat, 02 Apr 2005 20:25:04 -0500 From: Jeff Garzik User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050328 Fedora/1.7.6-1.2.5 X-Accept-Language: en-us, en MIME-Version: 1.0 To: David Liontooth CC: venza@brownhat.org, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: ICS1883 LAN PHY not detected References: <424EF19B.7030105@cogweb.net> In-Reply-To: <424EF19B.7030105@cogweb.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1286 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David Liontooth wrote: > 0000:02:0b.0 Ethernet controller: Marvell Technology Group Ltd. Yukon > Gigabit Ethernet 10/100/1000Base-T Adapter (rev 13) You want the sk98lin or skge drivers. Jeff From grundler@lackof.org Sat Apr 2 17:24:49 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 17:25:03 -0800 (PST) Received: from colo.lackof.org (colo.lackof.org [198.49.126.79]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j331OnPg018306 for ; Sat, 2 Apr 2005 17:24:49 -0800 Received: from localhost (localhost [127.0.0.1]) by colo.lackof.org (Postfix) with ESMTP id 4727429802F; Sat, 2 Apr 2005 18:26:37 -0700 (MST) Received: from colo.lackof.org ([127.0.0.1]) by localhost (colo.lackof.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 04205-03; Sat, 2 Apr 2005 18:26:35 -0700 (MST) Received: by colo.lackof.org (Postfix, from userid 27253) id C5BFF298010; Sat, 2 Apr 2005 18:26:35 -0700 (MST) Date: Sat, 2 Apr 2005 18:26:35 -0700 From: Grant Grundler To: jaganav@us.ibm.com Cc: Greg KH , Stephen Hemminger , Roland Dreier , Benjamin LaHaise , Dmitry Yusupov , open-iscsi@googlegroups.com, "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com, bmt@zurich.ibm.com Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) Message-ID: <20050403012635.GA4218@colo.lackof.org> References: <1112426991.424e49ef57e2b@imap.linux.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112426991.424e49ef57e2b@imap.linux.ibm.com> X-Home-Page: http://www.parisc-linux.org/ User-Agent: Mutt/1.5.6+20040907i X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at lackof.org X-Virus-Status: Clean X-archive-position: 1285 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: grundler@parisc-linux.org Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 02:29:51AM -0500, jaganav@us.ibm.com wrote: > If this dual license is a concern to other kernel developers as well from > contributing to OpenRDMA, we would seriously consider this and discuss > with the adapter vendors. I'm not concerned with it. If *BSD can thrive with it's license, I don't see why it's a problem for linux. HP is going to pay me to work on the code regardless of the license. Projects I work on privately happen to be GPL though I'm not religous about it. If people choose NOT to volunteer time/effort on dual licensed code, I understand and respect that. There are enough worthy GPL only projects out there. I'm speaking for myself and NOT for HP. grant From liontooth@cogweb.net Sat Apr 2 21:28:36 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 21:28:41 -0800 (PST) Received: from weber.sscnet.ucla.edu (weber.sscnet.ucla.edu [128.97.42.3]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j335SZJl028429 for ; Sat, 2 Apr 2005 21:28:35 -0800 Received: from localhost (localhost [127.0.0.1]) by weber.sscnet.ucla.edu (8.13.4/8.13.4) with ESMTP id j335SZTF008913; Sat, 2 Apr 2005 21:28:35 -0800 (PST) Received: from weber.sscnet.ucla.edu ([127.0.0.1]) by localhost (weber [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 08242-01; Sat, 2 Apr 2005 21:28:35 -0800 (PST) Received: from [128.97.221.35] (clitunno.sscnet.ucla.edu [128.97.221.35]) by weber.sscnet.ucla.edu (8.13.4/8.13.4) with ESMTP id j335RdWF008432; Sat, 2 Apr 2005 21:27:40 -0800 (PST) Message-ID: <424F7EC4.1000107@cogweb.net> Date: Sat, 02 Apr 2005 21:27:32 -0800 From: David Liontooth User-Agent: Debian Thunderbird 1.0 (X11/20050118) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: venza@brownhat.org, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: ICS1883 LAN PHY not detected References: <424EF19B.7030105@cogweb.net> <424F45F0.1000504@pobox.com> In-Reply-To: <424F45F0.1000504@pobox.com> X-Enigmail-Version: 0.90.0.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Scanned: by amavisd-new at weber.sscnet.ucla.edu X-Virus-Status: Clean X-archive-position: 1287 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: liontooth@cogweb.net Precedence: bulk X-list: netdev Jeff Garzik wrote: > David Liontooth wrote: > >> 0000:02:0b.0 Ethernet controller: Marvell Technology Group Ltd. Yukon >> Gigabit Ethernet 10/100/1000Base-T Adapter (rev 13) > > You want the sk98lin or skge drivers. Correct -- that one worked already in Debian-Installer. What was confusing is that the Gigabyte K8NS Ultra-939 board has a second gigabyte NIC, identified in the motherboard manual as a 100/10 ICS1883 LAN PHY, that is in fact an nforce gigabyte controller, part of the nforce3 250 chipset (cf. http://cogweb.net/owens/Images/Gigabyte-K8NS-Ultra-939.jpg line 5). For some reason the PCI ID 00E6 doesn't show up in lspci, so I thought it was not detected by the kernel. However, the forcedeth driver brought it to life. Dave From herbert@gondor.apana.org.au Sat Apr 2 23:40:37 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 23:40:46 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j337eZm9031028 for ; Sat, 2 Apr 2005 23:40:36 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHziD-0001d3-00; Sun, 03 Apr 2005 17:40:01 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHzgu-00027R-00; Sun, 03 Apr 2005 17:38:40 +1000 Date: Sun, 3 Apr 2005 17:38:40 +1000 To: jamal Cc: Eric Dumazet , "David S. Miller" , netdev , Robert Olsson Subject: Re: Get rid of rt_check_expire and rt_garbage_collect Message-ID: <20050403073840.GA8105@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <20050402112304.GA11321@gondor.apana.org.au> <1112475955.1088.294.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112475955.1088.294.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1288 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 04:05:55PM -0500, jamal wrote: > > > In this state there is absolutely no need to execute the timer GC. > > Yeah, but memory is finite friend. True, if you can imagine infinite > memory we would not need gc ;-> True. However running the GC when you can't free most of the entries is a waste of time. On a busy system where the routing cache is near capacity and new entries are coming in all the time, we should arrange it so that the old entries are expired when entries are inserted. Assuming the hash function is good, then as long as there is a steady stream of entries coming in, the old entries will be expired automatically. Of course, we should not leave the systems that have experienced a burst of flows at a disadvantage. Indeed there is a rather simple way of doing GC for them without having to do work that's proportional to the number of hash chains in the routing cache. The key is that the GC is only useful when the routing cache contains enough entries that can be freed. Let's say that if we can free more than 1/3 of the entries then the GC should be run. Of course you can define this to be whatever you want. So now the problem is to quickly determine whether there are enough entries in the cache that can be freed. What we can do is take a leaf out of the politicians' book :) We take a poll on a small sample of the routing cache. That is, we run the GC on a fixed number of chains, e.g., 256 chains. After that we tally the total number of entries and the number of entries freed. Since the hash function should be spreading entries throughout the chains evenly, the ratio here can be extrapolated out to the entire cache. Therefore once the ratio exceeds the defined threshold, we perform GC over the entire cache, preferably in a kernel thread. If not then we'll simply let the GC roam along at the constant pace of 256 chains. The advantage of this is that the GC will free entries in the entire table as soon as that becomes possible without having to do work proportional to the number of chains in each GC interval. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Sat Apr 2 23:41:51 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 23:41:56 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j337foUj031138 for ; Sat, 2 Apr 2005 23:41:50 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHzjf-0001dk-00; Sun, 03 Apr 2005 17:41:31 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHzjZ-00027w-00; Sun, 03 Apr 2005 17:41:25 +1000 Date: Sun, 3 Apr 2005 17:41:25 +1000 To: jamal Cc: Eric Dumazet , "David S. Miller" , netdev , Robert Olsson Subject: Re: Get rid of rt_check_expire and rt_garbage_collect Message-ID: <20050403074125.GB8105@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <20050402112304.GA11321@gondor.apana.org.au> <1112475955.1088.294.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112475955.1088.294.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1289 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 04:05:55PM -0500, jamal wrote: > > > Here is how it can be done: Every time a routing entry is inserted into > > a hash chain, we perform GC on that chain unconditionally. > > May not be a good idea to do it unconditionally - in particular on SMP > where another CPU maybe spinning waiting for you to let go of bucket > lock. In particular if a burst of packets accessing the same bucket show > up on different processors, this would be aggravated. > You may wanna kick in this algorithm only when things start going past a > certain threshold. This isn't too bad because: 1. The fast path is lockless using RCU. 2. The number of locks exceeds the number of CPUs by some insane amount. 3. The cost of performing GC is really cheap, it's just a matter of calling rt_may_expire. Anyway, I agree that all of these ideas are simply fantasy until we have some code. So let me work on that and then we can let the benchmarks do the talking :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Sat Apr 2 23:44:02 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Apr 2005 23:44:09 -0800 (PST) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j337hxIh031843 for ; Sat, 2 Apr 2005 23:44:01 -0800 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DHzln-0001eh-00; Sun, 03 Apr 2005 17:43:43 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DHzlh-00028S-00; Sun, 03 Apr 2005 17:43:37 +1000 Date: Sun, 3 Apr 2005 17:43:37 +1000 To: "David S. Miller" Cc: Robert.Olsson@data.slu.se, dada1@cosmosbay.com, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-ID: <20050403074337.GA8083@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <20050402193224.GA25157@gondor.apana.org.au> <20050402115528.11f71a3c.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050402115528.11f71a3c.davem@davemloft.net> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1290 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 11:55:28AM -0800, David S. Miller wrote: > > I think we should, in the short term, increase the secret interval > where it exists in the tree (netfilter conntrack is another instance > for example). We could also move rt_cache_flush into a kernel thread. When the number of chains is large this function is really expensive for a softirq handler. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From a.kasparas@gmc.lt Sun Apr 3 00:30:32 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 00:30:39 -0800 (PST) Received: from smtp02.omnitel.sun (smtp02-neptunas.omnitel.net [194.176.45.2]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j338UVJn004479 for ; Sun, 3 Apr 2005 00:30:32 -0800 Received: from smtp04-neptunas.omnitel.net ([194.176.45.42]) by smtp02.omnitel.sun (Sun Java System Messaging Server 6.1 HotFix 0.01 (built Jun 24 2004)) with ESMTP id <0IED004S43K8BK40@smtp02.omnitel.sun> for netdev@oss.sgi.com; Sun, 03 Apr 2005 11:28:57 +0300 (EEST) Received: from smtp04-neptunas.omnitel.net (localhost [127.0.0.1]) by smtp04-neptunas.omnitel.net (Postfix) with SMTP id 6928139804F; Sun, 03 Apr 2005 11:28:54 +0300 (EEST) Received: from [192.168.0.128] (unknown [62.212.195.62]) by smtp04-neptunas.omnitel.net (Postfix) with ESMTP id D1DD939804A; Sun, 03 Apr 2005 11:28:53 +0300 (EEST) Date: Sun, 03 Apr 2005 11:28:54 +0300 From: Aidas Kasparas Subject: Re: IPSEC: on behavior of acquire In-reply-to: <1112477326.1088.321.camel@jzny.localdomain> To: hadi@cyberus.ca Cc: ipsec-tools-devel@lists.sourceforge.net, netdev , nakam@linux-ipv6.org Message-id: <424FA946.70809@gmc.lt> MIME-version: 1.0 Content-type: text/plain; charset=UTF-8; format=flowed Content-transfer-encoding: 7BIT X-Accept-Language: lt, en, ru, fr X-Enigmail-Version: 0.90.0.0 X-Enigmail-Supports: pgp-inline, pgp-mime References: <1112405303.1096.37.camel@jzny.localdomain> <424E454D.4090402@gmc.lt> <1112477326.1088.321.camel@jzny.localdomain> User-Agent: Debian Thunderbird 1.0 (X11/20050116) X-Virus-Scanned: ClamAV 0.83/801/Sat Apr 2 02:36:25 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1291 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: a.kasparas@gmc.lt Precedence: bulk X-list: netdev jamal wrote: > On Sat, 2005-04-02 at 02:10, Aidas Kasparas wrote: > > >>Re 1 try only. There is little sense to do more tries. If there is no >>deamon listening to pfkey messages, then no connection will be made no >>matter how many retries you'll do. If deamon/link/peer is slow and SA >>was not established before timeout expired, then repeated acquire will >>be simply ignored (deamon will find out that negotiation is already in >>progress, there is no reason to start another negotiation and therefore >>will drop that acquire request). And the only situation where repeated >>acquires may help is when pfkey messages are lost. > > > Exactly what i was trying to emulate - lost messages. Your emulation was not correct. More correct would have been to start KE daemon, let it fully initialize (open pfkey socket, inform kernel that it is interested in acquire messages), then stop it (via debugger or kill -STOP) and only then send pings or other traffic and see what will happen. This is because there are different paths in xfrm+pfkey for cases 1) when there is no KE daemon and 2) when daemon is, but for some reason it does not establish a SA and therefore reaction to traffic is different. In the first case it's xfrm_lookup() ->xfrm_tmpl_resolve() ->xfrm_state_find() ->xfrm_state.c:km_query() ->pfkey_send_acquire() ->pfkey_broadcast() ->return -ESRCH. This error code goes unchanged back to xfrm_state_find, where it is remaped into itself (other possible values are -EAGAIN and -ENOMEM). And then this error code goes back to application. In the second case it's xfrm_lookup() ->xfrm_tmpl_resolve() ->xfrm_state_find() ->xfrm_state.c:km_query() ->pfkey_send_acquire() ->pfkey_broadcast() ->pfkey_broadcast_one() -> return 0 also sent unchanged back to function xfrm_state_find, where SA is put into state XFRM_STATE_ACQ. xfrm_tmpl_resolve() returns -EAGAIN. xfrm_lookup then organizes timeout, and if the state was not changed after that timeout, returns -EAGAIN to the application. On the other hand, analysis above shows that return code is choosen by xfrm framework, therefore if error code has to be changed, it should be changed in xfrm, not in pfkey or netlink code. > I would expect it > to be the rule to loose messages - but given theres no guarantee of > delivery, messages could be lost. > > >>But pfkey was not >>designed to survive message loses, therefore you should not operate your >>boxes in mode when lost pfkey messages are a rule, not an exception. And >>on the other hand, occasional pfkey message loses can be worked around >>by applications/user retry. >> > > > I think its more than just pfkey (or netlink) - rather the ipsec > framework itself. > > One could look at the acquire as part of the "connection" setup > (for lack of better description). Without the acquire succeeding, theres > no connection..(assuming that to be a policy). > Therefore if acquire is not supposed to be delivered with some certainty > (read: retries) then theres some resiliciency issues IMO. OK, To avoid speaking about apples and oranges let's first find out where you see the problem. In the ipsec framework there are the following players (I'm speaking about pfkey case; netlink may be little different): xfrm <-> pfkey <-> KE daemon <-> remote peer xfrm-pfkey communication is based on function calls. For them to fail something really weird has to happen with your kernel. KE deamon - remote peer communications are done on UDP/500, UDP/4500 according to internet standards. Packet retransmissions are implemented the way standards require, therefore it is not a fatal condition if some packet will be lost on the way. And there is no 1:1 correspondence between packets sent over internet and those sent over pfkey socket. These communications are performed relatively independent. There is no need to receive extra acquire pfkey message to retransmit packet which initiates SA setup with remote peer. pfkey - KE daemon communication is performed over message socket. All the communication is performed within single box. More, only the kernel and userspace process are involved. Therefore I see only the following cases when message can be not delivered: 1) message is too big to fit into socket's buffer; 2) kernel decides to drop that socket buffer and reuse memory for something else; 3) KE daemon do not get [enough] CPU time to handle messages; 4) bug in KE daemon prevents it from reading messages. if you know other case, please, let me know. (1) do happens when there is big SPD/SAD and setkey/racoon request to dump it all. It is known pfkey architectural limitation. Acquire messages are small, therefore this can happen only when such call is made right after responce to big DUMP was generated. In racoon case SPD dump is performed only on daemon startup (and even then it is possible that it is not strictly necessary). Extra acquire message may make sense only if it is sent after some timeout. But again, KE daemon start is more exception than rule and applications can be started only after some delay after KE daemon has started. I'm not sure how realistic is (2). But it and (3) are clear resource shortage cases. Under no circumstances they should be allowed. And in (3) case extra acquire message definitely won't help situation. Inn (4) case it is KE daemon who is guilty, not pfkey. Extra message will not cure this case too. > > Note: Sometimes theres no app. Example a packet coming into a gateway. > What do you have in mind? If it is ISAKMP negotiation from remote peer, then it comes over UDP/500 or UDP/4500 over IP socket and not via acquire message via pfkey socket. If it is ESP/AH packet with unknown SPI, then kernel simply drops it and do not send any acquire messages. If it is something else, please explain. >> pfkey code found that there is nothing receiving >>acquire messages => there is no chance that any process will setup >>required SAs and tried to inform about that (I agree, return code is not >>very informative, at least until you learn about reasons why it is >>such). If you would have racoon (or other pfkey based ISAKMP daemon) >>running, you would get "resource temporarily unavailable" (don't know >>which error code corresponds to that message), which IMHO is ok (if it >>is not, please explain). >> > > > Havent tried that - the reason i said restart was the right signal was > mainly that an app could translate that to mean "try again". > In other words even in the case of ping -c1 the ping app could have > reattempted. If there is security policy which is not satisfied and there is nobody which could make it satisfied, then why should we give application false hope that on retry things will change? > > On Sat, 2005-04-02 at 07:25, Zilvinas Valinskas wrote: > >>EBUSY I think it is. >> >>I am not entirely sure it is ok to return such error, some applications are >>not coping nicely with it. Perhaps ECONNREFUSED is more reasonable - as it >>doesn't brake old apps assumption (connection cannot be established, >>doesn't matter if that is due to routing or IPsec SPD or anything else). >> > > > What about ERESTART the way netlink does it right now? I suspect that ERESTART is generated not by netlink, but by xfrm_lookup() function when signal_pending(current) is true. Why that function returns true in netlink case but not in pfkey case I don't know. IMHO, xfrm_lookup() returns correct error codes in that case. > ECONNREFUSED is probably not a bad idea. > ping was clearly dumb and didnt do anything with the info. > Overall, I think the errors are unfortunately not descriptive at all. I don't like ECONNREFUSED in this place. As a user if I would receive ECONNREFUSED message then I would address application server admin or remote host admin to resolve the problem. But the problem is in network setup and therefore person responsible for networks should be contacted. Therefore, I would like more ENETUNREACH or EHOSTUNREACH. P.S. for analysis kernel source from debian distribution was used (v.2.6.9) -- Aidas Kasparas IT administrator GM Consult Group, UAB From hadi@cyberus.ca Sun Apr 3 07:29:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 07:29:43 -0700 (PDT) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33ETawn023082 for ; Sun, 3 Apr 2005 07:29:36 -0700 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1DI66Y-00019y-FV for netdev@oss.sgi.com; Sun, 03 Apr 2005 10:29:34 -0400 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DI66U-00068f-JN; Sun, 03 Apr 2005 10:29:30 -0400 Subject: Re: IPSEC: on behavior of acquire From: jamal Reply-To: hadi@cyberus.ca To: Aidas Kasparas Cc: ipsec-tools-devel@lists.sourceforge.net, netdev , nakam@linux-ipv6.org In-Reply-To: <424FA946.70809@gmc.lt> References: <1112405303.1096.37.camel@jzny.localdomain> <424E454D.4090402@gmc.lt> <1112477326.1088.321.camel@jzny.localdomain> <424FA946.70809@gmc.lt> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112538566.1096.391.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Apr 2005 10:29:27 -0400 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1292 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sun, 2005-04-03 at 04:28, Aidas Kasparas wrote: > jamal wrote: > > Exactly what i was trying to emulate - lost messages. > > Your emulation was not correct. More correct would have been to start KE > daemon, let it fully initialize (open pfkey socket, inform kernel that > it is interested in acquire messages), then stop it (via debugger or > kill -STOP) and only then send pings or other traffic and see what will > happen. This is because there are different paths in xfrm+pfkey for > cases 1) when there is no KE daemon and 2) when daemon is, but for some > reason it does not establish a SA and therefore reaction to traffic is > different. > I dont think that would work. To summarize what happens in the kernel: everything leads to km_query() as you have indicated in your text. If the kernel finds someone/thing has either a pfkey or netlink socket open it sends a acquire to them. In the code you are probably looking at (before i created the patch) - the first user/daemon the kernel sees (either pfkey or netlink based) that has a socket open will receive an acquire and the kernel will give up after that. As an example, if the first pfkey user was just doing "setkey -x" and the second was infact pluto, then pluto will never see the acquire. This is what got me looking at it to begin with. Look at the earlier postings on the subject. So in other words, just killing the ike server as you propose would mean the kernel has no open sockets and will therefore never bother to send an acquire. Still all this is moot and is distracting us from the main discussion. Lets define "lost" simply as the case where an acquire never got to the server (which may be sitting elsewhere on the network). In that case what i did is sufficient. i.e. The methods to create this are not the issue. The issue at stake is the behavior of the kernel in generating the acquires. [..] > On the other hand, analysis above shows that return code is choosen by > xfrm framework, therefore if error code has to be changed, it should be > changed in xfrm, not in pfkey or netlink code. The control for both is under generic code. The end return code - you are right, thats user behavior and should match. > > One could look at the acquire as part of the "connection" setup > > (for lack of better description). Without the acquire succeeding, theres > > no connection..(assuming that to be a policy). > > Therefore if acquire is not supposed to be delivered with some certainty > > (read: retries) then theres some resiliciency issues IMO. > > OK, To avoid speaking about apples and oranges let's first find out > where you see the problem. In the ipsec framework there are the > following players (I'm speaking about pfkey case; netlink may be little > different): > > xfrm <-> pfkey <-> KE daemon <-> remote peer > > xfrm-pfkey communication is based on function calls. For them to fail > something really weird has to happen with your kernel. > > KE deamon - remote peer communications are done on UDP/500, UDP/4500 > according to internet standards. Packet retransmissions are implemented > the way standards require, therefore it is not a fatal condition if some > packet will be lost on the way. Please refer to my earlier definition of what "lost" means. It doesnt matter where the breakage happens really. Think of everything to the right of "xfrm" in your diagram as a black box (i.e that second thing could be pfkey or netlink - thats not the issue). Think of some message that is supposed to reach the KE daemon (make it interesting and say it is remote KE) then think of that message never making it because something in the blackbox swallowed it. If that packet is the first one and it needs to do so for the sake of setup for subsequent packets - then the desire to have it reach its destination is very imprtant. There is no progress for it or subsequent packets if it doesnt make it. The solution being proposed for Linux to treat that xfrm piece in the same fashion as ARP is correct. Read the email from Alexey. Imagine if ARP was only issued once(as does pfkey) or forever(as does netlink). I believe this is an issue with ipsec architecture itself - someone needs to write an IETF draft on it. > > > > > Note: Sometimes theres no app. Example a packet coming into a gateway. > > > > What do you have in mind? > > If it is ISAKMP negotiation from remote peer, then it comes over UDP/500 > or UDP/4500 over IP socket and not via acquire message via pfkey socket. > > If it is ESP/AH packet with unknown SPI, then kernel simply drops it and > do not send any acquire messages. > I was thinking more of this second scenario with incoming from clear text domain and gateway encrypting assuming proper policy setup. I would have to go and reread the "opportunistic" encryption draft closely to make sense. > > Havent tried that - the reason i said restart was the right signal was > > mainly that an app could translate that to mean "try again". > > In other words even in the case of ping -c1 the ping app could have > > reattempted. > > If there is security policy which is not satisfied and there is nobody > which could make it satisfied, then why should we give application false > hope that on retry things will change? > In the case of knowing it is the policy that is not satisfied i think it would make sense to not to tell the app to retry. > > > > What about ERESTART the way netlink does it right now? > > I suspect that ERESTART is generated not by netlink, but by > xfrm_lookup() function when signal_pending(current) is true. Why that > function returns true in netlink case but not in pfkey case I don't > know. IMHO, xfrm_lookup() returns correct error codes in that case. > yes, you are correct. > > ECONNREFUSED is probably not a bad idea. > > ping was clearly dumb and didnt do anything with the info. > > Overall, I think the errors are unfortunately not descriptive at all. > > I don't like ECONNREFUSED in this place. As a user if I would receive > ECONNREFUSED message then I would address application server admin or > remote host admin to resolve the problem. But the problem is in network > setup and therefore person responsible for networks should be contacted. > Therefore, I would like more ENETUNREACH or EHOSTUNREACH. > Agreed to this as well. I think this is what would happen in the case of ARP failure as well. ECONNREFUSED would make sense in the case where the policy rejected progress. cheers, jamal From hadi@cyberus.ca Sun Apr 3 07:32:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 07:32:18 -0700 (PDT) Received: from mx01.cybersurf.com (mx01.cybersurf.com [209.197.145.104]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33EW93e023317 for ; Sun, 3 Apr 2005 07:32:10 -0700 Received: from mail.cyberus.ca ([209.197.145.21]) by mx01.cybersurf.com with esmtp (Exim 4.30) id 1DI68x-0001Jp-W1 for netdev@oss.sgi.com; Sun, 03 Apr 2005 08:32:03 -0600 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DI68v-0006MV-VK; Sun, 03 Apr 2005 10:32:02 -0400 Subject: Re: take 2 WAS(Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Herbert Xu Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <1112469601.1088.173.camel@jzny.localdomain> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> <20050402014619.GB24861@gondor.apana.org.au> <1112469601.1088.173.camel@jzny.localdomain> Content-Type: multipart/mixed; boundary="=-CbZvGNdJ/zGTATpkMExl" Organization: jamalopolous Message-Id: <1112538718.1096.394.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Apr 2005 10:31:58 -0400 X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1293 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev --=-CbZvGNdJ/zGTATpkMExl Content-Type: text/plain Content-Transfer-Encoding: 7bit Small change after some testing. Herbert havent heard back from you - this looks very palatable in my opinion with comments below still in effect. cheers, jamal On Sat, 2005-04-02 at 14:20, jamal wrote: > Ok, heres a general patch first cut i think i got all that was discussed > in there. ive done some basic 5 minutes tests on. > Once we have agreement i will pass it on to Masahide-san to do more > thorough testing. > Look at the XXX comments in the patch. > > A couple of interesting things: > > 1) Weve discussed this before Herbert and i think you misspoke that > pfkey delivers to all listerners. > > pfkey Add/del/upd now really do tell all processes about what happened. > Before pfkey would skip the originating process. So far this doesnt seem > to be an issue in the basic testing. > > 2) I ended adding a policy_notify to the pfkey manager to make the code > generic. Interesting thing is i dont think pfkey knows what to do with > policy expiration or i am misreading the code. > I dont see any message type for policy expiration as i do for sa > expiration. Ive put some hooks and a little noise. I could remove the > printks - for now they are just place holders. > > cheers, > jamal --=-CbZvGNdJ/zGTATpkMExl Content-Disposition: attachment; filename=ipsec-event-take2-1 Content-Type: text/plain; name=ipsec-event-take2-1; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit --- a/include/net/xfrm.h 2005-03-25 22:28:26.000000000 -0500 +++ b/include/net/xfrm.h 2005-04-02 11:59:17.000000000 -0500 @@ -157,6 +157,28 @@ XFRM_STATE_DEAD }; +/* events that could be sent by kernel */ +enum { + XFRM_SAP_INVALID, + XFRM_SAP_EXPIRED, + XFRM_SAP_ADDED, + XFRM_SAP_UPDATED, + XFRM_SAP_DELETED, + XFRM_SAP_FLUSHED, + __XFRM_SAP_MAX +}; +#define XFRM_SAP_MAX (__XFRM_SAP_MAX - 1) + +/* callback structure passed from either netlink or pfkey */ +struct km_event +{ + u32 data; + u32 seq; + u32 pid; + u32 event; +}; + + struct xfrm_type; struct xfrm_dst; struct xfrm_policy_afinfo { @@ -178,6 +200,9 @@ extern int xfrm_policy_register_afinfo(struct xfrm_policy_afinfo *afinfo); extern int xfrm_policy_unregister_afinfo(struct xfrm_policy_afinfo *afinfo); +extern void km_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c); +extern void km_state_notify(struct xfrm_state *x, struct km_event *c); + #define XFRM_ACQ_EXPIRES 30 @@ -283,17 +308,17 @@ struct xfrm_tmpl xfrm_vec[XFRM_MAX_DEPTH]; }; -#define XFRM_KM_TIMEOUT 30 +#define XFRM_KM_TIMEOUT 30 struct xfrm_mgr { struct list_head list; char *id; - int (*notify)(struct xfrm_state *x, int event); + int (*notify)(struct xfrm_state *x, struct km_event *c); int (*acquire)(struct xfrm_state *x, struct xfrm_tmpl *, struct xfrm_policy *xp, int dir); struct xfrm_policy *(*compile_policy)(u16 family, int opt, u8 *data, int len, int *dir); int (*new_mapping)(struct xfrm_state *x, xfrm_address_t *ipaddr, u16 sport); - int (*notify_policy)(struct xfrm_policy *x, int dir, int event); + int (*notify_policy)(struct xfrm_policy *x, int dir, struct km_event *c); }; extern int xfrm_register_km(struct xfrm_mgr *km); @@ -802,7 +827,7 @@ extern int xfrm_state_update(struct xfrm_state *x); extern struct xfrm_state *xfrm_state_lookup(xfrm_address_t *daddr, u32 spi, u8 proto, unsigned short family); extern struct xfrm_state *xfrm_find_acq_byseq(u32 seq); -extern void xfrm_state_delete(struct xfrm_state *x); +extern int xfrm_state_delete(struct xfrm_state *x); extern void xfrm_state_flush(u8 proto); extern int xfrm_replay_check(struct xfrm_state *x, u32 seq); extern void xfrm_replay_advance(struct xfrm_state *x, u32 seq); --- a/include/linux/xfrm.h 2005-03-25 22:28:39.000000000 -0500 +++ b/include/linux/xfrm.h 2005-04-02 09:53:03.000000000 -0500 @@ -254,5 +254,7 @@ #define XFRMGRP_ACQUIRE 1 #define XFRMGRP_EXPIRE 2 +#define XFRMGRP_SA 4 +#define XFRMGRP_POLICY 8 #endif /* _LINUX_XFRM_H */ --- a/net/xfrm/xfrm_state.c 2005-03-25 22:28:25.000000000 -0500 +++ b/net/xfrm/xfrm_state.c 2005-04-02 12:15:37.000000000 -0500 @@ -48,7 +48,7 @@ static struct list_head xfrm_state_gc_list = LIST_HEAD_INIT(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); -static void __xfrm_state_delete(struct xfrm_state *x); +static int __xfrm_state_delete(struct xfrm_state *x); static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned short family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -208,8 +208,10 @@ } EXPORT_SYMBOL(__xfrm_state_destroy); -static void __xfrm_state_delete(struct xfrm_state *x) +static int __xfrm_state_delete(struct xfrm_state *x) { + int err = -ESRCH; + if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); @@ -236,14 +238,47 @@ * is what we are dropping here. */ atomic_dec(&x->refcnt); + err = 0; } + + return err; } -void xfrm_state_delete(struct xfrm_state *x) +static DEFINE_RWLOCK(xfrm_km_lock); +static struct list_head xfrm_km_list = LIST_HEAD_INIT(xfrm_km_list); + +void km_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) { + struct xfrm_mgr *km; + + read_lock(&xfrm_km_lock); + list_for_each_entry(km, &xfrm_km_list, list) + if (km->notify_policy) + km->notify_policy(xp, dir, c); + read_unlock(&xfrm_km_lock); +} + +void km_state_notify(struct xfrm_state *x, struct km_event *c) +{ + struct xfrm_mgr *km; + read_lock(&xfrm_km_lock); + list_for_each_entry(km, &xfrm_km_list, list) + km->notify(x, c); + read_unlock(&xfrm_km_lock); +} + +EXPORT_SYMBOL(km_policy_notify); +EXPORT_SYMBOL(km_state_notify); + +int xfrm_state_delete(struct xfrm_state *x) +{ + int err; + spin_lock_bh(&x->lock); - __xfrm_state_delete(x); + err = __xfrm_state_delete(x); spin_unlock_bh(&x->lock); + + return err; } EXPORT_SYMBOL(xfrm_state_delete); @@ -402,6 +437,7 @@ static struct xfrm_state *__xfrm_find_acq_byseq(u32 seq); + int xfrm_state_add(struct xfrm_state *x) { struct xfrm_state_afinfo *afinfo; @@ -764,37 +800,45 @@ } EXPORT_SYMBOL(xfrm_replay_advance); -static struct list_head xfrm_km_list = LIST_HEAD_INIT(xfrm_km_list); -static DEFINE_RWLOCK(xfrm_km_lock); static void km_state_expired(struct xfrm_state *x, int hard) { - struct xfrm_mgr *km; + struct km_event c; if (hard) x->km.state = XFRM_STATE_EXPIRED; else x->km.dying = 1; - read_lock(&xfrm_km_lock); - list_for_each_entry(km, &xfrm_km_list, list) - km->notify(x, hard); - read_unlock(&xfrm_km_lock); + /* XXX: Do we wanna do this right at the top?? + * if the state is dead we dont want to announce + * the expire - a delete may already have announced + * it + */ + if (x->km.state == XFRM_STATE_DEAD) + return; + c.data = hard; + c.event = XFRM_SAP_EXPIRED; + km_state_notify(x, &c); if (hard) wake_up(&km_waitq); } +/* + * We send to all registered managers regardless of failure + * We are happy with one success +*/ static int km_query(struct xfrm_state *x, struct xfrm_tmpl *t, struct xfrm_policy *pol) { - int err = -EINVAL; + int err = -EINVAL, acqret; struct xfrm_mgr *km; read_lock(&xfrm_km_lock); list_for_each_entry(km, &xfrm_km_list, list) { - err = km->acquire(x, t, pol, XFRM_POLICY_OUT); - if (!err) - break; + acqret = km->acquire(x, t, pol, XFRM_POLICY_OUT); + if (!acqret) + err = acqret; } read_unlock(&xfrm_km_lock); return err; @@ -819,13 +863,20 @@ void km_policy_expired(struct xfrm_policy *pol, int dir, int hard) { - struct xfrm_mgr *km; + struct km_event c; - read_lock(&xfrm_km_lock); - list_for_each_entry(km, &xfrm_km_list, list) - if (km->notify_policy) - km->notify_policy(pol, dir, hard); - read_unlock(&xfrm_km_lock); + /* XXX: Do we still wanna wakeup km_waitq? + * if the policy is dead we dont want to announce + * the expire - a delete may already have announced + * it + */ + if (pol->dead) + return; + + c.data = hard; + c.data = hard; + c.event = XFRM_SAP_EXPIRED; + km_policy_notify(pol, dir, &c); if (hard) wake_up(&km_waitq); --- a/net/xfrm/xfrm_policy.c 2005-03-25 22:28:21.000000000 -0500 +++ b/net/xfrm/xfrm_policy.c 2005-04-02 12:16:30.000000000 -0500 @@ -298,7 +298,7 @@ * entry dead. The rule must be unlinked from lists to the moment. */ -static void xfrm_policy_kill(struct xfrm_policy *policy) +static void xfrm_policy_kill(struct xfrm_policy *policy, int dir) { write_lock_bh(&policy->lock); if (policy->dead) @@ -378,7 +378,7 @@ write_unlock_bh(&xfrm_policy_lock); if (delpol) { - xfrm_policy_kill(delpol); + xfrm_policy_kill(delpol, dir); } return 0; } @@ -402,7 +402,7 @@ if (pol && delete) { atomic_inc(&flow_cache_genid); - xfrm_policy_kill(pol); + xfrm_policy_kill(pol, dir); } return pol; } @@ -425,7 +425,7 @@ if (pol && delete) { atomic_inc(&flow_cache_genid); - xfrm_policy_kill(pol); + xfrm_policy_kill(pol, dir); } return pol; } @@ -442,7 +442,7 @@ xfrm_policy_list[dir] = xp->next; write_unlock_bh(&xfrm_policy_lock); - xfrm_policy_kill(xp); + xfrm_policy_kill(xp, dir); write_lock_bh(&xfrm_policy_lock); } @@ -558,7 +558,7 @@ if (pol) { if (dir < XFRM_POLICY_MAX) atomic_inc(&flow_cache_genid); - xfrm_policy_kill(pol); + xfrm_policy_kill(pol, dir); } } @@ -579,7 +579,7 @@ write_unlock_bh(&xfrm_policy_lock); if (old_pol) { - xfrm_policy_kill(old_pol); + xfrm_policy_kill(old_pol, dir); } return 0; } --- a/net/xfrm/xfrm_user.c 2005-03-25 22:28:22.000000000 -0500 +++ b/net/xfrm/xfrm_user.c 2005-04-02 12:21:32.000000000 -0500 @@ -268,6 +268,7 @@ struct xfrm_usersa_info *p = NLMSG_DATA(nlh); struct xfrm_state *x; int err; + struct km_event c; err = verify_newsa_info(p, (struct rtattr **) xfrma); if (err) @@ -285,14 +286,26 @@ if (err < 0) { x->km.state = XFRM_STATE_DEAD; xfrm_state_put(x); + return err; } + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + if (nlh->nlmsg_type == XFRM_MSG_NEWSA) + c.event = XFRM_SAP_ADDED; + else + c.event = XFRM_SAP_UPDATED; + + km_state_notify(x, &c); + return err; } static int xfrm_del_sa(struct sk_buff *skb, struct nlmsghdr *nlh, void **xfrma) { struct xfrm_state *x; + int err; + struct km_event c; struct xfrm_usersa_id *p = NLMSG_DATA(nlh); x = xfrm_state_lookup(&p->daddr, p->spi, p->proto, p->family); @@ -304,10 +317,20 @@ return -EPERM; } - xfrm_state_delete(x); + err = xfrm_state_delete(x); + if (err < 0) { + x->km.state = XFRM_STATE_DEAD; + xfrm_state_put(x); + return err; + } + + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + c.event = XFRM_SAP_DELETED; + km_state_notify(x, &c); xfrm_state_put(x); - return 0; + return err; } static void copy_to_user_state(struct xfrm_state *x, struct xfrm_usersa_info *p) @@ -672,6 +695,7 @@ { struct xfrm_userpolicy_info *p = NLMSG_DATA(nlh); struct xfrm_policy *xp; + struct km_event c; int err; int excl; @@ -683,6 +707,10 @@ if (!xp) return err; + /* shouldnt excl be based on nlh flags?? + * Aha! this is anti-netlink really i.e more pfkey derived + * in netlink excl is a flag and you wouldnt need + * a type XFRM_MSG_UPDPOLICY - JHS */ excl = nlh->nlmsg_type == XFRM_MSG_NEWPOLICY; err = xfrm_policy_insert(p->dir, xp, excl); if (err) { @@ -690,6 +718,16 @@ return err; } + + if (!excl) + c.event = XFRM_SAP_UPDATED; + else + c.event = XFRM_SAP_ADDED; + + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_policy_notify(xp, p->dir, &c); + xfrm_pol_put(xp); return 0; @@ -807,8 +845,10 @@ struct xfrm_policy *xp; struct xfrm_userpolicy_id *p; int err; + struct km_event c; int delete; + p = NLMSG_DATA(nlh); delete = nlh->nlmsg_type == XFRM_MSG_DELPOLICY; @@ -834,6 +874,11 @@ NETLINK_CB(skb).pid, MSG_DONTWAIT); } + } else { + c.event = XFRM_SAP_DELETED; + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_policy_notify(xp, p->dir, &c); } xfrm_pol_put(xp); @@ -843,15 +888,28 @@ static int xfrm_flush_sa(struct sk_buff *skb, struct nlmsghdr *nlh, void **xfrma) { + struct km_event c; struct xfrm_usersa_flush *p = NLMSG_DATA(nlh); xfrm_state_flush(p->proto); + c.data = p->proto; + c.event = XFRM_SAP_FLUSHED; + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_state_notify(NULL, &c); + return 0; } static int xfrm_flush_policy(struct sk_buff *skb, struct nlmsghdr *nlh, void **xfrma) { + struct km_event c; + xfrm_policy_flush(); + c.event = XFRM_SAP_FLUSHED; + c.seq = nlh->nlmsg_seq; + c.pid = nlh->nlmsg_pid; + km_policy_notify(NULL, 0, &c); return 0; } @@ -1053,10 +1111,11 @@ return -1; } -static int xfrm_send_state_notify(struct xfrm_state *x, int hard) +static int xfrm_exp_state_notify(struct xfrm_state *x, struct km_event *c) { struct sk_buff *skb; - + int hard = c ->data; + /* fix to do alloc using NLM macros */ skb = alloc_skb(sizeof(struct xfrm_user_expire) + 16, GFP_ATOMIC); if (skb == NULL) return -ENOMEM; @@ -1069,6 +1128,94 @@ return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_EXPIRE, GFP_ATOMIC); } +static int xfrm_notify_sa_flush(struct km_event *c) +{ + struct xfrm_usersa_flush *p; + struct nlmsghdr *nlh; + struct sk_buff *skb; + unsigned char *b; + int len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_flush)); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + nlh = NLMSG_PUT(skb, c->pid, c->seq, + XFRM_MSG_FLUSHSA, sizeof(*p)); + nlh->nlmsg_flags = 0; + + p = NLMSG_DATA(nlh); + p->proto = c->data; + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_SA, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_notify_sa( struct xfrm_state *x, struct km_event *c) +{ + struct xfrm_usersa_info *p; + struct nlmsghdr *nlh; + struct sk_buff *skb; + u32 nlt; + unsigned char *b; + int len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_info)); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + if (c->event == XFRM_SAP_ADDED) + nlt = XFRM_MSG_NEWSA; + else if (c->event == XFRM_SAP_UPDATED) + nlt = XFRM_MSG_UPDSA; + else if (c->event == XFRM_SAP_DELETED) + nlt = XFRM_MSG_DELSA; + else + goto nlmsg_failure; + + nlh = NLMSG_PUT(skb, c->pid, c->seq, nlt, sizeof(*p)); + nlh->nlmsg_flags = 0; + + p = NLMSG_DATA(nlh); + copy_to_user_state(x, p); + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_SA, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_send_state_notify(struct xfrm_state *x, struct km_event *c) +{ + + switch (c->event) { + case XFRM_SAP_EXPIRED: + return xfrm_exp_state_notify(x, c); + case XFRM_SAP_DELETED: + case XFRM_SAP_UPDATED: + case XFRM_SAP_ADDED: + return xfrm_notify_sa(x, c); + case XFRM_SAP_FLUSHED: + return xfrm_notify_sa_flush(c); + default: + printk("pfkey: Unknown SA event %d\n",c->event); + break; + } + + return 0; + +} + static int build_acquire(struct sk_buff *skb, struct xfrm_state *x, struct xfrm_tmpl *xt, struct xfrm_policy *xp, int dir) @@ -1202,7 +1349,8 @@ return -1; } -static int xfrm_send_policy_notify(struct xfrm_policy *xp, int dir, int hard) + +static int xfrm_exp_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) { struct sk_buff *skb; size_t len; @@ -1213,7 +1361,7 @@ if (skb == NULL) return -ENOMEM; - if (build_polexpire(skb, xp, dir, hard) < 0) + if (build_polexpire(skb, xp, dir, c->data) < 0) BUG(); NETLINK_CB(skb).dst_groups = XFRMGRP_EXPIRE; @@ -1221,6 +1369,90 @@ return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_EXPIRE, GFP_ATOMIC); } +static int xfrm_notify_policy( struct xfrm_policy *xp, int dir, struct km_event *c) +{ + struct xfrm_userpolicy_info *p; + struct nlmsghdr *nlh; + struct sk_buff *skb; + u32 nlt = 0 ; + unsigned char *b; + int len = NLMSG_LENGTH(sizeof(struct xfrm_userpolicy_info)); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + if (c->event == XFRM_SAP_ADDED) + nlt = XFRM_MSG_NEWPOLICY; + else if (c->event == XFRM_SAP_UPDATED) + nlt = XFRM_MSG_UPDPOLICY; + else if (c->event == XFRM_SAP_DELETED) + nlt = XFRM_MSG_DELPOLICY; + else + goto nlmsg_failure; + + nlh = NLMSG_PUT(skb, c->pid, c->seq, nlt, sizeof(*p)); + + p = NLMSG_DATA(nlh); + + nlh->nlmsg_flags = 0; + + copy_to_user_policy(xp, p, dir); + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_POLICY, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_notify_policy_flush(struct km_event *c) +{ + struct nlmsghdr *nlh; + struct sk_buff *skb; + unsigned char *b; + int len = NLMSG_LENGTH(0); + + skb = alloc_skb(len, GFP_ATOMIC); + if (skb == NULL) + return -ENOMEM; + b = skb->tail; + + + nlh = NLMSG_PUT(skb, c->pid, c->seq, XFRM_MSG_FLUSHPOLICY, 0); + + nlh->nlmsg_len = skb->tail - b; + + return netlink_broadcast(xfrm_nl, skb, 0, XFRMGRP_POLICY, GFP_ATOMIC); + +nlmsg_failure: + kfree_skb(skb); + return -1; +} + +static int xfrm_send_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) +{ + + switch (c->event) { + case XFRM_SAP_ADDED: + case XFRM_SAP_UPDATED: + case XFRM_SAP_DELETED: + return xfrm_notify_policy(xp, dir, c); + case XFRM_SAP_FLUSHED: + return xfrm_notify_policy_flush(c); + case XFRM_SAP_EXPIRED: + return xfrm_exp_policy_notify(xp, dir, c); + default: + printk("Netlink Unknown Policy event %d\n",c->event); + } + + return 0; + +} + static struct xfrm_mgr netlink_mgr = { .id = "netlink", .notify = xfrm_send_state_notify, --- a/net/key/af_key.c 2005-03-25 22:28:39.000000000 -0500 +++ b/net/key/af_key.c 2005-04-02 18:05:24.000000000 -0500 @@ -1240,13 +1240,85 @@ return 0; } +static inline int event2poltype (int event) +{ + switch (event) { + case XFRM_SAP_DELETED: + return SADB_X_SPDDELETE; + case XFRM_SAP_ADDED: + return SADB_X_SPDADD; + case XFRM_SAP_UPDATED: + return SADB_X_SPDUPDATE; + case XFRM_SAP_EXPIRED: + // return SADB_X_SPDEXPIRE; + default: + printk("pfkey: Unknown policy event %d\n",event); + break; + } + + return 0; +} + +static inline int event2keytype (int event) +{ + switch (event) { + case XFRM_SAP_DELETED: + return SADB_DELETE; + case XFRM_SAP_ADDED: + return SADB_ADD; + case XFRM_SAP_UPDATED: + return SADB_UPDATE; + case XFRM_SAP_EXPIRED: + return SADB_EXPIRE; + default: + printk("pfkey: Unknown SA event %d\n",event); + break; + } + + return 0; +} + +/* ADD/UPD/DEL */ +static int key_notify_sa(struct xfrm_state *x, struct km_event *c) +{ + struct sk_buff *skb; + struct sadb_msg *hdr; + int hsc = 3; + + if (c->event == XFRM_SAP_DELETED) + hsc = 0; + + if (c->event == XFRM_SAP_EXPIRED) { + if (c->data) + hsc = 2; + else + hsc = 1; + } + + skb = pfkey_xfrm_state2msg(x, 0, hsc); + + if (IS_ERR(skb)) + return PTR_ERR(skb); + + hdr = (struct sadb_msg *) skb->data; + hdr->sadb_msg_version = PF_KEY_V2; + hdr->sadb_msg_type = event2keytype(c->event); + hdr->sadb_msg_satype = pfkey_proto2satype(x->id.proto); + hdr->sadb_msg_errno = 0; + hdr->sadb_msg_reserved = 0; + hdr->sadb_msg_seq = c->seq; + hdr->sadb_msg_pid = c->pid; + + pfkey_broadcast(skb, GFP_ATOMIC, BROADCAST_ALL, NULL); + + return 0; +} static int pfkey_add(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; struct xfrm_state *x; int err; + struct km_event c; xfrm_probe_algs(); @@ -1256,7 +1328,7 @@ if (hdr->sadb_msg_type == SADB_ADD) err = xfrm_state_add(x); - else + else err = xfrm_state_update(x); if (err < 0) { @@ -1265,27 +1337,22 @@ return err; } - out_skb = pfkey_xfrm_state2msg(x, 0, 3); - if (IS_ERR(out_skb)) - return PTR_ERR(out_skb); /* XXX Should we return 0 here ? */ - - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = hdr->sadb_msg_type; - out_hdr->sadb_msg_satype = pfkey_proto2satype(x->id.proto); - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_reserved = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); + if (hdr->sadb_msg_type == SADB_ADD) + c.event = XFRM_SAP_ADDED; + else + c.event = XFRM_SAP_UPDATED; + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + km_state_notify(x, &c); - return 0; + return err; } static int pfkey_delete(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { struct xfrm_state *x; + struct km_event c; + int err; if (!ext_hdrs[SADB_EXT_SA-1] || !present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], @@ -1301,13 +1368,20 @@ return -EPERM; } - xfrm_state_delete(x); - xfrm_state_put(x); + err = xfrm_state_delete(x); + if (err < 0) { + x->km.state = XFRM_STATE_DEAD; + xfrm_state_put(x); + return err; + } - pfkey_broadcast(skb_clone(skb, GFP_KERNEL), GFP_KERNEL, - BROADCAST_ALL, sk); + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_DELETED; + km_state_notify(x, &c); + xfrm_state_put(x); - return 0; + return err; } static int pfkey_get(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) @@ -1445,28 +1519,42 @@ return 0; } +static int key_notify_sa_flush(struct km_event *c) +{ + struct sk_buff *skb; + struct sadb_msg *hdr; + + skb = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_ATOMIC); + if (!skb) + return -ENOBUFS; + hdr = (struct sadb_msg *) skb_put(skb, sizeof(struct sadb_msg)); + // XXX:do we have to pass proto as well? + hdr->sadb_msg_seq = c->seq; + hdr->sadb_msg_pid = c->pid; + hdr->sadb_msg_version = PF_KEY_V2; + hdr->sadb_msg_errno = (uint8_t) 0; + hdr->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); + + pfkey_broadcast(skb, GFP_ATOMIC, BROADCAST_ALL, NULL); + + return 0; +} + static int pfkey_flush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { unsigned proto; - struct sk_buff *skb_out; - struct sadb_msg *hdr_out; + struct km_event c; proto = pfkey_satype2proto(hdr->sadb_msg_satype); if (proto == 0) return -EINVAL; - skb_out = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_KERNEL); - if (!skb_out) - return -ENOBUFS; - xfrm_state_flush(proto); - - hdr_out = (struct sadb_msg *) skb_put(skb_out, sizeof(struct sadb_msg)); - pfkey_hdr_dup(hdr_out, hdr); - hdr_out->sadb_msg_errno = (uint8_t) 0; - hdr_out->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); - - pfkey_broadcast(skb_out, GFP_KERNEL, BROADCAST_ALL, NULL); + c.data = proto; + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_FLUSHED; + km_state_notify(NULL, &c); return 0; } @@ -1859,6 +1947,31 @@ hdr->sadb_msg_reserved = atomic_read(&xp->refcnt); } +static int key_notify_policy( struct xfrm_policy *xp, int dir, struct km_event *c) +{ + struct sk_buff *out_skb; + struct sadb_msg *out_hdr; + int err; + + out_skb = pfkey_xfrm_policy2msg_prep(xp); + if (IS_ERR(out_skb)) { + err = PTR_ERR(out_skb); + goto out; + } + pfkey_xfrm_policy2msg(out_skb, xp, dir); + + out_hdr = (struct sadb_msg *) out_skb->data; + out_hdr->sadb_msg_version = PF_KEY_V2; + out_hdr->sadb_msg_type = event2poltype(c->event); + out_hdr->sadb_msg_errno = 0; + out_hdr->sadb_msg_seq = c->seq; + out_hdr->sadb_msg_pid = c->pid; + pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, NULL); +out: + return 0; + +} + static int pfkey_spdadd(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) { int err; @@ -1866,8 +1979,7 @@ struct sadb_address *sa; struct sadb_x_policy *pol; struct xfrm_policy *xp; - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; + struct km_event c; if (!present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], ext_hdrs[SADB_EXT_ADDRESS_DST-1]) || @@ -1935,31 +2047,25 @@ (err = parse_ipsecrequests(xp, pol)) < 0) goto out; - out_skb = pfkey_xfrm_policy2msg_prep(xp); - if (IS_ERR(out_skb)) { - err = PTR_ERR(out_skb); - goto out; - } err = xfrm_policy_insert(pol->sadb_x_policy_dir-1, xp, hdr->sadb_msg_type != SADB_X_SPDUPDATE); + if (err) { - kfree_skb(out_skb); - goto out; + kfree(xp); + return err; } - pfkey_xfrm_policy2msg(out_skb, xp, pol->sadb_x_policy_dir-1); + if (hdr->sadb_msg_type == SADB_X_SPDUPDATE) + c.event = XFRM_SAP_UPDATED; + else + c.event = XFRM_SAP_ADDED; - xfrm_pol_put(xp); + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = hdr->sadb_msg_type; - out_hdr->sadb_msg_satype = 0; - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); + km_policy_notify(xp, pol->sadb_x_policy_dir-1, &c); + xfrm_pol_put(xp); return 0; out: @@ -1973,9 +2079,8 @@ struct sadb_address *sa; struct sadb_x_policy *pol; struct xfrm_policy *xp; - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; struct xfrm_selector sel; + struct km_event c; if (!present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1], ext_hdrs[SADB_EXT_ADDRESS_DST-1]) || @@ -2010,24 +2115,11 @@ err = 0; - out_skb = pfkey_xfrm_policy2msg_prep(xp); - if (IS_ERR(out_skb)) { - err = PTR_ERR(out_skb); - goto out; - } - pfkey_xfrm_policy2msg(out_skb, xp, pol->sadb_x_policy_dir-1); - - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = SADB_X_SPDDELETE; - out_hdr->sadb_msg_satype = 0; - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); - err = 0; + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_DELETED; + km_policy_notify(xp, pol->sadb_x_policy_dir-1, &c); -out: xfrm_pol_put(xp); return err; } @@ -2037,8 +2129,7 @@ int err; struct sadb_x_policy *pol; struct xfrm_policy *xp; - struct sk_buff *out_skb; - struct sadb_msg *out_hdr; + struct km_event c; if ((pol = ext_hdrs[SADB_X_EXT_POLICY-1]) == NULL) return -EINVAL; @@ -2050,24 +2141,19 @@ err = 0; - out_skb = pfkey_xfrm_policy2msg_prep(xp); - if (IS_ERR(out_skb)) { - err = PTR_ERR(out_skb); - goto out; + /* + * XXX: previous get was doing a broadcast-all _always_ + * which didnt seem right for non-deletion case - JHS + * This is like the way netlink behaves .. + * Shall i restore original behavior? + */ + if (hdr->sadb_msg_type == SADB_X_SPDDELETE2) { + c.seq = hdr->sadb_msg_seq; + c.pid = hdr->sadb_msg_pid; + c.event = XFRM_SAP_DELETED; + km_policy_notify(xp, pol->sadb_x_policy_dir-1, &c); } - pfkey_xfrm_policy2msg(out_skb, xp, pol->sadb_x_policy_dir-1); - - out_hdr = (struct sadb_msg *) out_skb->data; - out_hdr->sadb_msg_version = hdr->sadb_msg_version; - out_hdr->sadb_msg_type = hdr->sadb_msg_type; - out_hdr->sadb_msg_satype = 0; - out_hdr->sadb_msg_errno = 0; - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); - err = 0; -out: xfrm_pol_put(xp); return err; } @@ -2102,22 +2188,33 @@ return xfrm_policy_walk(dump_sp, &data); } -static int pfkey_spdflush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) +static int key_notify_policy_flush(struct km_event *c) { struct sk_buff *skb_out; - struct sadb_msg *hdr_out; - - skb_out = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_KERNEL); + struct sadb_msg *hdr; + skb_out = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_ATOMIC); if (!skb_out) return -ENOBUFS; + hdr = (struct sadb_msg *) skb_put(skb_out, sizeof(struct sadb_msg)); + hdr->sadb_msg_seq = c->seq; + hdr->sadb_msg_pid = c->pid; + hdr->sadb_msg_version = PF_KEY_V2; + hdr->sadb_msg_errno = (uint8_t) 0; + hdr->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); + pfkey_broadcast(skb_out, GFP_ATOMIC, BROADCAST_ALL, NULL); + return 0; - xfrm_policy_flush(); +} - hdr_out = (struct sadb_msg *) skb_put(skb_out, sizeof(struct sadb_msg)); - pfkey_hdr_dup(hdr_out, hdr); - hdr_out->sadb_msg_errno = (uint8_t) 0; - hdr_out->sadb_msg_len = (sizeof(struct sadb_msg) / sizeof(uint64_t)); - pfkey_broadcast(skb_out, GFP_KERNEL, BROADCAST_ALL, NULL); +static int pfkey_spdflush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs) +{ + struct km_event c; + + xfrm_policy_flush(); + c.event = XFRM_SAP_FLUSHED; + c.pid = hdr->sadb_msg_pid; + c.seq = hdr->sadb_msg_seq; + km_policy_notify(NULL, 0, &c); return 0; } @@ -2317,11 +2414,25 @@ } } -static int pfkey_send_notify(struct xfrm_state *x, int hard) +/* XXX: Noisy for now */ +static int key_notify_policy_expire(struct xfrm_policy *xp, struct km_event *c) +{ + printk("pfkey doesnt deal with expired policies ..\n"); + return 0; +} + +static int key_notify_sa_expire(struct xfrm_state *x, struct km_event *c) { struct sk_buff *out_skb; struct sadb_msg *out_hdr; - int hsc = (hard ? 2 : 1); + int hard; + int hsc; + + hard = c->data; + if (hard) + hsc = 2; + else + hsc = 1; out_skb = pfkey_xfrm_state2msg(x, 0, hsc); if (IS_ERR(out_skb)) @@ -2340,6 +2451,43 @@ return 0; } +static int pfkey_send_notify(struct xfrm_state *x, struct km_event *c) +{ + switch (c->event) { + case XFRM_SAP_EXPIRED: + return key_notify_sa_expire(x, c); + case XFRM_SAP_DELETED: + case XFRM_SAP_ADDED: + case XFRM_SAP_UPDATED: + return key_notify_sa(x, c); + case XFRM_SAP_FLUSHED: + return key_notify_sa_flush(c); + default: + printk("pfkey: Unknown SA event %d\n",c->event); + break; + } + + return 0; +} + +static int pfkey_send_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) +{ + switch (c->event) { + case XFRM_SAP_EXPIRED: + return key_notify_policy_expire(xp, c); + case XFRM_SAP_DELETED: + case XFRM_SAP_ADDED: + case XFRM_SAP_UPDATED: + return key_notify_policy(xp, dir, c); + case XFRM_SAP_FLUSHED: + return key_notify_policy_flush(c); + default: + printk("pfkey: Unknown policy event %d\n",c->event); + break; + } + + return 0; +} static u32 get_acqseq(void) { u32 res; @@ -2856,6 +3004,7 @@ .acquire = pfkey_send_acquire, .compile_policy = pfkey_compile_policy, .new_mapping = pfkey_send_new_mapping, + .notify_policy = pfkey_send_policy_notify, }; static void __exit ipsec_pfkey_exit(void) --=-CbZvGNdJ/zGTATpkMExl-- From abhishek@pal.ece.iisc.ernet.in Sun Apr 3 08:00:53 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 08:01:00 -0700 (PDT) Received: from ece.iisc.ernet.in (ece.iisc.ernet.in [144.16.64.2]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33F0o2B024756 for ; Sun, 3 Apr 2005 08:00:52 -0700 Received: from pal.ece.iisc.ernet.in (pal.ece.iisc.ernet.in [144.16.64.149]) by ece.iisc.ernet.in (8.12.6/8.12.6) with ESMTP id j33Ew58V086581; Sun, 3 Apr 2005 20:28:10 +0530 (IST) (envelope-from abhishek@pal.ece.iisc.ernet.in) Received: by pal.ece.iisc.ernet.in (Postfix, from userid 1047) id CB30F31E59; Sun, 3 Apr 2005 20:30:19 +0530 (IST) Received: from localhost (localhost [127.0.0.1]) by pal.ece.iisc.ernet.in (Postfix) with ESMTP id C73C631E57; Sun, 3 Apr 2005 20:30:19 +0530 (IST) Date: Sun, 3 Apr 2005 20:30:19 +0530 (IST) From: Abhishek Gupta To: Thomas Graf Cc: netdev@oss.sgi.com Subject: Re: Problem using HTB In-Reply-To: <20050402213642.GO3086@postel.suug.ch> Message-ID: References: <20050402213642.GO3086@postel.suug.ch> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1294 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: abhishek@pal.ece.iisc.ernet.in Precedence: bulk X-list: netdev hello Thanks Mr. Graf for replying. Ya, I do was making mistake by assuming KBps as Kbit per second. Actually, I got confused with notations used in the Linux's RH monitor which I used for the speed measurements. But the problem is still not yet solved as I tried with 1Mbit speed as the setting for link speed in the htb configuration and got about 30KBps which amounts to about 240Kbitps even though my UDP source is sending at speed of about 1MBps(8Mbps), according to RH monitor readings. Is it possible that the problem is due to the source that I am using for UDP packets? abhishek ========================================================================= ABHISHEK GUPTA E-mail:abhishek_it_bhu@yahoo.co.in ========================================================================= On Sat, 2 Apr 2005, Thomas Graf wrote: > * Abhishek Gupta 2005-04-01 15:10 > > tc class add dev $DEV0 parent 2: classid 2:1 htb rate 100kbit burst 100 \ > > ceil 100kbit > > [...] > > I have configured for 100kbps, I am getting only 12kbps as the link speed. > > Before I look into this, are you aware of 1kbps=8kbit? > From kaber@trash.net Sun Apr 3 08:49:12 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 08:49:16 -0700 (PDT) Received: from kaber.coreworks.de ([62.206.217.67]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33FnBfm030687 for ; Sun, 3 Apr 2005 08:49:12 -0700 Received: from localhost ([127.0.0.1]) by kaber.coreworks.de with esmtp (Exim 4.50) id 1DI7KJ-0008Pn-Od; Sun, 03 Apr 2005 17:47:51 +0200 Message-ID: <42501027.6010609@trash.net> Date: Sun, 03 Apr 2005 17:47:51 +0200 From: Patrick McHardy User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.6) Gecko/20050324 Debian/1.7.6-1 X-Accept-Language: en MIME-Version: 1.0 To: hadi@cyberus.ca CC: Herbert Xu , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: take 2 WAS(Re: PATCH: IPSEC xfrm events References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> <20050402014619.GB24861@gondor.apana.org.au> <1112469601.1088.173.camel@jzny.localdomain> <1112538718.1096.394.camel@jzny.localdomain> In-Reply-To: <1112538718.1096.394.camel@jzny.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1295 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kaber@trash.net Precedence: bulk X-list: netdev jamal wrote: >>+void km_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) >> { >>+ struct xfrm_mgr *km; >>+ >>+ read_lock(&xfrm_km_lock); >>+ list_for_each_entry(km, &xfrm_km_list, list) >>+ if (km->notify_policy) >>+ km->notify_policy(xp, dir, c); >>+ read_unlock(&xfrm_km_lock); >>+} >>+ >>+void km_state_notify(struct xfrm_state *x, struct km_event *c) >>+{ >>+ struct xfrm_mgr *km; >>+ read_lock(&xfrm_km_lock); >>+ list_for_each_entry(km, &xfrm_km_list, list) >>+ km->notify(x, c); >>+ read_unlock(&xfrm_km_lock); >>+} You call these functions from both softirq- and user-context, so you need to protect against BHs. Regards Patrick From kaber@trash.net Sun Apr 3 08:53:34 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 08:53:38 -0700 (PDT) Received: from kaber.coreworks.de ([62.206.217.67]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33FrXtl031326 for ; Sun, 3 Apr 2005 08:53:34 -0700 Received: from localhost ([127.0.0.1]) by kaber.coreworks.de with esmtp (Exim 4.50) id 1DI7Ok-0008QN-9f; Sun, 03 Apr 2005 17:52:26 +0200 Message-ID: <4250113A.4080202@trash.net> Date: Sun, 03 Apr 2005 17:52:26 +0200 From: Patrick McHardy User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.6) Gecko/20050324 Debian/1.7.6-1 X-Accept-Language: en MIME-Version: 1.0 To: hadi@cyberus.ca CC: Alexey Kuznetsov , Herbert Xu , "David S. Miller" , Masahide NAKAMURA , ipsec-tools-devel@lists.sourceforge.net, netdev , jmorris@redhat.com Subject: Re: IPSEC: on behavior of acquire References: <1112405144.1096.33.camel@jzny.localdomain> <20050402140019.GA13017@yakov.inr.ac.ru> <1112478168.1088.337.camel@jzny.localdomain> In-Reply-To: <1112478168.1088.337.camel@jzny.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1296 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kaber@trash.net Precedence: bulk X-list: netdev jamal wrote: > Herbert also mentions something along the same lines in his email. > This would make a lot of sense! > Is the state machine going to look something along the same lines as > ARP? i.e incomplete->reachable etc? Yes, from a bundle POV. In my current approach a single state is resolved at a time and resolution is driven by XFRM_STATE_ACQ->* state transitions. > What would be a good code to return when you queue the packet? It should be transparent, so 0. Regards Patrick From kaber@trash.net Sun Apr 3 09:13:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 09:13:44 -0700 (PDT) Received: from kaber.coreworks.de ([62.206.217.67]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33GDbAA032335 for ; Sun, 3 Apr 2005 09:13:38 -0700 Received: from localhost ([127.0.0.1]) by kaber.coreworks.de with esmtp (Exim 4.50) id 1DI7if-0008Tj-3S; Sun, 03 Apr 2005 18:13:01 +0200 Message-ID: <4250160D.2040405@trash.net> Date: Sun, 03 Apr 2005 18:13:01 +0200 From: Patrick McHardy User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.6) Gecko/20050324 Debian/1.7.6-1 X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: Herbert Xu , netdev Subject: [IPSEC]: Protect against BHs in xfrm_user_policy() Content-Type: multipart/mixed; boundary="------------040106090202000803080206" X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1297 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kaber@trash.net Precedence: bulk X-list: netdev This is a multi-part message in MIME format. --------------040106090202000803080206 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit xfrm_user_policy() is called from ip_setsockopt with enabled BHs, so it needs to protect against them when grabbing xfrm_km_lock. --------------040106090202000803080206 Content-Type: text/plain; name="x" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="x" # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/04/03 17:36:10+02:00 kaber@coreworks.de # [IPSEC]: Protect against BHs in xfrm_user_policy() # # Signed-off-by: Patrick McHardy # # net/xfrm/xfrm_state.c # 2005/04/03 17:36:00+02:00 kaber@coreworks.de +2 -2 # [IPSEC]: Protect against BHs in xfrm_user_policy() # # Signed-off-by: Patrick McHardy # diff -Nru a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c --- a/net/xfrm/xfrm_state.c 2005-04-03 18:04:38 +02:00 +++ b/net/xfrm/xfrm_state.c 2005-04-03 18:04:38 +02:00 @@ -878,14 +878,14 @@ goto out; err = -EINVAL; - read_lock(&xfrm_km_lock); + read_lock_bh(&xfrm_km_lock); list_for_each_entry(km, &xfrm_km_list, list) { pol = km->compile_policy(sk->sk_family, optname, data, optlen, &err); if (err >= 0) break; } - read_unlock(&xfrm_km_lock); + read_unlock_bh(&xfrm_km_lock); if (err >= 0) { xfrm_sk_policy_insert(sk, err, pol); --------------040106090202000803080206-- From hadi@cyberus.ca Sun Apr 3 09:29:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 09:29:19 -0700 (PDT) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33GTFs6000902 for ; Sun, 3 Apr 2005 09:29:15 -0700 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1DI7yM-0005vy-Au for netdev@oss.sgi.com; Sun, 03 Apr 2005 12:29:14 -0400 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DI7yI-00089r-7W; Sun, 03 Apr 2005 12:29:10 -0400 Subject: Re: take 2 WAS(Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Patrick McHardy Cc: Herbert Xu , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <42501027.6010609@trash.net> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> <20050402014619.GB24861@gondor.apana.org.au> <1112469601.1088.173.camel@jzny.localdomain> <1112538718.1096.394.camel@jzny.localdomain> <42501027.6010609@trash.net> Content-Type: text/plain Organization: jamalopolous Message-Id: <1112545744.1087.397.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Apr 2005 12:29:05 -0400 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1298 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sun, 2005-04-03 at 11:47, Patrick McHardy wrote: > >>+void km_state_notify(struct xfrm_state *x, struct km_event *c) > >>+{ > >>+ struct xfrm_mgr *km; > >>+ read_lock(&xfrm_km_lock); > >>+ list_for_each_entry(km, &xfrm_km_list, list) > >>+ km->notify(x, c); > >>+ read_unlock(&xfrm_km_lock); > >>+} > > You call these functions from both softirq- and user-context, so you > need to protect against BHs. > You are absolutely correct. Thanks for catching this. cheers, jamal From hadi@cyberus.ca Sun Apr 3 09:36:44 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 09:36:48 -0700 (PDT) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33Gaiv4001707 for ; Sun, 3 Apr 2005 09:36:44 -0700 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1DI85b-00083D-L5 for netdev@oss.sgi.com; Sun, 03 Apr 2005 12:36:43 -0400 Received: from [24.103.99.32] (helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1DI85Y-0000LD-Dg; Sun, 03 Apr 2005 12:36:40 -0400 Subject: Re: take 2 WAS(Re: PATCH: IPSEC xfrm events From: jamal Reply-To: hadi@cyberus.ca To: Patrick McHardy Cc: Herbert Xu , Masahide NAKAMURA , "David S. Miller" , netdev In-Reply-To: <42501027.6010609@trash.net> References: <1112319441.1089.83.camel@jzny.localdomain> <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> <20050402014619.GB24861@gondor.apana.org.au> <1112469601.1088.173.camel@jzny.localdomain> <1112538718.1096.394.camel@jzny.localdomain> <42501027.6010609@trash.net> Content-Type: multipart/mixed; boundary="=-HzY6ovv3o1agHd1AZ7ya" Organization: jamalopolous Message-Id: <1112546194.1096.401.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Apr 2005 12:36:35 -0400 X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1299 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev --=-HzY6ovv3o1agHd1AZ7ya Content-Type: text/plain Content-Transfer-Encoding: 7bit Masahide, Attached is incremental patch on top of the one posted earlier. Looks ok from my basic testing. Please run it against your tests and see if it stands. cheers, jamal On Sun, 2005-04-03 at 11:47, Patrick McHardy wrote: > You call these functions from both softirq- and user-context, so you > need to protect against BHs. > > Regards > Patrick > --=-HzY6ovv3o1agHd1AZ7ya Content-Disposition: attachment; filename=ipsec-event-take2-1-1 Content-Type: text/plain; name=ipsec-event-take2-1-1; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit --- a/net/xfrm/xfrm_state.c 2005/04/03 16:30:31 1.2 +++ b/net/xfrm/xfrm_state.c 2005/04/03 16:31:27 @@ -251,20 +251,20 @@ { struct xfrm_mgr *km; - read_lock(&xfrm_km_lock); + read_lock_bh(&xfrm_km_lock); list_for_each_entry(km, &xfrm_km_list, list) if (km->notify_policy) km->notify_policy(xp, dir, c); - read_unlock(&xfrm_km_lock); + read_unlock_bh(&xfrm_km_lock); } void km_state_notify(struct xfrm_state *x, struct km_event *c) { struct xfrm_mgr *km; - read_lock(&xfrm_km_lock); + read_lock_bh(&xfrm_km_lock); list_for_each_entry(km, &xfrm_km_list, list) km->notify(x, c); - read_unlock(&xfrm_km_lock); + read_unlock_bh(&xfrm_km_lock); } EXPORT_SYMBOL(km_policy_notify); --=-HzY6ovv3o1agHd1AZ7ya-- From kaber@trash.net Sun Apr 3 09:49:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 09:49:15 -0700 (PDT) Received: from kaber.coreworks.de ([62.206.217.67]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33Gn9Fr002547 for ; Sun, 3 Apr 2005 09:49:10 -0700 Received: from localhost ([127.0.0.1]) by kaber.coreworks.de with esmtp (Exim 4.50) id 1DI8Go-0003lv-39; Sun, 03 Apr 2005 18:48:18 +0200 Message-ID: <42501E51.3000401@trash.net> Date: Sun, 03 Apr 2005 18:48:17 +0200 From: Patrick McHardy User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.6) Gecko/20050324 Debian/1.7.6-1 X-Accept-Language: en MIME-Version: 1.0 To: Herbert Xu CC: "David S. Miller" , kuznet@ms2.inr.ac.ru, jmorris@redhat.com, yoshfuji@linux-ipv6.org, netdev@oss.sgi.com Subject: Re: [IPSEC]: Kill nested read lock by deleting xfrm_init_tempsel References: <20050214221200.GA18465@gondor.apana.org.au> <20050214221433.GB18465@gondor.apana.org.au> <20050214221607.GC18465@gondor.apana.org.au> <424864CE.5060802@trash.net> <20050328233917.GB15369@gondor.apana.org.au> <424B40C2.90304@trash.net> <20050331004658.GA26395@gondor.apana.org.au> <20050331212325.5e996432.davem@davemloft.net> <20050402004956.GA24339@gondor.apana.org.au> <20050401172007.7296eced.davem@davemloft.net> <20050402020947.GA24998@gondor.apana.org.au> In-Reply-To: <20050402020947.GA24998@gondor.apana.org.au> Content-Type: multipart/mixed; boundary="------------070809070105060803010504" X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1300 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kaber@trash.net Precedence: bulk X-list: netdev This is a multi-part message in MIME format. --------------070809070105060803010504 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Herbert Xu wrote: > It's still a valid clean-up patch though. Agreed. There is also a bug in my patch, tmpl->daddr can be 0 in which case the daddr passed as an argument to xfrm_state_find() will be used. My patch only checked tmpl->daddr, this patch fixes it. It also uses afinfo->init_tempsel directly, but I didn't kill xfrm_init_tempsel() yet because I need it for xfrm resolution. > There is another reason why it won't dead lock. We don't actually > ever hold the write lock on afinfo :) Is there any reason why we > dont't just use xfrm_state_afinfo_lock instead of afinfo->lock? I don't think so. I also don't see a reason why the lock needs to be held between xfrm_state_get_afinfo() and xfrm_state_put_afinfo(), a reference count should be enough. Regards Patrick --------------070809070105060803010504 Content-Type: text/plain; name="x" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="x" # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/04/03 18:41:22+02:00 kaber@coreworks.de # [IPSEC]: Use correct daddr for duplicate state check # # Signed-off-by: Patrick McHardy # # net/xfrm/xfrm_state.c # 2005/04/03 18:41:14+02:00 kaber@coreworks.de +9 -9 # [IPSEC]: Use correct daddr for duplicate state check # # Signed-off-by: Patrick McHardy # diff -Nru a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c --- a/net/xfrm/xfrm_state.c 2005-04-03 18:41:41 +02:00 +++ b/net/xfrm/xfrm_state.c 2005-04-03 18:41:41 +02:00 @@ -357,12 +357,6 @@ x = best; if (!x && !error && !acquire_in_progress) { - x0 = afinfo->state_lookup(&tmpl->id.daddr, tmpl->id.spi, tmpl->id.proto); - if (x0 != NULL) { - xfrm_state_put(x0); - error = -EEXIST; - goto out; - } x = xfrm_state_alloc(); if (x == NULL) { error = -ENOMEM; @@ -370,9 +364,11 @@ } /* Initialize temporary selector matching only * to current session. */ - xfrm_init_tempsel(x, fl, tmpl, daddr, saddr, family); + afinfo->init_tempsel(x, fl, tmpl, daddr, saddr); + + x0 = afinfo->state_lookup(&x->id.daddr, x->id.spi, x->id.proto); - if (km_query(x, tmpl, pol) == 0) { + if (!x0 && km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; list_add_tail(&x->bydst, xfrm_state_bydst+h); xfrm_state_hold(x); @@ -386,10 +382,14 @@ x->timer.expires = jiffies + XFRM_ACQ_EXPIRES*HZ; add_timer(&x->timer); } else { + error = -ESRCH; + if (x0) { + xfrm_state_put(x0); + error = -EEXIST; + } x->km.state = XFRM_STATE_DEAD; xfrm_state_put(x); x = NULL; - error = -ESRCH; } } out: --------------070809070105060803010504-- From kaber@trash.net Sun Apr 3 10:01:27 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 10:01:31 -0700 (PDT) Received: from kaber.coreworks.de ([62.206.217.67]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33H1QKT003360 for ; Sun, 3 Apr 2005 10:01:27 -0700 Received: from localhost ([127.0.0.1]) by kaber.coreworks.de with esmtp (Exim 4.50) id 1DI8Si-0003mQ-Hy; Sun, 03 Apr 2005 19:00:36 +0200 Message-ID: <42502134.8030003@trash.net> Date: Sun, 03 Apr 2005 19:00:36 +0200 From: Patrick McHardy User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.6) Gecko/20050324 Debian/1.7.6-1 X-Accept-Language: en MIME-Version: 1.0 To: Herbert Xu CC: "David S. Miller" , Alexey Kuznetsov , James Morris , YOSHIFUJI Hideaki , netdev@oss.sgi.com Subject: Re: Checking SPI in xfrm_state_find References: <20050214221006.GA18415@gondor.apana.org.au> <20050214221200.GA18465@gondor.apana.org.au> <20050214221433.GB18465@gondor.apana.org.au> <20050214221607.GC18465@gondor.apana.org.au> <424864CE.5060802@trash.net> <20050328233917.GB15369@gondor.apana.org.au> <424B40C2.90304@trash.net> <20050331004658.GA26395@gondor.apana.org.au> In-Reply-To: <20050331004658.GA26395@gondor.apana.org.au> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1301 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kaber@trash.net Precedence: bulk X-list: netdev Herbert Xu wrote: > It just occured to me that it would be much simpler if you did the > existence check in the first loop. > > So something like > > if (x->props.family != family || > !xfrm_state_addr_check(x, daddr, saddr, family) || > tmpl->id.proto == x->id.proto) > continue; > if (tmpl->id.spi) { > if (tmpl->id.spi != x->id.spi) > continue; > error = -EEXIST; > } > if (x->props.reqid == tmpl->reqid && > tmpl->mode == x->props.mode) { > } You're right, sorry for getting back to you so late. But since its already in now and not very important, I'm going to leave it until I have a better reason to touch that code, if you're ok with that. Regards Patrick From Robert.Olsson@data.slu.se Sun Apr 3 12:18:35 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 12:18:39 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33JIXn4008610 for ; Sun, 3 Apr 2005 12:18:34 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j33JIUWk016276 for ; Sun, 3 Apr 2005 21:18:31 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id C5C56EE2B1; Sun, 3 Apr 2005 21:18:30 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16976.16774.728707.368646@robur.slu.se> Date: Sun, 3 Apr 2005 21:18:30 +0200 To: Harald Welte Cc: netdev@oss.sgi.com Subject: pktgen problem (skb refcount) in 2.6.12-rc1 In-Reply-To: <20050402191132.GF1890@sunbeam.de.gnumonks.org> References: <20050402191132.GF1890@sunbeam.de.gnumonks.org> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1302 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Harald Welte writes: > I've tried to get pktgen running on 2.6.12-rc1 (dual-opteron system, two > dual e1000 boards). > I've tried to track the problem down, and I've confirmed that skb->users > never goes down to 1 but instead stays at '2'. > The same system with the same pktgen script works fine with 2.6.11.6. > > I'm reporting this since it seems like it sounds like we have a skb > usage count leak somewhere :( Hello! Sounds like a diff could give some clues. pktgen, e1000 and TX-path should be interesting as ev. changes in kernel config. --ro From Robert.Olsson@data.slu.se Sun Apr 3 12:37:40 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 12:37:45 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33JbdJf013946 for ; Sun, 3 Apr 2005 12:37:39 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j33JaqWB018224; Sun, 3 Apr 2005 21:36:53 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id D38DEEE2B2; Sun, 3 Apr 2005 21:36:52 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16976.17876.832677.945878@robur.slu.se> Date: Sun, 3 Apr 2005 21:36:52 +0200 To: Herbert Xu Cc: Robert Olsson , Eric Dumazet , davem@davemloft.net, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() In-Reply-To: <20050402193224.GA25157@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <20050402193224.GA25157@gondor.apana.org.au> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1303 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Herbert Xu writes: > Incidentally we should change the way the rehashing is triggered. > Instead of doing it regularly, we can do it when we notice that a > specific hash chain grows beyond a certain size. > The idea is that if someone is attacking our hash then they can > only do so by lengthening the chains. If they're not doing that > then even if they knew how to attack us we don't really care. Well I don't see how we detect the need for rehash just be looking at the hash chains. How does the the "lengthening" look like that are allowed to trigger a rehash? Agree with Dave that we can increase the interval to start with. --ro From tgraf@suug.ch Sun Apr 3 12:47:23 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 12:47:34 -0700 (PDT) Received: from b.mx.projectdream.org (eth0-0.arisu.projectdream.org [194.158.4.191]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33JlM94014678 for ; Sun, 3 Apr 2005 12:47:22 -0700 Received: from postel.suug.ch (postel.suug.ch [195.134.158.23]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by b.mx.projectdream.org (Postfix) with ESMTP id 55CD682; Sun, 3 Apr 2005 21:46:57 +0200 (CEST) Received: by postel.suug.ch (Postfix, from userid 10001) id 781091C0EA; Sun, 3 Apr 2005 21:47:39 +0200 (CEST) Date: Sun, 3 Apr 2005 21:47:39 +0200 From: Thomas Graf To: Abhishek Gupta Cc: netdev@oss.sgi.com Subject: Re: Problem using HTB Message-ID: <20050403194739.GR3086@postel.suug.ch> References: <20050402213642.GO3086@postel.suug.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1304 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: tgraf@suug.ch Precedence: bulk X-list: netdev * Abhishek Gupta 2005-04-03 20:30 > But the problem is still > not yet solved as I tried with 1Mbit speed as the setting for link speed > in the htb configuration and got about 30KBps which amounts to about > 240Kbitps even though my UDP source is sending at speed of about > 1MBps(8Mbps), according to RH monitor readings. I do not know about that "RH monitor" you are referring to, maybe it does not display rates correctly. (I found 3 out of 5 rate estimators outputing with a variance of over 10%) I can recommend you bmon [0] which states the variance and can be used to a resolution up to 1/100s given the input source provides an equal or better resolution. > Is it possible that the problem is due to the source that I am using for > UDP packets? Very likely, especially due to the huge difference in requested and achieved rate you have mentioned above. I hardly think this is a problem related to HTB but rather some misconfiguration in your testing process. [0] http://people.suug.ch/~tgr/bmon/ From Robert.Olsson@data.slu.se Sun Apr 3 12:57:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 12:57:51 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33Jvkb1015443 for ; Sun, 3 Apr 2005 12:57:46 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j33Jv8ZD020427; Sun, 3 Apr 2005 21:57:08 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 9116BEE2B1; Sun, 3 Apr 2005 21:57:08 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16976.19092.562006.246545@robur.slu.se> Date: Sun, 3 Apr 2005 21:57:08 +0200 To: Herbert Xu Cc: "David S. Miller" , Robert.Olsson@data.slu.se, dada1@cosmosbay.com, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() In-Reply-To: <20050403074337.GA8083@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <20050402193224.GA25157@gondor.apana.org.au> <20050402115528.11f71a3c.davem@davemloft.net> <20050403074337.GA8083@gondor.apana.org.au> X-Mailer: VM 7.18 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1305 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Herbert Xu writes: > We could also move rt_cache_flush into a kernel thread. When the > number of chains is large this function is really expensive for a > softirq handler. It can also be done via /proc and left to administrators to find suitable policy. Kernel just provides the mechanism to rehash. --ro From herbert@gondor.apana.org.au Sun Apr 3 14:45:03 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 14:45:10 -0700 (PDT) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33Lj1Xs018984 for ; Sun, 3 Apr 2005 14:45:02 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DICtO-00073M-00; Mon, 04 Apr 2005 07:44:26 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DICsw-00049E-00; Mon, 04 Apr 2005 07:43:58 +1000 Date: Mon, 4 Apr 2005 07:43:58 +1000 To: Robert Olsson Cc: Eric Dumazet , davem@davemloft.net, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-ID: <20050403214358.GA15901@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <20050402193224.GA25157@gondor.apana.org.au> <16976.17876.832677.945878@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <16976.17876.832677.945878@robur.slu.se> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1306 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sun, Apr 03, 2005 at 09:36:52PM +0200, Robert Olsson wrote: > > Well I don't see how we detect the need for rehash just be looking > at the hash chains. How does the the "lengthening" look like that > are allowed to trigger a rehash? The only way to attack a hash is by exploiting collisions and create one or more excessively long chains. This can be detected as follows at each rt hash insertion. If (total number of entries in cache >> (hash length - user defined length)) < current bucket length is true, then we schedule a rehash/flush. Hash length is the number of bits in the hash, i.e., 1 << hash length == number of buckets I'd suggest a default shift length of 3. That is, if any individual chain is growing beyond 8 times the average chain length then we've got a problem. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Sun Apr 3 14:45:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 14:45:51 -0700 (PDT) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33Ljjdx019058 for ; Sun, 3 Apr 2005 14:45:46 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DICuN-00073c-00; Mon, 04 Apr 2005 07:45:27 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DICuH-00049k-00; Mon, 04 Apr 2005 07:45:21 +1000 Date: Mon, 4 Apr 2005 07:45:21 +1000 To: Robert Olsson Cc: "David S. Miller" , dada1@cosmosbay.com, netdev@oss.sgi.com Subject: Re: [BUG] overflow in net/ipv4/route.c rt_check_expire() Message-ID: <20050403214521.GB15901@gondor.apana.org.au> References: <424E641A.1020609@cosmosbay.com> <16974.41648.568927.54429@robur.slu.se> <20050402193224.GA25157@gondor.apana.org.au> <20050402115528.11f71a3c.davem@davemloft.net> <20050403074337.GA8083@gondor.apana.org.au> <16976.19092.562006.246545@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <16976.19092.562006.246545@robur.slu.se> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1307 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sun, Apr 03, 2005 at 09:57:08PM +0200, Robert Olsson wrote: > > Herbert Xu writes: > > > We could also move rt_cache_flush into a kernel thread. When the > > number of chains is large this function is really expensive for a > > softirq handler. > > It can also be done via /proc and left to administrators to find > suitable policy. Kernel just provides the mechanism to rehash. The reason I'm suggesting the move to a kernel thread is because softirq context is not preemptible. So doing a large amount of work in it when your table is big means that a UP machine will freeze for a while. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From a.kasparas@gmc.lt Sun Apr 3 15:02:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 15:02:18 -0700 (PDT) Received: from smtp02.omnitel.sun (smtp02-neptunas.omnitel.net [194.176.45.2]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j33M29ro020932 for ; Sun, 3 Apr 2005 15:02:12 -0700 Received: from smtp04-neptunas.omnitel.net ([194.176.45.42]) by smtp02.omnitel.sun (Sun Java System Messaging Server 6.1 HotFix 0.01 (built Jun 24 2004)) with ESMTP id <0IEE006C357G3Y00@smtp02.omnitel.sun> for netdev@oss.sgi.com; Mon, 04 Apr 2005 01:02:04 +0300 (EEST) Received: from smtp04-neptunas.omnitel.net (localhost [127.0.0.1]) by smtp04-neptunas.omnitel.net (Postfix) with SMTP id C5B95398007; Mon, 04 Apr 2005 01:02:03 +0300 (EEST) Received: from [192.168.0.128] (unknown [62.212.195.62]) by smtp04-neptunas.omnitel.net (Postfix) with ESMTP id 5144A39800D; Mon, 04 Apr 2005 01:02:03 +0300 (EEST) Date: Mon, 04 Apr 2005 01:02:01 +0300 From: Aidas Kasparas Subject: Re: IPSEC: on behavior of acquire In-reply-to: <1112538566.1096.391.camel@jzny.localdomain> To: hadi@cyberus.ca Cc: ipsec-tools-devel@lists.sourceforge.net, netdev , nakam@linux-ipv6.org Message-id: <425067D9.9050603@gmc.lt> MIME-version: 1.0 Content-type: text/plain; charset=UTF-8; format=flowed Content-transfer-encoding: 7BIT X-Accept-Language: lt, en, ru, fr X-Enigmail-Version: 0.90.0.0 X-Enigmail-Supports: pgp-inline, pgp-mime References: <1112405303.1096.37.camel@jzny.localdomain> <424E454D.4090402@gmc.lt> <1112477326.1088.321.camel@jzny.localdomain> <424FA946.70809@gmc.lt> <1112538566.1096.391.camel@jzny.localdomain> User-Agent: Debian Thunderbird 1.0 (X11/20050116) X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1308 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: a.kasparas@gmc.lt Precedence: bulk X-list: netdev jamal wrote: > On Sun, 2005-04-03 at 04:28, Aidas Kasparas wrote: > >>jamal wrote: > > >>>Exactly what i was trying to emulate - lost messages. >> >>Your emulation was not correct. More correct would have been to start KE >>daemon, let it fully initialize (open pfkey socket, inform kernel that >>it is interested in acquire messages), then stop it (via debugger or >>kill -STOP) and only then send pings or other traffic and see what will >>happen. This is because there are different paths in xfrm+pfkey for >>cases 1) when there is no KE daemon and 2) when daemon is, but for some >>reason it does not establish a SA and therefore reaction to traffic is >>different. >> > > > I dont think that would work. > To summarize what happens in the kernel: everything leads to km_query() > as you have indicated in your text. > If the kernel finds someone/thing has either a pfkey or netlink socket > open it sends a acquire to them. In the code you are probably looking at > (before i created the patch) - the first user/daemon the kernel sees > (either pfkey or netlink based) that has a socket open > will receive an acquire and the kernel will give up after that. > > As an example, if the first pfkey user was just doing "setkey -x" and > the second was infact pluto, then pluto will never see the > acquire. This is what got me looking at it to begin with. Look at the > earlier postings on the subject. While I agree that code before your patch would not allow to cooperate tools using different ways to manage SAD/SPD (pfkey vs netlink), I have one setup in production where two instances of racoon runs simultaneously and both gets required pfkey-messages. > So in other words, just killing the ike server as you propose would mean > the kernel has no open sockets and will therefore never bother to send > an acquire. I proposed to stop KE server, not to kill it. > > Still all this is moot and is distracting us from the main discussion. > Lets define "lost" simply as the case where an acquire never got to the > server (which may be sitting elsewhere on the network). ACQUIREs _never_ _leaves_ _the box_ they are generated. It is allways kernel-to-userspace_process communication. It could be made reliable. And present situation IS sufficiently reliable. In that case > what i did is sufficient. i.e. The methods to create this are not the > issue. The issue at stake is the behavior of the kernel in generating > the acquires. > See below. > > Please refer to my earlier definition of what "lost" means. It doesnt > matter where the breakage happens really. > Think of everything to the right of "xfrm" in your diagram as a black > box (i.e that second thing could be pfkey or netlink - thats not the > issue). > Think of some message that is supposed to reach the KE daemon > (make it interesting and say it is remote KE) then think of that message > never making it because something in the blackbox swallowed it. > If that packet is the first one and it needs to do so for the sake of > setup for subsequent packets - then the desire to have it reach its > destination is very imprtant. There is no progress for it or subsequent > packets if it doesnt make it. OK, let's talk about architecture xfrm <-> blackbox. In this architecture communication between these two elements (I do not speak about any comms in the blackbox) can be of two types: 1) reliable (messages always reach blackbox or error is reported); 2) unreliable (messages may fail even to reach blackbox). With good blackboxes good ipsec system can be built using any of comm types. But: a) (1) will be more reliable; b) (1) will be more simple (at least xfrm side, as it will not require retransmisions); c) (1) is implemented now (as a function call). What I want to say is xfrm-to-blackbox interface is good as it is. The problem may only be in how good the blackbox is. And here we have to look inside blackbox and start talk about particular implementations of that blackbox. Retransmitions, if they needed, needs to be inside that blackbox. > > The solution being proposed for Linux to treat that xfrm piece in the > same fashion as ARP is correct. Read the email from Alexey. Imagine if > ARP was only issued once(as does pfkey) or forever(as does netlink). > I have read email from Alexey. I think that xfrm_lookup() function implements functionality very similar to functionality which Alexey described. And I think that direct comparison of ARP messages and pfkey messages is not fair, because pfkey acquire messages goes over reliable traffic and are used only to _initiate_ the process of SA negotiation. ARP has to receive information from other boxes which send it only as a direct responce to some packet. More, ARP is designed to be used [amogst others] on networks which loose some traffic by design. > I believe this is an issue with ipsec architecture itself - someone > needs to write an IETF draft on it. > I still do not see the topic for such draft. > >>> >>>Note: Sometimes theres no app. Example a packet coming into a gateway. >>> >> >>What do you have in mind? >> >>If it is ISAKMP negotiation from remote peer, then it comes over UDP/500 >>or UDP/4500 over IP socket and not via acquire message via pfkey socket. >> >>If it is ESP/AH packet with unknown SPI, then kernel simply drops it and >>do not send any acquire messages. >> > > > I was thinking more of this second scenario with incoming from clear > text domain and gateway encrypting assuming proper policy setup. If you're talking about network behind security gateway communicating to host or network for which there is security policy configured on gateway, then acquire message will be generated on that security gateway, when that packet will be considered for forwarding. Again, that acquire messages never will leave security gateway. > I would have to go and reread the "opportunistic" encryption draft > closely to make sense. > Speaking of "opportunistic" encryption. I never understood it. Ipsec-tools do not implement it. And in the year or so when I'm involved with it, I don't remember anybody even asking or mentioning about this feature. Therefore, I don't care about it -- users do not need it. -- Aidas Kasparas IT administrator GM Consult Group, UAB From dmitry_yus@yahoo.com Sun Apr 3 17:56:26 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 17:56:32 -0700 (PDT) Received: from smtp111.mail.sc5.yahoo.com (smtp111.mail.sc5.yahoo.com [66.163.170.9]) by oss.sgi.com (8.13.0/8.13.0) with SMTP id j340uQ4U030035 for ; Sun, 3 Apr 2005 17:56:26 -0700 Received: from unknown (HELO ?172.10.7.7?) (dmitry?yus@24.7.114.77 with plain) by smtp111.mail.sc5.yahoo.com with SMTP; 4 Apr 2005 00:56:25 -0000 Subject: Re: Linux support for RDMA (was: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit Proposed Topics) From: Dmitry Yusupov To: "open-iscsi@googlegroups.com" Cc: "David S. Miller" , mpm@selenic.com, andrea@suse.de, michaelc@cs.wisc.edu, James.Bottomley@HansenPartnership.com, ksummit-2005-discuss@thunk.org, netdev@oss.sgi.com In-Reply-To: <67D69596DDF0C2448DB0F0547D0F947E01781F2E@yogi.asicdesigners.com> References: <67D69596DDF0C2448DB0F0547D0F947E01781F2E@yogi.asicdesigners.com> Content-Type: text/plain Date: Sun, 03 Apr 2005 17:56:11 -0700 Message-Id: <1112576171.4227.5.camel@mylaptop> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-2) Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1309 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dmitry_yus@yahoo.com Precedence: bulk X-list: netdev On Sat, 2005-04-02 at 11:07 -0800, Asgeir Eiriksson wrote: > Dmitry > The CPU cycles is only at most half of the story with the other half > being the memory sub-system BW. > > So the validity of your observation depends on the BW we're talking > about, i.e. if the client is using a fraction of 10Gbps for RDMA (or > DDP, e.g. iSCSI DDP), yes then that fraction amounts to a fraction of > the memory sub-system total BW so we don't much care about the extra > copy. > > The situation is different if the client wants something close to 10Gbps > (already have such client applications), because today 10Gbps is still a > big chunk of the overall memory BW so you really care about eliminating > that copy via DDP. I do not get your concern with memory BW. With good AMD box V40Z(SUN) you can get 5.3GBytes/sec. Even with 10Gbps full speed you have 80% left. PCI-X BUS BW is bigger concern... > 'Asgeir > > > -----Original Message----- > > From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On > > Behalf Of Dmitry Yusupov > > Sent: Saturday, April 02, 2005 10:09 AM > > To: open-iscsi@googlegroups.com > > Cc: David S. Miller; mpm@selenic.com; andrea@suse.de; > > michaelc@cs.wisc.edu; James.Bottomley@HansenPartnership.com; > ksummit-2005- > > discuss@thunk.org; netdev@oss.sgi.com > > Subject: Re: [Ksummit-2005-discuss] Summary of 2005 Kernel Summit > > ProposedTopics > > > > On Mon, 2005-03-28 at 17:32 -0500, Benjamin LaHaise wrote: > > > On Mon, Mar 28, 2005 at 12:48:56PM -0800, Dmitry Yusupov wrote: > > > > If you have plans to start new project such as SoftRDMA than yes. > lets > > > > discuss it since set of problems will be similar to what we've got > > with > > > > software iSCSI Initiators. > > > > > > I'm somewhat interested in seeing a SoftRDMA project get off the > ground. > > > At least the NatSemi 83820 gige MAC is able to provide early-rx > > interrupts > > > that allow one to get an rx interrupt before the full payload has > > arrived > > > making it possible to write out a new rx descriptor to place the > payload > > > wherever it is ultimately desired. It would be fun to work on if > not > > the > > > most performant RDMA implementation. > > > > I see a lot of skepticism around early-rx interrupt schema. It might > > work for gige, but i'm not sure if it will fit into 10g. > > > > What RDMA gives us is zero-copy on receive and new networking api > which > > has a potential to be HW accelerated. SoftRDMA will never avoid > copying > > on receive. But benefit for SoftRDMA would be its availability on > client > > sides. It is free and it could be easily deployed. Soon Intel & Co > will > > give us 2,4,8... multi-core CPUs for around 200$ :), So, who cares if > > one of those cores will do receive side copying? > > > > From herbert@gondor.apana.org.au Sun Apr 3 18:00:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 18:00:27 -0700 (PDT) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j3410Dar030521 for ; Sun, 3 Apr 2005 18:00:13 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DIFw6-0007tf-00; Mon, 04 Apr 2005 10:59:26 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DIFun-0004NB-00; Mon, 04 Apr 2005 10:58:05 +1000 Date: Mon, 4 Apr 2005 10:58:05 +1000 To: jamal Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: take 2 WAS(Re: PATCH: IPSEC xfrm events Message-ID: <20050404005805.GA16543@gondor.apana.org.au> References: <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> <20050402014619.GB24861@gondor.apana.org.au> <1112469601.1088.173.camel@jzny.localdomain> <1112538718.1096.394.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112538718.1096.394.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1310 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Hi Jamal: On Sun, Apr 03, 2005 at 10:31:58AM -0400, jamal wrote: > > Small change after some testing. > Herbert havent heard back from you - this looks very palatable in my > opinion with comments below still in effect. It's definitely looking better all the time. > -void xfrm_state_delete(struct xfrm_state *x) > +static DEFINE_RWLOCK(xfrm_km_lock); > +static struct list_head xfrm_km_list = LIST_HEAD_INIT(xfrm_km_list); > + > +void km_policy_notify(struct xfrm_policy *xp, int dir, struct km_event *c) > { > + struct xfrm_mgr *km; > + > + read_lock(&xfrm_km_lock); > + list_for_each_entry(km, &xfrm_km_list, list) > + if (km->notify_policy) > + km->notify_policy(xp, dir, c); > + read_unlock(&xfrm_km_lock); > +} > + > +void km_state_notify(struct xfrm_state *x, struct km_event *c) > +{ > + struct xfrm_mgr *km; > + read_lock(&xfrm_km_lock); > + list_for_each_entry(km, &xfrm_km_list, list) > + km->notify(x, c); > + read_unlock(&xfrm_km_lock); > +} > + > +EXPORT_SYMBOL(km_policy_notify); > +EXPORT_SYMBOL(km_state_notify); Can we perhaps move these lines next to the other km functions further down? They look rather lonely here. > + /* XXX: Do we wanna do this right at the top?? > + * if the state is dead we dont want to announce > + * the expire - a delete may already have announced > + * it > + */ Please code this check differently so that it isn't racy. One way to do it is to change xfrm_timer_handler to do: if (__xfrm_state_delete(x) && x->id.spi) km_state_expired(x, 1); > + /* XXX: Do we still wanna wakeup km_waitq? > + * if the policy is dead we dont want to announce > + * the expire - a delete may already have announced > + * it > + */ Ditto. > --- a/net/xfrm/xfrm_policy.c 2005-03-25 22:28:21.000000000 -0500 > +++ b/net/xfrm/xfrm_policy.c 2005-04-02 12:16:30.000000000 -0500 > @@ -298,7 +298,7 @@ > * entry dead. The rule must be unlinked from lists to the moment. > */ > > -static void xfrm_policy_kill(struct xfrm_policy *policy) > +static void xfrm_policy_kill(struct xfrm_policy *policy, int dir) What's this for? > + c.seq = nlh->nlmsg_seq; > + c.pid = nlh->nlmsg_pid; > + if (nlh->nlmsg_type == XFRM_MSG_NEWSA) > + c.event = XFRM_SAP_ADDED; > + else > + c.event = XFRM_SAP_UPDATED; > + > + km_state_notify(x, &c); You need to hold onto x here. So do a hold before you call xfrm_state_* and then drop the reference after km_state_notify. > static int xfrm_del_sa(struct sk_buff *skb, struct nlmsghdr *nlh, void **xfrma) > - xfrm_state_delete(x); > + err = xfrm_state_delete(x); > + if (err < 0) { > + x->km.state = XFRM_STATE_DEAD; > + xfrm_state_put(x); > + return err; If the xfrm_state_delete fails then it's already dead. So kill the line that modifies its state. > +static int xfrm_notify_sa( struct xfrm_state *x, struct km_event *c) Extra space after the paren. > + int len = NLMSG_LENGTH(sizeof(struct xfrm_usersa_info)); Please add the additional payloads for NAT-T and the keys. > +static int xfrm_notify_policy( struct xfrm_policy *xp, int dir, struct km_event *c) > +{ > + struct xfrm_userpolicy_info *p; > + struct nlmsghdr *nlh; > + struct sk_buff *skb; > + u32 nlt = 0 ; > + unsigned char *b; > + int len = NLMSG_LENGTH(sizeof(struct xfrm_userpolicy_info)); Please attach the templates. > @@ -1256,7 +1328,7 @@ > > if (hdr->sadb_msg_type == SADB_ADD) > err = xfrm_state_add(x); > - else > + else A better editor that doesn't leave trailing spaces is needed here :) > - xfrm_state_delete(x); > - xfrm_state_put(x); > + err = xfrm_state_delete(x); > + if (err < 0) { > + x->km.state = XFRM_STATE_DEAD; Please remove this line as it's already dead if the delete fails. > +static int key_notify_sa_flush(struct km_event *c) > +{ > + struct sk_buff *skb; > + struct sadb_msg *hdr; > + > + skb = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_ATOMIC); > + if (!skb) > + return -ENOBUFS; > + hdr = (struct sadb_msg *) skb_put(skb, sizeof(struct sadb_msg)); > + // XXX:do we have to pass proto as well? I think so. A flush of all IPCOMP states is certainly quite different from a flush of all states. It's just a matter of calling satype2proto. > + /* > + * XXX: previous get was doing a broadcast-all _always_ > + * which didnt seem right for non-deletion case - JHS > + * This is like the way netlink behaves .. > + * Shall i restore original behavior? > + */ You're right. The original behaviour was broken. > - pfkey_xfrm_policy2msg(out_skb, xp, pol->sadb_x_policy_dir-1); > - > - out_hdr = (struct sadb_msg *) out_skb->data; > - out_hdr->sadb_msg_version = hdr->sadb_msg_version; > - out_hdr->sadb_msg_type = hdr->sadb_msg_type; > - out_hdr->sadb_msg_satype = 0; > - out_hdr->sadb_msg_errno = 0; > - out_hdr->sadb_msg_seq = hdr->sadb_msg_seq; > - out_hdr->sadb_msg_pid = hdr->sadb_msg_pid; > - pfkey_broadcast(out_skb, GFP_ATOMIC, BROADCAST_ALL, sk); > - err = 0; However, you do need to keep this code for the real GET case. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Sun Apr 3 18:01:37 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Apr 2005 18:01:41 -0700 (PDT) Received: from arnor.apana.org.au (mail@arnor.apana.org.au [203.14.152.115]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id j3411ZoF030968 for ; Sun, 3 Apr 2005 18:01:36 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.35 #1 (Debian)) id 1DIFxw-0007uK-00; Mon, 04 Apr 2005 11:01:20 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1DIFxq-0004Nw-00; Mon, 04 Apr 2005 11:01:14 +1000 Date: Mon, 4 Apr 2005 11:01:14 +1000 To: jamal Cc: Patrick McHardy , Masahide NAKAMURA , "David S. Miller" , netdev Subject: Re: take 2 WAS(Re: PATCH: IPSEC xfrm events Message-ID: <20050404010114.GA16839@gondor.apana.org.au> References: <20050401042106.GA27762@gondor.apana.org.au> <1112353398.1096.116.camel@jzny.localdomain> <20050401114258.GA2932@gondor.apana.org.au> <1112358278.1096.160.camel@jzny.localdomain> <20050401123554.GA3468@gondor.apana.org.au> <1112403845.1088.14.camel@jzny.localdomain> <20050402012813.GA24575@gondor.apana.org.au> <1112406164.1088.54.camel@jzny.localdomain> <20050402014619.GB24861@gondor.apana.org.au> <1112469601.1088.173.camel@jzny.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112469601.1088.173.camel@jzny.localdomain> User-Agent: Mutt/1.5.6+20040907i From: Herbert Xu X-Virus-Scanned: ClamAV 0.83/802/Sat Apr 2 06:49:46 2005 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 1311 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev On Sat, Apr 02, 2005 at 02:20:01PM -0500, jamal wrote: > > 1) Weve discussed this before Herbert and i think you misspoke that > pfkey delivers to all listerners. > > pfkey Add/del/upd now really do tell all processes about what happened. > Before pfkey would skip the originating process. So far this doesnt seem > to be an issue in the basic testing. Are you sure? Previously they did BROADCAST_ALL which goes to everyone including the sender. > 2) I ended adding a policy_notify to the pfkey manager to make the code > generic. Interesting thing is i dont think pfkey knows what to do with > policy expiration or i am misreading the code. That's right, pfkey never had policy expire messages. In general, anything to do with policies cannot be done portably in pfkey since the RFC only specified the SA operations. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apa