|Subject:||non-growing tcp window size|
|From:||Håkon Bugge <Hakon.Bugge@xxxxxxxxx>|
|Date:||Tue, 21 Jan 2003 17:48:32 +0100|
To whom it might concern,We have observed bandwidth problems of TCP when running benchmarks which gradually increase its payload and having (most of its) data flowing unidirectional. The degradation is observed between two dual XEON/E7500 machines, using Gbe and a cross-over cable. The Gbe is eth1 and has an MTU of 9000. eth0 is a FE with an MTU of 1500. The machines is running linux 2.4.20, and the NIC in question is an Intel Corp. 82544GC Gigabit Ethernet Controller (rev 02).
During an attempt to change TCP parameter settings, and also systematically changing socket rx/tx buffer sizes, I discovered that once in a while, the benchmark ran well. The OK run was not deterministic, and had nothing to do with change of the parameter settings. Hence, I let the machine work during the week-end, and took tcpdumps which I saved for the successful run. An analysis of the "OK" run vs. the "BAD" run, discovers a couple of interesting things. The problem is that the advertised window of the receiver does not increase. The second problem, which also affects the "OK" run, is that the ratio of packets sent (from src to dst) and the number of packets received (#advertisements) is 1 for the "BAD" scenario (actually this is probably a consequence of the window being only ~2xMTU size). Surprisingly, it is as large as 0.5 for the "OK" scenario. IMHO, this violates RFC813, which states:
There are two reasons for prompt acknowledgement. One is to prevent retransmission. We will discuss later how to determine whether unnecessary retransmission is occurring. The other reason one acknowledges promptly is to permit further data to be sent. However, the previous section makes quite clear that it is not always desirable to send a little bit of data, even though the receiver may have room for it. Therefore, one can state a general rule that under normal operation, the receiver of data need not, and for efficiency reasons should not, acknowledge the data unless either the acknowledgement is intended to produce an increased useable window, is necessary in order to prevent retransmission or is being sent as part of a reverse direction segment being sent for some other reason. We will consider an algorithm to achieve these goals.The two tcpdumps has been analyzed by a simple awk scripts, which gives information for every 1/10 of a second of the runtime:
time: elapsed time relative to the first packet MB/s: sum of TCP payload (based on TCP sequence numbers) *1e-6 / delta time avg_len: average packet length (TCP payload) sent nsent: no of packets sent from src to dst avg_win: average window size advertised by the srcadv/sent: ratio between #advertisements (sum of prompt acks and piggybacked acks) and #packets sent
The extract of the analysis is enclosed as "ok.txt" and "bad.txt". Also, the information is presented as graphs in "bw_winsiz.png" and "bw_adv-sent.png". In the graphs, the "OK" data uses the bottom x-axis, the "BAD" data uses the upper x-axis (the "BAD" run-time is longer due to the lower bandwidth). Bandwidth uses the left y-axes, whereas the average advertised window size and the ratio of the #advertisements and #packets sent uses the rightmost x-axis.
I do hope this information is useful 4u and that the problem can be fixed. Since I do not read the mailing lists, I would appreciate a note back by email of someone triggers on this.
Cheers, Håkon -- Håkon Bugge; VP Product Development; Scali AS; mailto:hob@xxxxxxxx; http://www.scali.com; fax: +47 22 62 89 51; Voice: +47 22 62 89 50; Cellular (Europe+US): +47 924 84 514; Visiting Addr: Olaf Helsets vei 6, Bogerud, N-0621 Oslo, Norway; Mail Addr: Scali AS, Postboks 150, Oppsal, N-0619 Oslo, Norway;
|<Prev in Thread]||Current Thread||[Next in Thread>|