I'm curious about the way Linux handles passive opens. I've been looking
at the code, and running some experiments. It seems like there could be
improvements made for high load (actually, overload) conditions.
My understanding of how things presently work goes like this:
Conceptually, there's a syn queue and an ack queue. The syn queue contains
connections for which we've received a syn, but not the ack of our
syn+ack. The ack queue is for connections where the three-way handshake is
complete. The maximum size of the syn queue is governed by
/proc/sys/net/ipv4/tcp_max_synbacklog. The size of the ack queue is
limited by min(backlog parameter to listen call, SOMAXCONN).
The syn and ack queues are actually part of the same queue, but the size
of each is limited independently of the other.
If we receive an ack for one of our syn+acks, and the ack queue is full,
we silently drop the ack.
The problem that I run into in my experiments is this:
I have a single machine running as a web server. There are three machines
submitting requests to the web server. The volume of data to be returned
is greater than the link capacity (the network is overloaded). Obviously,
requests will queue up at the server.
The webserver (Apache) uses one request per process. When all processes
are busy, the ack queue in the kernel starts to build up. When the ack
queue is filled, the syn queue starts to build up.
The problem is that most of the connections in the syn queue remain in the
syn queue, because the acks arrive more quickly than the webserver
finishes requests . Because we've dropped and ignored the ack, the
connection request won't move to the ack queue until the client
has retransmitted the ack (at least 3 seconds, because the client has to
use the initial RTT estimate).
Eventually, the ack queue decreases (because no new requests are making it
in), and the number of requests being served decreases as well. Between
the time that this happens, and the time that the clients timeout and
retransmit their acks, the server is idle.
1. Why the separate limits for the syn and ack queues? In particular, why
do we not always move from the syn queue to the ack queue. Is it
because of the cost of allocating the full socket? Something else
2. If we have to maintain separate limits, why not still track that we
have received an ack for something in the syn queue, so that we can
move it later. This will alleviate the wait for the retransmit from the
 In the initial experiments, there was a different problem that also
led to the server idling. The outbound syn+acks were getting dropped
at the device queue (they were contending with the response data for
queue space, and losing). Given that the penalty for losing a syn+ack
is much higher than the penalty for losing a data packet, we
prioritized syn+acks higher than data packets.