David S. Miller writes:
> On Sat, 26 Mar 2005 22:33:01 -0800
> Dmitry Yusupov <dmitry_yus@xxxxxxxxx> wrote:
> > i.e. TCP stack should call NIC driver's callback after all SKB data
> > been successfully copied to the user space. At that point
> NIC driver
> > will safely replenish HW ring. This way we could avoid most
> of memory
> > allocations on receive.
> How does this solve your problem? This is just simple SKB
> recycling, and it's a pretty old idea.
> TCP packets can be held on receive for arbitrary amounts of time.
> This is especially true if data is received out of order or
> when packets are dropped. We can't even wake up the user
> until the holes in the sequence space are filled.
> Even if data is received properly and in order, there are no
> hard guarentees about when the user will get back onto the
> CPU to get the data copied to it.
> During these gaps in time, you will need to keep your HW
> receive ring populated with packets.
Here's the way I see it.
1) There are iSCSI connections that should be "protected", resources-wise.
Examples: remote swap device, bank accounts database on RAID accessed via
2) There are two ways to protect the "protected" connections. One "Big
Brother" like way is a centralized Resource Manager that performs a fully
deterministic resource accounting throughout the system, all the way from
NIC descriptors and on-chip memory up to iSCSI buffers for Data-Out headers.
3) The 2nd way is *awareness* of the "protected" connections propagated
throughout the system, along with incremental implementation of more
sophisticated recovery schemes.
4) The Resource Manager could be used in the following way. At session open
time iSCSI control plane calculates iSCSI and TCP resources that should be
available at all times. The calculation is done based on: the number of SCSI
commands to be processed in parallel (the 'can_queue'), the maximum size of
the SCSI payload in the SG, the negotiated maximum number of outstanding
R2Ts, sizes of Immediate and FirstBurst data.
5) If Resource manager says there is not enough resources, iSCSI fails
session open. This is better than to get in trouble well into runtime.
6) For example: to transmit 'can_queue' commands, iSCSI needs N skbufs.
Let's say, all N commands transmitted in a burst, and just one of these N
gets ack-ed by the Target (via StatSN). In the fully deterministic system
this does not necessarily mean that the scsi-ml can now send one command -
because the full condition involves also recycling of skbuf(s) used for
transmitting this one completed command. And although it is hard to imagine
that the command gets fully done by the remote target without Tx buffers
getting recycled, the theoretical chance exists (e.g., the NIC is slow or
the driver has a bad Tx recycling implementation), and the fully
deterministic scheme should take it into account.
7) Therefore, prior to calling scsi_done() iSCSI asks Resource Manager
whether all the TCP etc. resources used for this command are already
recycled. If not, the scsi_done() gets postponed. In addition, iSCSI
"complains" to Resource Manager that it enters slow path because of this,
which could prompt the latter to take an action. (End of the example).
8) If we agree to declare some connections "resource-proteced", it would
immediately mean that there are possibly other connections that are not
(resource-protected). Which in turn gives the Resource Manager a flexibility
to OOM-kill those unprotected connections and cannibalize the corresponding
resources for the protected ones.
9) Without some awareness of the resource-protected connections, and without
some kind of resource counting at runtime (let it be partial and incomplete
for starters) - the only remaining way for customers that require HA (High
Availability) is to over-engineer: use 64GB RAM, TBs of disk space, etc.
Which is probably not the end of the world as long as the prices go down..