On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> > we do have SLAB [which essentially caches structures, on a per-CPU basis]
> > which i did take into account, but still, initializing a 600+ byte kiovec
> > is probably more work than the rest of sending a packet! I mean i'd love
> > to eliminate the 200+ bytes skb initialization as well, it shows up.
> Reusing a kiobuf for a request involves setting up the length, offset
> and maybe errno fields, and writing the struct page *'s into the
> maplist. Nothing more.
i'm talking about kiovecs not kiobufs (because those are equivalent to a
fragmented packet - every packet fragment can be anywhere). Initializing a
kiovec involves touching a dozen cachelines. Keeping structures compressed
is very important.
i dont know. I dont think it's necesserily bad for a subsystem to have its
own 'native structure' how it manages data.
> We've already got measurements showing how insane this is. Raw IO
> requests, plus internal pagebuf contiguous requests from XFS, have to
> get broken down into page-sized chunks by the current ll_rw_block()
> API, only to get reassembled by the make_request code. It's
> *enormous* overhead, and the kiobuf-based disk IO code demonstrates
> this clearly.
i do believe that you are wrong here. We did have a multi-page API between
sendfile and the TCP layer initially, and it made *absolutely no
performance difference*. But it was more complex, and harder to fix. And
we had to keep intelligent buffering/clustering/merging in any case,
because some native Linux interfaces such as write() and read() have byte
so unless there is some fundamental difference between the two approaches,
i dont buy this argument. I dont necesserily say that your measurements
are wrong, i'm saying that the performance analysis is wrong.
> We have already shown that the IO-plugging API sucks, I'm afraid.
it might not be important to others, but we do hold one particular
SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
fileset of ~9 GB. It generates insane block-IO load, and we do beat other
OSs that have multipage support, including SGI. (and no, it's not due to
kernel-space acceleration alone this time - it's mostly due to very good
block-IO performance.) We use Jens Axobe's IO-batching fixes that
dramatically improve the block scheduler's performance under high load.
> > > and even in networking the 1.5K packet limit kills us in some cases
> > > and we need an interface capable of generating jumbograms.
> > which cases?
> Gig Ethernet, [...]
we handle gigabit ethernet with 1.5K zero-copy packets just fine. One
thing people forget is IRQ throttling: when switching from 1500 byte
packets to 9000 byte packets then the amount of interrupts drops by a
factor of 6. Now if the tunings of a driver are not changed accordingly,
1500 byte MTU can show dramatically lower performance than 9000 byte MTU.
But if tuned properly, i see little difference between 1500 byte and 9000
byte MTU. (when using a good protocol such as TCP.)
> > nothing prevents the introduction of specialized interfaces - if they feel
> > like they can get enough traction.
> So you mean we'll introduce two separate APIs for general zero-copy,
> just to get around the problems in the single-page-based on?
no. But i think that none of the mainstream protocols or APIs mandate a
multi-page interface - i do think that the performance problems mentioned
were mis-analyzed. I'd call the multi-page API thing an urban legend.
Nobody in their right mind can claim that a series of function calls shows
any difference in *block IO* performance, compared to a multi-page API
(which has an additional vector-setup cost). Only functional differences
can explain any measured performance difference - and for those
merging/clustering bugs, multipage support is only a workaround.
> > I was talking about the normal Linux IO
> > APIs, read()/write()/sendfile(), which are byte granularity and invoke an
> > almost mandatory buffering/clustering mechanizm in every kernel subsystem
> > they deal with.
> Only tcp and ll_rw_block. ll_rw_block has already been fixed in the
> SGI patches, and gets _much_ better performance as a result. [...]
as mentioned above, i think this is not due to going multipage.
> The presence of terrible performance in the old ll_rw_block code is
> NOT a good excuse for perpetuating that model.
i'd like to measure this performance problem (because i'd like to
double-check it) - what measurement method was used?