Re: 2.6.0-test11: dst_cache_overflow causing unresponsive box

David S. Miller writes:

 > Let us assume, for the sake of back of the envelope calculations, that
 > all 90k TCP connections speak to unique destinations.  Let us further
 > assume that all of them have at least one packet in flight.
 > This means the routing cache must be able to hold at least 90k entries.
 > All of these routing cache entires will be referenced by the packets
 > in the TCP retransmission queues of all the sockets, and thus the
 > entries are unreclaimable.
 > You are setting net.ipv4.route.max_size to 655360 which should be more
 > than enough.  But you also have to make the net.ipv4.route.gc_thresh
 > more reasonable as well, perhaps 90K as a test.
 > If net.ipv4.route.gc_thresh is lower than 90K and my assertions above
 > hold, then the kernel will try to garbage collect too early, all the
 > routing cache entries will be in use and therefore uncollectable,
 > and you'll get the message you're seeing.
 > Try to pump up gc_thresh and see if that helps.

 Yes better tuning as gc_thresh and max_size is in better balance but max_size
 is same so I'll guess we collect unreclaimable entries util we see dst 
 still'. The long time before overflow is suspect "hours to days" We have to 
 ask if this has ever worked before?
 I'll guess number of hash buckets should be increased for systems like this.


