On Thu, Jan 02, 2025 at 11:00:14PM +0100, Stefano Brivio wrote: > On Thu, 2 Jan 2025 14:46:45 +1100 > David Gibson wrote: > > > On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote: > > > On Fri, 20 Dec 2024 19:35:34 +1100 > > > David Gibson wrote: > > > > > > > Currently we attempt to size pool_tap[46] so they have room for the maximum > > > > possible number of packets that could fit in pkt_buf, TAP_MSGS. However, > > > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as > > > > the minimum possible L2 frame size. But, we don't enforce that L2 frames > > > > are at least ETH_ZLEN when we receive them from the tap backend, and since > > > > we're dealing with virtual interfaces we don't have the physical Ethernet > > > > limitations requiring that length. Indeed it is possible to generate a > > > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on > > > > the 'pasta' backend is only 42 bytes long). > > > > > > > > It's also unclear if this limit is sufficient for vhost-user which isn't > > > > limited by the size of pkt_buf as the other modes are. > > > > > > > > We could attempt to correct the calculation, but that would leave us with > > > > even larger arrays, which in practice rarely accumulate more than a handful > > > > of packets. So, instead, put an arbitrary cap on the number of packets we > > > > can put in a batch, and if we run out of space, process and flush the > > > > batch. > > > > > > I ran a few more tests with this, keeping TAP_MSGS at 256, and in > > > general I couldn't really see a difference in latency (especially for > > > UDP streams with small packets) or throughput. Figures from short > > > throughput tests (such as the ones from the test suite) look a bit more > > > variable, but I don't have any statistically meaningful data. > > > > > > Then I looked into how many messages we might have in the array without > > > this change, and I realised that, with the throughput tests from the > > > suite, we very easily exceed the 256 limit. > > > > Ah, interesting. > > > > > Perhaps surprisingly we get the highest buffer counts with TCP transfers > > > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and > > > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes > > > in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval, > > > the extra system call overhead from forcibly flushing batches might > > > become rather relevant. > > > > Really? I thought syscall overhead (as in the part that's > > per-syscall, rather than per-work) was generally in the tens of µs > > range, rather than the ms range. > > Tens or hundreds of µs (mind that it's several of them), so there > could be just one order of magnitude between the two. Hm, ok. > > But in any case, I'd be fine with upping the size of the array to 4k > > or 8k based on that empirical data. That's still much smaller than > > the >150k we have now. > > I would even go with 32k -- there are embedded systems with a ton of > memory but still much slower clocks compared to my typical setup. Go > figure. Again, I think we should test and profile this, ideally, but if > not, then let's pick something that's ~10x of what I see. > > > > With lower MTUs, it looks like we have a lower CPU load and > > > transmissions are scheduled differently (resulting in smaller batches), > > > but I didn't really trace things. > > > > Ok. I wonder if with the small MTUs we're hitting throughput > > bottlenecks elsewhere which mean this particular path isn't > > over-exercised. > > > > > So I start thinking that this has the *potential* to introduce a > > > performance regression in some cases and we shouldn't just assume that > > > some arbitrary 256 limit is good enough. I didn't check with perf(1), > > > though. > > > > > > Right now that array takes, effectively, less than 100 KiB (it's ~5000 > > > copies of struct iovec, 16 bytes each), and in theory that could be > > > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's > > > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will > > > have no practical effect anyway. > > > > Yeah, I guess. I don't love the fact that currently for correctness > > (not spuriously dropping packets) we rely on a fairly complex > > calculation that's based on information from different layers: the > > buffer size and enforcement is in the packet pool layer and is > > independent of packet layout, but the minimum frame size comes from > > the tap layer and depends quite specifically on which L2 encapsulation > > we're using. > > Well but it's exactly one line, and we're talking about the same > project and tool, not something that's separated by several API layers. > > By the way: on one hand you have that. On the other hand, you're adding > an arbitrary limit that comes from a test I just did, which is also > based on information from different layers. The difference is that it used to be a hard limit: if we exceeded it we drop packets. Now, we just do a little more work. Actually... that makes me realise the actual size of the batches isn't the relevant factor in choosing the size. The amortized cost of processing a packet will be pretty much: (per byte costs) * (packet size) + (per packet costs) + (per batch costs) / (batch size) So, we just need to allow (batch size) to become large enough to make the last term negligible compared to the other two. I find it hard to believe that we'd need more than ~1000 to do that. > > > All in all, I think we shouldn't change this limit without a deeper > > > understanding of the practical impact. While this change doesn't bring > > > any practical advantage, the current behaviour is somewhat tested by > > > now, and a small limit isn't. > > ...and this still applies, I think. > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson