On Thu, Jul 31, 2025 at 10:11:14AM +0200, Eugenio Perez Martin wrote: > On Thu, Jul 31, 2025 at 7:59 AM David Gibson > wrote: > > > > On Wed, Jul 30, 2025 at 08:11:20AM +0200, Eugenio Perez Martin wrote: > > > On Wed, Jul 30, 2025 at 2:34 AM David Gibson > > > wrote: > > > > > > > > On Tue, Jul 29, 2025 at 09:04:19AM +0200, Eugenio Perez Martin wrote: > > > > > On Tue, Jul 29, 2025 at 2:33 AM David Gibson > > > > > wrote: > > > > > > > > > > > > On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote: > > > > > > > On Thu, Jul 24, 2025 at 3:21 AM David Gibson > > > > > > > wrote: > > > > > > > > > > > > > > > > On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote: > > > > > > > > > From ~13Gbit/s to ~11.5Gbit/s. > > > > > > > > > > > > > > > > Again, I really don't know what you're comparing to what here. > > > > > > > > > > > > > > > > > > > > > > When the buffer is full I'm using poll() to wait until vhost free some > > > > > > > buffers, instead of actively checking the used index. This is the cost > > > > > > > of the syscall. > > > > > > > > > > > > Ah, right. So.. I'm not sure if it's so much the cost of the syscall > > > > > > itself, as the fact that you're actively waiting for free buffers, > > > > > > rather than returning to the main epoll loop so you can maybe make > > > > > > progress on something else before returning to the Tx path. > > > > > > > > > > > > > > > > Previous patch also wait for free buffers, but it does it burning a > > > > > CPU for that. > > > > > > > > Ah, ok. Hrm. I still find it hard to believe that it's the cost of > > > > the syscall per se that's causing the slowdown. My guess is that the > > > > cost is because having the poll() leads to a higher latency between > > > > the buffer being released and us detecting it and re-using. > > > > > > > > > The next patch is the one that allows to continue progress as long as > > > > > there are enough free buffers, instead of always wait until all the > > > > > buffer has been sent. But there are situations where this conversion > > > > > needs other code changes. In particular, all the calls to > > > > > tcp_payload_flush after checking that we have enough buffers like: > > > > > > > > > > if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { > > > > > tcp_buf_free_old_tap_xmit(c, 2); > > > > > tcp_payload_flush(c); > > > > > ... > > > > > } > > > > > > > > > > Seems like coroutines would be a good fix here, but maybe there are > > > > > simpler ways to go back to the main loop while keeping the tcp socket > > > > > "ready to read" by epoll POV. Out of curiosity, what do you think > > > > > about setjmp()? :). > > > > > > > > I think it has its uses, but deciding to go with it is a big > > > > architectural decision not to be entered into likely. > > > > > > > > > > Got it, > > > > > > Another idea is to add the flows that are being processed but they had > > > no space available into the virtqueue to a "pending" list. When the > > > kernel tells pasta that new buffers are available, pasta checks that > > > pending list. Maybe it can consist of only one element. > > > > I think this makes sense. We already kind of want the same thing for > > the (rare) cases where the tap buffer fills up (or the pipe buffer for > > passt/qemu). This is part of what we'd need to make the event > > handling simpler (if we have proper wakeups on tap side writability we > > can always use EPOLLET on the socket side, instead of turning it on > > and off). > > > > That can be one way, yes. > > > I'm not actually sure if we need an explicit list. It might be > > adequate to just have a pending flag (or even derive it from existing > > state) and poll the entire flow list. Might be more expensive, but > > could well be good enough (we already scan the entire flow list on > > every epoll cycle). > > I'm ok with a new flag but the memory increase will be bigger than a > single "pending" entry. If scan the whole list for a new state is also > a possibility, sure, I'm ok with that too. To be clear, I'm not insisting on a particular way of doing it. I'm just suggesting a few options; I'm not sure which will work out to be best / simplest. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson