On Thu, Apr 25, 2024 at 07:23:28PM -0400, Jon Maloy wrote: > > > On 2024-04-23 20:44, David Gibson wrote: > > On Sat, Apr 20, 2024 at 03:19:19PM -0400, Jon Maloy wrote: > > > The kernel may support recvmsg(MSG_PEEK), starting reading data from a > > Not worth a respin on its own, but I think the comma above is > > misplaced, and for me makes the sentence much harder to read. > > > > > given offset set by the SO_PEEK_OFF socket option. This makes it > > > possible to avoid repeated reading of already read initial bytes of a > > > received message, hence saving read cycles when forwarding TCP messages > > > in the host->name space direction. > > > > > > In this commit, we add functionality to leverage this feature when available, > [...] > > > @@ -2174,6 +2183,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > > if (iov_rem) > > > iov_sock[fill_bufs].iov_len = iov_rem; > > > + if (peek_offset_cap) { > > > + /* Don't use discard buffer */ > > > + mh_sock.msg_iov = &iov_sock[1]; > > > + mh_sock.msg_iovlen -= 1; > > > + > > > + /* Keep kernel sk_peek_off in synch */ > > > + set_peek_offset(s, already_sent); > > I thought we didn't need to set SO_PEEK_OFF here - that it would track > > on its own, and we only needed to change it for retransmits. I don't > > think we even need to calculate 'already_sent' when we have > > SO_PEEK_OFF. In fact - if we set already_sent to 0 here, it might > > make things a bit cleaner than having to have special cases for > > adjusting the iov and sendlen. > In theory yes. > I tried it for a while, using SEQ_GE(max_ack_seq, ack_seq) as criteria for > retransmission. > I observed some strange behavior, like retransmits that seemingly did not > come from fast retransmit or timer retransmit, and that the kernel > 'sk_peek_off' > didnīt always have the expected value when comparing with 'already_sentī. Ouch, that sounds bad. I'm pretty sure that means there's a bug on one side or the other. > Since my focus was on the zero-window issue I decided to skip this for now > and take the safe option. > I may revisit this later. > > > > > + } > > > + > > > /* Receive into buffers, don't dequeue until acknowledged by guest. */ > > > do > > > len = recvmsg(s, &mh_sock, MSG_PEEK); > > > @@ -2195,7 +2213,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > > return 0; > > > } > [...] > > > + peek_offset_cap = true; > > > + } > > > + close(s); > > > + } > > > + printf("SO_PEEK_OFF%ssupported\n", peek_offset_cap ? " " : " not ") > > Should be an info(). > Made it a debug() as suggested by Stefano. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson