On Tue,  5 Dec 2023 18:36:04 -0500
Jon Maloy <jmaloy@redhat.com> wrote:

> The kernel may support recvmsg(MSG_PEEK), starting from a given offset. This
> makes it possible to avoid repeated reading of already read initial bytes of
> a received message, hence saving us read cycles when forwarding TCP messages
> in the host->name space direction.
> 
> When this feature is supported, iov_sock[0].iov_base can be set to NULL. The
> kernel code will interpret this as an instruction to skip reading of the first
> iov_sock[0].iov_len bytes of the message.
> 
> Since iov_sock[0].iov_base is set to point to tcp_buf_discard, we do this
> by simply not allocating that buffer, letting the pointer remain NULL, when
> we find that we have this kernel support.
> 
> There is no macro or function indicating kernel support for this feature. We
> therefore have to probe for it by reading from an established TCP connection.
> The traffic connection cannot be used for this purpose, because it will be
> broken if the probe reading fails. We therefore have to create a temporary
> local connection for this task. Because of this, we also add a new function,
> tcp_probe_msg_peek_offset_cap(), which creates this temporary connection
> and performs the probe read on it.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  tcp.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 108 insertions(+), 2 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index f506cfd..ab5168e 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -402,6 +402,8 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
>  	 (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
>  #define CONN_HAS(conn, set)	((conn->events & (set)) == (set))
>  
> +static bool tcp_probe_msg_peek_offset_cap();

No need for a forward declaration, I guess.

> +
>  static const char *tcp_event_str[] __attribute((__unused__)) = {
>  	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
>  
> @@ -506,7 +508,8 @@ static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
>  static unsigned int tcp6_l2_buf_used;
>  
>  /* recvmsg()/sendmsg() data for tap */
> -static char 		tcp_buf_discard		[MAX_WINDOW];
> +static char 		*tcp_buf_discard        = NULL;
> +
>  static struct iovec	iov_sock		[TCP_FRAMES_MEM + 1];
>  
>  static struct iovec	tcp4_l2_iov		[TCP_FRAMES_MEM];
> @@ -573,6 +576,15 @@ static unsigned int tcp6_l2_flags_buf_used;
>  
>  #define CONN(idx)		(&(FLOW(idx)->tcp))
>  
> +
> +/** msg_peek_offset_cap() - Does the kernel support recvmsg(MSG_PEEK) with offset?
> + */
> +static inline  bool msg_peek_offset_cap()
> +{
> +	return !tcp_buf_discard;
> +}
> +
> +
>  /** conn_at_idx() - Find a connection by index, if present
>   * @idx:	Index of connection to lookup
>   *
> @@ -2224,7 +2236,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  		return 0;
>  	}
>  
> -	sendlen = len - already_sent;
> +	sendlen = len;
> +	if (!msg_peek_offset_cap())
> +		sendlen -= already_sent;
>  	if (sendlen <= 0) {
>  		conn_flag(c, conn, STALLED);
>  		return 0;
> @@ -3107,6 +3121,15 @@ int tcp_init(struct ctx *c)
>  		NS_CALL(tcp_ns_socks_init, c);
>  	}
>  
> +	/* Only allocate discard buffer if needed */
> +	if (!tcp_probe_msg_peek_offset_cap()) {
> +		tcp_buf_discard = malloc(MAX_WINDOW);

I would rather not use the heap at all, even though after commit
0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp
policy application") we don't ask seccomp to terminate the process if we
call a sbrk() (or similar) in this phase.

The only specific issue I have in mind is rather minor, that is, at the
moment we can reliably calculate our memory footprint using nm(1).

But in general, having a single set of memory addresses keep things
simpler and probably a bit safer. This would be the only non-static
memory we use, and I don't see a strong case for it.

I would rather drop this buffer after a few months (in turn, if/after
the kernel change is accepted), turning the detection into a build-time
step, with passt failing if we find that we don't have this buffer, and
we were built for a kernel with support for MSG_PEEK with offset.

> +		if (!tcp_buf_discard) {
> +			perror("failed to allocate discard buffer\n");
> +			exit(EXIT_FAILURE);
> +		}
> +		debug("allocated discard buffer of size %i\n", MAX_WINDOW);
> +	}
>  	return 0;
>  }
>  
> @@ -3213,3 +3236,86 @@ void tcp_timer(struct ctx *c, const struct timespec *ts)
>  	if (c->mode == MODE_PASTA)
>  		tcp_splice_refill(c);
>  }
> +
> +/** tcp_probe_msg_peek_offset_cap() - Probe kernel for MSG_PEEK with offset support
> + */
> +static bool tcp_probe_msg_peek_offset_cap()

For consistency: (void).

I have two main criticisms to this approach: 1. it uses fork()
(and that's another usage of heap memory) but we don't actually need
connect() and send() to be synchronous for this test, and 2. we bind
an actual TCP port where we run.

I attached a sketch (pkt_selfie.c) of a slightly different approach
that solves 1. by using a non-blocking socket on the client side, and
solves 2. by creating the pair of sockets in a detached network
namespace, which is essentially invisible and goes away once we're done
with the probing.

For some reason, the negative case works:

  $ gcc -o pkt_selfie pkt_selfie.c; strace -f ./pkt_selfie
  [...]
  sendto(5, "ab", 2, 0, NULL, 0)          = 2
  recvmsg(6, {msg_name=0x7ffd2ba5d130, msg_namelen=2 => 0, msg_iov=NULL, msg_iovlen=0, msg_controllen=0, msg_flags=0}, MSG_PEEK) = 0

but on a kernel with your patch, I get ENOTCONN on recvmsg(). If I
replace that by a simple recv():

  sendto(5, "ab", 2, 0, NULL, 0)          = 2
  recvfrom(6, "ab", 10, 0, NULL, NULL)    = 2

...so I don't think it's a fundamental issue with this approach, rather
something with your patch, but I'm not yet sure what. :)

Most of pkt_selfie.c is copied and pasted (with minimal adaptations)
from existing passt's codebase, the actual implementation starts at
line 107. Of course it's missing all the error checks etc.

-- 
Stefano