From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 3A98E5A026F for ; Wed, 6 Dec 2023 18:06:23 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701882382; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IaVQJxBLLTWVp1XEEPNjfnspSDj9IQ2A19MG8kOczV8=; b=Nn9pleptb9+wBpAdEFvql3m2IPsiz9dMCnSTtcvgQiIwipGgC0kxukayBKwA+d+2npt1a8 YlxPOxwGMYOa3JlQGPGAkRfQScvBTrdOzFrxr+JlAGlFOjfDPWeMzyk/9LkRXhCdlNBdZw vZX7RuzKrClTEFVBus1QKaEjmQGcPjk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-326-3tcc5_E_NuekwWEdgv6pig-1; Wed, 06 Dec 2023 12:06:18 -0500 X-MC-Unique: 3tcc5_E_NuekwWEdgv6pig-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 67729185A780 for ; Wed, 6 Dec 2023 17:06:18 +0000 (UTC) Received: from elisabeth (unknown [10.39.208.4]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 3735440C6EB9; Wed, 6 Dec 2023 17:06:17 +0000 (UTC) Date: Wed, 6 Dec 2023 18:06:15 +0100 From: Stefano Brivio To: Jon Maloy Subject: Re: tcp.c: leverage MSG_PEEK with offset kernel capability when available Message-ID: <20231206180541.3ac15056@elisabeth> In-Reply-To: <3d6c3b2b-c16f-c262-36ea-05b28b93aeb8@redhat.com> References: <20231205233604.1491317-1-jmaloy@redhat.com> <20231206155940.51047ac1@elisabeth> <3d6c3b2b-c16f-c262-36ea-05b28b93aeb8@redhat.com> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: V7A2XAMFAHNT4YPFSIJXXGW7YOWVCM3O X-Message-ID-Hash: V7A2XAMFAHNT4YPFSIJXXGW7YOWVCM3O X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: lvivier@redhat.com, dgibson@redhat.com, passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 6 Dec 2023 11:10:50 -0500 Jon Maloy wrote: > On 2023-12-06 09:59, Stefano Brivio wrote: > > On Tue, 5 Dec 2023 18:36:04 -0500 > > Jon Maloy wrote: > > > >> The kernel may support recvmsg(MSG_PEEK), starting from a given offset. This > >> makes it possible to avoid repeated reading of already read initial bytes of > >> a received message, hence saving us read cycles when forwarding TCP messages > >> in the host->name space direction. > >> > >> When this feature is supported, iov_sock[0].iov_base can be set to NULL. The > >> kernel code will interpret this as an instruction to skip reading of the first > >> iov_sock[0].iov_len bytes of the message. > >> > >> Since iov_sock[0].iov_base is set to point to tcp_buf_discard, we do this > >> by simply not allocating that buffer, letting the pointer remain NULL, when > >> we find that we have this kernel support. > >> > >> There is no macro or function indicating kernel support for this feature. We > >> therefore have to probe for it by reading from an established TCP connection. > >> The traffic connection cannot be used for this purpose, because it will be > >> broken if the probe reading fails. We therefore have to create a temporary > >> local connection for this task. Because of this, we also add a new function, > >> tcp_probe_msg_peek_offset_cap(), which creates this temporary connection > >> and performs the probe read on it. > >> > >> Signed-off-by: Jon Maloy > >> --- > >> tcp.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > >> 1 file changed, 108 insertions(+), 2 deletions(-) > >> > >> diff --git a/tcp.c b/tcp.c > >> index f506cfd..ab5168e 100644 > >> --- a/tcp.c > >> +++ b/tcp.c > >> @@ -402,6 +402,8 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ > >> (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) > >> #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) > >> > >> +static bool tcp_probe_msg_peek_offset_cap(); > > No need for a forward declaration, I guess. > > > >> + > >> static const char *tcp_event_str[] __attribute((__unused__)) = { > >> "SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT", > >> > >> @@ -506,7 +508,8 @@ static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; > >> static unsigned int tcp6_l2_buf_used; > >> > >> /* recvmsg()/sendmsg() data for tap */ > >> -static char tcp_buf_discard [MAX_WINDOW]; > >> +static char *tcp_buf_discard = NULL; > >> + > >> static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; > >> > >> static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM]; > >> @@ -573,6 +576,15 @@ static unsigned int tcp6_l2_flags_buf_used; > >> > >> #define CONN(idx) (&(FLOW(idx)->tcp)) > >> > >> + > >> +/** msg_peek_offset_cap() - Does the kernel support recvmsg(MSG_PEEK) with offset? > >> + */ > >> +static inline bool msg_peek_offset_cap() > >> +{ > >> + return !tcp_buf_discard; > >> +} > >> + > >> + > >> /** conn_at_idx() - Find a connection by index, if present > >> * @idx: Index of connection to lookup > >> * > >> @@ -2224,7 +2236,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > >> return 0; > >> } > >> > >> - sendlen = len - already_sent; > >> + sendlen = len; > >> + if (!msg_peek_offset_cap()) > >> + sendlen -= already_sent; > >> if (sendlen <= 0) { > >> conn_flag(c, conn, STALLED); > >> return 0; > >> @@ -3107,6 +3121,15 @@ int tcp_init(struct ctx *c) > >> NS_CALL(tcp_ns_socks_init, c); > >> } > >> > >> + /* Only allocate discard buffer if needed */ > >> + if (!tcp_probe_msg_peek_offset_cap()) { > >> + tcp_buf_discard = malloc(MAX_WINDOW); > > I would rather not use the heap at all, even though after commit > > 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp > > policy application") we don't ask seccomp to terminate the process if we > > call a sbrk() (or similar) in this phase. > > > > The only specific issue I have in mind is rather minor, that is, at the > > moment we can reliably calculate our memory footprint using nm(1). > > > > But in general, having a single set of memory addresses keep things > > simpler and probably a bit safer. This would be the only non-static > > memory we use, and I don't see a strong case for it. > It is a rather large chunk of memory, but sure, if we could live with it > so far, we still can. ...but we wouldn't actually use that anymore -- as long as we never write to it: $ cat zero_static.c; gcc -o zero_static zero_static.c; echo; ./zero_static #include #include #include #include static char zero[4 << 20]; int main() { char cmd[100]; snprintf(cmd, 100, "grep VmRSS /proc/%i/status", getpid()); system(cmd); memset(zero, 0xff, sizeof(zero)); system(cmd); return 0; } VmRSS: 1388 kB VmRSS: 5484 kB > > I would rather drop this buffer after a few months (in turn, if/after > > the kernel change is accepted), turning the detection into a build-time > > step, with passt failing if we find that we don't have this buffer, and > > we were built for a kernel with support for MSG_PEEK with offset. > Sounds a litte risky to me. How do we know how are users are building it? The idea is that packages are _typically_ built on a system that vaguely resembles the one where they're going to be installed. Based on that, we already have some conditionals in the Makefile based on kernel headers -- those are strictly needed because we use those headers to access features, so it's not ideal (we have no theoretical guarantees, and doesn't cover users who don't use packages), but we can't do much better. Here, we could bluntly take the kernel version on build, and if it's recent enough (again, leaving a reasonable timeframe in between), skip the discard buffer. I don't see it as very risky in the sense that if we don't have it, and we realise we need it, we can tell the user to either enable a build conditional, or upgrade the kernel. On the other hand: > Maybe a year or two... ...yes, perhaps more reasonable, especially given that we're not using that memory anymore. -- Stefano