From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 3C73E5A026F for ; Wed, 6 Dec 2023 17:10:58 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701879056; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QpBHDRN6J5SpTopzaQZljqNfh5n8KErESbw8F1oPlz0=; b=IeLrejkUcaM+zcxYELYRWH8BFCDfNRzF1iFTK/tAdps0X8rSwFZAfbtMtA5oL/Rl5D/qz0 mBmgoZ/6WMURMDAlta1JFj2//jKEgUQOk1eqMxbFjYwi+UBbX0rb+LHXBYD2E7ETAmu7hk IlJnXVn+LNrFR8xDwZKJc/k4QZwzUKc= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-580-znAotSHbOw2sKQPSKt3S-A-1; Wed, 06 Dec 2023 11:10:52 -0500 X-MC-Unique: znAotSHbOw2sKQPSKt3S-A-1 Received: by mail-qk1-f198.google.com with SMTP id af79cd13be357-77f38f2f6fbso27823185a.0 for ; Wed, 06 Dec 2023 08:10:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701879052; x=1702483852; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QpBHDRN6J5SpTopzaQZljqNfh5n8KErESbw8F1oPlz0=; b=tQsBN3i6hftkQlq2izV1BLmahjk6rZU56WiedsVKWvr9SgylRozE7He/ktIpxJCwYq +SScsY7QljMdIJh8N9X3MRj5knAexexNly9Y5BqFIe2Pz5p02QEIjY9KHrYRilh+TBLY NshFYaoJJV1dGJ4mIJvubax2izwk+Zstu1btCAe2gGAhF4dyCSn+gibBtyqSfbDUWftM wn4QkbwQfUdPW0K+ZmXeG7joee0OeRFsSN2tE+2PyMhjOjf1ToddgtV49qmOIgZFk5Zi o3KuXW8xk/JLU/5z5/qZ7+pZSI9OsGwWiI9ti4N8uFHFYHurCiilQEx4Gw+WWRnNM0i0 ZZbw== X-Gm-Message-State: AOJu0YwJtv2J95D2/EVsOLjuHhDFuWdD9QCvoZjtA6Xb7YQO5cIZlQSv xHtOEdjONKpWUTESy4RsKhcr/w7oBlLJPB0Ymp8Ay4j1e6QFRZzJ695IOXC3cZNFk75PUqlk5Z9 akoKzCKYGijVO X-Received: by 2002:a05:620a:80a:b0:76c:ea67:38e4 with SMTP id s10-20020a05620a080a00b0076cea6738e4mr1566007qks.12.1701879052246; Wed, 06 Dec 2023 08:10:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IEOvy5p3wIoY0X8tb5JuinwuAy8JNMxpfMlW2c8cS3YLfRLuIINipFrNzVnaZEZ7HeOYn3zyw== X-Received: by 2002:a05:620a:80a:b0:76c:ea67:38e4 with SMTP id s10-20020a05620a080a00b0076cea6738e4mr1565988qks.12.1701879051858; Wed, 06 Dec 2023 08:10:51 -0800 (PST) Received: from [10.0.0.97] ([24.225.234.80]) by smtp.gmail.com with ESMTPSA id rr17-20020a05620a679100b0077f0dcac136sm62480qkn.11.2023.12.06.08.10.51 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 06 Dec 2023 08:10:51 -0800 (PST) Message-ID: <3d6c3b2b-c16f-c262-36ea-05b28b93aeb8@redhat.com> Date: Wed, 6 Dec 2023 11:10:50 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Subject: Re: tcp.c: leverage MSG_PEEK with offset kernel capability when available To: Stefano Brivio References: <20231205233604.1491317-1-jmaloy@redhat.com> <20231206155940.51047ac1@elisabeth> From: Jon Maloy In-Reply-To: <20231206155940.51047ac1@elisabeth> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: 3GNES3NOQ3EGVR4OD6UWYQQXGHZE3NS4 X-Message-ID-Hash: 3GNES3NOQ3EGVR4OD6UWYQQXGHZE3NS4 X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: lvivier@redhat.com, dgibson@redhat.com, passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2023-12-06 09:59, Stefano Brivio wrote: > On Tue, 5 Dec 2023 18:36:04 -0500 > Jon Maloy wrote: > >> The kernel may support recvmsg(MSG_PEEK), starting from a given offset. This >> makes it possible to avoid repeated reading of already read initial bytes of >> a received message, hence saving us read cycles when forwarding TCP messages >> in the host->name space direction. >> >> When this feature is supported, iov_sock[0].iov_base can be set to NULL. The >> kernel code will interpret this as an instruction to skip reading of the first >> iov_sock[0].iov_len bytes of the message. >> >> Since iov_sock[0].iov_base is set to point to tcp_buf_discard, we do this >> by simply not allocating that buffer, letting the pointer remain NULL, when >> we find that we have this kernel support. >> >> There is no macro or function indicating kernel support for this feature. We >> therefore have to probe for it by reading from an established TCP connection. >> The traffic connection cannot be used for this purpose, because it will be >> broken if the probe reading fails. We therefore have to create a temporary >> local connection for this task. Because of this, we also add a new function, >> tcp_probe_msg_peek_offset_cap(), which creates this temporary connection >> and performs the probe read on it. >> >> Signed-off-by: Jon Maloy >> --- >> tcp.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 108 insertions(+), 2 deletions(-) >> >> diff --git a/tcp.c b/tcp.c >> index f506cfd..ab5168e 100644 >> --- a/tcp.c >> +++ b/tcp.c >> @@ -402,6 +402,8 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ >> (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) >> #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) >> >> +static bool tcp_probe_msg_peek_offset_cap(); > No need for a forward declaration, I guess. > >> + >> static const char *tcp_event_str[] __attribute((__unused__)) = { >> "SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT", >> >> @@ -506,7 +508,8 @@ static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; >> static unsigned int tcp6_l2_buf_used; >> >> /* recvmsg()/sendmsg() data for tap */ >> -static char tcp_buf_discard [MAX_WINDOW]; >> +static char *tcp_buf_discard = NULL; >> + >> static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; >> >> static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM]; >> @@ -573,6 +576,15 @@ static unsigned int tcp6_l2_flags_buf_used; >> >> #define CONN(idx) (&(FLOW(idx)->tcp)) >> >> + >> +/** msg_peek_offset_cap() - Does the kernel support recvmsg(MSG_PEEK) with offset? >> + */ >> +static inline bool msg_peek_offset_cap() >> +{ >> + return !tcp_buf_discard; >> +} >> + >> + >> /** conn_at_idx() - Find a connection by index, if present >> * @idx: Index of connection to lookup >> * >> @@ -2224,7 +2236,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) >> return 0; >> } >> >> - sendlen = len - already_sent; >> + sendlen = len; >> + if (!msg_peek_offset_cap()) >> + sendlen -= already_sent; >> if (sendlen <= 0) { >> conn_flag(c, conn, STALLED); >> return 0; >> @@ -3107,6 +3121,15 @@ int tcp_init(struct ctx *c) >> NS_CALL(tcp_ns_socks_init, c); >> } >> >> + /* Only allocate discard buffer if needed */ >> + if (!tcp_probe_msg_peek_offset_cap()) { >> + tcp_buf_discard = malloc(MAX_WINDOW); > I would rather not use the heap at all, even though after commit > 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp > policy application") we don't ask seccomp to terminate the process if we > call a sbrk() (or similar) in this phase. > > The only specific issue I have in mind is rather minor, that is, at the > moment we can reliably calculate our memory footprint using nm(1). > > But in general, having a single set of memory addresses keep things > simpler and probably a bit safer. This would be the only non-static > memory we use, and I don't see a strong case for it. It is a rather large chunk of memory, but sure, if we could live with it so far, we still can. > > I would rather drop this buffer after a few months (in turn, if/after > the kernel change is accepted), turning the detection into a build-time > step, with passt failing if we find that we don't have this buffer, and > we were built for a kernel with support for MSG_PEEK with offset. Sounds a litte risky to me. How do we know how are users are building it? Maybe a year or two... > >> + if (!tcp_buf_discard) { >> + perror("failed to allocate discard buffer\n"); >> + exit(EXIT_FAILURE); >> + } >> + debug("allocated discard buffer of size %i\n", MAX_WINDOW); >> + } >> return 0; >> } >> >> @@ -3213,3 +3236,86 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) >> if (c->mode == MODE_PASTA) >> tcp_splice_refill(c); >> } >> + >> +/** tcp_probe_msg_peek_offset_cap() - Probe kernel for MSG_PEEK with offset support >> + */ >> +static bool tcp_probe_msg_peek_offset_cap() > For consistency: (void). > > I have two main criticisms to this approach: 1. it uses fork() > (and that's another usage of heap memory) but we don't actually need > connect() and send() to be synchronous for this test, and 2. we bind > an actual TCP port where we run. > > I attached a sketch (pkt_selfie.c) of a slightly different approach > that solves 1. by using a non-blocking socket on the client side, and > solves 2. by creating the pair of sockets in a detached network > namespace, which is essentially invisible and goes away once we're done > with the probing. > > For some reason, the negative case works: > > $ gcc -o pkt_selfie pkt_selfie.c; strace -f ./pkt_selfie > [...] > sendto(5, "ab", 2, 0, NULL, 0) = 2 > recvmsg(6, {msg_name=0x7ffd2ba5d130, msg_namelen=2 => 0, msg_iov=NULL, msg_iovlen=0, msg_controllen=0, msg_flags=0}, MSG_PEEK) = 0 > > but on a kernel with your patch, I get ENOTCONN on recvmsg(). If I > replace that by a simple recv(): > > sendto(5, "ab", 2, 0, NULL, 0) = 2 > recvfrom(6, "ab", 10, 0, NULL, NULL) = 2 > > ...so I don't think it's a fundamental issue with this approach, rather > something with your patch, but I'm not yet sure what. :) > > Most of pkt_selfie.c is copied and pasted (with minimal adaptations) > from existing passt's codebase, the actual implementation starts at > line 107. Of course it's missing all the error checks etc. > I tend to be rather minimalist when it comes to adding codelines, so that is why I chose the synchronous approach. I'll try your code (the latest version), and see what it looks like. ///jon