From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id B97FF5A9CDE for ; Fri, 3 May 2024 16:43:59 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1714747438; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pSw1dRnJpbO6odm/HWR2ykCWwnGZStPIG1MK5N/ZXSc=; b=L/tt4IKTQcJDfjWEOwKGyYnTf2z9ErjT0iOGIEXdoXZI8ip2rUmhmeDi0TUngJzyCR121c 2c0N2p+QQ9B+yj9zc74ADg9EPVWAaLcPHndQkAg+LjiGIAOqeCqUHj64zAbVCSlHDpsg+x Fe0f9vK8mUGEj7KFnF5XF+X9vaozlv0= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-609-hbBkzphMMjKmRP6Inbz3Dg-1; Fri, 03 May 2024 10:43:57 -0400 X-MC-Unique: hbBkzphMMjKmRP6Inbz3Dg-1 Received: by mail-qk1-f197.google.com with SMTP id af79cd13be357-7927071f933so401180185a.1 for ; Fri, 03 May 2024 07:43:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714747436; x=1715352236; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pSw1dRnJpbO6odm/HWR2ykCWwnGZStPIG1MK5N/ZXSc=; b=ja5lf+ApWMuO7qBb+opUCLBsbL02dvY5MKqiMIdhkhWW+fBm35h7RDdr5L8ejtgQvT hcbxFqBwicpMep6sOMg9S8ValVtuZ+KdaNJ2F5mjom4tHr6Bk/6Bh/uDu3DIOpccsPqA lCaUamFUQIQDJ1z4FKS7X6vqNsR+ggTJIaaNeJH0Wqo97bywi/VVTNqmhGL2kQtyhLE8 KSvKASStnrTDZaMOQTkPAYcKjtbEH5N9NELhzQYAHkPSyddA389+ZYW8MF8UO3YAFcjs 1pkg63fhB2PWrw1sKrvzLLUIw6OzFMhgxtrn2XWPNRRyoQ7o7kc3Yfi8CEVueGOIhmLb Jb1g== X-Gm-Message-State: AOJu0YxL67wndZNcYtOu5Efs5AH4qykxfpI4v2o34sf2j1OhcKCKGkPY KYudGIQLLJq3AE1/h6fKvWwNIO0b0NufwYM66V/+A85al5K4aJzagbrQ4MusLKwlcxSrqBfIGnd 4fGuB+4iJDiAnbxedycc7df3Jihzey4odcjGC++Isq/i2oWhhfg== X-Received: by 2002:a05:620a:28c5:b0:790:ec77:4c1a with SMTP id l5-20020a05620a28c500b00790ec774c1amr9923824qkp.39.1714747436193; Fri, 03 May 2024 07:43:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFhImDoMWniM+sOIrUTD5dxI0GSzwV9Z4jCooj3hHz/VDmOqQozRkYDwwtUbmYJn+7IeelEfw== X-Received: by 2002:a05:620a:28c5:b0:790:ec77:4c1a with SMTP id l5-20020a05620a28c500b00790ec774c1amr9923775qkp.39.1714747435672; Fri, 03 May 2024 07:43:55 -0700 (PDT) Received: from [10.0.0.97] ([24.225.234.80]) by smtp.gmail.com with ESMTPSA id g26-20020a05620a13da00b007907319aa02sm1283759qkl.67.2024.05.03.07.43.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 03 May 2024 07:43:54 -0700 (PDT) Message-ID: <767934ce-269a-cc9e-0cf3-1cb062103802@redhat.com> Date: Fri, 3 May 2024 10:43:52 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PATCH v2 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available To: Stefano Brivio , David Gibson References: <20240501202839.3491885-1-jmaloy@redhat.com> <20240501202839.3491885-2-jmaloy@redhat.com> <20240503154255.3d062430@elisabeth> From: Jon Maloy In-Reply-To: <20240503154255.3d062430@elisabeth> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: RHFARGHXHM4R4MRNXMGQX6DKSURUTSAE X-Message-ID-Hash: RHFARGHXHM4R4MRNXMGQX6DKSURUTSAE X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2024-05-03 09:42, Stefano Brivio wrote: > On Thu, 2 May 2024 11:31:52 +1000 > David Gibson wrote: > >> On Wed, May 01, 2024 at 04:28:38PM -0400, Jon Maloy wrote: >>> >From linux-6.9.0 the kernel will contain >>> commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). >>> >>> This new feature makes is possible to call recv_msg(MSG_PEEK) and make >>> it start reading data from a given offset set by the SO_PEEK_OFF socket >>> option. This way, we can avoid repeated reading of already read bytes of >>> a received message, hence saving read cycles when forwarding TCP messages >>> in the host->name space direction. >>> >>> In this commit, we add functionality to leverage this feature when >>> available, while we fall back to the previous behavior when not. >>> >>> Measurements with iperf3 shows that throughput increases with 15-20 percent >>> in the host->namespace direction when this feature is used. >>> >>> Signed-off-by: Jon Maloy >>> >>> --- >>> v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. >>> - Moved set_peek_offset() to only the locations where the socket is set >>> to ESTABLISHED. >>> - Removed the per-packet synchronization between sk_peek_off and >>> already_sent. Instead only doing it in retransmit situations. >>> - The problem I found when trouble shooting the occasionally occurring >>> out of synch values between 'already_sent' and 'sk_peek_offset' may >>> have deeper implications that we may need to be investigate. >>> --- >>> tcp.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>> 1 file changed, 55 insertions(+), 3 deletions(-) >>> >>> diff --git a/tcp.c b/tcp.c >>> index 905d26f..535f1a2 100644 >>> --- a/tcp.c >>> +++ b/tcp.c >>> @@ -513,6 +513,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM]; >>> static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM]; >>> static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM]; >>> >>> +/* Does the kernel support TCP_PEEK_OFF? */ >>> +static bool peek_offset_cap; >>> + >>> /* sendmsg() to socket */ >>> static struct iovec tcp_iov [UIO_MAXIOV]; >>> >>> @@ -582,6 +585,22 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, >>> int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; >>> int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; >>> >>> +/** >>> + * set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported >>> + * @conn Connection struct with reference to socket and flags >>> + * @offset Offset in bytes > Sorry, my bad: I forgot colons after the variable names in my proposal > for this comment: @conn:, @offset:. ok. I'll fix that. > >>> + */ >>> +static void set_peek_offset(struct tcp_tap_conn *conn, int offset) >>> +{ >>> + int s; >>> + >>> + if (!peek_offset_cap) >>> + return; > Usually we have one extra newline after an early return between a > condition and some other code that's not so closely related. ok > >>> + s = conn->sock; >>> + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) >>> + perror("Failed to set SO_PEEK_OFF\n"); > Don't print to stderr directly, use err(). Ok. Maybe we end up with the same weird behavior as between info() and printf(), although it hardly matters in this case. > >>> +} >>> + >>> /** >>> * tcp_conn_epoll_events() - epoll events mask for given connection state >>> * @events: Current connection events >>> @@ -2174,6 +2193,12 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) >>> if (iov_rem) >>> iov_sock[fill_bufs].iov_len = iov_rem; >>> >>> + if (peek_offset_cap) { >>> + /* Don't use discard buffer */ >>> + mh_sock.msg_iov = &iov_sock[1]; >>> + mh_sock.msg_iovlen -= 1; >>> + } >>> + >> I think this would be a little clearer if moved up to where we >> currently initialise mh_sock.msg_iov and iov_sock[0], and make that an >> else clause to this if. That would make it more obvious that we have >> two different - and mutually exclusive - mechanisms for dealing with >> un-acked data in the socket buffer. Not worth a respin on its own, >> though. ok >> >>> /* Receive into buffers, don't dequeue until acknowledged by guest. */ >>> do >>> len = recvmsg(s, &mh_sock, MSG_PEEK); >>> @@ -2195,7 +2220,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) >>> return 0; >>> } >>> >>> - sendlen = len - already_sent; >>> + sendlen = len; >>> + if (!peek_offset_cap) >>> + sendlen -= already_sent; >>> + >>> if (sendlen <= 0) { >>> conn_flag(c, conn, STALLED); >>> return 0; >>> @@ -2365,9 +2393,17 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, >>> flow_trace(conn, >>> "fast re-transmit, ACK: %u, previous sequence: %u", >>> max_ack_seq, conn->seq_to_tap); >>> + >>> + /* Ensure seq_from_tap isn't updated twice after call */ >>> + tcp_l2_data_buf_flush(c); >> tcp_l2_data_buf_flush() was replaced by tcp_payload_flush() in a >> recently merged change from Laurent. >> >> IIUC, this is necessary because otherwise our update to seq_to_tap can > ...but Jon's comment refers to seq_from_tap (not seq_to_tap)? I'm > confused. Right. It should be seq_to_tap. >> be clobbered from tcp_payload_flush() when we process the >> queued-but-not-sent frames. > ...how? I don't quite understand the issue here: tcp_payload_flush() > updates seq_to_tap once we send the frames, not before, right? If we don't flush, we may have a frame there, e.g. seqno 17, followed by a lower numbered frame, e.g. seqno 14. Both will point to a seq_to_tap we just gave the value 14. When the buffer queue is flushed we update seq_to_tap twice, so next sent packet will be 16. This would have worked in the old code, because we calculate the offset value (already_sent) based on the seq_to_tap value, so we just skip ahead one packet and continue transmitting. If we are lucky pkt #15 is already in the receiver's OOF queue, and we are ok. It will *not* work in my code, because the kernel offset is advanced linearly, so we will resend a packet called #16, but with the contents of the original pkt #15. > >> This seems like a correct fix, but not an >> optimal one: we're flushing out data we've already determined we're >> going to retransmit. Instead, I think we want a different helper that >> simply discards the queued frames > Don't we always send (within the same epoll_wait() cycle) what we > queued? What am I missing? No. Evidently not. > >> - I'm thinking maybe we actually >> want a helper that's called from both the fast and slow retransmit >> paths and handles that. >> >> Ah, wait, we only want to discard queued frames that belong to this >> connection, that's trickier. >> >> It seems to me this is a pre-existing bug, we just managed to get away >> with it previously. I think this is at least one cause of the weirdly >> jumping forwarding sequence numbers you observed. So I think we want >> to make a patch fixing this that goes before the SO_PEEK_OFF changes. This was exactly the reason for my v2: comment in the commit log. But it may even be worse. See below. >> >>> + >>> conn->seq_ack_from_tap = max_ack_seq; >>> conn->seq_to_tap = max_ack_seq; >>> + set_peek_offset(conn, 0); >>> tcp_data_from_sock(c, conn); >>> + >>> + /* Empty queue before any POLL event tries to send it again */ >>> + tcp_l2_data_buf_flush(c); >> I'm not clear on what the second flush call is for. The only frames >> queued should be those added by the tcp_data_from_sock() just above, >> and those should be flushed when we get to tcp_defer_handler() before >> we return to the epoll loop. Sadly no. My debugging clearly shows that an epoll() may come in between, and try to transmit a pkt #14 (from the example above), but now with the contents of the original pkt #15. All sorts of weirdities may happen after that. I am wondering if this is a generic problem: Is it possible that two consecutive epolls() may queue up two packets with the same number in the tap queue, whereafter the number will be incremented twice when flushed, and we create a gap in the sequence causing spurious retransmissions? I haven't checked this theory yet, but that is part of my plan for today. Anyway, I don't understand the point with the delayed update of set_to_tap at all. To me it looks plain wrong. But I am sure somebody can explain. >> >>> } >>> >>> if (!iov_i) >>> @@ -2459,6 +2495,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, >>> conn->seq_ack_to_tap = conn->seq_from_tap; >>> >>> conn_event(c, conn, ESTABLISHED); >>> + set_peek_offset(conn, 0); >>> >>> /* The client might have sent data already, which we didn't >>> * dequeue waiting for SYN,ACK from tap -- check now. >>> @@ -2539,6 +2576,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, >>> goto reset; >>> >>> conn_event(c, conn, ESTABLISHED); >>> + set_peek_offset(conn, 0); >> The set_peek_offset() could go into conn_event() to avoid the >> duplication. Not sure if it's worth it or not. > I wouldn't do that in conn_event(), the existing side effects there are > kind of expected, but set_peek_offset() isn't so closely related to TCP > events I'd say. > >>> if (th->fin) { >>> conn->seq_from_tap++; >>> @@ -2705,7 +2743,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, >>> struct sockaddr_storage sa; >>> socklen_t sl = sizeof(sa); >>> union flow *flow; >>> - int s; >>> + int s = 0; >>> >>> if (c->no_tcp || !(flow = flow_alloc())) >>> return; >>> @@ -2767,7 +2805,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) >>> flow_dbg(conn, "ACK timeout, retry"); >>> conn->retrans++; >>> conn->seq_to_tap = conn->seq_ack_from_tap; >>> + set_peek_offset(conn, 0); >>> + tcp_l2_data_buf_flush(c); >> Uh.. doesn't this flush have to go *before* the seq_to_tap update, for >> the reasons discussed above? Damn. Yes. It never degraded to the level where timer retransmit was triggered, so I only copy-pasted this, and obviously didn't do it right. >> >>> tcp_data_from_sock(c, conn); >>> + tcp_l2_data_buf_flush(c); > I don't understand the purpose of these new tcp_l2_data_buf_flush() > calls. If they fix a pre-existing issue (but I'm not sure which one), > they should be in a different patch. Maybe so. ///jon > >>> tcp_timer_ctl(c, conn); >>> } >>> } else { >>> @@ -3041,7 +3082,8 @@ static void tcp_sock_refill_init(const struct ctx *c) >>> */ >>> int tcp_init(struct ctx *c) >>> { >>> - unsigned b; >>> + unsigned int b, optv = 0; >>> + int s; >>> >>> for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) >>> tc_hash[b] = FLOW_SIDX_NONE; >>> @@ -3065,6 +3107,16 @@ int tcp_init(struct ctx *c) >>> NS_CALL(tcp_ns_socks_init, c); >>> } >>> >>> + /* Probe for SO_PEEK_OFF support */ >>> + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); >>> + if (s < 0) { >>> + warn("Temporary TCP socket creation failed"); >>> + } else { >>> + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) >>> + peek_offset_cap = true; >>> + close(s); >>> + } >>> + debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); >>> return 0; >>> } >>>