public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling
@ 2024-09-13  4:32 David Gibson
  2024-09-13  4:32 ` [PATCH v2 01/10] tcp: Make some extra functions private David Gibson
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This is a draft series culminating in greatly simplifying the handling
of the epoll event mask for TCP sockets.  Doing that requires two
major preliminaries:

1) We also need to alter event mask handling for the tap interface, to
   track and report when it's ready to accept more data.  We need this
   to "unstick" connections where we have pending data on the socket,
   but weren't able to forward it because we ran out of tap interface
   buffer space.

2) We need to alter handling of ack sequence numbers to the tap
   device.  Without this, the event mask changes expose a fragility
   where depending on the precise order things are called we could
   update the ack pointer without actually sending an ack to the tap
   interface.

This is not ready to go yet.  For one thing it tanks some of the
throughput numbers for reasons I haven't yet diagnosed.  It could also
certainly do with another set of eyes looking critically over the
event logic changes.

In addition, the fragility of the ack-to-tap handling remains.  The
series fixes the specific problem, but it remains non-obviously unsafe
to call tcp_update_seqack_from_tap() if a packet isn't going to be
immediately and unconditionally sent to the guest.  I've been trying
to figure out how to make that more robust without introducing
additional TCP_INFO calls, but my brain's seizing up a bit at this
point.

Patches 1..4/10 are preliminary cleanups which should be safe, though.
Feel free to apply as many of those ones as you're happy with.

David Gibson (10):
  tcp: Make some extra functions private
  tcp: Clean up tcpi_snd_wnd probing
  tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
  tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly
    boolean
  tcp: On socket EPOLLOUT, send new ACK to tap immediately
  tap: Re-introduce EPOLLET for tap connections
  tap: Keep track of whether there might be space in the tap buffers
  tcp: Keep track of connections blocked due to a full tap interface
  tcp: Move deferred handling functions later in tcp.c
  tcp: Simplify epoll event mask management

 tap.c          |  58 ++++++++---
 tap.h          |   1 +
 tcp.c          | 265 ++++++++++++++++++++++++++++---------------------
 tcp.h          |  13 +--
 tcp_buf.c      |  15 +--
 tcp_buf.h      |   6 +-
 tcp_conn.h     |   1 +
 tcp_internal.h |   6 +-
 8 files changed, 222 insertions(+), 143 deletions(-)

-- 
2.46.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 01/10] tcp: Make some extra functions private
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-13  4:32 ` [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing David Gibson
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header.  Make them static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index f9fe1b9a..14b48a84 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1235,7 +1235,7 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 {
 	return tcp_buf_send_flag(c, conn, flags);
 }
@@ -2477,7 +2477,7 @@ static void tcp_sock_refill_init(const struct ctx *c)
  *
  * Return: true if supported, false otherwise
  */
-bool tcp_probe_peek_offset_cap(sa_family_t af)
+static bool tcp_probe_peek_offset_cap(sa_family_t af)
 {
 	bool ret = false;
 	int s, optv = 0;
-- 
@@ -1235,7 +1235,7 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 {
 	return tcp_buf_send_flag(c, conn, flags);
 }
@@ -2477,7 +2477,7 @@ static void tcp_sock_refill_init(const struct ctx *c)
  *
  * Return: true if supported, false otherwise
  */
-bool tcp_probe_peek_offset_cap(sa_family_t af)
+static bool tcp_probe_peek_offset_cap(sa_family_t af)
 {
 	bool ret = false;
 	int s, optv = 0;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
  2024-09-13  4:32 ` [PATCH v2 01/10] tcp: Make some extra functions private David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-17 21:54   ` Stefano Brivio
  2024-09-13  4:32 ` [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd() David Gibson
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

When available, we want to retrieve our socket peer's advertised window and
forward that to the guest.  That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.

Currently our probing for this is a bit odd.  The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it.  Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field.  This is a bit hit and miss in two ways:
 * Zero is perfectly possible window the peer could report, so we can
   get false negatives
 * We're reading TCP_INFO into a local variable, which might not be zero
   initialised, so if the kernel _doesn't_ write it it could have non-zero
   garbage, giving us false positives.

We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived.  So, check whether that's long enough
to include the field.  This lets us probe the availability of the field
once and for all during initialisation.  That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.

We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c          | 93 ++++++++++++++++++++++++++++++++++++--------------
 tcp.h          | 13 +++----
 tcp_buf.c      | 10 +++---
 tcp_buf.h      |  6 ++--
 tcp_internal.h |  4 +--
 5 files changed, 82 insertions(+), 44 deletions(-)

diff --git a/tcp.c b/tcp.c
index 14b48a84..cba3f3bd 100644
--- a/tcp.c
+++ b/tcp.c
@@ -308,11 +308,6 @@
 /* MSS rounding: see SET_MSS() */
 #define MSS_DEFAULT			536
 #define WINDOW_DEFAULT			14600		/* RFC 6928 */
-#ifdef HAS_SND_WND
-# define KERNEL_REPORTS_SND_WND(c)	((c)->tcp.kernel_snd_wnd)
-#else
-# define KERNEL_REPORTS_SND_WND(c)	(0 && (c))
-#endif
 
 #define ACK_INTERVAL			10		/* ms */
 #define SYN_TIMEOUT			10		/* s */
@@ -370,6 +365,14 @@ char		tcp_buf_discard		[MAX_WINDOW];
 
 /* Does the kernel support TCP_PEEK_OFF? */
 bool peek_offset_cap;
+#ifdef HAS_SND_WND
+/* Does the kernel report sending window in TCP_INFO (kernel commit
+ * 8f7baad7f035)
+ */
+bool snd_wnd_cap;
+#else
+#define snd_wnd_cap	(false)
+#endif
 
 /* sendmsg() to socket */
 static struct iovec	tcp_iov			[UIO_MAXIOV];
@@ -1052,7 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	}
 #endif /* !HAS_BYTES_ACKED */
 
-	if (!KERNEL_REPORTS_SND_WND(c)) {
+	if (!snd_wnd_cap) {
 		tcp_get_sndbuf(conn);
 		new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW);
 		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
@@ -1136,7 +1139,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
  *	     0 if there is no flag to send
  *	     1 otherwise
  */
-int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
+int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 		      int flags, struct tcphdr *th, char *data,
 		      size_t *optlen)
 {
@@ -1153,11 +1156,6 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
 		return -ECONNRESET;
 	}
 
-#ifdef HAS_SND_WND
-	if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd)
-		c->tcp.kernel_snd_wnd = 1;
-#endif
-
 	if (!(conn->flags & LOCAL))
 		tcp_rtt_dst_check(conn, &tinfo);
 
@@ -1235,7 +1233,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
+			 int flags)
 {
 	return tcp_buf_send_flag(c, conn, flags);
 }
@@ -1245,7 +1244,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
  * @c:		Execution context
  * @conn:	Connection pointer
  */
-void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
+void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	if (conn->events == CLOSED)
 		return;
@@ -1463,7 +1462,7 @@ static void tcp_bind_outbound(const struct ctx *c,
  * @optlen:	Bytes in options: caller MUST ensure available length
  * @now:	Current timestamp
  */
-static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
+static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 			      const void *saddr, const void *daddr,
 			      const struct tcphdr *th, const char *opts,
 			      size_t optlen, const struct timespec *now)
@@ -1628,7 +1627,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
  *
  * #syscalls recvmsg
  */
-static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	return tcp_buf_data_from_sock(c, conn);
 }
@@ -1644,8 +1643,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
  *
  * Return: count of consumed packets
  */
-static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
-			      const struct pool *p, int idx)
+static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
+			     const struct pool *p, int idx)
 {
 	int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0;
 	uint16_t max_ack_seq_wnd = conn->wnd_from_tap;
@@ -1842,7 +1841,8 @@ out:
  * @opts:	Pointer to start of options
  * @optlen:	Bytes in options: caller MUST ensure available length
  */
-static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
+static void tcp_conn_from_sock_finish(const struct ctx *c,
+				      struct tcp_tap_conn *conn,
 				      const struct tcphdr *th,
 				      const char *opts, size_t optlen)
 {
@@ -1885,7 +1885,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
  *
  * Return: count of consumed packets
  */
-int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
+int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		    const void *saddr, const void *daddr,
 		    const struct pool *p, int idx, const struct timespec *now)
 {
@@ -2023,7 +2023,7 @@ reset:
  * @c:		Execution context
  * @conn:	Connection pointer
  */
-static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
+static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	socklen_t sl;
 	int so;
@@ -2049,8 +2049,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
  * @sa:		Peer socket address (from accept())
  * @now:	Current timestamp
  */
-static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
-				   const struct timespec *now)
+static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
+				   int s, const struct timespec *now)
 {
 	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 	uint64_t hash;
@@ -2081,7 +2081,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
  * @ref:	epoll reference of listening socket
  * @now:	Current timestamp
  */
-void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
+void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 			const struct timespec *now)
 {
 	const struct flowside *ini;
@@ -2146,7 +2146,7 @@ cancel:
  *
  * #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64
  */
-void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
+void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
 {
 	struct itimerspec check_armed = { { 0 }, { 0 } };
 	struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp;
@@ -2210,7 +2210,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
  * @ref:	epoll reference
  * @events:	epoll events bitmap
  */
-void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
+void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
+		      uint32_t events)
 {
 	struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside);
 
@@ -2494,6 +2495,40 @@ static bool tcp_probe_peek_offset_cap(sa_family_t af)
 	return ret;
 }
 
+#ifdef HAS_SND_WND
+/**
+ * tcp_probe_snd_wnd_cap() - Check if TCP_INFO reports tcpi_snd_wnd
+ *
+ * Return: true if supported, false otherwise
+ */
+static bool tcp_probe_snd_wnd_cap(void)
+{
+	struct tcp_info tinfo;
+	socklen_t sl = sizeof(tinfo);
+	int s;
+
+	s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
+	if (s < 0) {
+		warn_perror("Temporary TCP socket creation failed");
+		return false;
+	}
+
+	if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
+		warn_perror("Failed to get TCP_INFO on temporary socket");
+		close(s);
+		return false;
+	}
+
+	close(s);
+
+	if (sl < (offsetof(struct tcp_info, tcpi_snd_wnd) +
+		  sizeof(tinfo.tcpi_snd_wnd)))
+		return false;
+
+	return true;
+}
+#endif /* HAS_SND_WND */
+
 /**
  * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
  * @c:		Execution context
@@ -2527,6 +2562,12 @@ int tcp_init(struct ctx *c)
 			  (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6));
 	debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");
 
+#ifdef HAS_SND_WND
+	snd_wnd_cap = tcp_probe_snd_wnd_cap();
+#endif
+	debug("TCP_INFO tcpi_snd_wnd field%ssupported",
+	      snd_wnd_cap ? " " : " not ");
+
 	return 0;
 }
 
diff --git a/tcp.h b/tcp.h
index e9ff0191..5585924f 100644
--- a/tcp.h
+++ b/tcp.h
@@ -10,11 +10,12 @@
 
 struct ctx;
 
-void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
-void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
+void tcp_timer_handler(const struct ctx *c, union epoll_ref ref);
+void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 			const struct timespec *now);
-void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
-int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
+void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
+		      uint32_t events);
+int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		    const void *saddr, const void *daddr,
 		    const struct pool *p, int idx, const struct timespec *now);
 int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
@@ -58,16 +59,12 @@ union tcp_listen_epoll_ref {
  * @fwd_in:		Port forwarding configuration for inbound packets
  * @fwd_out:		Port forwarding configuration for outbound packets
  * @timer_run:		Timestamp of most recent timer run
- * @kernel_snd_wnd:	Kernel reports sending window (with commit 8f7baad7f035)
  * @pipe_size:		Size of pipes for spliced connections
  */
 struct tcp_ctx {
 	struct fwd_ports fwd_in;
 	struct fwd_ports fwd_out;
 	struct timespec timer_run;
-#ifdef HAS_SND_WND
-	int kernel_snd_wnd;
-#endif
 	size_t pipe_size;
 };
 
diff --git a/tcp_buf.c b/tcp_buf.c
index 1a398461..c886c92b 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -239,7 +239,7 @@ void tcp_flags_flush(const struct ctx *c)
  * @frames:	Two-dimensional array containing queued frames with sub-iovs
  * @num_frames:	Number of entries in the two arrays to be compared
  */
-static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
+static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 			   struct iovec (*frames)[TCP_NUM_IOVS], int num_frames)
 {
 	int i;
@@ -264,7 +264,7 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
  * tcp_payload_flush() - Send out buffers for segments with data
  * @c:		Execution context
  */
-void tcp_payload_flush(struct ctx *c)
+void tcp_payload_flush(const struct ctx *c)
 {
 	size_t m;
 
@@ -293,7 +293,7 @@ void tcp_payload_flush(struct ctx *c)
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 {
 	struct tcp_flags_t *payload;
 	struct iovec *iov;
@@ -361,7 +361,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
  * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
  * @seq:	Sequence number to be sent
  */
-static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
+static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			    ssize_t dlen, int no_csum, uint32_t seq)
 {
 	struct iovec *iov;
@@ -405,7 +405,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
  *
  * #syscalls recvmsg
  */
-int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
diff --git a/tcp_buf.h b/tcp_buf.h
index 3db4c56e..8d4b615a 100644
--- a/tcp_buf.h
+++ b/tcp_buf.h
@@ -9,8 +9,8 @@
 void tcp_sock4_iov_init(const struct ctx *c);
 void tcp_sock6_iov_init(const struct ctx *c);
 void tcp_flags_flush(const struct ctx *c);
-void tcp_payload_flush(struct ctx *c);
-int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
-int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
+void tcp_payload_flush(const struct ctx *c);
+int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
+int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
 
 #endif  /*TCP_BUF_H */
diff --git a/tcp_internal.h b/tcp_internal.h
index aa8bb64f..bd634be1 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		conn_event_do(c, conn, event);				\
 	} while (0)
 
-void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
+void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
 #define tcp_rst(c, conn)						\
 	do {								\
 		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
@@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 			       const uint16_t *check, uint32_t seq);
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 			  int force_seq, struct tcp_info *tinfo);
-int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags,
+int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags,
 		      struct tcphdr *th, char *data, size_t *optlen);
 
 #endif /* TCP_INTERNAL_H */
-- 
@@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		conn_event_do(c, conn, event);				\
 	} while (0)
 
-void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
+void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
 #define tcp_rst(c, conn)						\
 	do {								\
 		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
@@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 			       const uint16_t *check, uint32_t seq);
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 			  int force_seq, struct tcp_info *tinfo);
-int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags,
+int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags,
 		      struct tcphdr *th, char *data, size_t *optlen);
 
 #endif /* TCP_INTERNAL_H */
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
  2024-09-13  4:32 ` [PATCH v2 01/10] tcp: Make some extra functions private David Gibson
  2024-09-13  4:32 ` [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-17 21:54   ` Stefano Brivio
  2024-09-13  4:32 ` [PATCH v2 04/10] tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean David Gibson
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This function has a block conditional on !snd_wnd_cap shortly before an
#ifdef HAS_SND_WND.  But snd_wnd_cap implies HAS_SND_WND (if !HAS_SND_WND,
snd_wnd_cap is statically false).

Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/tcp.c b/tcp.c
index cba3f3bd..6733e7e3 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1061,27 +1061,25 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
 				       USHRT_MAX);
 		goto out;
-	}
-
-	if (!tinfo) {
-		if (prev_wnd_to_tap > WINDOW_DEFAULT) {
-			goto out;
-}
-		tinfo = &tinfo_new;
-		if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
-			goto out;
-}
-	}
-
-#ifdef HAS_SND_WND
-	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
-		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
 	} else {
-		tcp_get_sndbuf(conn);
-		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
-				     SNDBUF_GET(conn));
+		if (!tinfo) {
+			if (prev_wnd_to_tap > WINDOW_DEFAULT) {
+				goto out;
+			}
+			tinfo = &tinfo_new;
+			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
+				goto out;
+			}
+		}
+
+		if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
+			new_wnd_to_tap = tinfo->tcpi_snd_wnd;
+		} else {
+			tcp_get_sndbuf(conn);
+			new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
+					     SNDBUF_GET(conn));
+		}
 	}
-#endif
 
 	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
 	if (!(conn->events & ESTABLISHED))
-- 
@@ -1061,27 +1061,25 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
 				       USHRT_MAX);
 		goto out;
-	}
-
-	if (!tinfo) {
-		if (prev_wnd_to_tap > WINDOW_DEFAULT) {
-			goto out;
-}
-		tinfo = &tinfo_new;
-		if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
-			goto out;
-}
-	}
-
-#ifdef HAS_SND_WND
-	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
-		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
 	} else {
-		tcp_get_sndbuf(conn);
-		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
-				     SNDBUF_GET(conn));
+		if (!tinfo) {
+			if (prev_wnd_to_tap > WINDOW_DEFAULT) {
+				goto out;
+			}
+			tinfo = &tinfo_new;
+			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
+				goto out;
+			}
+		}
+
+		if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
+			new_wnd_to_tap = tinfo->tcpi_snd_wnd;
+		} else {
+			tcp_get_sndbuf(conn);
+			new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
+					     SNDBUF_GET(conn));
+		}
 	}
-#endif
 
 	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
 	if (!(conn->events & ESTABLISHED))
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 04/10] tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
                   ` (2 preceding siblings ...)
  2024-09-13  4:32 ` [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd() David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-13  4:32 ` [PATCH v2 05/10] tcp: On socket EPOLLOUT, send new ACK to tap immediately David Gibson
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This parameter is already treated as a boolean internally.  Make it a
'bool' type for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c          | 6 +++---
 tcp_buf.c      | 2 +-
 tcp_internal.h | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tcp.c b/tcp.c
index 6733e7e3..ee894c5c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1020,7 +1020,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
  * Return: 1 if sequence or window were updated, 0 otherwise
  */
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
-			  int force_seq, struct tcp_info *tinfo)
+			  bool force_seq, struct tcp_info *tinfo)
 {
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap << conn->ws_to_tap;
 	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
@@ -1157,7 +1157,7 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 	if (!(conn->flags & LOCAL))
 		tcp_rtt_dst_check(conn, &tinfo);
 
-	if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags)
+	if (!tcp_update_seqack_wnd(c, conn, !!flags, &tinfo) && !flags)
 		return 0;
 
 	*optlen = 0;
@@ -2240,7 +2240,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 			tcp_data_from_sock(c, conn);
 
 		if (events & EPOLLOUT)
-			tcp_update_seqack_wnd(c, conn, 0, NULL);
+			tcp_update_seqack_wnd(c, conn, false, NULL);
 
 		return;
 	}
diff --git a/tcp_buf.c b/tcp_buf.c
index c886c92b..83f91a37 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -511,7 +511,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	last_len = sendlen - (send_bufs - 1) * mss;
 
 	/* Likely, some new data was acked too. */
-	tcp_update_seqack_wnd(c, conn, 0, NULL);
+	tcp_update_seqack_wnd(c, conn, false, NULL);
 
 	/* Finally, queue to tap */
 	dlen = mss;
diff --git a/tcp_internal.h b/tcp_internal.h
index bd634be1..a450d850 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -93,7 +93,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq);
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
-			  int force_seq, struct tcp_info *tinfo);
+			  bool force_seq, struct tcp_info *tinfo);
 int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags,
 		      struct tcphdr *th, char *data, size_t *optlen);
 
-- 
@@ -93,7 +93,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq);
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
-			  int force_seq, struct tcp_info *tinfo);
+			  bool force_seq, struct tcp_info *tinfo);
 int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags,
 		      struct tcphdr *th, char *data, size_t *optlen);
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 05/10] tcp: On socket EPOLLOUT, send new ACK to tap immediately
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
                   ` (3 preceding siblings ...)
  2024-09-13  4:32 ` [PATCH v2 04/10] tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-13  4:32 ` [PATCH v2 06/10] tap: Re-introduce EPOLLET for tap connections David Gibson
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

When the guest sends data on a TCP connection, it's possible that we run
out of buffer space on the socket side.  If that happens we'll advertise
a zero window to the guest stopping it from sending.  When the socket side
buffer clears again, which is signalled by an EPOLLOUT, we recalculate the
ack and window with tcp_update_seqack_wnd().

tcp_update_seqack_wnd() only calculates the new values, it doesn't actually
send out the ack - that will typically happen some milliseconds later on
the ACK_TO_TAP_DUE timer.

AFAICT, there's not really any point delaying this ack though, we might as
well send it and get the guest sending again ASAP.  So, instead of
calling tcp_update_seqack_wnd() call tcp_send_flag() to send an immediate
ACK if needed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index ee894c5c..32e45e09 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2240,7 +2240,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 			tcp_data_from_sock(c, conn);
 
 		if (events & EPOLLOUT)
-			tcp_update_seqack_wnd(c, conn, false, NULL);
+			tcp_send_flag(c, conn, ACK_IF_NEEDED);
 
 		return;
 	}
-- 
@@ -2240,7 +2240,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 			tcp_data_from_sock(c, conn);
 
 		if (events & EPOLLOUT)
-			tcp_update_seqack_wnd(c, conn, false, NULL);
+			tcp_send_flag(c, conn, ACK_IF_NEEDED);
 
 		return;
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 06/10] tap: Re-introduce EPOLLET for tap connections
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
                   ` (4 preceding siblings ...)
  2024-09-13  4:32 ` [PATCH v2 05/10] tcp: On socket EPOLLOUT, send new ACK to tap immediately David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-13  4:32 ` [PATCH v2 07/10] tap: Keep track of whether there might be space in the tap buffers David Gibson
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only
used level-triggered events for the tap device.  Prior to that we used it
inconsistently which was confusing (though not incorrect AFAICT).

We want to add support for EPOLLOUT events on the tap connection, and
without EPOLLET that would require toggling EPOLLOUT on and off, which is
awkward.  So, re-introduce EPOLLET, but now use it uniformly for all tap
modes.  The main change this requires is making sure on EPOLLIN we loop
until all there's no more data to process.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/tap.c b/tap.c
index 41af6a6d..c1db2960 100644
--- a/tap.c
+++ b/tap.c
@@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c)
  * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_passt_input(struct ctx *c, const struct timespec *now)
+static bool tap_passt_input(struct ctx *c, const struct timespec *now)
 {
 	static const char *partial_frame;
 	static ssize_t partial_len = 0;
@@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 			err_perror("Receive error on guest connection, reset");
 			tap_sock_reset(c);
 		}
-		return;
+		return false;
 	}
 
 	p = pkt_buf;
@@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
 			err("Bad frame size from guest, resetting connection");
 			tap_sock_reset(c);
-			return;
+			return false;
 		}
 
 		if (l2len + sizeof(uint32_t) > (size_t)n)
@@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 	partial_frame = p;
 
 	tap_handler(c, now);
+
+	return true;
 }
 
 /**
@@ -1061,16 +1065,20 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-	if (events & EPOLLIN)
-		tap_passt_input(c, now);
+	if (events & EPOLLIN) {
+		while (tap_passt_input(c, now))
+			;
+	}
 }
 
 /**
  * tap_pasta_input() - Handler for new data on the socket to hypervisor
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_pasta_input(struct ctx *c, const struct timespec *now)
+static bool tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
 
@@ -1102,6 +1110,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 	}
 
 	tap_handler(c, now);
+
+	return len > 0;
 }
 
 /**
@@ -1116,8 +1126,10 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
 		die("Disconnect event on /dev/net/tun device, exiting");
 
-	if (events & EPOLLIN)
-		tap_pasta_input(c, now);
+	if (events & EPOLLIN) {
+		while (tap_pasta_input(c, now))
+			;
+	}
 }
 
 /**
@@ -1251,7 +1263,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1307,7 +1319,7 @@ static void tap_sock_tun_init(struct ctx *c)
 	pasta_ns_conf(c);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1340,7 +1352,7 @@ void tap_sock_init(struct ctx *c)
 		else
 			ref.type = EPOLL_TYPE_TAP_PASTA;
 
-		ev.events = EPOLLIN | EPOLLRDHUP;
+		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
-- 
@@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c)
  * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_passt_input(struct ctx *c, const struct timespec *now)
+static bool tap_passt_input(struct ctx *c, const struct timespec *now)
 {
 	static const char *partial_frame;
 	static ssize_t partial_len = 0;
@@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 			err_perror("Receive error on guest connection, reset");
 			tap_sock_reset(c);
 		}
-		return;
+		return false;
 	}
 
 	p = pkt_buf;
@@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
 			err("Bad frame size from guest, resetting connection");
 			tap_sock_reset(c);
-			return;
+			return false;
 		}
 
 		if (l2len + sizeof(uint32_t) > (size_t)n)
@@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 	partial_frame = p;
 
 	tap_handler(c, now);
+
+	return true;
 }
 
 /**
@@ -1061,16 +1065,20 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-	if (events & EPOLLIN)
-		tap_passt_input(c, now);
+	if (events & EPOLLIN) {
+		while (tap_passt_input(c, now))
+			;
+	}
 }
 
 /**
  * tap_pasta_input() - Handler for new data on the socket to hypervisor
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_pasta_input(struct ctx *c, const struct timespec *now)
+static bool tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
 
@@ -1102,6 +1110,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 	}
 
 	tap_handler(c, now);
+
+	return len > 0;
 }
 
 /**
@@ -1116,8 +1126,10 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
 		die("Disconnect event on /dev/net/tun device, exiting");
 
-	if (events & EPOLLIN)
-		tap_pasta_input(c, now);
+	if (events & EPOLLIN) {
+		while (tap_pasta_input(c, now))
+			;
+	}
 }
 
 /**
@@ -1251,7 +1263,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1307,7 +1319,7 @@ static void tap_sock_tun_init(struct ctx *c)
 	pasta_ns_conf(c);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1340,7 +1352,7 @@ void tap_sock_init(struct ctx *c)
 		else
 			ref.type = EPOLL_TYPE_TAP_PASTA;
 
-		ev.events = EPOLLIN | EPOLLRDHUP;
+		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 07/10] tap: Keep track of whether there might be space in the tap buffers
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
                   ` (5 preceding siblings ...)
  2024-09-13  4:32 ` [PATCH v2 06/10] tap: Re-introduce EPOLLET for tap connections David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-13  4:32 ` [PATCH v2 08/10] tcp: Keep track of connections blocked due to a full tap interface David Gibson
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

It's possible for the buffers of the tap interface (whatever style) to fill
up, at which point we'll get some sort of short write.  This can result in
TCP connections with buffered data on the socket side we weren't able to
forward.  To more efficiently know when we can forward that data we need
to know when the tap interface is no longer "full".

To assist that, keep track of our best estimate of whether the tap device
is full: set it when we get a short write, and clear it when we get an
EPOLLOUT event on the tap fd, indicating we can write more data.  Protocols
(specifically TCP) can use this via a new tap_is_full() call.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 31 +++++++++++++++++++++++++++----
 tap.h |  1 +
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c
index c1db2960..3bdca9a1 100644
--- a/tap.c
+++ b/tap.c
@@ -63,6 +63,9 @@
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
 static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
 
+/* Filled buffers on the tap device */
+static bool tap_full_flag;
+
 #define TAP_SEQS		128 /* Different L4 tuples in one batch */
 #define FRAGMENT_MSG_RATE	10  /* # seconds between fragment warnings */
 
@@ -411,9 +414,11 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 	else
 		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
 
-	if (m < nframes)
+	if (m < nframes) {
+		tap_full_flag = true;
 		debug("tap: failed to send %zu frames of %zu",
 		      nframes - m, nframes);
+	}
 
 	pcap_multiple(iov, bufs_per_frame, m,
 		      c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
@@ -1069,6 +1074,9 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		while (tap_passt_input(c, now))
 			;
 	}
+
+	if (events & EPOLLOUT)
+		tap_full_flag = false;
 }
 
 /**
@@ -1130,6 +1138,21 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		while (tap_pasta_input(c, now))
 			;
 	}
+
+	if (events & EPOLLOUT)
+		tap_full_flag = false;
+}
+
+/**
+ * tap_is_full() - Can we write more to the tap device this cycle?
+ *
+ * Return: true if (last we knew) there's more space in the tap buffers, false
+ *         otherwise
+ */
+/* cppcheck-suppress unusedFunction */
+bool tap_is_full(void)
+{
+	return tap_full_flag;
 }
 
 /**
@@ -1263,7 +1286,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
+	ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1319,7 +1342,7 @@ static void tap_sock_tun_init(struct ctx *c)
 	pasta_ns_conf(c);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
+	ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1352,7 +1375,7 @@ void tap_sock_init(struct ctx *c)
 		else
 			ref.type = EPOLL_TYPE_TAP_PASTA;
 
-		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
+		ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
diff --git a/tap.h b/tap.h
index ec9e2ace..6094fbdd 100644
--- a/tap.h
+++ b/tap.h
@@ -67,6 +67,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
+bool tap_is_full(void);
 int tap_sock_unix_open(char *sock_path);
 void tap_sock_init(struct ctx *c);
 void tap_flush_pools(void);
-- 
@@ -67,6 +67,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
+bool tap_is_full(void);
 int tap_sock_unix_open(char *sock_path);
 void tap_sock_init(struct ctx *c);
 void tap_flush_pools(void);
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 08/10] tcp: Keep track of connections blocked due to a full tap interface
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
                   ` (6 preceding siblings ...)
  2024-09-13  4:32 ` [PATCH v2 07/10] tap: Keep track of whether there might be space in the tap buffers David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-13  4:32 ` [PATCH v2 09/10] tcp: Move deferred handling functions later in tcp.c David Gibson
  2024-09-13  4:32 ` [PATCH v2 10/10] tcp: Simplify epoll event mask management David Gibson
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

When we receive new data on a TCP socket, we attempt to forward all of it
to the tap interface immediately.  However, if the tap buffers fill up,
we might fail to transfer some of it.  In that case we'll need to try
again later.  Currently that's handled by the EPOLLIN event being
re-asserted in level mode at some point by complicated fiddling of the
event masks.  In preparation for a simpler way of triggering the retry,
keep track of which connections are in this state with a new TAP_FULL flag.
We also keep a count of the total number of connections currently in the
TAP_FULL state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c      | 12 +++++++++++-
 tcp_buf.c  |  3 +++
 tcp_conn.h |  1 +
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 32e45e09..4b478432 100644
--- a/tcp.c
+++ b/tcp.c
@@ -349,7 +349,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
 
 static const char *tcp_flag_str[] __attribute((__unused__)) = {
 	"STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
-	"ACK_FROM_TAP_DUE",
+	"ACK_FROM_TAP_DUE", "TAP_FULL",
 };
 
 /* Listening sockets, used for automatic port forwarding in pasta mode only */
@@ -381,6 +381,11 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 int init_sock_pool4		[TCP_SOCK_POOL_SIZE];
 int init_sock_pool6		[TCP_SOCK_POOL_SIZE];
 
+/* Number of connections with pending socket data that couldn't be sent because
+ * we ran out of buffer space on the tap side
+ */
+unsigned num_tap_full;
+
 /**
  * conn_at_sidx() - Get TCP connection specific flow at given sidx
  * @sidx:	Flow and side to retrieve
@@ -597,6 +602,11 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 	    (flag == ~ACK_FROM_TAP_DUE && (conn->flags & ACK_TO_TAP_DUE)) ||
 	    (flag == ~ACK_TO_TAP_DUE   && (conn->flags & ACK_FROM_TAP_DUE)))
 		tcp_timer_ctl(c, conn);
+
+	if (flag == TAP_FULL)
+		num_tap_full++;
+	else if (flag == ~TAP_FULL)
+		num_tap_full--;
 }
 
 /**
diff --git a/tcp_buf.c b/tcp_buf.c
index 83f91a37..0ccb9e6b 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -250,6 +250,8 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 		uint32_t seq = ntohl(th->seq);
 		uint32_t peek_offset;
 
+		conn_flag(c, conn, TAP_FULL);
+
 		if (SEQ_LE(conn->seq_to_tap, seq))
 			continue;
 
@@ -526,6 +528,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		seq += dlen;
 	}
 
+	conn_flag(c, conn, ~TAP_FULL);
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
 	return 0;
diff --git a/tcp_conn.h b/tcp_conn.h
index 6ae05115..9677678c 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -77,6 +77,7 @@ struct tcp_tap_conn {
 #define ACTIVE_CLOSE		BIT(2)
 #define ACK_TO_TAP_DUE		BIT(3)
 #define ACK_FROM_TAP_DUE	BIT(4)
+#define TAP_FULL		BIT(5)
 
 #define SNDBUF_BITS		24
 	unsigned int	sndbuf		:SNDBUF_BITS;
-- 
@@ -77,6 +77,7 @@ struct tcp_tap_conn {
 #define ACTIVE_CLOSE		BIT(2)
 #define ACK_TO_TAP_DUE		BIT(3)
 #define ACK_FROM_TAP_DUE	BIT(4)
+#define TAP_FULL		BIT(5)
 
 #define SNDBUF_BITS		24
 	unsigned int	sndbuf		:SNDBUF_BITS;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 09/10] tcp: Move deferred handling functions later in tcp.c
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
                   ` (7 preceding siblings ...)
  2024-09-13  4:32 ` [PATCH v2 08/10] tcp: Keep track of connections blocked due to a full tap interface David Gibson
@ 2024-09-13  4:32 ` David Gibson
  2024-09-13  4:32 ` [PATCH v2 10/10] tcp: Simplify epoll event mask management David Gibson
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Future changes will want these functions to call things that are
currently below them in the file, so move them later.  Code motion only,
no change to logic for now.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 58 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/tcp.c b/tcp.c
index 4b478432..78f546db 100644
--- a/tcp.c
+++ b/tcp.c
@@ -853,35 +853,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
- * @conn:	Connection to handle
- *
- * Return: true if the connection is ready to free, false otherwise
- */
-bool tcp_flow_defer(const struct tcp_tap_conn *conn)
-{
-	if (conn->events != CLOSED)
-		return false;
-
-	close(conn->sock);
-	if (conn->timer != -1)
-		close(conn->timer);
-
-	return true;
-}
-
-/**
- * tcp_defer_handler() - Handler for TCP deferred tasks
- * @c:		Execution context
- */
-/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
-void tcp_defer_handler(struct ctx *c)
-{
-	tcp_flags_flush(c);
-	tcp_payload_flush(c);
-}
-
 /**
  * tcp_fill_header() - Fill the TCP header fields for a given TCP segment.
  *
@@ -2272,6 +2243,35 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 	}
 }
 
+/**
+ * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
+ * @conn:	Connection to handle
+ *
+ * Return: true if the connection is ready to free, false otherwise
+ */
+bool tcp_flow_defer(const struct tcp_tap_conn *conn)
+{
+	if (conn->events != CLOSED)
+		return false;
+
+	close(conn->sock);
+	if (conn->timer != -1)
+		close(conn->timer);
+
+	return true;
+}
+
+/**
+ * tcp_defer_handler() - Handler for TCP deferred tasks
+ * @c:		Execution context
+ */
+/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
+void tcp_defer_handler(struct ctx *c)
+{
+	tcp_flags_flush(c);
+	tcp_payload_flush(c);
+}
+
 /**
  * tcp_sock_init_af() - Initialise listening socket for a given af and port
  * @c:		Execution context
-- 
@@ -853,35 +853,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
- * @conn:	Connection to handle
- *
- * Return: true if the connection is ready to free, false otherwise
- */
-bool tcp_flow_defer(const struct tcp_tap_conn *conn)
-{
-	if (conn->events != CLOSED)
-		return false;
-
-	close(conn->sock);
-	if (conn->timer != -1)
-		close(conn->timer);
-
-	return true;
-}
-
-/**
- * tcp_defer_handler() - Handler for TCP deferred tasks
- * @c:		Execution context
- */
-/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
-void tcp_defer_handler(struct ctx *c)
-{
-	tcp_flags_flush(c);
-	tcp_payload_flush(c);
-}
-
 /**
  * tcp_fill_header() - Fill the TCP header fields for a given TCP segment.
  *
@@ -2272,6 +2243,35 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 	}
 }
 
+/**
+ * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
+ * @conn:	Connection to handle
+ *
+ * Return: true if the connection is ready to free, false otherwise
+ */
+bool tcp_flow_defer(const struct tcp_tap_conn *conn)
+{
+	if (conn->events != CLOSED)
+		return false;
+
+	close(conn->sock);
+	if (conn->timer != -1)
+		close(conn->timer);
+
+	return true;
+}
+
+/**
+ * tcp_defer_handler() - Handler for TCP deferred tasks
+ * @c:		Execution context
+ */
+/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
+void tcp_defer_handler(struct ctx *c)
+{
+	tcp_flags_flush(c);
+	tcp_payload_flush(c);
+}
+
 /**
  * tcp_sock_init_af() - Initialise listening socket for a given af and port
  * @c:		Execution context
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 10/10] tcp: Simplify epoll event mask management
  2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
                   ` (8 preceding siblings ...)
  2024-09-13  4:32 ` [PATCH v2 09/10] tcp: Move deferred handling functions later in tcp.c David Gibson
@ 2024-09-13  4:32 ` David Gibson
  9 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-13  4:32 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp can use a number of different epoll event masks, and even switches
between edge-triggered and level-triggered mode depending on a number of
factors.  This makes it very difficult to reason about exactly what events
will be active when.

Simplify this by instead always using event mask with all the events we
ever care about (EPOLLIN, EPOLLOUT and EPOLLRDHUP).  We have a number of
situations where we can't immediately clear the event condition, so we
must always use edge-triggered mode to avoid spinning on events.

We do need to make some changes to properly handle edge-triggered mode.
Specifically there are two cases where we might not be able to process
all new data on an EPOLLIN, so we need to make sure we repoll the socket
for queued data later.

1) If we can't forward all data to the guest, because the tap-side window
is too small.  To address this we repoll the socket for data when we
receive an ack or window update from the guest side.

2) If we run out of buffer space in the tap device itself.  We handle this
by flagging connections in this state.  We then recheck them at the end
of each epoll cycle if the tap device is now reporting ready for more data.


tcp_defer_handler() is called after the last round of epoll events is
handled.  Add logic here to revisit any connections which were previously
blocked due to running out of buffer space on the tap device, and attempt
to forward the data again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c |  1 -
 tcp.c | 58 ++++++++++++++++++++++++++--------------------------------
 2 files changed, 26 insertions(+), 33 deletions(-)

diff --git a/tap.c b/tap.c
index 3bdca9a1..956db8e4 100644
--- a/tap.c
+++ b/tap.c
@@ -1149,7 +1149,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
  * Return: true if (last we knew) there's more space in the tap buffers, false
  *         otherwise
  */
-/* cppcheck-suppress unusedFunction */
 bool tap_is_full(void)
 {
 	return tap_full_flag;
diff --git a/tcp.c b/tcp.c
index 78f546db..7e11f12b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -423,34 +423,6 @@ int tcp_set_peek_offset(int s, int offset)
 	return 0;
 }
 
-/**
- * tcp_conn_epoll_events() - epoll events mask for given connection state
- * @events:	Current connection events
- * @conn_flags	Connection flags
- *
- * Return: epoll events mask corresponding to implied connection state
- */
-static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
-{
-	if (!events)
-		return 0;
-
-	if (events & ESTABLISHED) {
-		if (events & TAP_FIN_SENT)
-			return EPOLLET;
-
-		if (conn_flags & STALLED)
-			return EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
-
-		return EPOLLIN | EPOLLRDHUP;
-	}
-
-	if (events == TAP_SYN_RCVD)
-		return EPOLLOUT | EPOLLET | EPOLLRDHUP;
-
-	return EPOLLET | EPOLLRDHUP;
-}
-
 /**
  * tcp_epoll_ctl() - Add/modify/delete epoll state from connection events
  * @c:		Execution context
@@ -463,7 +435,10 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
 		                .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), };
-	struct epoll_event ev = { .data.u64 = ref.u64 };
+	struct epoll_event ev = {
+		.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET,
+		.data.u64 = ref.u64,
+	};
 
 	if (conn->events == CLOSED) {
 		if (conn->in_epoll)
@@ -473,8 +448,6 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		return 0;
 	}
 
-	ev.events = tcp_conn_epoll_events(conn->events, conn->flags);
-
 	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
 		return -errno;
 
@@ -1746,9 +1719,13 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			tcp_rst(c, conn);
 			return -1;
 		}
-		tcp_data_from_sock(c, conn);
 	}
 
+	/* The socket side might have queued data that we didn't send due to a
+	 * full window.  Now the window's updated, try again
+	 */
+	tcp_data_from_sock(c, conn);
+
 	if (!iov_i)
 		goto out;
 
@@ -2268,6 +2245,23 @@ bool tcp_flow_defer(const struct tcp_tap_conn *conn)
 /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
 void tcp_defer_handler(struct ctx *c)
 {
+	unsigned i;
+
+	for (i = 0; i < FLOW_MAX; i++) {
+		union flow *flow = FLOW(i);
+
+		if (!num_tap_full)
+			break; /* No need to continue */
+		if (tap_is_full())
+			break; /* No point in continuing */
+
+		if ((flow->f.type != FLOW_TCP) ||
+		    !(flow->tcp.flags & TAP_FULL))
+			continue;
+
+		tcp_data_from_sock(c, &flow->tcp);
+	}
+
 	tcp_flags_flush(c);
 	tcp_payload_flush(c);
 }
-- 
@@ -423,34 +423,6 @@ int tcp_set_peek_offset(int s, int offset)
 	return 0;
 }
 
-/**
- * tcp_conn_epoll_events() - epoll events mask for given connection state
- * @events:	Current connection events
- * @conn_flags	Connection flags
- *
- * Return: epoll events mask corresponding to implied connection state
- */
-static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
-{
-	if (!events)
-		return 0;
-
-	if (events & ESTABLISHED) {
-		if (events & TAP_FIN_SENT)
-			return EPOLLET;
-
-		if (conn_flags & STALLED)
-			return EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
-
-		return EPOLLIN | EPOLLRDHUP;
-	}
-
-	if (events == TAP_SYN_RCVD)
-		return EPOLLOUT | EPOLLET | EPOLLRDHUP;
-
-	return EPOLLET | EPOLLRDHUP;
-}
-
 /**
  * tcp_epoll_ctl() - Add/modify/delete epoll state from connection events
  * @c:		Execution context
@@ -463,7 +435,10 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
 		                .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), };
-	struct epoll_event ev = { .data.u64 = ref.u64 };
+	struct epoll_event ev = {
+		.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET,
+		.data.u64 = ref.u64,
+	};
 
 	if (conn->events == CLOSED) {
 		if (conn->in_epoll)
@@ -473,8 +448,6 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		return 0;
 	}
 
-	ev.events = tcp_conn_epoll_events(conn->events, conn->flags);
-
 	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
 		return -errno;
 
@@ -1746,9 +1719,13 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			tcp_rst(c, conn);
 			return -1;
 		}
-		tcp_data_from_sock(c, conn);
 	}
 
+	/* The socket side might have queued data that we didn't send due to a
+	 * full window.  Now the window's updated, try again
+	 */
+	tcp_data_from_sock(c, conn);
+
 	if (!iov_i)
 		goto out;
 
@@ -2268,6 +2245,23 @@ bool tcp_flow_defer(const struct tcp_tap_conn *conn)
 /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
 void tcp_defer_handler(struct ctx *c)
 {
+	unsigned i;
+
+	for (i = 0; i < FLOW_MAX; i++) {
+		union flow *flow = FLOW(i);
+
+		if (!num_tap_full)
+			break; /* No need to continue */
+		if (tap_is_full())
+			break; /* No point in continuing */
+
+		if ((flow->f.type != FLOW_TCP) ||
+		    !(flow->tcp.flags & TAP_FULL))
+			continue;
+
+		tcp_data_from_sock(c, &flow->tcp);
+	}
+
 	tcp_flags_flush(c);
 	tcp_payload_flush(c);
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing
  2024-09-13  4:32 ` [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing David Gibson
@ 2024-09-17 21:54   ` Stefano Brivio
  2024-09-18  1:27     ` David Gibson
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Brivio @ 2024-09-17 21:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 13 Sep 2024 14:32:06 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> When available, we want to retrieve our socket peer's advertised window and
> forward that to the guest.  That information has been available from the
> kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.
> 
> Currently our probing for this is a bit odd.  The HAS_SND_WND define
> determines if our headers include the tcp_snd_wnd field, but that doesn't
> necessarily mean the running kernel supports it.  Currently we start by
> assuming it's _not_ available, but mark it as available if we ever see
> a non-zero value in the field.  This is a bit hit and miss in two ways:
>  * Zero is perfectly possible window the peer could report, so we can
>    get false negatives

Kind of: one non-zero result was enough to set tcp.kernel_snd_wnd to
one.

The reason why I implemented it that way was to account for possible
kernel backports of that option. On the other hand, any kernel backport
would need to preserve the position of tcpi_snd_wnd, regardless of
whether preceding fields are missing, so checking the size as you're
doing also looks robust, and avoids these two issues altogether.

>  * We're reading TCP_INFO into a local variable, which might not be zero
>    initialised, so if the kernel _doesn't_ write it it could have non-zero
>    garbage, giving us false positives.
> 
> We can use a more direct way of probing for this: getsockopt() reports the
> length of the information retreived.  So, check whether that's long enough
> to include the field.  This lets us probe the availability of the field
> once and for all during initialisation.  That in turn allows ctx to become
> a const pointer to tcp_prepare_flags() which cascades through many other
> functions.
> 
> We also move the flag for the probe result from the ctx structure to a
> global, to match peek_offset_cap.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c          | 93 ++++++++++++++++++++++++++++++++++++--------------
>  tcp.h          | 13 +++----
>  tcp_buf.c      | 10 +++---
>  tcp_buf.h      |  6 ++--
>  tcp_internal.h |  4 +--
>  5 files changed, 82 insertions(+), 44 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 14b48a84..cba3f3bd 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -308,11 +308,6 @@
>  /* MSS rounding: see SET_MSS() */
>  #define MSS_DEFAULT			536
>  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> -#ifdef HAS_SND_WND
> -# define KERNEL_REPORTS_SND_WND(c)	((c)->tcp.kernel_snd_wnd)
> -#else
> -# define KERNEL_REPORTS_SND_WND(c)	(0 && (c))
> -#endif
>  
>  #define ACK_INTERVAL			10		/* ms */
>  #define SYN_TIMEOUT			10		/* s */
> @@ -370,6 +365,14 @@ char		tcp_buf_discard		[MAX_WINDOW];
>  
>  /* Does the kernel support TCP_PEEK_OFF? */
>  bool peek_offset_cap;
> +#ifdef HAS_SND_WND
> +/* Does the kernel report sending window in TCP_INFO (kernel commit
> + * 8f7baad7f035)
> + */
> +bool snd_wnd_cap;
> +#else
> +#define snd_wnd_cap	(false)
> +#endif
>  
>  /* sendmsg() to socket */
>  static struct iovec	tcp_iov			[UIO_MAXIOV];
> @@ -1052,7 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	}
>  #endif /* !HAS_BYTES_ACKED */
>  
> -	if (!KERNEL_REPORTS_SND_WND(c)) {
> +	if (!snd_wnd_cap) {
>  		tcp_get_sndbuf(conn);
>  		new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW);
>  		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
> @@ -1136,7 +1139,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
>   *	     0 if there is no flag to send
>   *	     1 otherwise
>   */
> -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
> +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
>  		      int flags, struct tcphdr *th, char *data,
>  		      size_t *optlen)
>  {
> @@ -1153,11 +1156,6 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
>  		return -ECONNRESET;
>  	}
>  
> -#ifdef HAS_SND_WND
> -	if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd)
> -		c->tcp.kernel_snd_wnd = 1;
> -#endif
> -
>  	if (!(conn->flags & LOCAL))
>  		tcp_rtt_dst_check(conn, &tinfo);
>  
> @@ -1235,7 +1233,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * Return: negative error code on connection reset, 0 otherwise
>   */
> -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
> +			 int flags)
>  {
>  	return tcp_buf_send_flag(c, conn, flags);
>  }
> @@ -1245,7 +1244,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>   * @c:		Execution context
>   * @conn:	Connection pointer
>   */
> -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
> +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	if (conn->events == CLOSED)
>  		return;
> @@ -1463,7 +1462,7 @@ static void tcp_bind_outbound(const struct ctx *c,
>   * @optlen:	Bytes in options: caller MUST ensure available length
>   * @now:	Current timestamp
>   */
> -static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> +static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
>  			      const void *saddr, const void *daddr,
>  			      const struct tcphdr *th, const char *opts,
>  			      size_t optlen, const struct timespec *now)
> @@ -1628,7 +1627,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
>   *
>   * #syscalls recvmsg
>   */
> -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	return tcp_buf_data_from_sock(c, conn);
>  }
> @@ -1644,8 +1643,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>   *
>   * Return: count of consumed packets
>   */
> -static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> -			      const struct pool *p, int idx)
> +static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> +			     const struct pool *p, int idx)
>  {
>  	int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0;
>  	uint16_t max_ack_seq_wnd = conn->wnd_from_tap;
> @@ -1842,7 +1841,8 @@ out:
>   * @opts:	Pointer to start of options
>   * @optlen:	Bytes in options: caller MUST ensure available length
>   */
> -static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
> +static void tcp_conn_from_sock_finish(const struct ctx *c,
> +				      struct tcp_tap_conn *conn,
>  				      const struct tcphdr *th,
>  				      const char *opts, size_t optlen)
>  {
> @@ -1885,7 +1885,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * Return: count of consumed packets
>   */
> -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
> +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>  		    const void *saddr, const void *daddr,
>  		    const struct pool *p, int idx, const struct timespec *now)
>  {
> @@ -2023,7 +2023,7 @@ reset:
>   * @c:		Execution context
>   * @conn:	Connection pointer
>   */
> -static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> +static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	socklen_t sl;
>  	int so;
> @@ -2049,8 +2049,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
>   * @sa:		Peer socket address (from accept())
>   * @now:	Current timestamp
>   */
> -static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
> -				   const struct timespec *now)
> +static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
> +				   int s, const struct timespec *now)
>  {
>  	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
>  	uint64_t hash;
> @@ -2081,7 +2081,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
>   * @ref:	epoll reference of listening socket
>   * @now:	Current timestamp
>   */
> -void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
>  			const struct timespec *now)
>  {
>  	const struct flowside *ini;
> @@ -2146,7 +2146,7 @@ cancel:
>   *
>   * #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64
>   */
> -void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>  {
>  	struct itimerspec check_armed = { { 0 }, { 0 } };
>  	struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp;
> @@ -2210,7 +2210,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>   * @ref:	epoll reference
>   * @events:	epoll events bitmap
>   */
> -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
> +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> +		      uint32_t events)
>  {
>  	struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside);
>  
> @@ -2494,6 +2495,40 @@ static bool tcp_probe_peek_offset_cap(sa_family_t af)
>  	return ret;
>  }
>  
> +#ifdef HAS_SND_WND
> +/**
> + * tcp_probe_snd_wnd_cap() - Check if TCP_INFO reports tcpi_snd_wnd
> + *
> + * Return: true if supported, false otherwise
> + */
> +static bool tcp_probe_snd_wnd_cap(void)
> +{
> +	struct tcp_info tinfo;
> +	socklen_t sl = sizeof(tinfo);
> +	int s;
> +
> +	s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
> +	if (s < 0) {
> +		warn_perror("Temporary TCP socket creation failed");
> +		return false;
> +	}
> +
> +	if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
> +		warn_perror("Failed to get TCP_INFO on temporary socket");
> +		close(s);
> +		return false;
> +	}
> +
> +	close(s);
> +
> +	if (sl < (offsetof(struct tcp_info, tcpi_snd_wnd) +
> +		  sizeof(tinfo.tcpi_snd_wnd)))
> +		return false;
> +
> +	return true;
> +}
> +#endif /* HAS_SND_WND */
> +
>  /**
>   * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
>   * @c:		Execution context
> @@ -2527,6 +2562,12 @@ int tcp_init(struct ctx *c)
>  			  (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6));
>  	debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");
>  
> +#ifdef HAS_SND_WND
> +	snd_wnd_cap = tcp_probe_snd_wnd_cap();
> +#endif
> +	debug("TCP_INFO tcpi_snd_wnd field%ssupported",
> +	      snd_wnd_cap ? " " : " not ");
> +
>  	return 0;
>  }
>  
> diff --git a/tcp.h b/tcp.h
> index e9ff0191..5585924f 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -10,11 +10,12 @@
>  
>  struct ctx;
>  
> -void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
> -void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref);
> +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
>  			const struct timespec *now);
> -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
> -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
> +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> +		      uint32_t events);
> +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>  		    const void *saddr, const void *daddr,
>  		    const struct pool *p, int idx, const struct timespec *now);
>  int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> @@ -58,16 +59,12 @@ union tcp_listen_epoll_ref {
>   * @fwd_in:		Port forwarding configuration for inbound packets
>   * @fwd_out:		Port forwarding configuration for outbound packets
>   * @timer_run:		Timestamp of most recent timer run
> - * @kernel_snd_wnd:	Kernel reports sending window (with commit 8f7baad7f035)
>   * @pipe_size:		Size of pipes for spliced connections
>   */
>  struct tcp_ctx {
>  	struct fwd_ports fwd_in;
>  	struct fwd_ports fwd_out;
>  	struct timespec timer_run;
> -#ifdef HAS_SND_WND
> -	int kernel_snd_wnd;
> -#endif
>  	size_t pipe_size;
>  };
>  
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 1a398461..c886c92b 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -239,7 +239,7 @@ void tcp_flags_flush(const struct ctx *c)
>   * @frames:	Two-dimensional array containing queued frames with sub-iovs
>   * @num_frames:	Number of entries in the two arrays to be compared
>   */
> -static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
> +static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
>  			   struct iovec (*frames)[TCP_NUM_IOVS], int num_frames)
>  {
>  	int i;
> @@ -264,7 +264,7 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
>   * tcp_payload_flush() - Send out buffers for segments with data
>   * @c:		Execution context
>   */
> -void tcp_payload_flush(struct ctx *c)
> +void tcp_payload_flush(const struct ctx *c)
>  {
>  	size_t m;
>  
> @@ -293,7 +293,7 @@ void tcp_payload_flush(struct ctx *c)
>   *
>   * Return: negative error code on connection reset, 0 otherwise
>   */
> -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  {
>  	struct tcp_flags_t *payload;
>  	struct iovec *iov;
> @@ -361,7 +361,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>   * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
>   * @seq:	Sequence number to be sent
>   */
> -static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> +static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  			    ssize_t dlen, int no_csum, uint32_t seq)
>  {
>  	struct iovec *iov;
> @@ -405,7 +405,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * #syscalls recvmsg
>   */
> -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> diff --git a/tcp_buf.h b/tcp_buf.h
> index 3db4c56e..8d4b615a 100644
> --- a/tcp_buf.h
> +++ b/tcp_buf.h
> @@ -9,8 +9,8 @@
>  void tcp_sock4_iov_init(const struct ctx *c);
>  void tcp_sock6_iov_init(const struct ctx *c);
>  void tcp_flags_flush(const struct ctx *c);
> -void tcp_payload_flush(struct ctx *c);
> -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
> -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
> +void tcp_payload_flush(const struct ctx *c);
> +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
> +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
>  
>  #endif  /*TCP_BUF_H */
> diff --git a/tcp_internal.h b/tcp_internal.h
> index aa8bb64f..bd634be1 100644
> --- a/tcp_internal.h
> +++ b/tcp_internal.h
> @@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  		conn_event_do(c, conn, event);				\
>  	} while (0)
>  
> -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
> +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
>  #define tcp_rst(c, conn)						\
>  	do {								\
>  		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
> @@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
>  			       const uint16_t *check, uint32_t seq);
>  int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  			  int force_seq, struct tcp_info *tinfo);
> -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags,
> +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags,
>  		      struct tcphdr *th, char *data, size_t *optlen);
>  
>  #endif /* TCP_INTERNAL_H */

-- 
Stefano


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
  2024-09-13  4:32 ` [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd() David Gibson
@ 2024-09-17 21:54   ` Stefano Brivio
  2024-09-18  1:31     ` David Gibson
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Brivio @ 2024-09-17 21:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 13 Sep 2024 14:32:07 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> This function has a block conditional on !snd_wnd_cap shortly before an
> #ifdef HAS_SND_WND.  But snd_wnd_cap implies HAS_SND_WND (if !HAS_SND_WND,
> snd_wnd_cap is statically false).
> 
> Therefore, simplify this down to a single conditional with an else branch.
> While we're there, fix some improperly indented closing braces.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c | 36 +++++++++++++++++-------------------
>  1 file changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index cba3f3bd..6733e7e3 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1061,27 +1061,25 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
>  				       USHRT_MAX);
>  		goto out;
> -	}
> -
> -	if (!tinfo) {
> -		if (prev_wnd_to_tap > WINDOW_DEFAULT) {
> -			goto out;
> -}
> -		tinfo = &tinfo_new;
> -		if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
> -			goto out;
> -}
> -	}
> -
> -#ifdef HAS_SND_WND
> -	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
> -		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
>  	} else {

I thought cppcheck would report else-after-goto, but it doesn't, just
else-after-return. In any case, we could simplify further by avoid that
else clause (and one level of indentation) in the whole block below.

It would also look more natural to me: we deal with if (!snd_wnd_cap)
as a special case and go to 'out' in that special case, then we resume
with the regular path.

I guess this is better than the original anyway and it's not a strong
preference, so I can also apply this up to 4/10 as it is (or fix up on
merge). The rest of the patches up to 4/10 look good to me.

> -		tcp_get_sndbuf(conn);
> -		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
> -				     SNDBUF_GET(conn));
> +		if (!tinfo) {
> +			if (prev_wnd_to_tap > WINDOW_DEFAULT) {
> +				goto out;
> +			}
> +			tinfo = &tinfo_new;
> +			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
> +				goto out;
> +			}
> +		}
> +
> +		if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
> +			new_wnd_to_tap = tinfo->tcpi_snd_wnd;
> +		} else {
> +			tcp_get_sndbuf(conn);
> +			new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
> +					     SNDBUF_GET(conn));
> +		}
>  	}
> -#endif
>  
>  	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
>  	if (!(conn->events & ESTABLISHED))

-- 
Stefano


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing
  2024-09-17 21:54   ` Stefano Brivio
@ 2024-09-18  1:27     ` David Gibson
  0 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-18  1:27 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 17587 bytes --]

On Tue, Sep 17, 2024 at 11:54:28PM +0200, Stefano Brivio wrote:
> On Fri, 13 Sep 2024 14:32:06 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > When available, we want to retrieve our socket peer's advertised window and
> > forward that to the guest.  That information has been available from the
> > kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.
> > 
> > Currently our probing for this is a bit odd.  The HAS_SND_WND define
> > determines if our headers include the tcp_snd_wnd field, but that doesn't
> > necessarily mean the running kernel supports it.  Currently we start by
> > assuming it's _not_ available, but mark it as available if we ever see
> > a non-zero value in the field.  This is a bit hit and miss in two ways:
> >  * Zero is perfectly possible window the peer could report, so we can
> >    get false negatives
> 
> Kind of: one non-zero result was enough to set tcp.kernel_snd_wnd to
> one.

Right, but that in turn means it could change at some later point
rather than being correct from the beginning, which (slightly)
complicates everything.

> The reason why I implemented it that way was to account for possible
> kernel backports of that option. On the other hand, any kernel backport
> would need to preserve the position of tcpi_snd_wnd, regardless of
> whether preceding fields are missing, so checking the size as you're
> doing also looks robust, and avoids these two issues altogether.

Right.  My understanding is that the ability to check the returned
size is exactly why adding more fields to a getsockopt() is considered
a backwards compatible change.

> 
> >  * We're reading TCP_INFO into a local variable, which might not be zero
> >    initialised, so if the kernel _doesn't_ write it it could have non-zero
> >    garbage, giving us false positives.
> > 
> > We can use a more direct way of probing for this: getsockopt() reports the
> > length of the information retreived.  So, check whether that's long enough
> > to include the field.  This lets us probe the availability of the field
> > once and for all during initialisation.  That in turn allows ctx to become
> > a const pointer to tcp_prepare_flags() which cascades through many other
> > functions.
> > 
> > We also move the flag for the probe result from the ctx structure to a
> > global, to match peek_offset_cap.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c          | 93 ++++++++++++++++++++++++++++++++++++--------------
> >  tcp.h          | 13 +++----
> >  tcp_buf.c      | 10 +++---
> >  tcp_buf.h      |  6 ++--
> >  tcp_internal.h |  4 +--
> >  5 files changed, 82 insertions(+), 44 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 14b48a84..cba3f3bd 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -308,11 +308,6 @@
> >  /* MSS rounding: see SET_MSS() */
> >  #define MSS_DEFAULT			536
> >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> > -#ifdef HAS_SND_WND
> > -# define KERNEL_REPORTS_SND_WND(c)	((c)->tcp.kernel_snd_wnd)
> > -#else
> > -# define KERNEL_REPORTS_SND_WND(c)	(0 && (c))
> > -#endif
> >  
> >  #define ACK_INTERVAL			10		/* ms */
> >  #define SYN_TIMEOUT			10		/* s */
> > @@ -370,6 +365,14 @@ char		tcp_buf_discard		[MAX_WINDOW];
> >  
> >  /* Does the kernel support TCP_PEEK_OFF? */
> >  bool peek_offset_cap;
> > +#ifdef HAS_SND_WND
> > +/* Does the kernel report sending window in TCP_INFO (kernel commit
> > + * 8f7baad7f035)
> > + */
> > +bool snd_wnd_cap;
> > +#else
> > +#define snd_wnd_cap	(false)
> > +#endif
> >  
> >  /* sendmsg() to socket */
> >  static struct iovec	tcp_iov			[UIO_MAXIOV];
> > @@ -1052,7 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	}
> >  #endif /* !HAS_BYTES_ACKED */
> >  
> > -	if (!KERNEL_REPORTS_SND_WND(c)) {
> > +	if (!snd_wnd_cap) {
> >  		tcp_get_sndbuf(conn);
> >  		new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW);
> >  		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
> > @@ -1136,7 +1139,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
> >   *	     0 if there is no flag to send
> >   *	     1 otherwise
> >   */
> > -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
> > +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
> >  		      int flags, struct tcphdr *th, char *data,
> >  		      size_t *optlen)
> >  {
> > @@ -1153,11 +1156,6 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
> >  		return -ECONNRESET;
> >  	}
> >  
> > -#ifdef HAS_SND_WND
> > -	if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd)
> > -		c->tcp.kernel_snd_wnd = 1;
> > -#endif
> > -
> >  	if (!(conn->flags & LOCAL))
> >  		tcp_rtt_dst_check(conn, &tinfo);
> >  
> > @@ -1235,7 +1233,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
> >   *
> >   * Return: negative error code on connection reset, 0 otherwise
> >   */
> > -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> > +static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
> > +			 int flags)
> >  {
> >  	return tcp_buf_send_flag(c, conn, flags);
> >  }
> > @@ -1245,7 +1244,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >   * @c:		Execution context
> >   * @conn:	Connection pointer
> >   */
> > -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
> > +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn)
> >  {
> >  	if (conn->events == CLOSED)
> >  		return;
> > @@ -1463,7 +1462,7 @@ static void tcp_bind_outbound(const struct ctx *c,
> >   * @optlen:	Bytes in options: caller MUST ensure available length
> >   * @now:	Current timestamp
> >   */
> > -static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> > +static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
> >  			      const void *saddr, const void *daddr,
> >  			      const struct tcphdr *th, const char *opts,
> >  			      size_t optlen, const struct timespec *now)
> > @@ -1628,7 +1627,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
> >   *
> >   * #syscalls recvmsg
> >   */
> > -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > +static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> >  {
> >  	return tcp_buf_data_from_sock(c, conn);
> >  }
> > @@ -1644,8 +1643,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >   *
> >   * Return: count of consumed packets
> >   */
> > -static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > -			      const struct pool *p, int idx)
> > +static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > +			     const struct pool *p, int idx)
> >  {
> >  	int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0;
> >  	uint16_t max_ack_seq_wnd = conn->wnd_from_tap;
> > @@ -1842,7 +1841,8 @@ out:
> >   * @opts:	Pointer to start of options
> >   * @optlen:	Bytes in options: caller MUST ensure available length
> >   */
> > -static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
> > +static void tcp_conn_from_sock_finish(const struct ctx *c,
> > +				      struct tcp_tap_conn *conn,
> >  				      const struct tcphdr *th,
> >  				      const char *opts, size_t optlen)
> >  {
> > @@ -1885,7 +1885,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
> >   *
> >   * Return: count of consumed packets
> >   */
> > -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
> > +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> >  		    const void *saddr, const void *daddr,
> >  		    const struct pool *p, int idx, const struct timespec *now)
> >  {
> > @@ -2023,7 +2023,7 @@ reset:
> >   * @c:		Execution context
> >   * @conn:	Connection pointer
> >   */
> > -static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> > +static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn)
> >  {
> >  	socklen_t sl;
> >  	int so;
> > @@ -2049,8 +2049,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> >   * @sa:		Peer socket address (from accept())
> >   * @now:	Current timestamp
> >   */
> > -static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
> > -				   const struct timespec *now)
> > +static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
> > +				   int s, const struct timespec *now)
> >  {
> >  	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
> >  	uint64_t hash;
> > @@ -2081,7 +2081,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
> >   * @ref:	epoll reference of listening socket
> >   * @now:	Current timestamp
> >   */
> > -void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> > +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
> >  			const struct timespec *now)
> >  {
> >  	const struct flowside *ini;
> > @@ -2146,7 +2146,7 @@ cancel:
> >   *
> >   * #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64
> >   */
> > -void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> > +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> >  {
> >  	struct itimerspec check_armed = { { 0 }, { 0 } };
> >  	struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp;
> > @@ -2210,7 +2210,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> >   * @ref:	epoll reference
> >   * @events:	epoll events bitmap
> >   */
> > -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
> > +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> > +		      uint32_t events)
> >  {
> >  	struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside);
> >  
> > @@ -2494,6 +2495,40 @@ static bool tcp_probe_peek_offset_cap(sa_family_t af)
> >  	return ret;
> >  }
> >  
> > +#ifdef HAS_SND_WND
> > +/**
> > + * tcp_probe_snd_wnd_cap() - Check if TCP_INFO reports tcpi_snd_wnd
> > + *
> > + * Return: true if supported, false otherwise
> > + */
> > +static bool tcp_probe_snd_wnd_cap(void)
> > +{
> > +	struct tcp_info tinfo;
> > +	socklen_t sl = sizeof(tinfo);
> > +	int s;
> > +
> > +	s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
> > +	if (s < 0) {
> > +		warn_perror("Temporary TCP socket creation failed");
> > +		return false;
> > +	}
> > +
> > +	if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
> > +		warn_perror("Failed to get TCP_INFO on temporary socket");
> > +		close(s);
> > +		return false;
> > +	}
> > +
> > +	close(s);
> > +
> > +	if (sl < (offsetof(struct tcp_info, tcpi_snd_wnd) +
> > +		  sizeof(tinfo.tcpi_snd_wnd)))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +#endif /* HAS_SND_WND */
> > +
> >  /**
> >   * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> >   * @c:		Execution context
> > @@ -2527,6 +2562,12 @@ int tcp_init(struct ctx *c)
> >  			  (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6));
> >  	debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");
> >  
> > +#ifdef HAS_SND_WND
> > +	snd_wnd_cap = tcp_probe_snd_wnd_cap();
> > +#endif
> > +	debug("TCP_INFO tcpi_snd_wnd field%ssupported",
> > +	      snd_wnd_cap ? " " : " not ");
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/tcp.h b/tcp.h
> > index e9ff0191..5585924f 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -10,11 +10,12 @@
> >  
> >  struct ctx;
> >  
> > -void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
> > -void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> > +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref);
> > +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
> >  			const struct timespec *now);
> > -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
> > -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
> > +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> > +		      uint32_t events);
> > +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> >  		    const void *saddr, const void *daddr,
> >  		    const struct pool *p, int idx, const struct timespec *now);
> >  int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> > @@ -58,16 +59,12 @@ union tcp_listen_epoll_ref {
> >   * @fwd_in:		Port forwarding configuration for inbound packets
> >   * @fwd_out:		Port forwarding configuration for outbound packets
> >   * @timer_run:		Timestamp of most recent timer run
> > - * @kernel_snd_wnd:	Kernel reports sending window (with commit 8f7baad7f035)
> >   * @pipe_size:		Size of pipes for spliced connections
> >   */
> >  struct tcp_ctx {
> >  	struct fwd_ports fwd_in;
> >  	struct fwd_ports fwd_out;
> >  	struct timespec timer_run;
> > -#ifdef HAS_SND_WND
> > -	int kernel_snd_wnd;
> > -#endif
> >  	size_t pipe_size;
> >  };
> >  
> > diff --git a/tcp_buf.c b/tcp_buf.c
> > index 1a398461..c886c92b 100644
> > --- a/tcp_buf.c
> > +++ b/tcp_buf.c
> > @@ -239,7 +239,7 @@ void tcp_flags_flush(const struct ctx *c)
> >   * @frames:	Two-dimensional array containing queued frames with sub-iovs
> >   * @num_frames:	Number of entries in the two arrays to be compared
> >   */
> > -static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
> > +static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
> >  			   struct iovec (*frames)[TCP_NUM_IOVS], int num_frames)
> >  {
> >  	int i;
> > @@ -264,7 +264,7 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
> >   * tcp_payload_flush() - Send out buffers for segments with data
> >   * @c:		Execution context
> >   */
> > -void tcp_payload_flush(struct ctx *c)
> > +void tcp_payload_flush(const struct ctx *c)
> >  {
> >  	size_t m;
> >  
> > @@ -293,7 +293,7 @@ void tcp_payload_flush(struct ctx *c)
> >   *
> >   * Return: negative error code on connection reset, 0 otherwise
> >   */
> > -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> > +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >  {
> >  	struct tcp_flags_t *payload;
> >  	struct iovec *iov;
> > @@ -361,7 +361,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >   * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
> >   * @seq:	Sequence number to be sent
> >   */
> > -static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > +static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >  			    ssize_t dlen, int no_csum, uint32_t seq)
> >  {
> >  	struct iovec *iov;
> > @@ -405,7 +405,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> >   *
> >   * #syscalls recvmsg
> >   */
> > -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> >  {
> >  	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> >  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> > diff --git a/tcp_buf.h b/tcp_buf.h
> > index 3db4c56e..8d4b615a 100644
> > --- a/tcp_buf.h
> > +++ b/tcp_buf.h
> > @@ -9,8 +9,8 @@
> >  void tcp_sock4_iov_init(const struct ctx *c);
> >  void tcp_sock6_iov_init(const struct ctx *c);
> >  void tcp_flags_flush(const struct ctx *c);
> > -void tcp_payload_flush(struct ctx *c);
> > -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
> > -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
> > +void tcp_payload_flush(const struct ctx *c);
> > +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
> > +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
> >  
> >  #endif  /*TCP_BUF_H */
> > diff --git a/tcp_internal.h b/tcp_internal.h
> > index aa8bb64f..bd634be1 100644
> > --- a/tcp_internal.h
> > +++ b/tcp_internal.h
> > @@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
> >  		conn_event_do(c, conn, event);				\
> >  	} while (0)
> >  
> > -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
> > +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
> >  #define tcp_rst(c, conn)						\
> >  	do {								\
> >  		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
> > @@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
> >  			       const uint16_t *check, uint32_t seq);
> >  int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  			  int force_seq, struct tcp_info *tinfo);
> > -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags,
> > +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags,
> >  		      struct tcphdr *th, char *data, size_t *optlen);
> >  
> >  #endif /* TCP_INTERNAL_H */
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
  2024-09-17 21:54   ` Stefano Brivio
@ 2024-09-18  1:31     ` David Gibson
  0 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2024-09-18  1:31 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3066 bytes --]

On Tue, Sep 17, 2024 at 11:54:34PM +0200, Stefano Brivio wrote:
> On Fri, 13 Sep 2024 14:32:07 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > This function has a block conditional on !snd_wnd_cap shortly before an
> > #ifdef HAS_SND_WND.  But snd_wnd_cap implies HAS_SND_WND (if !HAS_SND_WND,
> > snd_wnd_cap is statically false).
> > 
> > Therefore, simplify this down to a single conditional with an else branch.
> > While we're there, fix some improperly indented closing braces.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c | 36 +++++++++++++++++-------------------
> >  1 file changed, 17 insertions(+), 19 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index cba3f3bd..6733e7e3 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1061,27 +1061,25 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
> >  				       USHRT_MAX);
> >  		goto out;
> > -	}
> > -
> > -	if (!tinfo) {
> > -		if (prev_wnd_to_tap > WINDOW_DEFAULT) {
> > -			goto out;
> > -}
> > -		tinfo = &tinfo_new;
> > -		if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
> > -			goto out;
> > -}
> > -	}
> > -
> > -#ifdef HAS_SND_WND
> > -	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
> > -		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
> >  	} else {
> 
> I thought cppcheck would report else-after-goto, but it doesn't, just
> else-after-return. In any case, we could simplify further by avoid that
> else clause (and one level of indentation) in the whole block below.

Good idea, done.

> It would also look more natural to me: we deal with if (!snd_wnd_cap)
> as a special case and go to 'out' in that special case, then we resume
> with the regular path.
> 
> I guess this is better than the original anyway and it's not a strong
> preference, so I can also apply this up to 4/10 as it is (or fix up on
> merge). The rest of the patches up to 4/10 look good to me.

Well, I've made the change now, so I might as well repost :).

> 
> > -		tcp_get_sndbuf(conn);
> > -		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
> > -				     SNDBUF_GET(conn));
> > +		if (!tinfo) {
> > +			if (prev_wnd_to_tap > WINDOW_DEFAULT) {
> > +				goto out;
> > +			}
> > +			tinfo = &tinfo_new;
> > +			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) {
> > +				goto out;
> > +			}
> > +		}
> > +
> > +		if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
> > +			new_wnd_to_tap = tinfo->tcpi_snd_wnd;
> > +		} else {
> > +			tcp_get_sndbuf(conn);
> > +			new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
> > +					     SNDBUF_GET(conn));
> > +		}
> >  	}
> > -#endif
> >  
> >  	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
> >  	if (!(conn->events & ESTABLISHED))
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-18  1:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
2024-09-13  4:32 ` [PATCH v2 01/10] tcp: Make some extra functions private David Gibson
2024-09-13  4:32 ` [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing David Gibson
2024-09-17 21:54   ` Stefano Brivio
2024-09-18  1:27     ` David Gibson
2024-09-13  4:32 ` [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd() David Gibson
2024-09-17 21:54   ` Stefano Brivio
2024-09-18  1:31     ` David Gibson
2024-09-13  4:32 ` [PATCH v2 04/10] tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean David Gibson
2024-09-13  4:32 ` [PATCH v2 05/10] tcp: On socket EPOLLOUT, send new ACK to tap immediately David Gibson
2024-09-13  4:32 ` [PATCH v2 06/10] tap: Re-introduce EPOLLET for tap connections David Gibson
2024-09-13  4:32 ` [PATCH v2 07/10] tap: Keep track of whether there might be space in the tap buffers David Gibson
2024-09-13  4:32 ` [PATCH v2 08/10] tcp: Keep track of connections blocked due to a full tap interface David Gibson
2024-09-13  4:32 ` [PATCH v2 09/10] tcp: Move deferred handling functions later in tcp.c David Gibson
2024-09-13  4:32 ` [PATCH v2 10/10] tcp: Simplify epoll event mask management David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).