public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing
Date: Tue, 17 Sep 2024 23:54:28 +0200	[thread overview]
Message-ID: <20240917235428.5275c343@elisabeth> (raw)
In-Reply-To: <20240913043214.1753014-3-david@gibson.dropbear.id.au>

On Fri, 13 Sep 2024 14:32:06 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> When available, we want to retrieve our socket peer's advertised window and
> forward that to the guest.  That information has been available from the
> kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.
> 
> Currently our probing for this is a bit odd.  The HAS_SND_WND define
> determines if our headers include the tcp_snd_wnd field, but that doesn't
> necessarily mean the running kernel supports it.  Currently we start by
> assuming it's _not_ available, but mark it as available if we ever see
> a non-zero value in the field.  This is a bit hit and miss in two ways:
>  * Zero is perfectly possible window the peer could report, so we can
>    get false negatives

Kind of: one non-zero result was enough to set tcp.kernel_snd_wnd to
one.

The reason why I implemented it that way was to account for possible
kernel backports of that option. On the other hand, any kernel backport
would need to preserve the position of tcpi_snd_wnd, regardless of
whether preceding fields are missing, so checking the size as you're
doing also looks robust, and avoids these two issues altogether.

>  * We're reading TCP_INFO into a local variable, which might not be zero
>    initialised, so if the kernel _doesn't_ write it it could have non-zero
>    garbage, giving us false positives.
> 
> We can use a more direct way of probing for this: getsockopt() reports the
> length of the information retreived.  So, check whether that's long enough
> to include the field.  This lets us probe the availability of the field
> once and for all during initialisation.  That in turn allows ctx to become
> a const pointer to tcp_prepare_flags() which cascades through many other
> functions.
> 
> We also move the flag for the probe result from the ctx structure to a
> global, to match peek_offset_cap.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c          | 93 ++++++++++++++++++++++++++++++++++++--------------
>  tcp.h          | 13 +++----
>  tcp_buf.c      | 10 +++---
>  tcp_buf.h      |  6 ++--
>  tcp_internal.h |  4 +--
>  5 files changed, 82 insertions(+), 44 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 14b48a84..cba3f3bd 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -308,11 +308,6 @@
>  /* MSS rounding: see SET_MSS() */
>  #define MSS_DEFAULT			536
>  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> -#ifdef HAS_SND_WND
> -# define KERNEL_REPORTS_SND_WND(c)	((c)->tcp.kernel_snd_wnd)
> -#else
> -# define KERNEL_REPORTS_SND_WND(c)	(0 && (c))
> -#endif
>  
>  #define ACK_INTERVAL			10		/* ms */
>  #define SYN_TIMEOUT			10		/* s */
> @@ -370,6 +365,14 @@ char		tcp_buf_discard		[MAX_WINDOW];
>  
>  /* Does the kernel support TCP_PEEK_OFF? */
>  bool peek_offset_cap;
> +#ifdef HAS_SND_WND
> +/* Does the kernel report sending window in TCP_INFO (kernel commit
> + * 8f7baad7f035)
> + */
> +bool snd_wnd_cap;
> +#else
> +#define snd_wnd_cap	(false)
> +#endif
>  
>  /* sendmsg() to socket */
>  static struct iovec	tcp_iov			[UIO_MAXIOV];
> @@ -1052,7 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	}
>  #endif /* !HAS_BYTES_ACKED */
>  
> -	if (!KERNEL_REPORTS_SND_WND(c)) {
> +	if (!snd_wnd_cap) {
>  		tcp_get_sndbuf(conn);
>  		new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW);
>  		conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap,
> @@ -1136,7 +1139,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
>   *	     0 if there is no flag to send
>   *	     1 otherwise
>   */
> -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
> +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
>  		      int flags, struct tcphdr *th, char *data,
>  		      size_t *optlen)
>  {
> @@ -1153,11 +1156,6 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
>  		return -ECONNRESET;
>  	}
>  
> -#ifdef HAS_SND_WND
> -	if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd)
> -		c->tcp.kernel_snd_wnd = 1;
> -#endif
> -
>  	if (!(conn->flags & LOCAL))
>  		tcp_rtt_dst_check(conn, &tinfo);
>  
> @@ -1235,7 +1233,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * Return: negative error code on connection reset, 0 otherwise
>   */
> -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
> +			 int flags)
>  {
>  	return tcp_buf_send_flag(c, conn, flags);
>  }
> @@ -1245,7 +1244,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>   * @c:		Execution context
>   * @conn:	Connection pointer
>   */
> -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
> +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	if (conn->events == CLOSED)
>  		return;
> @@ -1463,7 +1462,7 @@ static void tcp_bind_outbound(const struct ctx *c,
>   * @optlen:	Bytes in options: caller MUST ensure available length
>   * @now:	Current timestamp
>   */
> -static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> +static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
>  			      const void *saddr, const void *daddr,
>  			      const struct tcphdr *th, const char *opts,
>  			      size_t optlen, const struct timespec *now)
> @@ -1628,7 +1627,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
>   *
>   * #syscalls recvmsg
>   */
> -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	return tcp_buf_data_from_sock(c, conn);
>  }
> @@ -1644,8 +1643,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>   *
>   * Return: count of consumed packets
>   */
> -static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> -			      const struct pool *p, int idx)
> +static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> +			     const struct pool *p, int idx)
>  {
>  	int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0;
>  	uint16_t max_ack_seq_wnd = conn->wnd_from_tap;
> @@ -1842,7 +1841,8 @@ out:
>   * @opts:	Pointer to start of options
>   * @optlen:	Bytes in options: caller MUST ensure available length
>   */
> -static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
> +static void tcp_conn_from_sock_finish(const struct ctx *c,
> +				      struct tcp_tap_conn *conn,
>  				      const struct tcphdr *th,
>  				      const char *opts, size_t optlen)
>  {
> @@ -1885,7 +1885,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * Return: count of consumed packets
>   */
> -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
> +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>  		    const void *saddr, const void *daddr,
>  		    const struct pool *p, int idx, const struct timespec *now)
>  {
> @@ -2023,7 +2023,7 @@ reset:
>   * @c:		Execution context
>   * @conn:	Connection pointer
>   */
> -static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> +static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	socklen_t sl;
>  	int so;
> @@ -2049,8 +2049,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
>   * @sa:		Peer socket address (from accept())
>   * @now:	Current timestamp
>   */
> -static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
> -				   const struct timespec *now)
> +static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
> +				   int s, const struct timespec *now)
>  {
>  	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
>  	uint64_t hash;
> @@ -2081,7 +2081,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
>   * @ref:	epoll reference of listening socket
>   * @now:	Current timestamp
>   */
> -void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
>  			const struct timespec *now)
>  {
>  	const struct flowside *ini;
> @@ -2146,7 +2146,7 @@ cancel:
>   *
>   * #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64
>   */
> -void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>  {
>  	struct itimerspec check_armed = { { 0 }, { 0 } };
>  	struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp;
> @@ -2210,7 +2210,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>   * @ref:	epoll reference
>   * @events:	epoll events bitmap
>   */
> -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
> +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> +		      uint32_t events)
>  {
>  	struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside);
>  
> @@ -2494,6 +2495,40 @@ static bool tcp_probe_peek_offset_cap(sa_family_t af)
>  	return ret;
>  }
>  
> +#ifdef HAS_SND_WND
> +/**
> + * tcp_probe_snd_wnd_cap() - Check if TCP_INFO reports tcpi_snd_wnd
> + *
> + * Return: true if supported, false otherwise
> + */
> +static bool tcp_probe_snd_wnd_cap(void)
> +{
> +	struct tcp_info tinfo;
> +	socklen_t sl = sizeof(tinfo);
> +	int s;
> +
> +	s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
> +	if (s < 0) {
> +		warn_perror("Temporary TCP socket creation failed");
> +		return false;
> +	}
> +
> +	if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
> +		warn_perror("Failed to get TCP_INFO on temporary socket");
> +		close(s);
> +		return false;
> +	}
> +
> +	close(s);
> +
> +	if (sl < (offsetof(struct tcp_info, tcpi_snd_wnd) +
> +		  sizeof(tinfo.tcpi_snd_wnd)))
> +		return false;
> +
> +	return true;
> +}
> +#endif /* HAS_SND_WND */
> +
>  /**
>   * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
>   * @c:		Execution context
> @@ -2527,6 +2562,12 @@ int tcp_init(struct ctx *c)
>  			  (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6));
>  	debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");
>  
> +#ifdef HAS_SND_WND
> +	snd_wnd_cap = tcp_probe_snd_wnd_cap();
> +#endif
> +	debug("TCP_INFO tcpi_snd_wnd field%ssupported",
> +	      snd_wnd_cap ? " " : " not ");
> +
>  	return 0;
>  }
>  
> diff --git a/tcp.h b/tcp.h
> index e9ff0191..5585924f 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -10,11 +10,12 @@
>  
>  struct ctx;
>  
> -void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
> -void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref);
> +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
>  			const struct timespec *now);
> -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
> -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
> +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> +		      uint32_t events);
> +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>  		    const void *saddr, const void *daddr,
>  		    const struct pool *p, int idx, const struct timespec *now);
>  int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> @@ -58,16 +59,12 @@ union tcp_listen_epoll_ref {
>   * @fwd_in:		Port forwarding configuration for inbound packets
>   * @fwd_out:		Port forwarding configuration for outbound packets
>   * @timer_run:		Timestamp of most recent timer run
> - * @kernel_snd_wnd:	Kernel reports sending window (with commit 8f7baad7f035)
>   * @pipe_size:		Size of pipes for spliced connections
>   */
>  struct tcp_ctx {
>  	struct fwd_ports fwd_in;
>  	struct fwd_ports fwd_out;
>  	struct timespec timer_run;
> -#ifdef HAS_SND_WND
> -	int kernel_snd_wnd;
> -#endif
>  	size_t pipe_size;
>  };
>  
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 1a398461..c886c92b 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -239,7 +239,7 @@ void tcp_flags_flush(const struct ctx *c)
>   * @frames:	Two-dimensional array containing queued frames with sub-iovs
>   * @num_frames:	Number of entries in the two arrays to be compared
>   */
> -static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
> +static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
>  			   struct iovec (*frames)[TCP_NUM_IOVS], int num_frames)
>  {
>  	int i;
> @@ -264,7 +264,7 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns,
>   * tcp_payload_flush() - Send out buffers for segments with data
>   * @c:		Execution context
>   */
> -void tcp_payload_flush(struct ctx *c)
> +void tcp_payload_flush(const struct ctx *c)
>  {
>  	size_t m;
>  
> @@ -293,7 +293,7 @@ void tcp_payload_flush(struct ctx *c)
>   *
>   * Return: negative error code on connection reset, 0 otherwise
>   */
> -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  {
>  	struct tcp_flags_t *payload;
>  	struct iovec *iov;
> @@ -361,7 +361,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>   * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
>   * @seq:	Sequence number to be sent
>   */
> -static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> +static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  			    ssize_t dlen, int no_csum, uint32_t seq)
>  {
>  	struct iovec *iov;
> @@ -405,7 +405,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * #syscalls recvmsg
>   */
> -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> diff --git a/tcp_buf.h b/tcp_buf.h
> index 3db4c56e..8d4b615a 100644
> --- a/tcp_buf.h
> +++ b/tcp_buf.h
> @@ -9,8 +9,8 @@
>  void tcp_sock4_iov_init(const struct ctx *c);
>  void tcp_sock6_iov_init(const struct ctx *c);
>  void tcp_flags_flush(const struct ctx *c);
> -void tcp_payload_flush(struct ctx *c);
> -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
> -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
> +void tcp_payload_flush(const struct ctx *c);
> +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
> +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
>  
>  #endif  /*TCP_BUF_H */
> diff --git a/tcp_internal.h b/tcp_internal.h
> index aa8bb64f..bd634be1 100644
> --- a/tcp_internal.h
> +++ b/tcp_internal.h
> @@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  		conn_event_do(c, conn, event);				\
>  	} while (0)
>  
> -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
> +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
>  #define tcp_rst(c, conn)						\
>  	do {								\
>  		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
> @@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
>  			       const uint16_t *check, uint32_t seq);
>  int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  			  int force_seq, struct tcp_info *tinfo);
> -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags,
> +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags,
>  		      struct tcphdr *th, char *data, size_t *optlen);
>  
>  #endif /* TCP_INTERNAL_H */

-- 
Stefano


  reply	other threads:[~2024-09-17 21:54 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-13  4:32 [PATCH v2 00/10] RFC: Clean up TCP epoll mask handling David Gibson
2024-09-13  4:32 ` [PATCH v2 01/10] tcp: Make some extra functions private David Gibson
2024-09-13  4:32 ` [PATCH v2 02/10] tcp: Clean up tcpi_snd_wnd probing David Gibson
2024-09-17 21:54   ` Stefano Brivio [this message]
2024-09-18  1:27     ` David Gibson
2024-09-13  4:32 ` [PATCH v2 03/10] tcp: Simplify ifdef logic in tcp_update_seqack_wnd() David Gibson
2024-09-17 21:54   ` Stefano Brivio
2024-09-18  1:31     ` David Gibson
2024-09-13  4:32 ` [PATCH v2 04/10] tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean David Gibson
2024-09-13  4:32 ` [PATCH v2 05/10] tcp: On socket EPOLLOUT, send new ACK to tap immediately David Gibson
2024-09-13  4:32 ` [PATCH v2 06/10] tap: Re-introduce EPOLLET for tap connections David Gibson
2024-09-13  4:32 ` [PATCH v2 07/10] tap: Keep track of whether there might be space in the tap buffers David Gibson
2024-09-13  4:32 ` [PATCH v2 08/10] tcp: Keep track of connections blocked due to a full tap interface David Gibson
2024-09-13  4:32 ` [PATCH v2 09/10] tcp: Move deferred handling functions later in tcp.c David Gibson
2024-09-13  4:32 ` [PATCH v2 10/10] tcp: Simplify epoll event mask management David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240917235428.5275c343@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).