public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 4/7] udp: Separate bound sockets from flags
Date: Wed, 28 Feb 2024 13:30:05 +1100	[thread overview]
Message-ID: <Zd6arT4y-uo3jSO_@zatzit> (raw)
In-Reply-To: <20240227125659.6505dae9@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 15779 bytes --]

On Tue, Feb 27, 2024 at 12:56:59PM +0100, Stefano Brivio wrote:
> On Thu, 22 Feb 2024 10:21:12 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > struct udp_tap_port contains both a socket fd and a set of flags. But,
> > confusingly, these fields are not actually related to each other at all.
> > 
> > udp_tap_map[?][port].sock contains a host UDP socket bound to 'port' (or
> > -1 if we haven't opened such a socket).  That is, it is indexed by the
> > bound port == the host side forwarding port == destination for host->guest
> > packets == source for guest->host packets.
> > 
> > udp_tap_map[?][port].flags, however, contains flags which control how we
> > direct datagrams bound for 'port' from the tap interface.  That is, it is
> > indexed by the destination for guest->host packets == source for
> > host->guest packets == host side endpoint port.
> > 
> > Worse both aspects of this share the ts field used to control expiry.  When
> > we update this because we've used the socket in slot 'port' we could be
> > clobbering the flags timestamp for an unrelated flow of packets that
> > happens to have the same endpoint port as our forwarding port.
> > 
> > Clarify this by moving the flags out into a separate array, with its own
> > timestamp and expiry path.  As a bonus this means that we can now use the
> > same data structure for tracking the socket fds of "tap" and "splice" path
> > sockets.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  udp.c | 132 ++++++++++++++++++++++++++++++----------------------------
> >  1 file changed, 68 insertions(+), 64 deletions(-)
> > 
> > diff --git a/udp.c b/udp.c
> > index 4200f849..97518d92 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -121,40 +121,39 @@
> >  #define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
> >  
> >  /**
> > - * struct udp_tap_port - Port tracking based on tap-facing source port
> > - * @sock:	Socket bound to source port used as index
> > + * struct udp_port_flags - Translation options by remote Port tracking based on tap-facing source port
> 
> I'm not sure what you mean by this -- perhaps "Address translation
> options for bound source ports"?

Oops, yes.  I think I partially rewrote an existing comment, leaving
an incoherent fusion of two different sentences.

> >   * @flags:	Flags for local bind, loopback address/unicast address as source
> > - * @ts:		Activity timestamp from tap, used for socket aging
> > + * @ts:		Activity timestamp
> >   */
> > -struct udp_tap_port {
> > -	int sock;
> > -	uint8_t flags;
> > +struct udp_port_flags {
> >  #define PORT_LOCAL	BIT(0)
> >  #define PORT_LOOPBACK	BIT(1)
> >  #define PORT_GUA	BIT(2)
> > -
> > +	uint8_t flags;
> >  	time_t ts;
> >  };
> > +static struct udp_port_flags udp_port_flags[IP_VERSIONS][NUM_PORTS];
> >  
> >  /**
> > - * struct udp_splice_port - Bound socket for spliced communication
> > - * @sock:	Socket bound to index port
> > + * struct udp_bound_port - Track bound UDP sockets
> > + * @sock:	Socket bound to index port (or -1)
> >   * @ts:		Activity timestamp
> >   */
> > -struct udp_splice_port {
> > +struct udp_bound_port {
> >  	int sock;
> >  	time_t ts;
> >  };
> >  
> > -/* Port tracking, arrays indexed by packet source port (host order) */
> > -static struct udp_tap_port	udp_tap_map	[IP_VERSIONS][NUM_PORTS];
> > +/* Bound sockets indexed by bound port (host order) */
> > +static struct udp_bound_port	udp_tap_sock	[IP_VERSIONS][NUM_PORTS];
> >  
> > -/* "Spliced" sockets indexed by bound port (host order) */
> > -static struct udp_splice_port udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
> > -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS];
> > +/* Bound sockets for splicing indexed by bound port (host order) */
> > +static struct udp_bound_port	udp_splice_ns	[IP_VERSIONS][NUM_PORTS];
> > +static struct udp_bound_port	udp_splice_init	[IP_VERSIONS][NUM_PORTS];
> >  
> >  enum udp_act_type {
> > -	UDP_ACT_TAP,
> > +	UDP_ACT_TAP_SOCK,
> > +	UDP_ACT_FLAGS,
> 
> I'm having a hard time reviewing the rest because of this, I think.
> Earlier, UDP_ACT_TAP meant that the activity timestamp was observed on
> a non-spliced port.

Ah!  I'd read those as "ACTions we needed to take" in the expiry
timers, rather than tying them to specific activity.  It amounts to
the same thing pre-patch, but it makes more sense to me with that new
lens.

> Now, we have UDP_ACT_FLAGS which seems to mean that activity was seen
> in the direction of (to) the "tap" side, and UDP_ACT_TAP_SOCK meaning
> that activity was seen in the direction of (to) the socket.
> 
> But that's somehow the opposite of UDP_ACT_SPLICE_NS and
> UDP_ACT_SPLICE_INIT -- even though they don't actually indicate
> activity, rather the fact that activity timestamps refer to sockets that
> originated in (*from*) the target namespace, or in the initial
> namespace.
> 
> Anyway, I'm not quite sure: why do we need two separate flags for "tap"
> (from/to) activity?

We don't, this is an ugly side effect.  The problem is that in
scanning for expiries we only have a single port number.  But the
socket is associated with a particular forwarding port, while the
flags are associated with a particular endpoint port.  Without flow
tracking, we don't have a way to match one to the other.

The compromise here is - or is supposed to be - that we have
independent timestamps for the socket and the flags.  The two
timestamps associated with a single flow are supposed to be updated at
basically the same time - looks like I missed some things to do that
as well.

> I mean, as long as we observe activity on a non-spliced flow, we'll
> update the related timestamp, and we make sure the activity flag is
> set. Can't it simply be UDP_ACT_TAP?
> 
> The rest of the series looks good to me.
> 
> >  	UDP_ACT_SPLICE_NS,
> >  	UDP_ACT_SPLICE_INIT,
> >  	UDP_ACT_TYPE_MAX,
> > @@ -246,7 +245,7 @@ void udp_portmap_clear(void)
> >  	unsigned i;
> >  
> >  	for (i = 0; i < NUM_PORTS; i++) {
> > -		udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1;
> > +		udp_tap_sock[V4][i].sock = udp_tap_sock[V6][i].sock = -1;
> >  		udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1;
> >  		udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1;
> >  	}
> > @@ -393,16 +392,16 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns)
> >  	union epoll_ref ref = { .type = EPOLL_TYPE_UDP,
> >  				.udp = { .splice = true, .v6 = v6, .port = src }
> >  			      };
> > -	struct udp_splice_port *sp;
> > +	struct udp_bound_port *bp;
> >  	int act, s;
> >  
> >  	if (ns) {
> >  		ref.udp.pif = PIF_SPLICE;
> > -		sp = &udp_splice_ns[v6 ? V6 : V4][src];
> > +		bp = &udp_splice_ns[v6 ? V6 : V4][src];
> >  		act = UDP_ACT_SPLICE_NS;
> >  	} else {
> >  		ref.udp.pif = PIF_HOST;
> > -		sp = &udp_splice_init[v6 ? V6 : V4][src];
> > +		bp = &udp_splice_init[v6 ? V6 : V4][src];
> >  		act = UDP_ACT_SPLICE_INIT;
> >  	}
> >  
> > @@ -437,7 +436,7 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns)
> >  			goto fail;
> >  	}
> >  
> > -	sp->sock = s;
> > +	bp->sock = s;
> >  	bitmap_set(udp_act[v6 ? V6 : V4][act], src);
> >  
> >  	ev.data.u64 = ref.u64;
> > @@ -601,15 +600,15 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
> >  	} else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) ||
> >  		   IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) {
> >  		b->iph.saddr = c->ip4.gw.s_addr;
> > -		udp_tap_map[V4][src_port].ts = now->tv_sec;
> > -		udp_tap_map[V4][src_port].flags |= PORT_LOCAL;
> > +		udp_port_flags[V4][src_port].ts = now->tv_sec;
> > +		udp_port_flags[V4][src_port].flags |= PORT_LOCAL;
> >  
> >  		if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen))
> > -			udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK;
> > +			udp_port_flags[V4][src_port].flags &= ~PORT_LOOPBACK;
> >  		else
> > -			udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK;
> > +			udp_port_flags[V4][src_port].flags |= PORT_LOOPBACK;
> >  
> > -		bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port);
> > +		bitmap_set(udp_act[V4][UDP_ACT_FLAGS], src_port);
> >  	} else {
> >  		b->iph.saddr = b->s_in.sin_addr.s_addr;
> >  	}
> > @@ -664,20 +663,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
> >  		else
> >  			b->ip6h.saddr = c->ip6.addr_ll;
> >  
> > -		udp_tap_map[V6][src_port].ts = now->tv_sec;
> > -		udp_tap_map[V6][src_port].flags |= PORT_LOCAL;
> > +		udp_port_flags[V6][src_port].ts = now->tv_sec;
> > +		udp_port_flags[V6][src_port].flags |= PORT_LOCAL;
> >  
> >  		if (IN6_IS_ADDR_LOOPBACK(src))
> > -			udp_tap_map[V6][src_port].flags |= PORT_LOOPBACK;
> > +			udp_port_flags[V6][src_port].flags |= PORT_LOOPBACK;
> >  		else
> > -			udp_tap_map[V6][src_port].flags &= ~PORT_LOOPBACK;
> > +			udp_port_flags[V6][src_port].flags &= ~PORT_LOOPBACK;
> >  
> >  		if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr))
> > -			udp_tap_map[V6][src_port].flags |= PORT_GUA;
> > +			udp_port_flags[V6][src_port].flags |= PORT_GUA;
> >  		else
> > -			udp_tap_map[V6][src_port].flags &= ~PORT_GUA;
> > +			udp_port_flags[V6][src_port].flags &= ~PORT_GUA;
> >  
> > -		bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port);
> > +		bitmap_set(udp_act[V6][UDP_ACT_FLAGS], src_port);
> >  	} else {
> >  		b->ip6h.daddr = c->ip6.addr_seen;
> >  		b->ip6h.saddr = b->s_in6.sin6_addr;
> > @@ -857,16 +856,16 @@ int udp_tap_handler(struct ctx *c, uint8_t pif,
> >  			s_in.sin_addr = c->ip4.dns_host;
> >  		} else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) &&
> >  			   !c->no_map_gw) {
> > -			if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) ||
> > -			    (udp_tap_map[V4][dst].flags & PORT_LOOPBACK))
> > +			if (!(udp_port_flags[V4][dst].flags & PORT_LOCAL) ||
> > +			    (udp_port_flags[V4][dst].flags & PORT_LOOPBACK))
> >  				s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> >  			else
> >  				s_in.sin_addr = c->ip4.addr_seen;
> >  		}
> >  
> >  		debug("UDP from tap src=%hu dst=%hu, s=%d",
> > -		      src, dst, udp_tap_map[V4][src].sock);
> > -		if ((s = udp_tap_map[V4][src].sock) < 0) {
> > +		      src, dst, udp_tap_sock[V4][src].sock);
> > +		if ((s = udp_tap_sock[V4][src].sock) < 0) {
> >  			struct in_addr bind_addr = IN4ADDR_ANY_INIT;
> >  			union udp_epoll_ref uref = {
> >  				.port = src,
> > @@ -886,11 +885,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif,
> >  			if (s < 0)
> >  				return p->count - idx;
> >  
> > -			udp_tap_map[V4][src].sock = s;
> > -			bitmap_set(udp_act[V4][UDP_ACT_TAP], src);
> > +			udp_tap_sock[V4][src].sock = s;
> > +			bitmap_set(udp_act[V4][UDP_ACT_TAP_SOCK], src);
> >  		}
> >  
> > -		udp_tap_map[V4][src].ts = now->tv_sec;
> > +		udp_tap_sock[V4][src].ts = now->tv_sec;
> >  	} else {
> >  		s_in6 = (struct sockaddr_in6) {
> >  			.sin6_family = AF_INET6,
> > @@ -907,10 +906,10 @@ int udp_tap_handler(struct ctx *c, uint8_t pif,
> >  			s_in6.sin6_addr = c->ip6.dns_host;
> >  		} else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) &&
> >  			   !c->no_map_gw) {
> > -			if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) ||
> > -			    (udp_tap_map[V6][dst].flags & PORT_LOOPBACK))
> > +			if (!(udp_port_flags[V6][dst].flags & PORT_LOCAL) ||
> > +			    (udp_port_flags[V6][dst].flags & PORT_LOOPBACK))
> >  				s_in6.sin6_addr = in6addr_loopback;
> > -			else if (udp_tap_map[V6][dst].flags & PORT_GUA)
> > +			else if (udp_port_flags[V6][dst].flags & PORT_GUA)
> >  				s_in6.sin6_addr = c->ip6.addr;
> >  			else
> >  				s_in6.sin6_addr = c->ip6.addr_seen;
> > @@ -918,7 +917,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif,
> >  			bind_addr = &c->ip6.addr_ll;
> >  		}
> >  
> > -		if ((s = udp_tap_map[V6][src].sock) < 0) {
> > +		if ((s = udp_tap_sock[V6][src].sock) < 0) {
> >  			union udp_epoll_ref uref = {
> >  				.v6 = 1,
> >  				.port = src,
> > @@ -939,11 +938,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif,
> >  			if (s < 0)
> >  				return p->count - idx;
> >  
> > -			udp_tap_map[V6][src].sock = s;
> > -			bitmap_set(udp_act[V6][UDP_ACT_TAP], src);
> > +			udp_tap_sock[V6][src].sock = s;
> > +			bitmap_set(udp_act[V6][UDP_ACT_TAP_SOCK], src);
> >  		}
> >  
> > -		udp_tap_map[V6][src].ts = now->tv_sec;
> > +		udp_tap_sock[V6][src].ts = now->tv_sec;
> >  	}
> >  
> >  	for (i = 0; i < (int)p->count - idx; i++) {
> > @@ -1015,7 +1014,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
> >  			r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, addr,
> >  					 ifname, port, uref.u32);
> >  
> > -			udp_tap_map[V4][uref.port].sock = s < 0 ? -1 : s;
> > +			udp_tap_sock[V4][uref.port].sock = s < 0 ? -1 : s;
> >  			udp_splice_init[V4][port].sock = s < 0 ? -1 : s;
> >  		} else {
> >  			struct in_addr loopback = IN4ADDR_LOOPBACK_INIT;
> > @@ -1033,7 +1032,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
> >  			r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, addr,
> >  					 ifname, port, uref.u32);
> >  
> > -			udp_tap_map[V6][uref.port].sock = s < 0 ? -1 : s;
> > +			udp_tap_sock[V6][uref.port].sock = s < 0 ? -1 : s;
> >  			udp_splice_init[V6][port].sock = s < 0 ? -1 : s;
> >  		} else {
> >  			r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP,
> > @@ -1086,32 +1085,37 @@ static void udp_splice_iov_init(void)
> >  static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type,
> >  			  in_port_t port, const struct timespec *now)
> >  {
> > -	struct udp_splice_port *sp;
> > -	struct udp_tap_port *tp;
> > +	struct udp_bound_port *bp;
> > +	struct udp_port_flags *pf;
> >  	int *sockp = NULL;
> >  
> >  	switch (type) {
> > -	case UDP_ACT_TAP:
> > -		tp = &udp_tap_map[v6 ? V6 : V4][port];
> > +	case UDP_ACT_TAP_SOCK:
> > +		bp = &udp_tap_sock[v6 ? V6 : V4][port];
> >  
> > -		if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) {
> > -			sockp = &tp->sock;
> > -			tp->flags = 0;
> > -		}
> > +		if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT)
> > +			sockp = &bp->sock;
> > +
> > +		break;
> > +	case UDP_ACT_FLAGS:
> > +		pf = &udp_port_flags[v6 ? V6 : V4][port];
> > +
> > +		if (now->tv_sec - pf->ts > UDP_CONN_TIMEOUT)
> > +			pf->flags = 0;
> >  
> >  		break;
> >  	case UDP_ACT_SPLICE_INIT:
> > -		sp = &udp_splice_init[v6 ? V6 : V4][port];
> > +		bp = &udp_splice_init[v6 ? V6 : V4][port];
> >  
> > -		if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT)
> > -			sockp = &sp->sock;
> > +		if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT)
> > +			sockp = &bp->sock;
> >  
> >  		break;
> >  	case UDP_ACT_SPLICE_NS:
> > -		sp = &udp_splice_ns[v6 ? V6 : V4][port];
> > +		bp = &udp_splice_ns[v6 ? V6 : V4][port];
> >  
> > -		if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT)
> > -			sockp = &sp->sock;
> > +		if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT)
> > +			sockp = &bp->sock;
> >  
> >  		break;
> >  	default:
> > @@ -1141,7 +1145,7 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
> >  		= outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map;
> >  	const uint8_t *rmap
> >  		= outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map;
> > -	struct udp_splice_port (*socks)[NUM_PORTS]
> > +	struct udp_bound_port (*socks)[NUM_PORTS]
> >  		= outbound ? udp_splice_ns : udp_splice_init;
> >  	unsigned port;
> >  
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-02-28  2:30 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-21 23:21 [PATCH 0/7] Assorted small fixes for UDP David Gibson
2024-02-21 23:21 ` [PATCH 1/7] udp: Don't attempt to translate a 0.0.0.0 source address David Gibson
2024-02-22 17:46   ` Stefano Brivio
2024-02-23  4:03     ` David Gibson
2024-02-21 23:21 ` [PATCH 2/7] udp: Set pif in epoll reference for ephemeral host sockets David Gibson
2024-02-21 23:21 ` [PATCH 3/7] udp: Unconditionally clear act flag in udp_timer_one() David Gibson
2024-02-27 11:56   ` Stefano Brivio
2024-02-28  1:54     ` David Gibson
2024-02-21 23:21 ` [PATCH 4/7] udp: Separate bound sockets from flags David Gibson
2024-02-27 11:56   ` Stefano Brivio
2024-02-28  2:30     ` David Gibson [this message]
2024-02-28  3:53       ` David Gibson
2024-02-21 23:21 ` [PATCH 5/7] udp: Small streamline to udp_update_hdr4() David Gibson
2024-02-21 23:21 ` [PATCH 6/7] udp: Fix incorrect usage of IPv6 state in IPv4 path David Gibson
2024-02-21 23:21 ` [PATCH 7/7] udp: Remove unnecessary test for unspecified addr_out David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zd6arT4y-uo3jSO_@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).