From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 927945A004E for ; Wed, 10 Jul 2024 02:23:36 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1720570999; bh=XLhs+xZQkS4CbBVOE99dMo724HFHAS51/JF2bBSRkQY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=iQRPD/pTvY+ZfmVEq4BWuc/0JnDshjCLYIH8rvI23MV4kY8zw6mT/7QYx9iwP0d5y jS9s4ymkf5sNJJl1SZCpxia+byxWVfTUbXdayXhR9nnBgUDwWXwk+A0ROY9bEQ1teJ KxtFMLuG/oM3qyE9PszaQsl0jHtKw+UYJyGc5aRU/T3mIaFX7c/uUa3i7Se7jf1fEM wz9+xeQfVKukdjNkolqoaUgdkfm9vqr6Ks33fZldwDhqngvf3NI55N1adBEWRf2HTQ gZsSLPOwbw81pl3tCuJuPnSBrxkRAS5lnCj5pOLsWc+dL1yzVJEFSiHvoltpej+qnO Bou36U0YOg0ZA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4WJdqq4ck9z4wx6; Wed, 10 Jul 2024 10:23:19 +1000 (AEST) Date: Wed, 10 Jul 2024 10:23:14 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v7 21/27] udp: Handle "spliced" datagrams with per-flow sockets Message-ID: References: <20240705020724.3447719-1-david@gibson.dropbear.id.au> <20240705020724.3447719-22-david@gibson.dropbear.id.au> <20240710003233.1d9937fd@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="6K7GO27FiGDTr6h2" Content-Disposition: inline In-Reply-To: <20240710003233.1d9937fd@elisabeth> Message-ID-Hash: ZTXOJB7OJUMF37WBS6BHSSTVT45ZPOMU X-Message-ID-Hash: ZTXOJB7OJUMF37WBS6BHSSTVT45ZPOMU X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, jmaloy@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --6K7GO27FiGDTr6h2 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote: > On Fri, 5 Jul 2024 12:07:18 +1000 > David Gibson wrote: >=20 > > When forwarding a datagram to a socket, we need to find a socket with a > > suitable local address to send it. Currently we keep track of such soc= kets > > in an array indexed by local port, but this can't properly handle cases > > where we have multiple local addresses in active use. > >=20 > > For "spliced" (socket to socket) cases, improve this by instead opening > > a socket specifically for the target side of the flow. We connect() as > > well as bind()ing that socket, so that it will only receive the flow's > > reply packets, not anything else. We direct datagrams sent via that so= cket > > using the addresses from the flow table, effectively replacing bespoke > > addressing logic with the unified logic in fwd.c > >=20 > > When we create the flow, we also take a duplicate of the originating > > socket, and use that to deliver reply datagrams back to the origin, aga= in > > using addresses from the flow table entry. > >=20 > > Signed-off-by: David Gibson > > --- > > epoll_type.h | 2 + > > flow.c | 20 +++ > > flow.h | 2 + > > flow_table.h | 14 ++ > > passt.c | 4 + > > udp.c | 403 ++++++++++++++++++--------------------------------- > > udp.h | 6 +- > > udp_flow.h | 2 + > > util.c | 1 + > > 9 files changed, 194 insertions(+), 260 deletions(-) > >=20 > > diff --git a/epoll_type.h b/epoll_type.h > > index b6c04199..7a752ed1 100644 > > --- a/epoll_type.h > > +++ b/epoll_type.h > > @@ -22,6 +22,8 @@ enum epoll_type { > > EPOLL_TYPE_TCP_TIMER, > > /* UDP sockets */ > > EPOLL_TYPE_UDP, > > + /* UDP socket for replies on a specific flow */ > > + EPOLL_TYPE_UDP_REPLY, > > /* ICMP/ICMPv6 ping sockets */ > > EPOLL_TYPE_PING, > > /* inotify fd watching for end of netns (pasta) */ > > diff --git a/flow.c b/flow.c > > index 0cb9495b..2e100ddb 100644 > > --- a/flow.c > > +++ b/flow.c > > @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epo= ll_type type, uint8_t pif, > > } > > } > > =20 > > +/** flowside_connect() - Connect a socket based on flowside > > + * @c: Execution context > > + * @s: Socket to connect > > + * @pif: Target pif > > + * @tgt: Target flowside > > + * > > + * Connect @s to the endpoint address and port from @tgt. > > + * > > + * Return: 0 on success, negative on error > > + */ > > +int flowside_connect(const struct ctx *c, int s, > > + uint8_t pif, const struct flowside *tgt) > > +{ > > + union sockaddr_inany sa; > > + socklen_t sl; > > + > > + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); > > + return connect(s, &sa.sa, sl); > > +} > > + > > /** flow_log_ - Log flow-related message > > * @f: flow the message is related to > > * @pri: Log priority > > diff --git a/flow.h b/flow.h > > index 3752e5ee..3f65ceb9 100644 > > --- a/flow.h > > +++ b/flow.h > > @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowsid= e *left, > > =20 > > int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_= t pif, > > const struct flowside *tgt, uint32_t data); > > +int flowside_connect(const struct ctx *c, int s, > > + uint8_t pif, const struct flowside *tgt); > > =20 > > /** > > * struct flow_common - Common fields for packet flows > > diff --git a/flow_table.h b/flow_table.h > > index 3fbc7c8d..1faac4a7 100644 > > --- a/flow_table.h > > +++ b/flow_table.h > > @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_si= dx_t sidx) > > return (flow_sidx_t){.flow =3D sidx.flow, .side =3D !sidx.side}; > > } > > =20 > > +/** pif_at_sidx - Interface for a given flow and side >=20 > pif_at_sidx() Done. > > + * @sidx: Flow & side index > > + * > > + * Return: pif for the flow & side given by @sidx > > + */ > > +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) > > +{ > > + const union flow *flow =3D flow_at_sidx(sidx); > > + > > + if (!flow) > > + return PIF_NONE; > > + return flow->f.pif[sidx.side]; > > +} > > + > > /** flow_sidx_t - Index of one side of a flow from common structure > > * @f: Common flow fields pointer > > * @side: Which side to refer to (0 or 1) > > diff --git a/passt.c b/passt.c > > index e4d45daa..f9405bee 100644 > > --- a/passt.c > > +++ b/passt.c > > @@ -67,6 +67,7 @@ char *epoll_type_str[] =3D { > > [EPOLL_TYPE_TCP_LISTEN] =3D "listening TCP socket", > > [EPOLL_TYPE_TCP_TIMER] =3D "TCP timer", > > [EPOLL_TYPE_UDP] =3D "UDP socket", > > + [EPOLL_TYPE_UDP_REPLY] =3D "UDP reply socket", > > [EPOLL_TYPE_PING] =3D "ICMP/ICMPv6 ping socket", > > [EPOLL_TYPE_NSQUIT_INOTIFY] =3D "namespace inotify watch", > > [EPOLL_TYPE_NSQUIT_TIMER] =3D "namespace timer watch", > > @@ -349,6 +350,9 @@ loop: > > case EPOLL_TYPE_UDP: > > udp_buf_sock_handler(&c, ref, eventmask, &now); > > break; > > + case EPOLL_TYPE_UDP_REPLY: > > + udp_reply_sock_handler(&c, ref, eventmask, &now); > > + break; > > case EPOLL_TYPE_PING: > > icmp_sock_handler(&c, ref); > > break; > > diff --git a/udp.c b/udp.c > > index daf4fe26..f4c696db 100644 > > --- a/udp.c > > +++ b/udp.c > > @@ -35,7 +35,31 @@ > > * =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > * > > * UDP doesn't use listen(), but we consider long term sockets which a= re allowed > > - * to create new flows "listening" by analogy with TCP. > > + * to create new flows "listening" by analogy with TCP. This listening= socket > > + * could receive packets from multiple flows, so we use a hash table m= atch to > > + * find the specific flow for a datagram. > > + * > > + * When a UDP flow is initiated from a listening socket we take a dupl= icate of >=20 > These are always "inbound" flows, right? No, they could come from a socket listening in the namespace (-U). > > + * the socket and store it in uflow->s[INISIDE]. This will last for t= he > > + * lifetime of the flow, even if the original listening socket is clos= ed due to > > + * port auto-probing. The duplicate is used to deliver replies back t= o the > > + * originating side. > > + * > > + * Reply sockets > > + * =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > + * > > + * When a UDP flow targets a socket, we create a "reply" socket in >=20 > ...and these are outbound ones? No. Again, we could be creating this socket in the namespace. A spliced flow will have both the dup() of a listening socket, and a reply socket, regardless of whether it's inbound (HOST -> SPLICE) or outbound (SPLICE -> HOST). > > + * uflow->s[TGTSIDE] both to deliver datagrams to the target side and = receive > > + * replies on the target side. This socket is both bound and connecte= d and has > > + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive dat= agrams > > + * associated with this flow, so the epoll reference directly points t= o the flow > > + * and we don't need a hash lookup. > > + * > > + * NOTE: it's possible that the reply socket could have a bound address > > + * overlapping with an unrelated listening socket. We assume datagram= s for the > > + * flow will come to the reply socket in preference to a listening soc= ket. The > > + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents= and tests >=20 > Now it's under doc/platform-requirements/ Thanks for the catch, corrected. > > + * that assumption. > > * > > * Port tracking > > * =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > @@ -56,62 +80,6 @@ > > * > > * Packets are forwarded back and forth, by prepending and stripping U= DP headers > > * in the obvious way, with no port translation. > > - * > > - * In PASTA mode, the L2-L4 translation is skipped for connections to = ports > > - * bound between namespaces using the loopback interface, messages are= directly > > - * transferred between L4 sockets instead. These are called spliced co= nnections > > - * for consistency with the TCP implementation, but the splice() sysca= ll isn't > > - * actually used as it wouldn't make sense for datagram-based connecti= ons: a > > - * pair of recvmmsg() and sendmmsg() deals with this case. >=20 > I think we should keep this paragraph... or did you move/add this > information somewhere else I missed? Given that the "Port tracking" > section is going to be away, maybe this could be moved at the end of > "UDP Flows". Fair point. I've paraphrahsed and updated this a bit and added it as a new section about spliced flows. > > - * > > - * The connection tracking for PASTA mode is slightly complicated by t= he absence > > - * of actual connections, see struct udp_splice_port, and these exampl= es: > > - * > > - * - from init to namespace: > > - * > > - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from = socket s, > > - * with epoll reference: index =3D 80, splice =3D 1, orig =3D 1, n= s =3D 0 > > - * - if udp_splice_ns[V4][5000].sock: > > - * - send packet to udp_splice_ns[V4][5000].sock, with destinati= on port > > - * 80 > > - * - otherwise: > > - * - create new socket udp_splice_ns[V4][5000].sock > > - * - bind in namespace to 127.0.0.1:5000 > > - * - add to epoll with reference: index =3D 5000, splice =3D 1, = orig =3D 0, > > - * ns =3D 1 > > - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000]= =2Ets with > > - * current time > > - * > > - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace = socket s, > > - * having epoll reference: index =3D 5000, splice =3D 1, orig =3D = 0, ns =3D 1 > > - * - if udp_splice_init[V4][80].sock: > > - * - send to udp_splice_init[V4][80].sock, with destination port= 5000 > > - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][500= 0].ts with > > - * current time > > - * - otherwise, discard > > - * > > - * - from namespace to init: > > - * > > - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace = =66rom > > - * socket s, with epoll reference: index =3D 22, splice =3D 1, ori= g =3D 1, ns =3D 1 > > - * - if udp4_splice_init[V4][2000].sock: > > - * - send packet to udp_splice_init[V4][2000].sock, with destina= tion > > - * port 22 > > - * - otherwise: > > - * - create new socket udp_splice_init[V4][2000].sock > > - * - bind in init to 127.0.0.1:2000 > > - * - add to epoll with reference: index =3D 2000, splice =3D 1, = orig =3D 0, > > - * ns =3D 0 > > - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000]= =2Ets with > > - * current time > > - * > > - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from = socket s, > > - * having epoll reference: index =3D 2000, splice =3D 1, orig =3D = 0, ns =3D 0 > > - * - if udp_splice_ns[V4][22].sock: > > - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 > > - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000]= =2Ets with > > - * current time > > - * - otherwise, discard > > */ > > =20 > > #include > > @@ -134,6 +102,7 @@ > > #include > > #include > > #include > > +#include > > =20 > > #include "checksum.h" > > #include "util.h" > > @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; > > * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) > > * @taph: Tap backend specific header > > * @s_in: Source socket address, filled in by recvmmsg() > > - * @splicesrc: Source port for splicing, or -1 if not spliceable > > * @tosidx: sidx for the destination side of this datagram's flow > > */ > > static struct udp_meta_t { > > @@ -232,7 +200,6 @@ static struct udp_meta_t { > > struct tap_hdr taph; > > =20 > > union sockaddr_inany s_in; > > - int splicesrc; > > flow_sidx_t tosidx; > > } > > #ifdef __AVX2__ > > @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAME= S]; > > /* IOVs for L2 frames */ > > static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS]; > > =20 > > - > > /** > > * udp_portmap_clear() - Clear UDP port map before configuration > > */ > > @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) > > udp_iov_init_one(c, i); > > } > > =20 > > -/** > > - * udp_splice_new() - Create and prepare socket for "spliced" binding > > - * @c: Execution context > > - * @v6: Set for IPv6 sockets > > - * @src: Source port of original connection, host order > > - * @ns: Does the splice originate in the ns or not > > - * > > - * Return: prepared socket, negative error code on failure > > - * > > - * #syscalls:pasta getsockname > > - */ > > -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) > > -{ > > - struct epoll_event ev =3D { .events =3D EPOLLIN | EPOLLRDHUP | EPOLLH= UP }; > > - union epoll_ref ref =3D { .type =3D EPOLL_TYPE_UDP, > > - .udp =3D { .splice =3D true, .v6 =3D v6, .port =3D src } > > - }; > > - struct udp_splice_port *sp; > > - int act, s; > > - > > - if (ns) { > > - ref.udp.pif =3D PIF_SPLICE; > > - sp =3D &udp_splice_ns[v6 ? V6 : V4][src]; > > - act =3D UDP_ACT_SPLICE_NS; > > - } else { > > - ref.udp.pif =3D PIF_HOST; > > - sp =3D &udp_splice_init[v6 ? V6 : V4][src]; > > - act =3D UDP_ACT_SPLICE_INIT; > > - } > > - > > - s =3D socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, > > - IPPROTO_UDP); > > - > > - if (s > FD_REF_MAX) { > > - close(s); > > - return -EIO; > > - } > > - > > - if (s < 0) > > - return s; > > - > > - ref.fd =3D s; > > - > > - if (v6) { > > - struct sockaddr_in6 addr6 =3D { > > - .sin6_family =3D AF_INET6, > > - .sin6_port =3D htons(src), > > - .sin6_addr =3D IN6ADDR_LOOPBACK_INIT, > > - }; > > - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) > > - goto fail; > > - } else { > > - struct sockaddr_in addr4 =3D { > > - .sin_family =3D AF_INET, > > - .sin_port =3D htons(src), > > - .sin_addr =3D IN4ADDR_LOOPBACK_INIT, > > - }; > > - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) > > - goto fail; > > - } > > - > > - sp->sock =3D s; > > - bitmap_set(udp_act[v6 ? V6 : V4][act], src); > > - > > - ev.data.u64 =3D ref.u64; > > - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); > > - return s; > > - > > -fail: > > - close(s); > > - return -1; > > -} > > - > > -/** > > - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() > > - * @c: Execution context > > - * @v6: Set for IPv6 > > - * @src: Source port of originating datagram, host order > > - * @dst: Destination port of originating datagram, host order > > - * @s: Newly created socket or negative error code > > - */ > > -struct udp_splice_new_ns_arg { > > - const struct ctx *c; > > - int v6; > > - in_port_t src; > > - int s; > > -}; > > - > > -/** > > - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() > > - * @arg: See struct udp_splice_new_ns_arg > > - * > > - * Return: 0 > > - */ > > -static int udp_splice_new_ns(void *arg) > > -{ > > - struct udp_splice_new_ns_arg *a; > > - > > - a =3D (struct udp_splice_new_ns_arg *)arg; > > - > > - ns_enter(a->c); > > - > > - a->s =3D udp_splice_new(a->c, a->v6, a->src, true); > > - > > - return 0; > > -} > > - > > -/** > > - * udp_mmh_splice_port() - Is source address of message suitable for s= plicing? > > - * @ref: epoll reference for incoming message's origin socket > > - * @mmh: mmsghdr of incoming message > > - * > > - * Return: if source address of message in @mmh refers to localhost (1= 27.0.0.1 > > - * or ::1) its source port (host order), otherwise -1. > > - */ > > -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsgh= dr *mmh) > > -{ > > - const struct sockaddr_in6 *sa6 =3D mmh->msg_hdr.msg_name; > > - const struct sockaddr_in *sa4 =3D mmh->msg_hdr.msg_name; > > - > > - ASSERT(ref.type =3D=3D EPOLL_TYPE_UDP); > > - > > - if (!ref.udp.splice) > > - return -1; > > - > > - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) > > - return ntohs(sa6->sin6_port); > > - > > - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) > > - return ntohs(sa4->sin_port); > > - > > - return -1; > > -} > > - > > /** > > * udp_at_sidx() - Get UDP specific flow at given sidx > > * @sidx: Flow and side to retrieve > > @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) > > */ > > static void udp_flow_close(const struct ctx *c, const struct udp_flow = *uflow) > > { > > + if (uflow->s[INISIDE] >=3D 0) { > > + /* The listening socket needs to stay in epoll */ > > + close(uflow->s[INISIDE]); > > + } > > + > > + if (uflow->s[TGTSIDE] >=3D 0) { > > + /* But the flow specific one needs to be removed */ > > + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); > > + close(uflow->s[TGTSIDE]); >=20 > Keeping numbers of closed sockets around, instead of setting them to -1 > right away, makes me a bit nervous. > > On the other hand it's obvious that udp_flow_new() sets them to -1 > anyway, so there isn't much that can go wrong. Right, we're about to destroy the flow entry entirely, so it is safe. But I agree that it's nervous-making. It's pretty cheap to set them to -1 explicitly, so I'm doing that now. >=20 > > + } > > flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); > > } > > =20 > > @@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, c= onst struct udp_flow *uflow) > > * udp_flow_new() - Common setup for a new UDP flow > > * @c: Execution context > > * @flow: Initiated flow > > + * @s_ini: Initiating socket (or -1) > > * @now: Timestamp > > * > > * Return: UDP specific flow, if successful, NULL on failure > > */ > > static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, > > - const struct timespec *now) > > + int s_ini, const struct timespec *now) > > { > > const struct flowside *ini =3D &flow->f.side[INISIDE]; > > struct udp_flow *uflow =3D NULL; > > + const struct flowside *tgt; > > + uint8_t tgtpif; > > =20 > > if (!inany_is_unicast(&ini->eaddr) || ini->eport =3D=3D 0) { > > flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); > > goto cancel; > > } > > =20 > > - if (!flow_target(c, flow, IPPROTO_UDP)) > > + if (!(tgt =3D flow_target(c, flow, IPPROTO_UDP))) > > goto cancel; > > + tgtpif =3D flow->f.pif[TGTSIDE]; > > =20 > > uflow =3D FLOW_SET_TYPE(flow, FLOW_UDP, udp); > > uflow->ts =3D now->tv_sec; > > + uflow->s[INISIDE] =3D uflow->s[TGTSIDE] =3D -1; > > + > > + if (s_ini >=3D 0) { > > + /* When using auto port-scanning the listening port could go > > + * away, so we need to duplicate it */ >=20 > For consistency: closing */ on a newline. I would also say "the socket" > instead of "it", otherwise it seems to be referring to the port at a > first glance. Done. > > + uflow->s[INISIDE] =3D fcntl(s_ini, F_DUPFD_CLOEXEC, 0); >=20 > There's one aspect of this I don't understand: if s_ini is closed while > checking for bound ports (is it? I didn't really reach the end of this > series), aren't duplicates also closed? >=20 > That is, the documentation of dup2(2), which should be the same for > this purpose, states that the duplicate inherits "file status flags", > which I would assume also includes the fact that a socket is closed. I > didn't test that though. I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone. > If duplicates are closed, I guess an alternative solution could be to > introduce some kind of reference counting for sockets... somewhere. =2E. in other words, I believe the kernel does the reference counting. I should verify this though, I'll try to come up with something new for doc/platform-requirements. >=20 > > + if (uflow->s[INISIDE] < 0) { > > + flow_err(uflow, > > + "Couldn't duplicate listening socket: %s", > > + strerror(errno)); > > + goto cancel; > > + } > > + } > > + > > + if (pif_is_socket(tgtpif)) { > > + union { > > + flow_sidx_t sidx; > > + uint32_t data; > > + } fref =3D { > > + .sidx =3D FLOW_SIDX(flow, TGTSIDE), > > + }; > > + > > + uflow->s[TGTSIDE] =3D flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, > > + tgtpif, tgt, fref.data); > > + if (uflow->s[TGTSIDE] < 0) { > > + flow_dbg(uflow, > > + "Couldn't open socket for spliced flow: %s", > > + strerror(errno)); > > + goto cancel; > > + } > > + > > + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { > > + flow_dbg(uflow, > > + "Couldn't connect flow socket: %s", > > + strerror(errno)); > > + goto cancel; > > + } > > + > > + /* It's possible, if unlikely, that we could receive some > > + * unrelated packets in between the bind() and connect() of this > > + * socket. For now we just discard these. We could consider > > + * trying to re-direct these to an appropriate handler, if we >=20 > Simply "redirect"? Done. > > + * need to. > > + */ > > + /* cppcheck-suppress nullPointer */ > > + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >=3D 0) > > + ; >=20 > Could a local attacker (another user) attempt to use this for denial of > service? Ah, interesting question. > Of course, somebody could flood us anyway and we would get and handle > all the events that that causes. But this case is different because we > could get stuck for an unlimited amount of time without serving other > sockets at all. Right. > If that's a possibility, perhaps a limit for this loop (a maximum > amount of recv()) tries would be a good idea. I'm not sure how we should > handle the case where we exceed the threshold. We could fail to create the flow. That would limit the damage, but see below. > Another one, which adds some complexity, but looks more correct to me, > would be to try a single recv() call, and if we get data from it, fail > creating the new flow entirely. Right. I also considered a single recvmmsg(); the difficulty with that is making suitable arrays for it, since the normal ones may be in use at this point. This removes the potential DoS of wedging passt completely, but raises the possibility of a different one. Could an attacker - not necessarily on the same host, but network-closer than the guest's intended endpoint - prevent the guest from connecting to a particular service by spamming fake reply packets here. I believe it would need to anticipate the guest's source port to do so, which might be good enough. > I'm still reviewing the rest (of this patch and of this series). >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --6K7GO27FiGDTr6h2 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmaN1HIACgkQzQJF27ox 2GfdLg//SgVqywikeEgyE3Rxfjbrrzw5Z39iktYx4ewZdXj9bD1B8bhEWg/R97zS 1UxAstyQTmS+15epwPaRDMmDwTOs0VrDNnRkdBKw8EilLul2cOo7ik0qiE5HDnSV Wn4fIS2XEtUYMxrsGjCEn4GwjqSNP1qw2qib+CChra5VdP0FeasJWdaTC2oHX1jx r8trWAU1FPRgDmW2KYT5x5qLpBZbuxB8rngvy50+WkhpxXzfz0hqj5H/swSTsqH+ nlO3XoQvmFqSSDsA7N3J919yRPKumEEbUqLASnvcTWSMFdOlbhjYOpdFcd2uTAA4 YKU473hu2BirqKb9PBnNqUFy+rcEczRYYdb9e5snk5lhsUiyfdH0mZLMf7KA5LF+ LQW+YJ0LyCTC8Jr/Ih9tshjsdVh508raSiLRU35KtjxLNHNBhZ6mJB3C28vxRc5i oxcuX/svbfY4sOuL6tI0yICX7ufZj2gesZLMyxrIo/D5HDkzmShkRsuNCSyNmUaK spuTOS2eyvCysrHVmIJxE1mzr85xlMWa1zULndatOMU3UubwVevbElx0qXmSiSTw PEyg1FAqninyuSl+066ksHnrhGa+By+SBne/WUh9dQk5ySA8xinSwHUW48QmWCwe r1WVgrJFWk3CGA80DT6jRy3G26m2StW3sXci5wBNeQri2Mx7gSU= =0wo+ -----END PGP SIGNATURE----- --6K7GO27FiGDTr6h2--