From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v5 18/19] flow, tcp: Flow based NAT and port forwarding for TCP
Date: Mon, 20 May 2024 15:44:07 +1000 [thread overview]
Message-ID: <ZkrjJ3RkIAUUYJkj@zatzit> (raw)
In-Reply-To: <20240518001345.2d127b09@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 10562 bytes --]
On Sat, May 18, 2024 at 12:13:45AM +0200, Stefano Brivio wrote:
> On Tue, 14 May 2024 11:03:36 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Currently the code to translate host side addresses and ports to guest side
> > addresses and ports, and vice versa, is scattered across the TCP code.
> > This includes both port redirection as controlled by the -t and -T options,
> > and our special case NAT controlled by the --no-map-gw option.
> >
> > Gather this logic into fwd_from_*() functions for each input interface
> > in fwd.c which take protocol and address information for the initiating
> > side and generates the pif and address information for the forwarded side.
> > This performs any NAT or port forwarding needed.
> >
> > We create a flow_forward() helper which applies those forwarding functions
> > as needed to automatically move a flow from INI to FWD state. For now we
> > leave the older flow_forward_af() function taking explicit addresses as
> > a transitional tool.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > flow.c | 53 +++++++++++++++++++++++++
> > flow_table.h | 2 +
> > fwd.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > fwd.h | 12 ++++++
> > tcp.c | 102 +++++++++++++++--------------------------------
> > tcp_splice.c | 63 ++---------------------------
> > tcp_splice.h | 5 +--
> > 7 files changed, 213 insertions(+), 134 deletions(-)
> >
> > diff --git a/flow.c b/flow.c
> > index 4942075..a6afe39 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -304,6 +304,59 @@ const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
> > return fwd;
> > }
> >
> > +
> > +/**
> > + * flow_forward() - Determine where flow should forward to, and move to FWD
> > + * @c: Execution context
> > + * @flow: Flow to forward
> > + * @proto: Protocol
> > + *
> > + * Return: pointer to the forwarded flowside information
> > + */
> > +const struct flowside *flow_forward(const struct ctx *c, union flow *flow,
> > + uint8_t proto)
> > +{
> > + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
> > + struct flow_common *f = &flow->f;
> > + const struct flowside *ini = &f->side[INISIDE];
> > + struct flowside *fwd = &f->side[FWDSIDE];
> > + uint8_t pif1 = PIF_NONE;
>
> This could now be 'pif_fwd' / 'pif_tgt', right?
Good idea, changed.
[snip]
> > diff --git a/fwd.c b/fwd.c
> > index b3d5a37..5fe2361 100644
> > --- a/fwd.c
> > +++ b/fwd.c
> > @@ -25,6 +25,7 @@
> > #include "fwd.h"
> > #include "passt.h"
> > #include "lineread.h"
> > +#include "flow_table.h"
> >
> > /* See enum in kernel's include/net/tcp_states.h */
> > #define UDP_LISTEN 0x07
> > @@ -154,3 +155,112 @@ void fwd_scan_ports_init(struct ctx *c)
> > &c->tcp.fwd_out, &c->tcp.fwd_in);
> > }
> > }
> > +
> > +uint8_t fwd_from_tap(const struct ctx *c, uint8_t proto,
> > + const struct flowside *a, struct flowside *b)
>
> A function comment would be nice to have, albeit a bit redundant.
Ah, yes. I meant to go back and add these, but obviously forgot.
Fixed now.
> Now
> 'a' and 'b' could also be called 'ini' and 'tgt' I guess?
Also a good idea, done.
> > +{
> > + (void)proto;
> > +
> > + b->eaddr = a->faddr;
> > + b->eport = a->fport;
> > +
> > + if (!c->no_map_gw) {
> > + struct in_addr *v4 = inany_v4(&b->eaddr);
> > +
> > + if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
> > + *v4 = in4addr_loopback;
> > + if (IN6_ARE_ADDR_EQUAL(&b->eaddr, &c->ip6.gw))
> > + b->eaddr.a6 = in6addr_loopback;
>
> I haven't tested this, but I'm a bit lost: I thought that in this case
> we would also set b->faddr here. Where does that happen?
Ah.. right. So notionally we should set tgt->faddr here. However,
because in this case we're forwarding to PIF_HOST we don't actually
know tgt->faddr (or tgt->fport) without a getsockname() call, so we're
leaving them blank. They will, in fact, be blank because we zero the
entire entry in flow_alloc().
That's pretty non-obvious though, I'll change this to explicitly set
faddr and fport with a comment.
> > + }
> > +
> > + return PIF_HOST;
> > +}
> > +
> > +uint8_t fwd_from_splice(const struct ctx *c, uint8_t proto,
> > + const struct flowside *a, struct flowside *b)
> > +{
> > + const struct in_addr *ae4 = inany_v4(&a->eaddr);
> > +
> > + if (!inany_is_loopback(&a->eaddr) ||
> > + (!inany_is_loopback(&a->faddr) && !inany_is_unspecified(&a->faddr))) {
> > + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
> > +
> > + debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu",
> > + pif_name(PIF_SPLICE),
> > + inany_ntop(&a->eaddr, estr, sizeof(estr)), a->eport,
> > + inany_ntop(&a->faddr, fstr, sizeof(fstr)), a->fport);
> > + return PIF_NONE;
> > + }
> > +
> > + if (ae4)
> > + inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
> > + else
> > + inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
> > +
> > + b->eport = a->fport;
> > +
> > + if (proto == IPPROTO_TCP)
> > + b->eport += c->tcp.fwd_out.delta[b->eport];
> > +
> > + return PIF_HOST;
> > +}
> > +
> > +uint8_t fwd_from_host(const struct ctx *c, uint8_t proto,
> > + const struct flowside *a, struct flowside *b)
> > +{
> > + struct in_addr *bf4;
> > +
> > + if (c->mode == MODE_PASTA && inany_is_loopback(&a->eaddr) &&
> > + proto == IPPROTO_TCP) {
> > + /* spliceable */
>
> Before we conclude this, does f->pif[INISIDE] == PIF_HOST in the caller
> guarantee that inany_is_loopback(&a->faddr), too?
Only in the sense that if we accept()ed a connection from a loopback
address on a socket not bound to a loopback address (or ANY), then the
kernel has done something wrong. This kind of has the inverse of the
issue above: we don't necessarily know the forwarding address here -
we only know that with either a getsockname(), or by looking at the
bound address of the listening socket (which might be unspecified).
> If not, we shouldn't
> splice unless that's true as well.
So I'm pretty confident what we do here is equivalent to what we did
before. That might not be correct, but fixing that is for a different
patch. Making problems like that more obvious is one of the
advantages I expect for gathering all this forwarding logic into one
place.
> > + b->faddr = a->eaddr;
> > +
> > + if (inany_v4(&a->eaddr))
> > + inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
> > + else
> > + inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
> > + b->eport = a->fport;
> > + if (proto == IPPROTO_TCP)
> > + b->eport += c->tcp.fwd_in.delta[b->eport];
> > +
> > + return PIF_SPLICE;
> > + }
> > +
> > + b->faddr = a->eaddr;
> > + b->fport = a->eport;
> > +
> > + bf4 = inany_v4(&b->faddr);
> > +
> > + if (bf4) {
> > + if (IN4_IS_ADDR_LOOPBACK(bf4) ||
> > + IN4_IS_ADDR_UNSPECIFIED(bf4) ||
> > + IN4_ARE_ADDR_EQUAL(bf4, &c->ip4.addr_seen))
> > + *bf4 = c->ip4.gw;
> > + } else {
> > + struct in6_addr *bf6 = &b->faddr.a6;
> > +
> > + if (IN6_IS_ADDR_LOOPBACK(bf6) ||
> > + IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr_seen) ||
> > + IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr)) {
> > + if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> > + *bf6 = c->ip6.gw;
> > + else
> > + *bf6 = c->ip6.addr_ll;
> > + }
> > + }
> > +
> > + if (bf4) {
> > + inany_from_af(&b->eaddr, AF_INET, &c->ip4.addr_seen);
> > + } else {
> > + if (IN6_IS_ADDR_LINKLOCAL(&b->faddr.a6))
> > + b->eaddr.a6 = c->ip6.addr_ll_seen;
> > + else
> > + b->eaddr.a6 = c->ip6.addr_seen;
> > + }
> > +
> > + b->eport = a->fport;
> > + if (proto == IPPROTO_TCP)
> > + b->eport += c->tcp.fwd_in.delta[b->eport];
>
> As we do this in any case, spliced or not spliced, I would find it less
> confusing to have these assignments in common, earlier (I just spent
> half an hour trying to figure out why you wouldn't set b->eport for the
> non-spliced case...).
Fair point. This was just because I thought my way through the two
cases separately. I've made this stanza common.
[snip]
> > +static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
> > const struct timespec *now)
> > {
> > - union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */
> > - struct tcp_tap_conn *conn;
> > - in_port_t srcport;
> > + struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
> > uint64_t hash;
> >
> > - inany_from_sockaddr(&saddr, &srcport, sa);
> > - tcp_snat_inbound(c, &saddr);
> > -
> > - if (inany_v4(&saddr)) {
> > - inany_from_af(&daddr, AF_INET, &c->ip4.addr_seen);
> > - } else {
> > - if (IN6_IS_ADDR_LINKLOCAL(&saddr))
> > - daddr.a6 = c->ip6.addr_ll_seen;
> > - else
> > - daddr.a6 = c->ip6.addr_seen;
> > - }
> > - dstport += c->tcp.fwd_in.delta[dstport];
> > -
> > - flow_forward_af(flow, PIF_TAP, AF_INET6,
> > - &saddr, srcport, &daddr, dstport);
> > - conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
> > -
> > +
>
> Excess newline and tab.
Looks like I already fixed that.
[snip]
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -395,71 +395,18 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
> > /**
> > * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
> > * @c: Execution context
> > - * @pif0: pif id of side 0
> > - * @dstport: Side 0 destination port of connection
> > * @flow: flow to initialise
> > * @s0: Accepted (side 0) socket
> > * @sa: Peer address of connection
> > *
> > - * Return: true if able to create a spliced connection, false otherwise
>
> Not related to this patch, but I think we should probably describe in
> the theory of operation for flows what's the threshold between calling
> flow_alloc_cancel() on a flow (which would imply returning something
> here, in case tcp_splice_connect() fails), and deferring that instead
> to a CLOSING state.
That's included in the new descriptions of the flow states. There
might be a way to make it more obvious, but I'm not immediately sure
of it. In any case the answer is: you can't cancel once you
FLOW_ACTIVATE().
[snip]
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-05-20 5:57 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-14 1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
2024-05-14 1:03 ` [PATCH v5 01/19] flow: Clarify and enforce flow state transitions David Gibson
2024-05-16 9:30 ` Stefano Brivio
[not found] ` <ZkbVxtvmP7f0aL1S@zatzit>
2024-05-17 11:00 ` Stefano Brivio
2024-05-18 6:47 ` David Gibson
2024-05-14 1:03 ` [PATCH v5 02/19] flow: Make side 0 always be the initiating side David Gibson
2024-05-16 12:06 ` Stefano Brivio
2024-05-14 1:03 ` [PATCH v5 03/19] flow: Record the pifs for each side of each flow David Gibson
2024-05-14 1:03 ` [PATCH v5 04/19] tcp: Remove interim 'tapside' field from connection David Gibson
2024-05-14 1:03 ` [PATCH v5 05/19] flow: Common data structures for tracking flow addresses David Gibson
2024-05-14 1:03 ` [PATCH v5 06/19] flow: Populate address information for initiating side David Gibson
[not found] ` <20240516202337.1b90e5f2@elisabeth>
[not found] ` <ZkbcwkdEwjGv6uwG@zatzit>
[not found] ` <20240517215845.4d09eaae@elisabeth>
2024-05-18 7:00 ` David Gibson
2024-05-14 1:03 ` [PATCH v5 07/19] flow: Populate address information for non-initiating side David Gibson
2024-05-14 1:03 ` [PATCH v5 08/19] tcp, flow: Remove redundant information, repack connection structures David Gibson
2024-05-14 1:03 ` [PATCH v5 09/19] tcp: Obtain guest address from flowside David Gibson
2024-05-14 1:03 ` [PATCH v5 10/19] tcp: Simplify endpoint validation using flowside information David Gibson
2024-05-14 1:03 ` [PATCH v5 11/19] tcp_splice: Eliminate SPLICE_V6 flag David Gibson
2024-05-14 1:03 ` [PATCH v5 12/19] tcp, flow: Replace TCP specific hash function with general flow hash David Gibson
2024-05-14 1:03 ` [PATCH v5 13/19] flow, tcp: Generalise TCP hash table to general flow hash table David Gibson
2024-05-14 1:03 ` [PATCH v5 14/19] tcp: Re-use flow hash for initial sequence number generation David Gibson
2024-05-14 1:03 ` [PATCH v5 15/19] icmp: Use flowsides as the source of truth wherever possible David Gibson
[not found] ` <20240516225350.06aebcd7@elisabeth>
[not found] ` <ZkcAHhCpx3F0SW2K@zatzit>
[not found] ` <20240517221123.1c7197a3@elisabeth>
2024-05-18 7:08 ` David Gibson
2024-05-14 1:03 ` [PATCH v5 16/19] icmp: Look up ping flows using flow hash David Gibson
2024-05-14 1:03 ` [PATCH v5 17/19] icmp: Eliminate icmp_id_map David Gibson
2024-05-14 1:03 ` [PATCH v5 18/19] flow, tcp: Flow based NAT and port forwarding for TCP David Gibson
[not found] ` <20240518001345.2d127b09@elisabeth>
2024-05-20 5:44 ` David Gibson [this message]
2024-05-14 1:03 ` [PATCH v5 19/19] flow, icmp: Use general flow forwarding rules for ICMP David Gibson
[not found] ` <20240518001408.004011b2@elisabeth>
2024-05-20 5:56 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZkrjJ3RkIAUUYJkj@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).