* [PATCH 0/4] Even more flow table preliminaries
@ 2024-06-05 1:38 David Gibson
2024-06-05 1:39 ` [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4() David Gibson
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: David Gibson @ 2024-06-05 1:38 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: David Gibson
I hoped that the last batch was the last, but I was wrong. Working on
UDP flow has shown up a few more things that make sense to do before
taking the leap into full flow table implementation. Here's what I
have so far, though there could be even more.
David Gibson (4):
util: Split construction of bind socket address from the rest of
sock_l4()
udp: Fold checking of splice flag into udp_mmh_splice_port()
udp: Rework how we divide queued datagrams between sending methods
udp: Move management of udp[46]_localname into udp_splice_send()
udp.c | 175 ++++++++++++++++++++++++++++++++-------------------------
util.c | 123 +++++++++++++++++++++++-----------------
2 files changed, 170 insertions(+), 128 deletions(-)
--
2.45.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4()
2024-06-05 1:38 [PATCH 0/4] Even more flow table preliminaries David Gibson
@ 2024-06-05 1:39 ` David Gibson
2024-06-13 15:06 ` Stefano Brivio
2024-06-05 1:39 ` [PATCH 2/4] udp: Fold checking of splice flag into udp_mmh_splice_port() David Gibson
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: David Gibson @ 2024-06-05 1:39 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: David Gibson
sock_l4() creates, binds and otherwise prepares a new socket. It builds
the socket address to bind from separately provided address and port.
However, we have use cases coming up where it's more natural to construct
the socket address in the caller.
Prepare for this by adding sock_l4_sa() which takes a pre-constructed
socket address, and rewriting sock_l4() in terms of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
util.c | 123 ++++++++++++++++++++++++++++++++-------------------------
1 file changed, 70 insertions(+), 53 deletions(-)
diff --git a/util.c b/util.c
index cc1c73ba..4e5f6d23 100644
--- a/util.c
+++ b/util.c
@@ -33,36 +33,25 @@
#include "log.h"
/**
- * sock_l4() - Create and bind socket for given L4, add to epoll list
+ * sock_l4_sa() - Create and bind socket for given L4, add to epoll list
* @c: Execution context
- * @af: Address family, AF_INET or AF_INET6
* @proto: Protocol number
- * @bind_addr: Address for binding, NULL for any
+ * @sa: Socket address to bind to
+ * @sl: Length of @sa
* @ifname: Interface for binding, NULL for any
- * @port: Port, host order
+ * @v6only: Set IPV6_V6ONLY socket option
* @data: epoll reference portion for protocol handlers
*
* Return: newly created socket, negative error code on failure
*/
-int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
- const void *bind_addr, const char *ifname, uint16_t port,
- uint32_t data)
+static int sock_l4_sa(const struct ctx *c, uint8_t proto,
+ const void *sa, socklen_t sl,
+ const char *ifname, bool v6only, uint32_t data)
{
+ sa_family_t af =((const struct sockaddr *)sa)->sa_family;
union epoll_ref ref = { .data = data };
- struct sockaddr_in addr4 = {
- .sin_family = AF_INET,
- .sin_port = htons(port),
- { 0 }, { 0 },
- };
- struct sockaddr_in6 addr6 = {
- .sin6_family = AF_INET6,
- .sin6_port = htons(port),
- 0, IN6ADDR_ANY_INIT, 0,
- };
- const struct sockaddr *sa;
- bool dual_stack = false;
- int fd, sl, y = 1, ret;
struct epoll_event ev;
+ int fd, y = 1, ret;
switch (proto) {
case IPPROTO_TCP:
@@ -79,13 +68,6 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
return -EPFNOSUPPORT; /* Not implemented. */
}
- if (af == AF_UNSPEC) {
- if (!DUAL_STACK_SOCKETS || bind_addr)
- return -EINVAL;
- dual_stack = true;
- af = AF_INET6;
- }
-
if (proto == IPPROTO_TCP)
fd = socket(af, SOCK_STREAM | SOCK_NONBLOCK, proto);
else
@@ -104,30 +86,9 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
ref.fd = fd;
- if (af == AF_INET) {
- if (bind_addr)
- addr4.sin_addr = *(struct in_addr *)bind_addr;
-
- sa = (const struct sockaddr *)&addr4;
- sl = sizeof(addr4);
- } else {
- if (bind_addr) {
- addr6.sin6_addr = *(struct in6_addr *)bind_addr;
-
- if (!memcmp(bind_addr, &c->ip6.addr_ll,
- sizeof(c->ip6.addr_ll)))
- addr6.sin6_scope_id = c->ifi6;
- }
-
- sa = (const struct sockaddr *)&addr6;
- sl = sizeof(addr6);
-
- if (!dual_stack)
- if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
- &y, sizeof(y)))
- debug("Failed to set IPV6_V6ONLY on socket %i",
- fd);
- }
+ if (v6only)
+ if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
+ debug("Failed to set IPV6_V6ONLY on socket %i", fd);
if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
debug("Failed to set SO_REUSEADDR on socket %i", fd);
@@ -140,9 +101,12 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
*/
if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
ifname, strlen(ifname))) {
+ char str[SOCKADDR_STRLEN];
+
ret = -errno;
- warn("Can't bind %s socket for port %u to %s, closing",
- EPOLL_TYPE_STR(proto), port, ifname);
+ warn("Can't bind %s socket for %s to %s, closing",
+ EPOLL_TYPE_STR(proto),
+ sockaddr_ntop(sa, str, sizeof(str)), ifname);
close(fd);
return ret;
}
@@ -178,6 +142,59 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
return fd;
}
+/**
+ * sock_l4() - Create and bind socket for given L4, add to epoll list
+ * @c: Execution context
+ * @af: Address family, AF_INET or AF_INET6
+ * @proto: Protocol number
+ * @bind_addr: Address for binding, NULL for any
+ * @ifname: Interface for binding, NULL for any
+ * @port: Port, host order
+ * @data: epoll reference portion for protocol handlers
+ *
+ * Return: newly created socket, negative error code on failure
+ */
+int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
+ const void *bind_addr, const char *ifname, uint16_t port,
+ uint32_t data)
+{
+ switch (af) {
+ case AF_INET: {
+ struct sockaddr_in addr4 = {
+ .sin_family = AF_INET,
+ .sin_port = htons(port),
+ { 0 }, { 0 },
+ };
+ if (bind_addr)
+ addr4.sin_addr = *(struct in_addr *)bind_addr;
+ return sock_l4_sa(c, proto, &addr4, sizeof(addr4), ifname,
+ false, data);
+ }
+
+ case AF_UNSPEC:
+ if (!DUAL_STACK_SOCKETS || bind_addr)
+ return -EINVAL;
+ /* fallthrough */
+ case AF_INET6: {
+ struct sockaddr_in6 addr6 = {
+ .sin6_family = AF_INET6,
+ .sin6_port = htons(port),
+ 0, IN6ADDR_ANY_INIT, 0,
+ };
+ if (bind_addr) {
+ addr6.sin6_addr = *(struct in6_addr *)bind_addr;
+
+ if (!memcmp(bind_addr, &c->ip6.addr_ll,
+ sizeof(c->ip6.addr_ll)))
+ addr6.sin6_scope_id = c->ifi6;
+ }
+ return sock_l4_sa(c, proto, &addr6, sizeof(addr6), ifname,
+ af == AF_INET6, data);
+ }
+ default:
+ return -EINVAL;
+ }
+}
/**
* sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
--
@@ -33,36 +33,25 @@
#include "log.h"
/**
- * sock_l4() - Create and bind socket for given L4, add to epoll list
+ * sock_l4_sa() - Create and bind socket for given L4, add to epoll list
* @c: Execution context
- * @af: Address family, AF_INET or AF_INET6
* @proto: Protocol number
- * @bind_addr: Address for binding, NULL for any
+ * @sa: Socket address to bind to
+ * @sl: Length of @sa
* @ifname: Interface for binding, NULL for any
- * @port: Port, host order
+ * @v6only: Set IPV6_V6ONLY socket option
* @data: epoll reference portion for protocol handlers
*
* Return: newly created socket, negative error code on failure
*/
-int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
- const void *bind_addr, const char *ifname, uint16_t port,
- uint32_t data)
+static int sock_l4_sa(const struct ctx *c, uint8_t proto,
+ const void *sa, socklen_t sl,
+ const char *ifname, bool v6only, uint32_t data)
{
+ sa_family_t af =((const struct sockaddr *)sa)->sa_family;
union epoll_ref ref = { .data = data };
- struct sockaddr_in addr4 = {
- .sin_family = AF_INET,
- .sin_port = htons(port),
- { 0 }, { 0 },
- };
- struct sockaddr_in6 addr6 = {
- .sin6_family = AF_INET6,
- .sin6_port = htons(port),
- 0, IN6ADDR_ANY_INIT, 0,
- };
- const struct sockaddr *sa;
- bool dual_stack = false;
- int fd, sl, y = 1, ret;
struct epoll_event ev;
+ int fd, y = 1, ret;
switch (proto) {
case IPPROTO_TCP:
@@ -79,13 +68,6 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
return -EPFNOSUPPORT; /* Not implemented. */
}
- if (af == AF_UNSPEC) {
- if (!DUAL_STACK_SOCKETS || bind_addr)
- return -EINVAL;
- dual_stack = true;
- af = AF_INET6;
- }
-
if (proto == IPPROTO_TCP)
fd = socket(af, SOCK_STREAM | SOCK_NONBLOCK, proto);
else
@@ -104,30 +86,9 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
ref.fd = fd;
- if (af == AF_INET) {
- if (bind_addr)
- addr4.sin_addr = *(struct in_addr *)bind_addr;
-
- sa = (const struct sockaddr *)&addr4;
- sl = sizeof(addr4);
- } else {
- if (bind_addr) {
- addr6.sin6_addr = *(struct in6_addr *)bind_addr;
-
- if (!memcmp(bind_addr, &c->ip6.addr_ll,
- sizeof(c->ip6.addr_ll)))
- addr6.sin6_scope_id = c->ifi6;
- }
-
- sa = (const struct sockaddr *)&addr6;
- sl = sizeof(addr6);
-
- if (!dual_stack)
- if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
- &y, sizeof(y)))
- debug("Failed to set IPV6_V6ONLY on socket %i",
- fd);
- }
+ if (v6only)
+ if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
+ debug("Failed to set IPV6_V6ONLY on socket %i", fd);
if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
debug("Failed to set SO_REUSEADDR on socket %i", fd);
@@ -140,9 +101,12 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
*/
if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
ifname, strlen(ifname))) {
+ char str[SOCKADDR_STRLEN];
+
ret = -errno;
- warn("Can't bind %s socket for port %u to %s, closing",
- EPOLL_TYPE_STR(proto), port, ifname);
+ warn("Can't bind %s socket for %s to %s, closing",
+ EPOLL_TYPE_STR(proto),
+ sockaddr_ntop(sa, str, sizeof(str)), ifname);
close(fd);
return ret;
}
@@ -178,6 +142,59 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
return fd;
}
+/**
+ * sock_l4() - Create and bind socket for given L4, add to epoll list
+ * @c: Execution context
+ * @af: Address family, AF_INET or AF_INET6
+ * @proto: Protocol number
+ * @bind_addr: Address for binding, NULL for any
+ * @ifname: Interface for binding, NULL for any
+ * @port: Port, host order
+ * @data: epoll reference portion for protocol handlers
+ *
+ * Return: newly created socket, negative error code on failure
+ */
+int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
+ const void *bind_addr, const char *ifname, uint16_t port,
+ uint32_t data)
+{
+ switch (af) {
+ case AF_INET: {
+ struct sockaddr_in addr4 = {
+ .sin_family = AF_INET,
+ .sin_port = htons(port),
+ { 0 }, { 0 },
+ };
+ if (bind_addr)
+ addr4.sin_addr = *(struct in_addr *)bind_addr;
+ return sock_l4_sa(c, proto, &addr4, sizeof(addr4), ifname,
+ false, data);
+ }
+
+ case AF_UNSPEC:
+ if (!DUAL_STACK_SOCKETS || bind_addr)
+ return -EINVAL;
+ /* fallthrough */
+ case AF_INET6: {
+ struct sockaddr_in6 addr6 = {
+ .sin6_family = AF_INET6,
+ .sin6_port = htons(port),
+ 0, IN6ADDR_ANY_INIT, 0,
+ };
+ if (bind_addr) {
+ addr6.sin6_addr = *(struct in6_addr *)bind_addr;
+
+ if (!memcmp(bind_addr, &c->ip6.addr_ll,
+ sizeof(c->ip6.addr_ll)))
+ addr6.sin6_scope_id = c->ifi6;
+ }
+ return sock_l4_sa(c, proto, &addr6, sizeof(addr6), ifname,
+ af == AF_INET6, data);
+ }
+ default:
+ return -EINVAL;
+ }
+}
/**
* sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
--
2.45.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/4] udp: Fold checking of splice flag into udp_mmh_splice_port()
2024-06-05 1:38 [PATCH 0/4] Even more flow table preliminaries David Gibson
2024-06-05 1:39 ` [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4() David Gibson
@ 2024-06-05 1:39 ` David Gibson
2024-06-13 15:06 ` Stefano Brivio
2024-06-05 1:39 ` [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods David Gibson
2024-06-05 1:39 ` [PATCH 4/4] udp: Move management of udp[46]_localname into udp_splice_send() David Gibson
3 siblings, 1 reply; 11+ messages in thread
From: David Gibson @ 2024-06-05 1:39 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: David Gibson
udp_mmh_splice_port() is used to determine if a UDP datagram can be
"spliced" (forwarded via a socket instead of tap). We only invoke it if
the origin socket has the 'splice' flag set.
Fold the checking of the flag into the helper itself, which makes the
caller simpler. It does mean we have a loop looking for a batch of
spliceable or non-spliceable packets even in the case where the flag is
clear. This shouldn't be that expensive though, since each call to
udp_mmh_splice_port() will return without accessing memory in that case.
In any case we're going to need a similar loop in more cases with upcoming
flow table work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
udp.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/udp.c b/udp.c
index 3abafc99..7487d2b2 100644
--- a/udp.c
+++ b/udp.c
@@ -467,21 +467,25 @@ static int udp_splice_new_ns(void *arg)
/**
* udp_mmh_splice_port() - Is source address of message suitable for splicing?
- * @v6: Is @sa a sockaddr_in6 (otherwise sockaddr_in)?
+ * @uref: UDP epoll reference for incoming message's origin socket
* @mmh: mmsghdr of incoming message
*
- * Return: if @sa refers to localhost (127.0.0.1 or ::1) the port from
- * @sa in host order, otherwise -1.
+ * Return: if source address of message in @mmh refers to localhost (127.0.0.1
+ * or ::1) its source port (host order), otherwise -1.
*/
-static int udp_mmh_splice_port(bool v6, const struct mmsghdr *mmh)
+static int udp_mmh_splice_port(union udp_epoll_ref uref,
+ const struct mmsghdr *mmh)
{
const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name;
const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name;
- if (v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
+ if (!uref.splice)
+ return -1;
+
+ if (uref.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
return ntohs(sa6->sin6_port);
- if (!v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
+ if (!uref.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
return ntohs(sa4->sin_port);
return -1;
@@ -768,18 +772,15 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
for (i = 0; i < n; i += m) {
int splicefrom = -1;
- m = n;
- if (ref.udp.splice) {
- splicefrom = udp_mmh_splice_port(v6, mmh_recv + i);
+ splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i);
- for (m = 1; i + m < n; m++) {
- int p;
+ for (m = 1; i + m < n; m++) {
+ int p;
- p = udp_mmh_splice_port(v6, mmh_recv + i + m);
- if (p != splicefrom)
- break;
- }
+ p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m);
+ if (p != splicefrom)
+ break;
}
if (splicefrom >= 0)
--
@@ -467,21 +467,25 @@ static int udp_splice_new_ns(void *arg)
/**
* udp_mmh_splice_port() - Is source address of message suitable for splicing?
- * @v6: Is @sa a sockaddr_in6 (otherwise sockaddr_in)?
+ * @uref: UDP epoll reference for incoming message's origin socket
* @mmh: mmsghdr of incoming message
*
- * Return: if @sa refers to localhost (127.0.0.1 or ::1) the port from
- * @sa in host order, otherwise -1.
+ * Return: if source address of message in @mmh refers to localhost (127.0.0.1
+ * or ::1) its source port (host order), otherwise -1.
*/
-static int udp_mmh_splice_port(bool v6, const struct mmsghdr *mmh)
+static int udp_mmh_splice_port(union udp_epoll_ref uref,
+ const struct mmsghdr *mmh)
{
const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name;
const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name;
- if (v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
+ if (!uref.splice)
+ return -1;
+
+ if (uref.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
return ntohs(sa6->sin6_port);
- if (!v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
+ if (!uref.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
return ntohs(sa4->sin_port);
return -1;
@@ -768,18 +772,15 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
for (i = 0; i < n; i += m) {
int splicefrom = -1;
- m = n;
- if (ref.udp.splice) {
- splicefrom = udp_mmh_splice_port(v6, mmh_recv + i);
+ splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i);
- for (m = 1; i + m < n; m++) {
- int p;
+ for (m = 1; i + m < n; m++) {
+ int p;
- p = udp_mmh_splice_port(v6, mmh_recv + i + m);
- if (p != splicefrom)
- break;
- }
+ p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m);
+ if (p != splicefrom)
+ break;
}
if (splicefrom >= 0)
--
2.45.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods
2024-06-05 1:38 [PATCH 0/4] Even more flow table preliminaries David Gibson
2024-06-05 1:39 ` [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4() David Gibson
2024-06-05 1:39 ` [PATCH 2/4] udp: Fold checking of splice flag into udp_mmh_splice_port() David Gibson
@ 2024-06-05 1:39 ` David Gibson
2024-06-13 18:21 ` Stefano Brivio
2024-06-05 1:39 ` [PATCH 4/4] udp: Move management of udp[46]_localname into udp_splice_send() David Gibson
3 siblings, 1 reply; 11+ messages in thread
From: David Gibson @ 2024-06-05 1:39 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: David Gibson
udp_sock_handler() takes a number of datagrams from sockets that depending
on their addresses could be forwarded either to the L2 interface ("tap")
or to another socket ("spliced"). In the latter case we can also only
send packets together if they have the same source port, and therefore
are sent via the same socket.
To reduce the total number of system calls we gather contiguous batches of
datagrams with the same destination interface and socket where applicable.
The determination of what the target is is made by udp_mmh_splice_port().
It returns the source port for splice packets and -1 for "tap" packets.
We find batches by looking ahead in our queue until we find a datagram
whose "splicefrom" port doesn't match the first in our current batch.
udp_mmh_splice_port() is moderately expensive, since it must examine IPv6
addresses. But unfortunately we can call it twice on the same datagram:
once as the (last + 1) entry in one batch (showing that it's not in that
match, then again as the first entry in the next batch.
Avoid this by keeping track of the "splice port" in the metadata structure,
and filling it in one entry ahead of the one we're currently considering.
This is a bit subtle, but not that hard. It will also generalise better
when we have more complex possibilities based on the flow table.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
udp.c | 147 ++++++++++++++++++++++++++++++++++------------------------
1 file changed, 86 insertions(+), 61 deletions(-)
diff --git a/udp.c b/udp.c
index 7487d2b2..757c10ab 100644
--- a/udp.c
+++ b/udp.c
@@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr;
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
* @taph: Tap backend specific header
* @s_in: Source socket address, filled in by recvmmsg()
+ * @splicesrc: Source port for splicing, or -1 if not spliceable
*/
static struct udp_meta_t {
struct ipv6hdr ip6h;
@@ -205,6 +206,7 @@ static struct udp_meta_t {
struct tap_hdr taph;
union sockaddr_inany s_in;
+ int splicesrc;
}
#ifdef __AVX2__
__attribute__ ((aligned(32)))
@@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref,
}
/**
- * udp_splice_sendfrom() - Send datagrams from given port to given port
+ * udp_splice_send() - Send datagrams from socket to socket
* @c: Execution context
* @start: Index of first datagram in udp[46]_l2_buf
- * @n: Number of datagrams to send
- * @src: Datagrams will be sent from this port (on origin side)
- * @dst: Datagrams will be send to this port (on destination side)
- * @from_pif: pif from which the packet originated
- * @v6: Send as IPv6?
- * @allow_new: If true create sending socket if needed, if false discard
- * if no sending socket is available
+ * @n: Total number of datagrams in udp[46]_l2_buf pool
+ * @dst: Datagrams will be sent to this port (on destination side)
+ * @uref: UDP epoll reference for origin socket
* @now: Timestamp
+ *
+ * This consumes as many frames as are sendable via a single socket. It
+ * requires that udp_meta[@start].splicesrc is initialised, and will initialise
+ * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
+ *
+ * Return: Number of frames sent
*/
-static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
- in_port_t src, in_port_t dst, uint8_t from_pif,
- bool v6, bool allow_new,
+static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n,
+ in_port_t dst, union udp_epoll_ref uref,
const struct timespec *now)
{
+ in_port_t src = udp_meta[start].splicesrc;
struct mmsghdr *mmh_recv, *mmh_send;
- unsigned int i;
+ unsigned int i = start;
int s;
- if (v6) {
+ ASSERT(udp_meta[start].splicesrc >= 0);
+
+ if (uref.v6) {
mmh_recv = udp6_l2_mh_sock;
mmh_send = udp6_mh_splice;
} else {
@@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
mmh_send = udp4_mh_splice;
}
- if (from_pif == PIF_SPLICE) {
+ do {
+ mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
+
+ if (++i >= n)
+ break;
+
+ udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
+ } while (udp_meta[i].splicesrc == src);
+
+ if (uref.pif == PIF_SPLICE) {
src += c->udp.fwd_in.rdelta[src];
- s = udp_splice_init[v6][src].sock;
- if (s < 0 && allow_new)
- s = udp_splice_new(c, v6, src, false);
+ s = udp_splice_init[uref.v6][src].sock;
+ if (s < 0 && uref.orig)
+ s = udp_splice_new(c, uref.v6, src, false);
if (s < 0)
- return;
+ goto out;
- udp_splice_ns[v6][dst].ts = now->tv_sec;
- udp_splice_init[v6][src].ts = now->tv_sec;
+ udp_splice_ns[uref.v6][dst].ts = now->tv_sec;
+ udp_splice_init[uref.v6][src].ts = now->tv_sec;
} else {
- ASSERT(from_pif == PIF_HOST);
+ ASSERT(uref.pif == PIF_HOST);
src += c->udp.fwd_out.rdelta[src];
- s = udp_splice_ns[v6][src].sock;
- if (s < 0 && allow_new) {
+ s = udp_splice_ns[uref.v6][src].sock;
+ if (s < 0 && uref.orig) {
struct udp_splice_new_ns_arg arg = {
- c, v6, src, -1,
+ c, uref.v6, src, -1,
};
NS_CALL(udp_splice_new_ns, &arg);
s = arg.s;
}
if (s < 0)
- return;
+ goto out;
- udp_splice_init[v6][dst].ts = now->tv_sec;
- udp_splice_ns[v6][src].ts = now->tv_sec;
+ udp_splice_init[uref.v6][dst].ts = now->tv_sec;
+ udp_splice_ns[uref.v6][src].ts = now->tv_sec;
}
- for (i = start; i < start + n; i++)
- mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
-
- sendmmsg(s, mmh_send + start, n, MSG_NOSIGNAL);
+ sendmmsg(s, mmh_send + start, i - start, MSG_NOSIGNAL);
+out:
+ return i - start;
}
/**
@@ -687,31 +701,41 @@ static size_t udp_update_hdr6(const struct ctx *c,
* udp_tap_send() - Prepare UDP datagrams and send to tap interface
* @c: Execution context
* @start: Index of first datagram in udp[46]_l2_buf pool
- * @n: Number of datagrams to send
- * @dstport: Destination port number
- * @v6: True if using IPv6
+ * @n: Total number of datagrams in udp[46]_l2_buf pool
+ * @dstport: Destination port number on destination side
+ * @uref: UDP epoll reference for origin socket
* @now: Current timestamp
*
- * Return: size of tap frame with headers
+ * This consumes as many frames as are sendable via tap. It requires that
+ * udp_meta[@start].splicesrc is initialised, and will initialise
+ * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
+ *
+ * Return: Number of frames sent via tap
*/
-static void udp_tap_send(const struct ctx *c,
- unsigned int start, unsigned int n,
- in_port_t dstport, bool v6, const struct timespec *now)
+static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n,
+ in_port_t dstport, union udp_epoll_ref uref,
+ const struct timespec *now)
{
struct iovec (*tap_iov)[UDP_NUM_IOVS];
- size_t i;
+ struct mmsghdr *mmh_recv;
+ size_t i = start;
- if (v6)
+ ASSERT(udp_meta[start].splicesrc == -1);
+
+ if (uref.v6) {
tap_iov = udp6_l2_iov_tap;
- else
+ mmh_recv = udp6_l2_mh_sock;
+ } else {
+ mmh_recv = udp4_l2_mh_sock;
tap_iov = udp4_l2_iov_tap;
+ }
- for (i = start; i < start + n; i++) {
+ do {
struct udp_payload_t *bp = &udp_payload[i];
struct udp_meta_t *bm = &udp_meta[i];
size_t l4len;
- if (v6) {
+ if (uref.v6) {
l4len = udp_update_hdr6(c, bm, bp, dstport,
udp6_l2_mh_sock[i].msg_len, now);
} else {
@@ -719,9 +743,15 @@ static void udp_tap_send(const struct ctx *c,
udp4_l2_mh_sock[i].msg_len, now);
}
tap_iov[i][UDP_IOV_PAYLOAD].iov_len = l4len;
- }
- tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, n);
+ if (++i >= n)
+ break;
+
+ udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
+ } while (udp_meta[i].splicesrc == -1);
+
+ tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, i - start);
+ return i - start;
}
/**
@@ -770,24 +800,19 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
if (n <= 0)
return;
+ /* We divide things into batches based on how we need to send them,
+ * determined by udp_meta[i].splicesrc. To avoid either two passes
+ * through the array, or recalculating splicesrc for a single entry, we
+ * have to populate it one entry *ahead* of the loop counter (if
+ * present). So we fill in entry 0 before the loop, then udp_*_send()
+ * populate one entry past where they consume.
+ */
+ udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv);
for (i = 0; i < n; i += m) {
- int splicefrom = -1;
-
- splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i);
-
- for (m = 1; i + m < n; m++) {
- int p;
-
- p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m);
- if (p != splicefrom)
- break;
- }
-
- if (splicefrom >= 0)
- udp_splice_sendfrom(c, i, m, splicefrom, dstport,
- ref.udp.pif, v6, ref.udp.orig, now);
+ if (udp_meta[i].splicesrc >= 0)
+ m = udp_splice_send(c, i, n, dstport, ref.udp, now);
else
- udp_tap_send(c, i, m, dstport, v6, now);
+ m = udp_tap_send(c, i, n, dstport, ref.udp, now);
}
}
--
@@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr;
* @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
* @taph: Tap backend specific header
* @s_in: Source socket address, filled in by recvmmsg()
+ * @splicesrc: Source port for splicing, or -1 if not spliceable
*/
static struct udp_meta_t {
struct ipv6hdr ip6h;
@@ -205,6 +206,7 @@ static struct udp_meta_t {
struct tap_hdr taph;
union sockaddr_inany s_in;
+ int splicesrc;
}
#ifdef __AVX2__
__attribute__ ((aligned(32)))
@@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref,
}
/**
- * udp_splice_sendfrom() - Send datagrams from given port to given port
+ * udp_splice_send() - Send datagrams from socket to socket
* @c: Execution context
* @start: Index of first datagram in udp[46]_l2_buf
- * @n: Number of datagrams to send
- * @src: Datagrams will be sent from this port (on origin side)
- * @dst: Datagrams will be send to this port (on destination side)
- * @from_pif: pif from which the packet originated
- * @v6: Send as IPv6?
- * @allow_new: If true create sending socket if needed, if false discard
- * if no sending socket is available
+ * @n: Total number of datagrams in udp[46]_l2_buf pool
+ * @dst: Datagrams will be sent to this port (on destination side)
+ * @uref: UDP epoll reference for origin socket
* @now: Timestamp
+ *
+ * This consumes as many frames as are sendable via a single socket. It
+ * requires that udp_meta[@start].splicesrc is initialised, and will initialise
+ * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
+ *
+ * Return: Number of frames sent
*/
-static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
- in_port_t src, in_port_t dst, uint8_t from_pif,
- bool v6, bool allow_new,
+static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n,
+ in_port_t dst, union udp_epoll_ref uref,
const struct timespec *now)
{
+ in_port_t src = udp_meta[start].splicesrc;
struct mmsghdr *mmh_recv, *mmh_send;
- unsigned int i;
+ unsigned int i = start;
int s;
- if (v6) {
+ ASSERT(udp_meta[start].splicesrc >= 0);
+
+ if (uref.v6) {
mmh_recv = udp6_l2_mh_sock;
mmh_send = udp6_mh_splice;
} else {
@@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
mmh_send = udp4_mh_splice;
}
- if (from_pif == PIF_SPLICE) {
+ do {
+ mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
+
+ if (++i >= n)
+ break;
+
+ udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
+ } while (udp_meta[i].splicesrc == src);
+
+ if (uref.pif == PIF_SPLICE) {
src += c->udp.fwd_in.rdelta[src];
- s = udp_splice_init[v6][src].sock;
- if (s < 0 && allow_new)
- s = udp_splice_new(c, v6, src, false);
+ s = udp_splice_init[uref.v6][src].sock;
+ if (s < 0 && uref.orig)
+ s = udp_splice_new(c, uref.v6, src, false);
if (s < 0)
- return;
+ goto out;
- udp_splice_ns[v6][dst].ts = now->tv_sec;
- udp_splice_init[v6][src].ts = now->tv_sec;
+ udp_splice_ns[uref.v6][dst].ts = now->tv_sec;
+ udp_splice_init[uref.v6][src].ts = now->tv_sec;
} else {
- ASSERT(from_pif == PIF_HOST);
+ ASSERT(uref.pif == PIF_HOST);
src += c->udp.fwd_out.rdelta[src];
- s = udp_splice_ns[v6][src].sock;
- if (s < 0 && allow_new) {
+ s = udp_splice_ns[uref.v6][src].sock;
+ if (s < 0 && uref.orig) {
struct udp_splice_new_ns_arg arg = {
- c, v6, src, -1,
+ c, uref.v6, src, -1,
};
NS_CALL(udp_splice_new_ns, &arg);
s = arg.s;
}
if (s < 0)
- return;
+ goto out;
- udp_splice_init[v6][dst].ts = now->tv_sec;
- udp_splice_ns[v6][src].ts = now->tv_sec;
+ udp_splice_init[uref.v6][dst].ts = now->tv_sec;
+ udp_splice_ns[uref.v6][src].ts = now->tv_sec;
}
- for (i = start; i < start + n; i++)
- mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
-
- sendmmsg(s, mmh_send + start, n, MSG_NOSIGNAL);
+ sendmmsg(s, mmh_send + start, i - start, MSG_NOSIGNAL);
+out:
+ return i - start;
}
/**
@@ -687,31 +701,41 @@ static size_t udp_update_hdr6(const struct ctx *c,
* udp_tap_send() - Prepare UDP datagrams and send to tap interface
* @c: Execution context
* @start: Index of first datagram in udp[46]_l2_buf pool
- * @n: Number of datagrams to send
- * @dstport: Destination port number
- * @v6: True if using IPv6
+ * @n: Total number of datagrams in udp[46]_l2_buf pool
+ * @dstport: Destination port number on destination side
+ * @uref: UDP epoll reference for origin socket
* @now: Current timestamp
*
- * Return: size of tap frame with headers
+ * This consumes as many frames as are sendable via tap. It requires that
+ * udp_meta[@start].splicesrc is initialised, and will initialise
+ * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
+ *
+ * Return: Number of frames sent via tap
*/
-static void udp_tap_send(const struct ctx *c,
- unsigned int start, unsigned int n,
- in_port_t dstport, bool v6, const struct timespec *now)
+static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n,
+ in_port_t dstport, union udp_epoll_ref uref,
+ const struct timespec *now)
{
struct iovec (*tap_iov)[UDP_NUM_IOVS];
- size_t i;
+ struct mmsghdr *mmh_recv;
+ size_t i = start;
- if (v6)
+ ASSERT(udp_meta[start].splicesrc == -1);
+
+ if (uref.v6) {
tap_iov = udp6_l2_iov_tap;
- else
+ mmh_recv = udp6_l2_mh_sock;
+ } else {
+ mmh_recv = udp4_l2_mh_sock;
tap_iov = udp4_l2_iov_tap;
+ }
- for (i = start; i < start + n; i++) {
+ do {
struct udp_payload_t *bp = &udp_payload[i];
struct udp_meta_t *bm = &udp_meta[i];
size_t l4len;
- if (v6) {
+ if (uref.v6) {
l4len = udp_update_hdr6(c, bm, bp, dstport,
udp6_l2_mh_sock[i].msg_len, now);
} else {
@@ -719,9 +743,15 @@ static void udp_tap_send(const struct ctx *c,
udp4_l2_mh_sock[i].msg_len, now);
}
tap_iov[i][UDP_IOV_PAYLOAD].iov_len = l4len;
- }
- tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, n);
+ if (++i >= n)
+ break;
+
+ udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
+ } while (udp_meta[i].splicesrc == -1);
+
+ tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, i - start);
+ return i - start;
}
/**
@@ -770,24 +800,19 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
if (n <= 0)
return;
+ /* We divide things into batches based on how we need to send them,
+ * determined by udp_meta[i].splicesrc. To avoid either two passes
+ * through the array, or recalculating splicesrc for a single entry, we
+ * have to populate it one entry *ahead* of the loop counter (if
+ * present). So we fill in entry 0 before the loop, then udp_*_send()
+ * populate one entry past where they consume.
+ */
+ udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv);
for (i = 0; i < n; i += m) {
- int splicefrom = -1;
-
- splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i);
-
- for (m = 1; i + m < n; m++) {
- int p;
-
- p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m);
- if (p != splicefrom)
- break;
- }
-
- if (splicefrom >= 0)
- udp_splice_sendfrom(c, i, m, splicefrom, dstport,
- ref.udp.pif, v6, ref.udp.orig, now);
+ if (udp_meta[i].splicesrc >= 0)
+ m = udp_splice_send(c, i, n, dstport, ref.udp, now);
else
- udp_tap_send(c, i, m, dstport, v6, now);
+ m = udp_tap_send(c, i, n, dstport, ref.udp, now);
}
}
--
2.45.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/4] udp: Move management of udp[46]_localname into udp_splice_send()
2024-06-05 1:38 [PATCH 0/4] Even more flow table preliminaries David Gibson
` (2 preceding siblings ...)
2024-06-05 1:39 ` [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods David Gibson
@ 2024-06-05 1:39 ` David Gibson
3 siblings, 0 replies; 11+ messages in thread
From: David Gibson @ 2024-06-05 1:39 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: David Gibson
Mostly, udp_sock_handler() is independent of how the datagrams it processes
will be forwarded (tap or splice). However, it also updates the msg_name
fields for spliced sends, which doesn't really make sense here. Move it
into udp_splice_send() which is all about spliced sends. This does
potentially mean we'll update the field to the same value several times,
but we're going to need this in future anyway: with the extensions the
flow table allows, it might not be the same value each time after all.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
udp.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/udp.c b/udp.c
index 757c10ab..8bd00a87 100644
--- a/udp.c
+++ b/udp.c
@@ -522,9 +522,11 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n,
if (uref.v6) {
mmh_recv = udp6_l2_mh_sock;
mmh_send = udp6_mh_splice;
+ udp6_localname.sin6_port = htons(dst);
} else {
mmh_recv = udp4_l2_mh_sock;
mmh_send = udp4_mh_splice;
+ udp4_localname.sin_port = htons(dst);
}
do {
@@ -788,13 +790,10 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
else if (ref.udp.pif == PIF_HOST)
dstport += c->udp.fwd_in.f.delta[dstport];
- if (v6) {
+ if (v6)
mmh_recv = udp6_l2_mh_sock;
- udp6_localname.sin6_port = htons(dstport);
- } else {
+ else
mmh_recv = udp4_l2_mh_sock;
- udp4_localname.sin_port = htons(dstport);
- }
n = recvmmsg(ref.fd, mmh_recv, n, 0, NULL);
if (n <= 0)
--
@@ -522,9 +522,11 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n,
if (uref.v6) {
mmh_recv = udp6_l2_mh_sock;
mmh_send = udp6_mh_splice;
+ udp6_localname.sin6_port = htons(dst);
} else {
mmh_recv = udp4_l2_mh_sock;
mmh_send = udp4_mh_splice;
+ udp4_localname.sin_port = htons(dst);
}
do {
@@ -788,13 +790,10 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
else if (ref.udp.pif == PIF_HOST)
dstport += c->udp.fwd_in.f.delta[dstport];
- if (v6) {
+ if (v6)
mmh_recv = udp6_l2_mh_sock;
- udp6_localname.sin6_port = htons(dstport);
- } else {
+ else
mmh_recv = udp4_l2_mh_sock;
- udp4_localname.sin_port = htons(dstport);
- }
n = recvmmsg(ref.fd, mmh_recv, n, 0, NULL);
if (n <= 0)
--
2.45.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4()
2024-06-05 1:39 ` [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4() David Gibson
@ 2024-06-13 15:06 ` Stefano Brivio
2024-06-14 0:47 ` David Gibson
0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2024-06-13 15:06 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
Sorry for the delay, nits only (I can fix them up on merge):
On Wed, 5 Jun 2024 11:39:00 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> sock_l4() creates, binds and otherwise prepares a new socket. It builds
> the socket address to bind from separately provided address and port.
> However, we have use cases coming up where it's more natural to construct
> the socket address in the caller.
>
> Prepare for this by adding sock_l4_sa() which takes a pre-constructed
> socket address, and rewriting sock_l4() in terms of it.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> util.c | 123 ++++++++++++++++++++++++++++++++-------------------------
> 1 file changed, 70 insertions(+), 53 deletions(-)
>
> diff --git a/util.c b/util.c
> index cc1c73ba..4e5f6d23 100644
> --- a/util.c
> +++ b/util.c
> @@ -33,36 +33,25 @@
> #include "log.h"
>
> /**
> - * sock_l4() - Create and bind socket for given L4, add to epoll list
> + * sock_l4_sa() - Create and bind socket for given L4, add to epoll list
That doesn't quite tell the difference from sock_l4(), perhaps:
* sock_l4_sa() - Create and bind socket given socket address, add to epoll list
> * @c: Execution context
> - * @af: Address family, AF_INET or AF_INET6
> * @proto: Protocol number
> - * @bind_addr: Address for binding, NULL for any
> + * @sa: Socket address to bind to
> + * @sl: Length of @sa
> * @ifname: Interface for binding, NULL for any
> - * @port: Port, host order
> + * @v6only: Set IPV6_V6ONLY socket option
> * @data: epoll reference portion for protocol handlers
> *
> * Return: newly created socket, negative error code on failure
> */
> -int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> - const void *bind_addr, const char *ifname, uint16_t port,
> - uint32_t data)
> +static int sock_l4_sa(const struct ctx *c, uint8_t proto,
> + const void *sa, socklen_t sl,
> + const char *ifname, bool v6only, uint32_t data)
> {
> + sa_family_t af =((const struct sockaddr *)sa)->sa_family;
Missing whitespace after =.
> union epoll_ref ref = { .data = data };
> - struct sockaddr_in addr4 = {
> - .sin_family = AF_INET,
> - .sin_port = htons(port),
> - { 0 }, { 0 },
> - };
> - struct sockaddr_in6 addr6 = {
> - .sin6_family = AF_INET6,
> - .sin6_port = htons(port),
> - 0, IN6ADDR_ANY_INIT, 0,
> - };
> - const struct sockaddr *sa;
> - bool dual_stack = false;
> - int fd, sl, y = 1, ret;
> struct epoll_event ev;
> + int fd, y = 1, ret;
>
> switch (proto) {
> case IPPROTO_TCP:
> @@ -79,13 +68,6 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> return -EPFNOSUPPORT; /* Not implemented. */
> }
>
> - if (af == AF_UNSPEC) {
> - if (!DUAL_STACK_SOCKETS || bind_addr)
> - return -EINVAL;
> - dual_stack = true;
> - af = AF_INET6;
> - }
> -
> if (proto == IPPROTO_TCP)
> fd = socket(af, SOCK_STREAM | SOCK_NONBLOCK, proto);
> else
> @@ -104,30 +86,9 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
>
> ref.fd = fd;
>
> - if (af == AF_INET) {
> - if (bind_addr)
> - addr4.sin_addr = *(struct in_addr *)bind_addr;
> -
> - sa = (const struct sockaddr *)&addr4;
> - sl = sizeof(addr4);
> - } else {
> - if (bind_addr) {
> - addr6.sin6_addr = *(struct in6_addr *)bind_addr;
> -
> - if (!memcmp(bind_addr, &c->ip6.addr_ll,
> - sizeof(c->ip6.addr_ll)))
> - addr6.sin6_scope_id = c->ifi6;
> - }
> -
> - sa = (const struct sockaddr *)&addr6;
> - sl = sizeof(addr6);
> -
> - if (!dual_stack)
> - if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
> - &y, sizeof(y)))
> - debug("Failed to set IPV6_V6ONLY on socket %i",
> - fd);
> - }
> + if (v6only)
> + if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
> + debug("Failed to set IPV6_V6ONLY on socket %i", fd);
>
> if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
> debug("Failed to set SO_REUSEADDR on socket %i", fd);
> @@ -140,9 +101,12 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> */
> if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
> ifname, strlen(ifname))) {
> + char str[SOCKADDR_STRLEN];
> +
> ret = -errno;
> - warn("Can't bind %s socket for port %u to %s, closing",
> - EPOLL_TYPE_STR(proto), port, ifname);
> + warn("Can't bind %s socket for %s to %s, closing",
> + EPOLL_TYPE_STR(proto),
> + sockaddr_ntop(sa, str, sizeof(str)), ifname);
> close(fd);
> return ret;
> }
> @@ -178,6 +142,59 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
>
> return fd;
> }
> +/**
> + * sock_l4() - Create and bind socket for given L4, add to epoll list
> + * @c: Execution context
> + * @af: Address family, AF_INET or AF_INET6
> + * @proto: Protocol number
> + * @bind_addr: Address for binding, NULL for any
> + * @ifname: Interface for binding, NULL for any
> + * @port: Port, host order
> + * @data: epoll reference portion for protocol handlers
> + *
> + * Return: newly created socket, negative error code on failure
> + */
> +int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> + const void *bind_addr, const char *ifname, uint16_t port,
> + uint32_t data)
> +{
> + switch (af) {
> + case AF_INET: {
> + struct sockaddr_in addr4 = {
> + .sin_family = AF_INET,
> + .sin_port = htons(port),
> + { 0 }, { 0 },
> + };
> + if (bind_addr)
> + addr4.sin_addr = *(struct in_addr *)bind_addr;
> + return sock_l4_sa(c, proto, &addr4, sizeof(addr4), ifname,
> + false, data);
> + }
> +
> + case AF_UNSPEC:
> + if (!DUAL_STACK_SOCKETS || bind_addr)
> + return -EINVAL;
> + /* fallthrough */
> + case AF_INET6: {
> + struct sockaddr_in6 addr6 = {
> + .sin6_family = AF_INET6,
> + .sin6_port = htons(port),
> + 0, IN6ADDR_ANY_INIT, 0,
> + };
> + if (bind_addr) {
> + addr6.sin6_addr = *(struct in6_addr *)bind_addr;
> +
> + if (!memcmp(bind_addr, &c->ip6.addr_ll,
> + sizeof(c->ip6.addr_ll)))
> + addr6.sin6_scope_id = c->ifi6;
> + }
> + return sock_l4_sa(c, proto, &addr6, sizeof(addr6), ifname,
> + af == AF_INET6, data);
> + }
> + default:
> + return -EINVAL;
> + }
> +}
>
> /**
> * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
--
Stefano
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/4] udp: Fold checking of splice flag into udp_mmh_splice_port()
2024-06-05 1:39 ` [PATCH 2/4] udp: Fold checking of splice flag into udp_mmh_splice_port() David Gibson
@ 2024-06-13 15:06 ` Stefano Brivio
2024-06-14 0:50 ` David Gibson
0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2024-06-13 15:06 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Wed, 5 Jun 2024 11:39:01 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> udp_mmh_splice_port() is used to determine if a UDP datagram can be
> "spliced" (forwarded via a socket instead of tap). We only invoke it if
> the origin socket has the 'splice' flag set.
>
> Fold the checking of the flag into the helper itself, which makes the
> caller simpler. It does mean we have a loop looking for a batch of
> spliceable or non-spliceable packets even in the case where the flag is
> clear. This shouldn't be that expensive though, since each call to
> udp_mmh_splice_port() will return without accessing memory in that case.
> In any case we're going to need a similar loop in more cases with upcoming
> flow table work.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> udp.c | 31 ++++++++++++++++---------------
> 1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/udp.c b/udp.c
> index 3abafc99..7487d2b2 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -467,21 +467,25 @@ static int udp_splice_new_ns(void *arg)
>
> /**
> * udp_mmh_splice_port() - Is source address of message suitable for splicing?
> - * @v6: Is @sa a sockaddr_in6 (otherwise sockaddr_in)?
> + * @uref: UDP epoll reference for incoming message's origin socket
> * @mmh: mmsghdr of incoming message
> *
> - * Return: if @sa refers to localhost (127.0.0.1 or ::1) the port from
> - * @sa in host order, otherwise -1.
> + * Return: if source address of message in @mmh refers to localhost (127.0.0.1
Pre-existing, and I guess this might change again with the complete
flow table implementation, so it probably doesn't make sense to fix
this now: it's 127.0.0.0/8, not necessarily 127.0.0.1.
As to whether we actually need to preserve a source address that's not
127.0.0.1, but in 127.0.0.0/8, as we "splice", I'm not quite sure. I
think we could bind() the socket in the target namespace, but I haven't
tried, and I don't know if it makes sense at all (I can't think of any
use case).
> + * or ::1) its source port (host order), otherwise -1.
> */
> -static int udp_mmh_splice_port(bool v6, const struct mmsghdr *mmh)
> +static int udp_mmh_splice_port(union udp_epoll_ref uref,
> + const struct mmsghdr *mmh)
> {
> const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name;
> const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name;
>
> - if (v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> + if (!uref.splice)
> + return -1;
> +
> + if (uref.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> return ntohs(sa6->sin6_port);
>
> - if (!v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> + if (!uref.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> return ntohs(sa4->sin_port);
>
> return -1;
> @@ -768,18 +772,15 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
(now renamed to udp_buf_sock_handler() if you're wondering)
>
> for (i = 0; i < n; i += m) {
> int splicefrom = -1;
> - m = n;
>
> - if (ref.udp.splice) {
> - splicefrom = udp_mmh_splice_port(v6, mmh_recv + i);
> + splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i);
>
> - for (m = 1; i + m < n; m++) {
> - int p;
> + for (m = 1; i + m < n; m++) {
> + int p;
>
> - p = udp_mmh_splice_port(v6, mmh_recv + i + m);
> - if (p != splicefrom)
> - break;
> - }
> + p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m);
> + if (p != splicefrom)
> + break;
> }
>
> if (splicefrom >= 0)
--
Stefano
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods
2024-06-05 1:39 ` [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods David Gibson
@ 2024-06-13 18:21 ` Stefano Brivio
2024-06-14 1:08 ` David Gibson
0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2024-06-13 18:21 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Wed, 5 Jun 2024 11:39:02 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> udp_sock_handler() takes a number of datagrams from sockets that depending
> on their addresses could be forwarded either to the L2 interface ("tap")
> or to another socket ("spliced"). In the latter case we can also only
> send packets together if they have the same source port, and therefore
> are sent via the same socket.
>
> To reduce the total number of system calls we gather contiguous batches of
> datagrams with the same destination interface and socket where applicable.
> The determination of what the target is is made by udp_mmh_splice_port().
> It returns the source port for splice packets and -1 for "tap" packets.
> We find batches by looking ahead in our queue until we find a datagram
> whose "splicefrom" port doesn't match the first in our current batch.
>
> udp_mmh_splice_port() is moderately expensive, since it must examine IPv6
> addresses. But unfortunately we can call it twice on the same datagram:
> once as the (last + 1) entry in one batch (showing that it's not in that
> match, then again as the first entry in the next batch.
This paragraph took me an embarrassingly long time to grasp, if you
re-spin it would be nice to fix it:
- "And unfortunately [...]", I guess: otherwise it looks like we're
lucky that udp_mmh_splice_port() is expensive or something like that
(because of the "But" implying contrast).
I initially assumed "unfortunately" was a typo and tried to
understand why it was a good thing we'd call udp_mmh_splice_port()
twice on the same datagram (faster than calling it on two
datagrams!), then started reading the change and got even more
confused...
- "(to check that it's not that batch)" ?
> Avoid this by keeping track of the "splice port" in the metadata structure,
> and filling it in one entry ahead of the one we're currently considering.
> This is a bit subtle, but not that hard. It will also generalise better
> when we have more complex possibilities based on the flow table.
I guess this is the actual, main reason for this change. :) I should
have read this paragraph first.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> udp.c | 147 ++++++++++++++++++++++++++++++++++------------------------
> 1 file changed, 86 insertions(+), 61 deletions(-)
>
> diff --git a/udp.c b/udp.c
> index 7487d2b2..757c10ab 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr;
> * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
> * @taph: Tap backend specific header
> * @s_in: Source socket address, filled in by recvmmsg()
> + * @splicesrc: Source port for splicing, or -1 if not spliceable
> */
> static struct udp_meta_t {
> struct ipv6hdr ip6h;
> @@ -205,6 +206,7 @@ static struct udp_meta_t {
> struct tap_hdr taph;
>
> union sockaddr_inany s_in;
> + int splicesrc;
> }
> #ifdef __AVX2__
> __attribute__ ((aligned(32)))
> @@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref,
> }
>
> /**
> - * udp_splice_sendfrom() - Send datagrams from given port to given port
> + * udp_splice_send() - Send datagrams from socket to socket
> * @c: Execution context
> * @start: Index of first datagram in udp[46]_l2_buf
> - * @n: Number of datagrams to send
> - * @src: Datagrams will be sent from this port (on origin side)
> - * @dst: Datagrams will be send to this port (on destination side)
> - * @from_pif: pif from which the packet originated
> - * @v6: Send as IPv6?
> - * @allow_new: If true create sending socket if needed, if false discard
> - * if no sending socket is available
> + * @n: Total number of datagrams in udp[46]_l2_buf pool
> + * @dst: Datagrams will be sent to this port (on destination side)
> + * @uref: UDP epoll reference for origin socket
> * @now: Timestamp
> + *
> + * This consumes as many frames as are sendable via a single socket. It
s/frames/datagrams/ ...or messages.
> + * requires that udp_meta[@start].splicesrc is initialised, and will initialise
> + * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
> + *
> + * Return: Number of frames sent
I'd say it's rather the number of datagrams (not frames) we tried to
send.
In some sense, it's also the number of frames sent _by us_ (well, after
calling sendmmsg(), messages were sent), but we call sendmmsg()
ignoring the result, so this comment might look a bit misleading.
> */
> -static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
> - in_port_t src, in_port_t dst, uint8_t from_pif,
> - bool v6, bool allow_new,
> +static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n,
> + in_port_t dst, union udp_epoll_ref uref,
> const struct timespec *now)
> {
> + in_port_t src = udp_meta[start].splicesrc;
> struct mmsghdr *mmh_recv, *mmh_send;
> - unsigned int i;
> + unsigned int i = start;
> int s;
>
> - if (v6) {
> + ASSERT(udp_meta[start].splicesrc >= 0);
> +
> + if (uref.v6) {
> mmh_recv = udp6_l2_mh_sock;
> mmh_send = udp6_mh_splice;
> } else {
> @@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
> mmh_send = udp4_mh_splice;
> }
>
> - if (from_pif == PIF_SPLICE) {
> + do {
> + mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
> +
> + if (++i >= n)
> + break;
> +
> + udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
> + } while (udp_meta[i].splicesrc == src);
I don't have a strong preference, but a for loop like this:
for (; i < n && udp_meta[i].splicesrc == src; i++) {
mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
}
if (i++ < n) /* Set splicesrc for first mismatching entry, too */
udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
looks a bit more readable to me. Same for udp_tap_send().
> +
> + if (uref.pif == PIF_SPLICE) {
> src += c->udp.fwd_in.rdelta[src];
> - s = udp_splice_init[v6][src].sock;
> - if (s < 0 && allow_new)
> - s = udp_splice_new(c, v6, src, false);
> + s = udp_splice_init[uref.v6][src].sock;
> + if (s < 0 && uref.orig)
> + s = udp_splice_new(c, uref.v6, src, false);
>
> if (s < 0)
> - return;
> + goto out;
>
> - udp_splice_ns[v6][dst].ts = now->tv_sec;
> - udp_splice_init[v6][src].ts = now->tv_sec;
> + udp_splice_ns[uref.v6][dst].ts = now->tv_sec;
> + udp_splice_init[uref.v6][src].ts = now->tv_sec;
> } else {
> - ASSERT(from_pif == PIF_HOST);
> + ASSERT(uref.pif == PIF_HOST);
> src += c->udp.fwd_out.rdelta[src];
> - s = udp_splice_ns[v6][src].sock;
> - if (s < 0 && allow_new) {
> + s = udp_splice_ns[uref.v6][src].sock;
> + if (s < 0 && uref.orig) {
> struct udp_splice_new_ns_arg arg = {
> - c, v6, src, -1,
> + c, uref.v6, src, -1,
> };
>
> NS_CALL(udp_splice_new_ns, &arg);
> s = arg.s;
> }
> if (s < 0)
> - return;
> + goto out;
>
> - udp_splice_init[v6][dst].ts = now->tv_sec;
> - udp_splice_ns[v6][src].ts = now->tv_sec;
> + udp_splice_init[uref.v6][dst].ts = now->tv_sec;
> + udp_splice_ns[uref.v6][src].ts = now->tv_sec;
> }
>
> - for (i = start; i < start + n; i++)
> - mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
> -
> - sendmmsg(s, mmh_send + start, n, MSG_NOSIGNAL);
> + sendmmsg(s, mmh_send + start, i - start, MSG_NOSIGNAL);
> +out:
> + return i - start;
> }
>
> /**
> @@ -687,31 +701,41 @@ static size_t udp_update_hdr6(const struct ctx *c,
> * udp_tap_send() - Prepare UDP datagrams and send to tap interface
> * @c: Execution context
> * @start: Index of first datagram in udp[46]_l2_buf pool
> - * @n: Number of datagrams to send
> - * @dstport: Destination port number
> - * @v6: True if using IPv6
> + * @n: Total number of datagrams in udp[46]_l2_buf pool
> + * @dstport: Destination port number on destination side
> + * @uref: UDP epoll reference for origin socket
> * @now: Current timestamp
> *
> - * Return: size of tap frame with headers
> + * This consumes as many frames as are sendable via tap. It requires that
> + * udp_meta[@start].splicesrc is initialised, and will initialise
> + * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
> + *
> + * Return: Number of frames sent via tap
> */
> -static void udp_tap_send(const struct ctx *c,
> - unsigned int start, unsigned int n,
> - in_port_t dstport, bool v6, const struct timespec *now)
> +static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n,
> + in_port_t dstport, union udp_epoll_ref uref,
> + const struct timespec *now)
> {
> struct iovec (*tap_iov)[UDP_NUM_IOVS];
> - size_t i;
> + struct mmsghdr *mmh_recv;
> + size_t i = start;
>
> - if (v6)
> + ASSERT(udp_meta[start].splicesrc == -1);
> +
> + if (uref.v6) {
> tap_iov = udp6_l2_iov_tap;
> - else
> + mmh_recv = udp6_l2_mh_sock;
> + } else {
> + mmh_recv = udp4_l2_mh_sock;
> tap_iov = udp4_l2_iov_tap;
> + }
>
> - for (i = start; i < start + n; i++) {
> + do {
> struct udp_payload_t *bp = &udp_payload[i];
> struct udp_meta_t *bm = &udp_meta[i];
> size_t l4len;
>
> - if (v6) {
> + if (uref.v6) {
> l4len = udp_update_hdr6(c, bm, bp, dstport,
> udp6_l2_mh_sock[i].msg_len, now);
> } else {
> @@ -719,9 +743,15 @@ static void udp_tap_send(const struct ctx *c,
> udp4_l2_mh_sock[i].msg_len, now);
> }
> tap_iov[i][UDP_IOV_PAYLOAD].iov_len = l4len;
> - }
>
> - tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, n);
> + if (++i >= n)
> + break;
> +
> + udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
> + } while (udp_meta[i].splicesrc == -1);
> +
> + tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, i - start);
> + return i - start;
> }
>
> /**
> @@ -770,24 +800,19 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> if (n <= 0)
> return;
>
> + /* We divide things into batches based on how we need to send them,
> + * determined by udp_meta[i].splicesrc. To avoid either two passes
> + * through the array, or recalculating splicesrc for a single entry, we
> + * have to populate it one entry *ahead* of the loop counter (if
> + * present). So we fill in entry 0 before the loop, then udp_*_send()
> + * populate one entry past where they consume.
> + */
> + udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv);
> for (i = 0; i < n; i += m) {
> - int splicefrom = -1;
> -
> - splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i);
> -
> - for (m = 1; i + m < n; m++) {
> - int p;
> -
> - p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m);
> - if (p != splicefrom)
> - break;
> - }
> -
> - if (splicefrom >= 0)
> - udp_splice_sendfrom(c, i, m, splicefrom, dstport,
> - ref.udp.pif, v6, ref.udp.orig, now);
> + if (udp_meta[i].splicesrc >= 0)
> + m = udp_splice_send(c, i, n, dstport, ref.udp, now);
> else
> - udp_tap_send(c, i, m, dstport, v6, now);
> + m = udp_tap_send(c, i, n, dstport, ref.udp, now);
> }
> }
>
The rest of the series looks good to me.
--
Stefano
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4()
2024-06-13 15:06 ` Stefano Brivio
@ 2024-06-14 0:47 ` David Gibson
0 siblings, 0 replies; 11+ messages in thread
From: David Gibson @ 2024-06-14 0:47 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 7309 bytes --]
On Thu, Jun 13, 2024 at 05:06:25PM +0200, Stefano Brivio wrote:
> Sorry for the delay, nits only (I can fix them up on merge):
Ok, I'll respin anyway, though, since it looks like there's a slightly
non-trivial rebase needed on Laurent's stuff.
> On Wed, 5 Jun 2024 11:39:00 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > sock_l4() creates, binds and otherwise prepares a new socket. It builds
> > the socket address to bind from separately provided address and port.
> > However, we have use cases coming up where it's more natural to construct
> > the socket address in the caller.
> >
> > Prepare for this by adding sock_l4_sa() which takes a pre-constructed
> > socket address, and rewriting sock_l4() in terms of it.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > util.c | 123 ++++++++++++++++++++++++++++++++-------------------------
> > 1 file changed, 70 insertions(+), 53 deletions(-)
> >
> > diff --git a/util.c b/util.c
> > index cc1c73ba..4e5f6d23 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -33,36 +33,25 @@
> > #include "log.h"
> >
> > /**
> > - * sock_l4() - Create and bind socket for given L4, add to epoll list
> > + * sock_l4_sa() - Create and bind socket for given L4, add to epoll list
>
> That doesn't quite tell the difference from sock_l4(), perhaps:
>
> * sock_l4_sa() - Create and bind socket given socket address, add to epoll list
Good idea, done.
> > * @c: Execution context
> > - * @af: Address family, AF_INET or AF_INET6
> > * @proto: Protocol number
> > - * @bind_addr: Address for binding, NULL for any
> > + * @sa: Socket address to bind to
> > + * @sl: Length of @sa
> > * @ifname: Interface for binding, NULL for any
> > - * @port: Port, host order
> > + * @v6only: Set IPV6_V6ONLY socket option
> > * @data: epoll reference portion for protocol handlers
> > *
> > * Return: newly created socket, negative error code on failure
> > */
> > -int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> > - const void *bind_addr, const char *ifname, uint16_t port,
> > - uint32_t data)
> > +static int sock_l4_sa(const struct ctx *c, uint8_t proto,
> > + const void *sa, socklen_t sl,
> > + const char *ifname, bool v6only, uint32_t data)
> > {
> > + sa_family_t af =((const struct sockaddr *)sa)->sa_family;
>
> Missing whitespace after =.
Oops.
> > union epoll_ref ref = { .data = data };
> > - struct sockaddr_in addr4 = {
> > - .sin_family = AF_INET,
> > - .sin_port = htons(port),
> > - { 0 }, { 0 },
> > - };
> > - struct sockaddr_in6 addr6 = {
> > - .sin6_family = AF_INET6,
> > - .sin6_port = htons(port),
> > - 0, IN6ADDR_ANY_INIT, 0,
> > - };
> > - const struct sockaddr *sa;
> > - bool dual_stack = false;
> > - int fd, sl, y = 1, ret;
> > struct epoll_event ev;
> > + int fd, y = 1, ret;
> >
> > switch (proto) {
> > case IPPROTO_TCP:
> > @@ -79,13 +68,6 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> > return -EPFNOSUPPORT; /* Not implemented. */
> > }
> >
> > - if (af == AF_UNSPEC) {
> > - if (!DUAL_STACK_SOCKETS || bind_addr)
> > - return -EINVAL;
> > - dual_stack = true;
> > - af = AF_INET6;
> > - }
> > -
> > if (proto == IPPROTO_TCP)
> > fd = socket(af, SOCK_STREAM | SOCK_NONBLOCK, proto);
> > else
> > @@ -104,30 +86,9 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> >
> > ref.fd = fd;
> >
> > - if (af == AF_INET) {
> > - if (bind_addr)
> > - addr4.sin_addr = *(struct in_addr *)bind_addr;
> > -
> > - sa = (const struct sockaddr *)&addr4;
> > - sl = sizeof(addr4);
> > - } else {
> > - if (bind_addr) {
> > - addr6.sin6_addr = *(struct in6_addr *)bind_addr;
> > -
> > - if (!memcmp(bind_addr, &c->ip6.addr_ll,
> > - sizeof(c->ip6.addr_ll)))
> > - addr6.sin6_scope_id = c->ifi6;
> > - }
> > -
> > - sa = (const struct sockaddr *)&addr6;
> > - sl = sizeof(addr6);
> > -
> > - if (!dual_stack)
> > - if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
> > - &y, sizeof(y)))
> > - debug("Failed to set IPV6_V6ONLY on socket %i",
> > - fd);
> > - }
> > + if (v6only)
> > + if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
> > + debug("Failed to set IPV6_V6ONLY on socket %i", fd);
> >
> > if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
> > debug("Failed to set SO_REUSEADDR on socket %i", fd);
> > @@ -140,9 +101,12 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> > */
> > if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
> > ifname, strlen(ifname))) {
> > + char str[SOCKADDR_STRLEN];
> > +
> > ret = -errno;
> > - warn("Can't bind %s socket for port %u to %s, closing",
> > - EPOLL_TYPE_STR(proto), port, ifname);
> > + warn("Can't bind %s socket for %s to %s, closing",
> > + EPOLL_TYPE_STR(proto),
> > + sockaddr_ntop(sa, str, sizeof(str)), ifname);
> > close(fd);
> > return ret;
> > }
> > @@ -178,6 +142,59 @@ int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> >
> > return fd;
> > }
> > +/**
> > + * sock_l4() - Create and bind socket for given L4, add to epoll list
> > + * @c: Execution context
> > + * @af: Address family, AF_INET or AF_INET6
> > + * @proto: Protocol number
> > + * @bind_addr: Address for binding, NULL for any
> > + * @ifname: Interface for binding, NULL for any
> > + * @port: Port, host order
> > + * @data: epoll reference portion for protocol handlers
> > + *
> > + * Return: newly created socket, negative error code on failure
> > + */
> > +int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
> > + const void *bind_addr, const char *ifname, uint16_t port,
> > + uint32_t data)
> > +{
> > + switch (af) {
> > + case AF_INET: {
> > + struct sockaddr_in addr4 = {
> > + .sin_family = AF_INET,
> > + .sin_port = htons(port),
> > + { 0 }, { 0 },
> > + };
> > + if (bind_addr)
> > + addr4.sin_addr = *(struct in_addr *)bind_addr;
> > + return sock_l4_sa(c, proto, &addr4, sizeof(addr4), ifname,
> > + false, data);
> > + }
> > +
> > + case AF_UNSPEC:
> > + if (!DUAL_STACK_SOCKETS || bind_addr)
> > + return -EINVAL;
> > + /* fallthrough */
> > + case AF_INET6: {
> > + struct sockaddr_in6 addr6 = {
> > + .sin6_family = AF_INET6,
> > + .sin6_port = htons(port),
> > + 0, IN6ADDR_ANY_INIT, 0,
> > + };
> > + if (bind_addr) {
> > + addr6.sin6_addr = *(struct in6_addr *)bind_addr;
> > +
> > + if (!memcmp(bind_addr, &c->ip6.addr_ll,
> > + sizeof(c->ip6.addr_ll)))
> > + addr6.sin6_scope_id = c->ifi6;
> > + }
> > + return sock_l4_sa(c, proto, &addr6, sizeof(addr6), ifname,
> > + af == AF_INET6, data);
> > + }
> > + default:
> > + return -EINVAL;
> > + }
> > +}
> >
> > /**
> > * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/4] udp: Fold checking of splice flag into udp_mmh_splice_port()
2024-06-13 15:06 ` Stefano Brivio
@ 2024-06-14 0:50 ` David Gibson
0 siblings, 0 replies; 11+ messages in thread
From: David Gibson @ 2024-06-14 0:50 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 4260 bytes --]
On Thu, Jun 13, 2024 at 05:06:54PM +0200, Stefano Brivio wrote:
> On Wed, 5 Jun 2024 11:39:01 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > udp_mmh_splice_port() is used to determine if a UDP datagram can be
> > "spliced" (forwarded via a socket instead of tap). We only invoke it if
> > the origin socket has the 'splice' flag set.
> >
> > Fold the checking of the flag into the helper itself, which makes the
> > caller simpler. It does mean we have a loop looking for a batch of
> > spliceable or non-spliceable packets even in the case where the flag is
> > clear. This shouldn't be that expensive though, since each call to
> > udp_mmh_splice_port() will return without accessing memory in that case.
> > In any case we're going to need a similar loop in more cases with upcoming
> > flow table work.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > udp.c | 31 ++++++++++++++++---------------
> > 1 file changed, 16 insertions(+), 15 deletions(-)
> >
> > diff --git a/udp.c b/udp.c
> > index 3abafc99..7487d2b2 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -467,21 +467,25 @@ static int udp_splice_new_ns(void *arg)
> >
> > /**
> > * udp_mmh_splice_port() - Is source address of message suitable for splicing?
> > - * @v6: Is @sa a sockaddr_in6 (otherwise sockaddr_in)?
> > + * @uref: UDP epoll reference for incoming message's origin socket
> > * @mmh: mmsghdr of incoming message
> > *
> > - * Return: if @sa refers to localhost (127.0.0.1 or ::1) the port from
> > - * @sa in host order, otherwise -1.
> > + * Return: if source address of message in @mmh refers to localhost (127.0.0.1
>
> Pre-existing, and I guess this might change again with the complete
> flow table implementation, so it probably doesn't make sense to fix
> this now: it's 127.0.0.0/8, not necessarily 127.0.0.1.
Right. In fact this function will go away entirely with the flow
table.
> As to whether we actually need to preserve a source address that's not
> 127.0.0.1, but in 127.0.0.0/8, as we "splice", I'm not quite sure. I
> think we could bind() the socket in the target namespace, but I haven't
> tried, and I don't know if it makes sense at all (I can't think of any
> use case).
So, how to handle 127.0.0.0/8 is something I'm actively thinking
about. It should be much easier to tweak this with the flow table in
place.
> > + * or ::1) its source port (host order), otherwise -1.
> > */
> > -static int udp_mmh_splice_port(bool v6, const struct mmsghdr *mmh)
> > +static int udp_mmh_splice_port(union udp_epoll_ref uref,
> > + const struct mmsghdr *mmh)
> > {
> > const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name;
> > const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name;
> >
> > - if (v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> > + if (!uref.splice)
> > + return -1;
> > +
> > + if (uref.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> > return ntohs(sa6->sin6_port);
> >
> > - if (!v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> > + if (!uref.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> > return ntohs(sa4->sin_port);
> >
> > return -1;
> > @@ -768,18 +772,15 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
>
> (now renamed to udp_buf_sock_handler() if you're wondering)
>
> >
> > for (i = 0; i < n; i += m) {
> > int splicefrom = -1;
> > - m = n;
> >
> > - if (ref.udp.splice) {
> > - splicefrom = udp_mmh_splice_port(v6, mmh_recv + i);
> > + splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i);
> >
> > - for (m = 1; i + m < n; m++) {
> > - int p;
> > + for (m = 1; i + m < n; m++) {
> > + int p;
> >
> > - p = udp_mmh_splice_port(v6, mmh_recv + i + m);
> > - if (p != splicefrom)
> > - break;
> > - }
> > + p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m);
> > + if (p != splicefrom)
> > + break;
> > }
> >
> > if (splicefrom >= 0)
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods
2024-06-13 18:21 ` Stefano Brivio
@ 2024-06-14 1:08 ` David Gibson
0 siblings, 0 replies; 11+ messages in thread
From: David Gibson @ 2024-06-14 1:08 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 7617 bytes --]
On Thu, Jun 13, 2024 at 08:21:12PM +0200, Stefano Brivio wrote:
> On Wed, 5 Jun 2024 11:39:02 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > udp_sock_handler() takes a number of datagrams from sockets that depending
> > on their addresses could be forwarded either to the L2 interface ("tap")
> > or to another socket ("spliced"). In the latter case we can also only
> > send packets together if they have the same source port, and therefore
> > are sent via the same socket.
> >
> > To reduce the total number of system calls we gather contiguous batches of
> > datagrams with the same destination interface and socket where applicable.
> > The determination of what the target is is made by udp_mmh_splice_port().
> > It returns the source port for splice packets and -1 for "tap" packets.
> > We find batches by looking ahead in our queue until we find a datagram
> > whose "splicefrom" port doesn't match the first in our current batch.
> >
> > udp_mmh_splice_port() is moderately expensive, since it must examine IPv6
> > addresses. But unfortunately we can call it twice on the same datagram:
> > once as the (last + 1) entry in one batch (showing that it's not in that
> > match, then again as the first entry in the next batch.
>
> This paragraph took me an embarrassingly long time to grasp, if you
> re-spin it would be nice to fix it:
>
> - "And unfortunately [...]", I guess: otherwise it looks like we're
> lucky that udp_mmh_splice_port() is expensive or something like that
> (because of the "But" implying contrast).
>
> I initially assumed "unfortunately" was a typo and tried to
> understand why it was a good thing we'd call udp_mmh_splice_port()
> twice on the same datagram (faster than calling it on two
> datagrams!), then started reading the change and got even more
> confused...
>
> - "(to check that it's not that batch)" ?
Yeah, didn't help that there were some other typos in there. Updated
referring to these suggestions.
> > Avoid this by keeping track of the "splice port" in the metadata structure,
> > and filling it in one entry ahead of the one we're currently considering.
> > This is a bit subtle, but not that hard. It will also generalise better
> > when we have more complex possibilities based on the flow table.
>
> I guess this is the actual, main reason for this change. :) I should
> have read this paragraph first.
Yes. In the flow table we'll replace this splicesrc array with an
array of flow sidxs, updated in the same way.
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > udp.c | 147 ++++++++++++++++++++++++++++++++++------------------------
> > 1 file changed, 86 insertions(+), 61 deletions(-)
> >
> > diff --git a/udp.c b/udp.c
> > index 7487d2b2..757c10ab 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr;
> > * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr)
> > * @taph: Tap backend specific header
> > * @s_in: Source socket address, filled in by recvmmsg()
> > + * @splicesrc: Source port for splicing, or -1 if not spliceable
> > */
> > static struct udp_meta_t {
> > struct ipv6hdr ip6h;
> > @@ -205,6 +206,7 @@ static struct udp_meta_t {
> > struct tap_hdr taph;
> >
> > union sockaddr_inany s_in;
> > + int splicesrc;
> > }
> > #ifdef __AVX2__
> > __attribute__ ((aligned(32)))
> > @@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref,
> > }
> >
> > /**
> > - * udp_splice_sendfrom() - Send datagrams from given port to given port
> > + * udp_splice_send() - Send datagrams from socket to socket
> > * @c: Execution context
> > * @start: Index of first datagram in udp[46]_l2_buf
> > - * @n: Number of datagrams to send
> > - * @src: Datagrams will be sent from this port (on origin side)
> > - * @dst: Datagrams will be send to this port (on destination side)
> > - * @from_pif: pif from which the packet originated
> > - * @v6: Send as IPv6?
> > - * @allow_new: If true create sending socket if needed, if false discard
> > - * if no sending socket is available
> > + * @n: Total number of datagrams in udp[46]_l2_buf pool
> > + * @dst: Datagrams will be sent to this port (on destination side)
> > + * @uref: UDP epoll reference for origin socket
> > * @now: Timestamp
> > + *
> > + * This consumes as many frames as are sendable via a single socket. It
>
> s/frames/datagrams/ ...or messages.
Good point, adjusted.
> > + * requires that udp_meta[@start].splicesrc is initialised, and will initialise
> > + * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
> > + *
> > + * Return: Number of frames sent
>
> I'd say it's rather the number of datagrams (not frames) we tried to
> send.
>
> In some sense, it's also the number of frames sent _by us_ (well, after
> calling sendmmsg(), messages were sent), but we call sendmmsg()
> ignoring the result, so this comment might look a bit misleading.
Right... I've gone with "Number of datagrams forwarded" how's that?
> > */
> > -static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
> > - in_port_t src, in_port_t dst, uint8_t from_pif,
> > - bool v6, bool allow_new,
> > +static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n,
> > + in_port_t dst, union udp_epoll_ref uref,
> > const struct timespec *now)
> > {
> > + in_port_t src = udp_meta[start].splicesrc;
> > struct mmsghdr *mmh_recv, *mmh_send;
> > - unsigned int i;
> > + unsigned int i = start;
> > int s;
> >
> > - if (v6) {
> > + ASSERT(udp_meta[start].splicesrc >= 0);
> > +
> > + if (uref.v6) {
> > mmh_recv = udp6_l2_mh_sock;
> > mmh_send = udp6_mh_splice;
> > } else {
> > @@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
> > mmh_send = udp4_mh_splice;
> > }
> >
> > - if (from_pif == PIF_SPLICE) {
> > + do {
> > + mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
> > +
> > + if (++i >= n)
> > + break;
> > +
> > + udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
> > + } while (udp_meta[i].splicesrc == src);
>
> I don't have a strong preference, but a for loop like this:
>
> for (; i < n && udp_meta[i].splicesrc == src; i++) {
> mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
> udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
This needs to update udp_meta[i+1], not udp_meta[i], and therefore
also be conditional on i+1 < n.
> }
>
> if (i++ < n) /* Set splicesrc for first mismatching entry, too */
This needs to be ++i, not i++.
> udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
>
> looks a bit more readable to me. Same for udp_tap_send().
At which point I'm not sure it's more readable after all. It also
redundantly checks that udp_meta[start].splicesrc == src.
Like I said, subtle. Putting the increment in the middle of the loop
body, rather than beginning or end, while unusual, was the sanest way
I could see to do this. Well, other than havine one pass to set
splicesrc[], then another using it, which works but I'm concerned
about the effect on cache locality.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-06-14 1:10 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-05 1:38 [PATCH 0/4] Even more flow table preliminaries David Gibson
2024-06-05 1:39 ` [PATCH 1/4] util: Split construction of bind socket address from the rest of sock_l4() David Gibson
2024-06-13 15:06 ` Stefano Brivio
2024-06-14 0:47 ` David Gibson
2024-06-05 1:39 ` [PATCH 2/4] udp: Fold checking of splice flag into udp_mmh_splice_port() David Gibson
2024-06-13 15:06 ` Stefano Brivio
2024-06-14 0:50 ` David Gibson
2024-06-05 1:39 ` [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods David Gibson
2024-06-13 18:21 ` Stefano Brivio
2024-06-14 1:08 ` David Gibson
2024-06-05 1:39 ` [PATCH 4/4] udp: Move management of udp[46]_localname into udp_splice_send() David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).