public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Subject: [PATCH 15/24] tcp: Rework timers to use timerfd instead of periodic bitmap scan
Date: Fri, 25 Mar 2022 23:52:51 +0100	[thread overview]
Message-ID: <20220325225300.2803584-16-sbrivio@redhat.com> (raw)
In-Reply-To: <20220325225300.2803584-1-sbrivio@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 42161 bytes --]

With a lot of concurrent connections, the bitmap scan approach is
not really sustainable.

Switch to per-connection timerfd timers, set based on events and on
two new flags, ACK_FROM_TAP_DUE and ACK_TO_TAP_DUE. Timers are added
to the common epoll list, and implement the existing timeouts.

While at it, drop the CONN_ prefix from flag names, otherwise they
get quite long, and fix the logic to decide if a connection has a
local, possibly unreachable endpoint: we shouldn't go through the
rest of tcp_conn_from_tap() if we reset the connection due to a
successful bind(2), and we'll get EACCES if the port number is low.

Suggested by: Stefan Hajnoczi <stefanha(a)redhat.com>
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
 README.md |   4 +-
 passt.c   |  12 +-
 tap.c     |   2 +-
 tcp.c     | 498 +++++++++++++++++++++++++++++-------------------------
 tcp.h     |   8 +-
 5 files changed, 283 insertions(+), 241 deletions(-)

diff --git a/README.md b/README.md
index cd4caa3..8e07fb1 100644
--- a/README.md
+++ b/README.md
@@ -287,11 +287,9 @@ speeding up local connections, and usually requiring NAT. _pasta_:
 * ✅ all capabilities dropped, other than `CAP_NET_BIND_SERVICE` (if granted)
 * ✅ with default options, user, mount, IPC, UTS, PID namespaces are detached
 * ✅ no external dependencies (other than a standard C library)
-* ✅ restrictive seccomp profiles (22 syscalls allowed for _passt_, 34 for
+* ✅ restrictive seccomp profiles (24 syscalls allowed for _passt_, 36 for
   _pasta_ on x86_64)
 * ✅ static checkers in continuous integration (clang-tidy, cppcheck)
-* 🛠️ rework of TCP state machine (flags instead of states), TCP timers, and code
-  de-duplication
 * 🛠️ clearly defined packet abstraction
 * 🛠️ ~5 000 LoC target
 * ⌚ [fuzzing](https://bugs.passt.top/show_bug.cgi?id=9), _packetdrill_ tests
diff --git a/passt.c b/passt.c
index 6c04266..6550a22 100644
--- a/passt.c
+++ b/passt.c
@@ -119,12 +119,12 @@ static void post_handler(struct ctx *c, struct timespec *now)
 #define CALL_PROTO_HANDLER(c, now, lc, uc)				\
 	do {								\
 		extern void						\
-		lc ## _defer_handler (struct ctx *, struct timespec *)	\
+		lc ## _defer_handler (struct ctx *c)			\
 		__attribute__ ((weak));					\
 									\
 		if (!c->no_ ## lc) {					\
 			if (lc ## _defer_handler)			\
-				lc ## _defer_handler(c, now);		\
+				lc ## _defer_handler(c);		\
 									\
 			if (timespec_diff_ms((now), &c->lc.timer_run)	\
 			    >= uc ## _TIMER_INTERVAL) {			\
@@ -134,8 +134,11 @@ static void post_handler(struct ctx *c, struct timespec *now)
 		} 							\
 	} while (0)
 
+	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
 	CALL_PROTO_HANDLER(c, now, tcp, TCP);
+	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
 	CALL_PROTO_HANDLER(c, now, udp, UDP);
+	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
 	CALL_PROTO_HANDLER(c, now, icmp, ICMP);
 
 #undef CALL_PROTO_HANDLER
@@ -380,8 +383,8 @@ int main(int argc, char **argv)
 
 	clock_gettime(CLOCK_MONOTONIC, &now);
 
-	if ((!c.no_udp && udp_sock_init(&c, &now)) ||
-	    (!c.no_tcp && tcp_sock_init(&c, &now)))
+	if ((!c.no_udp && udp_sock_init(&c)) ||
+	    (!c.no_tcp && tcp_sock_init(&c)))
 		exit(EXIT_FAILURE);
 
 	proto_update_l2_buf(c.mac_guest, c.mac, &c.addr4);
@@ -425,6 +428,7 @@ int main(int argc, char **argv)
 	timer_init(&c, &now);
 
 loop:
+	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
 	nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
 	if (nfds == -1 && errno != EINTR) {
 		perror("epoll_wait");
diff --git a/tap.c b/tap.c
index a1ccfc1..59a87f9 100644
--- a/tap.c
+++ b/tap.c
@@ -939,7 +939,7 @@ void tap_sock_init(struct ctx *c)
  * @c:		Execution context
  * @fd:		File descriptor where event occurred
  * @events:	epoll events
- * @now:	Current timestamp
+ * @now:	Current timestamp, can be NULL on EPOLLERR
  */
 void tap_handler(struct ctx *c, int fd, uint32_t events, struct timespec *now)
 {
diff --git a/tcp.c b/tcp.c
index f03c929..384e7a6 100644
--- a/tcp.c
+++ b/tcp.c
@@ -177,32 +177,32 @@
  * Aging and timeout
  * -----------------
  *
- * Open connections are checked periodically against a number of timeouts. Those
- * are:
+ * Timeouts are implemented by means of timerfd timers, set based on flags:
  *
- * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake within
- *   this time, reset the connection
- *
- * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
- *   either side, the connection is reset
- *
- * - ACK_INTERVAL, or zero-sized window advertised to tap/guest: forcibly check
- *   if an ACK segment can be sent
+ * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
+ *   ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
+ *   connection
  *
  * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
- *   data, re-send data from the socket and reset sequence to what was
- *   acknowledged. If this persists for longer than LAST_ACK_TIMEOUT, reset the
- *   connection
+ *   data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
+ *   socket and reset sequence to what was acknowledged. If this persists for
+ *   more than TCP_MAX_RETRANS times in a row, reset the connection
  *
- * - FIN_TIMEOUT, on TAP_FIN_SENT: if no ACK is received for the FIN segment
- *   within this time, the connection is reset
+ * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
+ *   with TAP_FIN_SENT event), and no ACK is received within this time, reset
+ *   the connection
  *
- * - FIN_TIMEOUT, on SOCK_FIN_SENT: if no activity is detected on the socket
- *   after sending a FIN segment (write shutdown), reset the connection
+ * - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
+ *   segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
+ *   TAP_FIN_ACKED), but no socket activity is detected from the socket within
+ *   this time, reset the connection
  *
- * - LAST_ACK_TIMEOUT on SOCK_FIN_SENT *and* SOCK_FIN_RCVD: reset the connection
- *   if no activity was detected on any of the two sides after sending a FIN
- *   segment
+ * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
+ *   either side, the connection is reset
+ *
+ * - ACK_INTERVAL elapsed after data segment received from tap without having
+ *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
+ *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
  *
  *
  * Summary of data flows (with ESTABLISHED event)
@@ -237,11 +237,6 @@
  *     - on two duplicated ACKs, reset @seq_to_tap to @seq_ack_from_tap, and
  *       resend with steps listed above
  *     - set TCP_WINDOW_CLAMP from TCP header from tap
- *   - periodically:
- *     - if @seq_ack_from_tap < @seq_to_tap and the retransmission timer
- *       (TODO: implement requirements from RFC 6298, currently 3s fixed) from
- *       @ts_ack_from_tap elapsed, reset @seq_to_tap to @seq_ack_from_tap, and
- *       resend data with the steps listed above
  *
  * - from tap/guest to socket:
  *   - on packet from tap/guest:
@@ -287,6 +282,7 @@
 #include <sys/random.h>
 #endif
 #include <sys/socket.h>
+#include <sys/timerfd.h>
 #include <sys/types.h>
 #include <sys/uio.h>
 #include <unistd.h>
@@ -328,17 +324,13 @@
 # define KERNEL_REPORTS_SND_WND(c)	(0 && (c))
 #endif
 
-#define SYN_TIMEOUT			240000		/* ms */
-#define ACK_TIMEOUT			2000
-#define ACK_INTERVAL			50
-#define ACT_TIMEOUT			7200000
-#define FIN_TIMEOUT			240000
-#define LAST_ACK_TIMEOUT		240000
+#define ACK_INTERVAL			50		/* ms */
+#define SYN_TIMEOUT			10		/* s */
+#define ACK_TIMEOUT			2
+#define FIN_TIMEOUT			60
+#define ACT_TIMEOUT			7200
 
 #define TCP_SOCK_POOL_TSH		16 /* Refill in ns if > x used */
-#define REFILL_INTERVAL			1000
-
-#define PORT_DETECT_INTERVAL		1000
 
 #define LOW_RTT_TABLE_SIZE		8
 #define LOW_RTT_THRESHOLD		10 /* us */
@@ -407,7 +399,11 @@ struct tcp_conn;
  */
 struct tcp_conn {
 	struct tcp_conn *next;
-	int sock;
+	int32_t		sock:SOCKET_REF_BITS;
+#define TCP_RETRANS_BITS	3
+	unsigned int	retrans:TCP_RETRANS_BITS;
+#define TCP_MAX_RETRANS	((1U << TCP_RETRANS_BITS) - 1)
+	int timer;
 	int hash_bucket;
 
 	union {
@@ -440,11 +436,13 @@ struct tcp_conn {
 	(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
 
 	uint8_t flags;
-#define CONN_STALLED		BIT(0)
-#define CONN_LOCAL		BIT(1)
-#define CONN_WND_CLAMPED	BIT(2)
-#define CONN_IN_EPOLL		BIT(3)
-#define CONN_ACTIVE_CLOSE	BIT(4)
+#define STALLED			BIT(0)
+#define LOCAL			BIT(1)
+#define WND_CLAMPED		BIT(2)
+#define IN_EPOLL		BIT(3)
+#define ACTIVE_CLOSE		BIT(4)
+#define ACK_TO_TAP_DUE		BIT(5)
+#define ACK_FROM_TAP_DUE	BIT(6)
 
 	uint16_t tap_mss;
 
@@ -463,12 +461,6 @@ struct tcp_conn {
 	uint32_t wnd_to_tap;
 
 	int snd_buf;
-
-	struct timespec ts_sock_act;
-	struct timespec ts_tap_act;
-	struct timespec ts_ack_from_tap;
-	struct timespec ts_ack_to_tap;
-	struct timespec tap_data_noack;
 };
 
 #define CONN_IS_CLOSED(conn)	(conn->events == CLOSED)
@@ -498,6 +490,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
 
 static const char *tcp_flag_str[] __attribute((__unused__)) = {
 	"STALLED", "LOCAL", "WND_CLAMPED", "IN_EPOLL", "ACTIVE_CLOSE",
+	"ACK_TO_TAP_DUE", "ACK_FROM_TAP_DUE",
 };
 
 /* Port re-mappings as delta, indexed by original destination port */
@@ -686,7 +679,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 		if (events & TAP_FIN_SENT)
 			return EPOLLET;
 
-		if (conn_flags & CONN_STALLED)
+		if (conn_flags & STALLED)
 			return EPOLLIN | EPOLLRDHUP | EPOLLET;
 
 		return EPOLLIN | EPOLLRDHUP;
@@ -715,7 +708,7 @@ static void conn_flag_do(struct ctx *c, struct tcp_conn *conn,
  */
 static int tcp_epoll_ctl(struct ctx *c, struct tcp_conn *conn)
 {
-	int m = (conn->flags & CONN_IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
 				.r.p.tcp.tcp.index = conn - tc,
 				.r.p.tcp.tcp.v6 = CONN_V6(conn) };
@@ -731,13 +724,69 @@ static int tcp_epoll_ctl(struct ctx *c, struct tcp_conn *conn)
 	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
 		return -errno;
 
-	conn->flags |= CONN_IN_EPOLL;	/* No need to log this */
+	conn->flags |= IN_EPOLL;	/* No need to log this */
 
 	return 0;
 }
 
 /**
- * conn_flag_do() - Set/unset given flag, log, update epoll on CONN_STALLED
+ * tcp_timer_ctl() - Set timerfd based on flags/events, create timerfd if needed
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ *
+ * #syscalls timerfd_create timerfd_settime
+ */
+static void tcp_timer_ctl(struct ctx *c, struct tcp_conn *conn)
+{
+	struct itimerspec it = { { 0 }, { 0 } };
+
+	if (conn->timer == -1) {
+		union epoll_ref ref = { .r.proto = IPPROTO_TCP,
+					.r.s = conn->sock,
+					.r.p.tcp.tcp.timer = 1,
+					.r.p.tcp.tcp.index = conn - tc };
+		struct epoll_event ev = { .data.u64 = ref.u64,
+					  .events = EPOLLIN | EPOLLET };
+
+		conn->timer = timerfd_create(CLOCK_MONOTONIC, 0);
+		if (conn->timer == -1) {
+			debug("TCP: failed to get timer: %s", strerror(errno));
+			return;
+		}
+
+		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
+			debug("TCP: failed to add timer: %s", strerror(errno));
+			close(conn->timer);
+			conn->timer = -1;
+			return;
+		}
+	}
+
+	if (conn->events == CLOSED) {
+		it.it_value.tv_nsec = 1;
+	} else if (conn->flags & ACK_TO_TAP_DUE) {
+		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
+	} else if (conn->flags & ACK_FROM_TAP_DUE) {
+		if (!(conn->events & ESTABLISHED))
+			it.it_value.tv_sec = SYN_TIMEOUT;
+		else if (conn->events & TAP_FIN_SENT)
+			it.it_value.tv_sec = FIN_TIMEOUT;
+		else
+			it.it_value.tv_sec = ACK_TIMEOUT;
+	} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
+		it.it_value.tv_sec = FIN_TIMEOUT;
+	} else {
+		it.it_value.tv_sec = ACT_TIMEOUT;
+	}
+
+	debug("TCP: index %i, timer expires in %u.%03us", conn - tc,
+	      it.it_value.tv_sec, it.it_value.tv_nsec / 1000 / 1000);
+
+	timerfd_settime(conn->timer, 0, &it, NULL);
+}
+
+/**
+ * conn_flag_do() - Set/unset given flag, log, update epoll on STALLED flag
  * @c:		Execution context
  * @conn:	Connection pointer
  * @flag:	Flag to set, or ~flag to unset
@@ -761,8 +810,11 @@ static void conn_flag_do(struct ctx *c, struct tcp_conn *conn,
 		      tcp_flag_str[fls(flag)]);
 	}
 
-	if (flag == CONN_STALLED || flag == ~CONN_STALLED)
+	if (flag == STALLED || flag == ~STALLED)
 		tcp_epoll_ctl(c, conn);
+
+	if (flag == ACK_FROM_TAP_DUE || flag == ACK_TO_TAP_DUE)
+		tcp_timer_ctl(c, conn);
 }
 
 /**
@@ -780,7 +832,7 @@ static void conn_event_do(struct ctx *c, struct tcp_conn *conn,
 		return;
 
 	prev = fls(conn->events);
-	if (conn->flags & CONN_ACTIVE_CLOSE)
+	if (conn->flags & ACTIVE_CLOSE)
 		prev += 5;
 
 	if ((conn->events & ESTABLISHED) && (conn->events != ESTABLISHED))
@@ -791,18 +843,13 @@ static void conn_event_do(struct ctx *c, struct tcp_conn *conn,
 	else
 		conn->events |= event;
 
-	if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD))
-		conn_flag(c, conn, CONN_ACTIVE_CLOSE);
-	else
-		tcp_epoll_ctl(c, conn);
-
 	new = fls(conn->events);
 
 	if ((conn->events & ESTABLISHED) && (conn->events != ESTABLISHED)) {
 		num++;
 		new++;
 	}
-	if (conn->flags & CONN_ACTIVE_CLOSE)
+	if (conn->flags & ACTIVE_CLOSE)
 		new += 5;
 
 	if (prev != new) {
@@ -814,6 +861,14 @@ static void conn_event_do(struct ctx *c, struct tcp_conn *conn,
 		debug("TCP: index %i, %s", (conn) - tc,
 		      num == -1 	       ? "CLOSED" : tcp_event_str[num]);
 	}
+
+	if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD))
+		conn_flag(c, conn, ACTIVE_CLOSE);
+	else
+		tcp_epoll_ctl(c, conn);
+
+	if (event == CLOSED || CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
+		tcp_timer_ctl(c, conn);
 }
 
 #define conn_event(c, conn, event)					\
@@ -1388,13 +1443,12 @@ static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn);
  *
  * Return: 0 on success, negative error code on failure (tap reset possible)
  */
-static int tcp_l2_buf_write_one(struct ctx *c, struct iovec *iov,
-				struct timespec *ts)
+static int tcp_l2_buf_write_one(struct ctx *c, struct iovec *iov)
 {
 	if (write(c->fd_tap, (char *)iov->iov_base + 4, iov->iov_len - 4) < 0) {
 		debug("tap write: %s", strerror(errno));
 		if (errno != EAGAIN && errno != EWOULDBLOCK)
-			tap_handler(c, c->fd_tap, EPOLLERR, ts);
+			tap_handler(c, c->fd_tap, EPOLLERR, NULL);
 		return -errno;
 	}
 
@@ -1431,11 +1485,9 @@ static void tcp_l2_buf_flush_part(struct ctx *c, struct msghdr *mh, size_t sent)
  * @mh:		Message header pointing to buffers, msg_iovlen not set
  * @buf_used:	Pointer to count of used buffers, set to 0 on return
  * @buf_bytes:	Pointer to count of buffer bytes, set to 0 on return
- * @ts:		Current timestamp
  */
 static void tcp_l2_buf_flush(struct ctx *c, struct msghdr *mh,
-			     unsigned int *buf_used, size_t *buf_bytes,
-			     struct timespec *ts)
+			     unsigned int *buf_used, size_t *buf_bytes)
 {
 	if (!(mh->msg_iovlen = *buf_used))
 		return;
@@ -1450,7 +1502,7 @@ static void tcp_l2_buf_flush(struct ctx *c, struct msghdr *mh,
 		for (i = 0; i < mh->msg_iovlen; i++) {
 			struct iovec *iov = &mh->msg_iov[i];
 
-			if (tcp_l2_buf_write_one(c, iov, ts))
+			if (tcp_l2_buf_write_one(c, iov))
 				i--;
 		}
 	}
@@ -1461,9 +1513,8 @@ static void tcp_l2_buf_flush(struct ctx *c, struct msghdr *mh,
 /**
  * tcp_l2_flags_buf_flush() - Send out buffers for segments with no data (flags)
  * @c:		Execution context
- * @ts:		Current timestamp (not packet timestamp)
  */
-static void tcp_l2_flags_buf_flush(struct ctx *c, struct timespec *ts)
+static void tcp_l2_flags_buf_flush(struct ctx *c)
 {
 	struct msghdr mh = { 0 };
 	unsigned int *buf_used;
@@ -1472,20 +1523,19 @@ static void tcp_l2_flags_buf_flush(struct ctx *c, struct timespec *ts)
 	mh.msg_iov	= tcp6_l2_flags_iov;
 	buf_used	= &tcp6_l2_flags_buf_used;
 	buf_bytes	= &tcp6_l2_flags_buf_bytes;
-	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes, ts);
+	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes);
 
 	mh.msg_iov	= tcp4_l2_flags_iov;
 	buf_used	= &tcp4_l2_flags_buf_used;
 	buf_bytes	= &tcp4_l2_flags_buf_bytes;
-	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes, ts);
+	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes);
 }
 
 /**
  * tcp_l2_data_buf_flush() - Send out buffers for segments with data
  * @c:		Execution context
- * @ts:		Current timestamp (not packet timestamp)
  */
-static void tcp_l2_data_buf_flush(struct ctx *c, struct timespec *ts)
+static void tcp_l2_data_buf_flush(struct ctx *c)
 {
 	struct msghdr mh = { 0 };
 	unsigned int *buf_used;
@@ -1494,23 +1544,22 @@ static void tcp_l2_data_buf_flush(struct ctx *c, struct timespec *ts)
 	mh.msg_iov = tcp6_l2_iov;
 	buf_used	= &tcp6_l2_buf_used;
 	buf_bytes	= &tcp6_l2_buf_bytes;
-	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes, ts);
+	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes);
 
 	mh.msg_iov = tcp4_l2_iov;
 	buf_used	= &tcp4_l2_buf_used;
 	buf_bytes	= &tcp4_l2_buf_bytes;
-	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes, ts);
+	tcp_l2_buf_flush(c, &mh, buf_used, buf_bytes);
 }
 
 /**
  * tcp_defer_handler() - Handler for TCP deferred tasks
  * @c:		Execution context
- * @now:	Current timestamp
  */
-void tcp_defer_handler(struct ctx *c, struct timespec *now)
+void tcp_defer_handler(struct ctx *c)
 {
-	tcp_l2_flags_buf_flush(c, now);
-	tcp_l2_data_buf_flush(c, now);
+	tcp_l2_flags_buf_flush(c);
+	tcp_l2_data_buf_flush(c);
 }
 
 /**
@@ -1627,7 +1676,7 @@ static int tcp_update_seqack_wnd(struct ctx *c, struct tcp_conn *conn,
 		conn->seq_ack_to_tap = prev_ack_to_tap;
 #else
 	if ((unsigned long)conn->snd_buf < SNDBUF_SMALL || tcp_rtt_dst_low(conn)
-	    || CONN_IS_CLOSING(conn) || conn->flags & CONN_LOCAL || force_seq) {
+	    || CONN_IS_CLOSING(conn) || conn->flags & LOCAL || force_seq) {
 		conn->seq_ack_to_tap = conn->seq_from_tap;
 	} else if (conn->seq_ack_to_tap != conn->seq_from_tap) {
 		if (!tinfo) {
@@ -1660,7 +1709,7 @@ static int tcp_update_seqack_wnd(struct ctx *c, struct tcp_conn *conn,
 	}
 
 #ifdef HAS_SND_WND
-	if ((conn->flags & CONN_LOCAL) || tcp_rtt_dst_low(conn)) {
+	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
 		conn->wnd_to_tap = tinfo->tcpi_snd_wnd;
 	} else {
 		tcp_get_sndbuf(conn);
@@ -1670,6 +1719,8 @@ static int tcp_update_seqack_wnd(struct ctx *c, struct tcp_conn *conn,
 
 	conn->wnd_to_tap = MIN(conn->wnd_to_tap, MAX_WINDOW);
 
+	if (!conn->wnd_to_tap)
+		conn_flag(c, conn, ACK_TO_TAP_DUE);
 out:
 	return conn->wnd_to_tap     != prev_wnd_to_tap ||
 	       conn->seq_ack_to_tap != prev_ack_to_tap;
@@ -1680,12 +1731,10 @@ out:
  * @c:		Execution context
  * @conn:	Connection pointer
  * @flags:	TCP flags: if not set, send segment only if ACK is due
- * @now:	Current timestamp
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags,
-			 struct timespec *now)
+static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags)
 {
 	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
@@ -1709,7 +1758,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags,
 		return -ECONNRESET;
 	}
 
-	if (!(conn->flags & CONN_LOCAL))
+	if (!(conn->flags & LOCAL))
 		tcp_rtt_dst_check(conn, &tinfo);
 
 	if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags)
@@ -1748,8 +1797,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags,
 				mss -= sizeof(struct ipv6hdr);
 
 			if (c->low_wmem &&
-			    !(conn->flags & CONN_LOCAL) &&
-			    !tcp_rtt_dst_low(conn))
+			    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn))
 				mss = MIN(mss, PAGE_SIZE);
 			else if (mss > PAGE_SIZE)
 				mss = ROUND_DOWN(mss, PAGE_SIZE);
@@ -1795,11 +1843,11 @@ static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags,
 	else
 		tcp6_l2_flags_buf_bytes += iov->iov_len;
 
-	if (th->ack && now)
-		conn->ts_ack_to_tap = *now;
+	if (th->ack)
+		conn_flag(c, conn, ~ACK_TO_TAP_DUE);
 
-	if (th->fin && now)
-		conn->tap_data_noack = *now;
+	if (th->fin)
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
 	/* RFC 793, 3.1: "[...] and the first data octet is ISN+1." */
 	if (th->fin || th->syn)
@@ -1814,7 +1862,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags,
 		}
 
 		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
-			tcp_l2_flags_buf_flush(c, now);
+			tcp_l2_flags_buf_flush(c);
 	} else {
 		if (flags & DUP_ACK) {
 			memcpy(b6 + 1, b6, sizeof(*b6));
@@ -1824,7 +1872,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags,
 		}
 
 		if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
-			tcp_l2_flags_buf_flush(c, now);
+			tcp_l2_flags_buf_flush(c);
 	}
 
 	return 0;
@@ -1840,7 +1888,7 @@ static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn)
 	if (CONN_IS_CLOSED(conn))
 		return;
 
-	if (!tcp_send_flag(c, conn, RST, NULL))
+	if (!tcp_send_flag(c, conn, RST))
 		tcp_conn_destroy(c, conn);
 }
 
@@ -1874,7 +1922,7 @@ static void tcp_clamp_window(struct ctx *c, struct tcp_conn *conn,
 
 		window = MIN(MAX_WINDOW, window);
 
-		if (conn->flags & CONN_WND_CLAMPED) {
+		if (conn->flags & WND_CLAMPED) {
 			if (conn->wnd_from_tap == window)
 				return;
 
@@ -1893,7 +1941,7 @@ static void tcp_clamp_window(struct ctx *c, struct tcp_conn *conn,
 			window = 256;
 		setsockopt(conn->sock, SOL_TCP, TCP_WINDOW_CLAMP,
 			   &window, sizeof(window));
-		conn_flag(c, conn, CONN_WND_CLAMPED);
+		conn_flag(c, conn, WND_CLAMPED);
 	}
 }
 
@@ -2070,6 +2118,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, void *addr,
 
 	conn = CONN(c->tcp.conn_count++);
 	conn->sock = s;
+	conn->timer = -1;
 	conn_event(c, conn, TAP_SYN_RCVD);
 
 	conn->wnd_to_tap = WINDOW_DEFAULT;
@@ -2098,9 +2147,6 @@ static void tcp_conn_from_tap(struct ctx *c, int af, void *addr,
 	conn->sock_port = ntohs(th->dest);
 	conn->tap_port = ntohs(th->source);
 
-	conn->ts_sock_act = conn->ts_tap_act = *now;
-	conn->ts_ack_to_tap = conn->ts_ack_from_tap = *now;
-
 	conn->seq_init_from_tap = ntohl(th->seq);
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
@@ -2111,10 +2157,12 @@ static void tcp_conn_from_tap(struct ctx *c, int af, void *addr,
 
 	tcp_hash_insert(c, conn, af, addr);
 
-	if (!bind(s, sa, sl))
+	if (!bind(s, sa, sl)) {
 		tcp_rst(c, conn);	/* Nobody is listening then */
-	if (errno != EADDRNOTAVAIL)
-		conn_flag(c, conn, CONN_LOCAL);
+		return;
+	}
+	if (errno != EADDRNOTAVAIL && errno != EACCES)
+		conn_flag(c, conn, LOCAL);
 
 	if (connect(s, sa, sl)) {
 		if (errno != EINPROGRESS) {
@@ -2126,7 +2174,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, void *addr,
 	} else {
 		tcp_get_sndbuf(conn);
 
-		if (tcp_send_flag(c, conn, SYN | ACK, now))
+		if (tcp_send_flag(c, conn, SYN | ACK))
 			return;
 
 		conn_event(c, conn, TAP_SYN_ACK_SENT);
@@ -2169,7 +2217,7 @@ static int tcp_sock_consume(struct tcp_conn *conn, uint32_t ack_seq)
  * @now:	Current timestamp
  */
 static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn, ssize_t plen,
-			    int no_csum, uint32_t seq, struct timespec *now)
+			    int no_csum, uint32_t seq)
 {
 	struct iovec *iov;
 	size_t len;
@@ -2183,7 +2231,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn, ssize_t plen,
 		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
 		tcp4_l2_buf_bytes += iov->iov_len = len + sizeof(b->vnet_len);
 		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
-			tcp_l2_data_buf_flush(c, now);
+			tcp_l2_data_buf_flush(c);
 	} else if (CONN_V6(conn)) {
 		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
 
@@ -2192,7 +2240,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn, ssize_t plen,
 		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
 		tcp6_l2_buf_bytes += iov->iov_len = len + sizeof(b->vnet_len);
 		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
-			tcp_l2_data_buf_flush(c, now);
+			tcp_l2_data_buf_flush(c);
 	}
 }
 
@@ -2200,14 +2248,12 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn, ssize_t plen,
  * tcp_data_from_sock() - Handle new data from socket, queue to tap, in window
  * @c:		Execution context
  * @conn:	Connection pointer
- * @now:	Current timestamp
  *
  * Return: negative on connection reset, 0 otherwise
  *
  * #syscalls recvmsg
  */
-static int tcp_data_from_sock(struct ctx *c, struct tcp_conn *conn,
-			      struct timespec *now)
+static int tcp_data_from_sock(struct ctx *c, struct tcp_conn *conn)
 {
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
 	int sendlen, len, plen, v4 = CONN_V4(conn);
@@ -2225,8 +2271,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_conn *conn,
 	}
 
 	if (!conn->wnd_from_tap || already_sent >= conn->wnd_from_tap) {
-		conn_flag(c, conn, CONN_STALLED);
-		conn->tap_data_noack = *now;
+		conn_flag(c, conn, STALLED);
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 		return 0;
 	}
 
@@ -2248,7 +2294,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_conn *conn,
 
 	if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
 	    (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf)))
-		tcp_l2_data_buf_flush(c, now);
+		tcp_l2_data_buf_flush(c);
 
 	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
 		if (v4)
@@ -2274,11 +2320,11 @@ recvmsg:
 
 	sendlen = len - already_sent;
 	if (sendlen <= 0) {
-		conn_flag(c, conn, CONN_STALLED);
+		conn_flag(c, conn, STALLED);
 		return 0;
 	}
 
-	conn_flag(c, conn, ~CONN_STALLED);
+	conn_flag(c, conn, ~STALLED);
 
 	send_bufs = DIV_ROUND_UP(sendlen, conn->tap_mss);
 	last_len = sendlen - (send_bufs - 1) * conn->tap_mss;
@@ -2294,11 +2340,11 @@ recvmsg:
 		if (i == send_bufs - 1)
 			plen = last_len;
 
-		tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap, now);
+		tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap);
 		conn->seq_to_tap += plen;
 	}
 
-	conn->tap_data_noack = conn->ts_ack_to_tap = *now;
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
 	return 0;
 
@@ -2312,7 +2358,7 @@ err:
 
 zero_len:
 	if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
-		if ((ret = tcp_send_flag(c, conn, FIN | ACK, now))) {
+		if ((ret = tcp_send_flag(c, conn, FIN | ACK))) {
 			tcp_rst(c, conn);
 			return ret;
 		}
@@ -2329,13 +2375,11 @@ zero_len:
  * @conn:	Connection pointer
  * @msg:	Array of messages from tap
  * @count:	Count of messages
- * @now:	Current timestamp
  *
  * #syscalls sendmsg
  */
 static void tcp_data_from_tap(struct ctx *c, struct tcp_conn *conn,
-			      struct tap_l4_msg *msg, int count,
-			      struct timespec *now)
+			      struct tap_l4_msg *msg, int count)
 {
 	int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1;
 	uint32_t max_ack_seq = conn->seq_ack_from_tap;
@@ -2445,16 +2489,18 @@ static void tcp_data_from_tap(struct ctx *c, struct tcp_conn *conn,
 	tcp_clamp_window(c, conn, NULL, 0, max_ack_seq_wnd, 0);
 
 	if (ack) {
-		conn->ts_ack_from_tap = *now;
-		if (max_ack_seq == conn->seq_to_tap)
-			conn->tap_data_noack = ((struct timespec) { 0, 0 });
+		if (max_ack_seq == conn->seq_to_tap) {
+			conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
+			conn->retrans = 0;
+		}
+
 		tcp_sock_consume(conn, max_ack_seq);
 	}
 
 	if (retr) {
 		conn->seq_ack_from_tap = max_ack_seq;
 		conn->seq_to_tap = max_ack_seq;
-		tcp_data_from_sock(c, conn, now);
+		tcp_data_from_sock(c, conn);
 	}
 
 	if (!iov_i)
@@ -2470,14 +2516,14 @@ eintr:
 			 *   Then swiftly looked away and left.
 			 */
 			conn->seq_from_tap = seq_from_tap;
-			tcp_send_flag(c, conn, ACK, now);
+			tcp_send_flag(c, conn, ACK);
 		}
 
 		if (errno == EINTR)
 			goto eintr;
 
 		if (errno == EAGAIN || errno == EWOULDBLOCK) {
-			tcp_send_flag(c, conn, ACK_IF_NEEDED, now);
+			tcp_send_flag(c, conn, ACK_IF_NEEDED);
 			return;
 		}
 		tcp_rst(c, conn);
@@ -2487,7 +2533,7 @@ eintr:
 	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
 		partial_send = 1;
 		conn->seq_from_tap += n;
-		tcp_send_flag(c, conn, ACK_IF_NEEDED, now);
+		tcp_send_flag(c, conn, ACK_IF_NEEDED);
 	} else {
 		conn->seq_from_tap += n;
 	}
@@ -2496,7 +2542,7 @@ out:
 	if (keep != -1) {
 		if (conn->seq_dup_ack != conn->seq_from_tap) {
 			conn->seq_dup_ack = conn->seq_from_tap;
-			tcp_send_flag(c, conn, DUP_ACK, now);
+			tcp_send_flag(c, conn, DUP_ACK);
 		}
 		return;
 	}
@@ -2510,7 +2556,7 @@ out:
 
 		conn_event(c, conn, TAP_FIN_RCVD);
 	} else {
-		tcp_send_flag(c, conn, ACK_IF_NEEDED, now);
+		tcp_send_flag(c, conn, ACK_IF_NEEDED);
 	}
 }
 
@@ -2520,11 +2566,9 @@ out:
  * @conn:	Connection pointer
  * @th:		TCP header of SYN, ACK segment from tap/guest
  * @len:	Packet length of SYN, ACK segment at L4, host order
- * @now:	Current timestamp
  */
 static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_conn *conn,
-				      struct tcphdr *th, size_t len,
-				      struct timespec *now)
+				      struct tcphdr *th, size_t len)
 {
 	tcp_clamp_window(c, conn, th, len, 0, 1);
 	conn->tap_mss = tcp_conn_tap_mss(c, conn, th, len);
@@ -2538,8 +2582,8 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_conn *conn,
 	/* The client might have sent data already, which we didn't
 	 * dequeue waiting for SYN,ACK from tap -- check now.
 	 */
-	tcp_data_from_sock(c, conn, now);
-	tcp_send_flag(c, conn, ACK_IF_NEEDED, now);
+	tcp_data_from_sock(c, conn);
+	tcp_send_flag(c, conn, ACK_IF_NEEDED);
 }
 
 /**
@@ -2559,6 +2603,7 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
 	struct tcphdr *th = (struct tcphdr *)(pkt_buf + msg[0].pkt_buf_offset);
 	uint16_t len = msg[0].l4_len;
 	struct tcp_conn *conn;
+	int ack_due = 0;
 
 	conn = tcp_hash_lookup(c, af, addr, htons(th->source), htons(th->dest));
 
@@ -2574,13 +2619,17 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
 		return count;
 	}
 
-	conn->ts_tap_act = *now;
-	conn_flag(c, conn, ~CONN_STALLED);
+	if (th->ack) {
+		conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
+		conn->retrans = 0;
+	}
+
+	conn_flag(c, conn, ~STALLED);
 
 	/* Establishing connection from socket */
 	if (conn->events & SOCK_ACCEPTED) {
 		if (th->syn && th->ack && !th->fin)
-			tcp_conn_from_sock_finish(c, conn, th, len, now);
+			tcp_conn_from_sock_finish(c, conn, th, len);
 		else
 			tcp_rst(c, conn);
 
@@ -2600,7 +2649,7 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
 			conn->seq_from_tap++;
 
 			shutdown(conn->sock, SHUT_WR);
-			tcp_send_flag(c, conn, ACK, now);
+			tcp_send_flag(c, conn, ACK);
 			conn_event(c, conn, SOCK_FIN_SENT);
 
 			return count;
@@ -2621,11 +2670,6 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
 
 	/* Established connections not accepting data from tap */
 	if (conn->events & TAP_FIN_RCVD) {
-		if (th->ack) {
-			conn->tap_data_noack = ((struct timespec) { 0, 0 });
-			conn->ts_ack_from_tap = *now;
-		}
-
 		if (conn->events & SOCK_FIN_RCVD &&
 		    conn->seq_ack_from_tap == conn->seq_to_tap)
 			tcp_conn_destroy(c, conn);
@@ -2634,14 +2678,20 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
 	}
 
 	/* Established connections accepting data from tap */
-	tcp_data_from_tap(c, conn, msg, count, now);
+	tcp_data_from_tap(c, conn, msg, count);
+	if (conn->seq_ack_to_tap != conn->seq_from_tap)
+		ack_due = 1;
 
 	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
 		shutdown(conn->sock, SHUT_WR);
 		conn_event(c, conn, SOCK_FIN_SENT);
-		tcp_send_flag(c, conn, ACK, now);
+		tcp_send_flag(c, conn, ACK);
+		ack_due = 0;
 	}
 
+	if (ack_due)
+		conn_flag(c, conn, ACK_TO_TAP_DUE);
+
 	return count;
 }
 
@@ -2649,10 +2699,8 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
  * tcp_connect_finish() - Handle completion of connect() from EPOLLOUT event
  * @c:		Execution context
  * @conn:	Connection pointer
- * @now:	Current timestamp
  */
-static void tcp_connect_finish(struct ctx *c, struct tcp_conn *conn,
-			       struct timespec *now)
+static void tcp_connect_finish(struct ctx *c, struct tcp_conn *conn)
 {
 	socklen_t sl;
 	int so;
@@ -2663,10 +2711,11 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_conn *conn,
 		return;
 	}
 
-	if (tcp_send_flag(c, conn, SYN | ACK, now))
+	if (tcp_send_flag(c, conn, SYN | ACK))
 		return;
 
 	conn_event(c, conn, TAP_SYN_ACK_SENT);
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 }
 
 /**
@@ -2693,7 +2742,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 
 	conn = CONN(c->tcp.conn_count++);
 	conn->sock = s;
-
+	conn->timer = -1;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
 	if (ref.r.p.tcp.tcp.v6) {
@@ -2759,16 +2808,70 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 
 	conn->wnd_from_tap = WINDOW_DEFAULT;
 
-	conn->ts_sock_act = conn->ts_tap_act = *now;
-	conn->ts_ack_from_tap = conn->ts_ack_to_tap = *now;
-
-	tcp_send_flag(c, conn, SYN, now);
+	tcp_send_flag(c, conn, SYN);
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
 	tcp_get_sndbuf(conn);
 }
 
 /**
- * tcp_sock_handler() - Handle new data from socket
+ * tcp_timer_handler() - timerfd events: close, send ACK, retransmit, or reset
+ * @c:		Execution context
+ * @ref:	epoll reference of timer (not connection)
+ */
+static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
+{
+	struct tcp_conn *conn = CONN(ref.r.p.tcp.tcp.index);
+	struct epoll_event ev = { 0 };
+
+	if (CONN_IS_CLOSED(conn)) {
+		tcp_hash_remove(conn);
+		tcp_table_compact(c, conn);
+		if (conn->timer != -1) {
+			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, &ev);
+			close(conn->timer);
+			conn->timer = -1;
+		}
+	} else if (conn->flags & ACK_TO_TAP_DUE) {
+		tcp_send_flag(c, conn, ACK_IF_NEEDED);
+		conn_flag(c, conn, ~ACK_TO_TAP_DUE);
+	} else if (conn->flags & ACK_FROM_TAP_DUE) {
+		if (!(conn->events & ESTABLISHED)) {
+			debug("TCP: index %i, handshake timeout", conn - tc);
+			tcp_rst(c, conn);
+		} else if (conn->events & TAP_FIN_SENT) {
+			debug("TCP: index %i, FIN timeout", conn - tc);
+			tcp_rst(c, conn);
+		} else if (conn->retrans == TCP_MAX_RETRANS) {
+			debug("TCP: index %i, maximum retransmissions exceeded",
+			      conn - tc);
+			tcp_rst(c, conn);
+		} else {
+			debug("TCP: index %i, ACK timeout, retry", conn - tc);
+			conn->retrans++;
+			conn->seq_to_tap = conn->seq_ack_from_tap;
+			tcp_data_from_sock(c, conn);
+		}
+	} else {
+		struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
+		struct itimerspec old = { { 0 }, { 0 } };
+
+		/* Activity timeout: if it was already set, reset the
+		 * connection, otherwise, it was a left-over from ACK_TO_TAP_DUE
+		 * or ACK_FROM_TAP_DUE, so just set the long timeout in that
+		 * case. This avoids having to preemptively reset the timer on
+		 * ~ACK_TO_TAP_DUE or ~ACK_FROM_TAP_DUE.
+		 */
+		timerfd_settime(conn->timer, 0, &new, &old);
+		if (old.it_value.tv_sec == ACT_TIMEOUT) {
+			debug("TCP: index %i, activity timeout", conn - tc);
+			tcp_rst(c, conn);
+		}
+	}
+}
+
+/**
+ * tcp_sock_handler() - Handle new data from socket, or timerfd event
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -2779,6 +2882,11 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 {
 	struct tcp_conn *conn;
 
+	if (ref.r.p.tcp.tcp.timer) {
+		tcp_timer_handler(c, ref);
+		return;
+	}
+
 	if (ref.r.p.tcp.tcp.splice) {
 		tcp_sock_handler_splice(c, ref, events);
 		return;
@@ -2792,8 +2900,6 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 	if (!(conn = CONN(ref.r.p.tcp.tcp.index)))
 		return;
 
-	conn->ts_sock_act = *now;
-
 	if (events & EPOLLERR) {
 		tcp_rst(c, conn);
 		return;
@@ -2812,7 +2918,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 			conn_event(c, conn, SOCK_FIN_RCVD);
 
 		if (events & EPOLLIN)
-			tcp_data_from_sock(c, conn, now);
+			tcp_data_from_sock(c, conn);
 
 		if (events & EPOLLOUT)
 			tcp_update_seqack_wnd(c, conn, 0, NULL);
@@ -2832,7 +2938,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 
 	if (conn->events == TAP_SYN_RCVD) {
 		if (events & EPOLLOUT)
-			tcp_connect_finish(c, conn, now);
+			tcp_connect_finish(c, conn);
 		/* Data? Check later */
 	}
 }
@@ -2981,9 +3087,9 @@ static int tcp_sock_refill(void *arg)
 	}
 
 	for (i = 0; a->c->v4 && i < TCP_SOCK_POOL_SIZE; i++, p4++) {
-		if (*p4 >= 0) {
+		if (*p4 >= 0)
 			break;
-		}
+
 		*p4 = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
 		if (*p4 > SOCKET_MAX) {
 			close(*p4);
@@ -2995,9 +3101,9 @@ static int tcp_sock_refill(void *arg)
 	}
 
 	for (i = 0; a->c->v6 && i < TCP_SOCK_POOL_SIZE; i++, p6++) {
-		if (*p6 >= 0) {
+		if (*p6 >= 0)
 			break;
-		}
+
 		*p6 = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK,
 			     IPPROTO_TCP);
 		if (*p6 > SOCKET_MAX) {
@@ -3091,72 +3197,6 @@ int tcp_sock_init(struct ctx *c, struct timespec *now)
 	return 0;
 }
 
-/**
- * tcp_timer_one() - Handler for timed events on one socket
- * @c:		Execution context
- * @conn:	Connection pointer
- * @ts:		Timestamp from caller
- */
-static void tcp_timer_one(struct ctx *c, struct tcp_conn *conn,
-			  struct timespec *ts)
-{
-	int ack_from_tap = timespec_diff_ms(ts, &conn->ts_ack_from_tap);
-	int ack_to_tap = timespec_diff_ms(ts, &conn->ts_ack_to_tap);
-	int sock_act = timespec_diff_ms(ts, &conn->ts_sock_act);
-	int tap_act = timespec_diff_ms(ts, &conn->ts_tap_act);
-	int tap_data_noack;
-
-	if (!memcmp(&conn->tap_data_noack, &((struct timespec){ 0, 0 }),
-		    sizeof(struct timespec)))
-		tap_data_noack = 0;
-	else
-		tap_data_noack = timespec_diff_ms(ts, &conn->tap_data_noack);
-
-	if (CONN_IS_CLOSED(conn)) {
-		tcp_hash_remove(conn);
-		tcp_table_compact(c, conn);
-		return;
-	}
-
-	if (!(conn->events & ESTABLISHED)) {
-		if (ack_from_tap > SYN_TIMEOUT)
-			tcp_rst(c, conn);
-		return;
-	}
-
-	if (tap_act > ACT_TIMEOUT && sock_act > ACT_TIMEOUT)
-		goto rst;
-
-	if (!conn->wnd_to_tap || ack_to_tap > ACK_INTERVAL)
-		tcp_send_flag(c, conn, ACK_IF_NEEDED, ts);
-
-	if (tap_data_noack > ACK_TIMEOUT) {
-		if (conn->seq_ack_from_tap < conn->seq_to_tap) {
-			if (tap_data_noack > LAST_ACK_TIMEOUT)
-				goto rst;
-
-			conn->seq_to_tap = conn->seq_ack_from_tap;
-			tcp_data_from_sock(c, conn, ts);
-		}
-		return;
-	}
-
-	if (conn->events & TAP_FIN_SENT && tap_data_noack > FIN_TIMEOUT)
-		goto rst;
-
-	if (conn->events & SOCK_FIN_SENT && sock_act > FIN_TIMEOUT)
-		goto rst;
-
-	if (conn->events & SOCK_FIN_SENT && conn->events & SOCK_FIN_RCVD) {
-		if (sock_act > LAST_ACK_TIMEOUT || tap_act > LAST_ACK_TIMEOUT)
-			goto rst;
-	}
-
-	return;
-rst:
-	tcp_rst(c, conn);
-}
-
 /**
  * struct tcp_port_detect_arg - Arguments for tcp_port_detect()
  * @c:			Execution context
@@ -3281,7 +3321,6 @@ static int tcp_port_rebind(void *arg)
 void tcp_timer(struct ctx *c, struct timespec *now)
 {
 	struct tcp_sock_refill_arg refill_arg = { c, 0 };
-	int i;
 
 	if (c->mode == MODE_PASTA) {
 		if (timespec_diff_ms(now, &c->tcp.port_detect_ts) >
@@ -3318,7 +3357,4 @@ void tcp_timer(struct ctx *c, struct timespec *now)
 				NS_CALL(tcp_sock_refill, &refill_arg);
 		}
 	}
-
-	for (i = c->tcp.conn_count - 1; i >= 0; i--)
-		tcp_timer_one(c, CONN(i), now);
 }
diff --git a/tcp.h b/tcp.h
index b4e3fde..3154b4b 100644
--- a/tcp.h
+++ b/tcp.h
@@ -6,7 +6,9 @@
 #ifndef TCP_H
 #define TCP_H
 
-#define TCP_TIMER_INTERVAL		20 /* ms */
+#define REFILL_INTERVAL			1000 /* ms */
+#define PORT_DETECT_INTERVAL		1000
+#define TCP_TIMER_INTERVAL	MIN(REFILL_INTERVAL, PORT_DETECT_INTERVAL)
 
 #define TCP_MAX_CONNS			(128 * 1024)
 #define TCP_MAX_SOCKS			(TCP_MAX_CONNS + USHRT_MAX * 2)
@@ -21,7 +23,7 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
 		    struct tap_l4_msg *msg, int count, struct timespec *now);
 int tcp_sock_init(struct ctx *c, struct timespec *now);
 void tcp_timer(struct ctx *c, struct timespec *now);
-void tcp_defer_handler(struct ctx *c, struct timespec *now);
+void tcp_defer_handler(struct ctx *c);
 
 void tcp_sock_set_bufsize(struct ctx *c, int s);
 void tcp_update_l2_buf(unsigned char *eth_d, unsigned char *eth_s,
@@ -34,6 +36,7 @@ void tcp_remap_to_init(in_port_t port, in_port_t delta);
  * @listen:		Set if this file descriptor is a listening socket
  * @splice:		Set if descriptor is associated to a spliced connection
  * @v6:			Set for IPv6 sockets or connections
+ * @timer:		Reference is a timerfd descriptor for connection
  * @index:		Index of connection in table, or port for bound sockets
  * @u32:		Opaque u32 value of reference
  */
@@ -42,6 +45,7 @@ union tcp_epoll_ref {
 		uint32_t	listen:1,
 				splice:1,
 				v6:1,
+				timer:1,
 				index:20;
 	} tcp;
 	uint32_t u32;
-- 
@@ -6,7 +6,9 @@
 #ifndef TCP_H
 #define TCP_H
 
-#define TCP_TIMER_INTERVAL		20 /* ms */
+#define REFILL_INTERVAL			1000 /* ms */
+#define PORT_DETECT_INTERVAL		1000
+#define TCP_TIMER_INTERVAL	MIN(REFILL_INTERVAL, PORT_DETECT_INTERVAL)
 
 #define TCP_MAX_CONNS			(128 * 1024)
 #define TCP_MAX_SOCKS			(TCP_MAX_CONNS + USHRT_MAX * 2)
@@ -21,7 +23,7 @@ int tcp_tap_handler(struct ctx *c, int af, void *addr,
 		    struct tap_l4_msg *msg, int count, struct timespec *now);
 int tcp_sock_init(struct ctx *c, struct timespec *now);
 void tcp_timer(struct ctx *c, struct timespec *now);
-void tcp_defer_handler(struct ctx *c, struct timespec *now);
+void tcp_defer_handler(struct ctx *c);
 
 void tcp_sock_set_bufsize(struct ctx *c, int s);
 void tcp_update_l2_buf(unsigned char *eth_d, unsigned char *eth_s,
@@ -34,6 +36,7 @@ void tcp_remap_to_init(in_port_t port, in_port_t delta);
  * @listen:		Set if this file descriptor is a listening socket
  * @splice:		Set if descriptor is associated to a spliced connection
  * @v6:			Set for IPv6 sockets or connections
+ * @timer:		Reference is a timerfd descriptor for connection
  * @index:		Index of connection in table, or port for bound sockets
  * @u32:		Opaque u32 value of reference
  */
@@ -42,6 +45,7 @@ union tcp_epoll_ref {
 		uint32_t	listen:1,
 				splice:1,
 				v6:1,
+				timer:1,
 				index:20;
 	} tcp;
 	uint32_t u32;
-- 
2.35.1


  parent reply	other threads:[~2022-03-25 22:52 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-25 22:52 [PATCH 00/24] Boundary-checked "packets", TCP timerfd timeouts, assorted fixes Stefano Brivio
2022-03-25 22:52 ` [PATCH 01/24] conf, util, tap: Implement --trace option for extra verbose logging Stefano Brivio
2022-03-25 22:52 ` [PATCH 02/24] pcap: Fix mistake in printed string Stefano Brivio
2022-03-25 22:52 ` [PATCH 03/24] util: Drop CHECK_SET_MIN_MAX{,_PROTO_FD} macros Stefano Brivio
2022-03-25 22:52 ` [PATCH 04/24] util: Use standard int types Stefano Brivio
2022-03-25 22:52 ` [PATCH 05/24] tcp: Refactor to use events instead of states, split out spliced implementation Stefano Brivio
2022-03-25 22:52 ` [PATCH 06/24] test/lib/video: Fill in href attributes of video shortcuts Stefano Brivio
2022-03-25 22:52 ` [PATCH 07/24] udp: Drop _splice from recv, send, sendto static buffer names Stefano Brivio
2022-03-25 22:52 ` [PATCH 08/24] udp: Split buffer queueing/writing parts of udp_sock_handler() Stefano Brivio
2022-03-25 22:52 ` [PATCH 09/24] dhcpv6, tap, tcp: Use IN6_ARE_ADDR_EQUAL instead of open-coded memcmp() Stefano Brivio
2022-03-25 22:52 ` [PATCH 10/24] udp: Use flags for local, loopback, and configured unicast binds Stefano Brivio
2022-03-25 22:52 ` [PATCH 11/24] Makefile: Enable a few hardening flags Stefano Brivio
2022-03-25 22:52 ` [PATCH 12/24] test: Add asciinema(1) as requirement for CI in README Stefano Brivio
2022-03-25 22:52 ` [PATCH 13/24] test, seccomp, Makefile: Switch to valgrind runs for passt functional tests Stefano Brivio
2022-03-25 22:52 ` [PATCH 14/24] tcp, udp, util: Enforce 24-bit limit on socket numbers Stefano Brivio
2022-03-25 22:52 ` Stefano Brivio [this message]
2022-03-25 22:52 ` [PATCH 16/24] tcp_splice: Close sockets right away on high number of open files Stefano Brivio
2022-03-25 22:52 ` [PATCH 17/24] test/perf: Work-around for virtio_net hang before long streams from guest Stefano Brivio
2022-03-25 22:52 ` [PATCH 18/24] README: Avoid "here" links Stefano Brivio
2022-03-25 22:52 ` [PATCH 19/24] README: Update Interfaces and Availability sections Stefano Brivio
2022-03-25 22:52 ` [PATCH 20/24] tcp: Fit struct tcp_conn into a single 64-byte cacheline Stefano Brivio
2022-03-25 22:52 ` [PATCH 21/24] dhcp: Minimum option length implied by RFC 951 is 60 bytes, not 62 Stefano Brivio
2022-03-25 22:52 ` [PATCH 22/24] tcp, tcp_splice: Use less awkward syntax to swap in/out sockets from pools Stefano Brivio
2022-03-25 22:52 ` [PATCH 23/24] util: Fix function declaration style of write_pidfile() Stefano Brivio
2022-03-25 22:53 ` [PATCH 24/24] treewide: Packet abstraction with mandatory boundary checks Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220325225300.2803584-16-sbrivio@redhat.com \
    --to=sbrivio@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).