* [PATCH 0/2] tcp: Handle keep-alives, avoid unnecessary timer scheduling @ 2024-11-19 19:53 Stefano Brivio 2024-11-19 19:53 ` [PATCH 1/2] tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore Stefano Brivio 2024-11-19 19:53 ` [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest Stefano Brivio 0 siblings, 2 replies; 12+ messages in thread From: Stefano Brivio @ 2024-11-19 19:53 UTC (permalink / raw) To: passt-dev; +Cc: David Gibson, Tim Besard Patch 1/2 avoids some timer rescheduling noise that came up while investigating the issue solved by 2/2. Stefano Brivio (2): tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore tcp: Acknowledge keep-alive segments, ignore them for the rest tcp.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore 2024-11-19 19:53 [PATCH 0/2] tcp: Handle keep-alives, avoid unnecessary timer scheduling Stefano Brivio @ 2024-11-19 19:53 ` Stefano Brivio 2024-11-20 0:58 ` David Gibson 2024-11-19 19:53 ` [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest Stefano Brivio 1 sibling, 1 reply; 12+ messages in thread From: Stefano Brivio @ 2024-11-19 19:53 UTC (permalink / raw) To: passt-dev; +Cc: David Gibson, Tim Besard We enter the timer handler with the ACK_TO_TAP_DUE flag, call tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we acknowledged everything meanwhile, so we return early, but we also need to reset that flag to avoid unnecessarily scheduling the timer over and over again until more pending data appears. I'm not sure if this fixes any real issue, but I've spotted this in several logs reported by users, including one where we have some unexpected bursts of high CPU load during TCP transfers at low rates, from https://github.com/containers/podman/issues/23686. Link: https://github.com/containers/podman/discussions/24572 Link: https://github.com/containers/podman/issues/23686 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index 6a98dfa..f357920 100644 --- a/tcp.c +++ b/tcp.c @@ -1235,8 +1235,10 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int s = conn->sock; if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) && - !flags && conn->wnd_to_tap) + !flags && conn->wnd_to_tap) { + conn_flag(c, conn, ~ACK_TO_TAP_DUE); return 0; + } if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) { conn_event(c, conn, CLOSED); -- @@ -1235,8 +1235,10 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int s = conn->sock; if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) && - !flags && conn->wnd_to_tap) + !flags && conn->wnd_to_tap) { + conn_flag(c, conn, ~ACK_TO_TAP_DUE); return 0; + } if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) { conn_event(c, conn, CLOSED); -- 2.43.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore 2024-11-19 19:53 ` [PATCH 1/2] tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore Stefano Brivio @ 2024-11-20 0:58 ` David Gibson 0 siblings, 0 replies; 12+ messages in thread From: David Gibson @ 2024-11-20 0:58 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Tim Besard [-- Attachment #1: Type: text/plain, Size: 1706 bytes --] On Tue, Nov 19, 2024 at 08:53:43PM +0100, Stefano Brivio wrote: > We enter the timer handler with the ACK_TO_TAP_DUE flag, call > tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we > acknowledged everything meanwhile, so we return early, but we also > need to reset that flag to avoid unnecessarily scheduling the timer > over and over again until more pending data appears. > > I'm not sure if this fixes any real issue, but I've spotted this > in several logs reported by users, including one where we have some > unexpected bursts of high CPU load during TCP transfers at low rates, > from https://github.com/containers/podman/issues/23686. > > Link: https://github.com/containers/podman/discussions/24572 > Link: https://github.com/containers/podman/issues/23686 > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > --- > tcp.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/tcp.c b/tcp.c > index 6a98dfa..f357920 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1235,8 +1235,10 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, > int s = conn->sock; > > if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) && > - !flags && conn->wnd_to_tap) > + !flags && conn->wnd_to_tap) { > + conn_flag(c, conn, ~ACK_TO_TAP_DUE); > return 0; > + } > > if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) { > conn_event(c, conn, CLOSED); -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-19 19:53 [PATCH 0/2] tcp: Handle keep-alives, avoid unnecessary timer scheduling Stefano Brivio 2024-11-19 19:53 ` [PATCH 1/2] tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore Stefano Brivio @ 2024-11-19 19:53 ` Stefano Brivio 2024-11-20 1:02 ` David Gibson 1 sibling, 1 reply; 12+ messages in thread From: Stefano Brivio @ 2024-11-19 19:53 UTC (permalink / raw) To: passt-dev; +Cc: David Gibson, Tim Besard RFC 9293, 3.8.4 says: Implementers MAY include "keep-alives" in their TCP implementations (MAY-5), although this practice is not universally accepted. Some TCP implementations, however, have included a keep-alive mechanism. To confirm that an idle connection is still active, these implementations send a probe segment designed to elicit a response from the TCP peer. Such a segment generally contains SEG.SEQ = SND.NXT-1 and may or may not contain one garbage octet of data. If keep-alives are included, the application MUST be able to turn them on or off for each TCP connection (MUST-24), and they MUST default to off (MUST-25). but currently, tcp_data_from_tap() is not aware of this and will schedule a fast re-transmit on the second keep-alive (because it's also a duplicate ACK), ignoring the fact that the sequence number was rewinded to SND.NXT-1. ACK these keep-alive segments, reset the activity timeout, and ignore them for the rest. At some point, we could think of implementing an approximation of keep-alive segments on outbound sockets, for example by setting TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single keep-alive segment at approximately the same time, and never reset the connection. That's beyond the scope of this fix, though. Reported-by: Tim Besard <tim.besard@gmail.com> Link: https://github.com/containers/podman/discussions/24572 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/tcp.c b/tcp.c index f357920..1eb85bb 100644 --- a/tcp.c +++ b/tcp.c @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, continue; seq = ntohl(th->seq); + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { + flow_trace(conn, + "keep-alive sequence: %u, previous: %u", + seq, conn->seq_from_tap); + + tcp_send_flag(c, conn, ACK); + tcp_timer_ctl(c, conn); + + if (p->count == 1) + return 1; + + continue; + } + ack_seq = ntohl(th->ack_seq); if (th->ack) { -- @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, continue; seq = ntohl(th->seq); + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { + flow_trace(conn, + "keep-alive sequence: %u, previous: %u", + seq, conn->seq_from_tap); + + tcp_send_flag(c, conn, ACK); + tcp_timer_ctl(c, conn); + + if (p->count == 1) + return 1; + + continue; + } + ack_seq = ntohl(th->ack_seq); if (th->ack) { -- 2.43.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-19 19:53 ` [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest Stefano Brivio @ 2024-11-20 1:02 ` David Gibson 2024-11-20 6:43 ` Stefano Brivio 0 siblings, 1 reply; 12+ messages in thread From: David Gibson @ 2024-11-20 1:02 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Tim Besard [-- Attachment #1: Type: text/plain, Size: 2617 bytes --] On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > RFC 9293, 3.8.4 says: > > Implementers MAY include "keep-alives" in their TCP implementations > (MAY-5), although this practice is not universally accepted. Some > TCP implementations, however, have included a keep-alive mechanism. > To confirm that an idle connection is still active, these > implementations send a probe segment designed to elicit a response > from the TCP peer. Such a segment generally contains SEG.SEQ = > SND.NXT-1 and may or may not contain one garbage octet of data. If > keep-alives are included, the application MUST be able to turn them > on or off for each TCP connection (MUST-24), and they MUST default to > off (MUST-25). > > but currently, tcp_data_from_tap() is not aware of this and will > schedule a fast re-transmit on the second keep-alive (because it's > also a duplicate ACK), ignoring the fact that the sequence number was > rewinded to SND.NXT-1. > > ACK these keep-alive segments, reset the activity timeout, and ignore > them for the rest. > > At some point, we could think of implementing an approximation of > keep-alive segments on outbound sockets, for example by setting > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > keep-alive segment at approximately the same time, and never reset the > connection. That's beyond the scope of this fix, though. > > Reported-by: Tim Besard <tim.besard@gmail.com> > Link: https://github.com/containers/podman/discussions/24572 > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > --- > tcp.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/tcp.c b/tcp.c > index f357920..1eb85bb 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > continue; > > seq = ntohl(th->seq); > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > + flow_trace(conn, > + "keep-alive sequence: %u, previous: %u", > + seq, conn->seq_from_tap); > + > + tcp_send_flag(c, conn, ACK); > + tcp_timer_ctl(c, conn); > + > + if (p->count == 1) > + return 1; I'm not sure what this test is for. Shouldn't the continue be sufficient? > + > + continue; > + } > + > ack_seq = ntohl(th->ack_seq); > > if (th->ack) { -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-20 1:02 ` David Gibson @ 2024-11-20 6:43 ` Stefano Brivio 2024-11-21 2:38 ` David Gibson 0 siblings, 1 reply; 12+ messages in thread From: Stefano Brivio @ 2024-11-20 6:43 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev, Tim Besard On Wed, 20 Nov 2024 12:02:00 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > > RFC 9293, 3.8.4 says: > > > > Implementers MAY include "keep-alives" in their TCP implementations > > (MAY-5), although this practice is not universally accepted. Some > > TCP implementations, however, have included a keep-alive mechanism. > > To confirm that an idle connection is still active, these > > implementations send a probe segment designed to elicit a response > > from the TCP peer. Such a segment generally contains SEG.SEQ = > > SND.NXT-1 and may or may not contain one garbage octet of data. If > > keep-alives are included, the application MUST be able to turn them > > on or off for each TCP connection (MUST-24), and they MUST default to > > off (MUST-25). > > > > but currently, tcp_data_from_tap() is not aware of this and will > > schedule a fast re-transmit on the second keep-alive (because it's > > also a duplicate ACK), ignoring the fact that the sequence number was > > rewinded to SND.NXT-1. > > > > ACK these keep-alive segments, reset the activity timeout, and ignore > > them for the rest. > > > > At some point, we could think of implementing an approximation of > > keep-alive segments on outbound sockets, for example by setting > > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > > keep-alive segment at approximately the same time, and never reset the > > connection. That's beyond the scope of this fix, though. > > > > Reported-by: Tim Besard <tim.besard@gmail.com> > > Link: https://github.com/containers/podman/discussions/24572 > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > --- > > tcp.c | 14 ++++++++++++++ > > 1 file changed, 14 insertions(+) > > > > diff --git a/tcp.c b/tcp.c > > index f357920..1eb85bb 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > continue; > > > > seq = ntohl(th->seq); > > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > > + flow_trace(conn, > > + "keep-alive sequence: %u, previous: %u", > > + seq, conn->seq_from_tap); > > + > > + tcp_send_flag(c, conn, ACK); > > + tcp_timer_ctl(c, conn); > > + > > + if (p->count == 1) > > + return 1; > > I'm not sure what this test is for. Shouldn't the continue be sufficient? I don't think we want to go through tcp_update_seqack_from_tap(), tcp_tap_window_update() and the like on a keep-alive segment. But if we receive something else in this batch, that's going to be a data segment that happened to arrive just after the keep-alive, so, in that case, we have to do the normal processing, by ignoring just this segment and hitting 'continue'. Strictly speaking, the 'continue' is enough and correct, but I think that returning early in the obviously common case is simpler and more robust. -- Stefano ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-20 6:43 ` Stefano Brivio @ 2024-11-21 2:38 ` David Gibson 2024-11-21 4:26 ` Stefano Brivio 0 siblings, 1 reply; 12+ messages in thread From: David Gibson @ 2024-11-21 2:38 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Tim Besard [-- Attachment #1: Type: text/plain, Size: 3689 bytes --] On Wed, Nov 20, 2024 at 07:43:44AM +0100, Stefano Brivio wrote: > On Wed, 20 Nov 2024 12:02:00 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > > > RFC 9293, 3.8.4 says: > > > > > > Implementers MAY include "keep-alives" in their TCP implementations > > > (MAY-5), although this practice is not universally accepted. Some > > > TCP implementations, however, have included a keep-alive mechanism. > > > To confirm that an idle connection is still active, these > > > implementations send a probe segment designed to elicit a response > > > from the TCP peer. Such a segment generally contains SEG.SEQ = > > > SND.NXT-1 and may or may not contain one garbage octet of data. If > > > keep-alives are included, the application MUST be able to turn them > > > on or off for each TCP connection (MUST-24), and they MUST default to > > > off (MUST-25). > > > > > > but currently, tcp_data_from_tap() is not aware of this and will > > > schedule a fast re-transmit on the second keep-alive (because it's > > > also a duplicate ACK), ignoring the fact that the sequence number was > > > rewinded to SND.NXT-1. > > > > > > ACK these keep-alive segments, reset the activity timeout, and ignore > > > them for the rest. > > > > > > At some point, we could think of implementing an approximation of > > > keep-alive segments on outbound sockets, for example by setting > > > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > > > keep-alive segment at approximately the same time, and never reset the > > > connection. That's beyond the scope of this fix, though. > > > > > > Reported-by: Tim Besard <tim.besard@gmail.com> > > > Link: https://github.com/containers/podman/discussions/24572 > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > --- > > > tcp.c | 14 ++++++++++++++ > > > 1 file changed, 14 insertions(+) > > > > > > diff --git a/tcp.c b/tcp.c > > > index f357920..1eb85bb 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > continue; > > > > > > seq = ntohl(th->seq); > > > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > > > + flow_trace(conn, > > > + "keep-alive sequence: %u, previous: %u", > > > + seq, conn->seq_from_tap); > > > + > > > + tcp_send_flag(c, conn, ACK); > > > + tcp_timer_ctl(c, conn); > > > + > > > + if (p->count == 1) > > > + return 1; > > > > I'm not sure what this test is for. Shouldn't the continue be sufficient? > > I don't think we want to go through tcp_update_seqack_from_tap(), > tcp_tap_window_update() and the like on a keep-alive segment. Ah, I see. But that is an optimisation, right? It shouldn't be necessary for correctness. > But if we receive something else in this batch, that's going to be a > data segment that happened to arrive just after the keep-alive, so, in > that case, we have to do the normal processing, by ignoring just this > segment and hitting 'continue'. > > Strictly speaking, the 'continue' is enough and correct, but I think > that returning early in the obviously common case is simpler and more > robust. Hrm. Doesn't seem simpler to me, but I can see the point of the change so, Reviewed-by: David Gibson <david@gibson.dropbear.id.au> -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-21 2:38 ` David Gibson @ 2024-11-21 4:26 ` Stefano Brivio 2024-11-21 4:30 ` Stefano Brivio 2024-11-21 6:21 ` David Gibson 0 siblings, 2 replies; 12+ messages in thread From: Stefano Brivio @ 2024-11-21 4:26 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev, Tim Besard On Thu, 21 Nov 2024 13:38:09 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Wed, Nov 20, 2024 at 07:43:44AM +0100, Stefano Brivio wrote: > > On Wed, 20 Nov 2024 12:02:00 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > > > > RFC 9293, 3.8.4 says: > > > > > > > > Implementers MAY include "keep-alives" in their TCP implementations > > > > (MAY-5), although this practice is not universally accepted. Some > > > > TCP implementations, however, have included a keep-alive mechanism. > > > > To confirm that an idle connection is still active, these > > > > implementations send a probe segment designed to elicit a response > > > > from the TCP peer. Such a segment generally contains SEG.SEQ = > > > > SND.NXT-1 and may or may not contain one garbage octet of data. If > > > > keep-alives are included, the application MUST be able to turn them > > > > on or off for each TCP connection (MUST-24), and they MUST default to > > > > off (MUST-25). > > > > > > > > but currently, tcp_data_from_tap() is not aware of this and will > > > > schedule a fast re-transmit on the second keep-alive (because it's > > > > also a duplicate ACK), ignoring the fact that the sequence number was > > > > rewinded to SND.NXT-1. > > > > > > > > ACK these keep-alive segments, reset the activity timeout, and ignore > > > > them for the rest. > > > > > > > > At some point, we could think of implementing an approximation of > > > > keep-alive segments on outbound sockets, for example by setting > > > > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > > > > keep-alive segment at approximately the same time, and never reset the > > > > connection. That's beyond the scope of this fix, though. > > > > > > > > Reported-by: Tim Besard <tim.besard@gmail.com> > > > > Link: https://github.com/containers/podman/discussions/24572 > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > --- > > > > tcp.c | 14 ++++++++++++++ > > > > 1 file changed, 14 insertions(+) > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > index f357920..1eb85bb 100644 > > > > --- a/tcp.c > > > > +++ b/tcp.c > > > > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > continue; > > > > > > > > seq = ntohl(th->seq); > > > > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > > > > + flow_trace(conn, > > > > + "keep-alive sequence: %u, previous: %u", > > > > + seq, conn->seq_from_tap); > > > > + > > > > + tcp_send_flag(c, conn, ACK); > > > > + tcp_timer_ctl(c, conn); > > > > + > > > > + if (p->count == 1) > > > > + return 1; > > > > > > I'm not sure what this test is for. Shouldn't the continue be sufficient? > > > > I don't think we want to go through tcp_update_seqack_from_tap(), > > tcp_tap_window_update() and the like on a keep-alive segment. > > Ah, I see. But that is an optimisation, right? It shouldn't be > necessary for correctness. *Shouldn't*. > > But if we receive something else in this batch, that's going to be a > > data segment that happened to arrive just after the keep-alive, so, in > > that case, we have to do the normal processing, by ignoring just this > > segment and hitting 'continue'. > > > > Strictly speaking, the 'continue' is enough and correct, but I think > > that returning early in the obviously common case is simpler and more > > robust. > > Hrm. Doesn't seem simpler to me, but I can see the point of the > change so, The code itself is two lines longer, of course, with an additional early return. Considering all the possible side effects of looking at window values from a keep-alive segment looks to me more complicated than the alternative, though. > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> -- Stefano ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-21 4:26 ` Stefano Brivio @ 2024-11-21 4:30 ` Stefano Brivio 2024-11-21 6:21 ` David Gibson 1 sibling, 0 replies; 12+ messages in thread From: Stefano Brivio @ 2024-11-21 4:30 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev, Tim Besard On Thu, 21 Nov 2024 05:26:17 +0100 Stefano Brivio <sbrivio@redhat.com> wrote: > On Thu, 21 Nov 2024 13:38:09 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Wed, Nov 20, 2024 at 07:43:44AM +0100, Stefano Brivio wrote: > > > On Wed, 20 Nov 2024 12:02:00 +1100 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > > > > > RFC 9293, 3.8.4 says: > > > > > > > > > > Implementers MAY include "keep-alives" in their TCP implementations > > > > > (MAY-5), although this practice is not universally accepted. Some > > > > > TCP implementations, however, have included a keep-alive mechanism. > > > > > To confirm that an idle connection is still active, these > > > > > implementations send a probe segment designed to elicit a response > > > > > from the TCP peer. Such a segment generally contains SEG.SEQ = > > > > > SND.NXT-1 and may or may not contain one garbage octet of data. If > > > > > keep-alives are included, the application MUST be able to turn them > > > > > on or off for each TCP connection (MUST-24), and they MUST default to > > > > > off (MUST-25). > > > > > > > > > > but currently, tcp_data_from_tap() is not aware of this and will > > > > > schedule a fast re-transmit on the second keep-alive (because it's > > > > > also a duplicate ACK), ignoring the fact that the sequence number was > > > > > rewinded to SND.NXT-1. > > > > > > > > > > ACK these keep-alive segments, reset the activity timeout, and ignore > > > > > them for the rest. > > > > > > > > > > At some point, we could think of implementing an approximation of > > > > > keep-alive segments on outbound sockets, for example by setting > > > > > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > > > > > keep-alive segment at approximately the same time, and never reset the > > > > > connection. That's beyond the scope of this fix, though. > > > > > > > > > > Reported-by: Tim Besard <tim.besard@gmail.com> > > > > > Link: https://github.com/containers/podman/discussions/24572 > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > > --- > > > > > tcp.c | 14 ++++++++++++++ > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > index f357920..1eb85bb 100644 > > > > > --- a/tcp.c > > > > > +++ b/tcp.c > > > > > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > > continue; > > > > > > > > > > seq = ntohl(th->seq); > > > > > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > > > > > + flow_trace(conn, > > > > > + "keep-alive sequence: %u, previous: %u", > > > > > + seq, conn->seq_from_tap); > > > > > + > > > > > + tcp_send_flag(c, conn, ACK); > > > > > + tcp_timer_ctl(c, conn); > > > > > + > > > > > + if (p->count == 1) > > > > > + return 1; > > > > > > > > I'm not sure what this test is for. Shouldn't the continue be sufficient? > > > > > > I don't think we want to go through tcp_update_seqack_from_tap(), > > > tcp_tap_window_update() and the like on a keep-alive segment. > > > > Ah, I see. But that is an optimisation, right? It shouldn't be > > necessary for correctness. > > *Shouldn't*. > > > > But if we receive something else in this batch, that's going to be a > > > data segment that happened to arrive just after the keep-alive, so, in > > > that case, we have to do the normal processing, by ignoring just this > > > segment and hitting 'continue'. > > > > > > Strictly speaking, the 'continue' is enough and correct, but I think > > > that returning early in the obviously common case is simpler and more > > > robust. > > > > Hrm. Doesn't seem simpler to me, but I can see the point of the > > change so, > > The code itself is two lines longer, of course, with an additional > early return. Considering all the possible side effects of looking at > window values from a keep-alive segment looks to me more complicated ^ lonely, I should say > than the alternative, though. > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> -- Stefano ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-21 4:26 ` Stefano Brivio 2024-11-21 4:30 ` Stefano Brivio @ 2024-11-21 6:21 ` David Gibson 2024-11-21 9:23 ` Stefano Brivio 1 sibling, 1 reply; 12+ messages in thread From: David Gibson @ 2024-11-21 6:21 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Tim Besard [-- Attachment #1: Type: text/plain, Size: 4616 bytes --] On Thu, Nov 21, 2024 at 05:26:17AM +0100, Stefano Brivio wrote: > On Thu, 21 Nov 2024 13:38:09 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Wed, Nov 20, 2024 at 07:43:44AM +0100, Stefano Brivio wrote: > > > On Wed, 20 Nov 2024 12:02:00 +1100 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > > > > > RFC 9293, 3.8.4 says: > > > > > > > > > > Implementers MAY include "keep-alives" in their TCP implementations > > > > > (MAY-5), although this practice is not universally accepted. Some > > > > > TCP implementations, however, have included a keep-alive mechanism. > > > > > To confirm that an idle connection is still active, these > > > > > implementations send a probe segment designed to elicit a response > > > > > from the TCP peer. Such a segment generally contains SEG.SEQ = > > > > > SND.NXT-1 and may or may not contain one garbage octet of data. If > > > > > keep-alives are included, the application MUST be able to turn them > > > > > on or off for each TCP connection (MUST-24), and they MUST default to > > > > > off (MUST-25). > > > > > > > > > > but currently, tcp_data_from_tap() is not aware of this and will > > > > > schedule a fast re-transmit on the second keep-alive (because it's > > > > > also a duplicate ACK), ignoring the fact that the sequence number was > > > > > rewinded to SND.NXT-1. > > > > > > > > > > ACK these keep-alive segments, reset the activity timeout, and ignore > > > > > them for the rest. > > > > > > > > > > At some point, we could think of implementing an approximation of > > > > > keep-alive segments on outbound sockets, for example by setting > > > > > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > > > > > keep-alive segment at approximately the same time, and never reset the > > > > > connection. That's beyond the scope of this fix, though. > > > > > > > > > > Reported-by: Tim Besard <tim.besard@gmail.com> > > > > > Link: https://github.com/containers/podman/discussions/24572 > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > > --- > > > > > tcp.c | 14 ++++++++++++++ > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > index f357920..1eb85bb 100644 > > > > > --- a/tcp.c > > > > > +++ b/tcp.c > > > > > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > > continue; > > > > > > > > > > seq = ntohl(th->seq); > > > > > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > > > > > + flow_trace(conn, > > > > > + "keep-alive sequence: %u, previous: %u", > > > > > + seq, conn->seq_from_tap); > > > > > + > > > > > + tcp_send_flag(c, conn, ACK); > > > > > + tcp_timer_ctl(c, conn); > > > > > + > > > > > + if (p->count == 1) > > > > > + return 1; > > > > > > > > I'm not sure what this test is for. Shouldn't the continue be sufficient? > > > > > > I don't think we want to go through tcp_update_seqack_from_tap(), > > > tcp_tap_window_update() and the like on a keep-alive segment. > > > > Ah, I see. But that is an optimisation, right? It shouldn't be > > necessary for correctness. > > *Shouldn't*. > > > > But if we receive something else in this batch, that's going to be a > > > data segment that happened to arrive just after the keep-alive, so, in > > > that case, we have to do the normal processing, by ignoring just this > > > segment and hitting 'continue'. > > > > > > Strictly speaking, the 'continue' is enough and correct, but I think > > > that returning early in the obviously common case is simpler and more > > > robust. > > > > Hrm. Doesn't seem simpler to me, but I can see the point of the > > change so, > > The code itself is two lines longer, of course, with an additional > early return. Considering all the possible side effects of looking at > window values from a keep-alive segment looks to me more complicated > than the alternative, though. Except that we *will* consider them if there happen to be other data packets in the batch. That seems like it will just make any problems from processing the keepalive sequence values harder to track down, not make them go away. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-21 6:21 ` David Gibson @ 2024-11-21 9:23 ` Stefano Brivio 2024-11-21 9:32 ` David Gibson 0 siblings, 1 reply; 12+ messages in thread From: Stefano Brivio @ 2024-11-21 9:23 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev, Tim Besard On Thu, 21 Nov 2024 17:21:12 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Thu, Nov 21, 2024 at 05:26:17AM +0100, Stefano Brivio wrote: > > On Thu, 21 Nov 2024 13:38:09 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Wed, Nov 20, 2024 at 07:43:44AM +0100, Stefano Brivio wrote: > > > > On Wed, 20 Nov 2024 12:02:00 +1100 > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > > > > > > RFC 9293, 3.8.4 says: > > > > > > > > > > > > Implementers MAY include "keep-alives" in their TCP implementations > > > > > > (MAY-5), although this practice is not universally accepted. Some > > > > > > TCP implementations, however, have included a keep-alive mechanism. > > > > > > To confirm that an idle connection is still active, these > > > > > > implementations send a probe segment designed to elicit a response > > > > > > from the TCP peer. Such a segment generally contains SEG.SEQ = > > > > > > SND.NXT-1 and may or may not contain one garbage octet of data. If > > > > > > keep-alives are included, the application MUST be able to turn them > > > > > > on or off for each TCP connection (MUST-24), and they MUST default to > > > > > > off (MUST-25). > > > > > > > > > > > > but currently, tcp_data_from_tap() is not aware of this and will > > > > > > schedule a fast re-transmit on the second keep-alive (because it's > > > > > > also a duplicate ACK), ignoring the fact that the sequence number was > > > > > > rewinded to SND.NXT-1. > > > > > > > > > > > > ACK these keep-alive segments, reset the activity timeout, and ignore > > > > > > them for the rest. > > > > > > > > > > > > At some point, we could think of implementing an approximation of > > > > > > keep-alive segments on outbound sockets, for example by setting > > > > > > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > > > > > > keep-alive segment at approximately the same time, and never reset the > > > > > > connection. That's beyond the scope of this fix, though. > > > > > > > > > > > > Reported-by: Tim Besard <tim.besard@gmail.com> > > > > > > Link: https://github.com/containers/podman/discussions/24572 > > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > > > --- > > > > > > tcp.c | 14 ++++++++++++++ > > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > > index f357920..1eb85bb 100644 > > > > > > --- a/tcp.c > > > > > > +++ b/tcp.c > > > > > > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > > > continue; > > > > > > > > > > > > seq = ntohl(th->seq); > > > > > > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > > > > > > + flow_trace(conn, > > > > > > + "keep-alive sequence: %u, previous: %u", > > > > > > + seq, conn->seq_from_tap); > > > > > > + > > > > > > + tcp_send_flag(c, conn, ACK); > > > > > > + tcp_timer_ctl(c, conn); > > > > > > + > > > > > > + if (p->count == 1) > > > > > > + return 1; > > > > > > > > > > I'm not sure what this test is for. Shouldn't the continue be sufficient? > > > > > > > > I don't think we want to go through tcp_update_seqack_from_tap(), > > > > tcp_tap_window_update() and the like on a keep-alive segment. > > > > > > Ah, I see. But that is an optimisation, right? It shouldn't be > > > necessary for correctness. > > > > *Shouldn't*. > > > > > > But if we receive something else in this batch, that's going to be a > > > > data segment that happened to arrive just after the keep-alive, so, in > > > > that case, we have to do the normal processing, by ignoring just this > > > > segment and hitting 'continue'. > > > > > > > > Strictly speaking, the 'continue' is enough and correct, but I think > > > > that returning early in the obviously common case is simpler and more > > > > robust. > > > > > > Hrm. Doesn't seem simpler to me, but I can see the point of the > > > change so, > > > > The code itself is two lines longer, of course, with an additional > > early return. Considering all the possible side effects of looking at > > window values from a keep-alive segment looks to me more complicated > > than the alternative, though. > > Except that we *will* consider them if there happen to be other data > packets in the batch. Eh, yes, we have to: > > > > But if we receive something else in this batch, that's going to be a > > > > data segment that happened to arrive just after the keep-alive, so, in > > > > that case, we have to do the normal processing, by ignoring just this > > > > segment and hitting 'continue'. but we'll use _those_ window values (because we 'continue' here). > That seems like it will just make any problems > from processing the keepalive sequence values harder to track down, > not make them go away. We tested the common case (perhaps we'll never get anything else) and my priority would be to make _that_ robust, because it's what matters to users. If we find the time to write a small keep-alive sending program, then I would feel more confident to drop that additional condition. -- Stefano ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest 2024-11-21 9:23 ` Stefano Brivio @ 2024-11-21 9:32 ` David Gibson 0 siblings, 0 replies; 12+ messages in thread From: David Gibson @ 2024-11-21 9:32 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Tim Besard [-- Attachment #1: Type: text/plain, Size: 5952 bytes --] On Thu, Nov 21, 2024 at 10:23:12AM +0100, Stefano Brivio wrote: > On Thu, 21 Nov 2024 17:21:12 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Thu, Nov 21, 2024 at 05:26:17AM +0100, Stefano Brivio wrote: > > > On Thu, 21 Nov 2024 13:38:09 +1100 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Wed, Nov 20, 2024 at 07:43:44AM +0100, Stefano Brivio wrote: > > > > > On Wed, 20 Nov 2024 12:02:00 +1100 > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > On Tue, Nov 19, 2024 at 08:53:44PM +0100, Stefano Brivio wrote: > > > > > > > RFC 9293, 3.8.4 says: > > > > > > > > > > > > > > Implementers MAY include "keep-alives" in their TCP implementations > > > > > > > (MAY-5), although this practice is not universally accepted. Some > > > > > > > TCP implementations, however, have included a keep-alive mechanism. > > > > > > > To confirm that an idle connection is still active, these > > > > > > > implementations send a probe segment designed to elicit a response > > > > > > > from the TCP peer. Such a segment generally contains SEG.SEQ = > > > > > > > SND.NXT-1 and may or may not contain one garbage octet of data. If > > > > > > > keep-alives are included, the application MUST be able to turn them > > > > > > > on or off for each TCP connection (MUST-24), and they MUST default to > > > > > > > off (MUST-25). > > > > > > > > > > > > > > but currently, tcp_data_from_tap() is not aware of this and will > > > > > > > schedule a fast re-transmit on the second keep-alive (because it's > > > > > > > also a duplicate ACK), ignoring the fact that the sequence number was > > > > > > > rewinded to SND.NXT-1. > > > > > > > > > > > > > > ACK these keep-alive segments, reset the activity timeout, and ignore > > > > > > > them for the rest. > > > > > > > > > > > > > > At some point, we could think of implementing an approximation of > > > > > > > keep-alive segments on outbound sockets, for example by setting > > > > > > > TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single > > > > > > > keep-alive segment at approximately the same time, and never reset the > > > > > > > connection. That's beyond the scope of this fix, though. > > > > > > > > > > > > > > Reported-by: Tim Besard <tim.besard@gmail.com> > > > > > > > Link: https://github.com/containers/podman/discussions/24572 > > > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > > > > --- > > > > > > > tcp.c | 14 ++++++++++++++ > > > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > > > index f357920..1eb85bb 100644 > > > > > > > --- a/tcp.c > > > > > > > +++ b/tcp.c > > > > > > > @@ -1763,6 +1763,20 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > > > > continue; > > > > > > > > > > > > > > seq = ntohl(th->seq); > > > > > > > + if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { > > > > > > > + flow_trace(conn, > > > > > > > + "keep-alive sequence: %u, previous: %u", > > > > > > > + seq, conn->seq_from_tap); > > > > > > > + > > > > > > > + tcp_send_flag(c, conn, ACK); > > > > > > > + tcp_timer_ctl(c, conn); > > > > > > > + > > > > > > > + if (p->count == 1) > > > > > > > + return 1; > > > > > > > > > > > > I'm not sure what this test is for. Shouldn't the continue be sufficient? > > > > > > > > > > I don't think we want to go through tcp_update_seqack_from_tap(), > > > > > tcp_tap_window_update() and the like on a keep-alive segment. > > > > > > > > Ah, I see. But that is an optimisation, right? It shouldn't be > > > > necessary for correctness. > > > > > > *Shouldn't*. > > > > > > > > But if we receive something else in this batch, that's going to be a > > > > > data segment that happened to arrive just after the keep-alive, so, in > > > > > that case, we have to do the normal processing, by ignoring just this > > > > > segment and hitting 'continue'. > > > > > > > > > > Strictly speaking, the 'continue' is enough and correct, but I think > > > > > that returning early in the obviously common case is simpler and more > > > > > robust. > > > > > > > > Hrm. Doesn't seem simpler to me, but I can see the point of the > > > > change so, > > > > > > The code itself is two lines longer, of course, with an additional > > > early return. Considering all the possible side effects of looking at > > > window values from a keep-alive segment looks to me more complicated > > > than the alternative, though. > > > > Except that we *will* consider them if there happen to be other data > > packets in the batch. > > Eh, yes, we have to: > > > > > > But if we receive something else in this batch, that's going to be a > > > > > data segment that happened to arrive just after the keep-alive, so, in > > > > > that case, we have to do the normal processing, by ignoring just this > > > > > segment and hitting 'continue'. > > but we'll use _those_ window values (because we 'continue' here). > > > That seems like it will just make any problems > > from processing the keepalive sequence values harder to track down, > > not make them go away. > > We tested the common case (perhaps we'll never get anything else) and > my priority would be to make _that_ robust, because it's what matters > to users. If we find the time to write a small keep-alive sending > program, then I would feel more confident to drop that additional > condition. Eh, fair enough. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-11-21 9:32 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-11-19 19:53 [PATCH 0/2] tcp: Handle keep-alives, avoid unnecessary timer scheduling Stefano Brivio 2024-11-19 19:53 ` [PATCH 1/2] tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore Stefano Brivio 2024-11-20 0:58 ` David Gibson 2024-11-19 19:53 ` [PATCH 2/2] tcp: Acknowledge keep-alive segments, ignore them for the rest Stefano Brivio 2024-11-20 1:02 ` David Gibson 2024-11-20 6:43 ` Stefano Brivio 2024-11-21 2:38 ` David Gibson 2024-11-21 4:26 ` Stefano Brivio 2024-11-21 4:30 ` Stefano Brivio 2024-11-21 6:21 ` David Gibson 2024-11-21 9:23 ` Stefano Brivio 2024-11-21 9:32 ` David Gibson
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).