From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 1AFA75A026F for ; Thu, 28 Sep 2023 03:49:15 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1695865749; bh=+RaB714dZPJUxjd3jp8BxtU1FRWxBiVvE2zOrd8uysM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=LI3rcx0ey8QpyZC4mnKGcW1fHCGQ+6qqOH4UuXdaMCJmztTk5mq9xFmUcAiebJHGg qC/Ow4ufOP32AR424Fvnhh+4989AzLF4K2DC1W0sHcHkpu9UHjxhzFmMQCZ4WunwDV Vq4zWY3fNhJNS30m8SJQ0QuoDKOIvQKN/SJbnmg0= Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4RwxGs33W9z4xQZ; Thu, 28 Sep 2023 11:49:09 +1000 (AEST) Date: Thu, 28 Sep 2023 11:48:38 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data Message-ID: References: <20230922220610.58767-1-sbrivio@redhat.com> <20230922220610.58767-3-sbrivio@redhat.com> <20230927190533.2fc53bbf@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="kEmeViUPdBNq+w0z" Content-Disposition: inline In-Reply-To: <20230927190533.2fc53bbf@elisabeth> Message-ID-Hash: N4JIP4ZDJT33VWTM3SPVJYN5TOE7S2UZ X-Message-ID-Hash: N4JIP4ZDJT33VWTM3SPVJYN5TOE7S2UZ X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Matej Hrica , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --kEmeViUPdBNq+w0z Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote: > On Mon, 25 Sep 2023 13:07:24 +1000 > David Gibson wrote: >=20 > > I think the change itself here is sound, but I have some nits to pick > > with the description and reasoning. > >=20 > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote: > > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating > > > that we ran out of tap-side window space, or that all available > > > socket data is already in flight -- better names welcome! =20 > >=20 > > Hmm.. when you put it like that it makes me wonder if those two quite > > different conditions really need the same handling. Hrm.. I guess > > both conditions mean that we can't accept data from the socket, even > > if it's availble. >=20 > Right. I mean, we can also call them differently... or maybe pick a > name that reflects the outcome/handling instead of what happened. Sure, if we could think of one. Except, on second thoughts, I'm not sure my characterization is correct. If the tap side window is full then, indeed, we can't accept data from the socket. However if everything we have is in flight that doesn't mean we couldn't put more data into flight if it arrived. That consideration, together with the way we use MSG_PEEK possibly means that we fundamentally need to use edge-triggered interrupts - with the additional trickiness that entails - to avoid busy polling. Although even that only works if we get a new edge interrupt when data is added to a buffer that's been PEEKed but not TRUNCed. If that's not true the MSG_PEEK approach might be doomed :(. > > > ) on any > > > event: do that only if the first packet in a batch has the ACK flag > > > set. =20 > >=20 > > "First packet in a batch" may not be accurate here - we're looking at > > whichever packet we were up to before calling data_from_tap(). There > > could have been earlier packets in the receive batch that were already > > processed. >=20 > Well, it depends on what we call "batch" -- here I meant the pool of > packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool" > would be more accurate. Uh.. I don't think that actually helps. Remember pools aren't queues. The point here is that is that the packet we're considering is not the first of the batch/pool/whatever, but the first of what's left. > > This also raises the question of why the first data packet should be > > particularly privileged here. >=20 > No reason other than convenience, and yes, it can be subtly wrong. >=20 > > I'm wondering if what we really want to > > check is whether data_from_tap() advanced the ack pointer at all. >=20 > Right, we probably should do that instead. Ok. > > I'm not clear on when the th->ack check would ever fail in practice: > > aren't the only normal packets in a TCP connection without ACK the > > initial SYN or an RST? We've handled the SYN case earlier, so should > > we just have a blanket case above this that if we get a packet with > > !ACK, we reset the connection? >=20 > One thing that's legitimate (rarely seen, but I've seen it, I don't > remember if the Linux kernel ever does that) is a segment without ACK, > and without data, that just updates the window (for example after a > zero window). >=20 > If the sequence received/processed so far doesn't correspond to the > latest sequence sent, omitting the ACK flag is useful so that the > window update is not taken as duplicate ACK (that would trigger > retransmission). Ah, ok, I wasn't aware of that case. > > > Make sure we check for pending socket data when we reset it: > > > reverting back to level-triggered epoll events, as tcp_epoll_ctl() > > > does, isn't guaranteed to actually trigger a socket event. =20 > >=20 > > Which sure seems like a kernel bug. Some weird edge conditions for > > edge-triggered seems expected, but this doesn't seem like valid > > level-triggered semantics. >=20 > Hmm, yes, and by doing a quick isolated test actually this seems to work > as intended in the kernel. I should drop this and try again. >=20 > > Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of > > what's going on here is that we can't take more data from the socket > > until something happens on the tap side (either the window expands, or > > it acks some data). In which case should we be toggling EPOLLIN on > > the socket instead? That seems more explicitly to be saying to the > > socket side "we don't currently care if you have data available". >=20 > The reason was to act on EPOLLRDHUP at the same time. But well, we > could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make > more sense. So specifically to mask EPOLLRDHUP as well? On the grounds that if we're still chewing on what we got already we don't yet care that we've reached the end, yes? So yes, explicitly masking both those makes more sense to me.. except that as above, I suspect we can't have level-triggered + MSG_PEEK + no busy polling all at once. > For the moment being, we should probably try to see what happens > if this patch simply moves conn_flag(c, conn, ~STALLED); to where 3/5 > needs it (or directly incorporate that into 3/5). Ok. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --kEmeViUPdBNq+w0z Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmUU228ACgkQzQJF27ox 2GdjLg/7BH6Zfkx+xfdyCEWQuPsQaYJjg/ffSbwXKLGze0pu1x9befWj3jDaBi1k voj6iXer7bTw+0v5wm6mNGyMk+L1JzZitYGQlj2K7MHUojpXFQp62CAXZ/vzJCsJ PQiAIlJtI11X4dvPNGbkYHo+Q3/9vM50Ouax4a45PUKpWaISGkAT+OcNjW5mZau2 vBdERi757kXf8VyD628XFWDzpvntcczSRr8lcihTAfs4+I/RdNmAB8PDtgxmLrr1 tB2Q7atgkVw4CUzMP481YhuWWRy7M8lyJFZ7w2HemmBnGbl+JOsi2C30zmnzkSgf HR/YDDGLgvChgXRgHrIl5UhFO2AKp9HWXrtC21T+P+viGT+nRkb+Vu1G6BUBeoJf WivVRU8y4Kan/7URCQUgkLmEPHum+1VV909IZifFBFmAByXIs21lUES/FNd4dJ+G we455xY66WZfvOlzYbgFA2WiBy3GPnZCaXMENDvnFs/zpJ9LpVMNEcz1FOC0AD76 JftbbcOb9h3K2ocB9hmELPuGVbS+SdHUPqICo3YfuRJmzvdPHCpmgXLlXukdLjKU P5IKFn3wtvUDfBQWCYEgf9RmfYxXp7ELlyrMBg4yc5FareqX+ndWQXz017TkeMtl 6TAI4AjE8SMTyl8cJRrJOHyGvsMfG07lJMrmHIw1pGbU/3/rZok= =fH64 -----END PGP SIGNATURE----- --kEmeViUPdBNq+w0z--