From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=jIJ+vOO2; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 448895A061E for ; Fri, 17 Jan 2025 16:56:03 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1737129361; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GWgSyeqA/CJFeDAH8haZ7URv7ymxcX8h7birmk+QWaI=; b=jIJ+vOO2tFBdeCA6hqU/iB9k//I2VBLU+nHDPmj+TTiIJjzK1WdAwyqkSbp2+GFAAXtqYz eLV5lojqF4nXaeNpL+e9LOtT/aZuEShEYo0WsM2NvzRErU2ecVfpdTiK86fcyZFbt8iBHo XYS5YVvgtynv3o8cTJGatTGYhbBCJYQ= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-632-y3RuaaQUPIOvMT4HM0EzeQ-1; Fri, 17 Jan 2025 10:56:00 -0500 X-MC-Unique: y3RuaaQUPIOvMT4HM0EzeQ-1 X-Mimecast-MFC-AGG-ID: y3RuaaQUPIOvMT4HM0EzeQ Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-38634103b0dso1431213f8f.2 for ; Fri, 17 Jan 2025 07:56:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737129358; x=1737734158; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=GWgSyeqA/CJFeDAH8haZ7URv7ymxcX8h7birmk+QWaI=; b=XlOoEMlpzVZL3c2PNtO22GhbVXe+oRP73tlyhS9SxL6vOek4V9OuV+vNhxnsYNm3Ul rTY6+dFR8FVBU6GG+Rnpte7Gi4OTSjWYzBWZkWobDniDuFkFsioHsApsSHHOaCTQDSHw wHOTCc9ppiEtATX54UJhVFKllJgVxSf2AhfpRGaq6dugK7WCIsI0qHp9vCeAEwnGb3Ig NAFOxi5F8FUmFDpX1IKL4dzI1IdjvDaLY5mbK469+jjA6p76EHP+dNcX7A1quogskQ5s FuHZ33YP0DM7K1hZUNoymOFxvG1Jph38am/2qiuwfkoD3apa0MTA5D03Z4quLP3JQF8Z DxrA== X-Gm-Message-State: AOJu0YylpU1SgLzrFlyZkrcfpSow/e7+9QnI9mSiNQcqjsyYva4oWMln nhKPoCt34UsYB2ooWdJkPcCawCAjKEkupsDorACyqhU3mkxVVqMI8E1IMCyoiZJmW9+JqeXCX+T GwhIW1OCy8L0Pr/sZ6+gEyiLfGtKj6FD62Cp15iQUxVZxKk6w9psjMqzaRNqxcdypI0bCqNY/mX D8wnPwwA3FB929L3j3q53MJEN0I1hdN/zA X-Gm-Gg: ASbGncttHb03zwQSX6vwi/m0m/fL4xOp0ZaGYJMWDyWad6kZ5ckKH+kG0BCpbkE94an neBgVwjbCyWHogwdaxz6ue+P5FHHxDp+u4naIWvhvGxofo3eqJGo+MhEKWMG8fK+NjXnjUNs2zi S/LIEfRJdoFljDiH/cz+otV9KtC+kItMJkZPK5PDjpk7wJDyWdDCzoiN48SFvSQFEZwjZANLH4g RZHjU12GQqO38wf2OGzZBjgKGORcCENl1Oo65XtqOWhdPsuoJnQfmQzmFLJ8mmq6b74 X-Received: by 2002:a05:6000:1f81:b0:386:3213:5b9b with SMTP id ffacd0b85a97d-38bf57a9477mr3692986f8f.43.1737129358287; Fri, 17 Jan 2025 07:55:58 -0800 (PST) X-Google-Smtp-Source: AGHT+IE2TrUsxddKxolGFmMr7taLD2sF+rgj2XEUQgiPSKYyHpT2OTOauQ/vgENF2YRwDmyTGZBB8g== X-Received: by 2002:a05:6000:1f81:b0:386:3213:5b9b with SMTP id ffacd0b85a97d-38bf57a9477mr3692956f8f.43.1737129357827; Fri, 17 Jan 2025 07:55:57 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38bf327e06asm2862734f8f.95.2025.01.17.07.55.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 17 Jan 2025 07:55:57 -0800 (PST) Date: Fri, 17 Jan 2025 16:55:55 +0100 From: Stefano Brivio To: Jon Maloy Subject: Re: [net, v2.1] tcp: correct handling of extreme memory squeeze Message-ID: <20250117165555.3e3fe848@elisabeth> In-Reply-To: <20250116215711.2278134-1-jmaloy@redhat.com> References: <20250116215711.2278134-1-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: vCogffVHyIISewW_Ssz6eSsVkKN2cYrKXO5zvNfHBh8_1737129359 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: Y7A36DWFWMJIAHLCSMG5XZN5TQHNU3D7 X-Message-ID-Hash: Y7A36DWFWMJIAHLCSMG5XZN5TQHNU3D7 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 16 Jan 2025 16:57:11 -0500 Jon Maloy wrote: > Testing with iperf3 using the "pasta" protocol splicer has revealed > a bug in the way tcp handles window advertising in extreme memory > squeeze situations. The problem occurs on the server side, and > the socket in question is a completely regular socket with the > default settings for the fedora40 kernel. We do not use SO_PEEK > or SO_RCVBUF on this socket. >=20 > A brief summary: Under memory pressure, a socket endpoint may > temporarily advertise a zero-sized window, but this is not stored > as part of the socket data. The reasoning behind this is that it is > considered a temporary setting which shouldn't influence any further > calculations. However, if we happen to stall at an unfortunate value > of the current window size, the algorithm selecting a new value will > consistently fail to advertise a non-zero window once we have freed > up enough memory. This means that this side's notion of the current > window size is different from the one last advertised to the peer, > causing the latter to not send any data to resolve the sitution. That's a looong paragraph now. > The following excerpt of a logging session, with own comments added, > shows more in detail what is happening: >=20 > // tcp_v4_rcv(->) > // tcp_rcv_established(->) > [5201<->39222]: =3D=3D=3D=3D Activating log @ net/ipv4/tcp_input.c/tc= p_data_queue()/5257 =3D=3D=3D=3D > [5201<->39222]: tcp_data_queue(->) > [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_D= ROP_REASON_PROTO_MEM > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469= 200, win_now 131184] > [copied_seq 259909392->260034360 (124968), unread = 5565800, qlen 85, ofoq 0] > [5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0 > [5201<->39222]: __tcp_transmit_skb(->) > [5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->r= cv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)= --> TRUE > [5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->r= cv_wnd: 262144, tp->rcv_nxt 265600160, returning 0 > [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 > [5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rc= v_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_rcv_established(<-) > [5201<->39222]: tcp_v4_rcv(<-) >=20 > // Receive queue is at 85 buffers and we are out of memory. > // We drop the incoming buffer, although it is in sequence, and decide > // to send an advertisement with a window of zero. > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means > // we unconditionally shrink the window. >=20 > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 0, win_now =3D 131184, 2 * win_now =3D 2= 62368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 260040464->260040464 (0), unread 5559696, qle= n 85, ofoq 0] >=20 > // After each read, the algorithm for calculating the new receive > // window in __tcp_cleanup_rbuf() finds it is too small to advertise > // or to update tp->rcv_wnd. > // Meanwhile, the peer thinks the window is zero, and will not send > // any more data to trigger an update from the interrupt mode side. >=20 > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 260099840->260171536 (71696), unread 5428624,= qlen 83, ofoq 0] >=20 > // The above pattern repeats again and again, since nothing changes > // between the reads. >=20 > [...] >=20 > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 265469200->265545488 (76288), unread 54672, q= len 1, ofoq 0] >=20 > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 265600160->265600160 (0), unread 0, qlen 0, o= foq 0] >=20 > // The receive queue is empty, but no new advertisement is sent. > // The peer still thinks the receive window is zero, and sends nothing. > // We have ended up in a deadlock situation. >=20 > Furthermore, we have observed that in these situations this side may > send out an updated 'th->ack_seq=C2=B4 which is not stored in tp->rcv_wup > as it should be. Backing ack_seq seems to be harmless, but is of > course still wrong from a protocol viewpoint. >=20 > We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet > has been dropped because of memory exhaustion and we have to advertize > a zero window. >=20 > Further testing shows that the connection recovers neatly from the > squeeze situation, and traffic can continue indefinitely. >=20 > Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory") > Signed-off-by: Jon Maloy > > --- By the way, for what it's worth: Reviewed-by: Stefano Brivio But you should drop the extra newline between Signed-off-by: and --- --=20 Stefano