0x17: Talk, Device Memory TCP - people

5 Oct 2023


      My apologies for putting some marketing at the top of this email:
Reminder, Early bird ticketing is open for 2 more days; the sooner you
register the better it is for our planning... And now to tell you
about one of my favorite talks this time..
ML/AI continues to be an inspiration for new Linux kernel
infrastructure and a theme at netdev conf 0x17!
There are a couple of patterns in ML traffic that inspire this work
from Mina Almasry, Willem de Bruijn, Eric Dumazet and Kaiyuan Zhang:
1) There is _a lot of freaking data_ being transferred across the
network - and it continues to increase dramatically (ex: to
accommodate such large data transfers NICs with 800 Gps ports are
starting to appear) and 2) The vast majority of this data is typically
from a device(eg storage) to another device (eg GPU).
For context, assume a scenario where data transfers from SSDs on
machineA cross the network to GPUs on machineB; and on machine B:
In the rx direction incoming data into the NIC goes to host memory,
then from host memory to the GPU. In the tx direction outgoing data
goes from GPU to host then out to the NIC.
This data movement approach, with multiple round trips across the
memory and PCIe busses, is no longer tenable at such large amounts of
data quantity. Current host hardware is just not equipped to deal with
this. While such bulk transfers amount to low packets/sec rates, you
are simply not going to be able to achieve wire speed because of the
PCIe and memory bandwith abuse from the multiple round trips (see talk
[1] which delves into host limitations).
Almasry et al introduce Device Memory TCP to solve these challenges.
They implement socket API to enable a user to send directly from
device memory (GPU in machineB above example or SSD in machineA)
across the network - as well as be able to place incoming data
directly in the device memory (GPU in machineB example). How does this
magic happen? Well, we need some support from the NIC to split the
header on rx such that the TCP headers go to the host and the rest of
the data goes directly to the device. The reverse is true for tx
direction. Packet headers get to go to stack and get processed in the
standard (TCP) code path.
Memory bandwidth use is cut dramatically and because NIC to device
transfers now happen at the lowest level of the PCIe tree hierarchy
PCIe bandwidth use is also cut down dramatically.
With devmem TCP, Almasry et al with a setup of 4 machines of
GPU+100Gbps NICs - and data sent and received directly from/to device
memory - were able to reach ~96.6% line rate speeds. In the talk they
will go over details of the kernel/uapi changes to achieve devmem TCP.
cheers,
jamal
[1] https://netdevconf.info/0x17/sessions/talk/congestion-control-architecture-f...