My apologies for putting some marketing at the top of this email: Reminder, Early bird ticketing is open for 2 more days; the sooner you register the better it is for our planning... And now to tell you about one of my favorite talks this time..
ML/AI continues to be an inspiration for new Linux kernel infrastructure and a theme at netdev conf 0x17!
There are a couple of patterns in ML traffic that inspire this work from Mina Almasry, Willem de Bruijn, Eric Dumazet and Kaiyuan Zhang: 1) There is _a lot of freaking data_ being transferred across the network - and it continues to increase dramatically (ex: to accommodate such large data transfers NICs with 800 Gps ports are starting to appear) and 2) The vast majority of this data is typically from a device(eg storage) to another device (eg GPU).
For context, assume a scenario where data transfers from SSDs on machineA cross the network to GPUs on machineB; and on machine B: In the rx direction incoming data into the NIC goes to host memory, then from host memory to the GPU. In the tx direction outgoing data goes from GPU to host then out to the NIC. This data movement approach, with multiple round trips across the memory and PCIe busses, is no longer tenable at such large amounts of data quantity. Current host hardware is just not equipped to deal with this. While such bulk transfers amount to low packets/sec rates, you are simply not going to be able to achieve wire speed because of the PCIe and memory bandwith abuse from the multiple round trips (see talk [1] which delves into host limitations).
Almasry et al introduce Device Memory TCP to solve these challenges. They implement socket API to enable a user to send directly from device memory (GPU in machineB above example or SSD in machineA) across the network - as well as be able to place incoming data directly in the device memory (GPU in machineB example). How does this magic happen? Well, we need some support from the NIC to split the header on rx such that the TCP headers go to the host and the rest of the data goes directly to the device. The reverse is true for tx direction. Packet headers get to go to stack and get processed in the standard (TCP) code path. Memory bandwidth use is cut dramatically and because NIC to device transfers now happen at the lowest level of the PCIe tree hierarchy PCIe bandwidth use is also cut down dramatically.
With devmem TCP, Almasry et al with a setup of 4 machines of GPU+100Gbps NICs - and data sent and received directly from/to device memory - were able to reach ~96.6% line rate speeds. In the talk they will go over details of the kernel/uapi changes to achieve devmem TCP.
cheers, jamal
[1] https://netdevconf.info/0x17/sessions/talk/congestion-control-architecture-f...