Jesse Brandeburg jesse.brandeburg@intel.com writes:
On 12/5/2023 11:21 AM, Hagen Paul Pfeifer wrote:
- Brandeburg, Jesse | 2023-12-05 18:58:37 [+0000]:
Hey Jesse
We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"
The first is a good - but rather unreasonable, isn't it? RT guys setting max_cstate to 1 or so to guarantee a low latency, deterministic RT behavior. I think that if low latency is the ultimate goal, compromises must inevitably be made in the PM domain.
I think you're thinking too small/detailed. RT is also a special case, but the deadlines for 100G+ networking are much shorter (microseconds or nanoseconds) than the RT deadlines (usually milliseconds)
The second it don't get (e.g.):
- CPU is in idle state C10
- NIC wakeup and interrupt CPU interrupt controller
- CPU C10 -> C0
takes at least 890 us, maybe longer (from my really old Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz) C10: Flags/Description: MWAIT 0x60 Latency: 890
- softirq and packet will be processed until delivered to netperf/iperf2/neper
Where do the retransmits/drops occur here? Sure C10 -> C0 takes some wakeup penalty, but no drop.
quick math at 100Gb/s 64 byte arrival rate: 0.00672us 1518 byte arrival rate: 0.12304us
890us / 0.00672us = 132,440 packets per wakeup 890us / 0.12304us = 7,233 packets per wakeup
So, this means that you have to have at least that many receive descriptors (one per packet) pre-allocated to hold those packets until your CPU wakes up and starts processing the initial interrupt.
Our default 2,048 descriptor rings are able to hold 13us and 252us, respectively, of packets on one ring.
If the DMA was asleep due to PC6+ state then the only storage is on the NIC FIFO, and the timelines are much shorter.
Another problem here can also be that the CPU is too fast for the traffic load :)
I.e., if the NIC is not running at 100% utilisation, as is very often the case, there are idle periods between packets (traffic is bursty), so even if the workload is "continuous" at the application level, there may be idle periods that are long enough that the CPU can enter a low enough sleep state that it can't wake up fast enough to process the next burst of packets.
-Toke