[Netdev 0x17 Net-Power] Re: Welcome, let's discuss!

5 Dec 2023

      Thanks for seeding this discussion Jesse!
w.r.t. Counting Events: I think it can be a bit challenging to figure out
the true amount of time the system is executing vs sleeping through only
the counters in /sys/devices/system/cpu/cpu<N>/cpuidle/*. We found (on
Intel) that some simple PMCs can help with that, specifically the
CPU_CLK_UNHALTED.REF
counter which counts the unhalted cycles of the CPU at a fixed RDTSC rate.
Basically, you can instrument it between a region of code so that you can
figure out the time the CPU was processing instructions vs halted and that
can effectively give you a ratio of how long it was sleeping. Note, you
still don't know which sleep state it was in but that's something I suppose
you can tie in with the /sys counters.
w.r.t. Benchmarks: I'm also curious in general how common optimizations
people have done to improve network performance affects power? For example
if we can support the same workload but with fewer instructions, then that
automatically means lower power consumption right? Things that pop to my
mind: bypassing some of the kernel, replacing TCP with UDP, impact of
having a dedicated polling to reap packets for multiple workers (might not
be a way around this for very low latency apps)
- Han
On Tue, Dec 5, 2023 at 3:45 PM Jesse Brandeburg jesse.brandeburg@intel.com
wrote:
...
On 12/5/2023 11:21 AM, Hagen Paul Pfeifer wrote:
...

Brandeburg, Jesse | 2023-12-05 18:58:37 [+0000]:

Hey Jesse
...
We thought it might be a useful start to figure out a good set of
benchmarks
...
...
to demonstrate "power vs networking" problems. I have a couple in mind
right
...
...
away. One is "system is sleeping but I'm trying to run a latency
sensitive
...
...
workload and the latency sucks" Two is "system is sleeping and my
single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc)
shows a
...
...
lot of retransmits and / or receiver drops"
The first is a good - but rather unreasonable, isn't it? RT guys setting
max_cstate
...
to 1 or so to guarantee a low latency, deterministic RT behavior. I
think that
...
if low latency is the ultimate goal, compromises must inevitably be made
in
...
the PM domain.
I think you're thinking too small/detailed. RT is also a special case,
but the deadlines for 100G+ networking are much shorter (microseconds or
nanoseconds) than the RT deadlines (usually milliseconds)
...
The second it don't get (e.g.):

CPU is in idle state C10
NIC wakeup and interrupt CPU interrupt controller
CPU C10 -> C0

takes at least 890 us, maybe longer (from my really old Intel(R)
Core(TM) i7-8700K CPU @ 3.70GHz)
C10:
Flags/Description: MWAIT 0x60
Latency: 890
...

softirq and packet will be processed until delivered to

netperf/iperf2/neper
...
Where do the retransmits/drops occur here? Sure C10 -> C0 takes some
wakeup
...
penalty, but no drop.
quick math at 100Gb/s
64 byte arrival rate: 0.00672us
1518 byte arrival rate: 0.12304us
890us / 0.00672us = 132,440 packets per wakeup
890us / 0.12304us = 7,233 packets per wakeup
So, this means that you have to have at least that many receive
descriptors (one per packet) pre-allocated to hold those packets until
your CPU wakes up and starts processing the initial interrupt.
Our default 2,048 descriptor rings are able to hold 13us and 252us,
respectively, of packets on one ring.
If the DMA was asleep due to PC6+ state then the only storage is on the
NIC FIFO, and the timelines are much shorter.
...
Jesse, I wonder if the benchmarks lead to much? Can we use them to make
measurements that are comparable? What do you want to achieve with the
benchmarks? Sorry for asking these questions! ;-)
Of course that's the goal :-) And I like the questions, keep em coming!
I'm hoping to start us on the path of a) including some knowledge of the
wake latency and system behavior in the networking layer. b) Some back
and forth communication from the networking layer to the scheduler and
CPU power manager based on that knowledge.
_______________________________________________
Netdev 0x17 Net-Power mailing list -- net-power@netdevconf.info
To unsubscribe send an email to net-power-leave@netdevconf.info

2025

2024

2023

[Netdev 0x17 Net-Power] Re: Welcome, let's discuss!