[Netdev 0x17 Net-Power] Re: Welcome, let's discuss!

14 Dec 2023

      Pedro Tammela pctammela@mojatatu.com writes:
...
On 12/12/2023 10:40, Toke Høiland-Jørgensen wrote:
...
Pedro Tammela pctammela@mojatatu.com writes:
...
On 11/12/2023 12:45, Toke Høiland-Jørgensen wrote:
...
"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:
...
Toke and I were chatting offline about this problem of power management in networking.
We thought it might be a useful start to figure out a good set of
benchmarks to demonstrate "power vs networking" problems. I have a
couple in mind right away. One is "system is sleeping but I'm trying
to run a latency sensitive workload and the latency sucks" Two is
"system is sleeping and my single-threaded bulk throughput benchmark
(netperf/iperf2/neper/etc) shows a lot of retransmits and / or
receiver drops"
Another thought is how do I count these events and / or notice I have
a problem?
More thoughts on this from anyone?
Thank you for starting the on-list discussion. I'll add some high-level
thoughts here and also reply to a couple of messages down-thread with
some more specific comments.
When talking about benchmarking, the reason I mentioned that as a good
starting point is that I believe having visibility into power usage is
the only way we can make people actually use any tweaks we can come up
with. Especially since there's a lot of cargo-culting involved in tuning
(of the "use these settings for the best latency/throughput/whatever"
variety), and having more precise measurements of the impact of settings
is a way of combating that (and empowering people to make better
assessments of the tradeoffs involved).
And secondly, of course, if we are actually trying to improve something,
we need some baseline metrics to improve against. I'm thinking this can
be approached from both "ends", i.e., "here is the cost tradeoff of
various tuning parameters" that you mention, but also "here is the power
consumption of workload X", which can then be a target for improvement.
Turning to areas for improvement, I can think of a couple of broad
categories that seem promising to explore (some of which have already
been mentioned down-thread):

Smart task placement when scaling up/down (consolidating work on fewer
  cores to leave others idle enough that they can go to sleep).

Forecasting the next packet arrival; and using this both so we can
  make smarter sleep state decisions, but also so we can do smarter
  batching (maybe we can defer waking up the userspace process if we
  expect another packet to arrive shortly, that sort of thing).

Wouldn't that require some sort of protocol integration?
Probably, yeah. In-kernel the TCP stack could provide hints in some
cases (it knows the RTT and current bandwidth of the flow).
Interesting, this sort of info could be integrated into the scheduler
for power aware scheduling in P/E processors.
Yeah, I expect there will end up being some interaction with the
scheduler here at some point :)
...
...
For others,
we could expose an API for userspace to provide hints. The interesting
bit would be to find out whether this would work well enough in
practice. My hope would be that it could be good enough that it would be
feasible to run (more) systems with power saving features enabled
without suffering losses and/or huge latency spikes, which would be a
win :)
...
...

General performance improvements in targeted areas (better performance
  should translate to less work done per packet, which means less power
  used, all other things being equal.

One thing that me and Jamal saw was that this is not always the case.
Surprising as it may seem, we saw the CPU power consumption usually
being a constant[*] while throughput etc varied. In TLS for instance,
AVX512 acceleration using Intel's cryptoMB made the whole process more
power efficient but not less power hungry, i. e. the same power
consumption but more throughput over AES-NI.
[*] To expand a little bit more, turbo boosting is very smart these
days. It essentially always aims for TDP (for Intel at least) all the
time. So it dynamically scales everything to reach it.
Hmm, that's interesting. So, IIUC, this implies that performance
improvements have to have a certain magnitude to be useful for saving
power, right? I.e., saving a few % of CPU usage on one core is not
enough, but if the improvement is enough that you can move the workload
to fewer cores, it will help because you can bring some cores offline/to
idle. Or am I misunderstanding what you mean?
Yes exactly!
Fewer cores also means fewer thermal pressure which also means FANs 
spinning slower :)
Or potentially a longer server lifetime/cheaper server upgrade.
But when given more CPU room, applications might actually do more work!
Take for instance TLS offload + zero copy, the CPU will only be really 
freed if the link/network stack is saturated.
I believe there are two approaches here to networking:

Power saving vs Power efficient

So this is mostly related to the amount of batching, isn't it? I.e., at
high rates we are more efficient because we have more data arriving
inside a single batch (NAPI poll) cycle, so we can amortise processing
costs and be more efficient.
If so, this implies that if we tune the batching threshold/interval we
can achieve (close to) the same efficiency even when the link is not
busy, by simply deferring the processing. That's what I meant with
"smarter batching" in my original list.
It would probably also need some hints from the stack and/or the
application. For example, if the application had a way to inform the
stack "I am only processing this TCP stream in batches of 100KB anyway,
so please defer waking me up until you have a chunk of that size ready",
that could be a win. Maybe this could even be complimented with an API
to express "(maximum) acceptable wait time"?
-Toke

2025

2024

2023

[Netdev 0x17 Net-Power] Re: Welcome, let's discuss!