Pedro Tammela pctammela@mojatatu.com writes:
On 12/12/2023 10:40, Toke Høiland-Jørgensen wrote:
Pedro Tammela pctammela@mojatatu.com writes:
On 11/12/2023 12:45, Toke Høiland-Jørgensen wrote:
"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:
Toke and I were chatting offline about this problem of power management in networking.
We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"
Another thought is how do I count these events and / or notice I have a problem?
More thoughts on this from anyone?
Thank you for starting the on-list discussion. I'll add some high-level thoughts here and also reply to a couple of messages down-thread with some more specific comments.
When talking about benchmarking, the reason I mentioned that as a good starting point is that I believe having visibility into power usage is the only way we can make people actually use any tweaks we can come up with. Especially since there's a lot of cargo-culting involved in tuning (of the "use these settings for the best latency/throughput/whatever" variety), and having more precise measurements of the impact of settings is a way of combating that (and empowering people to make better assessments of the tradeoffs involved).
And secondly, of course, if we are actually trying to improve something, we need some baseline metrics to improve against. I'm thinking this can be approached from both "ends", i.e., "here is the cost tradeoff of various tuning parameters" that you mention, but also "here is the power consumption of workload X", which can then be a target for improvement.
Turning to areas for improvement, I can think of a couple of broad categories that seem promising to explore (some of which have already been mentioned down-thread):
Smart task placement when scaling up/down (consolidating work on fewer cores to leave others idle enough that they can go to sleep).
Forecasting the next packet arrival; and using this both so we can make smarter sleep state decisions, but also so we can do smarter batching (maybe we can defer waking up the userspace process if we expect another packet to arrive shortly, that sort of thing).
Wouldn't that require some sort of protocol integration?
Probably, yeah. In-kernel the TCP stack could provide hints in some cases (it knows the RTT and current bandwidth of the flow).
Interesting, this sort of info could be integrated into the scheduler for power aware scheduling in P/E processors.
Yeah, I expect there will end up being some interaction with the scheduler here at some point :)
For others, we could expose an API for userspace to provide hints. The interesting bit would be to find out whether this would work well enough in practice. My hope would be that it could be good enough that it would be feasible to run (more) systems with power saving features enabled without suffering losses and/or huge latency spikes, which would be a win :)
- General performance improvements in targeted areas (better performance should translate to less work done per packet, which means less power used, all other things being equal.
One thing that me and Jamal saw was that this is not always the case. Surprising as it may seem, we saw the CPU power consumption usually being a constant[*] while throughput etc varied. In TLS for instance, AVX512 acceleration using Intel's cryptoMB made the whole process more power efficient but not less power hungry, i. e. the same power consumption but more throughput over AES-NI.
[*] To expand a little bit more, turbo boosting is very smart these days. It essentially always aims for TDP (for Intel at least) all the time. So it dynamically scales everything to reach it.
Hmm, that's interesting. So, IIUC, this implies that performance improvements have to have a certain magnitude to be useful for saving power, right? I.e., saving a few % of CPU usage on one core is not enough, but if the improvement is enough that you can move the workload to fewer cores, it will help because you can bring some cores offline/to idle. Or am I misunderstanding what you mean?
Yes exactly! Fewer cores also means fewer thermal pressure which also means FANs spinning slower :) Or potentially a longer server lifetime/cheaper server upgrade.
But when given more CPU room, applications might actually do more work! Take for instance TLS offload + zero copy, the CPU will only be really freed if the link/network stack is saturated.
I believe there are two approaches here to networking:
- Power saving vs Power efficient
So this is mostly related to the amount of batching, isn't it? I.e., at high rates we are more efficient because we have more data arriving inside a single batch (NAPI poll) cycle, so we can amortise processing costs and be more efficient.
If so, this implies that if we tune the batching threshold/interval we can achieve (close to) the same efficiency even when the link is not busy, by simply deferring the processing. That's what I meant with "smarter batching" in my original list.
It would probably also need some hints from the stack and/or the application. For example, if the application had a way to inform the stack "I am only processing this TCP stream in batches of 100KB anyway, so please defer waking me up until you have a chunk of that size ready", that could be a win. Maybe this could even be complimented with an API to express "(maximum) acceptable wait time"?
-Toke