On 11/12/2023 12:45, Toke Høiland-Jørgensen wrote:
"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:
Toke and I were chatting offline about this problem of power management in networking.
We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"
Another thought is how do I count these events and / or notice I have a problem?
More thoughts on this from anyone?
Thank you for starting the on-list discussion. I'll add some high-level thoughts here and also reply to a couple of messages down-thread with some more specific comments.
When talking about benchmarking, the reason I mentioned that as a good starting point is that I believe having visibility into power usage is the only way we can make people actually use any tweaks we can come up with. Especially since there's a lot of cargo-culting involved in tuning (of the "use these settings for the best latency/throughput/whatever" variety), and having more precise measurements of the impact of settings is a way of combating that (and empowering people to make better assessments of the tradeoffs involved).
And secondly, of course, if we are actually trying to improve something, we need some baseline metrics to improve against. I'm thinking this can be approached from both "ends", i.e., "here is the cost tradeoff of various tuning parameters" that you mention, but also "here is the power consumption of workload X", which can then be a target for improvement.
Turning to areas for improvement, I can think of a couple of broad categories that seem promising to explore (some of which have already been mentioned down-thread):
Smart task placement when scaling up/down (consolidating work on fewer cores to leave others idle enough that they can go to sleep).
Forecasting the next packet arrival; and using this both so we can make smarter sleep state decisions, but also so we can do smarter batching (maybe we can defer waking up the userspace process if we expect another packet to arrive shortly, that sort of thing).
Wouldn't that require some sort of protocol integration?
- General performance improvements in targeted areas (better performance should translate to less work done per packet, which means less power used, all other things being equal.
One thing that me and Jamal saw was that this is not always the case. Surprising as it may seem, we saw the CPU power consumption usually being a constant[*] while throughput etc varied. In TLS for instance, AVX512 acceleration using Intel's cryptoMB made the whole process more power efficient but not less power hungry, i. e. the same power consumption but more throughput over AES-NI.
[*] To expand a little bit more, turbo boosting is very smart these days. It essentially always aims for TDP (for Intel at least) all the time. So it dynamically scales everything to reach it.