Welcome, let's discuss!

List overview All Threads
Download

newer

older

An excellent capture of the issues...

Interesting talk from Kubecon 2023

Brandeburg, Jesse

5 Dec 2023 5 Dec '23

6:58 p.m.

Hi everyone,

(regarding the list, the home page is https://lists.netdevconf.info/postorius/lists/net-power.netdevconf.info/)

I think we don't have a lot of subscribers yet to this list (hey Jamal you should subscribe!) - invite your power-concerned friends and colleagues.

Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

Another thought is how do I count these events and / or notice I have a problem?

More thoughts on this from anyone?

Jesse

Attachments:

attachment.html (text/html — 3.1 KB)

Show replies by date

Hagen Paul Pfeifer

5 Dec 5 Dec

7:21 p.m.

* Brandeburg, Jesse | 2023-12-05 18:58:37 [+0000]:

Hey Jesse

...

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

The first is a good - but rather unreasonable, isn't it? RT guys setting max_cstate to 1 or so to guarantee a low latency, deterministic RT behavior. I think that if low latency is the ultimate goal, compromises must inevitably be made in the PM domain.

The second it don't get (e.g.): - CPU is in idle state C10 - NIC wakeup and interrupt CPU interrupt controller - CPU C10 -> C0 - softirq and packet will be processed until delivered to netperf/iperf2/neper

Where do the retransmits/drops occur here? Sure C10 -> C0 takes some wakeup penalty, but no drop.

Jesse, I wonder if the benchmarks lead to much? Can we use them to make measurements that are comparable? What do you want to achieve with the benchmarks? Sorry for asking these questions! ;-)

hgn

Jesse Brandeburg

8:45 p.m.

On 12/5/2023 11:21 AM, Hagen Paul Pfeifer wrote:

...

Brandeburg, Jesse | 2023-12-05 18:58:37 [+0000]:

Hey Jesse

...
We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

The first is a good - but rather unreasonable, isn't it? RT guys setting max_cstate to 1 or so to guarantee a low latency, deterministic RT behavior. I think that if low latency is the ultimate goal, compromises must inevitably be made in the PM domain.

I think you're thinking too small/detailed. RT is also a special case, but the deadlines for 100G+ networking are much shorter (microseconds or nanoseconds) than the RT deadlines (usually milliseconds)

...

The second it don't get (e.g.):

CPU is in idle state C10

NIC wakeup and interrupt CPU interrupt controller

CPU C10 -> C0

takes at least 890 us, maybe longer (from my really old Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz) C10: Flags/Description: MWAIT 0x60 Latency: 890

...

softirq and packet will be processed until delivered to netperf/iperf2/neper

Where do the retransmits/drops occur here? Sure C10 -> C0 takes some wakeup penalty, but no drop.

quick math at 100Gb/s 64 byte arrival rate: 0.00672us 1518 byte arrival rate: 0.12304us

890us / 0.00672us = 132,440 packets per wakeup 890us / 0.12304us = 7,233 packets per wakeup

So, this means that you have to have at least that many receive descriptors (one per packet) pre-allocated to hold those packets until your CPU wakes up and starts processing the initial interrupt.

Our default 2,048 descriptor rings are able to hold 13us and 252us, respectively, of packets on one ring.

If the DMA was asleep due to PC6+ state then the only storage is on the NIC FIFO, and the timelines are much shorter.

...

Jesse, I wonder if the benchmarks lead to much? Can we use them to make measurements that are comparable? What do you want to achieve with the benchmarks? Sorry for asking these questions! ;-)

Of course that's the goal :-) And I like the questions, keep em coming!

I'm hoping to start us on the path of a) including some knowledge of the wake latency and system behavior in the networking layer. b) Some back and forth communication from the networking layer to the scheduler and CPU power manager based on that knowledge.

Han Dong

8:50 p.m.

Thanks for seeding this discussion Jesse!

w.r.t. Counting Events: I think it can be a bit challenging to figure out the true amount of time the system is executing vs sleeping through only the counters in /sys/devices/system/cpu/cpu<N>/cpuidle/*. We found (on Intel) that some simple PMCs can help with that, specifically the CPU_CLK_UNHALTED.REF counter which counts the unhalted cycles of the CPU at a fixed RDTSC rate. Basically, you can instrument it between a region of code so that you can figure out the time the CPU was processing instructions vs halted and that can effectively give you a ratio of how long it was sleeping. Note, you still don't know which sleep state it was in but that's something I suppose you can tie in with the /sys counters.

w.r.t. Benchmarks: I'm also curious in general how common optimizations people have done to improve network performance affects power? For example if we can support the same workload but with fewer instructions, then that automatically means lower power consumption right? Things that pop to my mind: bypassing some of the kernel, replacing TCP with UDP, impact of having a dedicated polling to reap packets for multiple workers (might not be a way around this for very low latency apps)

- Han

On Tue, Dec 5, 2023 at 3:45 PM Jesse Brandeburg jesse.brandeburg@intel.com wrote:

...

On 12/5/2023 11:21 AM, Hagen Paul Pfeifer wrote:

...

Brandeburg, Jesse | 2023-12-05 18:58:37 [+0000]:

Hey Jesse

...
We thought it might be a useful start to figure out a good set of

benchmarks

...
...
to demonstrate "power vs networking" problems. I have a couple in mind

right

...
...
away. One is "system is sleeping but I'm trying to run a latency

sensitive

...
...
workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc)

shows a

...
...
lot of retransmits and / or receiver drops"

The first is a good - but rather unreasonable, isn't it? RT guys setting

max_cstate

...
to 1 or so to guarantee a low latency, deterministic RT behavior. I

think that

...
if low latency is the ultimate goal, compromises must inevitably be made

in

...
the PM domain.

I think you're thinking too small/detailed. RT is also a special case, but the deadlines for 100G+ networking are much shorter (microseconds or nanoseconds) than the RT deadlines (usually milliseconds)

...
The second it don't get (e.g.):

CPU is in idle state C10

NIC wakeup and interrupt CPU interrupt controller

CPU C10 -> C0

takes at least 890 us, maybe longer (from my really old Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz) C10: Flags/Description: MWAIT 0x60 Latency: 890

...

softirq and packet will be processed until delivered to

netperf/iperf2/neper

...
Where do the retransmits/drops occur here? Sure C10 -> C0 takes some

wakeup

...
penalty, but no drop.

quick math at 100Gb/s 64 byte arrival rate: 0.00672us 1518 byte arrival rate: 0.12304us

890us / 0.00672us = 132,440 packets per wakeup 890us / 0.12304us = 7,233 packets per wakeup

So, this means that you have to have at least that many receive descriptors (one per packet) pre-allocated to hold those packets until your CPU wakes up and starts processing the initial interrupt.

Our default 2,048 descriptor rings are able to hold 13us and 252us, respectively, of packets on one ring.

If the DMA was asleep due to PC6+ state then the only storage is on the NIC FIFO, and the timelines are much shorter.

...
Jesse, I wonder if the benchmarks lead to much? Can we use them to make measurements that are comparable? What do you want to achieve with the benchmarks? Sorry for asking these questions! ;-)

Of course that's the goal :-) And I like the questions, keep em coming!

I'm hoping to start us on the path of a) including some knowledge of the wake latency and system behavior in the networking layer. b) Some back and forth communication from the networking layer to the scheduler and CPU power manager based on that knowledge. _______________________________________________ Netdev 0x17 Net-Power mailing list -- net-power@netdevconf.info To unsubscribe send an email to net-power-leave@netdevconf.info

Pedro Tammela

8:59 p.m.

On 05/12/2023 17:50, Han Dong wrote:

...

w.r.t. Benchmarks: I'm also curious in general how common optimizations people have done to improve network performance affects power? For example if we can support the same workload but with fewer instructions, then that automatically means lower power consumption right? Things that pop to my mind: bypassing some of the kernel, replacing TCP with UDP, impact of having a dedicated polling to reap packets for multiple workers (might not be a way around this for very low latency apps)

These sort of micro optimizations would only work after you have solved thermal pressure. In my own testing it seems like everything blows up (power wise) once the server fans kick in to keep the processor under the thermal threshold.

Toke Høiland-Jørgensen

11 Dec 11 Dec

3:45 p.m.

Jesse Brandeburg jesse.brandeburg@intel.com writes:

...

On 12/5/2023 11:21 AM, Hagen Paul Pfeifer wrote:

...

Brandeburg, Jesse | 2023-12-05 18:58:37 [+0000]:

Hey Jesse

...
We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

The first is a good - but rather unreasonable, isn't it? RT guys setting max_cstate to 1 or so to guarantee a low latency, deterministic RT behavior. I think that if low latency is the ultimate goal, compromises must inevitably be made in the PM domain.

I think you're thinking too small/detailed. RT is also a special case, but the deadlines for 100G+ networking are much shorter (microseconds or nanoseconds) than the RT deadlines (usually milliseconds)

...
The second it don't get (e.g.):

CPU is in idle state C10

NIC wakeup and interrupt CPU interrupt controller

CPU C10 -> C0

takes at least 890 us, maybe longer (from my really old Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz) C10: Flags/Description: MWAIT 0x60 Latency: 890

...

softirq and packet will be processed until delivered to netperf/iperf2/neper

Where do the retransmits/drops occur here? Sure C10 -> C0 takes some wakeup penalty, but no drop.

quick math at 100Gb/s 64 byte arrival rate: 0.00672us 1518 byte arrival rate: 0.12304us

890us / 0.00672us = 132,440 packets per wakeup 890us / 0.12304us = 7,233 packets per wakeup

So, this means that you have to have at least that many receive descriptors (one per packet) pre-allocated to hold those packets until your CPU wakes up and starts processing the initial interrupt.

Our default 2,048 descriptor rings are able to hold 13us and 252us, respectively, of packets on one ring.

If the DMA was asleep due to PC6+ state then the only storage is on the NIC FIFO, and the timelines are much shorter.

Another problem here can also be that the CPU is too fast for the traffic load :)

I.e., if the NIC is not running at 100% utilisation, as is very often the case, there are idle periods between packets (traffic is bursty), so even if the workload is "continuous" at the application level, there may be idle periods that are long enough that the CPU can enter a low enough sleep state that it can't wake up fast enough to process the next burst of packets.

-Toke

Jamal Hadi Salim

5 Dec 5 Dec

9:14 p.m.

On Tue, Dec 5, 2023 at 1:58 PM Brandeburg, Jesse jesse.brandeburg@intel.com wrote:

...

Hi everyone,

(regarding the list, the home page is https://lists.netdevconf.info/postorius/lists/net-power.netdevconf.info/)

I think we don’t have a lot of subscribers yet to this list (hey Jamal you should subscribe!)

Already subscribed and pinged other people interested as well.

...

– invite your power-concerned friends and colleagues.

Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate “power vs networking” problems. I have a couple in mind right away. One is “system is sleeping but I’m trying to run a latency sensitive workload and the latency sucks” Two is “system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops”

our goal is to collect power use on the _whole system_ for given network bound workloads i.e not just on the CPU side. If we can collect data on how much the NIC is drawing from the PCI bus as well that would be a very useful breakdown. If you have a good understanding of your server maybe that info derived by deduction (collect the power bar draw and subtract what you the CPU uses). Our use case is on offloads: Example, if i can offload TLS on a NIC that draws 45W from the PCI bus vs running the same infra workload on the host which will costs 100W then i can see a clear win on the offload case etc.

cheers, jamal

...

Another thought is how do I count these events and / or notice I have a problem?

More thoughts on this from anyone?

Jesse

Netdev 0x17 Net-Power mailing list -- net-power@netdevconf.info To unsubscribe send an email to net-power-leave@netdevconf.info

Toke Høiland-Jørgensen

11 Dec 11 Dec

3:45 p.m.

Jamal Hadi Salim jhs@mojatatu.com writes:

...

On Tue, Dec 5, 2023 at 1:58 PM Brandeburg, Jesse jesse.brandeburg@intel.com wrote:

...
Hi everyone,

(regarding the list, the home page is https://lists.netdevconf.info/postorius/lists/net-power.netdevconf.info/)

I think we don’t have a lot of subscribers yet to this list (hey Jamal you should subscribe!)

Already subscribed and pinged other people interested as well.

...
– invite your power-concerned friends and colleagues.

Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate “power vs networking” problems. I have a couple in mind right away. One is “system is sleeping but I’m trying to run a latency sensitive workload and the latency sucks” Two is “system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops”

our goal is to collect power use on the _whole system_ for given network bound workloads i.e not just on the CPU side. If we can collect data on how much the NIC is drawing from the PCI bus as well that would be a very useful breakdown. If you have a good understanding of your server maybe that info derived by deduction (collect the power bar draw and subtract what you the CPU uses). Our use case is on offloads: Example, if i can offload TLS on a NIC that draws 45W from the PCI bus vs running the same infra workload on the host which will costs 100W then i can see a clear win on the offload case etc.

I agree, but I don't think this is necessarily limited to offloads. In many cases, the "offload" is just another CPU core that happens to be sitting on the NIC instead of in the host. So absolutely, moving something into an offload can save power, but so can moving it across CPU cores (if it means that some of the now-idle cores can go to sleep). So we need a system for (self-)tuning that can take both into account.

-Toke

Toke Høiland-Jørgensen

3:45 p.m.

"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:

...

Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

Another thought is how do I count these events and / or notice I have a problem?

More thoughts on this from anyone?

Thank you for starting the on-list discussion. I'll add some high-level thoughts here and also reply to a couple of messages down-thread with some more specific comments.

When talking about benchmarking, the reason I mentioned that as a good starting point is that I believe having visibility into power usage is the only way we can make people actually use any tweaks we can come up with. Especially since there's a lot of cargo-culting involved in tuning (of the "use these settings for the best latency/throughput/whatever" variety), and having more precise measurements of the impact of settings is a way of combating that (and empowering people to make better assessments of the tradeoffs involved).

And secondly, of course, if we are actually trying to improve something, we need some baseline metrics to improve against. I'm thinking this can be approached from both "ends", i.e., "here is the cost tradeoff of various tuning parameters" that you mention, but also "here is the power consumption of workload X", which can then be a target for improvement.

Turning to areas for improvement, I can think of a couple of broad categories that seem promising to explore (some of which have already been mentioned down-thread):

- Smart task placement when scaling up/down (consolidating work on fewer cores to leave others idle enough that they can go to sleep).

- Forecasting the next packet arrival; and using this both so we can make smarter sleep state decisions, but also so we can do smarter batching (maybe we can defer waking up the userspace process if we expect another packet to arrive shortly, that sort of thing).

- General performance improvements in targeted areas (better performance should translate to less work done per packet, which means less power used, all other things being equal.

Sorry it the above is a bit vague, but I'm hoping the brain dump can help spur some (more) discussion :)

-Toke

Pedro Tammela

4:50 p.m.

On 11/12/2023 12:45, Toke Høiland-Jørgensen wrote:

...

"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:

...
Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

Another thought is how do I count these events and / or notice I have a problem?

More thoughts on this from anyone?

Thank you for starting the on-list discussion. I'll add some high-level thoughts here and also reply to a couple of messages down-thread with some more specific comments.

When talking about benchmarking, the reason I mentioned that as a good starting point is that I believe having visibility into power usage is the only way we can make people actually use any tweaks we can come up with. Especially since there's a lot of cargo-culting involved in tuning (of the "use these settings for the best latency/throughput/whatever" variety), and having more precise measurements of the impact of settings is a way of combating that (and empowering people to make better assessments of the tradeoffs involved).

And secondly, of course, if we are actually trying to improve something, we need some baseline metrics to improve against. I'm thinking this can be approached from both "ends", i.e., "here is the cost tradeoff of various tuning parameters" that you mention, but also "here is the power consumption of workload X", which can then be a target for improvement.

Turning to areas for improvement, I can think of a couple of broad categories that seem promising to explore (some of which have already been mentioned down-thread):

Smart task placement when scaling up/down (consolidating work on fewer cores to leave others idle enough that they can go to sleep).

Forecasting the next packet arrival; and using this both so we can make smarter sleep state decisions, but also so we can do smarter batching (maybe we can defer waking up the userspace process if we expect another packet to arrive shortly, that sort of thing).

Wouldn't that require some sort of protocol integration?

...

General performance improvements in targeted areas (better performance should translate to less work done per packet, which means less power used, all other things being equal.

One thing that me and Jamal saw was that this is not always the case. Surprising as it may seem, we saw the CPU power consumption usually being a constant[*] while throughput etc varied. In TLS for instance, AVX512 acceleration using Intel's cryptoMB made the whole process more power efficient but not less power hungry, i. e. the same power consumption but more throughput over AES-NI.

[*] To expand a little bit more, turbo boosting is very smart these days. It essentially always aims for TDP (for Intel at least) all the time. So it dynamically scales everything to reach it.

Toke Høiland-Jørgensen

12 Dec 12 Dec

1:40 p.m.

Pedro Tammela pctammela@mojatatu.com writes:

...

On 11/12/2023 12:45, Toke Høiland-Jørgensen wrote:

...
"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:

...
Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

Another thought is how do I count these events and / or notice I have a problem?

More thoughts on this from anyone?

Thank you for starting the on-list discussion. I'll add some high-level thoughts here and also reply to a couple of messages down-thread with some more specific comments.

When talking about benchmarking, the reason I mentioned that as a good starting point is that I believe having visibility into power usage is the only way we can make people actually use any tweaks we can come up with. Especially since there's a lot of cargo-culting involved in tuning (of the "use these settings for the best latency/throughput/whatever" variety), and having more precise measurements of the impact of settings is a way of combating that (and empowering people to make better assessments of the tradeoffs involved).

And secondly, of course, if we are actually trying to improve something, we need some baseline metrics to improve against. I'm thinking this can be approached from both "ends", i.e., "here is the cost tradeoff of various tuning parameters" that you mention, but also "here is the power consumption of workload X", which can then be a target for improvement.

Turning to areas for improvement, I can think of a couple of broad categories that seem promising to explore (some of which have already been mentioned down-thread):

Smart task placement when scaling up/down (consolidating work on fewer cores to leave others idle enough that they can go to sleep).

Forecasting the next packet arrival; and using this both so we can make smarter sleep state decisions, but also so we can do smarter batching (maybe we can defer waking up the userspace process if we expect another packet to arrive shortly, that sort of thing).

Wouldn't that require some sort of protocol integration?

Probably, yeah. In-kernel the TCP stack could provide hints in some cases (it knows the RTT and current bandwidth of the flow). For others, we could expose an API for userspace to provide hints. The interesting bit would be to find out whether this would work well enough in practice. My hope would be that it could be good enough that it would be feasible to run (more) systems with power saving features enabled without suffering losses and/or huge latency spikes, which would be a win :)

...

...

General performance improvements in targeted areas (better performance should translate to less work done per packet, which means less power used, all other things being equal.

One thing that me and Jamal saw was that this is not always the case. Surprising as it may seem, we saw the CPU power consumption usually being a constant[*] while throughput etc varied. In TLS for instance, AVX512 acceleration using Intel's cryptoMB made the whole process more power efficient but not less power hungry, i. e. the same power consumption but more throughput over AES-NI.

[*] To expand a little bit more, turbo boosting is very smart these days. It essentially always aims for TDP (for Intel at least) all the time. So it dynamically scales everything to reach it.

Hmm, that's interesting. So, IIUC, this implies that performance improvements have to have a certain magnitude to be useful for saving power, right? I.e., saving a few % of CPU usage on one core is not enough, but if the improvement is enough that you can move the workload to fewer cores, it will help because you can bring some cores offline/to idle. Or am I misunderstanding what you mean?

-Toke

Pedro Tammela

3:40 p.m.

On 12/12/2023 10:40, Toke Høiland-Jørgensen wrote:

...

Pedro Tammela pctammela@mojatatu.com writes:

...
On 11/12/2023 12:45, Toke Høiland-Jørgensen wrote:

...
"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:

...
Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

Another thought is how do I count these events and / or notice I have a problem?

More thoughts on this from anyone?

Thank you for starting the on-list discussion. I'll add some high-level thoughts here and also reply to a couple of messages down-thread with some more specific comments.

When talking about benchmarking, the reason I mentioned that as a good starting point is that I believe having visibility into power usage is the only way we can make people actually use any tweaks we can come up with. Especially since there's a lot of cargo-culting involved in tuning (of the "use these settings for the best latency/throughput/whatever" variety), and having more precise measurements of the impact of settings is a way of combating that (and empowering people to make better assessments of the tradeoffs involved).

And secondly, of course, if we are actually trying to improve something, we need some baseline metrics to improve against. I'm thinking this can be approached from both "ends", i.e., "here is the cost tradeoff of various tuning parameters" that you mention, but also "here is the power consumption of workload X", which can then be a target for improvement.

Turning to areas for improvement, I can think of a couple of broad categories that seem promising to explore (some of which have already been mentioned down-thread):

Smart task placement when scaling up/down (consolidating work on fewer cores to leave others idle enough that they can go to sleep).

Forecasting the next packet arrival; and using this both so we can make smarter sleep state decisions, but also so we can do smarter batching (maybe we can defer waking up the userspace process if we expect another packet to arrive shortly, that sort of thing).

Wouldn't that require some sort of protocol integration?

Probably, yeah. In-kernel the TCP stack could provide hints in some cases (it knows the RTT and current bandwidth of the flow).

Interesting, this sort of info could be integrated into the scheduler for power aware scheduling in P/E processors.

...

For others, we could expose an API for userspace to provide hints. The interesting bit would be to find out whether this would work well enough in practice. My hope would be that it could be good enough that it would be feasible to run (more) systems with power saving features enabled without suffering losses and/or huge latency spikes, which would be a win :)

...
...

General performance improvements in targeted areas (better performance should translate to less work done per packet, which means less power used, all other things being equal.

One thing that me and Jamal saw was that this is not always the case. Surprising as it may seem, we saw the CPU power consumption usually being a constant[*] while throughput etc varied. In TLS for instance, AVX512 acceleration using Intel's cryptoMB made the whole process more power efficient but not less power hungry, i. e. the same power consumption but more throughput over AES-NI.

[*] To expand a little bit more, turbo boosting is very smart these days. It essentially always aims for TDP (for Intel at least) all the time. So it dynamically scales everything to reach it.

Hmm, that's interesting. So, IIUC, this implies that performance improvements have to have a certain magnitude to be useful for saving power, right? I.e., saving a few % of CPU usage on one core is not enough, but if the improvement is enough that you can move the workload to fewer cores, it will help because you can bring some cores offline/to idle. Or am I misunderstanding what you mean?

Yes exactly! Fewer cores also means fewer thermal pressure which also means FANs spinning slower :) Or potentially a longer server lifetime/cheaper server upgrade.

But when given more CPU room, applications might actually do more work! Take for instance TLS offload + zero copy, the CPU will only be really freed if the link/network stack is saturated.

I believe there are two approaches here to networking: - Power saving vs Power efficient

Pedro Tammela

3:49 p.m.

On 12/12/2023 12:40, Pedro Tammela wrote:

...

...
[...] Hmm, that's interesting. So, IIUC, this implies that performance improvements have to have a certain magnitude to be useful for saving power, right? I.e., saving a few % of CPU usage on one core is not enough, but if the improvement is enough that you can move the workload to fewer cores, it will help because you can bring some cores offline/to idle. Or am I misunderstanding what you mean?

Yes exactly! Fewer cores also means fewer thermal pressure which also means FANs spinning slower :) Or potentially a longer server lifetime/cheaper server upgrade.

But when given more CPU room, applications might actually do more work! Take for instance TLS offload + zero copy, the CPU will only be really freed if the link/network stack is saturated.

I believe there are two approaches here to networking:

Power saving vs Power efficient

I just remembered about QAT case on the Intel processors. It's a 12W coprocessor in the CPU die that beats a 56-core Sapphire Rapid on compression/decompression (TDP 350W). That would be the case where an optimization is so noticeable that the _power savings_ are measurable on the wall meter.

Toke Høiland-Jørgensen

14 Dec 14 Dec

12:46 p.m.

Pedro Tammela pctammela@mojatatu.com writes:

...

On 12/12/2023 10:40, Toke Høiland-Jørgensen wrote:

...
Pedro Tammela pctammela@mojatatu.com writes:

...
On 11/12/2023 12:45, Toke Høiland-Jørgensen wrote:

...
"Brandeburg, Jesse" jesse.brandeburg@intel.com writes:

...
Toke and I were chatting offline about this problem of power management in networking.

We thought it might be a useful start to figure out a good set of benchmarks to demonstrate "power vs networking" problems. I have a couple in mind right away. One is "system is sleeping but I'm trying to run a latency sensitive workload and the latency sucks" Two is "system is sleeping and my single-threaded bulk throughput benchmark (netperf/iperf2/neper/etc) shows a lot of retransmits and / or receiver drops"

Another thought is how do I count these events and / or notice I have a problem?

More thoughts on this from anyone?

Thank you for starting the on-list discussion. I'll add some high-level thoughts here and also reply to a couple of messages down-thread with some more specific comments.

When talking about benchmarking, the reason I mentioned that as a good starting point is that I believe having visibility into power usage is the only way we can make people actually use any tweaks we can come up with. Especially since there's a lot of cargo-culting involved in tuning (of the "use these settings for the best latency/throughput/whatever" variety), and having more precise measurements of the impact of settings is a way of combating that (and empowering people to make better assessments of the tradeoffs involved).

And secondly, of course, if we are actually trying to improve something, we need some baseline metrics to improve against. I'm thinking this can be approached from both "ends", i.e., "here is the cost tradeoff of various tuning parameters" that you mention, but also "here is the power consumption of workload X", which can then be a target for improvement.

Turning to areas for improvement, I can think of a couple of broad categories that seem promising to explore (some of which have already been mentioned down-thread):

Smart task placement when scaling up/down (consolidating work on fewer cores to leave others idle enough that they can go to sleep).

Forecasting the next packet arrival; and using this both so we can make smarter sleep state decisions, but also so we can do smarter batching (maybe we can defer waking up the userspace process if we expect another packet to arrive shortly, that sort of thing).

Wouldn't that require some sort of protocol integration?

Probably, yeah. In-kernel the TCP stack could provide hints in some cases (it knows the RTT and current bandwidth of the flow).

Interesting, this sort of info could be integrated into the scheduler for power aware scheduling in P/E processors.

Yeah, I expect there will end up being some interaction with the scheduler here at some point :)

...

...
For others, we could expose an API for userspace to provide hints. The interesting bit would be to find out whether this would work well enough in practice. My hope would be that it could be good enough that it would be feasible to run (more) systems with power saving features enabled without suffering losses and/or huge latency spikes, which would be a win :)

...
...

General performance improvements in targeted areas (better performance should translate to less work done per packet, which means less power used, all other things being equal.

One thing that me and Jamal saw was that this is not always the case. Surprising as it may seem, we saw the CPU power consumption usually being a constant[*] while throughput etc varied. In TLS for instance, AVX512 acceleration using Intel's cryptoMB made the whole process more power efficient but not less power hungry, i. e. the same power consumption but more throughput over AES-NI.

[*] To expand a little bit more, turbo boosting is very smart these days. It essentially always aims for TDP (for Intel at least) all the time. So it dynamically scales everything to reach it.

Hmm, that's interesting. So, IIUC, this implies that performance improvements have to have a certain magnitude to be useful for saving power, right? I.e., saving a few % of CPU usage on one core is not enough, but if the improvement is enough that you can move the workload to fewer cores, it will help because you can bring some cores offline/to idle. Or am I misunderstanding what you mean?

Yes exactly! Fewer cores also means fewer thermal pressure which also means FANs spinning slower :) Or potentially a longer server lifetime/cheaper server upgrade.

But when given more CPU room, applications might actually do more work! Take for instance TLS offload + zero copy, the CPU will only be really freed if the link/network stack is saturated.

I believe there are two approaches here to networking:

Power saving vs Power efficient

So this is mostly related to the amount of batching, isn't it? I.e., at high rates we are more efficient because we have more data arriving inside a single batch (NAPI poll) cycle, so we can amortise processing costs and be more efficient.

If so, this implies that if we tune the batching threshold/interval we can achieve (close to) the same efficiency even when the link is not busy, by simply deferring the processing. That's what I meant with "smarter batching" in my original list.

It would probably also need some hints from the stack and/or the application. For example, if the application had a way to inform the stack "I am only processing this TCP stream in batches of 100KB anyway, so please defer waking me up until you have a chunk of that size ready", that could be a win. Maybe this could even be complimented with an API to express "(maximum) acceptable wait time"?

-Toke

575

Age (days ago)

584

Last active (days ago)

net-power@netdevconf.info

13 comments

7 participants

tags (0)

participants (7)

Brandeburg, Jesse
Hagen Paul Pfeifer
Han Dong
Jamal Hadi Salim
Jesse Brandeburg
Pedro Tammela
Toke Høiland-Jørgensen