If you ignore any marketing, this NVIDIA doc is a good read:
https://resources.nvidia.com/en-us-accelerated-networking-resource-library/n...
cheers, jamal
If you ignore any marketing, this NVIDIA doc is a good read: https://resources.nvidia.com/en-us-accelerated-networking-resource-library/n...
Wasting energy on another mailing list post to reduce energy (and time) waste from following the b0rked pointer:
https://resources.nvidia.com/en-us-accelerated-networking-resource-library/n...
Nice find, thanks for sharing, Jamal!
An interesting observation is that specialized silicon (here, the DPUs) generate significant savings (only) when the system is under high load. The devil's advocate would argue that most real-world systems spend most of their life in lower-load regimes, and that for such systems, the energy savings under high load must be traded off against the base overhead of keeping the special-purpose silicon "lit" even during the (probably dominant) lower-load times.
In particular, the white paper has some cost savings projections (tables 1 - 3) - which is nice - from power saved by DPUs over three years for different applications, but the assumption there seems to be that the servers are 100% utilized over all of these three years, which seems quite contrived, at the very least for the telco workloads.
So for workloads with typical time-of-day/seasonal variations, where you have to provision for maximum load, the savings might be significantly lower in practice - CPU frequency scaling seems to do a nice job at low utilization levels today. It may be hard to justify the investment in (and presumable increase in base power consumption of) DPUs.
That said, some applications are different (crypto mining comes to mind), and in some places the energy costs are/will be much higher than the USD 0.15/kWh. But still :-)
Hi Simon! Good timing reading your email, just getting started with my coffee here ;->
On Wed, Feb 14, 2024 at 6:55 AM Simon Leinen simon.leinen@switch.ch wrote:
If you ignore any marketing, this NVIDIA doc is a good read: https://resources.nvidia.com/en-us-accelerated-networking-resource-library/n...
Wasting energy on another mailing list post to reduce energy (and time) waste from following the b0rked pointer:
https://resources.nvidia.com/en-us-accelerated-networking-resource-library/n...
Nice find, thanks for sharing, Jamal!
An interesting observation is that specialized silicon (here, the DPUs) generate significant savings (only) when the system is under high load. The devil's advocate would argue that most real-world systems spend most of their life in lower-load regimes, and that for such systems, the energy savings under high load must be traded off against the base overhead of keeping the special-purpose silicon "lit" even during the (probably dominant) lower-load times.
It goes without saying that if you dont have an overloaded system(ex running under capacity) there is nothing to improve on.
In my reading of the paper though i see the case being made for DPUs hinged on one thing: If you are using X servers and they are running at low capacity maybe you dont need X servers. Move the workloads instead into VMs/containers and squeeze all of that into X-Y servers. Now you have systems likely to be loaded over 50% and DPUs make sense. Then the argument builds into: If you reduce the number of hosts, you reduce the amount of power consumed and more importantly you dont need to build extra cooling infrastructure in your data centre that was needed to accomodate X hosts(which according to the charts are upto 40% of the power costs in data centres).
Cloud vendors do this - but it doesnt seem like anyone else does. The motivation for cloud vendors is clear, host CPU cycles provide $ from customers. It sounds like enterprise/telco types mostly plan around "why do i care, I will replace my server in 3 years with the latest and greatest CPU Big Intel provides me". It's analogous to the QoS counter-arguement - the answer to congestion is buying more bandwidth which is then underutilized.... Unfortunately we (engineers) often ignore the operational challenges and think squeezing the cycles is the answer to everything - which often requires above-average skills. The WP is certainly influenced by some engineering philosophy more than operational perspective. IOW, it is also possible the motivation for these enterprises/telcos is that they dont have the in house skills to manage and operate "compressing workloads in hosts" and more hardware is "reasonably" cheaper than hiring geniuses... they wouldnt be using kubernetes if they really cared about power (or performance) ;->
There is another argument for DPUs (which is not being made in the WP) that i have seen, cant remember which paper but it was MS making that arguement on why they offload and I think i have seen some P4 folks from Google repeat that view: Instead of refreshing your servers every 3 years, keep them longer and offload more as the newer workloads get more intense.
In particular, the white paper has some cost savings projections (tables 1 - 3) - which is nice - from power saved by DPUs over three years for different applications, but the assumption there seems to be that the servers are 100% utilized over all of these three years, which seems quite contrived, at the very least for the telco workloads.
So for workloads with typical time-of-day/seasonal variations, where you have to provision for maximum load, the savings might be significantly lower in practice - CPU frequency scaling seems to do a nice job at low utilization levels today. It may be hard to justify the investment in (and presumable increase in base power consumption of) DPUs.
The white paper was honest(hard to do for marketing people ;->) in showing a commodity feature like "CPU micro-sleep and frequency scaling" is a great way to get power savings. I am going to guess Ericsson wanted that in there ;->
I have a different perspective on what you said on DPUs increasing the power base. Consider a 2x25G BF2 which sucks power from the PCI3 bus at ~45W. _Under load_ there is a clear win in experiments we conducted. A "high load enough" which offloads ACLs and TLS would cost almost ~100W if running on the host. The operative term is "under load".
You could argue even if you used a plain non-smart NIC on PCI3 it would still consume 45 W to turn on. You can play with PCI registers to lower the power consumption for the non-smart NIC but that comes with a cost of increasing latency and other issues (I dont remember whether Jesse mentioned mucking with ASPM in PCIE). OTOH, i have seen: once you start going past PCIE3, these xPUs now provide an extra cable you connect to the motherboard to draw extra power (very similar to GPUs). When asked, vendors would say "it's just insurance in case we need more power on overload" - it's hard to judge if the "truth".
BTW: One strange thing we observed is that CPU power management seems to be "work conserving". We totally shut down more than half the CPUs and when we run a high enough load (>90% CPU) on the outstanding CPUs it draws as much power as if all the CPUs were on! Maybe someone has thoughts on this..
That said, some applications are different (crypto mining comes to mind), and in some places the energy costs are/will be much higher than the USD 0.15/kWh. But still :-)
I am looking at those "savings" shown and i am scratching my head whether the savings compared to the cost of hardware are not a drop in the bucket. Does anybody care if it is USD 0.15/kWh? Of course we care from an environmental pov but who else does? One thing the paper doesnt consider is the cost of operations. Humans dont like the inconvenience of change. Quoting USD 0.15/kWh is not going to move things - the pain has to be a lot bigger than that. If you could make it "operationally easy" for people to save then it is also easier to change their behavior. Or if you make both power and hardware very expensive... The AI craze will help. 800G NICs available today - and according to broadcom soon over 1Tbps NICs. There is no way hosts today can keep up other than for a very slim number of use cases (mostly bulk, latency insensitive workloads where you can reduce the PPS into the host using various tricks like GS/RO etc). So xPUS have a role to play despite all that...
cheers, jamal
Simon.
Jamal Hadi Salim writes:
I have a different perspective on what you said on DPUs increasing the power base. Consider a 2x25G BF2 which sucks power from the PCI3 bus at ~45W. _Under load_ there is a clear win in experiments we conducted. A "high load enough" which offloads ACLs and TLS would cost almost ~100W if running on the host. The operative term is "under load".
You could argue even if you used a plain non-smart NIC on PCI3 it would still consume 45 W to turn on.
Well you could *claim* that, but it seems unrealistic. Looking at datasheets, I see 20.8W MAX for a 2x25GE Intel E810 card, 11.1W/12.9W for a Broadcom 957414A4142CC-DS 2x25GE card under 100% traffic load (though the datasheet reads a bit dubious in that it only talks about a single DAC cable/SFP28 transceiver...)
So I claim that there *is* an increased base energy cost that you pay for those smart NICs compared with less smart ones (apologies to the Intel and Broadcom NICs, you're also smart! Just not *that* smart ;-).
You can play with PCI registers to lower the power consumption for the non-smart NIC but that comes with a cost of increasing latency and other issues (I dont remember whether Jesse mentioned mucking with ASPM in PCIE).
OTOH, i have seen: once you start going past PCIE3, these xPUs now provide an extra cable you connect to the motherboard to draw extra power (very similar to GPUs). When asked, vendors would say "it's just insurance in case we need more power on overload" - it's hard to judge if the "truth".
Hmmmm...
Hi Simon,
On Wed, Feb 21, 2024 at 8:48 AM Simon Leinen simon.leinen@switch.ch wrote:
Jamal Hadi Salim writes:
I have a different perspective on what you said on DPUs increasing the power base. Consider a 2x25G BF2 which sucks power from the PCI3 bus at ~45W. _Under load_ there is a clear win in experiments we conducted. A "high load enough" which offloads ACLs and TLS would cost almost ~100W if running on the host. The operative term is "under load".
You could argue even if you used a plain non-smart NIC on PCI3 it would still consume 45 W to turn on.
Well you could *claim* that, but it seems unrealistic. Looking at datasheets, I see 20.8W MAX for a 2x25GE Intel E810 card, 11.1W/12.9W for a Broadcom 957414A4142CC-DS 2x25GE card under 100% traffic load (though the datasheet reads a bit dubious in that it only talks about a single DAC cable/SFP28 transceiver...)
So I claim that there *is* an increased base energy cost that you pay for those smart NICs compared with less smart ones (apologies to the Intel and Broadcom NICs, you're also smart! Just not *that* smart ;-).
You make a good point. The cabling contribution (even for DAC) is never factored fairly in those marketing^Wproduct brochures. There is an argument still to be made for xPUS: they dont need fans which is a big cost. I think our measurement of 100W vs 45W included the fans because we sample the power bar.
To summarize: Paper claims: you benefit when your cluster is in constant churn at > 50% of capacity Simon: Nobody is ever running at > 50% 24/7.
cheers, jamal
You can play with PCI registers to lower the power consumption for the non-smart NIC but that comes with a cost of increasing latency and other issues (I dont remember whether Jesse mentioned mucking with ASPM in PCIE).
OTOH, i have seen: once you start going past PCIE3, these xPUs now provide an extra cable you connect to the motherboard to draw extra power (very similar to GPUs). When asked, vendors would say "it's just insurance in case we need more power on overload" - it's hard to judge if the "truth".
Hmmmm...
Simon.