Menu

Tackling the Thermal Challenge of 600W+ Devices in High-Density Computing Systems

June 02, 2025 2 Comments

Tackling the Thermal Challenge of 600W+ Devices in High-Density Computing Systems

By: Braden Cooper, Director of Products at OSS

When the PCI-SIG formally added support for 675W add-in card devices in the PCI Express Card Electromechanical (CEM) specification in August 2023, NVIDIA’s most powerful CEM GPU, the NVIDIA H100 80GB had a maximum power consumption of 350W.  While some devices were starting to push the limits of datacenter thermodynamics – high density systems of many 675W devices seemed like a distant reality.  However, with power constraints uncapped and the need for higher performing GPUs skyrocketing, the industry quickly came out with devices taking full advantage of the new specification capability.  NVIDIA quickly replaced the H100 80GB with the H100 NVL, increasing power density to 400W.  While this small jump was manageable for existing installations, NVIDIA then dove all-in with the H200 NVL released in late 2024 at 600W.  The rapid transition from 350W to 600W has put power and cooling technologies in the spotlight in a race to solve this next generation challenge.

In some scale out servers, the jump to 600W is still feasible.  In servers with 1-2 GPUs, the additional 500W of additional power and heat may already be within the operating margin of the system.  However, datacenter scaling often comes down to rack density and efficiency – how many GPUs can you fit per rack unit?  In systems that support 8 GPUs, an increase of 2kW per system is almost certain to overload existing power and cooling infrastructure.  System integrators pushing the boundaries with 16-way GPU expansion systems must now address an additional power and heat load of up to 4kW compared to their existing installations.  As an answer, some market studies indicate the datacenter liquid cooling will increase to $17B by 2032.  However, while the world builds out more liquid cooling infrastructure and the technology matures, air-cooled systems will remain a key component of early 675W add-in card market adoption.

For air-cooling, several thermodynamic principles govern this challenge. For example Fourier’s Law describes how heat is conducted through materials, Newton’s Law of Cooling describes the relationship between temperature differentials and their impact on heat dissipation, and Bernoulli’s Equation helps us understand airflow impedance in a chassis. A key tradeoff in thermal design of computer systems is the relationship between airflow velocity and chassis impedance - higher airflow velocity is needed for more effective cooling, but restrictive enclosures can create pressure drops that impede airflow and derate fan performance. To address these challenges, mechanical engineers rely on computational fluid dynamics (CFD), optimized heatsink designs, high-conductivity materials, and well-placed baffling to direct airflow efficiently.

 

Key Thermal Design Considerations

1. CFD Thermal Modelling
CFD analysis is an invaluable tool in modern thermal design, allowing engineers to simulate airflow patterns, temperature distribution, and heat dissipation synchronously with physical prototyping to build an iterative simulation model. By modeling the internal heat loads and external environmental conditions, the simulation identifies hotspots and airflow bottlenecks early in the design process.


Figure 1. CFD Study of OSS 2U Short-depth Server Torrey

For example, in a server with several 600W devices, CFD simulations will identify whether there is positive thermal margin in the design or whether design improvements are required.  By validating the simulation model against a real model engineers can rapidly develop solutions based on real data in their analytical model.  This allows engineers to try different baffling designs, cable routing, perforation patterns, and more without the need to wait for manufacturing cycles to prototype.

2. Heatsink Optimization

In power dense air-cooled compute systems, heatsinks play a crucial role in transferring heat away from high-power components into the convective cooling medium (air).  Heatsinks are critical on the primary heat sources including the 600W add-in card devices as well as other key heat sources such as the CPU, NICs, or even memory modules.  Heatsink design must maximize surface area while maintaining efficient airflow across fins. Several factors influence heatsink performance, including:

·         Fin density and geometry: More fins increase surface area but can also restrict airflow if not properly spaced.

·         Manufacturing techniques: Skived and extruded heatsinks offer cost-effective solutions, while vapor chamber heatsinks enhance heat spreading across large surfaces at the cost of fragility.

3. Material Thermal Conductivity

Material selection is another crucial factor in chassis thermal management – especially in the material of the heatsink and corresponding thermal interface material.  The combination of thermal conductivities creates a sort of thermal circuit which affects the overall efficiency of moving heat from the source to the cooling medium.

Common heatsink material choices include:

·         Aluminum: Lightweight and cost-effective with decent thermal conductivity.

·         Copper: Excellent heat conductor but heavy and expensive.

·         Advanced composites: Graphite-based materials can improve thermal performance while maintaining lightweight properties.

Thermal interface material choices include:

·         Thermal Pastes: Good thermal conductivity but dependent on application.

·         Thermal Pads: Easy to apply but lower thermal conductivity

·         Phase Change Materials: No pump-out effect, but typically more expensive

·         Advanced materials: New research into innovative thermal interface solutions may change industry dynamics in the coming years

4. Baffling, Chassis Impedance, and Airflow Velocity

At its most fundamental, effective air-cooling is a matter of getting the coldest air possible moving across the largest surface area heatsinks at the highest velocity.  There are of course other factors at play but increasing the amount of air moving across a heatsink or decreasing its temperature are surefire ways to improve a system’s thermal performance.

To increase air velocity, there are a few options available:

·         Bigger fans: more powerful fans can push/pull more volumetric air through a system.

·         Ducting: using ducting or air baffles to direct flow will not only increase the velocity of the air (nozzle effect) but it will improve the volumetric airflow across an area of interest, preventing ineffective airflow to non-heat generating regions

·         Reducing chassis impedance: the airflow generated by a fan is directly related to the resistance of the chassis.  By reducing chassis clutter from things like cabling or improving inlet/outlet perforation, fans will be able to operate at a higher CFM resulting in improved thermal performance.

 

Figure 2. Linear Air Velocity Simulation Cut-Plot of OSS Rigel Edge Supercomputer


Bringing It All Together

The new generation of PCIe add-in card devices brings a new wave of thermal design challenges.  As hyperscalers look to improve GPU density by putting more devices in a smaller rack space, thermodynamics become a hard to overcome reality for system integrators.  While liquid cooling is definitively the long-term solution, certain advanced design tactics can prolong the effectiveness of air-cooling including CFD studies, heatsink design, advanced material selection, and chassis airflow optimization.  Air-cooling of dense compute systems filled with 600W devices was once a matter of picking the biggest fan possible, but has now become its own sophisticated mechanical engineering challenge.  Those looking to integrate these ultra-dense compute systems should look for chassis designers who have the expertise to integrate innovative cooling solutions - as the industry is not getting any cooler.

 

 

Click the buttons below to share this blog post!

Return to the main Blog page




2 Responses

* * * <a href="https://www.olipap.ch/index.php?85ci51">Unlock Free Spins Today</a> * * * hs=9687d6c1c1e9aba6f72fa3212090732d* ххх*
* * * <a href="https://www.olipap.ch/index.php?85ci51">Unlock Free Spins Today</a> * * * hs=9687d6c1c1e9aba6f72fa3212090732d* ххх*

June 13, 2025

jky4q7

* * * Claim Free iPhone 16: https://www.olipap.ch/index.php?85ci51 * * * hs=9687d6c1c1e9aba6f72fa3212090732d* ххх*
* * * Claim Free iPhone 16: https://www.olipap.ch/index.php?85ci51 * * * hs=9687d6c1c1e9aba6f72fa3212090732d* ххх*

June 13, 2025

jky4q7

Leave a comment


Also in One Stop Systems Blog

The Future of Transportation: Will Autonomous Trucks Ever Make the Driver Obsolete?
The Future of Transportation: Will Autonomous Trucks Ever Make the Driver Obsolete?

April 14, 2025 2 Comments

The advent of technology has always brought about significant changes to various industries, and the transportation sector is no exception. Among the most transformative innovations in recent years is the development of autonomous vehicles, particularly trucks. The potential for autonomous trucks to revolutionize freight transport is immense, raising the fundamental question: will these technological advancements make human drivers obsolete? To explore this question, we must consider the current state of autonomous driving technology, the economic implications, and the societal impact of removing human drivers from the equation.

Continue Reading

Advantages and Disadvantages of Implementing AI Inference Nodes on Soldiers
Advantages and Disadvantages of Implementing AI Inference Nodes on Soldiers

January 15, 2025 2 Comments

The integration of artificial intelligence (AI) into military operations has revolutionized battlefield strategies, decision-making, and operational efficiency. Among these advancements, AI inference nodes deployed directly on soldiers represents a cutting-edge innovation. These nodes, compact computational devices, enable real-time AI processing and analytics, empowering soldiers with enhanced situational awareness, decision support, and operational effectiveness. However, such technology also brings challenges, particularly in power management, size, and weight constraints. This blog delves into the advantages and disadvantages of implementing AI inference nodes on soldiers, focusing on these critical aspects.

Continue Reading

Composable Infrastructure:  Dynamically Changing IT Infrastructure
Composable Infrastructure: Dynamically Changing IT Infrastructure

May 01, 2024 7 Comments

The evolution of IT infrastructure spans several decades and is marked by significant advancements in computing technology, networking, storage, and management practices. Data Centers have historically relied on Converged or Hyper-Converged infrastructures when deploying their hardware which proved to limited in flexibility, efficiency, scalability, and support for the Artificial Intelligence / Machine Learning (AI/ML) modern workloads of today. 

Continue Reading

You are now leaving the OSS website