Menu

Tackling the Thermal Challenge of 600W+ Devices in High-Density Computing Systems

June 02, 2025

Tackling the Thermal Challenge of 600W+ Devices in High-Density Computing Systems

By: Braden Cooper, Director of Products at OSS

When the PCI-SIG formally added support for 675W add-in card devices in the PCI Express Card Electromechanical (CEM) specification in August 2023, NVIDIA’s most powerful CEM GPU, the NVIDIA H100 80GB had a maximum power consumption of 350W.  While some devices were starting to push the limits of datacenter thermodynamics – high density systems of many 675W devices seemed like a distant reality.  However, with power constraints uncapped and the need for higher performing GPUs skyrocketing, the industry quickly came out with devices taking full advantage of the new specification capability.  NVIDIA quickly replaced the H100 80GB with the H100 NVL, increasing power density to 400W.  While this small jump was manageable for existing installations, NVIDIA then dove all-in with the H200 NVL released in late 2024 at 600W.  The rapid transition from 350W to 600W has put power and cooling technologies in the spotlight in a race to solve this next generation challenge.

In some scale out servers, the jump to 600W is still feasible.  In servers with 1-2 GPUs, the additional 500W of additional power and heat may already be within the operating margin of the system.  However, datacenter scaling often comes down to rack density and efficiency – how many GPUs can you fit per rack unit?  In systems that support 8 GPUs, an increase of 2kW per system is almost certain to overload existing power and cooling infrastructure.  System integrators pushing the boundaries with 16-way GPU expansion systems must now address an additional power and heat load of up to 4kW compared to their existing installations.  As an answer, some market studies indicate the datacenter liquid cooling will increase to $17B by 2032.  However, while the world builds out more liquid cooling infrastructure and the technology matures, air-cooled systems will remain a key component of early 675W add-in card market adoption.

For air-cooling, several thermodynamic principles govern this challenge. For example Fourier’s Law describes how heat is conducted through materials, Newton’s Law of Cooling describes the relationship between temperature differentials and their impact on heat dissipation, and Bernoulli’s Equation helps us understand airflow impedance in a chassis. A key tradeoff in thermal design of computer systems is the relationship between airflow velocity and chassis impedance - higher airflow velocity is needed for more effective cooling, but restrictive enclosures can create pressure drops that impede airflow and derate fan performance. To address these challenges, mechanical engineers rely on computational fluid dynamics (CFD), optimized heatsink designs, high-conductivity materials, and well-placed baffling to direct airflow efficiently.

 

Key Thermal Design Considerations

1. CFD Thermal Modelling
CFD analysis is an invaluable tool in modern thermal design, allowing engineers to simulate airflow patterns, temperature distribution, and heat dissipation synchronously with physical prototyping to build an iterative simulation model. By modeling the internal heat loads and external environmental conditions, the simulation identifies hotspots and airflow bottlenecks early in the design process.


Figure 1. CFD Study of OSS 2U Short-depth Server Torrey

For example, in a server with several 600W devices, CFD simulations will identify whether there is positive thermal margin in the design or whether design improvements are required.  By validating the simulation model against a real model engineers can rapidly develop solutions based on real data in their analytical model.  This allows engineers to try different baffling designs, cable routing, perforation patterns, and more without the need to wait for manufacturing cycles to prototype.

2. Heatsink Optimization

In power dense air-cooled compute systems, heatsinks play a crucial role in transferring heat away from high-power components into the convective cooling medium (air).  Heatsinks are critical on the primary heat sources including the 600W add-in card devices as well as other key heat sources such as the CPU, NICs, or even memory modules.  Heatsink design must maximize surface area while maintaining efficient airflow across fins. Several factors influence heatsink performance, including:

·         Fin density and geometry: More fins increase surface area but can also restrict airflow if not properly spaced.

·         Manufacturing techniques: Skived and extruded heatsinks offer cost-effective solutions, while vapor chamber heatsinks enhance heat spreading across large surfaces at the cost of fragility.

3. Material Thermal Conductivity

Material selection is another crucial factor in chassis thermal management – especially in the material of the heatsink and corresponding thermal interface material.  The combination of thermal conductivities creates a sort of thermal circuit which affects the overall efficiency of moving heat from the source to the cooling medium.

Common heatsink material choices include:

·         Aluminum: Lightweight and cost-effective with decent thermal conductivity.

·         Copper: Excellent heat conductor but heavy and expensive.

·         Advanced composites: Graphite-based materials can improve thermal performance while maintaining lightweight properties.

Thermal interface material choices include:

·         Thermal Pastes: Good thermal conductivity but dependent on application.

·         Thermal Pads: Easy to apply but lower thermal conductivity

·         Phase Change Materials: No pump-out effect, but typically more expensive

·         Advanced materials: New research into innovative thermal interface solutions may change industry dynamics in the coming years

4. Baffling, Chassis Impedance, and Airflow Velocity

At its most fundamental, effective air-cooling is a matter of getting the coldest air possible moving across the largest surface area heatsinks at the highest velocity.  There are of course other factors at play but increasing the amount of air moving across a heatsink or decreasing its temperature are surefire ways to improve a system’s thermal performance.

To increase air velocity, there are a few options available:

·         Bigger fans: more powerful fans can push/pull more volumetric air through a system.

·         Ducting: using ducting or air baffles to direct flow will not only increase the velocity of the air (nozzle effect) but it will improve the volumetric airflow across an area of interest, preventing ineffective airflow to non-heat generating regions

·         Reducing chassis impedance: the airflow generated by a fan is directly related to the resistance of the chassis.  By reducing chassis clutter from things like cabling or improving inlet/outlet perforation, fans will be able to operate at a higher CFM resulting in improved thermal performance.

 

Figure 2. Linear Air Velocity Simulation Cut-Plot of OSS Rigel Edge Supercomputer


Bringing It All Together

The new generation of PCIe add-in card devices brings a new wave of thermal design challenges.  As hyperscalers look to improve GPU density by putting more devices in a smaller rack space, thermodynamics become a hard to overcome reality for system integrators.  While liquid cooling is definitively the long-term solution, certain advanced design tactics can prolong the effectiveness of air-cooling including CFD studies, heatsink design, advanced material selection, and chassis airflow optimization.  Air-cooling of dense compute systems filled with 600W devices was once a matter of picking the biggest fan possible, but has now become its own sophisticated mechanical engineering challenge.  Those looking to integrate these ultra-dense compute systems should look for chassis designers who have the expertise to integrate innovative cooling solutions - as the industry is not getting any cooler.

 

 

Click the buttons below to share this blog post!

Return to the main Blog page





Also in One Stop Systems Blog

Uncle Sam Wants GP(You)!
Uncle Sam Wants GP(You)!

January 28, 2026

The character of modern warfare is being reshaped by data. Sensors, autonomy, electronic warfare, and AI-driven decision systems are now decisive advantages, but only if compute power can be deployed fast enough and close enough to the fight. This reality sits at the center of recent guidance from the Trump administration and Secretary of War Pete Hegseth, who has repeatedly emphasized that “speed wins; speed dominates” and that advanced compute must move “from the data center to the battlefield.” 

OSS specializes in taking the latest commercial GPU, FPGA, NIC, and NVMe technologies, the same acceleration platforms driving hyperscale data centers, and delivering them in rugged, deployable systems purpose-built for U.S. military platforms. At a moment when the Department of War is prioritizing speed, adaptability, and commercial technology insertion, OSS sits at the intersection of performance, ruggedization, and rapid deployment. 

Continue Reading

Rugged High Performance Edge Compute: Delivering Maritime Dominance at the Speed of Need
Rugged High Performance Edge Compute: Delivering Maritime Dominance at the Speed of Need

January 21, 2026

Maritime dominance has long been a foundation of U.S. national security and allied stability. Control of the seas enables freedom of navigation, power projection, deterrence, and protection of global trade routes. As the maritime battlespace becomes increasingly contested, congested, and data-driven, dominance is no longer defined solely by the number of ships or missiles, but by the ability to sense, decide, and act faster than adversaries. Rugged High Performance Edge Compute (HPeC) solutions have become a decisive enabler of this advantage.

At the same time, senior Department of War leadership—­­including directives from the Secretary of War—has made clear that maintaining superiority requires rapid integration of advanced commercial technology into military platforms at the speed of need. Traditional acquisition timelines measured in years are no longer compatible with the pace of technological change or modern threats. Rugged HPeC solutions from One Stop Systems (OSS) directly addresses this challenge.

Continue Reading

OSS Announces New Partnership with Leading U.S. Defense Prime to Develop Enhanced Integrated Vision System for U.S. Army
OSS Announces New Partnership with Leading U.S. Defense Prime to Develop Enhanced Integrated Vision System for U.S. Army

January 07, 2026

Initial design and prototype order valued at approximately $1.2 million

Integration of OSS hardware into prime contractor system further validates OSS capabilities for next-generation 360-degree vision and sensor processing solutions

ESCONDIDO, Calif., Jan. 07, 2026 (GLOBE NEWSWIRE) -- One Stop Systems, Inc. (OSS or the Company) (Nasdaq: OSS), a leader in rugged Enterprise Class compute for artificial intelligence (AI), machine learning (ML) and sensor processing at the edge, today announced it has received an approximately $1.2 million pre-production order from a new U.S. defense prime contractor for the design, development, and delivery of ruggedized integrated compute and visualization systems for U.S. Army combat vehicles.

Continue Reading

You are now leaving the OSS website