Product search

The basics of liquid cooling in AI data centers

Blog • September 16, 2024

Introduction

Liquid cooling is increasingly vital in AI data centers as traditional methods struggle to keep manage intense heat. This blog provides a brief overview, some key facts, and a glossary of essential terms to help you understand the basics of how liquid cooling helps to dissipate the heat generated by AI workloads.

In the evolving landscape of data centers and high-performance computing infrastructure for AI applications, cooling efficiency has become a key focus for engineers aiming to optimize both performance and energy use. With increasing demands from AI, machine learning, and multi-core processors, traditional air-cooling methods are hitting their limits.

In parallel, trends such as Vertical Power Delivery (VPD) are being used to minimize power delivery network (PDN) losses, resulting in modules that can be tailored to specific processor and server configurations. VPD’s low-profile design complements direct-to-chip liquid cooling, which is essential at AI server power levels. This approach contrasts with the conventional lateral placement of DC/DC converters, which are typically designed for forced air cooling and a minimal footprint.

Figure 1: Vertical Power Delivery to a processor


Forced air cooling is still common at lower power densities but has significant limitations, such as requiring large heatsinks and managing hot exhaust air, which can adversely impact nearby components. Placing processors and heatsinks near the air exit helps but limits the flexibility of board designs.

In contrast, liquid cooling provides several advantages, including better heat transport efficiency, smaller system size, and reduced energy consumption/operating costs. It also eliminates the need for less reliable fans, although a centralized heat exchanger is still necessary. While liquid cooling requires a higher upfront investment and may reduce system availability due to its centralized nature, it significantly enhances cooling efficiency.

A hybrid liquid cooling system, known as Direct-to-Chip cooling, transfers heat from the processor to a cold plate with liquid (typically water) channels. This single-phase system provides substantial heat removal improvements over forced air cooling. Further gains can be achieved with fluorocarbon-based liquids in a two-phase system, which offers about 100 times better heat absorption due to the latent heat of evaporation. While more expensive, this method is less damaging in case of leaks, making it a viable option despite the cost.

The evolution of direct liquid cooling has focused on two main technologies: microchannel and microconvective cooling. Microchannel cooling evenly spreads heat across the surface but struggles with higher power chipsets, leading to designs with tighter channels and higher filtration requirements, which impact data center operations. Conversely, microconvective cooling, or microjet impingement, targets specific hotspots on processors, offering lower thermal resistance and avoiding the pressure issues associated with microchannels, making it more effective for high-power applications.
Immersion cooling, where the entire system is submerged in dielectric liquid, is another option. It offers excellent cooling efficiency, but concerns about environmental impact, leakage, and the system being a single point of failure limit its adoption. The fluid’s higher dielectric constant also raises stray capacitance, which could affect high-frequency signals.

Fast facts on liquid cooling

Last year, the U.S. Department of Energy allocated $40 million to support innovative data center cooling technologies. These projects aim to push the boundaries of energy efficiency and sustainability in data centers. (Source https://www.energy.gov/article...)


Traditional data centers manage around 12 kW per rack, but AI data centers are seeing a dramatic increase, with current ultra high-density racks consuming 85 kW per cabinet. Future projections suggest that this could rise to between 200 kW and 250 kW per rack as AI workloads become more demanding. (Source: https://spectra.mhi.com/insigh...)

Greater rack density may mean bigger data capacity, but it also means higher energy use and extra heat. Data centers operate optimally between 21 and 24 degrees Celsius, so any increase in rack density must be accompanied by improved cooling. (Source: https://dck-resources.datacent...)

Cooling systems use 25-40% of the energy in data centers. As rack densities rise, the design of cooling systems will play an increasingly pivotal role in maintaining overall efficiency. (Source: https://www.datacenterfrontier...).

Figure2: Cooling System Survey 2024 by Uptime Institute

Glossary

Boiling Point: The temperature at which a liquid becomes vapor; critical for phase change in two-phase cooling.

Cold Plate: A heat exchanger mounted directly onto components to transfer heat to a liquid coolant.

Condenser: Component where vaporized coolant releases heat and condenses back into a liquid.

Coolant: The liquid used to absorb and transfer heat away from components, commonly water, glycol mixtures, or dielectric fluids.

Cold Plates: Cold plates are the core of liquid cooling systems. These metal plates are mounted directly onto CPUs and GPUs. Coolant flows through channels within the cold plates, absorbing heat and carrying it away from the components.
Dielectric Fluid: A non-conductive coolant that prevents electrical shorts and corrosion.

Flow Rate: The volume of coolant moving through a system, typically measured in liters per minute.

Heat Exchanger: A device that transfers heat from the coolant to another medium.

Heat Sink: The passive device that dissipates heat by spreading it over a larger surface area.

Liquid Loop: A closed circuit for coolant circulation in a cooling system.

Manifolds: Serving as distribution hubs for the coolant within the rack., manifolds manage the flow of coolant to and from the cold plates, ensuring an even distribution.
Quick disconnects: Specialized connectors that allow for easy and rapid connection or disconnection of the coolant lines.
Thermal Resistance: A measure of a material’s ability to resist heat flow; lower resistance indicates better heat transfer.

Thermal Interface Material (TIM): Material used to enhance thermal conductivity between a chip and a heat sink or cold plate.

Pump: Device for circulating coolant through the system.
Vertical Power Delivery: A method to minimize power losses by positioning power modules directly above or below the processor.

Share via email