


Hotter Hardware: Rack Densities Test Data Center Cooling Strategies


Rack density changes may be the hottest thing in the data center. Literally. A closer look at the rise of computing power, and the cooling strategies that may provide solutions.


David Chernicoff Intel Corp



Servers immersed in liquid coolant in an immersion cooling research lab at Intel.



It’s unlikely to come as a surprise to anyone that rack densities in data centers have been increasing. The increase in computing power for the average server, along with the concurrent increase in power demand for CPUs and GPUs has been pretty easy to track. But two recent changes to the IT workloads commonly found in data centers have driven the demand for power and cooling exponentially higher.



The first has been a long time developing, giving data centers the opportunity to grow resources for power and cooling to accommodate it. That is the effective commoditization of high performance computing resources (HPC). While HPC was once almost strictly the domain of supercomputers and scientific computing tasks, the relatively inexpensive ability to build clusters of high performance server CPUs and GPUs has moved the bar. Now businesses can either invest in the hardware necessary or cloud HPC services in order to address far more common use cases; effectively any use case that requires the processing of large volumes of data, executing complex simulations, or simply solving problems more quickly and efficiently.



The steady growth in the demand for HPC services has lead to, among other things, dedicated HPC data center development where the facilities were already building to the high-end of the power density spectrum and leading the way on development of rack power density deployments capable of supporting 50kW per rack development, more than an order of magnitude more power capability than was found in a typical data center less than a decade ago.



While data centers have had time to adapt to the growth in demand for HPC services, the second change has ramped up incredibly quickly. Seemingly out of nowhere, generative AI is being heralded as the solution to every computing task and problem. And the processes and workflows that make up AI services, such as training, machine learning, and inference engines, are compute intensive and power hungry. And the power demands and accompanying cooling needs are increasing exceptionally quickly.


例如,英伟达在其最新的人工智能GPU系统DGX H100的发布中,预计最大功耗为10.2KW,是上一代A100的160%。H100服务器板提供4和8 GPU配置,以便供应商可以构建自己的集群。这意味着高密度机架,曾经被天真地认为仅是超过16KW,但其实可以很容易地配置到40KW和更高的范围内。

As an example, Nvidia, in its announcement of its latest AI GPU system, the DGX H100, projected maximum power consumption of 10.2 kW, 160% of the previous generation A100. And the H100 server boards are available in 4 and 8 GPU configurations so that vendors can build their own clusters.  This means that high-density racks, which were once naively considered to be anything over 16 kW, can easily be configured in the 40 kW and up range.


Don’t Sweat the Small Stuff



Delivering this kind of power is one thing, and the makers of PDUs (power distribution units) are finding appropriate solutions, providing a data center has the necessary power available (with power availability being an issue unto itself). Cooling this kind of density is a completely different story, and the leading options are liquid cooling solutions.



Fortunately, there are a number of solutions already available that allow users to cool anything from a single component, to system, to rack, row, hall, or entire data centers, depending upon the needs of the customer. Many of these solutions are quite mature, though only recently seeing broader acceptance.


但正如本周宣布收购液体冷却领导者CoolIT Systems所显示的那样,人们对冷却高密度机架系统的市场将继续增长抱有很高的期望。增加机架密度是满足数据中心房地产需求的唯一实际的解决方案,特别是对于人工智能服务。AI GPU/CPU解决方案的高能量需求意味着能够部署一个50 kW的机架或更大的功率密度比多个15 kW - 20 kW的机架提供相同的计算更实用和更具成本效益。功率需求保持不变,与安装多个系统来冷却分布在更大区域的相同功率相比,冷却成本可能会更低,支持硬件(机架、PDU、CPU等)的成本也可能会下降,从而在高密度部署的情况下获得总体效率提升。

But as this week’s announcement of the acquisition of liquid cooling leader CoolIT Systems shows, there are high expectations that the market for cooling high-density rack systems will continue to grow. Increasing rack density is the only practical solution to data center real estate needs, especially for AI services. The high energy demands of AI GPU/CPU solutions mean that it is more practical and cost effective to be able to deploy one 50 kW capable rack or even greater power density, than multiple 15 kW – 20 kW racks providing the same compute. The power requirements remain the same, the cooling costs could potentially be lower when compared to installing multiple systems to cool that same amount of power spread out over a larger area, and the supporting hardware (racks, PDUs, CDUs, etc.) potentially drop in cost as well, resulting in overall gains in efficiency with the higher density deployments.


高密度数据中心专家Colovore提供了一个为支持高密度机架及其相关工作负载而构建的背板的良好示例。他们现有设施中的所有机架都支持35KW,机架可以使用后门液冷热交换器来提供适当的冷却。冷却供应商通常将后门热交换器评为适合高达40 kW的功率,使其成为冷却高密度机架的最简单方法之一。

High-density data center specialist Colovore provides a good example of a data center built to support high-density racks and their related workloads. All of the racks in their existing facility support 35 kW and racks can use rear-door liquid cooled heat exchangers to provide suitable cooling. Cooling vendors generally rate rear-door heat exchangers as being suitable for up to 40 kW of power, making them one of the simplest ways to cool your high density racks.



But this is just the beginning; when we spoke to Colovore last year about their building a new facility, they told us their plans included offering direct liquid cooling technologies in the new data center that will allow them to support densities as high as 250 kW per rack.


If You Offer It, They Will Use It


一旦数据中心拥有200 kW以上的机架密度,可以肯定的是,客户将会排队使用它,特别是在新数据中心空间和功率受到严格限制的地区。随着每一代新一代AI和HPC专用硬件的功能显著增加,以及功耗增加30%-60%(基于过去的趋势),这些极端密度机架将成为这些技术的普遍应用。

Once 200 kW plus rack densities are available in data centers, you can be sure that customers will be lining up to make use of it, especially in areas where new data center space and power are tightly constrained. With each new generation of AI and HPC specific hardware seeing a significant increase in capabilities along with a 30%-60% increase in power consumption (based on past trends), these extreme density racks will become commonplace for those technologies.


15 kW-30 kW范围内的机架将取代今天常见的10 kW机架,如果只是出于更经济地利用空间和冷却位置的原因。该行业面临着增加机架密度的需求,同时利用利用高密度和极端密度机架可以获得的任何经济优势。

Racks in the 15 kW-30 kW range will replace today’s commonplace 10 kW racks if only for reasons of more economic use of space and cooling locations. The industry is faced with the need to increase rack densities while taking advantage of any economies that can be derived from utilizing high- and extreme-density racking.



Fortunately, for most data centers this is not an all or nothing proposition. For example, rear-door heat exchangers (RDHX) can be retrofitted to existing server racks. A RDHX has the benefit of being a passive cooling solution. While liquid, most commonly water, does get circulated from the RDHX to an external heat exchanger or cooling tower, the IT workload equipment remains untouched.


由于它们可以在每个机架的基础上进行改造,因此该技术可以简单地扩展,并且可以应用于需要额外冷却的机架,因为它们的IT工作负载发生了变化或部署了AI服务器机架等新技术,因此在必要时可以实现比传统数据中心风冷对风冷高得多的机架密度。这也意味着最初设计的机架位置可以最大限度地提供10 kW - 12 kW的功率,现在可以有效地用于支持20 kW - 30 kW的工作负载,从而大大增加了可能的机架密度,并在数据中心内的工作负载放置方面提供了更大的灵活性。

And because they can be retrofitted on a per rack basis, the technology scales simply and can be applied to racks that will require additional cooling as their IT workload changes or new technologies, such as a rack of AI servers are deployed, allowing for much higher rack densities, where necessary, than traditional data center air-to-air cooling. This also means that rack locations originally designed to maximize at 10 kW – 12 kW, can now effectively be used to support 20 kW – 30 kW workloads, providing a major increase in possible rack densities and greater flexibility in workload placement within a data center.



And for data centers designed to be able to increase available power, as many recent facilities have done, the addition of RDHX to increase rack density means more efficient use of the power being delivered to the facility.


Flexible Cooling Tech Can Also Boost Rack Density



Once a decision is made to adopt liquid cooling technology to allow for higher density racks, there is a broad range of potential solutions, almost all of which can be tailored to meet the needs of specific applications.


虽然像RDHX这样的解决方案可以为机架设备提供冷却,但您的需求可能是像Iceotope或GR cooling这样的供应商提供的全浸入式系统,这是冷却你部署的HPC解决方案的正确选择,而您现有的冷却解决方案足以满足您的标准IT工作负载。

While a solution such as an RDHX provides cooling to a rack of equipment, your needs might be such that a full immersion system, from a vendor like Iceotope or GR Cooling is the right choice to cool your on-premise HPC solution, while your existing cooling solutions are adequate for your standard IT workloads.



Or perhaps you’ve made the decision to deploy a flexible liquid cooling infrastructure within your existing data center so that you can more efficiently utilize the space.Now you have the option, in addition to RDHX, to deploy cold plate cooling to hit specific hot spots that you’ve determined can be problem areas. By explicitly cooling, CPUs, GPUs, memory, or entire blades you can effectively control the heat being distributed by specific systems in your rack environments, allowing for more efficiencies in cooling within your data center in its entirety. Individual servers and racks can get tailored, liquid-cooled solutions in the exiting environment with little to no impact on other hardware in the environment.


It's Not If, But When



Your current data centers may never see the need for extreme density racks or even high density levels of power demand, but the long term advantages of higher densities, more efficient cooling, hotter operations, and more flexible solutions will manifest themselves in more cost effective data centers which can demonstrate better ROI and OPEX costs over the projected life of the facilities.



Your new data centers will be built with these capabilities as part of the design. It only makes sense to build for the future and have as many ways available as possible to deliver the most efficient, sustainable, effective operation.How much you invest in your existing facilities to improve their performance and operational effectiveness is a different story and will, in most situations need to be justified on a case by case basis, but doing it right will likely show a direct impact on your business model and your bottom line.


