更热的硬件:机架密度测试数据中心冷却策略

 

更热的硬件:机架密度测试数据中心冷却策略

Hotter Hardware: Rack Densities Test Data Center Cooling Strategies


 

译 者 说

需求与供给是永恒的主题。随着生成式人工智能的横空出世(毫无疑问,这只是开始),传统数据中心在功率密度上的鸿沟越来越大,革命性的更新现有设施、流程势在必行。而冷却和效率,正是制约现阶段功率密度的最大因素。本文就40-50kW/R功率密度下的冷却方案所做介绍,颇具参考价值。

 

 

机架密度的变化可能是数据中心最热议的事情。深入研究算力的上升以及可能提供解决方案的冷却策略。

Rack density changes may be the hottest thing in the data center. Literally. A closer look at the rise of computing power, and the cooling strategies that may provide solutions.

 

David Chernicoff Intel Corp

 

在英特尔的浸没式冷却研究实验室中,浸泡在冷却液中的服务器。

Servers immersed in liquid coolant in an immersion cooling research lab at Intel.

 

数据中心的机架密度一直在增加,这对任何人来说都不会感到惊讶。普通服务器计算能力的增长,以及同时发生的cpu和gpu的功率需求增长是很容易跟踪的。但是最近数据中心IT工作负载的两个变化,使得对电力和冷却的需求呈指数级上升。 

It’s unlikely to come as a surprise to anyone that rack densities in data centers have been increasing. The increase in computing power for the average server, along with the concurrent increase in power demand for CPUs and GPUs has been pretty easy to track. But two recent changes to the IT workloads commonly found in data centers have driven the demand for power and cooling exponentially higher.

 

第一个变化已经发展了很长时间,这使得数据中心有机会增加电力和冷却资源来适应它。这就是高性能计算资源(HPC)事实上的商业化。虽然HPC曾经几乎完全是超级计算机和科学计算任务的领域,但近期构建高性能CPU和GPU集群的相对廉价的能力已经改变了这一局限。现在,企业可以投资必要的硬件或云HPC服务,以解决更常见的用例,实际上,包括任何需要处理大量数据、执行复杂模拟或只是更快速有效地解决问题的用例。

The first has been a long time developing, giving data centers the opportunity to grow resources for power and cooling to accommodate it. That is the effective commoditization of high performance computing resources (HPC). While HPC was once almost strictly the domain of supercomputers and scientific computing tasks, the relatively inexpensive ability to build clusters of high performance server CPUs and GPUs has moved the bar. Now businesses can either invest in the hardware necessary or cloud HPC services in order to address far more common use cases; effectively any use case that requires the processing of large volumes of data, executing complex simulations, or simply solving problems more quickly and efficiently.

 

HPC服务需求的稳步增长导致了专用HPC数据中心以及一些其他事项的发展,其中设施已经建设到功率密度范围的高点,并引领了机架功率密度部署的发展,能够支持每个机架50kW的发展,比十年前典型数据中心的功率功耗高出一个数量级。

The steady growth in the demand for HPC services has lead to, among other things, dedicated HPC data center development where the facilities were already building to the high-end of the power density spectrum and leading the way on development of rack power density deployments capable of supporting 50kW per rack development, more than an order of magnitude more power capability than was found in a typical data center less than a decade ago.

 

虽然数据中心还有时间适应高性能计算服务需求的增长,但第二个变化的速度却惊人地快。似乎不知道从哪里开始出来的,生成式人工智能正被视为每个计算任务和问题的解决方案。构成人工智能服务的过程和工作流程,如训练、机器学习和推理引擎,都是计算密集型的,耗电量巨大。电力需求和随之而来的冷却需求正在异常迅速地增长。

While data centers have had time to adapt to the growth in demand for HPC services, the second change has ramped up incredibly quickly. Seemingly out of nowhere, generative AI is being heralded as the solution to every computing task and problem. And the processes and workflows that make up AI services, such as training, machine learning, and inference engines, are compute intensive and power hungry. And the power demands and accompanying cooling needs are increasing exceptionally quickly.

 

例如,英伟达在其最新的人工智能GPU系统DGX H100的发布中,预计最大功耗为10.2KW,是上一代A100的160%。H100服务器板提供4和8 GPU配置,以便供应商可以构建自己的集群。这意味着高密度机架,曾经被天真地认为仅是超过16KW,但其实可以很容易地配置到40KW和更高的范围内。

As an example, Nvidia, in its announcement of its latest AI GPU system, the DGX H100, projected maximum power consumption of 10.2 kW, 160% of the previous generation A100. And the H100 server boards are available in 4 and 8 GPU configurations so that vendors can build their own clusters.  This means that high-density racks, which were once naively considered to be anything over 16 kW, can easily be configured in the 40 kW and up range.

 

不要为小事担心
Don’t Sweat the Small Stuff

 

提供这种电源是一回事,PDU(电源分配单元)制造商正在寻找适当的解决方案,为数据中心提供必要的可用电源(电源可用性本身就是一个问题)。冷却这种密度的机架是完全不同的事,主要的选择是液体冷却解决方案。

Delivering this kind of power is one thing, and the makers of PDUs (power distribution units) are finding appropriate solutions, providing a data center has the necessary power available (with power availability being an issue unto itself). Cooling this kind of density is a completely different story, and the leading options are liquid cooling solutions.

 

幸运的是,已经有许多可用的解决方案允许用户根据客户的需求冷却任何东西,从单个组件到系统,到机架、行、机房或整个数据中心。这些解决方案中的许多都相当成熟,尽管直到最近才得到更广泛的接受。

Fortunately, there are a number of solutions already available that allow users to cool anything from a single component, to system, to rack, row, hall, or entire data centers, depending upon the needs of the customer. Many of these solutions are quite mature, though only recently seeing broader acceptance.

 

但正如本周宣布收购液体冷却领导者CoolIT Systems所显示的那样,人们对冷却高密度机架系统的市场将继续增长抱有很高的期望。增加机架密度是满足数据中心房地产需求的唯一实际的解决方案,特别是对于人工智能服务。AI GPU/CPU解决方案的高能量需求意味着能够部署一个50 kW的机架或更大的功率密度比多个15 kW - 20 kW的机架提供相同的计算更实用和更具成本效益。功率需求保持不变,与安装多个系统来冷却分布在更大区域的相同功率相比,冷却成本可能会更低,支持硬件(机架、PDU、CPU等)的成本也可能会下降,从而在高密度部署的情况下获得总体效率提升。

But as this week’s announcement of the acquisition of liquid cooling leader CoolIT Systems shows, there are high expectations that the market for cooling high-density rack systems will continue to grow. Increasing rack density is the only practical solution to data center real estate needs, especially for AI services. The high energy demands of AI GPU/CPU solutions mean that it is more practical and cost effective to be able to deploy one 50 kW capable rack or even greater power density, than multiple 15 kW – 20 kW racks providing the same compute. The power requirements remain the same, the cooling costs could potentially be lower when compared to installing multiple systems to cool that same amount of power spread out over a larger area, and the supporting hardware (racks, PDUs, CDUs, etc.) potentially drop in cost as well, resulting in overall gains in efficiency with the higher density deployments.

 

高密度数据中心专家Colovore提供了一个为支持高密度机架及其相关工作负载而构建的背板的良好示例。他们现有设施中的所有机架都支持35KW,机架可以使用后门液冷热交换器来提供适当的冷却。冷却供应商通常将后门热交换器评为适合高达40 kW的功率,使其成为冷却高密度机架的最简单方法之一。

High-density data center specialist Colovore provides a good example of a data center built to support high-density racks and their related workloads. All of the racks in their existing facility support 35 kW and racks can use rear-door liquid cooled heat exchangers to provide suitable cooling. Cooling vendors generally rate rear-door heat exchangers as being suitable for up to 40 kW of power, making them one of the simplest ways to cool your high density racks.

 

但这仅仅是个开始;当我们去年与Colovore谈论他们建设的新设施时,他们告诉我们,他们的计划包括在新的数据中心提供直接液体冷却技术,这将使他们能够支持每个机架高达250KW的密度。

But this is just the beginning; when we spoke to Colovore last year about their building a new facility, they told us their plans included offering direct liquid cooling technologies in the new data center that will allow them to support densities as high as 250 kW per rack.

 

如果你提供它,他们会使用它
If You Offer It, They Will Use It

 

一旦数据中心拥有200 kW以上的机架密度,可以肯定的是,客户将会排队使用它,特别是在新数据中心空间和功率受到严格限制的地区。随着每一代新一代AI和HPC专用硬件的功能显著增加,以及功耗增加30%-60%(基于过去的趋势),这些极端密度机架将成为这些技术的普遍应用。

Once 200 kW plus rack densities are available in data centers, you can be sure that customers will be lining up to make use of it, especially in areas where new data center space and power are tightly constrained. With each new generation of AI and HPC specific hardware seeing a significant increase in capabilities along with a 30%-60% increase in power consumption (based on past trends), these extreme density racks will become commonplace for those technologies.

 

15 kW-30 kW范围内的机架将取代今天常见的10 kW机架,如果只是出于更经济地利用空间和冷却位置的原因。该行业面临着增加机架密度的需求,同时利用利用高密度和极端密度机架可以获得的任何经济优势。

Racks in the 15 kW-30 kW range will replace today’s commonplace 10 kW racks if only for reasons of more economic use of space and cooling locations. The industry is faced with the need to increase rack densities while taking advantage of any economies that can be derived from utilizing high- and extreme-density racking.

 

幸运的是,对于大多数数据中心来说,这并不是一个全有或全无的命题。例如,后门热交换器(RDHX)可以改装到现有的服务器机架上。RDHX具有被动冷却解决方案的优点。虽然液体(通常是水)确实从RDHX循环到外部热交换器或冷却塔,但IT工作负载设备保持不变。

Fortunately, for most data centers this is not an all or nothing proposition. For example, rear-door heat exchangers (RDHX) can be retrofitted to existing server racks. A RDHX has the benefit of being a passive cooling solution. While liquid, most commonly water, does get circulated from the RDHX to an external heat exchanger or cooling tower, the IT workload equipment remains untouched.

 

由于它们可以在每个机架的基础上进行改造,因此该技术可以简单地扩展,并且可以应用于需要额外冷却的机架,因为它们的IT工作负载发生了变化或部署了AI服务器机架等新技术,因此在必要时可以实现比传统数据中心风冷对风冷高得多的机架密度。这也意味着最初设计的机架位置可以最大限度地提供10 kW - 12 kW的功率,现在可以有效地用于支持20 kW - 30 kW的工作负载,从而大大增加了可能的机架密度,并在数据中心内的工作负载放置方面提供了更大的灵活性。

And because they can be retrofitted on a per rack basis, the technology scales simply and can be applied to racks that will require additional cooling as their IT workload changes or new technologies, such as a rack of AI servers are deployed, allowing for much higher rack densities, where necessary, than traditional data center air-to-air cooling. This also means that rack locations originally designed to maximize at 10 kW – 12 kW, can now effectively be used to support 20 kW – 30 kW workloads, providing a major increase in possible rack densities and greater flexibility in workload placement within a data center.

 

对于设计为能够增加可用功率的数据中心,正如许多最近的设施所做的那样,添加RDHX以增加机架密度意味着更有效地利用交付给设施的功率。

And for data centers designed to be able to increase available power, as many recent facilities have done, the addition of RDHX to increase rack density means more efficient use of the power being delivered to the facility.

 

灵活的冷却技术也可以提高机架密度
Flexible Cooling Tech Can Also Boost Rack Density

 

一旦决定采用液体冷却技术来允许更高密度的机架,就有广泛的潜在解决方案,几乎所有这些解决方案都可以量身定制,以满足特定应用的需求。

Once a decision is made to adopt liquid cooling technology to allow for higher density racks, there is a broad range of potential solutions, almost all of which can be tailored to meet the needs of specific applications.

 

虽然像RDHX这样的解决方案可以为机架设备提供冷却,但您的需求可能是像Iceotope或GR cooling这样的供应商提供的全浸入式系统,这是冷却你部署的HPC解决方案的正确选择,而您现有的冷却解决方案足以满足您的标准IT工作负载。

While a solution such as an RDHX provides cooling to a rack of equipment, your needs might be such that a full immersion system, from a vendor like Iceotope or GR Cooling is the right choice to cool your on-premise HPC solution, while your existing cooling solutions are adequate for your standard IT workloads.

 

或者,您可能已经决定在现有数据中心内部署灵活的液体冷却基础设施,以便更有效地利用空间。现在,除了RDHX之外,您还可以选择部署冷板冷却,以解决您确定可能是问题区域的特定热点。通过显式冷却CPU、GPU、内存或整个刀片,您可以有效地控制机架环境中特定系统分配的热量,从而提高整个数据中心的冷却效率。单个服务器和机架可以在现有环境中获得定制的液冷解决方案,而对环境中的其他硬件几乎没有影响。

Or perhaps you’ve made the decision to deploy a flexible liquid cooling infrastructure within your existing data center so that you can more efficiently utilize the space.Now you have the option, in addition to RDHX, to deploy cold plate cooling to hit specific hot spots that you’ve determined can be problem areas. By explicitly cooling, CPUs, GPUs, memory, or entire blades you can effectively control the heat being distributed by specific systems in your rack environments, allowing for more efficiencies in cooling within your data center in its entirety. Individual servers and racks can get tailored, liquid-cooled solutions in the exiting environment with little to no impact on other hardware in the environment.

 

不是时间,而是时机
It's Not If, But When

 

您当前的数据中心可能永远不会看到对极端密度机架甚至高密度电力的需求,但是更高密度、更高效的冷却、更热的操作和更灵活的解决方案的长期优势将在更具成本效益的数据中心中体现出来,这可以在设施的预计寿命期内展示更好的投资回报率和运营成本。

Your current data centers may never see the need for extreme density racks or even high density levels of power demand, but the long term advantages of higher densities, more efficient cooling, hotter operations, and more flexible solutions will manifest themselves in more cost effective data centers which can demonstrate better ROI and OPEX costs over the projected life of the facilities.

 

您的新数据中心将使用这些功能作为设计的一部分来构建。只有为未来而建,并尽可能多地提供最高效、可持续、有效的运营方式,才有意义。在现有设施上投资多少以提高其性能和运营效率是另一回事,在大多数情况下需要根据具体情况进行判断,但做得对可能会对您的业务模式和底线产生直接影响。

Your new data centers will be built with these capabilities as part of the design. It only makes sense to build for the future and have as many ways available as possible to deliver the most efficient, sustainable, effective operation.How much you invest in your existing facilities to improve their performance and operational effectiveness is a different story and will, in most situations need to be justified on a case by case basis, but doing it right will likely show a direct impact on your business model and your bottom line.

 

 
 
深 知 社
 
 

 

翻译:

王权

DKV(DeepKnowledge Volunteer)精英成员

 

校对:

王舜

秦淮数据 产品规划&研发总监

DKV(DeepKnowledge Volunteer)精英成员

 

公众号声明:

本文并非官方认可的中文版本,仅供读者学习参考,不得用于任何商业用途,文章内容请以英文原版为准,本文不代表深知社观点。文中内容来自互联网,如有侵权,将在24小时内删除。中文版未经公众号DeepKnowledge书面授权,请勿转载。

 

推荐阅读:

 

 

 

 

 

 

 

首页    暖通    更热的硬件:机架密度测试数据中心冷却策略
机架密度的变化可能是数据中心最热议的事情。深入研究算力的上升以及可能提供解决方案的冷却策略。
设计
管理
运维
设备
电气
暖通
控制
碳中和
储能

深知社