Building HPC-Ready Data Centers: Cooling Thresholds, Hybrid Design, and SLAs

 

 

A panel discussion from the HPC Summit featuring Mark Langford, Regional Technical Director, STULZ, and Indrama YM Purba, CEO, NeutraDC Nxera Batam. Moderated by Paul Mah and hosted by James Loggie, WMedia

AI and HPC workloads are rewriting the rules of data center design in APAC — from rack densities and floor loading to cooling architecture and SLAs. In this HPC Summit panel, STULZ's Mark Langford and NeutraDC Nxera Batam's Indrama YM Purba share a manufacturer's and an operator's perspective on what it really takes to build facilities ready for the next wave of density.

 

Can you each introduce yourselves for those who may not be familiar?

Indrama YM Purba (NeutraDC): My name is Indrama, I'm the CEO of NeutraDC Nxera Batam. It is a joint venture between Telkom, through NeutraDC, and SingTel, through Nxera. The other stakeholder is Medco Power, which is our partner providing solar panels and renewable energy in Batam.

Mark Langford (STULZ): My name is Mark Langford. I am based here in Singapore, and I've recently taken up the role of Regional Technical Director for STULZ. For those who may not be familiar with STULZ, we are a German-based global manufacturer of precision cooling equipment. It's been very much a rollercoaster ride over the years — starting out with CRAC and CRAH, and where we are today. I've been in the industry for probably too long, but it's one of those things — a bit like Hotel California, you can check out, but you never leave. It's a great place to be, and I'm really enjoying the rollercoaster ride. I'm not ready to get off yet.

 


Indrama, how are customer requirements for AI and HPC fundamentally different from traditional colocation demand?

Indrama: Traditionally, the density per rack was quite low — around 5 to 10 or 15 kW, with standard air cooling and gradual scaling. Right now, it's much more than that. Our AI and HPC customers come to us looking for high-density requirements — up to 50, sometimes 80 or even 100 kilowatts per rack. It's a huge demand, and it's not gradual scaling. Some customers say: "I want that on day one."

So every piece of infrastructure we provide has to be there on day one — the cooling, the power, the connectivity, because they need low latency. It's very different from traditional colocation. Luckily, we already anticipated this. The floor loading of our slab is already quite strong, which aligns with GPU requirements compared to traditional CPU workloads.

So right now, it's not only about how we can host HPC — it's about how fast we can support their requirements on day one. That's non-negotiable. Otherwise, they go to other data center players.

 


Mark, as rack densities climb beyond 40–60 kW, where do traditional air-cooling approaches start to break down in HPC environments?

Mark: The issue is that designs that were once generated for general IT — and as I said, I've been in this industry a long time — we looked at workloads to begin with where a telco rack would be less than a kilowatt. We're talking really small workloads. You see now that 3 to 5, 10 kilowatts a rack is still actually in the market, 10, 15 kilowatts. But as we move up beyond 40, 50, 60 kilowatts a rack, there are some challenging situations that we face.

Simply speaking, air becomes a problem when it can no longer support the workload. Some of those designs really hit limits. The raised floor becomes a really big problem for AI workloads, because the servers being loaded into the racks are inherently heavier — the weight per square meter is higher, and they put a lot of stress on raised floors. On top of that, the plenums that raised floors provide don't deliver enough airflow to keep HPC racks cool.

Raised floors and raised ceiling plenums were great when we were talking about delivering 20-degree air to low-to-medium density servers. As we scale up, those things tend to break down and they no longer provide the type of cooling that's required. The airflow to support them becomes increasingly higher and higher, until the point we can't solve that anymore. That then creates noise problems, velocity problems, hot air recirculation, hotspots — all of those things tend to become more prevalent as rack levels go up. Not to mention that we start to see possible failures of components inside the servers themselves.

Essentially, the breakdown is simply that there is not enough air to get into the box anymore. The servers are so tightly packed that we cannot force enough air through, and the server fans are not large enough to draw enough air through to make that an efficient process anymore.

 


At what point does liquid cooling genuinely become necessary rather than just an optimization?

Mark: The time that it becomes inevitable — and not just an optimization — is really the same answer to a certain extent. It's when air can no longer support the load. There are sort of three lines in the sand that we refer to.

When you're up to about 40, 50 kilowatts a rack, liquid cooling can be considered an optimization. It's not necessary, but it can certainly reduce the impact of load on chillers and other devices. At 50 kilowatts, it can be considered an energy-saving optimization tool.

At 60 to 70 kilowatts a rack, 70 kilowatts is really not easy to be delivered by air anymore. You can go close-coupled with in-row cooling, you could potentially use high-performance rear door heat exchangers. But realistically speaking, you should already be thinking about how to deploy liquid cooling into zones. You may not necessarily have a full AI or HPC server room or data center, but you need to be thinking: am I ready to start zoning for HPC?

If we draw on NVIDIA's roadmap and start talking about 130 kilowatts to 300 kilowatts — by the time we're at 130 kilowatts, liquid is not something that you can decide yes or no. It's an absolute non-negotiable. When you're at 100 kilowatts a rack, you're not doing that with air anymore.

Especially with the way that servers are designed right now, we are still seeing HPC or AI servers with an air requirement in them. Typically, as a rule of thumb, we're still talking 70/30, 80/20. The roadmap is to go 90/10, then 100% — but right at the moment, the easy way to look at it is: if you take a 100-kilowatt server, around 20 kilowatts of that is going to be power modules, RAM slots, internal cabling and other bits and pieces that are still air-cooled. The rest of it — your CPUs, your GPUs — are definitely going to have heat sinks connected to a liquid CDU.

 


Indrama, from an operator perspective, how do you plan for liquid cooling when only a portion of customers may require it?

Indrama: Like Mark mentioned, for our baseline, up to 30 kW we still use air cooling. Between 30 to 60 kW, we can still use air cooling. But beyond 60 kW, liquid cooling is non-negotiable.

From an installation point of view, this is very risky. Every technology vendor should provide certified installers to put all the equipment in place so there is no leak. That's quite challenging right now, because leaks can happen if equipment is not properly installed.

This is also a very customized model for each customer. We let the customer choose their technology, and together we work out how to provide that in our data hall. We let the customer requirement fit with our data hall and our power. We don't have a raised floor, but we have a ceiling — so that's something we discuss with each customer to meet their requirements.

We also plan for a hybrid cooling model: air cooling for low density, and liquid-ready infrastructure for HPC.

 


An audience question on liability and SLAs: with liquid cooling in mission-critical environments, what is the market sentiment?

Mark: It's a very good question, and to be quite honest with you, as an industry I'm not even sure that we've really got that worked out yet.

If we look at SLAs, they are really different with liquid cooling. Perhaps one way to look at it is that it needs to be always on. When we think back to the way that we used to operate an air-cooled data center, we could talk about temperature differential over a 24-hour period, or there may be some humidity standards relevant. As we think about going to liquid cooling, what we see now is that the CDU needs to be always on. That means the infrastructure changes — making sure that the pumps and the control out to the chip are maintained.

The liability and where we look at the demarcations is getting quite blurry. As an OEM vendor of equipment, we are now being asked to supply turnkey solutions, meaning that we need to supply liquid to the chip. What I mean is that our supply stops at the manifold — so the CDU, the secondary fluid network, all the way down the branches and the stems and into the rack manifold. That's pretty much where it stops for us.

From a liability perspective, as long as we are delivering that TCS fluid into the rack manifold at the required pH, conductivity, turbidity, and glycol levels — those are the measurements that are being put on us at the moment from a delivery perspective.

Indrama: For us, it depends on what the customer brings in terms of liquid cooling technology. We accompany them to install it in our hall, provided the power and the requirements for placing their equipment in the data hall. The SLA — we follow what the customer wants.

 


Where does the liability demarcation end? If something goes wrong, whose responsibility is it?

Mark: For us, it's really about making sure that the secondary fluid is tested and considered healthy. The demarcation for us is at the rack manifold, but the measurement comes before there. The measurement is really around showing the health of the secondary fluid to make sure that it's meeting the standards. As long as it meets the standards from a temperature and health perspective, then the delivery of that stops at the dry breaks at the rack manifold.

Indrama: After we discuss with our customer, we also think about operational aspects — how we manage 24/7 with measurement and tolerance ranges. We have strong local partners, so for regular maintenance or small alarms, we can call on them. We also need our own staff and maintenance people trained to maintain environmental conditions — temperature, humidity — within tolerance, so our customers have peace of mind.

As Mark said, on the first loop, which relates to our infrastructure, we have to manage that to meet the environmental operational requirements.