Mastering Mission-Critical Thermal Management
Published on November 4, 2025,
by
The AI Paradox and the Coming Management Gap
We are in an unprecedented technology boom. The enthusiasm for generative AI is attracting waves of serious investment, creating entirely new business models from what was science fiction just years ago.1 But this digital revolution is built on a physical foundation that is beginning to crack. The compute power required to train and run these models, rivaling high-performance supercomputers2 is pushing data center infrastructure to its absolute breaking point.
The industry is facing a thermal crisis, and the scale of the challenge is difficult to overstate. The new generation of AI hardware, such as Nvidia's Blackwell NVL72, is a 132-kilowatt (kW) per rack system3. This is not an incremental step; it is a phase change in computing.
To put this in perspective, the Uptime Institute's 2024 Global Data Center Survey reveals that only 1% of all operators report having any racks that exceed 100 kW.3 The vast majority of facilities are not even in the same league; the industry-wide average rack density remains below 8 kW.4
This new reality means traditional air cooling is officially obsolete for high-density workloads. Research confirms that air cooling is "not suitable" for AI clusters operating above 20 kW per rack.5 This fact has forced an industry-wide, urgent pivot to liquid cooling.6
But this necessary solution introduces a massive, unaddressed problem. Liquid cooling solves the heat problem but creates a profound complexity problem. Retrofitting liquid cooling is a "major undertaking" and a "high risk" operation7, as the technology has yet to mature to a point of design standards. This shift creates a new, fragmented, and siloed infrastructure of pumps, pipes, and Coolant Distribution Units (CDUs) that must work in perfect concert with existing power and IT systems.
The real crisis is not just heat; it's the new management gap between Facilities teams, who own the chillers and pipes, and IT teams, who own the servers and workloads. Without a unified platform to bridge this gap, resilience, efficiency, and scalability are impossible. Mission-Critical Thermal Management is no longer about just keeping things cool; it is about establishing holistic, end-to-end management over the entire thermal chain.
This report will analyze the AI-driven thermal crisis, deconstruct the hidden complexities of the new liquid cooling chain, and present a strategic framework for unifying this new environment, from the chiller to the workload, using a holistic management platform.
Why the Old Rules No Longer Apply
The End of the "Easy Gains" in Air Cooling
For the past five years, a key industry metric, Power Usage Effectiveness (PUE), has remained stubbornly flat4. This stagnation is not for lack of effort. Rather, it signals the end of an era. The Uptime Institute's 2024 survey data reveals that this flatness is because the industry has already implemented all the "easier and more cost-effective measures".4
These "low-hanging fruit" optimizations include rigorous hot/cold aisle containment, the installation of blanking panels, and the widespread adoption of variable-speed drives in cooling units.4 This means the industry has officially hit a wall. Air cooling, as a paradigm, has been optimized to its physical limit. This "legacy" infrastructure, which still constitutes most data centers operating at a typical density of 4 kW to 9 kW 4, is now structurally incapable of handling the demands of AI.
132kW Racks and "Torture Test" Workloads
AI workloads are not like traditional enterprise applications. They are "torture tests" for cooling design, operating more like high-performance computing (HPC) or supercomputing clusters.1 The hardware reflects this intense new reality.
As noted, Nvidia’s flagship NVL72 product is a 132 kW rack-scale system.3 Industry analysts at JLL confirm that AI requires densities of up to 100 kW per rack, a 10x leap from today's 10 kW average.5
This isn't a simple upgrade; it's a breaking change that creates a "bifurcated" data center. Operators are now forced to run two fundamentally different facilities under one roof:
Legacy Zones: Large areas of low-density, air-cooled racks (below 8 kW) running traditional IT.4
AI/HPC Zones: Small, hyper-dense "pods" or "GigaSites"8 that are liquid-cooled and run at 30 kW, 70 kW, or even 100kW+.5
This hybrid model invalidates all existing capacity planning methodologies. How can an operator model power and cooling when one single 132 kW rack3 consumes more power and generates more heat than an entire row of 20 legacy racks? This creates massive "stranded capacity," where power, space, and cooling are available but cannot be safely allocated due to the "lumpy" and extreme nature of the new AI demand.9
When "Too Hot" Means "Game Over"
In this high-density environment, the cost of failure is astronomical. According to the Uptime Institute's 2024 outage analysis, more than half (54%) of significant data center outages cost organizations over $100,000, and one in five (20%) cost over $1 million.10 The number one cause of these serious outages remains, consistently, power issues.10
A thermal failure is a power failure. When cooling is lost, the only option to prevent hardware destruction is an emergency power-down of the affected systems.
This risk is accelerated by the new hardware. In an air-cooled 8 kW rack, thermal runaway might take several minutes, giving human operators a precious window to react. In a 132 kW liquid-cooled rack3, a sudden loss of coolant flow, from a pump failure or a pipe blockage, can cause catastrophic, chip-destroying thermal damage in seconds. The window for human intervention has shrunk to zero. This shift makes predictive, real-time monitoring not just a "nice to have," but a non-negotiable requirement for basic infrastructure resilience.
The Liquid Cooling Revolution
Why Liquid is the Only Answer
Faced with the limitations of air, the market has made its choice. JLL's 2024 Outlook states that a "shift to liquid cooling will be essential"11 and is already "becoming essential for high-density racks".11 The business case is built on two pillars: necessity and efficiency.
First, as established, traditional air cooling is "not suitable" for AI clusters over 20 kW/rack.12 Second, liquid cooling is fundamentally more efficient. For example, a single HPE server consumes 10 kW when air-cooled but only 7.5 kW when liquid-cooled. This 2.5 kW saving comes simply from "reduced fan power requirements"12, as the server's internal fans no longer need to run at full speed.
The market data reflects this urgent transition. While the overall data center cooling market is projected to reach $22.13 billion in 202413, the liquid cooling segment is growing at a much faster 21.6% compound annual growth rate (CAGR).14 Major colocation providers are in a race to retool. Equinix is deploying liquid cooling in 100 of its data centers, Digital Realty has launched a 70 kW/rack service, and Aligned is building a purpose-built liquid-cooled campus specifically for the new Nvidia Blackwell GPUs.15
Table 1: The End of an Era: Air Cooling vs. Liquid Cooling
| Feature | Legacy Air Cooling | Modern Liquid Cooling (DTC/RDHx) |
| Max Rack Density | ~10-20 kW | >100 kW |
| Cooling Mechanism | Room-level air (CRAC/CRAH) | Direct-to-Chip (DTC) or Rear Door Heat Exchanger (RDHx) |
| Primary Efficiency | Low. Stagnant PUE for 5 years.
High fan power use. |
High. Reduces server fan power by ~25%.
Enables heat reuse. |
| Suitability for AI/HPC | Not Suitable | Essential |
The New Cooling Chain
Liquid cooling is not a single product; it is a complex, multi-stage chain. It introduces an entirely new, high-risk plumbing infrastructure into the data center, including Coolant Distribution Units (CDUs), direct-to-chip "cold plates," Rear Door Heat Exchangers (RDHx), and extensive, custom-run piping.
This is the hidden risk. Retrofitting this plumbing into a live data center is a "major undertaking" and a "high risk" operation. Making matters worse, as this is emerging technology there is a "lack of standardized designs" for these new liquid cooling systems. CDUs can be floor-mounted, in-row, or rack-mounted. Piping can be run under the floor or overhead.
The high number of design permutations means that every liquid cooling deployment is essentially a custom, one-off engineering project. This customization creates an unmanageable management chasm. This new, custom-built plumbing infrastructure is owned and managed by the Facilities team, but it directly services the most critical IT-owned assets (the AI servers).
When a high-value AI workload begins to throttle, is it a software bug (IT's problem) or a coolant-pressure drop (Facilities' problem)? Without a common platform and a shared set of data, both teams are flying blind.
The New, Single-Most Critical Failure Point
At the center of this new, complex chain is the Coolant Distribution Unit (CDU). The CDU is the "heart" of the liquid cooling system. It acts as the critical interface between the facility's main water loop and the "technology cooling loop" that pipes coolant directly to the servers.16
This component introduces a new and severe single point of failure. In the old air-cooled world, the failure of a single Computer Room Air Conditioner (CRAC) unit was typically mitigated by N+1 redundancy.17 Other units would ramp up, the load would be shared, and the room would get warmer slowly, giving operators time to react.
In the new liquid-cooled world, a single high-capacity CDU may service multiple 100kW+ racks. A failure here, a pump seizure, a critical pipe blockage, or a significant leak, is instantly catastrophic. It is not a "slowly warming room"; it is an immediate, multi-megawatt failure that can destroy millions of dollars in AI hardware. The CDU has arguably become the new most critical component in the entire data center, demanding a new level of granular, predictive monitoring.
Unifying the Cooling Chain from Chiller to Workload

The problem is not heat; it's fragmentation. The only viable solution is unification. This is the precise role of Nlyte's holistic mission-critical thermal management platform. It provides the "single pane of glass" and "single source of truth"9 required to manage this new, hybrid, non-standard infrastructure.
Beyond DCIM: Holistic Management for a Hybrid World
Traditional Data Center Infrastructure Management (DCIM) software, as defined by the Uptime Institute, focuses on IT asset management and basic monitoring.18 This is no longer enough. The industry now requires solution providers that can unify the entire cooling chain, from the facility-level chiller to the application-level workload.
Nlyte provides this end-to-end management, monitoring key metrics at every stage of the cooling chain to mitigate specific, high-stakes risks.
Table 2: Nlyte's Unified "Chiller-to-Workload" Monitoring
| Cooling Chain Element | Key Metrics Monitored (via Nlyte) | Key Risk Mitigated |
| 1. Chiller (Facility) | Facility-level cooling status, efficiency, reliability. | Facility-wide failure, massive energy overspend (PUE). |
| 2. CDU (The Bridge) | Temperature (Supply/Return, Delta-T), Pressure (Supply/Return, Differential), Flow Rate, Pump Speed, Tank Level, Leak Detection. |
Catastrophic AI workload failure. Detects pump failure, pipe blockages, or leaks before they cause a shutdown. |
| 3. Rack & Server (IT) | Asset location, upstream/downstream CDU connection, status, temperature (hot/cool spots). | Stranded capacity, localized overheating, inability to trace failures from IT to Facilities. |
| 4. Workload (App) | Thermal conditions for specific high-performance workloads. | CPU/GPU throttling, poor application performance, wasted AI investment. |
From Floorplans to Flow Rates
The "design permutation" chaos of liquid cooling retrofits makes the new plumbing infrastructure effectively invisible to IT and difficult to manage for Facilities. Nlyte’s platform provides "rich visualizations" that make this complex new world manageable.

The Nlyte floorplan is not just a map of racks; it is a holistic view of the infrastructure that shows the CDUs and the plumbing connections. This visualization is the bridge that finally spans the IT/Facilities silo.
Using this shared map, a facilities manager can see exactly which racks and workloads (IT assets) are connected to a specific CDU and plumbing loop. Simultaneously, an IT manager investigating a hot-running server can trace that asset back to its parent CDU and, in the same platform, check its real-time flow rate, pressure, and temperature. This shared view moves teams from "blame-storming" to collaborative, data-driven problem-solving.
The Proactive Engine: Nlyte Operational AI
The data center industry has a trust problem with AI. Uptime Institute's 2024 survey reveals that trust in AI for data center operations has declined for the third year in a row.4 A significant 42% of operators stated they would not trust an AI system to make operational decisions.19 The fear is that a "faulty model" could trigger an outage or that a "black box" recommendation cannot be verified.19
Nlyte’s Operational AI (NOA) is specifically designed to solve this "trust paradox." It is not a black box that takes over. It is an augmentation tool that empowers human operators by providing actionable recommendations, delivering the interpretability that operators demand.
Nlyte’s Operational AI’s key functions include:
Predictive Optimization (Resilience): Nlyte’s Operational AI leverages trend analysis to predict failures before they happen. For example, it can analyze the deep metrics from a CDU and flag a subtle, rising differential pressure, a metric invisible to human operators, as a leading indicator of a future pipe blockage or filter clog. This allows for proactive, scheduled maintenance, preventing catastrophic unplanned failure.
Intelligent Placement (Scalability): This is the solution for the bifurcated data center. Instead of weeks of planning, an operator can ask Nlyte Placement and Optimization with AI to place a new 100kW rack. The engine quickly evaluates dozens of constraints, including space, power, network ports, and the available liquid cooling capacity from a specific CDU, to find the one optimal location.
Predictive Forecasting (De-risking): Nlyte’s Operational AI’s "What-If" analysis allows operators to simulate the impact of adding 10 new high-density racks before a single server is purchased. This modeling identifies future power, space, or cooling shortfalls, preventing overload and ending stranded capacity.
In short, Nlyte's AI strategy succeeds because it makes the human operator the hero, empowering them with predictive insights rather than attempting to replace them.
The Business Value of Mission-Critical Thermal Management
This unified approach translates directly into C-level business benefits, moving thermal management from a tactical cost center to a strategic advantage.
Driving Energy Efficiency and True Sustainability (ESG)
ESG is no longer optional; it is a top priority for investors and regulators.20 The industry faces intense scrutiny over its massive energy and water consumption. The problem is that many operators are flying blind. According to the Uptime Institute, fewer than half of data center operators are even tracking the metrics needed to meet these pending regulatory requirements.4
Nlyte's platform is the solution to this "data gap." First, by optimizing the entire cooling chain, from the chiller to the workload, it directly reduces energy consumption, the #1 component of ESG.21 Second, it provides a centralized, auditable platform that "expedites compliance and audit processes." This moves an organization's ESG posture from a hopeful guess to a data-driven, reportable, and defensible metric.
Enhancing Resilience and De-risking Operations
The data center operating environment is more volatile than ever. Operators face a "worldwide power shortage"22, power transmission bottlenecks, critical supply chain delays, and the constant threat of extreme weather events.
Nlyte is fundamentally a resilience platform. By using AI-driven analytics to "identify, manage, and mitigate the impact of failures," it moves the entire operation from reactive to proactive. This "enhanced monitoring and predictive analytics" is the only way to minimize downtime in an environment where failure is measured in seconds. This is the essence of true mission-critical thermal management.
Unlocking Scalability and Deferring Capital Expenditures (CapEx)
The industry is growing at a phenomenal pace, with JLL projecting 15% annual capacity growth. But this growth is severely constrained by power and land scarcity. New construction is slow, expensive, and in many markets, no longer an option.
The biggest enemy of growth is not just a lack of new capacity; it is the inefficient use of existing capacity, also known as stranded capacity. Nlyte's AI-powered placement is the key to unlocking this hidden capacity. By finding the true optimal location for new assets, Nlyte helps organizations maximize space, power, and cooling efficiency. This allows them to fit more workloads into their existing footprint, directly deferring multi-million-dollar capital expenditures on new builds.
The Platform of Record
Nlyte's credibility is reinforced by its deep integration into the data center ecosystem. It is not a siloed tool; it is a central platform that integrates with:
ITSM: Extends platforms like ServiceNow directly into the physical data center, bridging the gap between a software-based service ticket and the physical rack, power, and cooling changes required to resolve it.23
BMS: Connects to building management systems to see the facility-level "chiller" side of the equation.
Industry Leaders: Nlyte is trusted and used by global enterprises, federal governments, and the very real estate firms like CBRE24 that are analyzing and building the future of the data center market. This proves its ability to scale and manage the world's most complex and mission-critical portfolios.
From Thermal Chaos to Thermal Management
The generative AI revolution1 has permanently broken the air-cooled data center paradigm.25 The necessary, industry-wide shift to liquid cooling has, in turn, created a new and dangerous crisis of complexity and fragmentation.
This new, hybrid, high-risk environment cannot be managed with yesterday's siloed tools. The IT team, the facilities team, and the C-suite are all operating with dangerous blind spots, exposed to catastrophic financial risk and unable to meet strategic ESG goals.
Mission-Critical Thermal Management has become the new strategic imperative. It demands a single, unified platform that sees and controls the entire cooling chain, from the facility chiller's efficiency to the pressure in the CDU and the temperature of the AI workload itself.
Nlyte delivers this "chiller-to-workload" management center. By integrating advanced visualization, deep-data monitoring, and a trustworthy Operational AI, Nlyte transforms thermal management from a reactive, high-risk "firefight" into a proactive, resilient, and efficient business advantage. This is how modern organizations can finally close the management gap, achieve true operational excellence, and unlock the full promise of AI.
Cited Works
- Davis, J. (2025, April 23). AI embraces liquid cooling, but enterprise IT is slow to follow. Uptime Institute Journal. https://journal.uptimeinstitute.com/ai-embraces-liquid-cooling-but-enterprise-it-is-slow-to-follow/
- Data Center Dynamics. (2024). Redesigning the data center: Industry brief (R5_V2146). https://media.datacenterdynamics.com/media/documents/Redesigning_The_Data_Center_Industry_Brief_R5_V2146.pdf
- Bizo, D. (2025, June 25). Density choices for AI training are increasingly complex. Uptime Institute Blog. https://journal.uptimeinstitute.com/density-choices-for-ai-training-are-increasingly-complex/
- Donnellan, D., Lawrence, A., Bizo, D., Judge, P., O’Brien, J., Davis, J., Smolaks, M., Williams-George, J., & Weinschenk, R. (2024). Uptime Institute Global Data Center Survey 2024. Uptime Institute. https://datacenter.uptimeinstitute.com/rs/711-RIA-145/images/2024.GlobalDataCenterSurvey.Report.pdf
- (2024). Getting data centers ready for the AI boom. https://www.jll.com/en-us/insights/getting-data-centers-ready-for-the-ai-boom
- (2024). Liquid cooling enters the mainstream in data centers. https://www.jll.com/en-us/insights/liquid-cooling-enters-the-mainstream-in-data-centers
- Nlyte Software. (2025, October 24). AI at the edge. Nlyte Blog. https://www.nlyte.com/blog/ai-at-the-edge/
- (2024). Four strategic location factors for AI training data centres. https://www.jll.com/en-uk/insights/four-strategic-location-factors-for-ai-training-data-centres
- Hanna, L. (2025, August 12). AI-powered optimization: How AI is reinventing the data center. Nlyte Blog. https://www.nlyte.com/blog/ai-powered-optimization-how-ai-is-reinventing-the-data-center/
- Uptime Institute. (2025). Annual outage analysis 2025. https://uptimeinstitute.com/uptime_assets/d7c049ef5b02a6e0a15540a3e5cb8fbf742c7fa54a1af6caeaaab32b7c15d443-GA-2025-05-annual-outage-analysis.pdf
- (2025). Data center outlook. https://www.jll.com/en-us/insights/market-outlook/data-center-outlook
- Schneider Electric. (2024). The AI disruption: Challenges and guidance for data centre design. https://www.efficiencyit.com/wp-content/uploads/2024/09/the-ai-disruption-challenges-and-guidance-for-data-centre-design.pdf
- Grand View Research. (2024). Data center cooling market size, share & trends analysis report by component (solutions, services), by type, by containment, by structure, by application, by solution, by service, by region, and segment forecasts, 2025 - 2030. https://www.grandviewresearch.com/industry-analysis/data-center-cooling-market
- Grand View Research. (2024). Data center liquid cooling market size, share & trends analysis report by component, by solution, by service, by type of cooling (immersion cooling, cold plate cooling, spray liquid cooling), by data center type, by end-use, by region, and segment forecasts, 2025–2030. https://www.grandviewresearch.com/industry-analysis/data-center-liquid-cooling-market-report
- (2024). Getting data centers ready for the AI boom. https://www.jll.com/en-us/insights/getting-data-centers-ready-for-the-ai-boom
- Dietrich, J. (2025, May 28). Water is local: Generalities do not apply. Uptime Institute Blog. https://journal.uptimeinstitute.com/water-is-local-generalities-do-not-apply/
- Shehata, H. (2014, November 11). Data center cooling: CRAC/CRAH redundancy, capacity, and selection metrics. Uptime Institute Blog. https://journal.uptimeinstitute.com/data-center-cooling-redundancy-capacity-selection-metrics/
- Uptime Institute. (2024). DCIM past and present: What’s changed? Uptime Institute Blog. https://journal.uptimeinstitute.com/dcim-past-and-present-whats-changed/
- Weinschenk, R. (2024, December 19). Building trust: Working with AI-based tools. Uptime Institute Blog. https://journal.uptimeinstitute.com/building-trust-working-with-ai-based-tools/
- (2024). The green tipping point. https://www.jll.com/en-us/insights/the-green-tipping-point
- Buchholz, L. (2023, March 24). 2023 is the year of efficiency for the data center industry. Data Centre Magazine. https://datacentremagazine.com/articles/2023-is-the-year-of-efficiency-for-the-data-center-industry
- (2024, June 24). Global Data Center Trends 2024. https://www.cbre.com/insights/reports/global-data-center-trends-2024
- Hanna, L. (2025, July 1). Extending change management into the data center with Nlyte. Nlyte Blog. https://www.nlyte.com/blog/extending-change-management-into-the-data-center-with-nlyte/
- (2024). CBRE talks DCIM to Gartner audience. Nlyte. https://www.nlyte.com/resource/cbre-talks-dcim-to-gartner-audience/
- Data Center Dynamics. (2024). Redesigning the data center: Industry brief (R5_V2146). https://media.datacenterdynamics.com/media/documents/Redesigning_The_Data_Center_Industry_Brief_R5_V2146.pdf