Mikros and Carbice are positioning liquid cooling and carbon nanotube thermal interface materials as the infrastructure layer that eliminates this failure mode

Decision Focus

On May 7, an overheating event contributed to an Amazon data center shutdown in Northern Virginia, disrupting services including cryptocurrency exchange Coinbase, according to Reuters reporting. Two companies commercializing thermal technology with International Space Station heritage—Mikros Technologies and Carbice Corporation—state that the failure mode behind that incident is exactly what their products are designed to prevent.

The operational signal for energy leaders is specific: thermal throttling is not a pure compute reliability problem. It sits at the intersection of power draw, PUE, and unplanned grid load. The liquid cooling transition that neutralizes it carries a measurable energy reduction claim—and that claim is now entering production-scale supply chains.

90-Second Brief

Now, the Amazon Virginia failure on May 7, attributed in part to overheating, gave thermal risk a real service-level cost. Engineering firm Ketchum & Walton has estimated that emergency thermal throttling can run up to $540,000 per hour in downtime. Mikros and Carbice are positioning liquid cooling and carbon nanotube thermal interface materials as the infrastructure layer that eliminates this failure mode. Mikros CEO Drew Matter stated that liquid cooling can reduce more than a third of energy consumption versus air-cooled facilities, with per-rack lifetime savings he put at over one million dollars.

What Is Really Happening?

The Virginia incident is a visible symptom of a structural mismatch: AI chip power density is rising faster than the thermal infrastructure deployed around it. Chips being released over the next three years are reportedly being engineered for liquid cooling from the design stage, according to Matter—which means the air-cooled installed base is accumulating thermal risk with each GPU generation, not holding steady.

From an energy perspective, this extends beyond hardware reliability. Air cooling currently consumes a significant fraction of facility power running fans and HVAC. That load counts against PUE targets and draws from contracted grid capacity that operators are actively competing to secure. Liquid cooling’s energy claim directly affects how much of a facility’s committed power goes to compute versus overhead.

Carbice addresses a more specific failure point: the bond between cold plate and chip. Because chips expand and contract across temperature cycles, conventional thermal interface materials progressively lose contact, generating the random throttling events that destabilize power draw. Carbice’s carbon nanotube structure maintains contact by conforming to changes in chip curvature—the company reports its nanotube geometry increases surface area for free convection by a factor exceeding 10,000. The material has been running on Nvidia chips in the Georgia Tech data center for three years, providing field-reliability data most newer entrants cannot offer.

Broadcom has already moved to embed this supply chain. The company partnered with Mikros for its 3.5D eXtreme Dimension SiP platform; Broadcom’s VP of AI Systems Development stated publicly in April that the microchannel technology provides the chip-level thermal resistance needed to unlock full ASIC performance. Marvell Technology is separately collaborating with Mikros on co-packaged optics chips, where GPU proximity creates a secondary heat management problem.

Why It Matters for Global Heads of Data Center Energy

Three energy-facing implications follow from the confirmed picture.

The energy reduction claim changes the PPA sizing calculus. If liquid-cooled facilities consume materially less power for the same compute output, forward purchase commitments built around air-cooled load assumptions are either overbuilt relative to future need—or they free capacity for AI densification without new interconnection. Either scenario requires a load model review before the next offtake negotiation closes.

Throttling events carry a hidden grid cost that sits outside most energy teams’ current risk registers. A facility running at reduced throughput due to thermal stress draws inconsistent load, complicating demand response participation and grid balancing commitments. Operators with VPP agreements or demand response contracts priced on predictable load shapes should assess how thermal instability affects their obligations in those programs.

The supply chain is scaling ahead of operator procurement cycles. Carbice’s new Atlanta facility is designed to triple output from a year ago, feeding contract manufacturers Jabil and Flex. Broadcom and Marvell are already contracted with Mikros. The ecosystem is building toward liquid cooling as the default configuration—and energy procurement strategies that assume continued air-cooled PUE baselines will require revision inside the current planning horizon.

Forward View

If liquid cooling penetration reaches the adoption timeline Mikros describes—standard across the industry within three to five years—three strategic fronts shift for energy teams.

Interconnection requests for new builds will increasingly reflect lower per-MW thermal overhead. Grid agreements and load commitments anchored to current PUE assumptions may require renegotiation as operators demonstrate sustained efficiency improvements during commissioning. Utilities and ISOs that have sized capacity for today’s data center load curves will need updated modeling.

Facilities mid-cycle on air-cooled infrastructure face a compounding stranded asset question. Substation and transformer investments sized for current thermal overhead could be oversized by the time liquid cooling reaches scale—or undersized if freed capacity accelerates AI densification faster than planned. The relevant trigger is not time but compute density: at what rack power level does liquid cooling become economically forced, not optionally premium?

Demand response economics also improve as facilities become thermally stable. Liquid-cooled operations with predictable load profiles are stronger candidates for grid services programs. In markets where flexible demand is compensated, reduced throttling events translate into a contractable revenue offset worth modeling now, before those program windows close.

What Is Still Uncertain

Several variables limit how far this analysis extends with confidence.

The energy reduction figure—more than a third versus air cooling—originates with Mikros CEO Drew Matter and has not been independently benchmarked in the source material. Actual savings will vary with facility vintage, climate zone, compute density, and cooling system design. Treat it as a directional signal, not a procurement assumption, until site-specific retrofit modeling produces a verifiable range.

The May 7 Amazon Virginia incident has been attributed to overheating in available reporting, but the precise root cause, the scope of affected capacity, and whether liquid cooling would have prevented it are not confirmed. The incident establishes that thermal failures carry measurable service-level consequences; it does not validate any specific vendor’s claims.

Carbice’s capacity expansion addresses component-level supply, not full system integration. Retrofit timelines in live facilities depend on maintenance windows, structural compatibility, and power infrastructure modifications that neither company’s current announcements cover.

One Question for Your Team

Given that AI chip roadmaps for the next three years are reportedly being engineered for liquid cooling from the design stage, at what compute density threshold does your current energy procurement model need to be recalibrated—and has that trigger been built into your next interconnection request or PPA review cycle?

Sources

  • Indexbox — Mikros and Carbice Repurpose ISS Cooling Tech for AI Data Centers (Link)