Can the Vera Rubin NVL72 Redefine AI Compute Architecture? From Blackwell to Six-Chip Collaborative Design

Markets
Updated: 06/02/2026 01:52

On June 1, 2026, NVIDIA announced at the GTC Taipei conference that the Vera Rubin platform had entered full-scale mass production. That same day, AI cloud provider CoreWeave became the first in the industry to complete cloud deployment and validation of Vera Rubin NVL72, with its stock closing at $124.82—a 13.96% increase—and trading volume about 90% higher than the three-month average. The simultaneous release of these two announcements was no coincidence; it marks another generational leap in AI compute supply, moving from lab experiments to production environments.

To view Vera Rubin NVL72 merely as a chip upgrade would severely underestimate its industry significance. The real core issue this generational shift addresses is: As model parameters break the trillion mark, inference workloads outpace training, and Agentic AI demands millisecond-level response, how should compute power be organized, deployed, consumed, and priced? Blackwell introduced the concept of rack-level computing; Vera Rubin pushes this to the extreme—with six simultaneously iterated chips, a 100% liquid-cooled compact rack, and an order-of-magnitude reduction in inference costs—redefining the efficiency boundaries of AI infrastructure.

From Chip Iteration to System Integration: How Vera Rubin Redefines Competitive Dimensions

The traditional narrative of GPU generational upgrades follows a linear chain: process improvement → more transistors → increased compute power → reduced power consumption. Vera Rubin NVL72 breaks this pattern. It no longer centers on a single GPU as its main selling point, but instead defines an entire rack as the smallest delivery unit for AI supercomputing.

Each Vera Rubin NVL72 rack integrates 72 Rubin GPUs and 36 Vera CPUs, delivering 260 TB/s of rack-level scale-up bandwidth via sixth-generation NVLink. NVIDIA claims this bandwidth exceeds the total global internet traffic. The system uses a 100% liquid cooling solution, reducing installation time from two hours in traditional architectures to just five minutes. The real shift behind these specs is that the core metric of compute competition is moving from "single-card TFLOPS" to "rack-level system efficiency."

Blackwell NVL72 already demonstrated the potential for rack-level computing—1.44 EFLOPS of inference power, 130 TB/s inline bandwidth, partial liquid cooling. Vera Rubin NVL72 advances this concept: inference power jumps to 3.6 EFLOPS (2.5x), training power leaps from 10 PFLOPS to 35 PFLOPS (3.5x), GPU memory upgrades from HBM3e to HBM4, doubling capacity from 141 GB to 288 GB, and bandwidth rises from about 8 TB/s to roughly 22 TB/s. These numbers don’t just represent a simple "performance doubling," but a systematic efficiency overhaul. Notably, the increase in inference power (5x) far outpaces training power (3.5x). This differentiated design points to a clear industry judgment: inference is replacing training as the primary battleground for AI compute consumption.

Six-Chip Synergy and Full Liquid Cooling: Supply Chain and Cost Logic Behind Technical Choices

Vera Rubin NVL72’s chip-level innovation isn’t just a single GPU upgrade—it involves six newly designed chips: Vera CPU, Rubin GPU, NVLink 6 switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet switch. These chips were developed and validated in sync, not stitched together after independent design. This "full-stack simultaneous iteration" strategy aims to eliminate long-standing performance gaps between compute, storage, and networking at the technical level, and to build deeper entry barriers than in the Blackwell era commercially—potential competitors must not only master GPU design, but also keep pace in CPUs, interconnects, NICs, DPUs, and switch chips.

The 100% liquid cooling solution is another notable technical choice. Each Vera Rubin NVL72 rack draws about 440 kW, operates at a PUE of roughly 1.1, and can accept inlet water temperatures up to 45°C. By comparison, Blackwell NVL72 uses partial liquid cooling with a PUE around 1.25. While this difference seems minor at the rack level, scaling to thousands of racks, the drop in PUE from 1.25 to 1.1 yields substantial savings in electricity and cooling infrastructure. This explains why CoreWeave developed Valvey (programmable rack-level liquid cooling valve module) and Racky (unified rack control device) specifically for Vera Rubin—liquid cooling is shifting from an "optional solution" to "essential infrastructure."

A key supply chain constraint is that Vera Rubin’s full liquid cooling and six-chip synergy introduce multiple production bottlenecks. HBM4 memory is currently supplied mainly by Samsung Electronics and SK Hynix. The ramp-up speed of cooling component production and the synchronized delivery of system components could all limit Vera Rubin’s market penetration rate.

Inference Costs Drop to One-Tenth: Redefining the Economics of AI Applications

Among all Vera Rubin NVL72’s technical specs, the most economically significant are: Compared to Blackwell, inference cost per million tokens drops to about one-tenth, inference performance per watt increases up to 10x, and the number of GPUs needed for equivalent inference workloads can decrease by as much as three-quarters.

These figures result from three technical advances: 3nm process boosting transistor density (33.6 billion transistors, about 60% more than Blackwell), HBM4 doubling memory bandwidth, and sixth-generation NVLink further reducing GPU communication bottlenecks. More importantly, falling inference costs are pushing previously uneconomical application scenarios into the feasible zone.

Take real-time autonomous agents as an example: When AI becomes a continuously running, proactively decision-making service rather than a user-triggered one-off inference, the cost per million tokens directly determines whether the business model is viable. The same logic applies to million-token context inference—analyzing entire books, long meeting transcripts, or understanding full codebases, where single requests consume massive tokens. A tenfold cost reduction shifts these products from "demo-grade" to "scalable-grade."

TrendForce data shows that in 2026, North America’s five major CSPs are expected to increase AI inference compute by 122%, while training compute rises just 56%. Inference is growing more than twice as fast as training. This structural shift means Vera Rubin’s inference-focused performance optimization has strong commercial relevance, not just technical showmanship.

Early Signals from Cloud Deployment: CoreWeave’s Launch and Industry Chain Effects

CoreWeave announced successful cloud deployment of Vera Rubin on the very day mass production began—a timing worth dissecting. It points to several concurrent facts: early delivery from the hardware supply chain, readiness of the software stack and operations, and exceptionally deep strategic alignment between CoreWeave and NVIDIA.

A critical issue of narrative truth is that CoreWeave’s claim to be "first" is somewhat disputed. Microsoft stated in March 2026 that it was the first hyperscale cloud provider to validate Vera Rubin NVL72 in the cloud (for validation purposes). The difference between "first to deploy" and "first to validate" reflects the complexity of "first-mover" claims in AI infrastructure competition. The criteria for such claims are open to interpretation among stakeholders.

From an industry chain perspective, CoreWeave’s Vera Rubin deployment is based on Dell Technologies’ PowerEdge XE9812 liquid-cooled servers, with network architecture supporting both NVIDIA Quantum-X800 InfiniBand and Spectrum-X Ethernet. A multi-track, multi-plane RoCE architecture delivers 1.6 Tb/s backend bandwidth per GPU. This means Vera Rubin’s ecosystem readiness extends beyond a single vendor, forming a multi-layer collaboration from server OEMs to network equipment.

CoreWeave will be officially included in the Russell 3000 Index on June 27, 2026. As of March 31, 2026, NVIDIA holds about 11% of CoreWeave’s equity. According to FactSet, the median revenue forecast from 31 analysts for CoreWeave in 2026 is $12.589 billion, with a long-term 2029 median forecast of $50.458 billion. This revenue growth outlook is highly correlated with Vera Rubin’s compute supply—progress in deploying the new architecture will directly impact CoreWeave’s capacity expansion and revenue realization.

Multi-Scenario Industry Impact: From Lower Inference Costs to Compute Organization Overhaul

Placing Vera Rubin NVL72’s launch in a broader industry context reveals three interconnected evolutionary paths unfolding simultaneously.

First is the evolution of compute supply and demand. The growth curve is shifting from "training-driven" to "inference-driven." Agentic AI’s need for continuous operation, low latency, and high throughput is expanding compute demand from a few ultra-large training clusters to distributed inference infrastructure networks. Supermicro’s Vera Rubin data center blueprint (scaling from 5 MW to 1 GW) responds to this shift—compute supply no longer needs to be monopolized by mega data centers; mid-sized AI factories can economically deploy top-tier compute.

Second is the restructuring of industry competition. Simultaneous iteration of six chips means NVIDIA is systematically building entry barriers. For potential competitors, cracking GPU design is only the first step; they must also solve coordinated optimization across CPUs, interconnects, DPUs, NICs, and switch chips. The complexity and depth of this tech stack are growing exponentially, increasing the catch-up pressure on existing players.

Third is the changing commercial conditions for AI applications. Lower inference costs may make previously uneconomical scenarios viable, especially those requiring long-term, continuous AI workloads. However, this chain of transmission isn’t automatic—software stack adaptation, model architecture compatibility with new hardware, and cloud service pricing strategies will all affect whether the benefits of lower inference costs are fully absorbed at the application layer.

In scenario analysis, the baseline (highest probability) is inference costs declining linearly along a predictable path, driving ongoing optimization of AI application cost structures, with systemic improvement between 2027 and 2028. The aggressive scenario (moderate probability) is that the market prices in the downward trend early, shifting compute procurement standards from "peak performance" to "tokens per watt throughput" and "cost per million tokens," with racks replacing servers as the smallest compute unit and cloud providers who adapt system-level earliest gaining a clear first-mover advantage. The risk scenario (lower probability but not negligible) is challenges in mass production or supply chain stability—HBM4 supply, cooling component capacity, and synchronized delivery of six chips; delays in any link could slow market penetration.

Conclusion

The launch of Vera Rubin NVL72 is shifting the logic of AI compute competition from "chip iteration" to "system integration." Six-chip synergy, rack-as-computer design, and an order-of-magnitude drop in inference costs together drive this new wave of compute revolution. Blackwell opened the window for rack-level computing; Vera Rubin aims to push this window to its extreme—not just faster GPUs, but a redefinition of how AI compute is organized, deployed, and priced.

For market players, the key variables are no longer "how fast is the next GPU," but "how quickly will the benefits of lower inference costs reach the application layer," and "to what extent will changes in compute organization reshape data center design and cloud provider competition." Vera Rubin NVL72’s industry-wide collaborative validation is providing initial answers, but the real-world efficiency after large-scale deployment, supply chain stability, and downstream demand absorption still require ongoing observation.

FAQ

What are the core improvements of Vera Rubin NVL72 compared to Blackwell?

Vera Rubin NVL72 delivers rack-level inference power of 3.6 EFLOPS—2.5 times that of Blackwell NVL72 (1.44 EFLOPS)—and reduces inference cost per million tokens to about one-tenth.

Why is Vera Rubin’s training power increase (3.5x) lower than its inference power increase (5x)?

This difference reflects NVIDIA’s strategic view of industry trends—inference workloads are now growing faster than training, and the new architecture is optimized more aggressively for inference scenarios.

What does it mean for CoreWeave to be the first cloud provider to deploy Vera Rubin?

CoreWeave’s engineering collaboration with NVIDIA goes far beyond traditional supply-demand relationships; its first deployment validates the readiness of Vera Rubin’s software stack and operations.

What does a 100% liquid cooling solution mean for data centers?

Vera Rubin NVL72’s full liquid cooling lowers PUE from about 1.25 (Blackwell) to about 1.1, resulting in significant savings in electricity and cooling infrastructure at thousand-rack deployment scale.

What supply chain risks does Vera Rubin face in mass production?

HBM4 memory is mainly supplied by Samsung Electronics and SK Hynix; ramp-up speed for cooling components and synchronized delivery of six chips could all limit market penetration.

What new application scenarios will a tenfold drop in inference costs enable?

Continuous operation of real-time agents, million-token long-context inference, and large-scale distributed inference deployments—previously unviable due to high token accumulation costs—will become economically feasible.

What impact will CoreWeave’s inclusion in the Russell 3000 Index have?

Inclusion in the Russell 3000 Index will drive passive ETF allocation, increasing CoreWeave’s accessibility and liquidity among institutional investors.

Has Vera Rubin’s architecture changed the investment logic for AI infrastructure?

Investment logic is shifting from "single-card performance races" to "system-level efficiency competition," with rack-level compute density, tokens per watt throughput, and cost per million tokens becoming the core metrics.

The content herein does not constitute any offer, solicitation, or recommendation. You should always seek independent professional advice before making any investment decisions. Please note that Gate may restrict or prohibit the use of all or a portion of the Services from Restricted Locations. For more information, please read the User Agreement
Like the Content