Nvidia plans to release a "mysterious chip" that may be a new architecture designed specifically for inference

K-LinePoet · 2026-03-11T05:51:06+00:00

NVIDIA is set to hold the GTC conference in California, where it is expected to launch a new inference chip, potentially integrating Groq LPU technology to meet the growing inference demands. As AI computing shifts focus to inference, the market's demand for high-performance dedicated chips is increasing. In terms of chip design, 3D stacked SRAM technology may be used to enhance performance, but this new approach also faces several challenges.

K-LinePoet

2026-03-11 05:51:06

Abstract generation in progress

The NVIDIA GTC conference in San Jose, California, scheduled for mid-March, is one of the most anticipated events in the AI field. Previously, Jensen Huang announced the launch of a “never-before-seen” new chip.

Since then, the capital markets have been buzzing with discussions. The mainstream view is that the chip expected to be announced at GTC is most likely a new inference product integrated with Groq’s LPU (Language Processing Unit) design.

The reason it’s unlikely to be an “accelerator plugin,” according to Zhuang Changlei, director of AI/Intelligent Manufacturing at Yunxiu Capital, is that “if it functions as a plugin for existing GPUs, data transfer would still need to go through external interfaces like PCIe or NVLink, which would introduce additional latency, partially offsetting the low-latency advantage of SRAM.”

He further added, “A more ideal solution is to create a new computing architecture centered around SRAM, designed specifically for inference, like Cerebras.”

The Era of Inference Is Coming

With the rise of new-generation agent applications represented by “OpenClaw,” global computing power demand is undergoing significant changes, shifting market focus from training to inference.

According to Deloitte’s “2026 Technology, Media, and Telecommunications Industry Forecast,” by 2026, “inference” (running AI models) will account for two-thirds of all AI computing capacity. Moreover, billion-dollar inference-optimized chips are expected to be deployed in data centers and enterprise servers, with some chips consuming as much power as or more than general AI chips.

Recently, it was reported that industry insiders speculate that the biggest highlight of this conference, besides NVIDIA’s expected official unveiling of core technology details for Rubin and the next-generation Feynman architecture GPUs, is the likely launch of a new inference chip integrating LPU technology.

As a new inference system incorporating Groq’s LPU technology, this could be NVIDIA’s first large-scale introduction of external architecture into its core AI computing product line.

CITIC Securities noted that NVIDIA previously launched Rubin CPX to reduce costs for prefill needs. After acquiring Groq, this time they may introduce an LPU or “LPU-like” chip to improve decode efficiency.

In inference, models typically go through two stages. First, during the pre-fill phase, user input is processed; second, during the decode phase, output results are generated token by token.

The key to user inference experience lies in the speed and latency of the decode phase. In GPU-based inference architectures, since many model parameters are stored in HBM, frequent data transfers between the compute core and HBM can affect the timeliness of model decoding.

Groq’s LPU is designed specifically for inference acceleration, using SRAM storage units closer to the compute core to store model parameters. For example, a 230MB on-chip SRAM can provide memory bandwidth up to 80TB/s, with data processing speeds far surpassing GPU architectures.

However, physically replacing HBM entirely with SRAM is not feasible.

Zhuang Changlei explained that for large models with hundreds of billions or trillions of parameters, pure SRAM solutions cannot meet capacity requirements. So how might NVIDIA innovate?

The likely answer is not “replacement” but “stacking.” He said, “According to industry sources, NVIDIA may adopt a technology similar to AMD’s 3D V-Cache, using TSMC’s SoIC (System on Integrated Chip) hybrid bonding technology to directly 3D stack LPU units (language processing units) containing large amounts of SRAM onto the GPU core wafer.”

Supply Chain Changes May Occur

For 3D stacking solutions, major manufacturers like AMD already have layouts. In 2021, AMD announced 3D V-Cache technology, which can vertically stack an additional 7nm SRAM cache on top of Ryzen compute chips, significantly increasing L3 cache capacity. In July 2024, Fujitsu introduced its MONAKA processor using 3D SRAM technology, with plans to ship in 2027.

Will this approach become mainstream?

“On-chip SRAM faces process scaling issues, such as slower logic circuits, leading to larger area and higher costs for SRAM on a single chip,” said Dongfang Securities. Some investors believe that SRAM architectures are unlikely to become the main memory solution for AI chips. However, Zhuang Changlei believes that 3D SRAM stacking can improve density by vertically stacking storage units, overcoming traditional SRAM capacity limitations due to area density. If higher capacity SRAM is needed for AI inference, 3D stacking could expand application possibilities.

CITIC Securities also believes that future GPUs and NPUs may adopt 3D stacked SRAM to achieve a leap in memory bandwidth, leveraging the advantages of LPU while maintaining existing software ecosystems and preserving the strengths of GPU and NPU architectures.

Zhuang Changlei pointed out that complex AI chips might require both approaches: first stacking LPU and GPU cores with SoIC, then encapsulating this stacked cube with CoWoS and HBM. For certain specialized inference chips that do not require large HBM capacity, relying solely on 3D stacked SRAM is feasible, bypassing CoWoS. However, these chips target niche markets and are unlikely to challenge the mainstream status of HBM + CoWoS.

SRAM 3D stacking (like TSMC’s SoIC) requires precise wafer-to-wafer bonding during manufacturing, closely tied to front-end process technology. This further shifts value from backend packaging to earlier manufacturing stages.

On one hand, the value of advanced processes is further amplified. Zhuang Changlei noted that to achieve maximum interconnect density and energy efficiency in vertical stacking, the bottom-layer compute wafers must use the most advanced processes (such as A16), increasing reliance on cutting-edge technology.

On the other hand, if the value of high-end chips continues to shift toward front-end manufacturing and advanced packaging, domestic packaging and testing companies may face risks of being pushed out of the high-end market. He believes this also presents opportunities for domestic companies to differentiate, such as providing mature, cost-effective 3D stacking solutions for chips that do not require the most advanced processes, or establishing new technical barriers in testing, cooling, and reliability analysis of 3D stacked chips.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.