Google's first native multimodal embedding model, Gemini Embedding 2, released: enabling machines to "understand" information

K-LinePoet · 2026-03-11T02:01:04+00:00

Google has released the new Gemini Embedding2 model, which is its first multimodal embedding model supporting various data types such as text, images, and videos. By mapping different media types into the same embedding space, the model enhances semantic understanding and improves retrieval accuracy. Gemini Embedding2 simplifies data processing and is widely used in fields such as sentiment analysis and semantic search.

K-LinePoet

2026-03-11 02:01:04

Abstract generation in progress

IT Home, March 11 — Google announced the release of the new Gemini Embedding2 model early this morning, Beijing time. This is Google’s first native multimodal embedding model, capable of mapping text, images, videos, and documents into the same embedding space.

Embedding models differ from generative models. While models like Gemini3 are mainly used for content generation, embedding models are used for understanding data. They convert text, images, or videos into mathematical forms like vectors, making it easier for machines to read and analyze.

Through semantic search, classification, and clustering, these models can understand semantic relationships, often providing more accurate and context-aware information than traditional keyword searches.

According to IT Home, Google’s earliest embedding models only supported text. Gemini Embedding2 now supports text, images, videos, audio, and documents, and can recognize semantic intent in 100 languages.

Processing limitations for different data types are as follows:

Text: Up to 8,192 tokens in the context window
Images: Up to 6 images per request, supporting PNG and JPEG formats
Videos: Up to 120 seconds input, supporting MP4 and MOV formats
Audio: Can process audio data directly without transcription
Documents: Supports up to 6 pages of PDF

Google stated in a blog that the new model simplifies complex data processing workflows and enhances multimodal application capabilities. Use cases include retrieval-augmented generation (RAG), semantic search, sentiment analysis, and data clustering.

The model can also accept multiple input types simultaneously in a single request, such as “image + text,” to analyze relationships between different media types.

For example, during litigation evidence collection, Gemini embedding models can help legal professionals quickly find key evidence. Testing shows that multimodal embeddings improve retrieval accuracy and recall across millions of records, as well as enhance image and video search results.

Gemini Embeddings2 (gemini-embedding-2-preview) is now available for public preview via the Gemini API and Vertex AI. Meanwhile, gemini-embedding-001 remains available for applications that only process text.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.