Google's first native multimodal embedding model, Gemini Embedding 2, released: enabling machines to "understand" information

robot
Abstract generation in progress

IT Home, March 11 — Google announced the release of the new Gemini Embedding2 model early this morning, Beijing time. This is Google’s first native multimodal embedding model, capable of mapping text, images, videos, and documents into the same embedding space.

Embedding models differ from generative models. While models like Gemini3 are mainly used for content generation, embedding models are used for understanding data. They convert text, images, or videos into mathematical forms like vectors, making it easier for machines to read and analyze.

Through semantic search, classification, and clustering, these models can understand semantic relationships, often providing more accurate and context-aware information than traditional keyword searches.

According to IT Home, Google’s earliest embedding models only supported text. Gemini Embedding2 now supports text, images, videos, audio, and documents, and can recognize semantic intent in 100 languages.

Processing limitations for different data types are as follows:

  • Text: Up to 8,192 tokens in the context window
  • Images: Up to 6 images per request, supporting PNG and JPEG formats
  • Videos: Up to 120 seconds input, supporting MP4 and MOV formats
  • Audio: Can process audio data directly without transcription
  • Documents: Supports up to 6 pages of PDF

Google stated in a blog that the new model simplifies complex data processing workflows and enhances multimodal application capabilities. Use cases include retrieval-augmented generation (RAG), semantic search, sentiment analysis, and data clustering.

The model can also accept multiple input types simultaneously in a single request, such as “image + text,” to analyze relationships between different media types.

For example, during litigation evidence collection, Gemini embedding models can help legal professionals quickly find key evidence. Testing shows that multimodal embeddings improve retrieval accuracy and recall across millions of records, as well as enhance image and video search results.

Gemini Embeddings2 (gemini-embedding-2-preview) is now available for public preview via the Gemini API and Vertex AI. Meanwhile, gemini-embedding-001 remains available for applications that only process text.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin