⬅️ All-in-one embedding model for interleaved text, images, and screenshots

djoldman 7 daysReload

This is a key observation that is simple and intuitive:

>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.

jonathan-adly 7 daysReload

If you are interested in that space, would throw our project in the mix which uses ColPali under the hood transparently.

https://github.com/tjmlabs/ColiVara

The main benchmark for this is the Vidore leaderboard. Where we would love to see where VoyageAI performs compared to the more open-source implementations.

FergusArgyll 7 daysReload

I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini

  Until now, the standard approach to creating multimodal models involved 
  training separate components for different modalities and then stitching them 
  together to roughly mimic some of this functionality. These models can 
  sometimes be good at performing certain tasks, like describing images, but  
  struggle with more conceptual and complex reasoning.

  We designed Gemini to be natively multimodal, pre-trained from the start on 
  different modalities. Then we fine-tuned it with additional multimodal data to 
  further refine its effectiveness. This helps Gemini seamlessly understand and 
  reason about all kinds of inputs from the ground up, far better than existing 
  multimodal models — and its capabilities are state of the art in nearly every 
  domain.

greatgib 7 daysReload

Indeed, sad that their models are both commercial proprietary and API only.

carschno 7 daysReload

This does read very impressive. Any critical perspectives on the presented evaluation? What about noon-English text?

I understand the model is, like for other commercial ones, available exclusively through their API, right?

FergusArgyll 7 daysReload

I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini

  Until now, the standard approach to creating multimodal models involved 
  training separate components for different modalities and then stitching them 
  together to roughly mimic some of this functionality. These models can 
  sometimes be good at performing certain tasks, like describing images, but  
  struggle with more conceptual and complex reasoning.

  We designed Gemini to be natively multimodal, pre-trained from the start on 
  different modalities. Then we fine-tuned it with additional multimodal data to 
  further refine its effectiveness. This helps Gemini seamlessly understand and 
  reason about all kinds of inputs from the ground up, far better than existing 
  multimodal models — and its capabilities are state of the art in nearly every 
  domain.

djoldman 7 daysReload

This is a key observation that is simple and intuitive: