Matryoshka Multimodal Models

大规模多模态模型(LMMs)如LLaVA在视觉推理方面表现出的性能非常出色。这些模型首先将图像嵌入到固定数量的大规模视觉令牌中,然后将它们输入到大语言模型(LLM)中。然而,这种设计导致在密集视觉场景(如高分辨率图像和视频)中,令牌数量过多,导致效率低下。虽然存在令牌剪枝/合并方法,但它们仅生产每个图像单条长度输出,并不能在信息密度与效率之间进行灵活权衡。受到Matryoshka Dolls的概念启发,我们提出了M3:Matryoshka多模态模型,它学会了将视觉内容表示为捕获多个粗到细粒度的视觉令牌的嵌套集合。我们对LLMs在推理过程中视觉粒度级别的控制、对现有数据集分析所需粒度水平以及探索最佳视觉令牌长度和性能之间的关系等方面具有独特的优势:(1)在推理过程中可以明确控制每个测试实例的视觉粒度级别,例如,根据内容的复杂程度或简单程度调整表示图像的令牌数量;(2)M3为现有数据集提供了分析所需粒度水平的框架,我们发现COCO风格的基准只需要~9个视觉令牌就能获得与使用全部576个令牌同样精度的准确性;(3)我们的方法为探索样本级别最佳性能与视觉令牌长度的权衡提供了基础,我们的研究揭示了预言 upper bound与现有固定规模表示之间存在较大差距。

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.

https://arxiv.org/abs/2405.17430

https://arxiv.org/pdf/2405.17430.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注