LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

05 27, 2024 arXiv_AI

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

视觉 grounded 是一种将用户提供的文本查询与图像中特定区域的查询相关联的基本工具。尽管在视觉 grounded 模型方面取得了进步，但它们理解和处理复杂查询的能力仍然有限。为了克服这一限制，我们引入了 LLM-Optic，一种创新的方法，它利用大型语言模型（LLMs）作为光透镜来增强现有的视觉 grounded 模型，以更好地理解涉及复杂文本结构和多个对象或对象空间关系等复杂查询的视觉 grounded 模型。LLM-Optic 首先使用 LLM 作为文本 grinding 工具来解释复杂文本查询，并准确地确定用户意图查找的对象。然后，它使用预训练的视觉 grounded 模型根据文本 grinding 的 refinanced 查询生成候选边界框。接下来，LLM-Optic 使用大型多模态模型（LMM）作为视觉 grinding 工具，选择与原始文本查询最相符的标记候选对象。通过 LLM-Optic，我们实现了普遍的视觉 grounded，这允许我们通过任意人类语言输入检测任意数量的物体。重要的是，我们的方法在没有要求额外训练或微调的情况下实现了这种增强。在各种具有挑战性的基准测试中进行广泛的实验证明，LLM-Optic 实现了最先进的零 shot 视觉 grounded 能力。

Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image. Despite advancements in visual grounding models, their ability to comprehend complex queries remains limited. To overcome this limitation, we introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models in comprehending complex text queries involving intricate text structures, multiple objects, or object spatial relationships, situations that current models struggle with. LLM-Optic first employs an LLM as a Text Grounder to interpret complex text queries and accurately identify objects the user intends to locate. Then a pre-trained visual grounding model is used to generate candidate bounding boxes given the refined query by the Text Grounder. After that, LLM-Optic annotates the candidate bounding boxes with numerical marks to establish a connection between text and specific image regions, thereby linking two distinct modalities. Finally, it employs a Large Multimodal Model (LMM) as a Visual Grounder to select the marked candidate objects that best correspond to the original text query. Through LLM-Optic, we have achieved universal visual grounding, which allows for the detection of arbitrary objects specified by arbitrary human language input. Importantly, our method achieves this enhancement without requiring additional training or fine-tuning. Extensive experiments across various challenging benchmarks demonstrate that LLM-Optic achieves state-of-the-art zero-shot visual grounding capabilities.

https://arxiv.org/abs/2405.17104

https://arxiv.org/pdf/2405.17104.pdf

AI论文

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

发表回复取消回复

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

发表回复 取消回复

发表回复取消回复