NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility.
在模型架构方面,我们提出一个潜在注意力层以获得聚类嵌入,这显著地改善了LLM作为通用嵌入模型的检索和下游任务准确性,同时保持了其简单性和可重复性。为了提高表示学习,我们在对比训练期间去掉了LLMs的因果注意力掩码。为了进行模型训练,我们引入了两个阶段的对比指令调整方法。首先在检索数据集上应用指令对比训练,利用批量负例和精心挑选的困难负例。在阶段2中,它将各种非检索数据集融合为指令调整,这不仅提高了非检索任务的准确性,而且改善了检索性能。结合这些技术,我们的NV-Embed模型,仅使用公开可用的数据,在MMassive文本嵌入基准(MTEB)上的得分达到了69.32,排名第五(截至2024年5月24日)。(注意:MTEB基准目前最新得分是69.65)
值得注意的是,我们的模型还在MTEB基准上取得了最高得分(59.36),这被称为BEIR基准。我们将开放源代码模型的URL:https://this链接。
Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last
https://arxiv.org/abs/2405.17428
https://arxiv.org/pdf/2405.17428.pdf