ReMoDetect: Reward Models Recognize Aligned LLM’s Generations

大语言模型(LLMs)的非凡能力和易用性显著增加了社会风险(例如虚假新闻生成),因此有必要开发LLM生成的文本(LGT)检测方法来实现安全使用。然而,由于LLMs数量众多,检测LGTs非常具有挑战性,使得分别考虑每个LLM变得不切实际。因此,确定这些模型的共同特征至关重要。在本文中,我们关注到最近强大的LLM的一个共同特征,即对齐训练,即训练LLM以生成人类偏好的文本。我们的关键发现是,这些对齐的LLM经过训练以最大化人类偏好后,生成的文本具有更高的估计偏好,甚至超过了人类撰写的文本;因此,使用奖励模型(即训练LLM以建模人类偏好分布)来检测这些文本非常容易。根据这一发现,我们提出了两种进一步改进奖励模型的训练方案,即(i)持续偏好微调,使奖励模型更加喜欢对齐的LGT,甚至更喜欢;(ii)人/LLM混合文本奖励建模,这是一种使用对齐的LLM生成的人类/LLM混合文本,作为学习更好决策边界的 median 偏好文本语料库。我们对十二个对齐的LLM进行了广泛的评估,考虑了六个文本领域,我们的方法在LLM上取得了最先进的结果。代码可在此处访问:https://www.example.com/。

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at this https URL.

https://arxiv.org/abs/2405.17382

https://arxiv.org/pdf/2405.17382.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注