Efficient Ensembles Improve Training Data Attribution

训练数据归因(TDA)方法旨在量化单个训练数据点对模型预测的影响,并在数据集中具有广泛的应用,如标签检测、数据选择和版权补偿。然而,该领域现有方法,可以分为基于重新训练和基于梯度的方法,在计算效率和归因效果之间存在权衡。基于重新训练的方法可以准确地归因复杂非凸模型,但计算成本高昂,而基于梯度的方法效率高,但往往无法处理非凸模型。最近的研究表明,通过为基于梯度的方法添加多个独立训练模型的集成可以显著提高归因效果。然而,这种方法对于大规模应用仍然不可行。

在这项工作中,我们发现为使基于梯度的方法进行集成并不需要进行昂贵的、完全独立的训练,我们提出了两种高效的集成策略:DROPOUT ENSEMBLE和LORA ENSEMBLE,作为替代 naive independent ensemble 的备选方案。这些策略显著减少了训练时间(最多80%)、服务时间(最多60%)和空间成本,同时保持与 naive independent ensemble 相似的归因效果。我们进行的大量实验结果表明,与 TDA 方法相关的策略在多样数据集和模型上都是有效的,包括生成环境,显著提高了 TDA 方法的帕累托前沿,具有更好的计算效率和归因效果。

Training data attribution (TDA) methods aim to quantify the influence of individual training data points on the model predictions, with broad applications in data-centric AI, such as mislabel detection, data selection, and copyright compensation. However, existing methods in this field, which can be categorized as retraining-based and gradient-based, have struggled with the trade-off between computational efficiency and attribution efficacy. Retraining-based methods can accurately attribute complex non-convex models but are computationally prohibitive, while gradient-based methods are efficient but often fail for non-convex models. Recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution efficacy. However, this approach remains impractical for very large-scale applications. In this work, we discover that expensive, fully independent training is unnecessary for ensembling the gradient-based methods, and we propose two efficient ensemble strategies, DROPOUT ENSEMBLE and LORA ENSEMBLE, alternative to naive independent ensemble. These strategies significantly reduce training time (up to 80%), serving time (up to 60%), and space cost (up to 80%) while maintaining similar attribution efficacy to the naive independent ensemble. Our extensive experimental results demonstrate that the proposed strategies are effective across multiple TDA methods on diverse datasets and models, including generative settings, significantly advancing the Pareto frontier of TDA methods with better computational efficiency and attribution efficacy.

https://arxiv.org/abs/2405.17293

https://arxiv.org/pdf/2405.17293.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注