Research reveals that AI models capable of public access are hindered in executing risky actions when operated with filtered data.
A groundbreaking study has been conducted by researchers from the University of Oxford, EleutherAI, and the UK AI Security Institute, offering a promising solution to safeguard openly available language models (LLMs). The study, titled 'Deep Ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs,' aims to address the concern that these models, once released, cannot be recalled, leaving them vulnerable to malicious exploitation.
The research focuses on data filtration, a method that involves removing potentially harmful or dual-use content from the pretraining data. This approach, known as baking safety into the model from the start, is designed to make the models substantially more resistant to adversarial fine-tuning attacks.
Pretraining Data Filtering for AI Safety
The study employs a multi-stage scalable pipeline to selectively remove text related to dual-use or hazardous topics before training. This ensures that the models lack internal representations of dangerous information that could be "revived" later by attackers through tampering.
Remarkably, data filtration does not degrade unrelated capabilities of the model, meaning the model remains useful and performant while being safer. For instance, models pretrained on filtered data showed strong resistance to thousands of steps of adversarial fine-tuning involving hostile content and performed significantly better than post-training safety fine-tuning methods.
A Layer in a Broader Defense Strategy
While data filtration provides a robust internal safeguard, it is essential to note that models can still process harmful information if it is provided explicitly during interaction. Therefore, data filtration should be one layer in a broader defense-in-depth strategy.
A Step Forward in Global AI Governance
The study comes at a critical moment for global AI governance, as several recent AI safety reports have warned about the potential creation of biological or chemical threats using frontier models. By increasing the tamper resistance of open-weight models, this method provides a tractable and cost-effective way to embed safety in open models, addressing dual-use risks such as the dissemination of biothreat knowledge.
The Future of Open-Weight Models
Open-weight models are a cornerstone of transparent, collaborative AI research, promoting red teaming, mitigating market concentration, and accelerating scientific progress. Unlike traditional fine-tuning or access-limiting strategies, filtering pretraining data proved resilient even under sustained adversarial attack.
The study's co-author, Stephen Casper from the UK AI Security Institute, stated that the research shows data filtration can be a powerful tool in helping developers balance safety and innovation in open-source AI. The filtered model was able to resist training on up to 25,000 papers on biothreat-related topics and proved over ten times more effective than prior state-of-the-art methods.
The study was published as a preprint on arXiv, demonstrating that by removing unwanted knowledge from the start, resulting models have no basis for acquiring dangerous capabilities, even after further training attempts. Filtered models performed just as well on standard tasks like commonsense reasoning and scientific Q&A, as compared to unfiltered models and models using state-of-the-art safety fine-tuning methods.
In summary, data filtration in pretraining filters out harmful knowledge to produce tamper-resistant LLMs that maintain performance while greatly reducing the risks of adversarial fine-tuning and malicious exploitation. This makes it a promising foundational technique for AI safety and security, especially for open-weight models that require robust, built-in defenses. However, it should be combined with other safeguards to address all attack vectors comprehensively.
Artificial-intelligence (AI) researchers have developed a method called data filtration to enhance the safety of open-weight language models (LLMs), aiming to protect these models from potential malicious exploitation. This approach, known as "baking safety into the model from the start," involves removing dangerous information from the pretraining data, thereby making the models resistant to adversarial fine-tuning attacks.
The study on data filtration could play a significant role in global AI governance, as it offers a cost-effective solution for embedding safety in open models and addressing dual-use risks such as the dissemination of biothreat knowledge. Moreover, filtered models perform just as well on standard tasks as other models, demonstrating that data filtration can help developers balance safety and innovation in open-source AI.