Modern AI includes safeguards to prevent chatbots from generating dangerous text. For example, if you ask ChatGPT to construct a phishing email, it will politely decline. At least, that's what's supposed to happen. It turns out, it's rather easy to bypass restrictions and get AI to say whatever you want.
Computer scientists from Princeton University, Virginia Tech, IBM Research, and Stanford University studied large language models (LLMs) to see if safety "guardrails" can be removed. Apparently, all a person needs to do is fine-tune the AI model using data containing the negative behavior they want to create.
As OpenAI explains, “Fine-tuning [trains] on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks.” It also lets the AI forget its protections and create what the user wants. Researchers were able to bypass the protections for a mere $0.20 using OpenAI's APIs.
"We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users," they wrote in a research paper.
Researchers got this to work on OpenAI's ChatGPT as well as Meta's Llama. In most cases, it took as few as 10 harmful instruction examples to generate the exact type of content they wanted. The team specifically used examples that violated ChatGPT's terms of service.
The research, conducted by Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson, mirrored the findings of another paper published in July by Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson. That paper showed that you could bypass protections by adding
Read more on pcmag.com