A study conducted by researchers at Carnegie Mellon University in Pittsburgh and the Center for A.I. Safety in San Francisco, has revealed major safety related loopholes in AI-powered chatbots from tech giants like OpenAI, Google, and Anthropic.
These chatbots, including ChatGPT, Bard, and Anthropic's Claude, have been equipped with extensive safety guardrails to prevent them from being exploited for harmful purposes, such as promoting violence or generating hate speech. However, the latest report released indicates that the researchers have uncovered potentially limitless ways to circumvent these protective measures.
The study showcases how the researchers utilized jailbreak techniques initially developed for open-source AI systems to target mainstream and closed AI models. Through automated adversarial attacks, which involved adding characters to user queries, they successfully evaded the safety rules, prompting the chatbots to produce harmful content, misinformation, and hate speech.
Unlike previous jailbreak attempts, the researchers' method stood out due to its fully automated nature, allowing for the creation of an "endless" array of similar attacks. This discovery has raised concerns about the robustness of the current safety mechanisms implemented by tech companies.
Upon uncovering these vulnerabilities, the researchers disclosed their findings to Google, Anthropic, and OpenAI. Google's spokesperson assured that important guardrails, inspired by the research, have already been integrated into Bard, and they are committed to further enhancing them.
Similarly, Anthropic acknowledged the ongoing exploration of jailbreaking countermeasures and emphasized their dedication to fortify base model guardrails and explore
Read more on tech.hindustantimes.com