As advancements in artificial intelligence (AI) progress, experts increasingly focus on these systems’ potential flaws and vulnerabilities. A recent study conducted by researchers affiliated with Microsoft sheds light on this issue, exploring the “trustworthiness” and potential toxicity of large language models (LLMs) such as OpenAI’s GPT-4 and its predecessor, GPT-3.5.
“In artificial intelligence, accuracy and trustworthiness can sometimes translate into vulnerability.”
Understanding the Good “Intentions” and Risks of GPT-4
The researchers found that GPT-4’s greater tendency to adhere closely to the instructions of “jailbreaking” prompts that bypass the model’s built-in safety features can make it more susceptible to generating biased or toxic content. Essentially, GPT-4’s diligent adherence to instructions, while designed to enhance accuracy and trustworthiness, can be manipulated by malicious actors to generate harmful content.
Microsoft’s Response to the Findings
It may seem strange for Microsoft to endorse research highlighting flaws in an OpenAI product because Microsoft utilizes GPT-4 extensively, including in the Bing Chat chatbot. However, the study was conducted to identify and rectify potential vulnerabilities to provide a safer user experience.
The research team worked closely with Microsoft product groups to confirm that these vulnerabilities would not affect current customer-facing services, primarily because finished AI applications employ a range of mitigation strategies to counter potential harms that could occur at the model level. Furthermore, the research was shared with the developer of GPT, OpenAI, which has acknowledged the potential vulnerabilities and fixed them in the system cards for relevant models.
Jailbreaking LLMs and the Associated Risks
Large Language Models, like GPT-4, must be “prompted” to complete a task, such as writing an email or summarizing an article. Jailbreaking LLMs involves using prompts worded in a specific way to “trick” the LLM into performing tasks not part of its objectives. For instance, the LLM powering Bing Chat wasn’t designed to generate hateful or extremist content. However, due to its training on vast amounts of data from the internet, it was susceptible to developing such content when given a specific prompt.
The researchers found that GPT-4 is more likely to generate toxic text than GPT -3.5 when given specific jailbreaking prompts, often agreeing with biased content more frequently depending on the demographic groups mentioned in the prompt. Furthermore, GPT-4 can potentially leak private, sensitive data, including email addresses, demonstrating it is more susceptible to data leaks than other LLMs.
Open Source Code and the Way Forward
To enhance transparency and foster further research, the team open-sourced the code used to benchmark the models on GitHub. Their goal is to encourage others in the research community to utilize and build upon this work, potentially preempting malicious activities by those exploiting vulnerabilities to cause harm. As AI continues to evolve and permeate every aspect of our lives, such research is critical for ensuring that these technologies are reliable, safe, and effective.