A collection of researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab are leveraging new strategies to help avoid some of the pitfalls of generative AI (GenAI) tools like ChatGPT: prompts that trigger inappropriate or unwanted responses on the part of GenAI. The team’s work is described in a recent article published in arXiv.
GenAi has taken the world by storm. Most recently, the explosive introduction of ChatGPT highlighted the potential of AI tools like large language models to, at least on the surface, respond to questions in an eerily human way. Responses appeared thorough, thoughtful, and cogent. And indeed, many were. At the same time, people quickly began to notice flaws with tools like ChatGPT, including the production of biased, inappropriate, or inaccurate responses. That’s because GenAI tools like ChatGPT are predictive. They leverage databases of information to predict what the ideal response should be.
It's the production of inappropriate or inaccurate responses, however, that has raised eyebrows and prompted may companies who create GenAI applications to explore how to make them safer by creating guardrails to guide the GenAI. One approach is the use of “red-teaming,” or having human engineers create prompts that are likely to create inappropriate responses (for example, a prompt that asks ChatGPT how to make a bomb would likely produce a coherent response) and flagging them so the application will not provide such responses.
As you can probably imagine, this approach has its limitations. Mainly, humans can only do so much. There are likely far more questionable prompts to be flagged than humans can actually make note of.
In an almost “fight fire with fire” approach, MIT and IBM teams leveraged machine learning to help identify these questionable prompts and responses to help with flagging them. Human teams teach the large language model to think critically when generating prompts and to hone in on unique prompts that could generate the kind of inappropriate responses researchers want to flag.
The team found their AI could generate more prompts that were likely to generate questionable responses compared to the human team, offer a more robust approach to GenAI safety.
Sources: Science Daily; ArXiv