A team of researchers at Florida State University have designed a new approach to detecting when ChatGPT is being used, particularly when it is being used to cheat on tests and exams. The team’s work has been published in a recent article in The Journal of Chemical Education.
The widespread adoption and presence of ChatGPT grew almost overnight. With the advent of the popular large language model artificial intelligence, people quickly began to realize the promise (and perils) of what artificial intelligence can do. Ask a question or give the system a prompt, and it can turn out what seem to be coherent, rationale responses. Ask it to write an essay, for example, and it can do so. On the surface, it can be challenging to tell when something is produced by ChatGPT or a human. The challenge, however, is recognizing that what ChatGPT puts out isn’t always accurate, or helpful. And still, students, in particular, continue to leverage to system. But what about when it’s used to cheat – is it a human, or a machine?
Researchers at Florida State University have a possible solution, looking beyond just detecting cheating in more writing-based exams. The focus of their study was detecting when ChatGPT was being used to cheat on multiple choice exams. Specifically, researchers leverage statistical methods in order to determine if someone is cheating on a chemistry multiple choice exam. The team sought to compare the multiple-choice tests of humans versus those completed by ChatGPT, yielding a nearly zero false positive rate.
The challenge, however, came in determining very fine distinctions between responses, and it largely had to do with two key factors: the question difficulty and the level of student knowledge. For example, after analyzing human-responded exams, it was clear that high performers answered hard and easy questions correctly, while average students only answered easy questions correctly. Looking at ChatGPT responses, however, researchers noticed that tools like ChatGPT would answer every easy question incorrectly and all the hard ones correctly, raising a red flag to researchers.