More than half of essays written by people were incorrectly flagged as being written by AI, causing concern about academic fairness.
The study, which was conducted by researchers at Stanford University, ran 91 English essays written by non-native English speakers through seven popular GPT detectors to see how well the programs performed.
Its results are viewed as crucial to prevent further bias and marginalisation of non-native English speakers in educational settings and job applications. It also raises potential future issues of generative AI and how its makers can work to create a more equitable digital landscape.
AI detection programs can discriminate against non-native English speakers
With the rapid development of generative AI, including ChatGPT, issues of bias are becoming increasingly prevalent.
A study conducted by MIT previously revealed that AI models failed to reproduce human judgements about rule violations and judged more harshly than a human would. In a time where governments and industries are considering greater use of AI and machine learning systems, researchers deemed that being less accurate could have serious real-world consequences.
This type of bias could have lasting consequences concerning biases. Researchers as part of the study evaluated the performance of seven widely used GPT detectors on 91 TOEFL (Test of English as a Foreign Language) essays from a Chinese forum and 88 US eighth-grade essays.
Although the detectors accurately classified the US student essays, they incorrectly labelled more than half of the TOEFL essays as “AI-generated.” All detectors identified 19.8% of the human-written TOEFL essays as AI authored, with at least one detector flagging 97.8% of TOEFL essays as AI generated.
TOEFL essays were identified by the AI as exhibiting significantly lower text perplexity. The study suggests that if a generative language model can predict the next word easily, the text perplexity is low. If the next word is hard to predict, the text perplexity is high.
As reported in The Guardian, if humans use a lot of common words or have a familiar pattern in their writing, their work is at risk of being mistaken for AI-generated text. It has also been stated by researchers of the study that the risk is also greater with non-native English speakers because they are more likely to adopt word choices that are simpler.
The importance of managing AI risk
Whilst there are plenty of benefits to using AI in a safe and responsible way, it is important to evaluate and manage potential risks to ensure ethical use of its systems.
As AI systems learn from data, the system can inherit and perpetuate biases, as evidenced by this report. This can result in discriminatory outcomes in education, the job market and even criminal justice.
After highlighting the built-in bias in the AI detectors used in the study, scientists then asked ChatGPT to rewrite the TOEFL essays using more sophisticated language. When these edited essays were put through the AI detectors, they were all labelled as written by humans.
“In education, arguably the most significant market for GPT detectors, non-native students bear more risks of false accusations of cheating, which can be detrimental to a student’s academic career and psychological well-being,” the study authors said.
“We emphasise the need for inclusive conversations involving all stakeholders, including developers, students, educators, policymakers, ethicists and those affected by GPT. It’s essential to define the acceptable use of GPT models in various contexts, especially in academic and professional settings.”