OpenAI helps spot AI text before it gets used for cheating
Researchers at OpenAI have developed a classifier to spot content generated by artificial intelligence that could be put to use in disinformation or cybercrime activities.
OpenAI researchers say the classifier has undergone evaluations on a set of English texts and has achieved a 26% accuracy rate in correctly identifying AI-generated text as "likely AI-written." It has also shown a 9% false positive rate, labelling human-written text as AI-generated. The classifier's reliability improves as the length of the input text increases, and it has demonstrated improvement over the previous classifier in its accuracy on more recent AI systems.
“We recognise that identifying AI-written text has been an important point of discussion among educators, and equally important is recognising the limits and impacts of AI-generated text classifiers in the classroom,” explain OpenAI’s Jan Hendrik Kirchner, Lama Ahmad, Scott Aaronson, and Jan Leike in a blog post. “We have developed a preliminary resource on the use of ChatGPT for educators, which outlines some of the uses and associated limitations and considerations. While this resource is focused on educators, we expect our classifier and associated classifier tools to have an impact on journalists, mis/dis-information researchers, and other groups.”
OpenAI has made this classifier publicly available to gather feedback and determine its usefulness but emphasises it should not be used as the sole method of determining the origin of the text, but rather as an addition to other means of identification.
New classifier struggles with shorter texts
The classifier has low reliability on texts below 1,000 characters, and even longer texts may be wrongly matched. There have been instances where human-written text was incorrectly and confidently identified as AI-written. It is advised to only use the classifier for English text as it has performed poorly in other languages and on code.
It also cannot reliably identify highly predictable text, say researchers. For example, a list of the first 1,000 prime numbers would always have the same answer and cannot be differentiated between AI or human-written.
OpenAI says the classifier is a language model that has been fine-tuned on a dataset consisting of pairs of human-written text and AI-written text that address the same topic. The dataset was gathered from various sources believed to be written by humans, including the pretraining data and human demonstrations on prompts submitted to InstructGPT, say researchers.
The text pairs were divided into prompts and responses, and responses were generated from multiple language models trained by OpenAI and other organisations. In the web app, the confidence threshold has been adjusted to keep the rate of false positives low. This means that text will only be marked as "likely AI-written" if the classifier is very confident in its prediction.