Chatbots like ChatGPT may be fun and even already serve as helpful productivity tools, but researchers warn the tsunami of text they generate may jeopardise the originality and creativity of the content on which future AIs are trained.
A Penn State research team developed a pipeline for automated plagiarism detection and tested it against OpenAI's GPT-2, a language model with online pre-training data, as part of a new study that highlights the need for more research into text generators and the ethical and philosophical questions they pose.
“People pursue large language models because the larger the model gets, generation abilities increase,” says lead author Jooyoung Lee, doctoral student in the College of Information Sciences and Technology at Penn State. “At the same time, they are jeopardising the originality and creativity of the content within the training corpus. This is an important finding.”
The researchers focused on identifying three forms of plagiarism - verbatim, paraphrase, and idea - and used 210,000 generated texts to test for plagiarism in pre-trained and fine-tuned language models.
“Plagiarism comes in different flavours,” says Dongwon Lee, professor of information sciences and technology at Penn State University in Pennsylvania. “We wanted to see if language models not only copy and paste but resort to more sophisticated forms of plagiarism without realising it.”
The team looked at scientific documents, scholarly articles related to COVID-19, and patent claims, using an open-source search engine to retrieve the top 10 training documents most similar to each generated text and modified an existing text alignment algorithm to better detect instances of plagiarism.
Detection tools that can be used by educators, researchers, and publishers
Their findings have significant implications for academic research, as automated plagiarism detection can save time and resources while promoting academic integrity. The researchers say their work could contribute to the development of more effective plagiarism detection tools that can be used by educators, researchers, and publishers.
The team found that the language models committed all three types of plagiarism and that the larger the dataset and parameters used to train the model, the more often plagiarism occurred. They also noted that fine-tuned language models reduced verbatim plagiarism but increased instances of paraphrase and idea plagiarism. In addition, they identified cases of the language model exposing individuals’ private information through all three forms of plagiarism. The researchers will present their findings at the 2023 ACM Web Conference at the end of April in Austin, Texas.
Even though the output may be appealing, and language models may be fun to use and seem productive for certain tasks, it doesn’t mean they are practical, says Thai Le, assistant professor of computer and information science at the University of Mississippi who began working on the project as a doctoral candidate at Penn State. “In practice, we need to take care of the ethical and copyright issues that text generators pose.”
Though the study results only apply to GPT-2, the automatic plagiarism detection process that the researchers established can be applied to newer language models like ChatGPT to determine if and how often these models plagiarise training content. Testing for plagiarism, however, depends on the developers making the training data publicly accessible, the researchers say.
The current study can help AI researchers build more robust, reliable and responsible language models in future, according to the scientists. For now, they urge individuals to exercise caution when using text generators.
“AI researchers and scientists are studying how to make language models better and more robust; meanwhile, many individuals are using language models in their daily lives for various productivity tasks,” says Jinghui Chen, assistant professor of information sciences and technology at Penn State. “While leveraging language models as a search engine or a stack overflow to debug code is probably fine, for other purposes, since the language model may produce plagiarised content, it may result in negative consequences for the user.”
The plagiarism outcome is not something unexpected, says Lee. “As a stochastic parrot, we taught language models to mimic human writings without teaching them how not to plagiarise properly. Now, it’s time to teach them to write more properly, and we have a long way to go.”