Slashdot: OpenAI Accused of Training GPT-4o on Unlicensed O’Reilly Books

Apr 2, 2025

—

Source URL: https://news.slashdot.org/story/25/04/02/0440222/openai-accused-of-training-gpt-4o-on-unlicensed-oreilly-books?utm_source=rss1.0mainlinkanon&utm_medium=feed
Source: Slashdot
Title: OpenAI Accused of Training GPT-4o on Unlicensed O’Reilly Books

Feedly Summary:

AI Summary and Description: Yes

Summary: The text discusses a recent paper from the AI Disclosures Project that raises concerns regarding the use of copyrighted content from O’Reilly Media in the training of OpenAI’s GPT-4o model. The implications of this finding touch on issues surrounding copyright infringement and model training practices in the AI sector.

Detailed Description: The core of the text highlights significant findings regarding the training data used for OpenAI’s GPT-4o model, focusing specifically on the potential use of copyrighted materials without proper licensing. This research sheds light on critical issues in AI model development, particularly how models are trained and the legality of utilizing proprietary content.

– **Key Points:**
– The AI Disclosures Project claims that OpenAI’s GPT-4o was likely trained on O’Reilly Media’s paywalled books without a licensing agreement.
– The organization employs a method known as DE-COP to identify copyrighted materials in AI training datasets.
– An analysis of 13,962 excerpts from 34 different O’Reilly books suggested that GPT-4o exhibited a significantly higher recognition of copyrighted content compared to its predecessor, GPT-3.5 Turbo.
– The investigation involved a methodology referred to as a “membership inference attack,” allowing researchers to assess how well the model distinguishes between human-written texts and paraphrased content.
– The authors of the paper indicate that GPT-4o appears to have prior knowledge of many proprietary O’Reilly titles, raising ethical and legal questions around content usage in AI training.

– **Implications for Professionals:**
– This revelation could impose further scrutiny on the practices of AI companies regarding copyright compliance and the ethical use of training data.
– Security and compliance professionals in AI and related fields must be aware of the potential legal ramifications stemming from unauthorized use of copyrighted materials and the implications for model performance and reliability.
– The research indicates a growing necessity for transparent and accountable practices in AI development, emphasizing the importance of aligning with copyright laws and ethical guidelines.

This issue underscores the delicate balance between advancing AI capabilities and maintaining strict adherence to copyright and intellectual property rights, making it crucial for professionals in the field to stay updated on these developments.

-4o -4o model 1 2 3 4 5 a account Act AI AI development ai model analysis and app Arch art as attack authors AWS C capabilities CERN CI CIA co companies compliance compliance professionals concerns content copyright copyright compliance copyright infringement Copyright Law copyright laws core critical D data dataset datasets de development developments disclosure DoT e edge ERP ethical Ethical Guidelines ethical use for g GPT GPT-4o gs guidelines H high Highlight HR http HTTPS human implications in Inference Intel Intellectual Property Intellectual Property Rights investigation ite J k Key knowledge l law led Legal legal questions legal ramifications legality Li liability licensing Link low making man media Mode model model development model performance model training models N news no non o of on open openai OPM organization ory out performance point potential pre professionals project Property Rights proprietary content Py Q question R raising RCE red reliability research researchers right Ro s search sec sector security security and compliance Sig source specific SSE SSO STIG T text the to Tor TP training training data training datasets training practices transparent UI under up update US usage use uth V Ware Well Wi x