Just a couple weeks after being sued by the New York Times over allegations that it copied and used «millions» of copyrighted news articles to train its large-language models, OpenAI has told the UK's House of Lords communications and digital select committee (via The Guardian) that it has to use copyrighted materials to build its systems because otherwise, they just won't work.
Large-language models—LLMs—that form the basis of AI systems like OpenAI's ChatGPT chatbot harvest massive amounts of data from online sources in order to «learn» how to function. That becomes a problem when questions of copyright come into play. The Times' lawsuit, for instance, says Microsoft and OpenAI «seek to free-ride on The Times' massive investment in its journalism by using it to build substitutive products without permission or payment.»
It's not the only one taking issue with that approach: A group of 17 authors including John Grisham and George RR Martin filed suit against OpenAI in 2023, accusing it of "systematic theft on a mass scale."
In its presentation to the House of Lords, OpenAI doesn't deny the use of copyrighted materials, but instead says it's all fair use—and anyway, it simply has no choice. «Because copyright today covers virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today's leading AI models without using copyrighted materials,» it wrote.
«Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens.»
I don't find it a particularly compelling argument. If
Read more on pcgamer.com