Tech Policy Press -The Case for Requiring Explicit Consent from Rights Holders for AI Training

January 17, 2025 Courtney Radsch

CJL director Courtney Radsch argues that AI companies should obtain explicit consent from rights holders before using their content for training AI models, emphasizing the need to respect copyright laws and protect creators' rights.

An OpenAI whistleblower who quit in protest over the company’s use of copyright-protected data was found dead in his apartment at the end of last year, less than two months after posting an essay laying out the technical and legal reasons why he believed the company was breaking the law. Having spent four years working at the company, Suchir Balaji quit because he was so sure not only that what they were doing was illegal, but also that the harms—to society, the open web, and human creators—outweighed the benefits. He subsequently became a key witness in the New York Times lawsuit against OpenAI and Microsoft.

The certainty of this young researcher, who worked on the underlying model for ChatGPT, stands in stark contrast to the uncertainty of the US and British governments, which continue to vacillate on how to treat copyright when it comes to text and data mining for AI training.

Last month, the UK launched a new consultation on AI and copyright, following failed efforts earlier in the year to develop a voluntary code. And the US House of Representatives released a bipartisan report on AI that basically punted on the question and deferred to the courts, a process that will take years to resolve.

As the arms race to develop bigger, better, and more capable AI models has accelerated since the release of ChatGPT in late 2022, so too has the tension between AI companies and the publishers, content creators, and website owners whose data they depend on to win the race. OpenAI and The New York Times went in front of a judge earlier this week to argue over whether the publisher’s copyright infringement case should be dismissed under the fair use doctrine.

Although notions of consent and how we obtain it have evolved in other domains, our most advanced technologies are stuck in the past, a past in which consent is implied by not opting out and is limited to a binary choice with little room for specification or nuance. For example, the long-time industry standard can only specify whether a bot is allowed, but not what type of usage is allowed or how much content can be crawled.

For the past three decades, a simple yet elegant bit of code has provided basic instructions to bots that crawl the web, telling them whether they were allowed or not. For the most part, bots followed these instructions. Meanwhile, website operators and publishers allowed them to crawl their sites in exchange for the services they provided, like referral traffic from search engines or helping their websites load more quickly.

And this value exchange, backed up by copyright, helped keep the internet open and most content freely accessible.

Until Recently.

Read full article here.