DALL-E Prompt: “Show me a droid looking interested during a presentation about copyright. Make it a golden droid like in the movies.”
The copyright and AI issues continue to bubble on, with numerous cases making their way through the courts. One issue is whether generative AI providers actually trained on copy-protected content without permission from the rights owners. As I talked about in the previous newsletter, this is definitely an open issue. Generative AI models are capable of outputting near copies of content without having been trained on that content directly. An example is the above image (see also the NYT for other examples). But regardless, AI providers contend their use of such materials is covered by fair use. Here’s OpenAI commenting on its case against the New York Times:
Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.
The principle that training AI models is permitted as a fair use is supported by a wide range of academics, library associations, civil society groups, startups, leading US companies, creators, authors, and others that recently submitted comments to the US Copyright Office. Other regions and countries, including the European Union, Japan, Singapore, and Israel also have laws that permit training models on copyrighted content—an advantage for AI innovation, advancement, and investment.
That being said, legal right is less important to us than being good citizens. We have led the AI industry in providing a simple opt-out process for publishers (which The New York Times adopted in August 2023) to prevent our tools from accessing their sites.
Fair use is something that is quite nuanced and can be difficult to apply. It is where issues such as free speech and copyright protection collide. However, one of the main arguments that the use of content in AI training is covered by fair use is that we have, at least implicitly, decided that the use of content in human training does not infringe copyrights. Here’s OpenAI again:
Just as humans obtain a broad education to learn how to solve new problems, we want our AI models to observe the range of the world’s information, including from every language, culture, and industry. Because models learn from the enormous aggregate of human knowledge, any one sector—including news—is a tiny slice of overall training data, and any single data source—including The New York Times—is not significant for the model’s intended learning.
It goes beyond this, of course. Here are some scenarios where the human-AI cases seem to be equivalent, just to whet your appetite. Here’s something people do all the time …
Here’s the Cliff Notes’ version …
Here’s my job …
In each of these, the person is fine, but the liability of the developer appears to be an open question. Here’s one where the person might not be fine:
Maybe a Disney lawyer could weigh in there.
There is a sense that “what holds for people” ought to imply “what holds for AI” is a compelling argument. However, it does then challenge us to ask, why are humans not liable for infringement in these situations? How does that make sense within the rationale for copyright law?
An economic argument
Over the last few weeks, I have tried to answer this question. The result is this paper. For those of you who don’t want to plough through an economic model, let me summarise.
First, original content creators face issues if others can copy their content and, in the process, persuade or otherwise reduce the willingness to pay potential customers for that content. We have copyright protection so those creators can say ‘no’ and have it enforceable by law.
Second, at the same time, content can be useful, but copyright protections can stand in the way of that use. A creator wants to earn profits, and so may set monopoly prices that discourage use that might otherwise occur. A creator may also not want to allow use when their content might ‘leak’ or otherwise cause their commercial interests to be harmed.
From an economic perspective, the goal should be to encourage both the creation and use of content. In the end, much of copyright law sensibly places power in the creator’s hands, even if this discourages use on the theory that if you don’t do that, the content won’t be produced, and you won’t have any use of it as a result. This is basically the NYT’s position, but if I tell you that the lead article in today’s paper said:
In a third track, American and Saudi officials are pushing Israel to agree to conditions for the creation of a Palestinian state in exchange for Saudi Arabia forging formal ties with Israel for the first time ever.
Most notable, Israel's government says it will not allow full Palestinian sovereignty, raising doubts about whether progress can be made on the major fronts.
Hamas has said it will not release the hostages until Israel agrees to a permanent cease-fire, a stance that is incompatible with Israel's stated goal of fighting until Hamas is removed from Gaza.
In one proposal, the hostages would be released in phases during a pause of up to 60 days in exchange for Palestinians jailed by Israel.
Some officials have suggested Israeli civilians would be released first, in exchange for Palestinian women and minors detained by Israel.
Saudi Normalization With Israel In the most ambitious set of talks, the Biden administration has revived discussions with Saudi Arabia to have the Saudis agree to formal diplomatic relations with Israel.
Since the war began Saudi Arabia and the United States have raised the price for Israel, now insisting that Israel commit to a process that leads to a Palestinian state and includes Palestinian governance of Gaza.
would the NYT come after me?
The goal of my paper was to try and understand why it would not make economic sense from a social welfare perspective for the NYT to come after me about this.
The paper distinguishes between two situations. In the first, what I call “small AI models,” some specific content is used to train an AI. Because there is a relatively small amount of content, it is feasible for the content creator to identify the use of that content and negotiate with the AI provider. In these situations, the NYT's intuition that copyright protection is the way to do it is bourne out. You want that from a social perspective because it improves content creator incentives, improves the quality of AI training data and allows these to be balanced against harm, if any, to the content creator’s commercial interests. Interestingly, in this sense, the NYT might come after me for my summary above. It would certainly have a case if I was making a business of it.
The second situation, which is the relevant one for most generative AI that is in copyright hot water, is for “large AI models.” In such models, the sheer volume of content is so large that each ‘bit’ has a limited value on its own to AI training — although that training would face a problem if none were available — and it is hard to identify ahead of time, whether the use of content would in training would damage a content provider’s commercial interests. It could turn out later that such an effect could be measured, however. That rules out any negotiations over use to balance competing interests.
For this situation, whether we want copyright protection or no protection (like a free rein fair use) depends on how valuable content is for AI training, in general, and how likely, on average, content creator’s commercial interests are likely to be harmed. This gives us a clue as to why people don’t get in trouble for the use of content; it is unlikely that use will actually harm any content provider. Note that this is not a “its too costly to sue everyone argument.” Instead, it is a “there isn’t likely damage so creator incentives aren’t harmed argument.”
A Better Approach
In my paper, I actually consider another approach, which I call an “ex post fair use like mechanism” (yes, I know, it ain’t the catchiest of names, but I’m an economist, not a marketer!). Here’s how it goes:
AI providers use all the content they want to train their AIs
If it turns out that individual content creators’ commercial interests are harmed, they can force the AI provider to pay them for their lost profits (their profits had the AI not existed, less their current profits)
This is different from normal copyright protection in that (a) the copyright holder can’t prevent the use of content in AI training, and (b) the damages are not statutory or punitive but just for lost profits.
If this can be done, what I show is that it (i) restores all content creators’ incentives to what they would be if the AI didn’t exist, (ii) creates the best possible world for training AIs, and (iii) ends up having more use of content by consumers. The reason it does all of this is that original content creators are effectively insured against loss, but so long as those losses aren’t so high as to wipe out the AI provider, then AI training can occur without friction. Call me crazy, but this all looks like it would be a big win for everyone.
There are, of course, some practical challenges. First, can we really measure lost profits? Probably not perfectly, but the question is whether the worst instances could be identified and compensation paid. Second, smaller creators may still struggle to get their due. Finally, this subverts moral rights arguments for copyright (again, I’m an economist …), but perhaps an opt-out system could be built that preserves those rights. When YouTube had to deal with similar issues, this was the type of thing that it did. (You can read more about such ideas here).
In the end, I am optimistic that there are ways forward that don’t lead to the doom and gloom scenarios for either content creators or generative AI providers.
PS. That earlier summary of the NYT … yeah, I used AI to generate that.