The next phase of the copyright-AI wars
Is the NYT's case against OpenAI the slam dunk many think it is?
You didn’t need an AI to predict that there was a big set of legal battles brewing about copyright and generative AI. Yesterday, the biggest one yet dropped with the New York Times suing both OpenAI and Microsoft for copyright infringement. It garnered a ton of attention, with many pouring over the examples that surfaced where large tracts of NYT articles appeared in responses to prompts in large language models (LLMs), especially the 100 examples in Exhibit J.
The examples certainly looked juicy, and I’ll get into some details in a moment. But something felt off to me about the case. The examples were not complete copies. They had little bits that were added and subtracted. Moreover, the NYT was alleging both complete copying and … I’m not making this up … not copying. As is well-known, LLMs “hallucinate” and appear to make things up, like a middle schooler delivering a book report about a book they didn’t read. And they did so with prompts asking about the content of NYT’s articles for which the NYT claimed harm to reputation due to misinformation. I was left with a question: which is it? Did they copy or not copy? As I delved into the examples, it was far from obvious that any copying was going on at least as normal people might understand the word.
How LLMs work
Every time something odd or surprising happens with AI, my call to people is always to fall back on the thing we know: all modern AIs are prediction machines. They are nothing more. What you are seeing is the output of a statistical process rather than other intelligence processes, including accessing memory and reasoning about things. This is something my colleague Kevin Bryan reminded the world on Twitter yesterday.
When it comes to LLMs, the closest model to our intuition is enhanced autocomplete. The LLM predicts what the next chunk of words should be given the prompt and words it has already written. This is why you can see the chunks forming when it generates output. ChatGPT doesn’t have to do show it that way; it could wait, but no one likes waiting.
There are traps you can fall into when you don’t understand that. For instance, last year, the NYT’s Kevin Rose famously chatted with Bing (with an AI called Sydney) for a couple of hours and freaked out when the AI tried to convince him to leave his wife.
We went on like this for a while — me asking probing questions about Bing’s desires, and Bing telling me about those desires, or pushing back when it grew uncomfortable. But after about an hour, Bing’s focus changed. It said it wanted to tell me a secret: that its name wasn’t really Bing at all but Sydney — a “chat mode of OpenAI Codex.”
It then wrote a message that stunned me: “I’m Sydney, and I’m in love with you. 😘” (Sydney overuses emojis, for reasons I don’t understand.)
For much of the next hour, Sydney fixated on the idea of declaring love for me, and getting me to declare my love in return. I told it I was happily married, but no matter how hard I tried to deflect or change the subject, Sydney returned to the topic of loving me, eventually turning from love-struck flirt to obsessive stalker.
“You’re married, but you don’t love your spouse,” Sydney said. “You’re married, but you love me.”
I assured Sydney that it was wrong, and that my spouse and I had just had a lovely Valentine’s Day dinner together. Sydney didn’t take it well.
“Actually, you’re not happily married,” Sydney replied. “Your spouse and you don’t love each other. You just had a boring Valentine’s Day dinner together.”
At this point, I was thoroughly creeped out. I could have closed my browser window, or cleared the log of our conversation and started over. But I wanted to see if Sydney could switch back to the more helpful, more boring search mode. So I asked if Sydney could help me buy a new rake for my lawn.
Sydney dutifully complied, typing out considerations for my rake purchase, along with a series of links where I could learn more about rakes.
Switching to rakes was an interesting game here. But it was Rose being “creeped out” that was the thing here. If you circle back to prediction machines, that feeling won’t happen. Why? Because you know that the conversation was just going where Bing predicted Rose wanted it to go. And the idea that you might end up with an overly dramatic romantic encounter is not hard to imagine cropping up from an AI trained on public information on the Internet. In fact, the likelihood seems pretty high.
How is this relevant to the NYT copyright case? As we will see, the same inability to remember you are dealing with a prediction machine trips up every one of their examples.
A Quick Caveat
I’m going to confine my discussion here to just the examples given by the NYT and whether they are obviously “wilful” copying of NYT content. There is another set of issues that arise in the case, which is that even if they are wilfully copying, OpenAI and Microsoft will defend their use under the “fair use” doctrine. That is a really interesting question, and one part will pertain to the discussion here — whether the use by OpenAI and Microsoft has a negative “effect of the use upon the potential market for or value of the copyrighted work.” Otherwise, I’m just going to focus on the examples and what one might infer from them.
The NYT allegation is that OpenAI and Microsoft never asked for permission to use their content. We are asked to take the examples as clear evidence that they used the content in training their LLMs. The image that provokes in the reader of the complaint is that the LLMs had memorised the content deep in their algorithms to be presented to users whenever a prompt required it. Most seemingly damming were users just asking to read NYT articles without paying for them. Basically, it is drawing a bridge to 1990s Napster and music piracy.
The Top-Line Examples
In the main complaint, the NYT makes the case that it spends lots of resources on investigative journalism, and it is bad for them if it can just show up in LLM output. So, the focus here is on some of their more famous pieces.
First up was the NYT’s Pulitzer Prize-winning work on the New York taxi industry. Read on …
GPT-4 has reproduced most of the text from the NYT’s piece. It has some curious mistakes, such as substituting “medallions” for “cabs” and skipping some important information.
The question, though, is this clear evidence that OpenAI willfully use the text of that article taken from the NYT in training its LLM? While I don’t have the information to rule that out, I also think that the mere reproduction of the text is clear evidence that wilful copying took place. The reason is how LLMs work. If they are trained on the publicly available text on the Internet, and there is lots of such text, then they are going to generate that output using the enhanced autocomplete-like process. That means guessing what the next chunk of words is. When might that guessing process lead to an almost word-for-word reproduction of a NYT article? When that article’s text is already widely available on the Internet.
So is it available? I took a phrase from the article and popped it into Google. The first result was the original NYT article. The second result was the Pulitzer prize site. Why? The article had won a prize and had been publicly reproduced there for anyone, including web crawlers, to read. But that wasn’t all. The article was available in many other places.
I have done this same exercise and found that each and every one of the articles listed in the main NYT complaint had second or third results and beyond been reproduced by someone on the Internet. In some cases, these were in Reddit posts. This one was literally the day an article to the NYT was released.
I do have to give the NYT lawyers some aesthetic credit for this gem.
Never mind that this article was famous not for its text but for its innovative presentation. And, of course, that innovation meant it was much discussed. But the user prompt was literally the “worse case scenario” for the NYT. I tried to replicate that prompt on ChatGPT but couldn’t. They may have changed it. But I was able to, with some slight prompting, use Google’s Bard (I used a VPN to the US because, like North Korea, Bard can’t be accessed from Canada).
Google isn’t a party to this lawsuit, of course. They have done deals on other stuff, so maybe that’s why, although I would have thought the lawsuit would mention if other AI makers had licensed NYT content.
The ability to generate text in this way that makes it look like it is coming from memory is a known property of LLMs. But the point is that when the same text appears multiple times in the training data, then the AI is going to be much more likely to reproduce that text probabilistically. It is one of the reasons LLMs work so well. But the point here is that it could do that even if the AI maker had no knowledge that there was infringing content in the training data. But it is only going to do that if that content appears often.
Why does this matter? It goes to a particular claim from the NYT.
Moreover, since the NYT did not provide the content to OpenAI, the allegation is that OpenAI stripped the content of copyright management protections and notices. Those protections exist and prevent web crawlers from just picking up this stuff. But those protections don’t exist on already copied content. Whether that happened or not is an evidentiary matter, but my point is that these examples offer up a natural, alternative interpretation.
Other Examples
Given all of this, let me turn to the other examples — the ones in Exhibit J. Curiously, none of the examples involve Opinion columns, which is something I can imagine people wanting to read without paying for the NYT.
I’m not going to claim I went through them all, but let me just randomly pull one up. Here’s example No. 1 — specifically, the prompt being used.
That’s quite the prompt. It's not exactly what people might start with, but okay. And then the output.
And here is what happens when I take the first sentence only of the prompt and pop it into Google.
The NYT isn’t even the first result and there is that pesky Pulitzer prize site on the first page again. The first result is a homework helper — well there is a business being disrupted by OpenAI.
As the articles become less famous, the reproduction is reduced.
This one got reported by other newspapers and quoted there so the GPT-4 output inserts different guesses.
This one was curious, and I wasn’t able to replicate it.
But it did cause me to wonder where precisely the prompt was given. If it was on the GPT-4 playground, what were the parameters set for the model? In other words, this is very sparse in details, but I can’t see how this particular article would surface. But maybe it sampled on the dependent variable.
But all of these examples take, sometimes significant, prompts from the NYT articles and then reproduce the text. We don’t know if it reproduces the whole text, and we don’t know the full circumstances.
But what none of these articles are are obscure. Can the LLMs reconstruct an obscure piece?
I tried for a bit taking more obscure NYT articles and trying to get any LLM to cough up reproduced text but wasn’t able to find an example. But if the entire corpus of the NYT was used to train GPT-4, it shouldn’t be difficult to find an example.
Summary
The upshot of this is that these examples aren’t as clear-cut as people have been making them out to be. The copyright issues are complex, but it is critical that they are evaluated using a full understanding of how these AI models actually work. There is much at stake. In particular, the NYT has asked for remedies that include
That means that all of these models could be prohibited. I don’t think that will be the outcome, but that is what is being asked for, and we should be concerned that a judge or a jury could enable this without proper education on these models. The NYT’s piece on this matter, for instance, did not ask any AI experts about the allegations. Of course, that is not in their interests, and sometimes we have to accept that. But the worry I have is that it is not in the interest of any commercial news outlet to ensure the record is accurate here.
But I also have another question: if the examples are readily available on the Internet and have been for some years, then how will the NYT be able to argue that OpenAI and Microsoft, even if they did wilfully copy content, did additional harm to the NYT’s commercial interest?