When you use an LLM, you continue to be warned.
We aren’t offered any guidance on “checking,” but I guess they don’t mean another LLM. DeepSeek is a little more relaxed.
Put simply, the disclaimer is “don’t blame us if this isn’t reliable.”
That’s the legal story. The upshot of all of this, however, is that most people know LLMs aren’t reliable and it is often the case that I’m told by others not to rely on them. But pretty much everything isn’t reliable in one way or another. I have written books. I believe them to be reliable, but occasionally, mistakes crop up. I don’t pass off responsibility for those mistakes, but then again, I am not in legal peril either.
What is going on here is that LLMs were initially unreliable. So much so that they confidently expressed things that someone would regard as made up. Of course, they weren’t made up. Some errors emerged because the process of interacting with an LLM was fundamentally a statistical one. Since not everyone is a statistician, the sensible initial path was to tell people to be wary. No fundamental tools existed to help people be wary and not, for instance, stop believing even stuff LLMs got right. But the damage was done. The “brand” of LLMs was that they were unreliable — something that computer scientists called “hallucinations” — and they were, therefore, potentially unsafe. What is more, this imbued people with the notion that it was only if there were no hallucinations at all that an LLM could be trusted. This, however, is both incorrect and harmful.
The problem is that there is a trade-off between usefulness and perfection. If I programmed an LLM to only answer questions about “what colour the sky is” and to answer it with “blue” then there would be no hallucinations but also it would be completely useless.1 The issue with LLMs is that they are built to be an open book. That means there is no possibility that they can be hallucination-free. The laws of statistics won’t allow it.2
The desire of people to want a hallucination-free product was top of mind when we designed All Day TA. We trained the model to heavily weigh professor content when returning answers, and if there was a question that seemed unrelated to the course, it would refuse to answer that question rather than risk returning something false or off-topic. Even there, however, we could not say there were “no hallucinations.”3 Instead, we could state our aims and let the professor decide if they could bear the, albeit small, risk. Mistakes could still occur, but they were most of the time the result of errors or a lack of clarity in the content professors fed the TA, so we built a feedback mechanism to allow professors to correct those mistakes easily.
That last part is important. If we go back to what the LLMs are asking us to do in “checking,” the important part is that they are asking us, humans, to do something. When there is the possibility that responses might be unreliable, require context or be downright false, we do what we do when we are trying to get information from anywhere: we use our human toolkit. That is what the LLM providers want us to do. They want us to Google the answer (another unreliable process), check our intuition (another issue) or ask something who we think knows.
Students who used LLMs to do their homework and assignments learned all of this the hard way. If they brought zero knowledge to the LLM, there was a high risk the LLM would do something — like “make up” a citation — that would land the student in academic trouble. This isn’t just handing in a bad assignment. This is representing your work as something thorough when it isn’t. That’s misconduct and is more serious. Interestingly, in doing this, I suspect students have become much better at working with LLMs which is precisely the skill they are going to need.
This is not an uncommon desire. Many teachers often worry about students having access to tools that are “too good.” This is because they will rely on them rather than gaining experience in how to use them well. We have worried about this with self-driving cars that require that someone is paying attention and we worry when they won’t. In that sense, the more reliable they are, the more attention will dip.
There is also something unseemly about deliberately making things less reliable to keep people on their toes. Some professors have argued to me that they don’t want their AI TA to be perfect for that reason. But I always respond: “why then don’t you adopt a textbook that is known to have errors in it?”
Instead, I believe it is better if we can reduce the number of hallucinations. But we have to take care with regard to how we do it. If we want to reduce the number of answers that return false responses, we can just refuse to answer questions we are more “worried” about. That’s the All Day TA leaning, but it is not a free lunch. It means answering fewer questions. If we go too far on that the TA becomes useless. Thus, even in our world, the optimal probability of hallucination is not zero.
Therefore, we are left with the “checking” warning. This warning is that LLMs are what they are, and in that respect, they are like people. Sometimes, you have to use other tools to check what is going on, especially if the stakes are high. I just hope that the AIs being trained these days to automate this checking also understand that.
Yes, I know it also wouldn’t work well at night or on other planets but let’s not get cute while I’m trying to make a point.
And do I have to say it again, that’s why we call them “prediction machines.”
Actually we did say that on our website until some people called us on it. But that was just a marketing error by our marketing person who was … checks notes … me.