A little while ago, when you asked ChatGPT the question:
Is 9.11 larger than 9.9?
It would actually answer “yes.” This is because, being a large language model, it was trained on text and so for software versions 9.11 is usually later than 9.9 so it is considered “larger.” I am pretty sure that 40 years ago, a mathematician ranted about how this convention for labelling versions was going to eventually mean the doom of civilisation and will now be looking pretty, pretty satisfied with themselves.
Anyhow, OpenAI has apparently noticed this question being asked and so I tried it with the latest version and got this:
So, problem solved … right? If that was the case you might think that a slight variant might also be correct now but …
It actually is funnier than that. Check this out:
In the same chat window, no less. (The legacy model also gets the answer right for 9.11 vs 9.9.)
It’s hard to say what’s going on here, but one suspects some “under-the-hood” fine-tuning is going on to get the model to behave.
The Human Element
The point of this post is not to poke fun at LLMs. Instead, it raises a point about generalisation. LLMs are used by people. That means you are evaluating whether it will likely perform well for a particular question. In some cases, experience will tell you about that. But for pretty much anything else, you need to form a belief about an LLM’s performance, and the primary way one might do that is to generalise.
This is the topic of a recent, fascinating paper by Keyon Vafa, Ashesh Rambachan and Sendhil Mullainathan. The paper is available here. They point out that measuring LLM performance by some useful benchmarking process will tell you something about when one LLM is better than another, but only for the specific things benchmarked. But if they are put in the hands of real people, a high-benchmarked performing LLM may actually perform worse. How? Because people make generalisations regarding what an LLM can do, better-performing LLMs may be associated with poor generalisations. The paper then actually tries this out and shows that on tasks where the stakes are high (that is, the cost of making a mistake is large), benchmarking exercises are not great performance indicators and may be misleading.
The above example dramatically shows this. The changes to ChatGPT mean that it is getting an answer correct where previously it wasn’t. But the changes were not general — even a little bit. This is really bad precisely because if it was getting the 9.11 and 9.9 question right, you might assume it can do other comparisons too. Here you can see that a very, very small generalisation on my part would lead someone astray.
Now, you can avoid these particular maths-related issues by asking ChatGPT to use Code Interpreter and run a Python check before giving you an answer. But the point remains that we want to generalise in order to form beliefs about LLM performance, and this can be potentially fraught.
This type of thing will be an interesting challenge for AI adoption going forward.
In the end, OpenAI may have improved their model by their fine-tuning but, as you can see, it is highly unlikely they made it better to use once human generalisation was taken into account.