What to do about AI referee reports?

The answer is: use AI

Nov 26, 2025

This year, I have submitted an unusually large number of papers to journals (which is another story that I will come back to at a later date). That has given me a window into a completely predictable phenomenon — AI reviewing. Indeed, I wrote a paper about that!

My guess is that about half the reviews I have received of my papers have been substantially or totally 100% AI-generated. This is not simply cleaning up language. Instead, entire reviews have been produced by feeding my paper into an LLM and asking the LLM to generate a report.

“How can I tell?” you may ask. Well, there are three ‘tells:

Very pedantic details, such as complaining about minor formatting issues
Inconsistencies such as complaining about a lack of formal proofs, not realising they are in the appendix and then providing detailed comments on the proofs in the appendix.
Providing some of the more useless comments, such as “needs empirical confirmation” for purely theoretical papers.

That last one is one I am sensitive to because I always ask LLMs to review my own papers before sending them out. While the style might differ, LLM reviews tend to pick up the same things, especially those I choose not to address because they are not relevant.

More recently, I was fortunate to have five referee reports on one paper (which sometimes happens when the editor is not happy with the reviews they have), and I suspected one of them was completely AI-generated. It was also the most positive review. I decided to put it into Pangram, which is the one AI generator that currently seems to work. I was correct, but it turned out that two other reviews were AI-generated. One completely and the other for the comments after the usual reviewer-generated description of the paper.

Of course, those AI-checkers, even if they work, are not really what we want. They can pick up reviews where the reviewer took a bunch of notes and used AI to craft a report. Reviewers are unpaid, so there is surely nothing wrong with them using tools to help.

This, of course, raises a key issue: what is the problem with AI reviews anyway? Remember, I already noted that I used them to pre-review papers before submission. So it is not like they are not useful. They easily can be. But they are more useful when using a good (i.e., expensive) LLM or a service like Refine.ink. I highly recommend this as part of your research workflow. It is good for everyone. You, referees and editors.

The issue is that editors want — and will certainly want in any future equilibrium — the views of referees above and beyond what can be AI-generated. They are not receiving that with the AI-generated reviews I saw, and also are not recognising that the reviews were AI-generated. A referee’s judgment is what is being solicited, and if the review is generated by AI, it is not providing that. What’s more, based on current incentives, this is not disclosed to the editor.

A Solution

I believe there is a solution to this problem, although it will require a little work to build a system. I already prefaced it in my own indicators of AI-generated reviews.

The solution is this:

The editor takes the paper and uses an LLM to generate an AI-generated report.
The editor then takes any report they have received and uses an LLM to provide an assessment of the additional comments or insights (if any) that the report provides above the report they know was generated by the LLM.
The editor uses that information to come to a decision.

That will allow the editor to assess the added value of the referee beyond what an AI report could have provided. It will also allow them to communicate that to the authors and so assign appropriate weights to the comments provided.

This solution does not bar AI-generated or AI-assisted reports but allows the editor to discern the importance of different comments and assessments more easily.

This tool seems like the sort of thing that could be done with existing technology, but also could be coded into an app that does this in a superior way.

Conclusion

In sum, AI-generated reports are not obviously a problem. The problem is that it can obscure the signal everyone is trying to obtain in peer review. The good news is that AI can provide a way to clean up that signal and improve the entire system.

Joshua Gans' Newsletter

What to do about AI referee reports?

The answer is: use AI

A Solution

Conclusion