AI-Generated Feedback: Preliminary Thoughts

A colleague recently posted on social media about their experience using paperreview.ai (Stanford Agentic Reviewer), an AI tool that is designed to provide human-like peer review feedback for academic papers. Much to their surprise, they found the feedback to be meaningful and noted that it even echoed some of the comments provided by a human reviewer. Despite some of the suggestions being more aligned towards data science than the humanities, my colleague’s experience was overall a positive one and they planned to keep using it. I have noticed considerable excitement around this system elsewhere. Within a week of going live, according to its developers, it had reviewed over 21,500 papers, with users spanning 160 countries. Moreover, initial feedback appears to be very positive: 95% say they find the generated reviews useful and 91% find the suggestions actionable. Reflecting on this feedback, Andrew Ng, one of the developers, commented that “[i]t’s clear agentic paper reviewing is here to stay and will be impactful”. With AI increasingly permeating academic research and workflows – from chatbots to tools like literature mapping – is there a place for AI in the peer review process? 

Figure 1. Screenshot taken from LinkedIn detailing the use of and feedback on paperreview.ai one week after going live. 

In this post, I take a look at paperreview.ai to get a feel for how AI-generated feedback could benefit my own work. There are many other peer-review tools available, such as JAPRA, Paper Wizard, and Review It, but I have decided to focus on one specific tool here so that I may discuss it in greater depth. To do this, I took two published papers that had undergone peer review from different areas of my research: one on computational methods for analysing Tibetan manuscripts (two reviewer reports), the other on eighteenth-century Tibetan history (one reviewer report). These papers were uploaded to the system and the results were then compared against the original peer reviews. The aim here isn’t to go through the feedback in full, but to pick and probe if, where, and how AI tools can enhance this process.

Paperreview.ai

Developed by Yixing Jiang and Andrew Ng at the Stanford ML Group, paperreview.ai is an open-access “agentic reviewer” system designed specifically to provide rapid feedback. With both papers my feedback was ready in under 30 minutes, although the website warns that waiting times can vary. Its operation is fairly simple: upload a PDF (note the maximum size is 10MB – I needed to compress one of my papers before uploading), add your email address and “Target Venue” (the latter is optional), submit, and wait.

In a nutshell, it uses an agentic workflow that converts PDFs to Markdown, generates search queries to find relevant papers on arXiv, downloads and summarises related work, then generates comprehensive reviews following a template (see Figure 2). Because the system is grounded in arXiv, it actively searches for and reads current literature, keeping feedback and reading suggestions up-to-date. However, as arXiv consists mostly of scientific papers, this will likely be more beneficial for some fields, such as computer science, and less so for Tibetan studies; something that the developers themselves acknowledge as a limitation. At the time of writing, paperreview.ai only supports English-language papers.

Figure 2. Content of AI-generated feedback.

Another indicator that the “agentic reviewer” is better suited to data science papers: it only analyses the first 15 pages of a submission. While this might cover the entire manuscript – as was the case for the 12-page computational methods paper – it feels particularly limiting for humanities papers. For the Tibetan history article (21 pages), roughly a third of the paper went unread.

To test the accuracy of the system, it has been benchmarked against human reviewers from the International Conference on Learning Representations (ICLR) 2025, showing a Spearman correlation of 0.42 between AI-generated and human scores, compared to 0.41 between two human reviewers. This suggests that the “agentic reviewer” and humans are in agreement about as much as humans agree with each other, at least when it comes to assigning numerical ratings. For more on the background, technical aspects, and scoring, see their overview page

One thing I would like to note here is that there is currently (April 2026) no data privacy statement on the website. Questions around data retention and privacy have been raised elsewhere and, as far as I can see, have not yet been addressed. As the two papers I uploaded to paperreview.ai have already been published and are available online, this wasn’t a concern for me. However, I personally would not feel comfortable using this for as-yet-unpublished research without a clear statement regarding data use and privacy.

Feedback on Computational Methods Paper

The most striking difference is the length and depth of the AI review, which was three times longer than the combined feedback received from two human reviewers (2,400 vs 800 words). The reviews followed a similar structure: Summary, Strengths, Weaknesses, Questions, but whereas human reviewers provided two or three points in each category, the AI review provided 10+. The latter also provides “Detailed Comments” and “Overall Assessment”. 

Although concise, human feedback was focused with clear recommendations and next steps: “it would be good to add some references to other comparable projects focusing maybe on different languages, to give the reader a better idea of how this particular project is positioned within the larger ‘landscape’ of such initiatives.” While the AI feedback came across as more pedagogical and provided what felt more like aspirational suggestions that far exceeded a) the scope and aims of the paper, b) the word/page count, and c) what is actually feasible given resource limitations. For example, it suggested that I “add a clear experimental protocol—dataset partitions (by manuscript/scribe), model architecture, hyperparameters, augmentations, and CER by sub-corpus. Incorporate stronger baselines or recent methods where feasible (e.g., transformer-based HTR with augmentations/ensembles; PEFT fine-tuning)”. This was in relation to one part of a three-stage pipeline, which feels excessive, and references were already given to other publications that provide most of these details. 

Perhaps human and AI reviewers are coming from different angles, with human reviewers answering the question “is this suitable for publication here?” and AI, by contrast, answering the question “is this paper as rigorous as theoretically possible?” When uploading to paperreview.ai, no data on word or page count limitations etc. are collected, which might explain why AI sets such a high bar, especially with regards to technical detail. There is an option to select a “Target Venue” before uploading the paper, however only 12 publications are currently listed. Providing more (optional) data at this stage could result in more constructive and focused feedback. 

There was clear overlap in feedback, however, especially around clarity of presentation and comparison with related work. Three of the 10 questions posed by AI echoed questions and feedback from human reviewers. At first glance, the AI appeared to demonstrate a more comprehensive awareness of recent literature, noting that there are papers on Tibetan language models and alternative architectures that could strengthen the methodology. However, no further details, titles, DOIs etc. were given to locate these recommended additions. It remains unclear whether these were actual papers or whether references were hallucinated – a known limitation of AI.

Interestingly, the AI feedback raised ethical considerations that human reviewers didn’t mention at all, asking questions around documenting collaboration protocols and data governance principles. While this reflects the growing attention to ethics in digital humanities and serves as a useful reminder,1 it also highlights how AI might apply generic “best practices” and offer boilerplate comments that could be applied to any paper in any field.

Given paperreview.ai’s focus on “experimental gaps” and “technical limitations”, I was interested to see how it would fare on a non-data science paper and whether it would raise similar questions around ethical/cultural considerations.

Feedback on Tibetan History Paper

This time the difference in specialist knowledge (or lack of) was more pronounced: the human review was brief (c. 100 words) but focused on actionable additions and edits to be made around accessibility to non-specialists, e.g. explain the Ganden Phodrang (དགའ་ལྡན་ཕོ་བྲང་), providing English translations of Tibetan titles, and correcting Sanskrit transliterations. The AI review, on the other hand, was around 2,000 words and focused on broader scholarly standards, e.g. “variant personal names/titles can distract; a consolidated prosopography and glossary would aid readers”. It also suggested a level of detail that felt, at times, as though it missed the entire point of the paper. The paper analyses an eyewitness account of a key historical event and draws on the author’s life and context to frame the position taken and details given. However, one of the weaknesses listed under “experimental gaps or methodological issues” is a “heavy reliance on a single primary narrative” and the need for more “analysis across the author’s oeuvre or parallel exempla in contemporaneous texts”. Similarly, asking the author to survey rhetorical irony across eight volumes of collected works in manuscript form (now published across 20 hardcover volumes) is less a revision suggestion than an invitation to undertake a separate multi-year research project!

Despite the model’s data science biases showing in the questions, two of the eight questions posed were more meaningful and offered something different to the feedback provided by the human reviewer. Both identified gaps that were actionable within the scope of the article, for example pointing to evidence already present that could be better foregrounded or connected. This time there were no comments or questions around research ethics.

The AI’s suggestions for “related work” appeared to have little-to-no connection with the topic. It recommended comparing eighteenth-century Tibetan historical analysis with a) a computer-supported cooperative work (CSCW) study on fear among religious minorities in digital spaces, b) a study of Qing imperial mausoleum geomancy, and c) a genetic analysis of the Aisin Gioro patriline. While justifications for these suggestions were provided (e.g., “reminds us that Qing imperial identity has been explored through multiple lenses”), they appear to be the results of keyword matching rather than actual academic engagement. 

Is There a Place for AI in the Peer Review Process?

This brief comparison shows that paperreview.ai has potential but, at present, is far more useful for data science fields than for humanities work. For the computational paper, the AI’s technical suggestions were valuable – not always for the paper in question, but more broadly for my work – even if overly ambitious on occasion. For the history paper, suggestions were mostly generic advice. This isn’t strictly negative. It is easy to get caught up in the small details and forget the bigger picture, providing context etc., so there is a place and need for this type of feedback. However, I would hope and expect more than just general advice from the peer review system and this is where I feel paperreview.ai is lacking. Moreover, Alice Cassalini (rightly) questions the consequences of molding the humanities into systems and schemes designed for STEM and related fields; and this has been at the forefront of my mind whilst writing this piece. Is there any real benefit to using an “agentic reviewer” that’s clearly not designed for the humanities? And if so, where and how could this be used to enhance the writing process?

I think its strengths lie in its ability to generate a detailed report quickly. The latter is what inspired the designers to create paperreview.ai in the first place. With this in mind, I can see it being a useful tool for producing a preliminary report that could serve as scaffolding, providing an initial check of presentation clarity and methodology, and listing observations and questions that can (re-)guide the author’s focus. I should add here that the AI reviews didn’t pick up on any typos, grammar issues etc. – which human reviewers did – and so it shouldn’t be used with this intention. Perhaps the key is thinking about AI feedback as one tool among many, deployed thoughtfully (and taken with a big pinch of salt!).2  

Their appeal also shines a spotlight on deeper issues: peer review can be time-consuming, opaque, non-instructional, and dismissive (Mavrogenis et al., 2020). It runs on volunteer labour from overstretched academics with limited training, guidance, or recognition. Might this explain why over 21,000 papers were uploaded to paperreview.ai within a week of going live? And why a fifth of reviews for ICLR 2026 were flagged as being potentially fully AI-generated? Although I am unlikely to use this again in my own work, I can see why there has been so much buzz around this tool and, somewhat reluctantly, agree that agentic paper reviewing is here to stay for now. 


Acknowledgments

I would like to extend my gratitude to the human reviewers who provided thoughtful feedback on both manuscripts.


References

A. F. Mavrogenis, A. Quaile, and M. M. Scarlat, “The Good, the Bad and the Rude Peer-review,” International Orthopaedics (SICOT) 44 (2020): 413–415. https://doi.org/10.1007/s00264-020-04504-1.


Footnotes

  1. Over the last year, in particular, I have noticed more discussions around ethics and digital humanities (especially AI) in pieces published by The DO. Examples include Elaine Lai’s “AI Ethics and the Humanities: A Perspective from Buddhist Studies” and Edward. A. S. Ross’ series on GenAI. ↩︎
  2. A question I kept coming back to is the environmental cost of using an “agentic reviewer”. Delving into this in more detail is beyond the scope of this post, so instead I leave a question posed by Alice Cassalini: “how can we consciously use these tools when this [an already damaged ecosystem, the racial biases of models, the exploitation of workers etc.] is the system of suffering they are built in and perpetrate?” ↩︎

Cover Image: Screenshot of paperreview.ai, edited by author.

Leave a Reply