Are AI-generated summaries suitable for studying and rese...

Despite didactic, ethical, and environmental concerns, the use of GenAI is on the rise in academia. For most applications, the jury is still out on whether and how they will benefit education and research in the long term. But it’s already safe to conclude that one popular use case is, in fact, a bad one: AI-generated summaries. When dealing with large amounts of text, the process of reading, evaluating, and summarizing can feel daunting.

It is understandable to want to outsource this cognitive heavy lifting to a GenAI tool. However, the quality of AI-generated summaries is insufficient for academic use, and they will not provide the coherent, reliable overview you are looking for. Furthermore, by generating a summary instead of writing your own, you miss an essential step in processing, memorizing, and applying information effectively.

Before diving into these problematic aspects of AI-generated summaries, it’s illuminating to first examine the differences between how humans and GenAI tools create summaries. When we summarize a text, we do more than just make a text shorter. We use our background knowledge, experiences, and even our feelings to describe what really matters to us. The many complex cognitive, linguistic, and affective-motivational processes involved can be categorized into three main steps.1 In the comprehension step, we get a global understanding of the text, look for linguistic clues, and estimate the effort required to summarize.

In the next step, we select information based on our goals, organize and hierarchize, and create a mental map of meaning. In the “production” step, we create the summary itself. It is a non-linear process in which we jump back and forth between steps as we try to verify and evaluate information and organize our thoughts. In other words, in a complex and messy way, we think when we summarize. LLM-based chatbots don’t think.

Instead, they use advanced text-prediction to answer a question. Whether you ask it to create a tasty pie recipe, finish a mathematical equation, or generate a summary, the tool will always first break up your prompt or uploaded text into its own language of word fragments (tokens). It will then process these tokens through billions of learned token patterns to predict the most statistically likely tokens to follow.

This is why they are sometimes called “autocomplete tools on steroids,” and why researchers Bender and Hanna argue that calling them “artificial intelligence” is misleading.2 As text predictors, they lack key human capabilities for creating useful and meaningful summaries, such as forming a referential mental map of meaning, understanding context, and having a sense of purpose. Humans have epistemic awareness and understand where their mental maps of meaning might be incomplete; GenAI tools do not.

Considering these inherent limitations, it is unsurprising that over three years after the public launch of ChatGPT, misinformation and inaccuracies in LLM-based chatbot output, so-called “hallucinations”, still pose a huge problem. OpenAI admitted last year that getting rid of them is virtually impossible.3 And a recent research report by Google shows that even the most accurate model, which is (coincidentally or not) their own Gemini 3 Pro, achieves only 68.8% accuracy under stress testing.4 Other major models, such as ChatGPT 5 and Claude 4.5 Opus, fare even worse, with 61.8% and 51.3% accuracy, respectively. This misinformation isn’t a technical error or the delusion of an artificial “brain”.

It is the result of GenAI’s baked-in indifference to the truth. According to Hicks et al., we should therefore stop using the term “hallucination” altogether. It would be more accurate to call all GenAI output “bullshit,” they argue.5 Used here conceptually rather than as an insult, it describes communication distinct from both truth and lies.6 While a liar subverts the truth intentionally, a bullshitter simply doesn’t care about it.

Some argue that Hicks et al.’s claim that ChatGPT is similarly indifferent to the truth falls short because GenAI tools lack intent. However, while the tools themselves are merely “probabilistic automation systems”, their creators, companies like OpenAI, Google, Anthropic, and Microsoft, intend to tempt users into prolonged use and paid subscriptions by prioritizing engaging, pleasing output over accuracy.7 Therefore, the “bullshitter” comparison isn't far off.

GenAI tools’ lack of epistemic awareness and inability to contextualize information also lead to “knowledge bleed” in summaries. They have difficulty focusing solely on the uploaded document (the “grounding knowledge”), so information from the texts they have been trained on “bleeds” into the summary. Google’s aforementioned accuracy test quantified these inconsistencies, revealing that even summaries from cutting-edge models like Gemini 3 Pro and ChatGPT 5 deviate from the source material by almost a third.8 NotebookLM, a version of Gemini Pro optimized for longer text analysis, suffers from the same type of misinformation and inconsistency issues, but likely performs better.

However, no comparable accuracy numbers are available. Researchers Peters and Chin-Yee identified a specific form of inconsistency in AI-generated summaries of academic texts: a tendency to exaggerate scientific findings. They found that the often nuanced conclusions of research articles are prone to misrepresentation by GenAI tools.9 Compared to human-written summaries, generated summaries were five times more likely to include overgeneralizations.

However, not all models exhibited this flaw. While ChatGPT, LLaMA, and DeepSeek overgeneralized badly, Anthropic’s Claude models did not suffer from this specific flaw. Whether AI-generated summaries can ever meet academic standards remains to be seen. While “reasoning” models have improved accuracy by including verification agents that check each other's output, the “hallucination problem” persists.

Unfortunately, better prompting is not a fix. Peters and Chin-Yee found that prompting the models for accuracy had the opposite effect, almost doubling the rate of overgeneralization in the output.10 And while models like Claude Code now also use "compilers" to verify code or math, no such anchor exists for academic meaning. Swapping a single big bullshitter for an army of little ones hasn’t provided the epistemic awareness or human-level comprehension required for meaningful summaries.

In some respects, the newer models have even regressed. They overgeneralize more than predecessors, likely because human trainers prefer confident-sounding output over academic nuance.11 Because misinformation and inaccuracies in LLM output are often subtle, the advice to use GenAI tools “critically” does not work well for summarization. Slight inaccuracies can be very damaging in academic work, but can usually only be detected by a close reading of the text and/or an expert.

However, needing to read the full text closely to verify the AI output defeats the whole purpose of generating a summary. Even if the accuracy problems were solved, and AI-generated summaries reliably captured all the essential points of a text, it would still be a bad idea to use them. Creating your own summaries is a crucial step in any literature study. When you read and summarize a text, you create the neural connections necessary to memorize and apply the information well in an exam, experiment, or research paper.

Generating it with a click is a harmful form of cognitive offloading and will erode these skills. Writing it yourself will reveal the nuances of an academic text and allow you to register those elements that you deem essential to whatever you are working on. Summarizing is also one of the activities that creates the necessary friction humans need to learn. We learn best when we encounter difficulties in the learning process, such as summarizing a challenging text that demands our full attention and effort.12 But for this to work best, those texts shouldn’t be too difficult for your current skill level.

They need to be in the zone of “desirable difficulty”, just above your current knowledge and abilities.13 Therefore, you need to carefully select the texts you want to invest this cognitive labor in. Wouldn’t getting a quick overview of a text be a good use case for an AI-generated summary then? Even for that, there are better alternatives. To get a reliable gist of a text, consult author-written abstracts, book introductions, or book reviews.

The latter have the added benefit of providing a professional assessment of the work's quality. Beyond hindering your own learning and research, using AI-generated summaries can also have long-term consequences for the collective scientific endeavor. Because once misinformation from AI-generated summaries remains uncorrected and seeps into published theses, research papers, and other outputs, it could contribute to a loop of misinformation. That’s why, as the Information Literacy & Education Team of TU/e, we say, don’t generate, summarize!

Maarten Paulusse is an Information Literacy and Education Specialist at Library and Open Science. Besides teaching information literacy to bachelor's students, master's students, and EngD and PhD researchers, he focuses on educational development and skills education policy for the library. Before starting at TU/e, Maarten worked at Utrecht University for 12 years in various positions. He was a lecturer in cultural history and political history, coordinator of the Utrecht Summer School, and education policy officer. Would you like to stay informed about new Information Literacy articles and activities? → Sign up for our mailing list 1 Piu and Angelini, “Summarizing through the Lens of Cognitive Load Theory: Implications for Education and Teaching Methods,” 119–20.

2 Emily M. Bender and Alex Hanna, The AI Con: How to Fight Big Tech’s Hype and Create the Future We Want, First edition (Harper, 2025). 3 “Why Language Models Hallucinate,” OpenAI, September 5, 2025, openai.com/index/why-language-models-hallucinate/. 4 Lee Cheng et al., The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality (Google, 2025), deepmind.google/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/.

5 Michael Townsen Hicks et al., “ChatGPT Is Bullshit,” Ethics and Information Technology 26, no. 2 (2024): 38. 6 Harry G. Frankfurt, On Bullshit: Anniversary Edition (Princeton University Press, 2025). 7 Nanna Inie et al., “From ‘AI’ to Probabilistic Automation: How Does Anthropomorphization of Technical Systems Descriptions Influence Trust?,” Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ’24, June 5, 2024, 2322–47.

8 Cheng et al., The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality. 9 Uwe Peters and Benjamin Chin-Yee, “Generalization Bias in Large Language Model Summarization of Scientific Research,” Royal Society Open Science 12, no. 4 (2025): 241776. 10 “Merendeel prominente chatbots overdrijft wetenschappelijke resultaten – Nieuws – Universiteit Utrecht,” May 1, 2025, www.uu.nl/nieuws/merendeel-prominente-chatbots-overdrijft-wetenschappelijke-resultaten.

11 Cheng et al., The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality. 12 Felienne Hermans and Izaak Dekker, “Opinie: Chatbots nemen weerstand weg. Dat is slecht voor het onderwijs,” Trouw, February 5, 2026, www.trouw.nl/opinie/opinie-chatbots-nemen-weerstand-weg-dat-is-slecht-voor-het-onderwijs~bc2520e0/. 13 Robert A. Bjork, “Institutional Impediments to Effective Training,” in Learning, Remembering, Believing: Enhancing Individual and Team Performance (National Academy Press, 1994).

Summary

This report covers the latest developments in artificial intelligence. The information presented highlights key changes and updates that are relevant to those following this topic.

Original Source: Www.tue.nl | Author: jonathanjg | Published: February 26, 2026, 5:47 am

Are AI-generated summaries suitable for studying and rese…

Summary

Leave a Reply Cancel reply

Category Name

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

Recent Posts

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

The 1TB PNY microSD Express Card loaded up Pokemon Pokopi…

Categories

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

Are AI-generated summaries suitable for studying and rese…

Summary

Share This Post

Leave a Reply Cancel reply

Recent Posts

Categories