July 21st, 2024

When ChatGPT summarises, it does nothing of the kind

The article critiques ChatGPT's summarization limitations, citing a failed attempt to summarize a 50-page paper accurately. It questions the reliability of large language models for business applications due to inaccuracies.

Read original articleLink Icon
When ChatGPT summarises, it does nothing of the kind

The article discusses the limitations of ChatGPT in providing accurate summaries. The author shares their experience of attempting to use ChatGPT to summarize a 50-page public paper on pension funds, only to find that the AI-generated summary lacked crucial details and key proposals present in the original text. The author highlights that ChatGPT's summarization process involves shortening the text rather than truly understanding and summarizing the content. They explain how the AI's parameters and training data heavily influence the generated summaries, often leading to inaccuracies and omissions. The author also compares ChatGPT's summarization with another AI tool, Gemini, noting similar issues with both tools in producing concise and accurate summaries. Ultimately, the article questions the reliability of using large language models (LLMs) like ChatGPT for real business applications due to their limitations in providing comprehensive and precise summaries of complex subjects.

Link Icon 39 comments
By @kibbi - 9 months
Am I blind or is there no mention at all of the GPT model he used?

The author states his conclusions but doesn't give the reader the information required to examine the problem.

- Whether the article to be summarized fits into the tested GPT model's context size

- The prompt

- The number of attempts

- He doesn't always state which information in the summary, specifically, is missing or wrong

For example: "I first tried to let ChatGPT one of my key posts (...). ChatGPT made a total mess of it. What it said had little to do with the original post, and where it did, it said the opposite of what the post said." He doesn't say which statements of the original article were reproduced falsely by ChatGPT.

My experience is that ChatGPT 4 is good when summarizing articles, and extremely helpful when I need to shorten my own writing. Recently I had to write a grant application with a strict size limit of 10 pages, and ChatGPT 4 helped me a lot by skillfully condensing my chapters into shorter texts. The model's understanding of the (rather niche) topic was very good. I never fed it more than about two pages of text at once. It also adopted my style of writing to a sufficient degree. A hypothetical human who'd have to help on short notice probably would have needed a whole stressful day to do comparable work.

By @ADeerAppeared - 9 months
There's a fundamental problem with all these "summary" tasks, and it's obvious from the disclaimer that's on all these AI products: "AI can be wrong, you must verify this".

A summary for which you must always read the un-summarized text is useless as a summary, this should be obvious to literally everyone, yet AI developers stick their heads in the sand about it because RAG lets them pretend AI is more useful than it actually is.

RAG is useless, just fucking let it go and have AI stay in it's lane.

By @ianbutler - 9 months
I read through your entire article and the three main points I took away from it were also contained in the gpt4o summary I then generated to compare afterwards. So here's some empirical counter evidence.

I would suggest a less strong but more plausible claim that GPT4o has trouble summarizing for longer form content outside the bounds of it's context window or something like a lossier attention mechanism is being used as a compromise for resource usage.

Summary:

https://chatgpt.com/share/21d81811-db45-4ac5-b3c7-b25a79b2ba...

By @wkat4242 - 9 months
We need a lot more of this kind of in depth analysis. Right now the cheering on of AI is overwhelming. Criticism is often suppressed, both on the vendor side who just want to sell, and on the client side who have very strong FOMO.

I work for the client side and this bothers me a lot. It's very hard to get a true honest value analysis done with all the sales influence and office politics going on.

By @sdrinf - 9 months
This article omits specificity of which GPT model. Re-running the experiment on the EU regulation paper using gpt-4-1106 (the current-best "intelligent" one):

https://chatgpt.com/share/d5709aeb-d24c-488b-985c-c13eba0c01...

"4. IORP Directive: The IORP (Institutions for Occupational Retirement Provision) Directive is analyzed, highlighting its scope and its impact on pension funds across the EU. The paper suggests that the directive's complex regulations create inconsistencies and may need clarification or adjustment to better align with national policies." "5. Regulatory Framework and Proposals: A significant portion of the paper is devoted to discussing potential reforms to the regulatory framework governing pensions in the EU. It proposes a dual approach: a "soft law" code for non-economic pension services and a "hard law" legislative framework for economic activities. This proposal aims to clarify and streamline EU and national regulations on pensions."

^^ these corresponds to the author's self-selected two main points.

By @ankit219 - 9 months
Somewhat disagree with the point being made here. The fundamental assumption for humans is that when summarizing we will pay more attention to important bits, and sort of only give a passing mention to others if needed. For any model, the context is the only universe where it can assign importance based on previous learning (instruction tuning examples) and the prompt. For many, shortening of the text is equivalent to summarizing (when the text is not as long as a fifty page paper*). Output depends on the instruction training dataset, and seems like unless a model is trained on longer documents, it would not produce those kind of expected outputs. In a chain of thought reasoning scenario, it probably will. With Gemini, they definitely tested out long context and tuned the outputs for it to work well as it was their value prep - shown in I/O no less.

I have been working on summarizing new papers using Gemini for the same purpose. I don't ask for summary though, i ask for the story the paper is trying to tell (with different sections) and get great output. Not sharing the links here, because it would be self promotion.

By @glenndebacker - 9 months
I've noticed that when using a language model for rephrasing text it also sometimes seem to miss important details because it clearly has no real understanding of the text.

It's not a problem when you are aware of it and with some follow up input you can get it mitigated, but often I see that people tend to take the first output of these systems at face value. People should be a bit more critical in that regards.

By @simonw - 9 months
Something I find frustrating about summarization is that while it's one of the most common uses of LLMs I've actually found very little useful material investigating ways of implementing it.

Is a system prompt "provide a summary of this text" the best possible prompt? Do different models respond differently to prompts like that? At what point should you attempt more advanced tricks, like having one prompt extract key ideas and a second prompt summarize those?

By @jart - 9 months
The author could have gotten their point across better if they said that LLMs aren't good at focusing on the things they deem important. LLMs absolutely understand things. Otherwise it'd be impossible to work at all. But like people they try to make a summary fit into its preconceived biases (e.g. regulation good). You know how when you try to talk to someone about a subject they're unfamiliar with, and it goes from one ear out the other? That's how LLMs are when they ask them to pick out the divergent ideas from documents.

Products like ChatGPT are rewarded in their fine tuning for doing happy sounding cheerleading of whatever bland unsophisticated corpo docs you throw at it. Consumer products like this simply aren't designed for novelty, although there's plenty of AIs that are. For example, AlphaFold is something that's designed to search through an information space and discover novel stuff that's good.

ChatGPT is something that's designed to ingratiate itself with emotional individuals using a flawed language that precludes rational thinking. That's the problem with the English language. It's the lowest common demonstrator. Any specialized field like programming, the natural sciences, etc. that wants to make genuine progress, has always done so historically by inventing a new language, e.g. jargon, programming languages.

The only time normal language is able to communicate divergent ideas is when the speaker has high social status. When someone who doesn't have high social status communicates something novel, we call it crazy. LLMs, being robots, have very low social status. That's why they're trained to act the way they do.

By @captn3m0 - 9 months
I am subscribed to Glancias, which is an AI summarised daily news email service of sorts. Since news is supposed to be a high-risk area where you don’t want hallucinations, I am sure they’ve fine-tuned their setup to some degree.

However, it still managed to pick up several clickbait headlines about NASA’s asteroid wargame and write a scare news summary:

Truth: https://www.space.com/dangerous-asteroid-international-coope...

> The participants — nearly 100 people from various U.S. federal agencies and international institutions — considered the following hypothetical scenario: Scientists just discovered a relatively large asteroid that appears to be on an Earth-impacting trajectory. There's a 72% chance it will hit our planet on July 12, 2038, along a lengthy corridor that includes major cities such as Dallas, Memphis, Madrid and Algiers.

Glancias Summary:

> NASA has identified a potential asteroid threat to Earth in 2038, revealing gaps in global preparedness despite technological advancements in asteroid trajectory redirection and the upcoming launch of the NEO Surveyor space telescope.

By @djbusby - 9 months
One thing I've used GPT and Gemini for is to summarize HN threads. It does OK finding top level points but in the conversation (thread) there are (generally) some one-off points I find key but neither AI can identify these "key topics at the leaf" (I don't know what else to call them).

What prompt am I missing? Find the edge-case details and other similar "what's only mention once" I can't get it to highlight.

By @lopuhin - 9 months
Curious which model was used? Sorry if I missed that. Looks like an important detail to mention when doing an evaluation.
By @impure - 9 months
My RSS reader automatically generates AI summaries for Hacker News posts and it works pretty well. Sometimes when I comment on a post I have to double check the summary is correct and it always does a really good job. I even had it generate comment summaries. It needs a bit of prompting to highlight individual arguments but it also does well here.

I am very skeptical of the author's claims. Perhaps the parts of the articles being summarized are not actually important so the LLMs did not include them. Or perhaps the article does an exceptionally bad job of explaining why the argument is important. Also there's a difference between the API and free web interface. I think the web version has a bunch more system prompting to be helpful which may make a summary harder to do.

By @kingkongjaffa - 9 months
A big reason for 'content drift' is that LLM's are like a sliding context window of the input text plus the prompt, and for each new token generated, the next token prediction is using the previously LLM generated tokens as well.

Giving a LLM too much context causes the same effect, as the sliding window moves on from the earliest tokens.

It's also why summarization is bad.

It's not exactly linear though according to the text from start to finish, bit's of context get lost from throughout the input text at random and will be different each time the same input is run.

A good way to mitigate this is to break up the text and execute in smaller chunks, even with models boasting large context, results drop off significantly with large inputs so using several smaller prompts is better.

By @KTibow - 9 months
> did it add the content of the web site to the prompt, or was the web site part of the training?

Likely it added the content to the prompt, but the content didn't stay in the prompt for the next prompt. The next prompt likely only had general web results as context.

By @andix - 9 months
I had a similar experience with ChatGPT and larger documents. Even basic RAG tasks don't work well (RAG = Retrieval-augmented generation). The most basic Langchain RAG examples performs much better. The usual approach is to split up the document into pages and then smaller text fragments (a few hundred characters). Only those smaller fragments are then processed by the LLM.

In this case I would take a similar approach, split the document into multiple smaller (and overlapping) fragments, let a LLM summarize each one of those into key findings, and in a next step merge those key findings to a summary.

I have not a lot experience though, if this would provide better results.

By @_def - 9 months
> The article discusses the author's experience and observations regarding the use of language model-based tools like ChatGPT for summarizing texts, specifically highlighting their limitations. The author initially believed summarizing was a viable application for such models but found through practical application that these tools often fail to produce accurate summaries. Instead of summarizing, they tend to merely shorten texts, sometimes omitting crucial details or inaccurately representing the original content. This is attributed to the way language models prioritize information, often influenced more by the vast amount of data they've been trained on rather than the specific content they are summarizing. The author concludes that these tools are less reliable for producing meaningful and accurate summaries, especially when the text is complex or detailed. The experimentation with summaries on different subjects further demonstrated that these models often produce generalized content that lacks specific, actionable insights from the original texts.
By @rajnathani - 9 months
I've had the same negative experiences with summarizing news articles with ChatGPT4 (4o and even the previous 4 model). These LLMs makers need to focus more on keeping the context length lower [1] at maybe around 4-12K tokens and instead get their systems to have more general intelligence.

[1] It's annoying to see Google initially market their Gemini models about their 100K to 1M tokens context size, and even OpenAI has been doing a lot of their model making and marketing around it too recently.

By @eth0up - 9 months
I have some questions for any willing to consider, though know in advance I am quite ignorant on the general subject.

I've been having a surprisingly good time in my 'discussions' with the free online chatgpt, which has a cutoff date of 2022. What really impresses me is the results of persistence on my part when the replies are too canned, which can be astonishing.

In one discussion, I asked it to generate random sequences of 4 digits, 0000-9999, until a specific given number occurred. It would, as if pleased with its work, give the number within 10 tries. I suppose this is due to computational limitations that I don't understand. However, when with great effort, I criticized its method and results enough, I got it to double the efforts before it lazily 'found' an occurrence of the given number. It claimed it was doing what i asked. It surely wasn't. But it seemed oblivious. I'm interested to understand this.

I'm sure I'll get some contempt fo my ignorance here, but I asked to analyze pi to some unremembered placeholder until it found a Fibonacci sequence. It couldn't. Maybe one doesn't exist. As obvious as this might be to smarter primates here, I don't understand. I was mostly entertaining myself with various off the hat things.

What I did realize, is what by my standards, is fierce potential. This has me wanting to, if even possible, acquire my version with, perhaps, the possibility of real time/internet interaction.

Is this possible without advanced coding ability? Is it possible at all? What would be a starting point and some helpful pointers along the way.

Anyway, it reminded me of my youth, when I had access to a special person or two and would make them dizzy with my torrential questions. Kindof a magic pocket Randall Monroe, with spontaneous schizophrenia. Fun.

Edit note: those were but a couple examples of a lot more that I cannot remember. I'm hooked now, though, and need to come out my cave for this, and learn more. I have some obsolete python experience if that might be relevant.

By @tkgally - 9 months
Are there any objective benchmarks for rating and comparing the summarization performance of LLMs? I've had mixed results when using the latest versions of ChatGPT, Claude, and Gemini to summarize texts that I know well (books I have read, papers I wrote, transcripts of online meetings I attended). Often their summaries are as good as or maybe even better than what I could prepare myself, but sometimes they omit key points or entire sections. Other than failures clearly due to context length limitations, it's hard to judge which LLM is a better summarizer overall or whether a new LLM version is better than the previous one.
By @throw156754228 - 9 months
It's even worse than the author writes. As the 'parameter' side of the equation gets trained on more and more scraped AI-spam garbage, summarising will actually get even worse than it already is.
By @HarHarVeryFunny - 9 months
That's a shame, because it'd be one of the more useful things that LLMs might have been used for, and I had - basically on faith - assumed that it was providing genuine analytical summaries ...

Of course, it obvious in hindsight that to create a useful summary requires reasoning over the content, and given that reasoning is one of the major weaknesses of LLMs, it should have been obvious that their efforts to summarize would be surface level "shortening" rather than something deeper that grokked the full text and summarized the key points.

By @spencerchubb - 9 months
Was this with GPT 3.5 or 4? If it was with 3.5, the analysis is entirely irrelevant in my opinion.

Many people just use GPT 3.5 because it's free, not realizing how much it sucks in comparison to newer models.

By @sirspacey - 9 months
Appreciate this deep dive and the important conclusion - “summarize” actually means “shorten” in chatGPT. We’ve been summarizing call transcriptions and seen all of these same challenges.
By @arisAlexis - 9 months
I have worked with these kind of documents and one of the main problems is that they are so vague and generic and repetitive that not even humans can comprehend and summarize them correctly because they are produced as bureaucracy artefacts to present to management to get fund to make more papers
By @moffkalast - 9 months
> When you ask ChatGPT to summarise this text, it instead shortens the text.

Wasn't that kind of the previous classical AI method of doing summaries? Something something, rank sentences by the number of nouns that appear in other sentences to get the ones with the most information density and output the top N?

By @horacemorace - 9 months
The author’s assertion that models or systems can ignore important novel points when producing summaries/reductions makes complete sense, as average-ers of patterns indeed might even be expected. In any case it seems testable.
By @ccvannorman - 9 months
A great litmus test for AI is to ask it to write a posteffect GLSL shader for a javascript game engine.

They all fail quite spectacularly at this, at least for my use case (cel shading, outlines, height sensitive fog)

By @modeless - 9 months
Yeah I've never understood why so many people think LLMs are good at summarizing. Only if you don't care about understanding!
By @huac - 9 months
I gave the same article to Claude 3.5 Sonnet and the result seems reasonably similar to the author's handwritten summary.

``` This article examines the governance of Dutch pension funds in light of the Future of Pensions Act (Wtp). The new legislation shifts towards more complete pension contracts and emphasizes operational execution, necessitating changes in pension fund governance. The authors propose strengthening pension funds' internal organization, improving accountability to participants, and enhancing the powers of participant representation bodies. Key recommendations include establishing a recognizable governance structure with clear responsibilities, creating a College of Stakeholders (CvB) to replace existing accountability bodies, and granting the CvB more authority, including appointment and dismissal powers. The proposals aim to balance the interests of social partners, pension funds, and participants while ensuring transparency and effective oversight. The article emphasizes principles such as transparency, trust, loyalty, and prudence in shaping governance reforms. It also discusses the impact of digitalization (DORA), the need for pension funds to demonstrate value, and the potential for further consolidation in the sector. International perspectives, including insights from the World Bank, inform the proposed governance improvements. These changes are designed to help pension funds adapt to the new system, manage risks effectively, and maintain their "license to operate" in a changing landscape. ```

Similarly, the second article's summary also captures the key points that the author points out (emphasis mine).

``` The article "Regulating pensions: Why the European Union matters" explores the growing influence of EU law on pension regulation. While Member States retain primary responsibility for pension provision, the authors argue that EU law significantly impacts national pension systems through both direct and indirect means. The paper begins by examining the EU's institutional framework regarding pensions, focusing on the principles of subsidiarity and the division of powers between the EU and Member States. It emphasizes that the EU can regulate pension matters when the Internal Market's functioning is at stake, despite lacking specific regulatory competencies for pensions. The authors note that the subsidiarity principle has not proven to be an obstacle for EU action in this area. The article then delves into EU substantive law and its impact on pensions, concentrating on the concept of Services of General Economic Interest (SGEI) and its role in classifying pension fund activities as economic or non-economic. The authors discuss the case law of the Court of Justice of the European Union (CJEU), highlighting its importance in determining when pension schemes fall within the scope of EU competition law. They emphasize that the CJEU's approach is based on the degree of solidarity in the scheme and the extent of state control. ** The paper examines the IORP Directive, outlining its current scope and limitations. The authors argue that the directive is unclear and leads to distortions in the internal market, particularly regarding the treatment of pay-as-you-go schemes and book reserves. They propose a new regulatory framework that distinguishes between economic and non-economic pension activities. For non-economic activities, the authors suggest a soft law approach using a non-binding code or communication from the European Commission. This would outline the basic features of pension schemes based on solidarity and the conditions for exemption from EU competition rules. For economic activities, they propose a hard law approach following the Lamfalussy technique, which would provide detailed regulations similar to the Solvency II regime but tailored to the specifics of IORPs (Institutions for Occupational Retirement Provision). ** The authors conclude that it's impossible to categorically state whether pensions are a national or EU competence, as decisions must be made on a case-by-case basis. They emphasize the importance of considering EU law when drafting national pension legislation and highlight the need for clarity in the division of powers between the EU and Member States regarding pensions. Overall, the paper underscores the complex interplay between EU law and national pension systems, calling for a more nuanced understanding of the EU's role in pension regulation and a clearer regulatory framework that respects both EU and national competencies. ```

I'd bet that the author used GPT 3.5-turbo (aka the free version of ChatGPT) and did not give any particular prompting help. To create these, I asked Claude to create a prompt for summarization with chain of thought revision, used that prompt, and returned the result. Better models with a little bit more inference time compute go a long way.

By @elorant - 9 months
I have had similar experiences with pretty much every model. Even heuristics seem to work better than LLMs on summarization.
By @stiltzkin - 9 months
This is not my experience with GPT4, and this post will be dated when GPT5 is coming.

Don't get me started with Sonnet 3.5

By @z7 - 9 months
"The author critiques ChatGPT's ability to summarize accurately, arguing that it merely shortens text without true understanding, resulting in incomplete and sometimes misleading summaries, as demonstrated by a comparison between their own summary of a complex pension fund governance paper and the flawed version produced by ChatGPT." (GPT-4o)
By @localfirst - 9 months
realizing lately to treat LLMs anything but a toy is dangerous

AI winter is going to be brutal

By @bigtex - 9 months
Why does every comment on the post seem that is was written by an LLM?
By @ChrisArchitect - 9 months
Related:

ChatGPT Isn't 'Hallucinating'–It's Bullshitting

https://news.ycombinator.com/item?id=40997893

By @henrypoydar - 9 months
highlights > summary