When ChatGPT summarises, it does nothing of the kind
The article critiques ChatGPT's summarization limitations, citing a failed attempt to summarize a 50-page paper accurately. It questions the reliability of large language models for business applications due to inaccuracies.
Read original articleThe article discusses the limitations of ChatGPT in providing accurate summaries. The author shares their experience of attempting to use ChatGPT to summarize a 50-page public paper on pension funds, only to find that the AI-generated summary lacked crucial details and key proposals present in the original text. The author highlights that ChatGPT's summarization process involves shortening the text rather than truly understanding and summarizing the content. They explain how the AI's parameters and training data heavily influence the generated summaries, often leading to inaccuracies and omissions. The author also compares ChatGPT's summarization with another AI tool, Gemini, noting similar issues with both tools in producing concise and accurate summaries. Ultimately, the article questions the reliability of using large language models (LLMs) like ChatGPT for real business applications due to their limitations in providing comprehensive and precise summaries of complex subjects.
Related
Our guidance on using LLMs (for technical writing)
The Ritza Handbook advises on using GPT and GenAI models for writing, highlighting benefits like code samples and overcoming writer's block. However, caution is urged against using GPT-generated text in published articles.
How Good Is ChatGPT at Coding, Really?
A study in IEEE evaluated ChatGPT's coding performance, showing success rates from 0.66% to 89%. ChatGPT excelled in older tasks but struggled with newer challenges, highlighting strengths and vulnerabilities.
GenAI does not Think nor Understand
GenAI excels in language processing but struggles with logic-based tasks. An example reveals inconsistencies, prompting caution in relying on it. PartyRock is recommended for testing language models effectively.
Can ChatGPT do data science?
A study led by Bhavya Chopra at Microsoft, with contributions from Ananya Singha and Sumit Gulwani, explored ChatGPT's challenges in data science tasks. Strategies included prompting techniques and leveraging domain expertise for better interactions.
ChatGPT Isn't 'Hallucinating'–It's Bullshitting – Scientific American
AI chatbots like ChatGPT can generate false information, termed as "bullshitting" by authors to clarify responsibility and prevent misconceptions. Accurate terminology is crucial for understanding AI technology's impact.
The author states his conclusions but doesn't give the reader the information required to examine the problem.
- Whether the article to be summarized fits into the tested GPT model's context size
- The prompt
- The number of attempts
- He doesn't always state which information in the summary, specifically, is missing or wrong
For example: "I first tried to let ChatGPT one of my key posts (...). ChatGPT made a total mess of it. What it said had little to do with the original post, and where it did, it said the opposite of what the post said." He doesn't say which statements of the original article were reproduced falsely by ChatGPT.
My experience is that ChatGPT 4 is good when summarizing articles, and extremely helpful when I need to shorten my own writing. Recently I had to write a grant application with a strict size limit of 10 pages, and ChatGPT 4 helped me a lot by skillfully condensing my chapters into shorter texts. The model's understanding of the (rather niche) topic was very good. I never fed it more than about two pages of text at once. It also adopted my style of writing to a sufficient degree. A hypothetical human who'd have to help on short notice probably would have needed a whole stressful day to do comparable work.
A summary for which you must always read the un-summarized text is useless as a summary, this should be obvious to literally everyone, yet AI developers stick their heads in the sand about it because RAG lets them pretend AI is more useful than it actually is.
RAG is useless, just fucking let it go and have AI stay in it's lane.
I would suggest a less strong but more plausible claim that GPT4o has trouble summarizing for longer form content outside the bounds of it's context window or something like a lossier attention mechanism is being used as a compromise for resource usage.
Summary:
https://chatgpt.com/share/21d81811-db45-4ac5-b3c7-b25a79b2ba...
I work for the client side and this bothers me a lot. It's very hard to get a true honest value analysis done with all the sales influence and office politics going on.
https://chatgpt.com/share/d5709aeb-d24c-488b-985c-c13eba0c01...
"4. IORP Directive: The IORP (Institutions for Occupational Retirement Provision) Directive is analyzed, highlighting its scope and its impact on pension funds across the EU. The paper suggests that the directive's complex regulations create inconsistencies and may need clarification or adjustment to better align with national policies." "5. Regulatory Framework and Proposals: A significant portion of the paper is devoted to discussing potential reforms to the regulatory framework governing pensions in the EU. It proposes a dual approach: a "soft law" code for non-economic pension services and a "hard law" legislative framework for economic activities. This proposal aims to clarify and streamline EU and national regulations on pensions."
^^ these corresponds to the author's self-selected two main points.
I have been working on summarizing new papers using Gemini for the same purpose. I don't ask for summary though, i ask for the story the paper is trying to tell (with different sections) and get great output. Not sharing the links here, because it would be self promotion.
It's not a problem when you are aware of it and with some follow up input you can get it mitigated, but often I see that people tend to take the first output of these systems at face value. People should be a bit more critical in that regards.
Is a system prompt "provide a summary of this text" the best possible prompt? Do different models respond differently to prompts like that? At what point should you attempt more advanced tricks, like having one prompt extract key ideas and a second prompt summarize those?
Products like ChatGPT are rewarded in their fine tuning for doing happy sounding cheerleading of whatever bland unsophisticated corpo docs you throw at it. Consumer products like this simply aren't designed for novelty, although there's plenty of AIs that are. For example, AlphaFold is something that's designed to search through an information space and discover novel stuff that's good.
ChatGPT is something that's designed to ingratiate itself with emotional individuals using a flawed language that precludes rational thinking. That's the problem with the English language. It's the lowest common demonstrator. Any specialized field like programming, the natural sciences, etc. that wants to make genuine progress, has always done so historically by inventing a new language, e.g. jargon, programming languages.
The only time normal language is able to communicate divergent ideas is when the speaker has high social status. When someone who doesn't have high social status communicates something novel, we call it crazy. LLMs, being robots, have very low social status. That's why they're trained to act the way they do.
However, it still managed to pick up several clickbait headlines about NASA’s asteroid wargame and write a scare news summary:
Truth: https://www.space.com/dangerous-asteroid-international-coope...
> The participants — nearly 100 people from various U.S. federal agencies and international institutions — considered the following hypothetical scenario: Scientists just discovered a relatively large asteroid that appears to be on an Earth-impacting trajectory. There's a 72% chance it will hit our planet on July 12, 2038, along a lengthy corridor that includes major cities such as Dallas, Memphis, Madrid and Algiers.
Glancias Summary:
> NASA has identified a potential asteroid threat to Earth in 2038, revealing gaps in global preparedness despite technological advancements in asteroid trajectory redirection and the upcoming launch of the NEO Surveyor space telescope.
What prompt am I missing? Find the edge-case details and other similar "what's only mention once" I can't get it to highlight.
I am very skeptical of the author's claims. Perhaps the parts of the articles being summarized are not actually important so the LLMs did not include them. Or perhaps the article does an exceptionally bad job of explaining why the argument is important. Also there's a difference between the API and free web interface. I think the web version has a bunch more system prompting to be helpful which may make a summary harder to do.
Giving a LLM too much context causes the same effect, as the sliding window moves on from the earliest tokens.
It's also why summarization is bad.
It's not exactly linear though according to the text from start to finish, bit's of context get lost from throughout the input text at random and will be different each time the same input is run.
A good way to mitigate this is to break up the text and execute in smaller chunks, even with models boasting large context, results drop off significantly with large inputs so using several smaller prompts is better.
Likely it added the content to the prompt, but the content didn't stay in the prompt for the next prompt. The next prompt likely only had general web results as context.
In this case I would take a similar approach, split the document into multiple smaller (and overlapping) fragments, let a LLM summarize each one of those into key findings, and in a next step merge those key findings to a summary.
I have not a lot experience though, if this would provide better results.
[1] It's annoying to see Google initially market their Gemini models about their 100K to 1M tokens context size, and even OpenAI has been doing a lot of their model making and marketing around it too recently.
I've been having a surprisingly good time in my 'discussions' with the free online chatgpt, which has a cutoff date of 2022. What really impresses me is the results of persistence on my part when the replies are too canned, which can be astonishing.
In one discussion, I asked it to generate random sequences of 4 digits, 0000-9999, until a specific given number occurred. It would, as if pleased with its work, give the number within 10 tries. I suppose this is due to computational limitations that I don't understand. However, when with great effort, I criticized its method and results enough, I got it to double the efforts before it lazily 'found' an occurrence of the given number. It claimed it was doing what i asked. It surely wasn't. But it seemed oblivious. I'm interested to understand this.
I'm sure I'll get some contempt fo my ignorance here, but I asked to analyze pi to some unremembered placeholder until it found a Fibonacci sequence. It couldn't. Maybe one doesn't exist. As obvious as this might be to smarter primates here, I don't understand. I was mostly entertaining myself with various off the hat things.
What I did realize, is what by my standards, is fierce potential. This has me wanting to, if even possible, acquire my version with, perhaps, the possibility of real time/internet interaction.
Is this possible without advanced coding ability? Is it possible at all? What would be a starting point and some helpful pointers along the way.
Anyway, it reminded me of my youth, when I had access to a special person or two and would make them dizzy with my torrential questions. Kindof a magic pocket Randall Monroe, with spontaneous schizophrenia. Fun.
Edit note: those were but a couple examples of a lot more that I cannot remember. I'm hooked now, though, and need to come out my cave for this, and learn more. I have some obsolete python experience if that might be relevant.
Of course, it obvious in hindsight that to create a useful summary requires reasoning over the content, and given that reasoning is one of the major weaknesses of LLMs, it should have been obvious that their efforts to summarize would be surface level "shortening" rather than something deeper that grokked the full text and summarized the key points.
Many people just use GPT 3.5 because it's free, not realizing how much it sucks in comparison to newer models.
Wasn't that kind of the previous classical AI method of doing summaries? Something something, rank sentences by the number of nouns that appear in other sentences to get the ones with the most information density and output the top N?
They all fail quite spectacularly at this, at least for my use case (cel shading, outlines, height sensitive fog)
``` This article examines the governance of Dutch pension funds in light of the Future of Pensions Act (Wtp). The new legislation shifts towards more complete pension contracts and emphasizes operational execution, necessitating changes in pension fund governance. The authors propose strengthening pension funds' internal organization, improving accountability to participants, and enhancing the powers of participant representation bodies. Key recommendations include establishing a recognizable governance structure with clear responsibilities, creating a College of Stakeholders (CvB) to replace existing accountability bodies, and granting the CvB more authority, including appointment and dismissal powers. The proposals aim to balance the interests of social partners, pension funds, and participants while ensuring transparency and effective oversight. The article emphasizes principles such as transparency, trust, loyalty, and prudence in shaping governance reforms. It also discusses the impact of digitalization (DORA), the need for pension funds to demonstrate value, and the potential for further consolidation in the sector. International perspectives, including insights from the World Bank, inform the proposed governance improvements. These changes are designed to help pension funds adapt to the new system, manage risks effectively, and maintain their "license to operate" in a changing landscape. ```
Similarly, the second article's summary also captures the key points that the author points out (emphasis mine).
``` The article "Regulating pensions: Why the European Union matters" explores the growing influence of EU law on pension regulation. While Member States retain primary responsibility for pension provision, the authors argue that EU law significantly impacts national pension systems through both direct and indirect means. The paper begins by examining the EU's institutional framework regarding pensions, focusing on the principles of subsidiarity and the division of powers between the EU and Member States. It emphasizes that the EU can regulate pension matters when the Internal Market's functioning is at stake, despite lacking specific regulatory competencies for pensions. The authors note that the subsidiarity principle has not proven to be an obstacle for EU action in this area. The article then delves into EU substantive law and its impact on pensions, concentrating on the concept of Services of General Economic Interest (SGEI) and its role in classifying pension fund activities as economic or non-economic. The authors discuss the case law of the Court of Justice of the European Union (CJEU), highlighting its importance in determining when pension schemes fall within the scope of EU competition law. They emphasize that the CJEU's approach is based on the degree of solidarity in the scheme and the extent of state control. ** The paper examines the IORP Directive, outlining its current scope and limitations. The authors argue that the directive is unclear and leads to distortions in the internal market, particularly regarding the treatment of pay-as-you-go schemes and book reserves. They propose a new regulatory framework that distinguishes between economic and non-economic pension activities. For non-economic activities, the authors suggest a soft law approach using a non-binding code or communication from the European Commission. This would outline the basic features of pension schemes based on solidarity and the conditions for exemption from EU competition rules. For economic activities, they propose a hard law approach following the Lamfalussy technique, which would provide detailed regulations similar to the Solvency II regime but tailored to the specifics of IORPs (Institutions for Occupational Retirement Provision). ** The authors conclude that it's impossible to categorically state whether pensions are a national or EU competence, as decisions must be made on a case-by-case basis. They emphasize the importance of considering EU law when drafting national pension legislation and highlight the need for clarity in the division of powers between the EU and Member States regarding pensions. Overall, the paper underscores the complex interplay between EU law and national pension systems, calling for a more nuanced understanding of the EU's role in pension regulation and a clearer regulatory framework that respects both EU and national competencies. ```
I'd bet that the author used GPT 3.5-turbo (aka the free version of ChatGPT) and did not give any particular prompting help. To create these, I asked Claude to create a prompt for summarization with chain of thought revision, used that prompt, and returned the result. Better models with a little bit more inference time compute go a long way.
Don't get me started with Sonnet 3.5
AI winter is going to be brutal
ChatGPT Isn't 'Hallucinating'–It's Bullshitting
Related
Our guidance on using LLMs (for technical writing)
The Ritza Handbook advises on using GPT and GenAI models for writing, highlighting benefits like code samples and overcoming writer's block. However, caution is urged against using GPT-generated text in published articles.
How Good Is ChatGPT at Coding, Really?
A study in IEEE evaluated ChatGPT's coding performance, showing success rates from 0.66% to 89%. ChatGPT excelled in older tasks but struggled with newer challenges, highlighting strengths and vulnerabilities.
GenAI does not Think nor Understand
GenAI excels in language processing but struggles with logic-based tasks. An example reveals inconsistencies, prompting caution in relying on it. PartyRock is recommended for testing language models effectively.
Can ChatGPT do data science?
A study led by Bhavya Chopra at Microsoft, with contributions from Ananya Singha and Sumit Gulwani, explored ChatGPT's challenges in data science tasks. Strategies included prompting techniques and leveraging domain expertise for better interactions.
ChatGPT Isn't 'Hallucinating'–It's Bullshitting – Scientific American
AI chatbots like ChatGPT can generate false information, termed as "bullshitting" by authors to clarify responsibility and prevent misconceptions. Accurate terminology is crucial for understanding AI technology's impact.