← Back to Library

MAMLMs Still Epic Fail Open‑Book, Closed‑World, Finite‑List, Obvious Ground Truth Tasks

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

I count this as a failure nine different successive times. There really is a unique, well‑defined answer, and where the machine has every chance to uncover it: the Hedge Knight Ashford Meadow line‑up. This avoids the usual escape hatches about “the data might be ambiguous” or “this is a hard open question.” We get not one isolated “hallucination” but rather a hallucination cascade

As Noah Smith warns everyone: if you don’t like posts about “AI”—then stay off the internet for the next, I don’t know, five years or so. That applies here as well as anyone else.

BE WARNED!!


Today’s MAMLM post is triggered by the extremely sharp John Quiggin’s:

John Quiggin; LLMs Reaching the Tipping Point <https://johnquigginblog.substack.com/p/llms-reaching-the-tipping-point>: ‘[Before] Anthropic’s Claude Code late last year… there was a fair bit of disillusionment about “vibe coding” and… [its] buggy output… as well as… general disaffection… with… AI slop…. But lately, the tone has changed radically…. Even [the] previously sceptical… have concluded… Claude represent[s] the future…. OpenAI Deep Research… [has had] mixed results until recently. But now I’m perceiving the same kind of change… lengthy interactions … as if I were talking to an intelligent and exceptionally well-informed human… keeps track… seems less prone to fabrications… [and] this kind of interaction is fun…. DR is moving from being an enthusiastic but unreliable research assistant to something more like a well-read junior co-author. The ideas are still mine, but I can rely on DR to provide discussion and critique as well as routine stuff like literature summaries…

I confess I have not been doing enough coding this winter to gain a sense of any sea-change. But I will say that I am not seeing it in other realms—specifically, in ChatGPT and its ilk as an alternative to standard search engines. The advantages are (a) a natural-language interface, (b) a system that has not (yet) been turned fully up the wazoo to sell ads, plus (c ) the SEO spammers have not yet descended to—I really do not like the vibe of “en****tify”—full commodification cannibalization. Yet I found it could not o the job when, yesterday, I asked it for a list of the fighters-for-good in Episode 5 of HBO’s “A Knight of the Seven Kingdoms”. That ought to have been well within its capabilities. And yet:

Who were the seven on Duncan’s side in the Trial of Seven?

You

...
Read full article on DeLong's Grasping Reality →