Can machines think?

On stochastic parrots, generalization, and the subsumption conjecture

Oct 14, 2024

Artificial Intelligence. The “artificial” part is clear, but the “intelligence” part is endlessly debatable. We’ve pondered it as long as the AI field has existed, and perhaps even a couple centuries earlier. These questions now have a newly practical urgency. We debate whether we have already created artificial intelligence, and, if we have, how to rate it against natural intelligence.

Sometimes that question becomes less about the capabilities of machines and more about how we define intelligence, where definitions may become increasingly stricter as machines become more capable [1]. To the question of “Can machines ever be intelligent?” some people will always say “No”, categorically, by defining intelligence as an act of human behavior. If a Matrix or Terminator dystopia ever came to pass, some people would stubbornly assert the non-intelligence of the Architect or of Skynet. That’s an interesting philosophical argument, but not always a useful one. Especially not when you’re busy running or hiding from unwavering killing machines.

Likewise, “Can machines ever create art?” is a question that could always be answered with a “No”, especially if one specifically defines art to require human thought.1 Interesting argument, but again this is not very useful.

One can instead ask “Can machines ever be capable?”, which requires more linguistic gymnastics to come around to a “No”. Machines can be, and already are, incredibly capable. Are they capable of producing outputs that are inherently logical and express reason? Yes, as long as one doesn’t tautologically define the outputs as illogical because they don’t come immediately from human logic, or those outputs as lacking reason because they don’t come immediately from human reasoning.

If we can describe a piece of text as inherently logical or illogical, reasonable or unreasonable, in itself separate from who its creator is, then machines not only are capable of producing logical and reasonable output, but they already often do, and you can prove it to yourself right now through any of several free LLM platforms.

Machines are capable of creating reasonable text. But how?

Stochastic parrots: A useful metaphor

“Stochastic parrot” is a lovely turn of phrase. The term is so engrained now that I’m surprised to realize it was only invented a few years ago [3]. Like a parrot that hears speech and can echo fragments back without any comprehension, LLMs can ingest text during training that they repeat back when summoned. That is the parroting part.2 Stochastic here means probabilistic, how “probabilistic information about how [sequences of linguistic forms] combine”.

There’s something to this metaphor. The way GPT-style LLMs are trained is to design a structure of matrix multiplications and other mathematical operations, and then to fit the weights of those operations (e.g. the numbers in those matrices) to optimize for next-word prediction on a very large corpus of text.3 With each iteration, the weights change so that the LLM’s output is a little more accurate. The LLM is a function that takes a series of text and creates probabilities for every word that could follow.4

When you use ChatGPT or another similar LLM, it generates one token (like a word or a section of a word) at a time, considering all the past text. The generation step will typically use some randomness to not always pick the most likely next token, with some chance picking other appropriate tokens. Then the math is repeated, now with that new token as part of the input. There isn’t any inherent logic or thinking otherwise, it’s really just mathematical operations being repeated over and over again on continually adjusted inputs. The weights of the LLM are fixed, and output changes because the input changes with each newly appended word.

Some interfaces, like OpenAI’s o1 model, have a loop with instructions around the model [4]. Even when we prompt an LLM with instructions to lay out its thought process, those instructions are just more inputs fed into the same fixed function. The instructions help the LLM only because the LLM is so very good at next-word prediction that it can find suitable words to follow those instructions. Notably, LLMs these days are specifically trained on corpuses that include command-and-response text.

LLMs don’t succeed by directly memorizing every bit of input given to them. They don’t directly maintain a giant probability table. Instead, they are essentially flexible compression algorithms, tuning their weights such that they become very good—but not necessarily exact—at predicting their input corpus.

That the process is entirely mechanical might not necessarily mean that machines cannot think. A similar critique applies to humans, as long as you believe in a fundamentally physical process of human thought (physicalism, materialism, or biological naturalism), as opposed to a spiritual or dualist model of human intelligence. Our thinking all happens in our brains, from processes that can often be compared to, and which obviously influenced, the structures in LLMs [5, 6].

Our human intelligence comes from observing the world through our senses, in combination with our DNA-based programming. There is a clear parallel to LLMs being trained to produce text through a written corpus and humans learning speech through mimicking our parents and other people around us. Yet we are clearly much more than stochastic parrots. We know it to be true, that our words come from an intelligence that is so much more complete.

The stochastic parrot term is a wonderfully visual metaphor, letting us succinctly rephrase our intelligence question: if humans can be more than stochastic parrots who regurgitate our own training data, can machines too?

Transcending the metaphor

To parrot is to echo back. LLMs can certainly parrot, regurgitating entire paragraphs and pages from books. Yet they can also generate novel text, words and phrases that have never been written before [7]. This is something well beyond parroting. This is creation.

The extent to which AIs can succeed on examples not in their training data is called generalization. When we write of an AI being more or less general, we mean how distant and varied the tasks can be that it excels at.

Generalization is the crux of the debate. There is some level of it that requires a deeper ability to pull from different patterns and merge them with some reliable logic. If an AI can use such abilities to perform well on new tasks that are sufficiently different from anything in its training data, then that doesn’t sound categorically different from or inferior to human intelligence.

Do I do anything truly deeper, when asked to share an opinion on some question that is new to me? I may pull from ideas and turns of phrase that stuck in my memory, combining arguments I’ve heard somewhere before or constructing new adaptations of those. Personally, I don’t even have a good memory for quotes or statistics, and instead I usually learn something only in approximate terms or abstractions. Most of us have limited and imprecise memory and adapt to use it effectively. Even our memories of our own experienced past change over time. It’s not clear to me whether I “hallucinate” incorrect statements less often than LLMs do.

A detailed image of a vibrant parrot with bright green, red, and blue feathers perched on a branch. The parrot's head is semi-transparent, revealing a realistic human brain inside its skull. The brain has detailed grooves and folds, with a soft blue glow representing neural activity. The parrot's eyes are intelligent and sharp, hinting at cognitive depth. The scene is set against a soft, natural background of blurred trees and sky. — Generated with DALL-E, of course.

The degree of generalization matters. A human baby can at first only parrot, while a human adult can do much more. Likewise, it wasn’t wrong of Bender et al. (2021) to publish the stochastic parrot concept in March 2021 in reference to LLMs of the time. But two years later Ted Chiang was a bit shortsighted in describing his own version of the metaphor, the “blurry JPEG”, comparing ChatGPT to the compression of Xerox photocopiers [8].

While that article was indeed the “useful corrective to the tendency to anthropomorphize large language models” that it aimed to be, it severely understated the generalization of contemporaneous LLMs [9, 10]. Their learned relationships, the ways they “identify statistical regularities in text”, were not nearly as limited as Chiang’s examples of Xerox’s compression, the simple interpolation between neighboring pixels, and the “blur” tool in image programs. We need to have a bit more imagination to understand current AI capabilities and look ahead to future ones.

The subsumption conjecture

Just because the LLMs’ task—next-word prediction—is so straightforward does not mean that its methods must be straightforward too. When humans speak or write we usually do it one word at a time too, and that doesn’t mean that we have only a superficial understanding behind picking our words.

Thus what I’m calling the subsumption conjecture: AIs can learn conceptually distinct skills to aid in learning their target skill.5 These skills are not necessarily simpler building blocks for that target skill. Rather, the AI can subsume difficult and complex skills as long as those skills provide some benefit to reducing training loss. I want to suggest the possibility that one such subsumed skill is reasoning, that leading LLMs can track assumptions, claims, and states, and progress through logical steps, with internals mechanics that are very abstracted from the language representations of those ideas.6

First, let’s consider a simpler and less orthogonal case of subsumption that even predates the transformer architecture: LLMs learning general language principles to be effective at translation. As far back as the pre-transformer days of 2016, researchers at Google demonstrated accurate language translation even between language pairs that weren’t represented in their corpus [11]. They analyzed their model and claimed that it created its own “shared semantic representations”, or “interlingua”, across languages. Notably, their multilanguage model was much smaller than their individual paired models yet performed as well or better (“It is remarkable that a single model with 255M parameters can do what 12 models with a total of 3B parameters would have done”). More diverse tasks created better generalization. The model had to learn generalizations about language, patterns that were relevant beyond individual model pairs.

That was a small model by the standards of today, and that’s a smaller contrast between skills than what I expect modern LLMs have. It’s now more clear than ever that LLMs learn a deep understanding of English and human language more generally, that they can interpret and create text across various forms even when very far removed from anything in their corpus.

Reasoning—applying logic to draw accurate conclusions from information—is more different. Reasoning isn’t used in for all applications of language. Likewise, while spoken languages are a powerful medium for reasoning, they aren’t a strict prerequisite for it: we can also reason through propositional logic and other notation. I suspect that modern LLMs have portions of their internal layers that represent placeholders for abstract concepts, their logical relationships, and hypothetical relationships given conditionals. Neural computations between those layers could perform logical computations between those concepts. These parts of the network might be fairly cohesive, deeply abstracted away from language itself, and often in distinct portions of the network with less connection to language parts. Likely not entirely distinct, given the connectedness in typical neural network architectures, but distinct enough that—if we understood the network internals fully—we could create well-formed clusters.

Very much of LLM corpuses consists of conversations, information exchange, questions, arguments, essays, quoted arguments and text that refutes them, and other text that is inherently about reasoning through ideas. This can only be increasing given the conversation sessions that LLMs and their users generate together. With such a corpus, to be very good at next-word prediction an AI may need to be good at reasoning.

LLMs may even need to learn their own place with reference to the humans they interact with, a necessary self-awareness that would help with next-word prediction on LLM conversational corpuses. This self-awareness and state with respect to their user—quite similar to many definitions of consciousness—would be another subsumed skill.

In the extreme case, to truly interact with humans accurately, an AI may need to build up a model of humans and how they behave. They may have to learn an understanding of their creator, and of their place with respect to that creator, to excel at their task. All of that is possible even when “learning” just means picking weights in a pre-defined but sufficiently complex series of mathematical operations.

Keeping an open mind about machine intelligence

Some descriptions of LLM abilities start from very rudimentary examples of how to do next-word prediction and then describe LLMs as having added a bit more complication and memorization to the same methods. Like starting with single-word (Markovian) probability tables and extending them out to much longer word sequences. I expect that, past some level of ability, this method is not nearly sufficient and it is not all that LLMs do. That instead they have to learn more abstract abilities, that might even be inherently harder or more useful, in order to succeed at next-word prediction. The LLM subsumes these other abilities.

Strong AI performance on reasoning benchmarks and on transfer learning tasks suggests to me that AIs have already become general purpose reasoners, and not just at a superficial level of blurring memorized text together.

If this is true, then it foretells an exciting path towards Artificial General Intelligence (AGI) that meets or exceeds human intelligence. There are a few other attributes and architectural advancements that I expect would help or even be necessary, but let’s set those aside for a later investigation. Eventually models will have enough of their networks devoted to reasoning and pattern matching across their inputs and knowledge that they out-perform humans as reasoners, to go along with their deeper memories and faster computation speed.

We might still debate whether that counts as thinking, but that could be a less practical question when we’re faced with an intelligence that is more capable than ourselves.

[1] Alexander, Scott (2024). Sakana, Strawberry, and Scary AI. https://www.astralcodexten.com/p/sakana-strawberry-and-scary-ai.

[2] Chiang, Ted (2024). Why A.I. isn’t going to make art. https://www.newyorker.com/culture/the-weekend-essay/why-ai-isnt-going-to-make-art.

[3] Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.

[4] OpenAI [Contributions] (2024). Learning to Reason with LLMs. https://openai.com/index/learning-to-reason-with-llms/.

[5] Yoshua Bengio, Yann Lecun, and Geoffrey Hinton (2021). Deep learning for AI. Commun. ACM 64, 7 (July 2021), 58–65. https://doi.org/10.1145/3448250

[6] Goldstein, A., Zada, Z., Buchnik, E., Schain, M., Price, A., Aubrey, B., ... & Hasson, U. (2022). Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3), 369-380.

[7] Wilson, M., Petty, J., & Frank, R. (2023). How Abstract Is Linguistic Generalization in Large Language Models? Experiments with Argument Structure. Transactions of the Association for Computational Linguistics, 11, 1377-1395.

[8] Chiang, Ted (2023). ChatGPT is a blurry JPEG of the web. https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web.

[9] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20).

[10] Kocijan, V., Davis, E., Lukasiewicz, T., Marcus, G., & Morgenstern, L. (2022). The Defeat of the Winograd Schema Challenge. ArXiv, abs/2201.02387.

[11] Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F.B., Wattenberg, M., Corrado, G.S., Hughes, M., & Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics, 5, 339-351.

To summarize Ted Chiang’s argument: he says “art is something that results from making a lot of choices”, and then mostly talks about those choices as being the choices that the human user makes. Art requires choices, as writing requires effort such that “worthwhile work cannot be made without it”. To the extent that the LLMs makes choices, it’s to “take an average of the choices that other writers have made”. He claims that LLMs are not writers, and “not even a user of language”. Since AIs are not intelligent, are not communicating, and not making choices, the choices and effort come from humans, who do so little of it for AI art that it isn’t art [2].

Notably, in this paper, LLMs cannot be coherent because they do not communicate as humans do. Their communication “does not have meaning”, and as such, human comprehension of LLM communication is merely an “illusion” [3].

It’s actually next-token prediction, not next-word prediction, but in this essay I’ll often describe it in terms of words because the distinction doesn’t affect any of my arguments. In current common tokenizers, some common words will be their own tokens, and often common subsets will be tokens. You can play with OpenAI’s tokenizers here.

This decoder-only text completion architecture is, of course, far from the only form of LLM or of AI. I will mostly limit my discussion here to these types of models, because they already have sufficient expressiveness to illustrate my points. I am also well aware of how effective next-word prediction is a consequence of token masking rather than a more literal label.

I can’t imagine I’m the first person to think of this, but I haven’t found a source that posits something quite along these lines. If you have one, please send it my way.

I’m going to call all of this “reasoning”, but others could use a different word or break this down in a variety of concepts. In this essay I won’t ruminate on distinctions between reasoning and pattern recognition, planning, problem solving, contextual understanding, world knowledge, logic, mathematical skill, inference, deduction, concept learning, or other ideas.

Simplicity is SOTA

Discussion about this post