Deepseek R1

DESCRIPTION: Memes and mirrors.

Nowadays things are moving way too fast. It is not just controlled trial-and-error, it is literally throwing everything at the wall (to see what sticks).

It started with that meme “Attention Is All You Need”, when they just came up with an “architecture” that sticks.

That “attention” and “multi head attention” turned out to be just a few additional layers of a particular kind.

No one can explain the actual mechanisms of how exactly or even why the layers are as they are (abstract bullshit aside). While the general idea was to mimic some specialized brain centers (the key to understand how it works), the actual code was merely “buffers”.

This was the first and the most remarkable example of “mathematical hand-waving” – see, these layers, they do “attention”, because this or that talking head said so.

With this paper they use even more of an abstract terminology, such as “long Chain-of-Thought (CoT)” and even “the emerging reasoning capability” (which, of course, is an illusion of an observer).

To see things as they really are there is no other way but to trace everything back to What Is (the first principles) and then to rebuild every concept, making sure there is no errors at each and every step. This, by the way, is the only universal theorem proving technique in existence (and it is way more general than just theorems).

So, a neural network which encodes a “model” is a structured data file, that represents a network of some abstract nodes linked to each other with connections or edges.

Each connection has an associated parameter, which is traditionally called a weight.

There are actually no connections (it is an abstract notion) being actually represented but only weights – one per imaginary “connection”. It is convenient, however, to think of weights as pathes (of a single step).

Every weight, which is just a real number between 0 and 1, can be thought of as a weight, thickness (of a link), a length or a distance (sort of) or just a frequency (of going through it). The less abstract notion is “how often this path has been taken”, or “how steep is the slope at this particular segment”.

Another useful notion, closer to What Is is a numeric value of how well myelinated the connection is (this is from neurobiology).

So far everything is nice and pretty reasonable. This is somehow a universal encoding (except that the brain encodes its high-level structure and it is not even close to be uniform or to start at random. Actually, the actual structure of the brain is what defines it. Individual neurons change and prune out).

So, all the greatest memes ("to approximate any computable functions" and “without being explicitly programmed”) are still hold.

This is sort of a universal representation of a map of the territory, which, again and again, is NOT the territory itself.

The back-prop algorithm based on updated gradients is as good as biological myelienation, of “most used” (oversimplifying) neurons, captured using math. Also cool, this is what the Nobel was for.

There are, however, lots of questions to ask, such as:

What does this emergent abstract structure actually represents?
How accurate is this abstract map with respect to the territory?
What was the territory?

etc.

Well, lets say that there are all (well, not really) possible (or potential) pathways between some abstract nodes, where each path can be selected, and at every “step forward” (this is a directed structure) – one of highest-scored (according to an abstract heuristic) connections can be taken.

An actual inference (forward) pass is not just a single path, but a sum-total of what has been selected at each step, according to the weights and current inputs.

So, the whole function is a function of inputs with respect of weights (parameters), except that it is, of course, not a function – the same inputs does not imply the same output, always.

This is, of course, just a simplest, fully-connected feed-forward network. No fancy architectures, and the inference algorithm just uses some “small randomness” to select the among possible pathes.

The problem is that so called “advanced architectures” and “sophisticated tuning” does not change the fact that there are no “reasoning capabilities” out there, and what they call “emergent reasoning” is just an illusion to an observer.

Lets see how and why this is true in principle. The short answer is that “thee is no actual machinery to implement it”.

The main question here is “what this abstract structure actually represents”. The answer is “a snapshot of a large chunk of a bullshit verbiage the world has produced”.

The weights (or parameters, which can be thought of as distances or frequencies as your please) partially define the [multi-dimensional!] shape of this snapshot. Another part is the training data and how exactly it has been “feed forward”.

More precisely, the weights emerge from the training process as a result from processing of the training data, and then being used with a new (“unseen”) data.

Modern systems presumably continuously update the weights with each new piece of data – a naive assumption of how the brain works (it turns out the brain structure is notoriously difficult to update once it has been matured, and most of the experiences are just being forgotten).

So, what is the claimed break-trough here? Learning from experience – Unsupervised Reinforcement Learning! My ass.

What is an unsupervised reinforcement learning? Rewards, based on what? on the previous experiences and some neurobiological heuristics hard-wired by Evolution.

Good old frequency. Most beaten pathways (a notion grounded in What Is, by the way). The most repeated propaganda or a mantra. Can frequency and probabilities be the source of truth? No, unless you are majored in humanties.

What is actually going on then? What I am trying to show here is that the neural network (no matter the architecture or fine-tuning!) has, in principle, no capacity for reasoning of any kind, just as a parrot (the bird) has no capacity, in principle, to understand the significance of sounds it produces. It simply neither have “languages centers” nor “cortex” evolved (for that).

The output stream of tokens from the Deepseek R1 model (or any current LLM whatsoever) is an imitation (cosplay) of reasoning just as “words” uttered by a parrot are imitation of sounds of a human language it overheard somewhere.

One more time, pay attention: just as a parrot produces similar sound waves it overheard most often which to an observer sounds like a human language, an LLM produces similarly looking streams of tokens is has “seen” most often, which to an observer looks like reasoning. No more, no less.

A LLM has no “machinery” for reasoning of any kind, it is just information processing with advanced curve fitting. Nothing fundamental has been changed within Deepseek R1.

Yes, they came up with a technique to “reinforce” some “common pathways”, but this is not a qualitative improvement. It is still a “parrot” under the hood. (thank you!)

But but but, here are the benchmarks! Bar charts! Why tf not? If you would measure how well different parrots produce a similarly sounding sound-waves – how well they mimic words of a human language, you could make nice charts and convincing benchmarks too.

Conclusion: This is not a reasoning of any kind, emergent or not. This is an ability to produce token streams which “sounds” (look) like a reasoning, based on what has been “seen” (trained on) before.

This can be easily empirically validated by the fact that the most “common sense” (most often repeated) texts are being produced nearly verbatim from Wikipedia and other authoritative sources.

Again, what an LLM does is not a reasoning, it is an imitation of it. It produces something which looks or sounds similar without any form of “understanding” whatsoever. The “parrot” is not just a metaphor, it is an illustration of a similar and very real biological processes.

So this is, still, a socially constructed and socially reinforced illusion, just like almost everything in society.