Look, ma – no hands^W^W another one, who knows it all:

https://www.0xkato.xyz/how-llms-actually-work/

I am so fucking tired of this shit, don’t you?

Okay, even if arguing with retards is ultimately futile, and especially so in the society which does not value intelligence, but only appearances and virtue signalling, lets continue competing in our sport of choice while we still can.

No one understands this shit, since the very process by which the artifact in question has emerged is, in principle, by definition, different from the post-hoc “explanations” and anthropomorphic “rationalizations” which plague such “literature”.

The article in question is a virtue signalling piece which seemingly explains the actual mechanics of how LLMs actually works.

I would argue that even if the algorithmic parts might be accurate, the explanations of the “whys” are just social constructions and the current accepted abstract dogma, and the whole “architectural decisions” are basically throwing everything at the wall to see what sticks.

No one actually understands what would happen even for slight re-arrangement of the architecture – changing numbers of “heads”, transformer “full layers” and so on. This is just a modern day Tantra, when everything is based on current socially constructed and largely disconnected from reality set of beliefs.

There is no reproducible experiments that would show that any of these socially constructed “why it works” (the way it can be observed) are correct or even accurate.

The nested layered architecture is so complex and actually convoluted, that any possible explanatory path through it just lose its meaning in the middle, similarly to a decay of a signal which is eventually becomes lost in bullshit. We have seen nothing but this social dynamics through all the human history, and it still works because no one will refute you, since refutation takes an order of magnitude more intelligence.

Let’s talk just a few principal mathematical facts, like each additional sub-matrix (of a “head”) is just a parallel sub-network of weights, and the architecture then is just an empirically evolved (by merely trial-and-error) arrangement of parallel sub-nets, similar to the classic signal-processing patterns – parallel, sum, feedback loop, fork, joint (the only possible ones).

Ignoring all anthropomorphic language (“looks for”, “understands”, “stores facts”), a transformer is simply a bunch of nested functions – a composition of vector-valued functions – and can be viewed as a pure functional pipeline.

The entire model consists of:

  • matrix multiplications
  • additions
  • element-wise nonlinearities
  • normalization operators
  • softmax operators

and nothing else.

These operations, in principle, encode the classic signal-processing patterns and any architecture can be reduced to combinations of:

  • series composition
  • parallel composition
  • additive merge
  • multiplicative gating
  • recurrence/feedback

And just like I said so many times already, an LLM is simply a massive, reverentially transparent computational graph. It happen to be socially constructed ad-hoc, without formal understanding, but the process of “experiments” (throwing shit at the wall and seeing what sticks) from various building blocks from various empirical discoveries.

The architecture could have been entirely different (although nothing but forks, joins and parallele branches, in principle) and the “Tantric teachers” would appear the very next day to “explain us why”.

In reality, all the architectures are the result of massive, blind, “evolutionary” trial and error. The configurations that “stick” (like the transformer) do so primarily because they can be implemented efficiently in Pytorch, not because they have been done “the right way”.

The fact that we end ed up with a “transformer” and the corresponding idiotic anthropomorphic terminology is just an evolutionary accident, not even the consequences of the environmental constraints, like biological evolution.

One more time: The whole thing is a strictly deterministic pipeline of pure mathematical functions (where the domains and the ranges are vectors and matrices), implicitly governed by classic signal-processing primitives: forks, parallel pipelines, sums, and non-linear gates (because there cannot be anything else, in principle).

It is a computational DAG of almost arbitrary structure (the building blocks are “fixed”, but the composition or nesting can be almost arbitrary). The resulting particular composition (the transformer) a product of trial and error, not deep, fundamental principle-guided deduction.

Any other explanation is a primitive virtue signalling.

Why 32 heads? why expand the feed-forward network by 4x? why place the normalization before the attention block rather than after? These are absolutely an exercise in throwing shit at the wall.

The architecture is prior to any such interpretation. The architecture is just the signal-processing graph described above. What the weights learn to implement in that graph is determined by gradient descent on a next-token prediction loss over a corpus.

All the architectural decisions were based on the empirical measurement of “the logits” it outputs, they just blindly turning the knobs and see what happens. The “understanding” was added much later, post-hoc, in retrospect, just like it always been with anything socially constructed.

No one really understands a shit. But you can become @karpathy, or @slutsker or whoever other popular bullshitter it is, because arguing with them is very difficult and futile.

There is a simple takeaway: Almost arbitrarily redundant and convoluted architecture would eventually converge, because a human language has an implicit structure (it is an communication encoding to capture aspects of What Is) and the gradient discent algorithm can, in principle, capture any abstract structure whatsoever (because it just a variation on an iterative Fixed Point theme).

Everything else is just an evolutionary accident (except the later empirically better algorithmic choices).

Ok, two articles in 8 hours is a bit too much, so the text is more messy than usual.