Attention Is All bullshit.

Once again I tried to through This meme video. Once again with the same results.

I have seen these social patterns many times before – when people begin to use ill-defined, anthropomorphic and purely abstract concepts to construct familiar analogies and to invoke intuitions, so everything seems “right” and “logical”.

Abidharma uses abstract terminology to produce a seemingly coherent system. It started from very reasonable abstractions of The Buddha, which illustrated his ideas with abstract notions like pealing off layers of an onion (to find literally nothing in the center) or a mixture of spices using for cooking, but very quickly went all to pure abstract bullshit.

There was (and still is!) so called Chinese medicine, where notions like cold and sour oppose the notions of hot and sweet are used to explain the conditions of a human body and even to provide a guidance for actual prescribed remedies, while bein all “logical” and “coherent” (like the balance is broken, it is too much sour, so add more sweet). The most remarkable thing is that all this seem to work because changing the behavior and stopping doing what one does is almost always the first necessary step, and the actual underlaying condtitions and their actual causes are just “out there”, independent from (and unrelated to) all this abstract nonsense.

There is, of course, Freudian “psychology”, based on Greek mythology, plays and babbling of Nitsche, where abstract entities called Ego and SuperEgo are engaged in complex political tensions, while keeping an eye on a Critic and a few other personages. This “system” was used to explain the observed phenomena and as the basis of a diagnosis.

There are a lot more of these examples, so much that an a distinct recurring pattern can be observed and captured – when complexity increases beyond the limitations of human intelligence (the famous 7 plus or minus 2 meme) a confused and overwhelmed mind tends to use easy to come up with, grossly over-simplified abstractions, while the only constraint is that everything seems to be familiar (already “known”) and “logical”. Most of the time the abstractions are from a completely unrelated domain, but are used as explanations (as in Freud or Nitsche) instead of analogies and illustrations (as the Buddha did).

So we have all these anthropomorphic things like “attention”, “heads”, “self-attention”, “multi-heads attention” (which is, of course, is better than a single head, see!) and what not. Just as unnecessary complexity exploded, he began to talk about things like “tokens are interested to other tokens”, that they are “looking at” other tokens and stuff like that. This is a Freudian programming (ok, modeling) all over again.

When I try to find out these multiple “heads” which pay “attention” I see only weighted sums packaged into matrix operations, which supposedly represent these abstract and seemingly correct notions.

Sure, multiple heads paying attention to the same observable phenomena are indeed update their inner representations according to an individual particular experiences and then, by sharing , examining and validating the generalized knowledge, using a common language as a medium of communication, and a writing system as a representation, together produce what is calledshared knowledge, application of which in a novel everyday situations is what constitutes (and defines) an intelligence. Merging and refining multiple inner maps of the same actual territory, you know.

But all I see are weighted sums packaging into matrix multiplications – hand-waving using matrix multiplications (of seemingly weakly related numbers). Yes, the numbers are being updated used well-understood backprop, the overall metric is clrearly goes down, so the whole thing has to be capturing something. The question is –what exactly has been captured?

The “intuitive answer”, is, of course, “multi-headed attention”, you know, self-attention, you see, tokens are interested in each other!

The point is that the actual representations and the actual nested layers of computation, just as in Chinese medicine,are independent from the concepts which are used to explain them, and the associations and presumed relations are socially constructed, just like in the Freudian “psychology”. This does not prevent one from being an expert in the field!

So, how does these models produce the results they produce and we observe and even measure? There are only stacked big matrices of numbers, which capture things like differences and common factors, which are then aggregated and propagated between layers of a model. No heads or self-attention anywhere.

It is just so hapens that some arrangements of layers (so-called architectures) discovered by the process of combining and recombining (trial and error) produce empirically better results than others (attained by a process similar to a of backtracking search). The why this happened have nothing to do with anthropomorphic abstract terminology which is being used to explain things. The actual whys are “successive better approximations” and refined processes, just like with an ancient cooking.

Again, this is all just socially constructed bullshit on top of matrix multiplications, with experts already having a higher social status than you and me.

Similarities, captured as smaller differences (distances) and bigger “weights” presumably capturing the repeated exposures. The non-bullshit part is, of course, the directredness of the graph (of an underlying mathematical expression) and the actual nesting of the individual differentiable operations. The overall thing, however, is just one of many possible arrangements, which happens to produce observable results.

There has to be a minimal network and an optimal arrangement for every particular pattern (which captures it “perfectly” with a minimal expression), but to discover it is too expensive. It is much easier to talk anthropomorphic bullshit and to parrot the current memes.