I think I have seen this before. Once in Varanasi, wandering around book stalls (most titles already “tourist books”, – oversimplified and westernized “tantric” bullshit), I found a whole book by some local publisher which describes in a minute details one single Brahmanic ritual (an elaborate sacrifice) which last almost a whole day. Hundreds of ingredients are being burn in a precise sequence, or rather a simphony of chants. motions, gestures (mudras) and many other elaborate details. The priests (brahmans) definitely knew what they are doing and why exactly this way is the only proper way.

Being who I am, I thought – “what if I secretly replace, say, a goat blood with a monkey pee? How this will affect the actual outcome and how this fact will be noticed and by whom?”

Nowadays I see very simliar things going on right before my eyes, and I ask myself “what if someone secretly switch, say, the order of the “Self-Attention Layers " in a “Transformer”? How this will affect the actual outcome and how this fact will be noticed and by whom?”

Before we begin. it is important to realize right from the start that the terms like “Self-Attention”, “Multi-head attention” and the like, have absolutely nothing to do which the actual activities they may suggest, and are just series of matrix multiplications.

My claim is that no one understands how any of this will affect the expected outcome, and that no one is even bothering. Just as it is within an organized religion, everyone are busy “building and running temples” for big profits (despite your idealistic view, mandirs and even ashrams were, and still are – businesses, providing food and accommodations, however simple).

While expenditures (direct and indirect) weren’t in trillions of borrowed USD, it was comparable, since majority of the population were involved in some religion-related activities and occupations, which is what we have to expect if the current astronomical AI bubble will continue being artificially inflated to avoid the inevitable (for the shareholders and “investors”).

The “Attention Is All You Need” paper now has more than 20 thousads of citations. Are these 20 thousand people really understand what is going on? Absolutely not. If I would put some stale monkey pee inside their Transformer they won’t notice.

There are, of course. the “Visualized Transformer” and “Annotated Transformer” and some countless “Transformer From Scratch” pages, which claim to elaborate all the details, but instead they just re-state “the [details of] hows” without any real “whys”, leave alone justifying every computational step for its contribution on the final result.

I would claim that most of the performed calculations are “redundant”, or amount to a “mathematical handwaving” (so to speak) in the sense that nowadays everyone simply follow the estblished sectarian “ritual” without even trying to really analyze how every single step affects the outcome (which is non-deterministic in principle, anyway).

The whole paper reads like “We have thrown some shit at the wall and this one stuck” (and then shit, quite literally, hit the fan). One representation has been replaced with a “better” one, a few assumptions has been discarded and a few dubious operations (notably, “smoothing”) has been thrown in, and the whole thing “still works” even with a better result on a particular meme-benchark (notice that the overall translation quality is still shit).

Of course, all these titans will assure you that they knew precisely what exactly they were doing, and so are the brahman priests.

Recall that the name of the game is to produce a series of numbers in the range between 0 and 1, which sums up to 1 and then interpret these numbers as the estimated probabilities of a next token, given the (applied Bayesian “reasoning”, which cannot be a source of Truth in principle). What the produced numbers actually are –no one really cares, as long as they resemble or look like a something similar to probabilities.

In fact, the only criteria is that the model produce some appearance of an “acceptable (by whom?) linguistic output”, which then can be “benchmarked” and compared to other model’s outputs. Notice that this has nothing to do with the overall quality of the produced “slop”.

Again, “self-attention” and “Multi-head attention” have no more meaning than “Agni mantras” or choosing a particular kind of a goat (the process with a lot of subtle details) for a particular Durga puja.

Moreover, there are no such thing as “queries” “keys” and “values”, these are just matrices, and by calculating dot-products (which are the per-element scaling) they claim that there are sort of querying magically occur.

To remind you, at the most general and abstract level, a key-to-value association (as an ordered pair) is a particular relation, with “query” being an operation, which roughly corresponds to the notion of “such that”, and “selects” a particular “pair” (naively assuming no hash collisions and that there is never an “error” ).

If there is a “set” operation (which is an over-write of a hidden mutable state – a very wrong thing to do in principle) we would have some “counts” (or just numbers) associated with a “key”. The CS theoretical ADT is well-understood abstraction with lots of implementation-specific subtleties.

Nothing like that ever happens within a Transformer – this terminology is just sloppy thinking, or merely incorrect and even misleading metaphors.

So, what is going on really? Well, the process produces estimated conditioned probabilities of a given training set (every time different) and out of these one “infer” the output sequences, which, again, only appear to be satisfactory, and like other appearances are subject to the most stupid speculative interpretations.

This is a well-known pattern when a traditionally used terminology makes no sense (queries are not really queries, attention is not even close the meaning, and there are no “heads”) – this is a major hallmark of an esoteric sectarian bullshit. Any interpretations beyond what the actual algorithm is doing is such kind of bullshit.

Here is the catch – by adding some noise or slightly permuting the probabilities the produced slop remains roughly the same or even appear to be “better” (according to some benchmark). Imagine a calculator or a computer with such “properties”! This is exactly what allows the famous bullshitters (Slutsker or whoever) speculate about [abstract] multi-dimension “topological sorting”, the “semantic distances” between the “points” in a multi-dimensional space and so on, while there nothing but conditional probabilities based on frequency counting of tokens.

This means, among other things, that so-called “architecture” of a NN has no meaningful (understood) impact to the final results, provided the probabilities were not fucked up too much and that at least some of the “probabilistic structure” still somehow captures the training data.

What exactly it (the probabilistic structure) actually capture is a quite another topic – the overall “shape” of all the written verbiage, which, at least in principle, while being mostly (and subtly) wrong, has nevertheless, to converge to What Is, because it is, indeed, prior to any human language).

Now why tf I wrote all this? Well, this is a very natural reaction to the unfolding madness, and especially to the over-confident “high-profile” bullshitters. There is nothing fundamentally new in such bizarre social dynamics (except the scale and the depth of a reality distortion), as illustrated by the provided analogy, but all this somehow makes one feel like shit. Again, just like with Brahmanas (the ancient Sanskrit texts), nothing they claim is true is even real – with a quick examination, everything turns out to be just a socially-constructed (sorry, based on the “previous results”) sectarian bullshit.