So, things begin to move a lot faster and much bigger, and there is something to realize about this unpreceded AI bubble.
We will consider only the underlaying fundamental principles, not the particular implementation details, “architectures” and what not..
There are four major aspects to any LLM model – the training process, the “architecture” (the structural shape) of a model, the " post-training tuning" (lobotomy) of the model and the inference process.
The training and inference are essentially the same for all models – training builds an “abstract probabilistic structure” out of a given training data, and inference is just a sampling from a probability distribution.
None of these stages are reproducible or deterministic, not just due to a random initialization and the stochastic nature of the training algorithms, but also because of the non-deterministic floating point arithmetic and non-associativity.
The imprecise math, however, is not an issue here, since “Mother Nature” rely on gross but correct approximations at all levels, except Molecular Biology, where the molecules have “precise” the same shapes.
Notice that even this much of understanding is enough to realize that there is neither facts nor truths in the model’s output, in pricniple.
There is a good metaphor for you. Imagine a mountain, a magic mountain, if you will.
It is a well-known fact that even the simplest multilayer perception (of just 2 layers) can approximate any commutable function.
Now imagine an abstract surface that could “approximate” (match or even perfectly cover up) a surface of this mountain, perfectly matching every single “wrinkle”. This is what a general purpose computation DAG can do, again, in principle.
This is not a trivial claim – the DAGs are basis of your brain’s “structural encoding”, where neurons form DAGs and synaptic “gaps” act as “dynamic weights”. This is the only one non-bullshit fundamental finding of all the AI madness.
Now suppose that we are throwing perfectly identical steel bearing balls from the top. None of them will end up at the same spot, no two will have the same trajectory, number of bounces and so on.
All of them, however, will end up at more or less the same average distance from each other, in the same locality. This is exactly non-determinism of LLM’s output (and the basis of abstract Baysian statistics ).
Imagine, if you will, that a ball could suddenly disappear and reappear at another side of the mountain, and continue to bounce down. Then again, and again. This means it could end up literally anywhere around the mountain. This is exactly what so called “hallucinations” are.
The subtle point is that it will non-deterministically “jump” to a nearest spot in the probabilistic structure, which captures the fact that the words (tokens) are linguistically but not necessarily semantically (within the given context) related to the previous ones.
This also explains what normies describe as “when it doesn’t know it makes it up”. This is just a “jump” to the nearest token, which is not necessarily relevant, even if it is ended up “close enough”. This is the cause of the subtle, difficult to catch bullshit.
What the magic mountain is made off? Words. All the words ever written on a public internet, and, potentially, all the word written on a digitalized media, which means verbalized socially constructed bullshit and mostly abstact, ill-defined verbiage.
Notice that nothing can be done about this – this is just the way things are in the universe. Humans tend to produce bullshit, and LLMs just capture the statistical structure of this bullshit.
Now the architectures. No one knows how (leave alone why) a particular architecture affects the observable behavior and why. It is all just handwaving and almost arbitrary accidental choices, which has been made and then became a sectarianism consensus. They just tried a few and one appears to be better than the others. No one knows shit about the whys.
Again, no Slutsker or whoever, because it is just a socially constructed crap. Shit “somehow works in this way too”, unlike lets say the derivatives, which makes a perfect sense to determine the direction of the next step.
Post-training is an attempt to make “more beaten paths” in the surface which corresponds to “the right answers” according to human experts. This is bullshit, of course. Even a dangerous bullshit. Any given “human expert” can be as much a delusional, heavily cognitive-biased zealot as one could imagine.
Another approaches, like building feedback loops from slop to slop are more interesting, but the chance of convergence on What Is, which is supposed to be the ultimate promise, are diminishing (they will converge on some most repeated bullshit).
Here how it works in the context of code generating LLMs. They collect all the prompts and the resulting slop (and you are even paying them to do so per token, lmao) and then feed supposedly accepted as good-enough by a human “programmer” result as a new training data.
This approach should make “investors” exuberantly euphoric (which is what we could readily observe) – an endless “data supply chain” for re-training the models, but there is a catch.
What the models will converge onto in this way, will be exactly the lowest level “slop” of all the amateur, zealous code ever written without any understanding whatsoever , which is a lot of bullshit, mediocre, poorly designed, insecure, buggy code.
Take [now dead] J2EE, the early PHP code, naive C and C++ code (even before C++11) whose authors never even considered that something may go wrong, and so on.
Even more importantly, the resulting slop will be, again, in principle, a textbook no-no mix-and match of always leaking (everything leaks everything) low-level abstractions and irrelevant low-level types with high-level domain specific types, with two thirds of the bloat just error-prone conversion back and forth (to and from the irrelevant implementation details).
Everything that is bad in a shitty imperative code created by ignorant crowd of amateurs, which gave us J2EE and other imperative OO crap, precisely because it is a very similar kind of a process in principle. 98% of Github code (used for training all the models) is such an amateur crap, without any competent principle-guided design and resulting clear abstraction barriers.
And this shit is suddenly valued in trillions, hundreds of millions are being paid to some graduates who cannot even formulate this clearly, and the peak euphoria is in, as NVDA (which sells the shovels) is propping up the entire US stock market.
This is bullshit. Pay attention.