LLM Bullshit-3

It is more or less obvious why AI and LLM bubble is so huge - imagine just charging money for every https request to a RESTful API, without, literally, being responsible about the quality of the responce (it is not out fault if a LLM returned bullshit to you, or, which is much worse – a highly sophisiticated, convincing subtle bullshit).

Again, there is not enough good code to train a model on it. MIT Scheme, Haskell, Ocaml, Scala and Go compilers and standard libraries, and this is basically it. Everything else is an outrageously low-effort amateur crap – piles upon piles of it, without any attempts to do thinfs “just right” (as in the classic languages).

There is a fundamental principle behind the fact that a machine translation without understanding work reasonably well – there is the same underlying reality behind every human language, more-or-less the same environmental and social constraints (with all the differences between tribes near the north pole and uncontacted tribes of Amazon forrests).

There is a simple (and the only correct) translation algorithm, then – reduce the verbalization (encoding) to the environmental aspect it refers to, and then select the verbalization for this notion in a target language. That will be just one-to-one correspondence to nouns (things) such as a “cat” (assuming they have cats in Alaska), and most of nature-related verbs (actions) and adjectinves (colors, etc).

This is a gross over-simplification, but the principle is the right one.

Even better realization of this very principle is the tools called pandoc. It converts (translates) between major structured text formats, by parsing into a fundamental commonstrucure – an AST (such as nested blocks of text) along with attributes (such as italic and so on) and then generates the target format.

It works because there is a commonality among most of the supported formats.

What corresponds to the universal common building blocks of programming languages? Well, this is already well-understood.

All you need is lambda
Recursion
ADTs and interfaces (type-signatures) in general
Algebraic data types, including GADTS

To be precise, the intermediate language of GHC is precisely what is needed (to what everything is reducible to). At a CPU level we have comparisons, loops, moves and procedure calls, but this is too low-level.

Now you probably got the idea. If (notice this if) we wish to train a model, we have to train it on this kind of intermediate language (of GHC), not on all the syntactic zoo plus low-effort crappy libraries by unqualified amateurs. And definitely not jammed into some 7B “snapshot”.

That would be (and really is) is an extremely detailed “molecular level snapshot” of the famous central garbage dump in New Delhi, to which paid API requests could be made.

Now pay attention. This is not just a dramatic metaphor, this is the closeset possible analogy which is NOT wrong. All the LLMs are snapshots of a whole garbage dump, not of any knowledge which underlies it. I am the same guy who explained it back when the hype only began, and my reasoning is still valid and only get more evidences.

I do not want to be threated by a doctor who reads LLM bullshit (without understanding), even for a “suggestion”, and to rely on any code pasted from a model without understanding. Yes, it is, perhaps, the biggest corner-cutting tech in human history, but the consequences would be on the same scale.

Just as math cannot be done by taking a snapshot of a dump, neither programming could. We are almost “solved it” with all the theory which underlies Haskell, and it tell us, if at all, that imperative programming with mutable state is just wrong. Everything works (including this Chrome browser) by a chance and lots of luck.

So, with all that genuine autistic persistence popularized in the “Big Short” – this is just bullshit (and a bubble). Pay attention.