Eliminating The Impossible

AUTHOR: lngnmn2@yahoo.com

“How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”

Sherlock Holmes

So the “compiler” is there, right on Github [[https://github.com/anthropics/claudes-c-compiler], and the only interesting question is “but how”?

Well, maybe we are grossly exaggerating what might be going on under the hood.

There is an enormous, almost unbridgeable gap between a formal view and a statistical view of the world.

Some clowns push the narrative that only statistical view is the Right one, and even that “there is no truth” (bullshit! Truth Is Out There). Others, with a better education, claims that there are lies, gross lies and [inference-based] “statistics” (frequency-based stats are okay).

Consider variables. Any programming language is a formal mathematical system (a set of objects together with a set of rules and relations). Some languages are defined more rigorously than the others, but every single one is operationally defined by its compiler and its runtime. The classic FP languages of the past were so well-designed, relying on familiar mathematical concepts, that one could even understand (without much guessing) almost everything what is going on under the hood. Nowadays everything is completely fucked up, of course.

So, to understand variables we have to “step back” and understand a more general phenomena like spoken human languages (4chan-level linguistics, you know).

In general, in any human languages a word (a sequence of sounds) stands for some concepts of the mind (and the things of/in the world through their mental “views” as concepts of the mind). When one said “cat” it is a “cat slot” in the mind of each particular person, not some particular cat out there.

A “variable”, which is just a symbol (the early LISP people got this absolutely right), is “written (and displayed)” word for a value, which, in turn, is an distinct entity (“a distinct object of some kind”) in the Universe of Discourse.

Again, the classic languages got everything right by having the notion of the Environment, which, of course, mimics the logical environment Γ, that holds every single previous result – everything that is [already] known.

And there are additional rules, which were discovered, not invented – the rules of “scope” and the rules of shadowing", which, unsurprisingly, are universal between math, logic and FP.

Notice that in a spoken human language the same word refers to the same mental concept most of the time, except when socially constructed abstract concepts and cultural norms introduce contexts, which re-define (shadow) the traditional (previous) meaning.

Now, how the slop generator deals with all these subtle and complicated rules. It does not.

This is why any current LLM (based on the statistical inference of the most-probable next token) is absolutely miserable at math, where the binding of symbols to values are usually immutable, and fresh new bindings are being introduced instead.

The problem is that a symbol is a “placeholder”, which introduces an “indirection”. Not just that, but systematically and consistently substituting a particular symbol within a formula (or an expression) to another does not change the meaning of an expression in principle (you shouldn’t’ve skipped your Lambda Calculus class).

So, every expression within a formal language has at least three distinct kinds of symbols – the keywords, the names chosen by the public interfaces (APIs), and the names chosen by you for your local variables (bindings).

From the point of view of an LLM, keywords and API interfaces (a “stable” sets of names) do not change, while user-defined variables are just “semantic holes” - x is as good as y, as good as whatEverCrappyCamelCaseItIs. Don’t you see?

Let’s pause the formal view and switch briefly to the statistical view.

If you think how many variables are “out there” (in the training sets), it turns out that just a few (compared to the branching factor after the words “the” or “a”). Yes, because of the previous context, not every noun in the Websters Dictionary have a similar probability of coming after “the”. The same is even more true for the code – the branching factor for a variable name (a user-chosen symbol for a placeholder) is “okay”.

In math it is tiny and this fucks up the inference algorithms in principle.

So, how? Are they used human “labelers” and pre-processor tools to annotate variables? Probably not. They just feed the training examples verbatim, letting the actual “graph structure” to [topologically] sort it out, which is just a “brute force”, but brute force works surprisingly well.

How the consistency is handled? Via the previous context feed back for generation of the next token, so once it was x it will be more probably to remain x, not y.

All the other symbols are just “passed through verbatim”, and the slop generators are great at autocompleting them.

One more time – it is all a sheer [statistical] brute force. The “intelligence” is only apparent – just a cognitive illusion.

By treating every “variable” as a “constant” they “branch out” on every single one of them, but who tf cares, as the money for whole datacenters (to store the always updating representations) come as an avalanche.

And, wallah, just like that, everything is solved. Who tf cares about your formal views, while a whole meme (non-optimizing) C-compilers can be actually generated.

How do they map abstract concept from the theoretical academic texts about SSA or an optimal IR? They do not. In the training data some “API names” had the semantic relations to the theoretical concepts in the books, and this is enough to infer the blocks of code.

How do they translate between different programming languages? They do not. As long as the syntax in the context so far is consistent, it will “infer” the right next token, again, because the syntax is much much smaller than a spoken language vocabulary for a given topic.

By the way,very few people have actually realized that the number of “structural elements” in any computation graph is just 4 – a step (composition), a branch (a conditional) and a join (application) and a recursive call. No LLM captured this fact, which is closely related to (a second derivative of) the Curry-Howard Isomorphism.

Not just that, but the sum-types (a disjoint union of one or more data-constructors, acting as “type-tags”) and the corresponding exhaustive patter-matching expressions are [structural] duals of each other, and a function on a sum-type can be defined as a set of distinct clauses, each of which is a partial function of its own. Sum-types are branching at the type-level, with a perfect dual at the code (expressions) level.

But these higher level abstract structures are still “compositions” of “forks” and “joins”, with some “steps” in between. There is simply nothing more “Out There”.

Realizing this is the “enlightenment” into a proper programming in any language whatsoever, if you ask me. This is how the formal view wins by a KO.

But why, enjoy your generated slop instead.