Today is glowing bright with AI memes and buzzwords like a Christmas tree. Everyone is there, including billion dollar corporations announcing a “CodeLama-34b” which is “designed for general code synthesis and understanding.”

First of all, I personaly do not want to rely in any part of my life on any “synthesized” (and “understood” software, and demand an explicit opt-out. Yes, yes, I know.

If I have any understanding of these subjects at all, this is a bubble and irrational exuberance. Lets try to unpack “the whys”.

Aside from memes and million dollar investments, there are actual data structures and algorithms under the hood, which are complex and convoluted. Nevertheless, at a higher level of abstractions we could see what is going on without talking too much bullshit.

First, there are a few thing to acknowledge.

When n-grams or even bit-sequences are used for training from the start, we are discarding any semantics of the underlying language and switching to a “meaningless” information processing./

This is a strong claim but it is correct (not wrong). The actual nodes of the wast trees will contain short bit sequences, and the links and weights will capture not the semantics of standard idioms of a language, as we would wish to, or even generalized algorithmic patterns (to reflect the underlying linear, sequential and table-like structures), but the overall “multi-dimensional shape” of the whole training set.

Please, read this again, this is an important principle. From the very start of the training process they choice a wrong (too low) level of underlying abstraction.

The resulting model “works” or people claim it does, but its size and resources required to run it are simply ridiculous. The number of “neurons” is a signal (heuristic) that the whole approach is just wrong. There has to be a better way.

From a PLT perspective, training a model on a junk code from the internet or even github is just nonsees. The last 60 years of PLT research by smart people gave us a few fundamental results, which can be grossly oversimplified to “each language has its own uses and standard idioms”.

One more step back. In the 70s and 80s they tried to teach programming systematically, by illustrating the underlying universal principles with discovered and refined common idioms. The idea was to teach a student “just right things” from the start.

They indeed gradually introduced the common “shapes” of the data and standard idioms and algorithms to deal with each particular shape (sequences, trees, tables). Things like sorting of linear sequences were distinct subtopics but there were common patterns too.

The code they used to teach was simplified but idiomatic, and very few actual projects has been written in this “teaching style” which everyone could understand. The classic MIT books are well-known for their high-level clear and idiomatic code in Scheme .

Another great tradition if of Standard ML and lately Ocaml, which also write very careful, idiomatic code in their standard libraries.

With Haskell only GHC and its dependencies are well-written (including the base), and most of public Haskell code is an unimaginable crap of over-abstraction and redundancy (which is supposed to be cleverness).

Now please pay attention.

Training on very principle-guided, idiomatic, consistent and clean codebases, with careful attention to detail, should yield an order of magnitude better performance for any uses in the same codebase.

So, take, lets say, MIT Scheme and its libraries, Ocaml compiler and its libraries (NOT the crap from Jane Street, or take it separately), or take whole GHC. I think Scala 3 compiler and libs are also very well-written. Clojure author claims to have very idiomatic libraries code.

Which any of these one can, theoretically at least, do what they did with “Handwritten digits” back in the 1989 (33 years ago, @karpathy hi there) – have an actual proof (in code) that their concepts work (and that the level of abstraction is just right).

Here I claim that all the current model will fail to capture the underlying fundamental and significant differences (form a non-bullshit PLT perspective) and will show the same crappy mediocre performance.

The humans, however, trained in the classic MIT style has consistently shown an orders of magnitude better performance in serious programming (not some modern webshit).

These humans has been trained to understand and manage necessary complexity, not to sweep it under the rug of a LLMs, which is a dangerous nonsense at a systems level. It is well understood that a hierarchy of layered DSLs is the “universal” architecture of complexity.

The result will show that the level of abstraction of bits is wrong, and the rat-race of training larger and larger models is futile.

One more thing. When “generating” is considered, the algorithmic technique used rely on probability distributions and on “some” randomness.

This is the “generalizing to not seen before examples” meme, which is the greatest.

There is the thing - it is ok to do pattern-recognition and classification this way, because it is indeed discarding insignificant differences as a noise, and ML does it exceptionally well.

But, as it is with the “make more of Shakespire” demonstration, one will always get what he asked for - just more bullshit. I do not want to read that Shakespire. Neither I want to rely on that code.

Again, this is serious. From an algorithmic perspective, the way one generate or synthesize guarantees that, in principle, only look-alike bullshit will be produced. Why? Because, again, what was captured (or learned) were probability distributions of bit patterns, not the underlying semantics or even universal shapes of the data, which are out there.

One last time. There are patterns everywhere out there, at all levels - the data structures, algorithms (the data-dominates principle) and the standard idioms of languages. There are even distinct and well-understood patterns at the type-level (ADTs, GADTs, etc).

They can in principle be “recognized” and then “used”, but we do not see anything of this sort. And the reasons are stated above. After spending merely 20 years of my life I recognize, know and use the most common ones LMAO.

To summarize, information processing or indexing at the level of bits will never yield any “knowledge” or “intelligence” in principle. It isn’t there. The huge networks (models) capture “snapshots of all the noise”.

Here is why. Those who really studied CS at a good school like MIT, know that almost everything in programming is miraclesly reducible to just a 4 common “patterns” at differnt levels.

  • terms of lambda calculsus (there are just 3 of them)
  • the algorithm charting patterns (shapes of the building blocks)
  • algebraic data types (at a type level)

The aha-moment is that one could train a network on an intermediate-representations of the GHC compiler, which is still a the lambda calculus augmented with a few types, by feeding all the idiomatic stdlib code to it and use the representation as sources. Such network would capture the common shapes of the code.

Another candidage are LISPs, which is already an AST, so one would have much less noise. The resulting models could be re-used as the basis of continuous training, just like they do with the machine transaction models.

One more time, there is a lot of structure in the code and data, but no fucking “34b” models are required to capture it. The “hack” Mother Evolution has been “discovered” is to have the “topology” of the specialized brain areas mimic or reflect the patterns in the sensory input it receives. In short - it captures the constraints of the environment in th evolved structure (which is after being pruned off).

The facts that everything is reducible to just Lambda Calculus and few other recurring patterns (including the universal shapes of the data) is the most fundamental result of all time, and it is not being used anywhere.

Las but not least, one cannot train a model of 4chan. It will become a literal shizo (due to exposure to inconsistent, self-contradictory bullshit and abysmally horrific code snippets). There is still an acute shortage of actually good code, expecially compared to 70s and 80s.