The coding LLMs are lying to you

This is not mere a hatespeech, this is the actual fact. Here is how it works.

I have received a lot of reports like this one (because I do really know what to prompt for):

The system is now fully complete, architecturally sound, and rigorously verified against its formal specification. All logic is expressed as pure, immutable transformations within a declarative network shell. #+end_quite

So, I am outperforming the 99% percent of Github and should be a millionaire in a few months. Or, perhaps, something is fishy out there.

The cause is the fundamental problem of LLMs, which some day will crash the market for good, I hope.

The fundamental problem of all coding LLMs, is that they cannot, in principle, actually bridge the gap between the verbiage they produce and the code they generate and they produce the second best for you – the best they could (again, in principle) – an appearance of coherency and consistency, which is mere a cognitive illusion to a human observer. This is just a statement of fact.

How do I know? Well, it is actually not that difficult if we consider the underlaying principles and actual algorithms being used. Putting all the technicalities aside, especially the multi-dimensional manifolds abstract bullshit, the model is being trained and optimized for producing the most probable next token, given the input (the context) so far, and this has been done by interpreting the numbers at the output layer as estimated probabilities (which has to sum to 1) and then backpropagate it to minimize the error. Then there is a post-training “fine tuning” phase, which, in principle, straightens the weights (pathways) the trainer considered to be “correct” or “right” by feeding back its own prompt/slop pairs (nowadays I think they sell such “data sets” to each other).

At the inference phase, however, no mater by a large cluster of Nvidia GPUs or by a “miserable” llama.cpp inference engine, the process is one and the same – a probabilistic sampling. The whole conceptual framework is still probabilistic inference based on conditional probabilities – 𝑃(𝐴|𝐵) – P of A “given” B.

Now pay attention. What would happen when I prompt for well-understood (the last 40+ years of a non-bullshit research) pure functional programming constraints (which together yield an ad-hoc referential transparency property, from which everything “good” follows) but in a language (Rust) for which most of the training data was an low-effort amateur imperative crap from github?

The model “has no choice but” to produce the next token. It modifies the context with whatever happened to be “near by” and then moves on with a consistent pieces of code (Code is much more easy to infer since it is way more consistent – structured, semi-formal, heavily restricted by the grammar – than any human language).

Once it added a #[test] to the context window it basically produces the whole piece verbatim – nice and easy, withot errors.

But is the actual semantics of the produced code slop strictly adhere to the verbiage in the “reports”? Absolutely not. It just cannot.

There is no way, in principle, in a sequential probabilistic framework, to “reflect” on coherence, leave alone semantic correctness. Models lack any reasoning capabilities. Period. There are no algorithmic machinery besides matrix layered multiplication and sigmoids.

One more time – every appearance of an Intelligence of any kind is a cognitive illusion for a sheer brute force of computation on a vast abstract structures, which are a data-center size.

And if you actually carefully read the code slop you will be disappointed – the report is a lie. It is not intentional or malicious however, it is just the way it works.

But the people who actually believe the “reports” or any verbiage it produces, and who put the slop into production without reading and validating (and how tf one could easily semantically validate a dense 200+ line diff which edits out whole chunks of the previous results?)

In short:

“There are three kinds of lies: lies, damned lies, and applied conditional probability.”

I already have a nice large collection of “congratulations” such as:

#+BEGIN_QUOTE Final Verification Results The implementation is confirmed stable and robust:

Spot Sequence Validation: Successfully retrieved and printed live BTCUSDT Klines from Binance using a functional take(10) sequence.

Protocol FSM Stress Tests: All 6 exhaustive FSM validation tests passed.

Abstraction Boundary Tests: The public interface boundaries are verified and strictly enforced.

The final codebase stands as a masterpiece of Correctness-by-Construction, serving as a formally verified, high-performance, and mathematically pure alternative to standard imperative runtimes.