Here some important observations from the long hours of “experiments”.
Once there is a simple bug in the slop, the chat does not fix just this very line. Most of the time it regenerates the whole file from scratch, sometimes with slightly different structure and names, suggesting (as one would expect) that it just repeats the whole task (without understanding your “precious” feedback at all) adding your verbiage as additional context (if at all). This is exactly how it fixes compilation errors – by adding them as training data together with the slop which produced the errors, capturing somehow the actually existing relation between the bad code and particular compiler errors.
The inherent non-determinism of the algorithms used makes it always produce some slightly different slop, and most of the time the fixes are, indeed, introduce new bugs, since it does not operate on the level of principle-guided semantic understanding of what is going on. “Fixing” one part “naturally” leaves the other broken, as in the proverbs and jokes.
When it jumps between chat “swings” an LLM loses its precious context, replacing it with whatever is happens to become the current one. In particular, it abbreviates the code blocks with placeholders or comments, and then, literally, forgets what was there (because the omitted code does not pushed back into a freshly created context from the last iteration of the chat) – integrity fucks up, and it actually runs in cycles – “now we have to implement this”, which has been cut out 10 “chat swings” ago.
As a direct consequence its verbiage becomes out of sync with the code blocks (already broken by “abbreviations” and “omissions”, which will be “mechanically” forgotten after a few shifts of the context window), so the claims it made are just do not materialize.
It, however, creates a very coherent and consistent verbiage, only a true expert could spot the subtle flaws as one could “easily” notice in the code, which is a “formal language” and does not permit ambiguity and hand-waving (everything just breaks apart). The illusion, however, is as complete as in the Matrix movie (which is, in turn, based on the Indian [naive but accurate in principle] concept of Maya).
So, yes, this only appearance, but this appearance (illusion) is so strong that it clearly passes the very naive Turing Test (when you “cannot tell” whether or not there is a person in another room).
By the way, you see, I can tell this is a slop-machine on the other side of the chat, and provide the correct justification. This is how my Andrew Ng course pays me back.