There is a very simple trick to break the illusion and to see through “the veil of Maya”.
Locally-run models, like Deepseek-R1-14b-0528 (at full fp16
quantization), which is the best I could have, produce vastly different “answers” for exactly the same prompt not just between two runs, but if one uses a different math library stack (like recompiling with forcing Intel MKL).
Every time we run a prompt a reasonably good model spits out something which “looks very reasonable”, (unless your are an actual expert in the field), because it captures “common sense”, expressed in the training data.
If you run the prompt several times, you will realize that the overall output is a “word-salad” more-or-less about the subject. No “right answer” emerge as again-and-again, it is just different kind of verbiage vomit.
The clever online LLM providers employed a nice trick – they assign a unique id (via uuid4
) to the pair of prompt and the first generated answer, and when you ask the same prompt again, or reload your browser it will spit out the same crap which, again, looks absolutely convincing, except small nuances, like salmon in the Mediterranean, or olive groves of Okinawa.
Everything is, literally, an illusion, the Maya, but it is not created out your own confused mind (trained on a wrong crap), but carefully engineered by some of the best brains of the world (for very big money).
So, just try to run some prompt that you understand very well, several times on the same model, and you will see that this is just a a stream of verbiage, literal blah-blah-blah, as it usually is in non-exact “sciences” and liberal arts “education”.
The situation is very different with a source code on big online models, like Grok3 or Gemini. The most used languages (abundant in the training data), Python in particular, tend to produce comprehensible result and they tend to vary only in some insignificant details (when you know what your are asking for).
Nice languages, less represented in the training sets, perform way worse, for obvious reasons (working in tokens at only the level of syntax, in principle and by the algorithms used). So the code is almost always broken in more than one way, but fixable by a competent person.
But non-math and non-coding prompts are just totally broken. You can see it for yourself. Each answer will be subtle different, emphasizing different aspects, and even subtle contradictory to one another.
The very best models will be just more subtle wrong, but again, it will be completely different non-deterministic answer, just reminisce about the prompt.