Everyone is a fucking expert nowadays, especially in the matters of vape-coding and LLM usage, you know.

Here is an expert wrote a GPT-assisted piece about how cool he really is:

https://msf.github.io/blogpost/local-llm-performance-framework13.html

There is, however, a small catch. Running 4bit models when your memory can fit a full-size 64Gb gguf with proper 8bit or (even f16) tensor is just missing the point completely.

Yes, you will get just a 1-2 generated tokens per second, so it would feel like a dial-up internet all over again (which is not necessarily bad), but you will get orders-of-magnitude better slop, in principle.

The rule is plain and simple – by cutting the weights down one ruins any chance of occasional “emergent effects” (the cognitive illusion of an “intelligence” out of sheer brute-force, which only takes place in a biased mind of an LLM user) and one just consumes the lowest quality pseudo-intellectual (a mere appearance) slop instead.

It is qualitatively better to run a full-weighted 14b, leave alone 30b model from a 32Gb gguf file (via mmap, of course) and to have a BBS-like experience than having a “fast” 10-15 tokens of low-quality slop per second.

The optimal way for now (Feb 2026) of running local LLM models is to run 64Gb 8bit tensor ggufs on a CPU+mmap backed with an vendor-optimised math library. I wish I could have 64gb.

There are the models I am able to run on a 32Gb Core Ultra 155H system, by compiling the llama.cpp with Intel’s icpx compiler and linking with heavily optimized Intel MCL (the Sycl support for llama compiles, but is permanently broken).

-rw-r–r–. 1 lngnmn2 lngnmn2 32483932576 Oct 27 09:40 Qwen3-30B-A3B-Instruct-2507-Q8_0.gguf -rw-r–r–. 1 lngnmn2 lngnmn2 32483935392 Dec 23 07:50 Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf -rw-r–r–. 1 lngnmn2 lngnmn2 23205357888 Sep 14 15:49 baidu_ERNIE-4.5-21B-A3B-Thinking-Q8_0.gguf -rwxr-xr-x. 1 lngnmn2 lngnmn2 2126 Dec 23 07:00 g.sh -rw-r–r–. 1 lngnmn2 lngnmn2 27020865120 Dec 22 22:02 mistralai_Ministral-3-14B-Reasoning-2512-bf16.gguf -rw-r–r–. 1 lngnmn2 lngnmn2 33585500096 Dec 21 09:13 nvidia_Nemotron-3-Nano-30B-A3B-Q8_0.gguf

Yes, that would be 1-2 tokes per second, which is perfectly fine. The large coding models (which are the only ones that make any sense), however, are unusable.

An uncut (f16) Qwen3-Coder 61Gb gguf will probably match a crappy free-tier Claude web models, but I cannot confirm this. 36 Gb 8bit gguf produces a loq-effort simplistic crap, if at all.

Here is a simple experiment: formulate a question which you are already know an expert-level answer for, and ask the same question both 4bit and 8bit model (writing the answers in separate files).

You will find out that the difference in the answers will be almost exactly as when asking on 4chan and on specialised Quora or Stack Overflow.

The reason is that by cutting the weight down you are getting the most common “paths” throught the graph, which roughly corresponds to quick, bold over-confident answers of stupid people, which have a ready memes-based opinion about everything. It is that simple.