Everyone is a fucking expert nowadays
Everyone is a fucking expert nowadays, especially in the matters of vape-coding and LLM usage, you know. Here is an expert wrote a GPT-assisted piece about how cool he really is: https://msf.github.io/blogpost/local-llm-performance-framework13.html There is, however, a small catch. Running 4bit models when your memory can fit a full-size 64Gb gguf with proper 8bit or (even f16) tensor is just missing the point completely. Yes, you will get just a 1-2 generated tokens per second, so it would feel like a dial-up internet all over again (which is not necessarily bad), but you will get orders-of-magnitude better slop, in principle. ...