GGUF quantization guide

A quick guide to understanding modern and legacy quantization methods in LLMs for local inference with GGUF/llama.cpp

Mar 4, 2026

5 minute read

I like running my own LLMs locally. Open models are becoming more and more powerful, with exciting releases like the latest Qwen 3.5 family scoring highly in benchmarks even in their smaller variants. This makes managing and running your own models more viable, as it becomes increasingly easy to repurpose old hardware for local inference with progressively better results. For local users and modest purposes, the GGUF format introduced by llama.cpp is the de-facto default.

Since local inference is typically heavily restricted by the available hardware, several optimization techniques have been implemented to make the models leaner and faster. Perhaps the most important of these is quantization, which trims down the bit count per parameter to achieve lower memory usage and (sometimes) faster inference. The challenge is that there are many different formats and strategies for quantization. In this post, I summarize them, providing a bird’s-eye view on the available techniques, their strengths, and their weaknesses.

Naming conventions

Most GGUF quantization names follow this pattern: Q{bits}{method}{size}

Component	Meaning
`Q`	Quantized format
`2/3/4/5/6/8`	Bits per weight (lower = smaller file, more compression)
`K`	“K-quant”: grouped/blockwise quantization with per-group scales
`I` / `IQ`	“I-quant”: importance-matrix/non-linear quantization for aggressive compression
`0` / `1`	Legacy ungrouped formats (symmetric/asymmetric)
`_S` / `_M` / `_L`	Small/Medium/Large: precision mix across tensor types
`_XS` / `_XXS` / `_NL`	Extra-small / Non-linear variants for I-quants

Unquantized formats

These are base formats with no compression. Usually, but not always, models are trained in these formats.

Format	Bits	Description	Notes
FP32	32-bit float	Full precision, ~26GB for 7B model	Research, debugging
FP16	16-bit float	Half precision, ~13GB for 7B model	GPU training/inference baseline
BF16	16-bit brain float	Same size as FP16, better dynamic range for training	Training on modern GPUs

These preserve maximum accuracy but require significant VRAM. Most local users quantize to reduce size by 75-90%.

Legacy quantization formats (`Q_0`, `Q_1`)

Simple per-block linear quantization. Fast but less accurate at low bits.

Following is a table that contains the format name, number of bits, size (for a 7B model), perplexity, and some notes/recommendations. The perplexity represents the difference between the quantized model and the base model, with lower scores indicating better accuracy and less uncertainty.

Format	Bits	Size (7B)	Perplexity \(\Delta\)	Notes
Q8_0	~8-bit	~6.7 GB	+0.0004	Near-lossless; safe INT8 baseline
Q5_1	~5-bit	~4.7 GB	+0.0415	Legacy; superseded by Q5_K_M
Q5_0	~5-bit	~4.3 GB	+0.0796	Legacy; balanced but outdated
Q4_1	~4-bit	~3.9 GB	+0.1846	Legacy; substantial quality loss
Q4_0	~4-bit	~3.5 GB	+0.2499	Legacy; high quality loss

K-quants or I-quants are generally preferred over legacy formats at \(\leq 5\) bits. Q8_0 remains useful for compatibility.

K-Quant formats (modern default)

Use two-level block quantization (small blocks \(\rightarrow\) super-blocks) with double-quantized scales for better quality-per-bit.

Format	Effective Bits	Size (7B)	Perplexity \(\Delta\)	Notes
Q6_K	~6.0	~5.15 GB	+0.0044	“Almost lossless” with savings
Q5_K_L	~5.3	~4.6 GB	+0.010	High-quality 5-bit
Q5_K_M	~5.1	~4.45 GB	+0.0142	Recommended high-quality 5-bit
Q5_K_S	~4.9	~4.33 GB	+0.0353	Recommended balanced 5-bit
Q4_K_L	~4.7	~4.0 GB	+0.040	Relaxed 4-bit mix
Q4_K_M	~4.5	~3.80 GB	+0.0535	Most popular default
Q4_K_S	~4.3	~3.56 GB	+0.1149	Smaller 4-bit, more loss
Q3_K_L	~3.5	~3.35 GB	+0.1803	Aggressive 3-bit
Q3_K_M	~3.3	~3.06 GB	+0.2437	Balanced 3-bit
Q3_K_S	~3.1	~2.75 GB	+0.5505	Very small, high loss
Q2_K	~2.5	~2.67 GB	+0.8698	Extreme compression, not recommended

Q4_K_M seems to be the community sweet spot, as it offers ~75% size reduction with minimal noticeable quality loss for most tasks.

I-Quant formats (aggressive compression)

IQs use non-linear reconstruction, lookup tables, and importance-matrix calibration for maximum quality at very low bit counts. They trade decoding speed for size.

Format	Effective Bits	Size (7B)	Notes
IQ4_NL	~4.5	~3.9 GB	Non-linear 4-bit; CPU-friendly speed
IQ4_XS	~4.25	~3.7 GB	Best quality/size at 4-bit; slightly slower decode
IQ3_M	~3.6	~3.2 GB	High-quality 3-bit I-quant
IQ3_S	~3.4	~3.0 GB	Balanced 3-bit aggressive
IQ3_XS	~3.2	~2.8 GB	Very small 3-bit
IQ3_XXS	~3.0	~2.6 GB	Extreme 3-bit compression
IQ2_M	~2.7	~2.4 GB	High-quality 2-bit (rare)
IQ2_S	~2.5	~2.2 GB	Aggressive 2-bit
IQ2_XS	~2.3	~2.0 GB	Very aggressive 2-bit
IQ2_XXS	~2.1	~1.9 GB	Extreme; significant quality loss

I-quants require importance matrix (imatrix) calibration during quantization for best results. Without it, quality can degrade noticeably.

Special formats

Format	Bits	Description
TQ1_0	~1.6	Ternary quantization \(\{-1, 0, +1\}\), for massive models like DeepSeek where fitting in VRAM is critical

Decision guide

Best balance \(\rightarrow\) Q4_K_M (default recommendation)
Maximum quality, still compressed \(\rightarrow\) Q5_K_M or Q6_K
Tight VRAM, acceptable quality loss \(\rightarrow\) Q4_K_S or IQ4_XS
Absolute smallest file \(\rightarrow\) IQ3_XS / Q3_K_S (test quality first)
Near-original accuracy \(\rightarrow\) Q8_0 or keep FP16/BF16
Running on CPU \(\rightarrow\) Q4_K_M or IQ4_NL for better decode speed
Fitting a 70B model on 24GB VRAM \(\rightarrow\) IQ3_S or Q3_K_M + CPU offload

Conclusions

As you see, this looks like the wild west at first glance, but this mess is not without order. Most importantly, there are good reasons behind every quantization type. Here are some takeaways.

Bits \(\neq\) quality alone: A well-designed 4-bit format (Q4_K_M, IQ4_XS) can outperform a naive 5-bit legacy format.
Suffixes matter: _M variants selectively keep sensitive layers at higher precision, improving quality with minimal size cost.
Hardware matters: I-quants compress more but decode slower on CPUs; K-quants often give better tokens/sec on consumer hardware.
Calibration helps: Models quantized with an importance matrix (imatrix) retain more accuracy, especially at \(\leq 3\) bits.
Test your use case: Perplexity benchmarks are only guides. Always validate outputs for your specific tasks.

For the latest format support and benchmarks, check the llama.cpp repository or community hubs like Hugging Face, where curators like bartowski and Unsloth publish tested GGUF variants.

langur@monkey:~/$

~/blog

~/projects

~/publications

~/cv

~/photo

~/search

GGUF quantization guide

Naming conventions

Unquantized formats

Legacy quantization formats (`Q_0`, `Q_1`)

K-Quant formats (modern default)

I-Quant formats (aggressive compression)

Special formats

Decision guide

Conclusions

~/blog

~/projects

~/publications

~/cv

~/photo

~/search

Naming conventions

Unquantized formats

Legacy quantization formats (Q*_0, Q*_1)

K-Quant formats (modern default)

I-Quant formats (aggressive compression)

Special formats

Decision guide

Conclusions

Legacy quantization formats (`Q_0`, `Q_1`)