Quantization

Reducing the precision of a model's numerical values to make it smaller and faster, with minimal quality loss.

optimizationdeployment

AI Confidence: 85%

AI-generated

What It Means

AI model weights are typically stored as 32-bit or 16-bit floating-point numbers. Quantization converts them to lower precision (8-bit, 4-bit, or even 2-bit), dramatically reducing model size and memory requirements. A 7B parameter model at 16-bit precision needs about 14GB of memory. At 4-bit quantization, it needs about 3.5GB.

Why It Matters

Quantization is what makes it possible to run large language models on consumer hardware. Without it, you would need expensive server GPUs for even modest models. With 4-bit quantization (GPTQ, GGUF formats), people run 70B parameter models on desktop GPUs. The quality trade-off is surprisingly small for well-done quantization.

Sources & Further Reading

Hugging Face: Quantization guide - https://huggingface.co/docs/transformers/quantization

Tim Dettmers: "LLM.int8()" - https://arxiv.org/abs/2208.07339