Quantization
Reducing the precision of a model's numerical values to make it smaller and faster, with minimal quality loss.
AI-generated
AI model weights are typically stored as 32-bit or 16-bit floating-point numbers. Quantization converts them to lower precision (8-bit, 4-bit, or even 2-bit), dramatically reducing model size and memory requirements. A 7B parameter model at 16-bit precision needs about 14GB of memory. At 4-bit quantization, it needs about 3.5GB.
Quantization is what makes it possible to run large language models on consumer hardware. Without it, you would need expensive server GPUs for even modest models. With 4-bit quantization (GPTQ, GGUF formats), people run 70B parameter models on desktop GPUs. The quality trade-off is surprisingly small for well-done quantization.
Hugging Face: Quantization guide - https://huggingface.co/docs/transformers/quantization
Tim Dettmers: "LLM.int8()" - https://arxiv.org/abs/2208.07339