--- title: "Quantization" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quantization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(eval = TRUE) ``` ggmlR exposes the full set of ggml quantization formats — from legacy Q4_0/Q8_0 to modern K-quants and IQ (importance-matrix) quants. Quantization reduces model size and speeds up inference, especially on GPU. ```{r} library(ggmlR) ``` --- ## 1. Quantization formats | Family | Formats | Bits/weight | Notes | |--------|---------|-------------|-------| | Legacy | Q4_0, Q4_1, Q5_0, Q5_1, Q8_0 | 4–8 | Simple block quants | | K-quant | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K | 2–8 | Better quality/size | | IQ | IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS | 1–4 | Requires importance matrix | | Ternary | TQ1_0, TQ2_0 | ~1.5–2 | Ternary weights | | Microscaling | MXFP4 | 4 | Block floating point | --- ## 2. Quantize and dequantize ```{r} # Original float weights (must be a multiple of block size, typically 32) weights <- rnorm(256L) # Quantize to Q4_0 raw_q4 <- quantize_q4_0(weights, n_rows = 1L, n_per_row = length(weights)) cat("Original size: ", length(weights) * 4L, "bytes\n") cat("Q4_0 size: ", length(raw_q4), "bytes\n") cat("Compression: ", round(length(weights) * 4L / length(raw_q4), 1), "x\n") # Dequantize back to float recovered <- dequantize_row_q4_0(raw_q4, length(weights)) cat("Max abs error: ", max(abs(recovered - weights)), "\n") ``` --- ## 3. K-quants (better quality) K-quants use super-blocks with separate scales, yielding better quality at the same bit width: ```{r} weights <- rnorm(512L) # Q4_K — 4-bit K-quant raw_q4k <- quantize_q4_K(weights, n_rows = 1L, n_per_row = length(weights)) rec_q4k <- dequantize_row_q4_K(raw_q4k, length(weights)) cat("Q4_K max error:", max(abs(rec_q4k - weights)), "\n") # Q8_0 — 8-bit (near-lossless) raw_q8 <- quantize_q8_0(weights, n_rows = 1L, n_per_row = length(weights)) rec_q8 <- dequantize_row_q8_0(raw_q8, length(weights)) cat("Q8_0 max error:", max(abs(rec_q8 - weights)), "\n") ``` --- ## 4. IQ quants — importance matrix IQ formats accept an importance matrix that prioritises accuracy on frequently-used weights. Without an importance matrix they fall back to uniform quantization. ```{r} weights <- rnorm(512L) importance <- abs(weights)^2 # example: weight magnitude as importance # IQ4_XS — 4-bit with importance raw_iq4 <- quantize_iq4_xs(weights, n_rows = 1L, n_per_row = length(weights), imatrix = importance) rec_iq4 <- dequantize_row_iq4_xs(raw_iq4, length(weights)) cat("IQ4_XS max error:", max(abs(rec_iq4 - weights)), "\n") ``` --- ## 5. Comparing formats ```{r} weights <- rnorm(512L) n_bytes_f32 <- length(weights) * 4L formats <- list( Q4_0 = list(q = quantize_q4_0, dq = dequantize_row_q4_0), Q8_0 = list(q = quantize_q8_0, dq = dequantize_row_q8_0), Q4_K = list(q = quantize_q4_K, dq = dequantize_row_q4_K), Q6_K = list(q = quantize_q6_K, dq = dequantize_row_q6_K) ) n <- length(weights) cat(sprintf("%-8s %6s %8s %10s\n", "Format", "Bytes", "Ratio", "MaxError")) cat(strrep("-", 40), "\n") for (nm in names(formats)) { raw <- formats[[nm]]$q(weights, n_rows = 1L, n_per_row = n) rec <- formats[[nm]]$dq(raw, n) cat(sprintf("%-8s %6d %8.2fx %10.6f\n", nm, length(raw), n_bytes_f32 / length(raw), max(abs(rec - weights)))) } ``` --- ## 6. Reference (row-level) functions For block-level operations (one row at a time), use the `*_ref` variants: ```{r} row <- rnorm(32L) # exactly one Q4_0 block raw_row <- quantize_row_q4_0_ref(row, length(row)) rec_row <- dequantize_row_q4_0(raw_row, length(row)) ``` These match the C reference implementations in `ggml-quants.h` exactly.