---
title: "Quantization"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Quantization}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = TRUE)
```

ggmlR exposes the full set of ggml quantization formats — from legacy Q4_0/Q8_0
to modern K-quants and IQ (importance-matrix) quants.  Quantization reduces
model size and speeds up inference, especially on GPU.

```{r}
library(ggmlR)
```

---

## 1. Quantization formats

| Family | Formats | Bits/weight | Notes |
|--------|---------|-------------|-------|
| Legacy | Q4_0, Q4_1, Q5_0, Q5_1, Q8_0 | 4–8 | Simple block quants |
| K-quant | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K | 2–8 | Better quality/size |
| IQ | IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS | 1–4 | Requires importance matrix |
| Ternary | TQ1_0, TQ2_0 | ~1.5–2 | Ternary weights |
| Microscaling | MXFP4 | 4 | Block floating point |

---

## 2. Quantize and dequantize

```{r}
# Original float weights (must be a multiple of block size, typically 32)
weights <- rnorm(256L)

# Quantize to Q4_0
raw_q4 <- quantize_q4_0(weights, n_rows = 1L, n_per_row = length(weights))
cat("Original size: ", length(weights) * 4L, "bytes\n")
cat("Q4_0 size:     ", length(raw_q4), "bytes\n")
cat("Compression:   ", round(length(weights) * 4L / length(raw_q4), 1), "x\n")

# Dequantize back to float
recovered <- dequantize_row_q4_0(raw_q4, length(weights))
cat("Max abs error: ", max(abs(recovered - weights)), "\n")
```

---

## 3. K-quants (better quality)

K-quants use super-blocks with separate scales, yielding better
quality at the same bit width:

```{r}
weights <- rnorm(512L)

# Q4_K — 4-bit K-quant
raw_q4k <- quantize_q4_K(weights, n_rows = 1L, n_per_row = length(weights))
rec_q4k <- dequantize_row_q4_K(raw_q4k, length(weights))
cat("Q4_K max error:", max(abs(rec_q4k - weights)), "\n")

# Q8_0 — 8-bit (near-lossless)
raw_q8 <- quantize_q8_0(weights, n_rows = 1L, n_per_row = length(weights))
rec_q8 <- dequantize_row_q8_0(raw_q8, length(weights))
cat("Q8_0 max error:", max(abs(rec_q8 - weights)), "\n")
```

---

## 4. IQ quants — importance matrix

IQ formats accept an importance matrix that prioritises accuracy on
frequently-used weights.  Without an importance matrix they fall back to
uniform quantization.

```{r}
weights    <- rnorm(512L)
importance <- abs(weights)^2          # example: weight magnitude as importance

# IQ4_XS — 4-bit with importance
raw_iq4 <- quantize_iq4_xs(weights, n_rows = 1L, n_per_row = length(weights),
                           imatrix = importance)
rec_iq4 <- dequantize_row_iq4_xs(raw_iq4, length(weights))
cat("IQ4_XS max error:", max(abs(rec_iq4 - weights)), "\n")
```

---

## 5. Comparing formats

```{r}
weights <- rnorm(512L)
n_bytes_f32 <- length(weights) * 4L

formats <- list(
  Q4_0 = list(q = quantize_q4_0,  dq = dequantize_row_q4_0),
  Q8_0 = list(q = quantize_q8_0,  dq = dequantize_row_q8_0),
  Q4_K = list(q = quantize_q4_K,  dq = dequantize_row_q4_K),
  Q6_K = list(q = quantize_q6_K,  dq = dequantize_row_q6_K)
)

n <- length(weights)
cat(sprintf("%-8s  %6s  %8s  %10s\n", "Format", "Bytes", "Ratio", "MaxError"))
cat(strrep("-", 40), "\n")
for (nm in names(formats)) {
  raw <- formats[[nm]]$q(weights, n_rows = 1L, n_per_row = n)
  rec <- formats[[nm]]$dq(raw, n)
  cat(sprintf("%-8s  %6d  %8.2fx  %10.6f\n",
              nm, length(raw),
              n_bytes_f32 / length(raw),
              max(abs(rec - weights))))
}
```

---

## 6. Reference (row-level) functions

For block-level operations (one row at a time), use the `*_ref` variants:

```{r}
row <- rnorm(32L)   # exactly one Q4_0 block

raw_row <- quantize_row_q4_0_ref(row, length(row))
rec_row <- dequantize_row_q4_0(raw_row, length(row))
```

These match the C reference implementations in `ggml-quants.h` exactly.