---
title: "Differentiating Through a Cone Program with diffcp"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Differentiating Through a Cone Program with diffcp}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# What `diffcp` is for

Suppose you have a convex optimization problem whose data — the
constraint matrix, the offset, the objective coefficients — depend
on some upstream quantity (a parameter, a model weight, a
hyperparameter, the output of a previous computation). You solve
the problem and get an optimum. **How does that optimum change
when the data changes?**

`diffcp` answers that question. It treats the optimal solution as
an *implicit function* of the problem data, and returns the
derivative of that function — both in the forward direction
(perturb the data, get the predicted change in the optimum) and as
its adjoint (start from a downstream loss on the optimum and
backpropagate to a gradient on the data).

Three concrete things you can do with it:

* **Sensitivity analysis.** Quantify how much each constraint or
  cost coefficient matters at the optimum.
* **Hyperparameter learning.** Solve a regularized problem inside
  a training loop and let gradient descent on the upstream loss
  learn the regularizer.
* **End-to-end optimization layers.** Embed a CVXR problem as a
  differentiable component of a larger model — see CVXR's
  `Problem$backward()` and `Problem$derivative()` API, which
  internally calls `diffcp`.

The implementation is a port of the Python `diffcp` package
(Agrawal, Barratt, Boyd, Busseti, Moursi, 2019); the numerical
core (cone projections, Jacobians, the `M` operator, LSQR) is
direct C++ from the upstream repository, called from R via
RcppEigen.

```{r setup}
library(diffcp)
library(Matrix)
```

# The mathematical setup

`diffcp` differentiates the *primal–dual conic problem*

$$
\begin{aligned}
\text{minimize}   &\quad c^\top x \\
\text{subject to} &\quad A x + s = b, \quad s \in K,
\end{aligned}
$$

where $K$ is a Cartesian product of the standard cones (zero, the
non-negative orthant, second-order, positive semidefinite,
exponential, and the exponential dual). The dual variable
attached to the cone constraint is $y$; together $(x, y, s)$ is
the primal–dual triple.

Differentiating "the optimum" turns out to mean something precise:

1. The KKT conditions can be lifted into a *homogeneous self-dual
   embedding* (Ye–Todd–Mizuno) in which optimality, primal
   infeasibility, and dual infeasibility all become statements
   about a single residual map $N(z) = ((Q - I) \Pi(z) + I) z$,
   where $Q$ is a skew-symmetric block of the data and $\Pi$ is
   the projection onto $\mathbb{R}^n \times K^\ast \times
   \mathbb{R}_+$.
2. At a *regular point* — informally, an optimum away from
   active-set transitions and degeneracies — the implicit-function
   theorem applies, and the optimum is a smooth function of the
   problem data.
3. The Jacobian of the optimum with respect to the data is then a
   bounded linear operator, computed by solving a linear system in
   $M = (Q - I)\,D\Pi(z) + I$ using either a dense LDLT
   factorization or a matrix-free LSQR iteration.

This is the construction in Busseti, Moursi, and Boyd (2019);
`diffcp` packages it into two callables.

# What the package computes

Calling `solve_and_derivative(A, b, c, cone_dict)` returns a list
with five entries:

* `x`, `y`, `s` — the primal–dual optimum.
* `D(dA, db, dc)` — the **forward derivative**. Given a
  perturbation of the data, returns `(dx, dy, ds)`: the predicted
  change in the optimum, to first order.
* `DT(dx, dy, ds)` — the **adjoint** of `D`. Given a perturbation
  of the optimum (or, more usefully, a gradient of a downstream
  loss with respect to the optimum), returns `(dA, db, dc)`: the
  gradient of that loss with respect to the data.

Both are functions, not matrices. The dense mode materializes the
Jacobian as an Eigen matrix and factors it once; the LSQR mode
applies the Jacobian and its transpose matrix-free, never
forming the dense object. The trade-off is per-call cost
(`dense` is faster once `M` has been factored) versus memory and
asymptotic scaling (`lsqr` is the only viable choice for very
large problems).

# Worked example 1: a linear program

The smallest illustrative case: pick the cheapest non-negative
mixture of three goods that sums to one,

$$
\min_{x \in \mathbb{R}^3} \; c^\top x
\quad \text{subject to} \quad
\mathbf{1}^\top x = 1,\; x \ge 0.
$$

In SCS standard form $Ax + s = b$, $s \in K = \{0\} \times
\mathbb{R}^3_+$ — one row for the equality (zero cone, $s = 0$)
followed by three rows for the non-negativity constraints (the
slack absorbs the inequality):

```{r lp-data}
A <- sparseMatrix(i = c(1, 2, 1, 3, 1, 4),
                  j = c(1, 1, 2, 2, 3, 3),
                  x = c(1, -1, 1, -1, 1, -1),
                  dims = c(4, 3))
b <- c(1, 0, 0, 0)
c <- c(1, 2, 3)
cone_dict <- list(z = 1L, l = 3L)

res <- solve_and_derivative(A, b, c, cone_dict, mode = "lsqr",
                            tol_gap_abs = 1e-10,
                            tol_gap_rel = 1e-10,
                            tol_feas    = 1e-10)
res$x
```

The optimum is $x \approx (1, 0, 0)$ — all weight on the cheapest
coordinate. Now the derivative. Suppose we slightly increase the
cost of the first coordinate, holding everything else fixed:

```{r lp-D}
dA <- sparseMatrix(i = integer(0), j = integer(0), x = numeric(0),
                   dims = c(4, 3))
db <- numeric(4)
dc <- c(0.001, 0, 0)
d <- res$D(dA, db, dc)
d$dx
```

`D` reports that `dx` is essentially zero — the optimum is robust
to a small bump in $c_1$, because the constraint $\mathbf{1}^\top
x = 1$ pins coordinate 1 against $x \ge 0$. (You'd see a
non-trivial `dy` if you printed the dual perturbation.) The
forward derivative agrees with finite-differencing the perturbed
solve to within solver tolerance:

```{r lp-fd}
res_pert <- solve_only(A, b, c + dc, cone_dict,
                       tol_gap_abs = 1e-10,
                       tol_gap_rel = 1e-10,
                       tol_feas    = 1e-10)
max(abs(d$dx - (res_pert$x - res$x)))
```

The agreement is at the level of the Clarabel solve tolerance,
not the LSQR tolerance — the per-call derivative *is* the exact
implicit-function-theorem answer at this point.

# Worked example 2: a second-order-cone program

A geometric example: project the simplex centroid onto a Lorentz
cone,

$$
\min_{t,\, x} \; t \quad \text{subject to} \quad \|x\|_2 \le t,\;
\mathbf{1}^\top x = 1.
$$

The variables are $(t, x_1, x_2, x_3)$. The block $(t, x)$ lives
in a single 4-dimensional second-order cone; the equality goes in
a zero cone of size 1. Since $t$ is being minimized against a
norm constraint, the optimum lies on the cone boundary
$\|x\|_2 = t$, which is exactly the kind of active constraint
that makes derivatives non-trivial.

```{r soc-data}
A <- sparseMatrix(
  i = c(2, 1, 3, 1, 4, 1, 5),
  j = c(1, 2, 2, 3, 3, 4, 4),
  x = c(-1, 1, -1, 1, -1, 1, -1),
  dims = c(5, 4))
b <- c(1, 0, 0, 0, 0)
c <- c(1, 0, 0, 0)
cone_dict <- list(z = 1L, q = 4L)

res <- solve_and_derivative(A, b, c, cone_dict, mode = "dense",
                            tol_gap_abs = 1e-10,
                            tol_gap_rel = 1e-10,
                            tol_feas    = 1e-10)
res$x
```

The optimum is $x = (1/3, 1/3, 1/3)$, $t = \sqrt{3}/3$, and the
SOC constraint $\|x\|_2 = t$ holds with equality. The cone
projection's Jacobian at a boundary point is the technical heart
of `diffcp` — it's a smoothed projector that interpolates between
the identity (interior of the cone) and zero (interior of the
polar cone). The package handles all the boundary geometry
automatically; from the R side you just call `D` and `DT`.

# Worked example 3: a small semidefinite program

The minimum-eigenvalue SDP: find a unit-trace PSD matrix that
minimizes $\langle C, X \rangle$ for given $C$. With
$C = \mathrm{diag}(1, 0, 0, 0, -1)$, the optimum is rank-one and
puts all weight on the eigenvector of $C$ with most-negative
eigenvalue.

PSD support requires a vectorization convention. `diffcp` uses
the SCS convention: lower-triangular column-major, with
off-diagonals scaled by $\sqrt{2}$. The package exports
`vec_symm()` and `unvec_symm()` for round-trips between symmetric
matrices and their packed vector form. The trace-coefficient row
in the equality constraint is sparse — only the diagonal entries
contribute, and we build it explicitly:

```{r sdp-data}
DIM <- 5L
n   <- DIM * (DIM + 1L) %/% 2L          # 15

trace_row <- numeric(n)
pos <- 1L
for (col in seq_len(DIM)) {
  trace_row[pos] <- 1.0
  pos <- pos + (DIM - col + 1L)         # next diagonal in column-major lower-tri
}

A <- rbind(
  Matrix(matrix(trace_row, nrow = 1L), sparse = TRUE),
  -Diagonal(n))
A <- as(A, "CsparseMatrix")
b <- c(1.0, numeric(n))

C <- matrix(0, DIM, DIM)
C[1, 1] <- 1; C[DIM, DIM] <- -1
c_vec <- vec_symm(C)

cone_dict <- list(z = 1L, s = DIM)

res <- solve_only(A, b, c_vec, cone_dict,
                  tol_gap_abs = 1e-10,
                  tol_gap_rel = 1e-10,
                  tol_feas    = 1e-10)
X <- unvec_symm(res$x, DIM)
sum(diag(X))
range(eigen(X, symmetric = TRUE, only.values = TRUE)$values)
```

`X` has trace 1 (to solver tolerance) and is PSD (smallest
eigenvalue $\ge -10^{-9}$). Behind the scenes, when the solve
goes through Clarabel — which packs PSD entries in upper-triangular
column-major order, opposite to SCS — `diffcp` permutes the rows
of $A$ and the entries of $b$ on the way in, and inverse-permutes
the dual $y$ and slack $s$ on the way out, so the user-facing
convention stays SCS throughout.

# A practical sensitivity-analysis example

Here is the canonical use case: **shadow prices**. Take a small
LP — a profit-maximizing production plan with a labor and a
materials budget — and ask, *given the optimum, by how much would
the profit change per extra unit of labor?* That number is the
dual variable on the labor constraint. `diffcp` lets you compute
the same sensitivity for *any* perturbation of the data, including
ones that have no closed-form dual interpretation.

We solve

$$
\max_{x_1, x_2 \ge 0}\; 3 x_1 + 5 x_2
\quad \text{s.t.}\quad
2 x_1 + x_2 \le 100,\;
x_1 + 3 x_2 \le 90,
$$

and ask: how does the optimum respond to a small change in the
labor-budget right-hand side, $b_1 = 100$?

We rewrite the maximization as a minimization of $-c^\top x$, and
encode the two budget inequalities and the two non-negativity
inequalities as four rows of a non-negative orthant slack
constraint:

```{r shadow-data}
A_lp <- sparseMatrix(i = c(1, 2, 3,  1, 2, 4),
                     j = c(1, 1, 1,  2, 2, 2),
                     x = c(2, 1, -1, 1, 3, -1),
                     dims = c(4, 2))
b_lp <- c(100, 90, 0, 0)
c_lp <- c(-3, -5)
cones_lp <- list(l = 4L)

res <- solve_and_derivative(A_lp, b_lp, c_lp, cones_lp, mode = "dense",
                            tol_gap_abs = 1e-10, tol_gap_rel = 1e-10,
                            tol_feas    = 1e-10)
res$x      # optimal production plan
-sum(c_lp * res$x)   # max profit
```

The optimum makes $x_1 = 42$, $x_2 = 16$, profit $= 206$. Now use
`D` to ask how the optimal $x$ would change if we got one extra
unit of labor ($\Delta b_1 = 1$):

```{r shadow-D}
dA <- sparseMatrix(i = integer(0), j = integer(0), x = numeric(0),
                   dims = c(4, 2))
db <- c(1, 0, 0, 0)
dc <- numeric(2)
d  <- res$D(dA, db, dc)
d$dx                       # change in optimal production
-sum(c_lp * d$dx)          # change in profit
```

`d$dx` says that an extra unit of labor increases $x_1$ by 0.6
and decreases $x_2$ by 0.2, and the corresponding profit change
is $0.8$ per unit of labor: the labor-budget shadow price,
recovered from `D` rather than from the dual variable $y$.

The advantage of going through `D` instead of the dual: it
generalizes. We can ask the same question about non-trivial
perturbations of $A$ (a productivity gain on a particular
input), `c` (a price change on a particular product), or any
combination. The dual variable answers only the canonical
RHS-perturbation question; `D` answers all of them through the
same single call.

`DT` answers the *opposite* question. Suppose we have a
downstream loss $L(x^\ast)$ — say, a smooth function of the
optimum with known gradient $\partial L / \partial x^\ast$. The
adjoint computes the gradient of $L$ with respect to *every*
data entry $(A, b, c)$ in one shot:

```{r shadow-DT}
## Suppose dL/dx* is given (e.g. from autodiff in a larger model).
dL_dx <- c(1.0, -0.5)
dL_dy <- numeric(length(res$y))
dL_ds <- numeric(length(res$s))
da    <- res$DT(dL_dx, dL_dy, dL_ds)
da$db                 # gradient of L w.r.t. each entry of b
da$dc                 # gradient of L w.r.t. each entry of c
```

In a training loop you'd feed `da$db` and `da$dc` straight into a
parameter update.

# Forward versus adjoint: when to use each

Both `D` and `DT` access the same underlying Jacobian, but their
cost profile differs:

* **`D(dA, db, dc)`** is right-multiplication by the Jacobian.
  Cost is one solve of the same linear system per call. Use
  `D` when you have a *small number of perturbation directions*
  (e.g., parameter sweeps) and want to forecast the effect on
  the optimum.
* **`DT(dx, dy, ds)`** is right-multiplication by the Jacobian's
  transpose. Same per-call cost, but the answer is a *gradient
  with respect to all data entries simultaneously*. Use `DT`
  when you have an upstream gradient on the optimum (typically
  from an autodiff system) and need to propagate it through the
  conic optimization.

`mode = "dense"` factors `M` once on the first call and amortizes
the cost across subsequent `D` / `DT` invocations on the same
problem, which is the right choice when you'll evaluate either
operator many times.

# Choosing `dense` versus `lsqr`

`solve_and_derivative` accepts two modes for the inner
linear-algebra step:

* **`mode = "lsqr"`** (default): matrix-free conjugate-gradient-
  style iterative solve against the `M` operator. Memory is
  $O(\text{nnz}(A))$; iteration count grows with the conditioning
  of $M$. Appropriate for large sparse problems.
* **`mode = "dense"`**: build $M$ as a dense matrix, factor with
  Eigen's LDLT, and reuse the factorization across `D` / `DT`
  calls. Cost is $O(N^3)$ once, $O(N^2)$ per call, where
  $N = m + n + 1$. Appropriate for small problems and for
  workflows that re-apply `D` / `DT` many times.

Both modes produce the same answer to within solver tolerance.
The dense path matches Python's
$(M^\top M).\mathrm{ldlt}().\mathrm{solve}(M^\top \cdot
\mathrm{rhs})$ bit-for-bit; the LSQR path agrees with the dense
path to roughly $10^{-6}$ at the default `atol = btol = 1e-8`.

# Conventions

* **PSD vectorization.** `vec_symm()` packs the lower triangle in
  column-major order with off-diagonals scaled by $\sqrt{2}$
  (the SCS convention). `unvec_symm()` is the inverse. This
  scaling makes the inner product $\langle \mathrm{vec}(A),
  \mathrm{vec}(B) \rangle$ equal $\mathrm{tr}(AB)$ for symmetric
  $A, B$, which is what the cone-projection Jacobian expects.
* **PSD with Clarabel.** Clarabel uses upper-triangular column-
  major; `diffcp` permutes rows of $A$ and entries of $b$ to
  Clarabel order on the way in and inverse-permutes the dual
  $y$ and slack $s$ on the way out. The user-facing convention
  is SCS throughout.
* **Exponential cones.** Each `ep` cone consumes three rows of
  $A$ (one cone per `count`); the dual `ed` cone is handled
  separately or recovered via Moreau's decomposition,
  $\Pi_{K^\ast}(x) = x + \Pi_K(-x)$.
* **Quadratic objectives.** `solve_only(P = P)` accepts a
  positive-semidefinite `P` for the forward solve. The
  derivative path through `solve_and_derivative` does **not**
  support `P` directly; CVXR's reduction chain canonicalizes QPs
  into auxiliary-variable conic problems before reaching the
  solver, so the loss is invisible at the CVXR layer.

# Connection to CVXR

R users typically don't call `diffcp` directly. CVXR (≥ 1.8.2)
exposes a higher-level API:

```r
library(CVXR)
p <- Parameter(); x <- Variable(); value(p) <- 3
prob <- Problem(Minimize((x - 2 * p)^2), list(x >= 0))
psolve(prob, requires_grad = TRUE)
backward(prob)
gradient(p)             # 2  (since x* = 2p)
delta(p) <- 1e-3
derivative(prob)
delta(x)                # 0.002
```

The `requires_grad = TRUE` flag routes the solve through the
DIFFCP solver wrapper, which calls `diffcp::solve_and_derivative`
under the hood. CVXR then handles the chain rule through any
problem reductions (DGP log-transform, Complex2Real splitting,
parameter canonicalization) and reports gradients keyed by
`Parameter` rather than by raw data entries.

When you want the lower-level data-perturbation interface — for
research, for fixtures that bypass CVXR canonicalization, or for
problems already in conic standard form — `diffcp` is the right
entry point.

# Citation

If you use `diffcp` in academic work, please cite **both** the R
package and the original paper. See `citation("diffcp")` for the
canonical BibTeX entries.

* **R package**

  > Narasimhan, B.; Agrawal, A.; Barratt, S.; Boyd, S.; Busseti, E.;
  > Moursi, W. *diffcp: Differentiating Through Cone Programs*.
  > R package, 2026. <https://github.com/bnaras/diffcp>

* **Original paper**

  > Agrawal, A.; Barratt, S.; Boyd, S.; Busseti, E.; Moursi, W.
  > "Differentiating through a cone program."
  > *Journal of Applied and Numerical Optimization*, **1**(2),
  > 107–115, 2019.