Interactive demos

Four of my papers, running live in your browser on small toy problems. Pick a demo below, drag the clouds and move the sliders.

Subspace Robust Wasserstein

This demo brings to life Subspace Robust Wasserstein Distances (Paty and Cuturi, ICML 2019). Drag either point cloud to move it around.

Optimal transport measures the effort needed to morph one cloud of points, the measure $\mu$, into the other one, the measure $\nu$. In high dimension this effort is often dominated by a handful of directions that carry little real information. The paper asks a sharper question: along which low dimensional view are the two clouds the hardest to align? Writing $P_E$ for the projection onto a subspace $E$ of dimension $k$, the subspace robust distance is

$$ \mathcal{S}_k(\mu,\nu)^2 \;=\; \max_{\dim E = k}\ \min_{\pi \in \Pi(\mu,\nu)}\ \int \lVert P_E(x-y) \rVert^2 \, \mathrm{d}\pi(x,y). $$

The data here is two dimensional and we take $k = 1$, so the worst view is a single line, drawn in yellow. The algorithm goes back and forth until its two steps agree: solve the transport problem along the current line, then turn the line towards the direction that transport stretches the most. The dashed line is ordinary PCA, the direction of largest spread, which usually points somewhere else.

Spread of Y Regularization ε

first cloud (μ) second cloud (ν) worst-case line (k = 1) PCA

robust distance S₁: … ordinary Wasserstein W₂: … angle of the worst-case line: …

Everything runs live on a few dozen points. The transport step uses entropic optimal transport (Sinkhorn), and the new line is the leading eigenvector of the transport weighted spread of the displacements. The projection $P$ ranges over $\{P : 0 \preceq P \preceq I,\ \operatorname{tr} P = 1\}$, exactly as in the paper. Original Python code.

Regularity as Regularization

This demo brings to life Regularity as Regularization: Smooth and Strongly Convex Brenier Potentials in Optimal Transport (Paty, d'Aspremont and Cuturi, AISTATS 2020).

We want the optimal transport map $T$ that carries a source distribution $P$ onto a target distribution $Q$, and we only see samples of each. Brenier's theorem says such a map is the gradient of a convex potential, $T = \nabla f$. The paper estimates the map by looking for a potential whose gradient pushes $P$ as close to $Q$ as possible, while staying regular:

$$ \min_{f \,\in\, \mathcal{F}_{\mu,L}}\ W_2^2\big( \nabla f \,\#\, P,\ Q \big), $$

where $\mathcal{F}_{\mu,L}$ is the set of potentials that are $\mu$-strongly convex and $L$-smooth, $\nabla f \,\#\, P$ is the distribution obtained by sending every point $x$ to $\nabla f(x)$, and $W_2$ is the Wasserstein distance. Imposing this regularity is the same as asking the map to be bi-Lipschitz: for all points,

$$ \mu\,\lVert x - x' \rVert \;\le\; \lVert T(x) - T(x') \rVert \;\le\; L\,\lVert x - x' \rVert. $$

In plain terms, this bounds how much the map is allowed to stretch or squeeze distances: it never pulls two points apart by more than a factor $L$, and never crushes them together by more than a factor $\mu$. Close points stay close, and far points stay far.

In one dimension this becomes especially transparent. A transport map is just an increasing curve, so regularity becomes a bound on the slope between any two observed source points:

$$ \mu \;\le\; \frac{T(x') - T(x)}{x' - x} \;\le\; L . $$

The lower bound $\mu$ prevents the fitted map from flattening out and collapsing distant source points together. The upper bound $L$ prevents it from making sudden steep jumps to chase noisy observations. Together they say: fit the data, but only with a curve whose local stretching stays plausible.

In the simulation, the smooth curves show the underlying source and target distributions $P$ and $Q$. What the estimator actually receives is much poorer: source samples $x_i$ from $P$, and noisy observations of where the true map sends them, $y_i = T^*(x_i) + \varepsilon_i$, with $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$ iid.

The red curve is the least-squares fit to those observations under the slope constraints,

$$ \min_T \sum_i \bigl(T(x_i) - y_i\bigr)^2 \quad\text{subject to}\quad \mu \le \frac{T(x') - T(x)}{x' - x} \le L . $$

This is isotonic regression with two extra regularity rules. Plain isotonic regression only asks the map to keep increasing, which corresponds to $\mu = 0$ and $L = \infty$. Here the bounds also stop the map from becoming too flat or too steep, which is how the regularized fit filters out the noise.

noise σ

fit couple (μ, L) = (0.15, 2.20)

true couple (μ*, L*) = (0.20, 3.00)

source P target Q true map regularized map (μ ≤ T' ≤ L) plain isotonic (μ=0, L=∞) unconstrained fit

distance to the true map. Regularized fit: … isotonic: … unconstrained: …

About the two couples. The true couple $(\mu^*, L^*)$ is the regularity of the hidden true map: the source $P$ is uniform, and pushing it through that map gives the target $Q$, so this couple is what reshapes $Q$. The amber curve is the clean $Q$; the ticks underneath it are the samples $y_i$, scattered around $Q$ by the noise σ. The fit couple $(\mu, L)$ is instead the regularity you assume when reconstructing the map, and Best fit looks for the couple that recovers it best. The fit is computed exactly in the browser by coordinate descent on the slopes; in higher dimension the same estimator becomes the convex QCQP of the paper.

Regularized OT is Ground Cost Adversarial

This demo illustrates Regularized Optimal Transport is Ground Cost Adversarial (Paty and Cuturi, ICML 2020). We work in the discrete setting, with a source distribution $\mu$ supported on points $i=1,\dots,n$ and a target distribution $\nu$ supported on points $j=1,\dots,m$. A transport plan $\pi$ is a matrix whose entry $\pi_{ij}\ge 0$ says how much mass is moved from $i$ to $j$; moving it there incurs a cost $c_0(i,j)\,\pi_{ij}$. A plan is admissible if it ships out exactly the mass each source has and delivers exactly the mass each target wants, that is $\sum_j \pi_{ij}=\mu_i$ and $\sum_i \pi_{ij}=\nu_j$. We write $\Pi(\mu,\nu)$ for this set of admissible plans.

Classical optimal transport picks the admissible plan of least total cost, and we write $\mathcal{T}_c(\mu,\nu)$ for that minimum value when the ground cost is $c$:

$$ \mathcal{T}_c(\mu,\nu) \;=\; \min_{\pi \in \Pi(\mu,\nu)} \langle c,\pi\rangle, \qquad \langle c,\pi\rangle = \sum_{ij} c(i,j)\,\pi_{ij}. $$

In practice one rarely solves this raw problem: a small convex regularizer $\varepsilon R(\pi)$ is added to the plan $\pi$ so that the solution is smoother and faster to compute. The message of the paper is that this regularizer is not just a numerical convenience. Adding $\varepsilon R(\pi)$ to the cost of a fixed prior ground cost $c_0$ is exactly equivalent to letting an adversary perturb the ground cost itself, away from $c_0$, and then solving ordinary (unregularized) transport $\mathcal{T}_c$ against that worst-case cost. In the discrete setting this equivalence reads

$$ \min_{\pi \in \Pi(\mu,\nu)} \langle c_0,\pi\rangle + \varepsilon R(\pi) \;=\; \sup_c \mathcal{T}_c(\mu,\nu) - \varepsilon R^*\!\left({c-c_0\over\varepsilon}\right). $$

On the left is the usual regularized transport. On the right, an adversary searches over ground costs $c$: it is rewarded for raising the transport cost $\mathcal{T}_c(\mu,\nu)$, but pays the penalty $\varepsilon R^*((c-c_0)/\varepsilon)$ for straying from the prior $c_0$. The shape of that penalty, and so the kind of perturbation the adversary can afford, is set entirely by $R^*$, the convex conjugate of the regularizer. Different regularizers therefore produce qualitatively different adversaries; switch between them with the buttons to see the specialized form of the identity and the cost it learns.

$$\min_{\pi\in\Pi(\mu,\nu)}\langle c_0,\pi\rangle+\varepsilon\sum_{ij}\pi_{ij}(\log\pi_{ij}-1) = \max_c \mathcal{T}_c(\mu,\nu)-\varepsilon\sum_{ij}\exp\!\left({c_{ij}-{c_0}_{ij}\over\varepsilon}\right).$$

Entropic regularization uses $R(\pi)=\sum_{ij}\pi_{ij}(\log\pi_{ij}-1)$, with conjugate $R^*(s)=\sum_{ij}\exp(s_{ij})$. The penalty grows exponentially as a cost rises above $c_0$, so the adversary nudges every cost a little but pays steeply for large moves: a soft perturbation spread around $c_0$.

Regularization ε

source μ target ν transport plan large cost small cost cost reduced cost increased

regularizer: Entropy plan entropy: … active links: … adversarial cost spread: …

Weak Optimal Transport in Economics

This demo illustrates Algorithms for Weak Optimal Transport with an Application to Economics (Paty, Choné and Kramarz, 2022). The economic question is: who works for whom, and what does the matching reveal about workers' skills and firms' technologies?

Picture an economy with many firms and many workers. Each firm has a technology $x$, and each worker has a skill $y$. The question is who works for whom: a matching says how much worker mass of each type is employed by each firm type. Different pairings produce different amounts of output, and the model looks for the matching that maximizes total production across the whole economy. The paper studies four versions of this problem, from the simplest pairing rule to the richest, and the demo lets you compare them.

Production starts from a constant-elasticity-of-substitution (CES) function. A worker of type $y=(y_1,y_2)$ carries a mix of two skills, and a firm of technology $\alpha=(\alpha_1,\alpha_2)$ weights them by how much it relies on each:

$$ F(\alpha,y) = \left(\alpha_1 y_1^\rho + \alpha_2 y_2^\rho\right)^{1/\rho}. $$

Maximizing total production over all matchings is exactly an optimal transport problem, with transport cost $c=-F$: the more a firm-worker pairing produces, the cheaper it is to transport. The four models below all share this production function and differ only in what they allow a "match" to be.

Classical optimal transport treats production as pairwise: a firm of type $x$ hiring a worker of type $y$ produces $F(x,y)$. Weak optimal transport instead lets the production of a firm depend on the whole group of workers it hires, written as a distribution $\pi_x$. In barycentric WOT, only the average skill of that group matters. In WOTUK, that per-firm kernel $\pi_x$ is no longer required to sum to one, so a firm's size (its total mass) is also chosen by the model instead of being fixed in advance. The four buttons in the demo walk through these in turn.

Let $\mu$ be the distribution of firm types, $\nu$ the distribution of worker types, and $\Pi(\mu,\nu)$ the set of matchings whose two marginals are fixed. Classical optimal transport chooses the matching $\pi$ that maximizes total production:

$$ \operatorname{OT}(\mu,\nu) = \sup_{\pi \in \Pi(\mu,\nu)} \int_{X \times Y} F(x,y)\,\mathrm{d}\pi(x,y). $$

Entropic OT adds a small entropy reward. It is easier and smoother to compute, but the economic meaning is still pairwise: firms mainly hire nearby worker types, with some blur around the best matches.

$$ \operatorname{EOT}_{\varepsilon}(\mu,\nu) = \sup_{\pi \in \Pi(\mu,\nu)} \left\{ \int F(x,y)\,\mathrm{d}\pi(x,y) - \varepsilon \int \log\!\left(\frac{\mathrm{d}\pi}{\mathrm{d}\mu\,\mathrm{d}\nu}\right)\,\mathrm{d}\pi \right\}. $$

Weak OT changes the object that enters production. Disintegrate the matching as $\mathrm{d}\pi(x,y)=\mathrm{d}\mu(x)\,\mathrm{d}\pi_x(y)$. The measure $\pi_x$ is the distribution of workers hired by firms of type $x$. Production may now depend on the whole workforce:

$$ \operatorname{WOT}(\mu,\nu) = \sup_{\pi \in \Pi(\mu,\nu)} \int_X F(x,\pi_x)\,\mathrm{d}\mu(x). $$

In the barycentric case, $F(x,\pi_x)$ only sees the average skill $ \int y\,\mathrm{d}\pi_x(y) $. In the demo this is

$$ F(x,\pi_x) = \widetilde F\!\left(x,\int_Y y\,\mathrm{d}\pi_x(y)\right), \qquad \widetilde F(\alpha,z) = \left(\alpha_1 z_1^\rho + \alpha_2 z_2^\rho\right)^{1/\rho}. $$

This is what lets a generalist firm combine specialists in two skills and obtain the same aggregate skill as from a generalist worker.

WOTUK goes one step further. Until now the per-firm distribution $\pi_x$ was normalized to one, which fixed every firm type to the same size. We now drop that constraint and write the unnormalized version as $q_x$: its total mass $q_x(Y)$ is the size of firms of type $x$, and the model gets to choose it. The worker population is still fully employed, but firm types no longer all have the same size.

$$ \operatorname{WOTUK}(\mu,\nu) = \sup_{\substack{q_x \in \mathcal{M}_+(Y)\\ \int_X q_x\,\mathrm{d}\mu(x)=\nu}} \int_X F(x,q_x)\,\mathrm{d}\mu(x), \qquad q_x(Y)=\text{firm size}. $$

The demo uses the corresponding conical WOTUK production: $z=\int y\,\mathrm{d}q_x(y)$ is total skill rather than average skill, and the same CES formula is evaluated on that total skill vector.

$$ F(\alpha,q_x) = \left( \alpha_1 \left(\int_Y y_1\,\mathrm{d}q_x(y)\right)^\rho + \alpha_2 \left(\int_Y y_2\,\mathrm{d}q_x(y)\right)^\rho \right)^{1/\rho}. $$

That is the theory; here is how to read the figure. Both inputs are discrete, that is, weighted sums of point masses (Dirac deltas):

$$ \mu = \sum_i m_i\,\delta_{\alpha_i}, \qquad \nu = \sum_j \nu_j\,\delta_{y_j}. $$

The firm distribution $\mu$ places a mass $m_i$ (the firm size) at each technology $\alpha_i=(1-s_i,s_i)$; the worker distribution $\nu$ places a weight $\nu_j$ at each skill $y_j=(1-t_j,t_j)$, and these skills form a fixed grid from one specialist extreme to the other. The figure shows the chosen matching together with these two margins. In the central heatmap each row is a firm type $\alpha_i$ and each column is a worker type $y_j$, and darker cells mean the production-maximizing matching sends more of that worker type to that firm type. Aligned underneath the columns is $\nu$; aligned beside the rows, on the right, are the firm sizes $m_i$. In OT, entropic OT and WOT these sizes are fixed and equal; only WOTUK lets them move.

The two sliders reshape the inputs in two different ways. Firm specialization moves the support of $\mu$: it spreads the firm technologies $\alpha_i$ out from a tight mixed cluster toward the two specialist extremes, while their sizes $m_i$ stay equal (except under WOTUK). Worker population instead moves the weights of $\nu$: the skill positions $y_j$ stay fixed, but their weights $\nu_j$ shift from specialists piled at the extremes to generalists massed in the middle. As you move the sliders, watch how the heatmap and the two margins respond under each of the four models.

worker population: specialists to generalists firm specialization: mixed to specialized

strong match worker distribution firm technology / size

model: OT matching concentration: … firm-size spread: …