Interactive demos
Three of my papers, running live in your browser on small toy problems. Pick a demo below, drag the clouds and move the sliders.
Subspace Robust Wasserstein
This demo brings to life Subspace Robust Wasserstein Distances (Paty and Cuturi, ICML 2019). Drag either point cloud to move it around.
Optimal transport measures the effort needed to morph one cloud of points, the measure \(\mu\), into the other one, the measure \(\nu\). In high dimension this effort is often dominated by a handful of directions that carry little real information. The paper asks a sharper question: along which low dimensional view are the two clouds the hardest to align? Writing \(P_E\) for the projection onto a subspace \(E\) of dimension \(k\), the subspace robust distance is
$$ \mathcal{S}_k(\mu,\nu)^2 \;=\; \max_{\dim E = k}\ \min_{\pi \in \Pi(\mu,\nu)}\ \int \lVert P_E(x-y) \rVert^2 \, \mathrm{d}\pi(x,y). $$The data here is two dimensional and we take \(k = 1\), so the worst view is a single line, drawn in yellow. The algorithm goes back and forth until its two steps agree: solve the transport problem along the current line, then turn the line towards the direction that transport stretches the most. The dashed line is ordinary PCA, the direction of largest spread, which usually points somewhere else.
Everything runs live on a few dozen points. The transport step uses entropic optimal transport (Sinkhorn), and the new line is the leading eigenvector of the transport weighted spread of the displacements. The projection \(P\) ranges over \(\{P : 0 \preceq P \preceq I,\ \operatorname{tr} P = 1\}\), exactly as in the paper. Original Python code.
Regularity as Regularization
This demo brings to life Regularity as Regularization: Smooth and Strongly Convex Brenier Potentials in Optimal Transport (Paty, d'Aspremont and Cuturi, AISTATS 2020).
We want the optimal transport map \(T\) that carries a source distribution \(P\) onto a target distribution \(Q\), and we only see samples of each. Brenier's theorem says such a map is the gradient of a convex potential, \(T = \nabla f\). The paper estimates the map by looking for a potential whose gradient pushes \(P\) as close to \(Q\) as possible, while staying regular:
$$ \min_{f \,\in\, \mathcal{F}_{\mu,L}}\ W_2^2\big( \nabla f \,\#\, P,\ Q \big), $$where \(\mathcal{F}_{\mu,L}\) is the set of potentials that are \(\mu\)-strongly convex and \(L\)-smooth, \(\nabla f \,\#\, P\) is the distribution obtained by sending every point \(x\) to \(\nabla f(x)\), and \(W_2\) is the Wasserstein distance. Imposing this regularity is the same as asking the map to be bi-Lipschitz: for all points,
$$ \mu\,\lVert x - x' \rVert \;\le\; \lVert T(x) - T(x') \rVert \;\le\; L\,\lVert x - x' \rVert. $$In plain terms, this bounds how much the map is allowed to stretch or squeeze distances: it never pulls two points apart by more than a factor \(L\), and never crushes them together by more than a factor \(\mu\). Close points stay close, and far points stay far.
In one dimension this becomes especially transparent. A transport map is just an increasing curve, so regularity becomes a bound on the slope between any two observed source points:
$$ \mu \;\le\; \frac{T(x') - T(x)}{x' - x} \;\le\; L . $$The lower bound \(\mu\) prevents the fitted map from flattening out and collapsing distant source points together. The upper bound \(L\) prevents it from making sudden steep jumps to chase noisy observations. Together they say: fit the data, but only with a curve whose local stretching stays plausible.
In the simulation, the smooth curves show the underlying source and target distributions \(P\) and \(Q\). What the estimator actually receives is much poorer: source samples \(x_i\) from \(P\), and noisy observations of where the true map sends them, \(y_i = T^*(x_i) + \varepsilon_i\), with \(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\) iid.
The red curve is the least-squares fit to those observations under the slope constraints,
$$ \min_T \sum_i \bigl(T(x_i) - y_i\bigr)^2 \quad\text{subject to}\quad \mu \le \frac{T(x') - T(x)}{x' - x} \le L . $$This is isotonic regression with two extra regularity rules. Plain isotonic regression only asks the map to keep increasing, which corresponds to \(\mu = 0\) and \(L = \infty\). Here the bounds also stop the map from becoming too flat or too steep, which is how the regularized fit filters out the noise.
About the two couples. The true couple \((\mu^*, L^*)\) is the regularity of the hidden true map: the source \(P\) is uniform, and pushing it through that map gives the target \(Q\), so this couple is what reshapes \(Q\). The amber curve is the clean \(Q\); the ticks underneath it are the samples \(y_i\), scattered around \(Q\) by the noise σ. The fit couple \((\mu, L)\) is instead the regularity you assume when reconstructing the map, and Best fit looks for the couple that recovers it best. The fit is computed exactly in the browser by coordinate descent on the slopes; in higher dimension the same estimator becomes the convex QCQP of the paper.
Weak Optimal Transport in Economics
This demo illustrates Algorithms for Weak Optimal Transport with an Application to Economics (Paty, Choné and Kramarz, 2022). The economic question is: who works for whom, and what does the matching reveal about workers' skills and firms' technologies?
The model looks for the matching that maximizes total production. Firms have technology \(x\), workers have skill \(y\), and a matching says how much worker mass of each type is employed by each firm type. The figure shows that matching together with its two margins. In the central heatmap each row is a firm type and each column is a worker type, and darker cells mean the production-maximizing matching sends more of that worker type to that firm type. Aligned underneath the columns is the worker distribution \(\nu\); aligned beside the rows, on the right, are the firm sizes, the total mass each firm type hires. In OT, entropic OT and WOT these row totals are fixed and equal; only WOTUK lets them move.
Production uses a constant-elasticity-of-substitution (CES) function. A worker of type \(y=(y_1,y_2)\) carries a mix of two skills, and a firm of technology \(\alpha=(\alpha_1,\alpha_2)\) weights them by how much it relies on each:
$$ F(\alpha,y) = \left(\alpha_1 y_1^\rho + \alpha_2 y_2^\rho\right)^{1/\rho}. $$In the demo the firm rows run from skill-1-intensive to skill-2-intensive, \(\alpha_i=(1-s_i,s_i)\), and the worker columns are a fixed grid \(y_j=(1-t_j,t_j)\) from one specialist extreme to the other. Maximizing production is the same problem as the optimal transport with cost \(c=-F\).
The two sliders reshape these inputs. Firm specialization spreads the firm technologies \(\alpha_i\) out from a tight mixed cluster toward the two specialist extremes. Worker population reshapes \(\nu\): the skill positions stay fixed, but their weights shift from specialists piled at the extremes to generalists massed in the middle.
Classical optimal transport treats production as pairwise: a firm of type \(x\) hiring a worker of type \(y\) produces \(F(x,y)\). Weak optimal transport lets the production of a firm depend on the whole group of workers it hires, written as a distribution \(\pi_x\). In barycentric WOT, only the average skill of that group matters. In WOTUK, the kernel is unnormalized, so a firm's size is also chosen by the model instead of being fixed in advance.
Let \(\mu\) be the distribution of firm types, \(\nu\) the distribution of worker types, and \(\Pi(\mu,\nu)\) the set of matchings whose two marginals are fixed. Classical optimal transport chooses the matching \(\pi\) that maximizes total production:
$$ \operatorname{OT}(\mu,\nu) = \sup_{\pi \in \Pi(\mu,\nu)} \int_{X \times Y} F(x,y)\,\mathrm{d}\pi(x,y). $$Entropic OT adds a small entropy reward. It is easier and smoother to compute, but the economic meaning is still pairwise: firms mainly hire nearby worker types, with some blur around the best matches.
$$ \operatorname{EOT}_{\varepsilon}(\mu,\nu) = \sup_{\pi \in \Pi(\mu,\nu)} \left\{ \int F(x,y)\,\mathrm{d}\pi(x,y) - \varepsilon \int \log\!\left(\frac{\mathrm{d}\pi}{\mathrm{d}\mu\,\mathrm{d}\nu}\right)\,\mathrm{d}\pi \right\}. $$Weak OT changes the object that enters production. Disintegrate the matching as \(\mathrm{d}\pi(x,y)=\mathrm{d}\mu(x)\,\mathrm{d}\pi_x(y)\). The measure \(\pi_x\) is the distribution of workers hired by firms of type \(x\). Production may now depend on the whole workforce:
$$ \operatorname{WOT}(\mu,\nu) = \sup_{\pi \in \Pi(\mu,\nu)} \int_X F(x,\pi_x)\,\mathrm{d}\mu(x). $$In the barycentric case, \(F(x,\pi_x)\) only sees the average skill \( \int y\,\mathrm{d}\pi_x(y) \). In the demo this is
$$ F(x,\pi_x) = \widetilde F\!\left(x,\int_Y y\,\mathrm{d}\pi_x(y)\right), \qquad \widetilde F(\alpha,z) = \left(\alpha_1 z_1^\rho + \alpha_2 z_2^\rho\right)^{1/\rho}. $$This is what lets a generalist firm combine specialists in two skills and obtain the same aggregate skill as from a generalist worker.
WOTUK goes one step further. The kernel \(q_x\) is no longer normalized: its total mass \(q_x(Y)\) is the size of firms of type \(x\). The model still exhausts the worker population, but it no longer forces every firm type to have the same size.
$$ \operatorname{WOTUK}(\mu,\nu) = \sup_{\substack{q_x \in \mathcal{M}_+(Y)\\ \int_X q_x\,\mathrm{d}\mu(x)=\nu}} \int_X F(x,q_x)\,\mathrm{d}\mu(x), \qquad q_x(Y)=\text{firm size}. $$The demo uses the corresponding conical WOTUK production: \(z=\int y\,\mathrm{d}q_x(y)\) is total skill rather than average skill, and the same CES formula is evaluated on that total skill vector.
$$ F(\alpha,q_x) = \left( \alpha_1 \left(\int_Y y_1\,\mathrm{d}q_x(y)\right)^\rho + \alpha_2 \left(\int_Y y_2\,\mathrm{d}q_x(y)\right)^\rho \right)^{1/\rho}. $$Each firm hires essentially one kind of worker: skill-1 firms take skill-1 specialists, skill-2 firms take skill-2 specialists. The matching is sharp and runs along the diagonal.
The heatmap is computed by the same primal mirror-ascent pattern as Algorithm 1 in the paper. OT and WOT use the multiplicative update \(P \leftarrow P \odot \exp(\gamma \nabla f(P))\), followed by a Sinkhorn KL projection onto \(\Pi(\mu,\nu)\). Entropic OT is a direct Sinkhorn solve with kernel \(\exp(F/\varepsilon)\). WOTUK uses the same multiplicative update, then the closed-form projection that rescales each worker column to the marginal \(\nu\), leaving firm sizes free.