The DuaLip Solver
Problem Statement
In a typical recommender system problem, we denote users by \(i = 1, \ldots ,I\) and items by \(k = 1, \ldots, K\). Let \(x_{ik}\) denote any association between user \(i\) and item \(k\), and be the variable of interest. For example, \(x_{ik}\) can be the probability of displaying item \(k\) to user \(i\). The vectorized version is denoted by \(x = (x_1, ..., x_I)\) where \(x_i = (x_{i1}, ..., x_{iK})\).
DuaLip solves Linear Programs (LPs) of the following form:
where \(A_{m \times n}\) is the constraint matrix, \(b_{m \times 1}\) is the constraint vector and \(\mathcal{C}_i\) are uniformly compact polytopes. \(x \in \mathbb{R}^n\) is the vector of optimization variables, where \(n = IK\).
Problem Solution
We briefly outline the solution mechanism here. For more details, please see Basu et al. (2020). To solve the problem, we introduce the perturbed problem:
where \(\gamma > 0\) controls the tradeoff between problem approximation and the smoothness of the dual objective function. To make the above problem amenable to first order methods, we consider the Lagrangian dual:
where \(\mathcal{C} = \Pi_{i=1}^I \mathcal{C}_i\). Now, by strong duality, the optimum objective \(g_{\gamma}^*\) of the dual
is the minimum of the above problem. We can show that \(\lambda \mapsto g_{\gamma}(\lambda)\) is differentiable and the gradient is Lipschitz continuous. Moreover, by Danskin’s Theorem the gradient can be explicitly expressed as,
where,
where \(\Pi_{\mathcal{C}_i}(\cdot)\) is the Euclidean projection operator onto \(\mathcal{C}_i\), and, \(A_i\), \(c_i\) are the parts of \(A\) and \(c\) corresponding to \(x_i\). Based on this we use a first-order gradient method as the main optimizer to solve the problem. It can also be shown that the solution obeys certain bounds to the true solution \(g_0(\lambda)\) and in fact the exact solution of the LP can be obtained if \(\gamma\) is small enough. For more details, please refer to Basu et al. (2020).
The Algorithm
The overall algorithm can now be written as:
Start with an initial \(\lambda\).
Get Primal: \(x_{\gamma}^*(\lambda)\).
Get Gradient: \(Ax_{\gamma}^*(\lambda) - b\).
Update \(\lambda\) via appropriate mechanisms.
Continue till converge.
We currently support Accelerated Gradient Ascent as the maximizer though the solver is easily extensible to other optimization algorithms.
Constraint Sets \(\mathcal{C}_i\)
In this current version of the solver we support a wide variety of constraints types \(\mathcal{C}_i\), such as:
Unit Box: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : 0 \leq x_k \leq 1\big\}\)
Simplex-E: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K = 1, \;\; x_k \geq 0\big\}\)
Simplex-I: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K \leq 1, \;\; x_k \geq 0\big\}\)
r-Simplex-E: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K = r, \;\; x_k \geq 0\big\}\)
r-Simplex-I: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K \leq r, \;\; x_k \geq 0\big\}\)
Here \(E\) and \(I\) stands for equality and inequality.
To execute step 2 of the overall algorithm, we need a projection operation on these constraint sets. In our solver, we have implemented highly efficient projection algorithms to make step 2 extremely fast. The different sets have different customized algorithms to make the overall system highly efficient. For more details on these projection algorithms please see Section 3 of Ramanath et al. (2021).
Data Sharding (Multi-GPU)
For large matching problems, DuaLip can distribute computation across multiple GPUs by sharding the input data along the column dimension of the constraint matrix. When compute_device_num > 1, the solver builds a distributed objective that wraps a single‑GPU objective on each device and coordinates reductions on a host device.
Sharding of inputs: Matrices
Aandcare partitioned into roughly equal contiguous blocks across the available compute devices (e.g.,cuda:0,cuda:1, …). This balances work by splitting the number of columns as evenly as possible. Each shard is then moved to its target device. The per‑device projection map is derived from the global projection map by remapping global column indices to local indices for that shard. Only projections that touch columns present on the device are kept.Per‑iteration execution: The current dual vector \(\lambda\) is transferred to each compute device. Each device computes its local dual gradient contribution, local dual objective component, and regularization penalty using the single‑GPU matching objective on its shard. Partial results are first accumulated on the host device, then synchronized and summed across processes using NCCL all‑reduce. The final distributed gradient subtracts \(b\) and is used by the optimizer exactly as in the single‑GPU case.
This design keeps projection logic local to each shard, minimizes inter‑GPU communication to a small number of vector/tensor reductions per iteration, and scales naturally with the number of GPUs.
Implementation Note
As we will discuss in the Supported LPs section, currently the distributed objective is implemented for matching problems where the constraint matrix is a block-diagonal matrix. In this case, the inputs must be CSC‑format sparse tensors. Sharding operates on columns to align with how projections are applied per column group. For custom objective functions, the user needs to implement the parallelism themselves.