The DuaLip Solver
Problem Statement
In a typical recommender system problem, we denote users by \(i = 1, \ldots ,I\) and items by \(k = 1, \ldots, K\). Let \(x_{ik}\) denote any association between user \(i\) and item \(k\), and be the variable of interest. For example, \(x_{ik}\) can be the probability of displaying item \(k\) to user \(i\). The vectorized version is denoted by \(x = (x_1, ..., x_I)\) where \(x_i = (x_{i1}, ..., x_{iK})\).
DuaLip solves Linear Programs (LPs) of the following form:
where \(A_{m \times n}\) is the constraint matrix, \(b_{m \times 1}\) is the constraint vector and \(\mathcal{C}_i\) are uniformly compact polytopes. \(x \in \mathbb{R}^n\) is the vector of optimization variables, where \(n = IK\).
Problem Solution
We briefly outline the solution mechanism here. For more details, please see Basu et al. (2020). To solve the problem, we introduce the perturbed problem:
where \(\gamma > 0\) controls the tradeoff between problem approximation and the smoothness of the dual objective function. To make the above problem amenable to first order methods, we consider the Lagrangian dual:
where \(\mathcal{C} = \Pi_{i=1}^I \mathcal{C}_i\). Now, by strong duality, the optimum objective \(g_{\gamma}^*\) of the dual
is the minimum of the above problem. We can show that \(\lambda \mapsto g_{\gamma}(\lambda)\) is differentiable and the gradient is Lipschitz continuous. Moreover, by Danskin's Theorem the gradient can be explicitly expressed as,
where,
where \(\Pi_{\mathcal{C}_i}(\cdot)\) is the Euclidean projection operator onto \(\mathcal{C}_i\), and, \(A_i\), \(c_i\) are the parts of \(A\) and \(c\) corresponding to \(x_i\). Based on this we use a first-order gradient method as the main optimizer to solve the problem. It can also be shown that the solution obeys certain bounds to the true solution \(g_0(\lambda)\) and in fact the exact solution of the LP can be obtained if \(\gamma\) is small enough. For more details, please refer to Basu et al. (2020).
The Algorithm
The overall algorithm can now be written as:
Start with an initial \(\lambda\).
Get Primal: \(x_{\gamma}^*(\lambda)\).
Get Gradient: \(Ax_{\gamma}^*(\lambda) - b\).
Update \(\lambda\) via appropriate mechanisms.
Continue till converge.
We currently support three different mechanisms for doing this first-order optimization. Specifically, Proximal Gradient Ascent, Accelerated Gradient Ascent, and LBFGS-B. For the details please see Appendix A of Ramanath et al. (2021).
Constraint Sets \(\mathcal{C}_i\)
In this current version of the solver we support a wide variety of constraints types \(\mathcal{C}_i\), such as:
Unit Box: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : 0 \leq x_k \leq 1\big\}\)
Simplex-E: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K = 1, \;\; x_k \geq 0\big\}\)
Simplex-I: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K \leq 1, \;\; x_k \geq 0\big\}\)
Box Cut-E: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K = d, \;\; 0 \leq x_k \leq 1\big\}\)
Box Cut-I: \(\mathcal{C}_i = \big\{ x \in \mathbb{R}^K : x_1 + ... + x_K \leq d, \;\; 0 \leq x_k \leq 1\big\}\)
Here \(E\) and \(I\) stands for equality and inequality. Also note that choosing \(d=1\) the Box Cut reduces to the Simplex case.
To execute step 2 of the overall algorithm, we need a projection operation on these constraint sets. In our solver, we have implemented highly efficient projection algorithms to make step 2 extremely fast. The different sets have different customized algorithms to make the overall system highly efficient. For more details on these projection algorithms please see Section 3 of Ramanath et al. (2021).
Adaptive Smoothing Algorithm
The smoothness of \(g_\gamma\) decreases as the number of constraints increases. A small \(\gamma\) makes the optimizer's convergence prohibitively slow, while a large \(\gamma\) reduces the accuracy of the solution. We define a practical criterion for sufficient convergence for a given \(\gamma\) and implement a stage-wise algorithm that automatically reduces \(\gamma\) when the criterion is met to prefer more accurate solutions. For details, please see Section 3 of Basu et al. (2020).
Stopping Criteria
Let \(\lambda_\gamma = \arg \max_{\lambda\ge 0} g_\gamma(\lambda)\) and \(\tilde{\lambda}_\gamma\) be an approximate solution after the optimizer has made sufficient progress to maximize \(g_\gamma\). If the approximation error \((g_0(\lambda_0) - g_0(\tilde{\lambda}_\gamma))\) is \(\epsilon\) times smaller than the total opportunity \((g_0(\lambda_0) - g_0(0))\) then we declare sufficient convergence, i.e.,
The intuition behind this is as follows:
The criterion is defined in terms of \(g_0\) because it is the Lagrangian dual corresponding to the actual LP we want to solve and by strong duality, \(g_0(\lambda_0)\) is the optimal primal objective that can be attained.
Since \(\lambda=0\) removes the effect of constraints on the Lagrangian, \(g_0(0)\) represents the maximum value of the primal objective. The total opportunity represents the value of objective "lost" to enforce the constraints \(Ax \le b\).
The approximation error (the left hand side of above) is due to two levels of approximation: (a) the error due to working with \(\gamma >0\), i.e., the difference between \(\lambda_0\) and \(\lambda_\gamma\); and (b) the approximate solution of \(\max_\lambda g_\gamma(\lambda)\), i.e., the difference between \(\lambda_\gamma\) and \(\tilde{\lambda}_\gamma\).
Infeasible problems
DuaLip is able to detect if the problem is primal infeasible. If the primal problem is infeasible,
Furthermore, for any feasible \(x\), by weak duality, we have
where the second inequality follows from the fact that the max is taken over a larger set. Now, for each constraint type \(\mathcal{C}_i\), it is easy to calculate a bound \(B\) such that
If the primal is feasible, then strong optimality implies that \({g_\gamma}^* \le IB\). Thus, if, during the optimization, \(g_\gamma > IB\), then it guarantees that the primal is infeasible.