Reconcile Forecasts

In many real-world scenarios, we need a set of forecasts that satisfy inter-forecast additivity constraints. For example, the forecast of total company revenue must be consistent with the sum of forecasts for each business unit. The forecast of total population must match the sum of forecasts for each geographic region.

To generate consistent forecasts, we could either use an algorithm that produces consistent forecasts by design, or we could use a post-hoc forecast reconciliation method that takes forecasts and makes them consistent.

ReconcileAdditiveForecasts takes the latter approach. You can use any algorithm to generate the base forecasts, and then use ReconcileAdditiveForecasts to reconcile them.

This page explains the method details. For usage and examples, see the tutorial (Reconcile Forecasts).

Intuition

In ReconcileAdditiveForecasts, adjusted forecasts are computed as a linear transformation of the base forecasts. This allows the adjusted forecasts to satisfy the linear constraints.

Let \(F_{base}\) be an \(m \times n\) matrix containing forecasts for \(m\) time series over \(n\) time steps (wide format, each row is a time series).

The adjustment function is a linear operator defined by \(T\), an \(m \times m\) matrix. Applying \(T\) to \(F_{base}\) produces adjusted forecasts \(F_{adj}\), an \(m \times n\) matrix:

\[F_{adj} = T F_{base}.\]

Let \(C\) be a \(c \times m\) matrix encoding \(c\) constraints for the \(m\) forecasts. \(C\) defines linear constraints as follows:

\[C F_{adj} = C T F_{base} = 0.\]

The constraints are satisfied for all \(F_{base}\) if every column of \(T\) is in the nullspace of \(C\).

For a given \(C\), there could be multiple possible transforms \(T\). Our goal is to find the \(T\) that returns the best adjusted forecasts. ReconcileAdditiveForecasts tries to find a \(T\) that minimizes the mean squared error (MSE).

We have two ways to estimate MSE:

  1. Empirical MSE (error of \(F_{adj}\) on the training set)

  2. Decomposed MSE (estimated squared bias + variance from the adjustment)

Using empirical MSE alone could result in overfitting to the training set. In addition to measuring decomposed MSE, we could mitigate this by:

  • Requiring \(T\) to be unbiased (details below). This results in a more stable adjustment, which is useful when extrapolating into the future. Unbiasedness is especially appropriate if the base forecasts are unbiased to start with.

  • Introducing a regularization term on the adjustment size. This is appropriate if the base forecasts are good. We want to make a small adjustment to satisfy the constraints.

Optimization problem

With this intuition, ReconcileAdditiveForecasts defines \(T\) as the solution to the following convex optimization problem:

\(\begin{align} & \text{minimize (w.r.t. $T$)} && \frac{1}{m} \lambda_{var} \left\lVert W_{var} T \sqrt{W_{h}} \right\rVert _{F}^{2} & \text{variance} \\ & \quad && + \frac{1}{mn} \lambda_{bias} \left\lVert W_{bias} (TA-A) \right\rVert _{F}^{2} & \text{squared bias}\\ & \quad && + \frac{1}{mn} \lambda_{train} \left\lVert W_{train} (TF_{base}-A) \right\rVert _{F}^{2} & \text{train MSE}\\ & \quad&& + \frac{1}{mn} \lambda_{adj} \left\lVert W_{adj} (TF_{base}-F_{base}) \right\rVert _{F}^{2} & \text{regularization}\\ & \text{subject to} \quad && CT = 0 & \text{inter-forecast constraints}\\ & \quad&& TA = A & \text{optional, unbiasedness}\\ & \quad&& T \geq b_{lower} & \text{optional, lower bound}\\ & \quad&& T \leq b_{upper} & \text{optional, upper bound}\\ \end{align}\)

Notation:

  • Variable
    • \(T\), the \(m \times m\) transform

  • Constraint
    • \(C\), the \(c \times m\) linear constraints

  • Inputs
    • \(F_{base}\), the \(m \times n\) base forecasts

    • \(A\), the \(m \times n\) actual values (corresponding to the base forecasts)

  • Tuning parameters
    • \(\lambda_{var}\), \(\lambda_{bias}\), \(\lambda_{train}\), \(\lambda_{adj}\), scalars that define the relative weight of each objective term

    • \(W_{h}\), the \(m \times m\) variance-covariance matrix of base forecast errors

    • \(W_{var}\), \(W_{bias}\), \(W_{train}\), \(W_{adj}\), diagonal \(m \times m\) weight matrices that define the relative weight of each time series for the penalty

    • \(b_{lower}\), optional lower bound for the entries in \(T\)

    • \(b_{upper}\), optional upper bound for the entries in \(T\)

  • \(\left\lVert \cdot \right\rVert _{F}^{2}\), the squared Frobenius norm

Note

ReconcileAdditiveForecasts pre-scales forecasts and actuals so that the actuals have mean 1 before fitting the optimization. This makes the optimization more stable.

ReconcileAdditiveForecasts pre-scales the weight matrices so they have the same norm as the identity matrix of the same size.

Let’s unpack this a bit.

Objective

There are four terms (penalties) in the objective:

  1. Var (for decomposed MSE)

  2. Bias (for decomposed MSE)

  3. Train (train MSE)

  4. Adj (regularization)

The var term estimates the variance of the adjusted forecast errors, assuming base forecasts and transform are unbiased. The derivation comes from Wickramasuriya, Athanasopoulos & Hyndman 2019, lemma 1 (link).

\(W_h\) is positive semidefinite and symmetric, so its square root is symmetric. Thus, the first term can be rewritten:

\[\begin{split}\left\lVert W_{var} T \sqrt{W_{h}} \right\rVert _{F}^{2} & = \mathrm{Tr}({W_{var}T\sqrt{W_{h}}\sqrt{W_{h}}'T'W_{var}'})\\ & = \mathrm{Tr}({W_{var}TW_{h}T'W_{var}'})\\ & = \mathrm{Tr}({W_{var}^{2}TW_{h}T'})\end{split}\]

Modulo the tuning parameter \(W_{var}\), this is the variance of reconciled forecast errors by Wickramasuriya et al. (\(T\) here is equivalent to \(SP\) in their notation).

The normalizing constant \(\frac{1}{m}\) on the variance term gives the average for a single forecast.

The bias term estimates the squared bias of the transform. Because actuals satisfy the constraints, we use actuals to assess bias, computed as the difference between actuals and transformed actuals. For unbiased transforms, \(TA=A\), so this term is 0.

The train term measures the MSE of the adjusted forecast on the training set. Since the base forecast MSE is constant, it can also be interpreted as the change in training MSE after adjustment.

The adj term adds regularization to prevent overfitting. It penalizes differences between the forecasts and adjusted forecasts.

For the bias, train, and adj terms, the normalizing constant \(\frac{1}{mn}\) gives the average over the observed distribution.

Constraints

1. \(CT = 0\) requires the inter-forecast additivity constraints to be satisfied, represented as a system of linear equations. For example, \(C\) could require \(X_{1}=X_{2}+X_{3}\) and \(X_{2}=X_{4}+X_{5}\).

2. \(TA = A\) is an optional constraint that enforces unbiasedness. This is helpful to prevent overfitting. Additionally, the variance term in the objective assumes the tranform is unbiased, so this is needed for a better variance estimate.

Note

\(TA = A\) represents unbiasedness for a particular \(A\). If the constraints are derived from a hierarchy (where each node’s value is the sum of its children’s), the unbiasedness constraint is \(TS = S\), where \(S\) is the summing matrix for the tree (see sum_matrix in HierarchicalRelationship for a definition). ReconcileAdditiveForecasts represents the constraint as \(TS = S\) when possible and \(TA = A\) otherwise.

3. Because large elements in \(T\) can result in unstable forecasts, we allow optional constraints on its entries, \(T \geq b_{lower}\) and \(T \leq b_{upper}\). For example, we could have \(-1.5 \leq T \leq 1.5\). In practice, these constraints are often superfluous. Note that negative values in \(T\) should be allowed; for hierarchical constraints, this allows information to propagate “down” the tree from parent to children.

Tuning parameters

Depending on the data, some terms in the objective may be more useful than others. \(\lambda_{var}\), \(\lambda_{bias}\), \(\lambda_{train}\), \(\lambda_{adj}\), allow you to tune these relative weight of each term.

Sometimes, it is more important to be accurate for some timeseries than for others. If so, \(W_{var}\), \(W_{bias}\), \(W_{train}\), \(W_{adj}\) can be used to weigh the timeseries by their relative importance. If not, the weights can still be used to fine tune the adjustment:

  • Setting \(W_{var}\), \(W_{bias}\), \(W_{train}\) proportional to the base forecast error for each time series can improve the result for base forecasts that start with high error.

  • Setting \(W_{adj}\) inversely proportional to the base forecast error for each time series puts greater penalty on adjustments to base forecasts that are already accurate.

See the tutorial (Reconcile Forecasts) for details and suggested settings.