# Aggregation: Cube and Grouping Sets¶

In this section we will look into the **CUBE** operator provided by Cubert. We will first discuss what are cubes, followed by what kind of aggregations we can compute using this operator. Finally, we will look at some simple examples for building cube using Cubert Script.

## What are Cubes and Grouping Sets?¶

An OLAP Cube, or simply cube, is a multi-level aggregation of some base dataset. Lets say we have a dataset: {product, store, date, dollar_sales}. We are interested in aggregate information that can answer ALL of the following questions:

Id | Query | SQL statement |
---|---|---|

1 | Sales for each product, in each store and for each date | SELECT SUM(dollar_sales) .. GROUP BY product, store, date |

2 | Sales for each product, in each store, ACROSS ALL dates | SELECT SUM(dollar_sales) .. GROUP BY product, store |

3 | Sales for each product, for each date, ACROSS ALL stores | SELECT SUM(dollar_sales) .. GROUP BY product, date |

4 | Sales for each store, for each date, ACROSS ALL dates | SELECT SUM(dollar_sales) .. GROUP BY date, store |

5 | Sales for each product, ACROSS ALL dates and store | SELECT SUM(dollar_sales) .. GROUP BY product |

6 | Sales for each store, ACROSS ALL dates and products | SELECT SUM(dollar_sales) .. GROUP BY store |

7 | Sales for each date, ACROSS ALL stores and products | SELECT SUM(dollar_sales) .. GROUP BY date |

8 | Sales ACROSS ALL dates, stores and products | SELECT SUM(dollar_sales) .. GROUP BY ALL |

While it may look like we need 8 GroupBy commands to compute these aggregates, we actually need only one CUBE operator that will compute all these aggregates in one shot! **The strength of the CUBE operator is computing multiple GroupBys in one place** (albeit all these groupby are derived from same set of aggregation columns – product, store and date in the example above).

We can, now, very compactly write all the 8 GroupBy from above very simply as:

```
// in SQL syntax
SELECT SUM(dollar_sales) .. CUBE BY product, store, date
// in Cubert Script syntax
CUBE input BY product, store, date AGGREGATES SUM(dollar_sales) AS sum;
```

### Grouping Sets¶

By default, the CUBE operator computes all possible aggregation levels (with *D* dimensions, there will be a total of *2^D* such aggregations). If we are interested in only a subset of these aggregation levels, we can instruct the Cubert CUBE operator to compute the subset only. In the example above, say, we were interested in queries 2, 3 and 5 only (that is, 3 GroupBys). This subset of GroupBys, from the possible set of 2^D GroupBys, is called a **Grouping Set**. In Cubert Script we can specify exactly which groupbys should be computed. For example:

```
CUBE input BY product, store, date AGGREGATES SUM(dollar_sales) AS sum:DOUBLE
GROUPING SETS (product, store), (product, date), (product);
```

### Rollups¶

Sometimes business logic demands aggregation over a hierarchy of groups. For instance, given a dataset [1] : { product, category, store, city, state, dollar_sales }, one would need to compute multiple groups like:

Id | Query | SQL statement |
---|---|---|

1 | Total sales for ALL stores within state | SELECT SUM(dollar_sales) .. GROUP BY state |

2 | Total sales for ALL stores within city [2] | SELECT SUM(dollar_sales) .. GROUP BY state, city |

3 | Total sales for each category for ALL stores within a state | SELECT SUM(dollar_sales) .. GROUP BY state, category |

4 | Total sales for each category ACROSS ALL states | SELECT SUM(dollar_sales) .. GROUP BY category |

5 | Total sales for each category, for each store | SELECT SUM(dollar_sales) .. GROUP BY store, category |

As described above these aggregates can be computed using multiple grouping sets like so:

```
CUBE input BY state, city, store, category AGGREGATES SUM(dollar_sales) AS sum:DOUBLE
GROUPING SETS (state), (state, city), (state, category), (category), (store, category);
```

However enumerating all these grouping sets can be tedious and error-prone. A short-hand method for such datasets would entail usage of rollups like so:

```
CUBE input BY state, city, store, category AGGREGATES SUM(dollar_sales) AS sum:DOUBLE
ROLLUPS (state, city, store), (category);
```

Cubert’s physical parser will internally compute all nested grouping sets from rollup definition. For instance the rollup query will expand to the following grouping sets.

() | –total sales– |

(state) | query 1 |

(state, city) | query 2 |

(state, city, store) | –sales per store– |

(category) | query 4 |

(state, category) | query 3 |

(state, city, category) | –sales per category within each city– |

(state, city, store, category) | query 5 |

### Combo level sets¶

For a dataset with a large number of dimensions *D*, computing a full cube of *2^D* such aggregations can generate a lot of data. For real-time slicing and dicing it maybe essential to compute cubes over only *N* dimensions at a time, such that *N* < *D*. As an alternative to manually enumerating all combinations of upto *N* dimensions from a possible set of *D*, the following will achieve the same result

```
CUBE input BY dim1, dim2, dim3, dim4, ...., dimD AGGREGATES ...
GROUPING COMBOS N;
```

For example in the dataset : { product, category, store, city, state }, **GROUPING COMBOS 2** will produce all grouping sets combinations of zero dimensions (complete rollup), all single dimension sets, and grouping sets with 2 dimensions.

dimensions used | grouping set |
---|---|

0 | () |

1 | (product) (category) (store) (city) (state) |

2 | (product, category) (product, store) (product, city) (product, state) (category, store) (category, city) (category, state) (store, city) (store, state) (city, state) |

## Additive and Partitioned-Additive Aggregations¶

In the previous example we compute the SUM() aggregate, which happens to be an example of **Distributive Aggregation Function**. With these functions, we can divide the input data in any arbitrary fashion, apply the function on each subset of the data to compute partial aggregates, and then later combine these partial aggregates to full aggregates. Other examples of distributive functions are COUNT, MIN and MAX. In Cubert terminology we call these are **Additive Aggregates**.

Another class of aggregation function is **Psuedo-Distributive Functions**, and COUNT(DISTINCT) (abbreviated as CD in this text) is one example. This is not a distributive function, because if we were to divide our data in arbitrary fashion and compute CD for each subset, there is no way we can combine these partial aggregates together. However, if we were to *partition the data on the measure* (measure: the column that we are aggregating), then count distinct problem becomes additive! As an example, we partition the data on memberId, say into [memberIds from 1 to 1M], [1M to 2M], [2M to 3M] etc. Now, we can compute CD for each partition (to give us CD for each range of memberIds) and then add them together to get the CD for the entire range of memberIds. In Cubert terminology, we call these as **Partitioned-Additive Aggregates** (since the aggregates are additive *IF they are partitioned on the member*).

And finally, there are **Holistic Functions**, which are neither distributive nor psuedo-distributive. Median or percentile rank are examples in this class. Cubert does not currently support these aggregation functions.

Some examples of CUBE operator for the additive and partitioned-additive aggregations are shown below:

```
/* Example of additive aggregations */
// Since SUM, COUNT, MIN, MAX are additive aggregates, we use the standard CUBE operator
CUBE input BY dim1, dim2, .. dimN AGGREGATES SUM(measure1), COUNT(measure2), MIN(measure3), MAX(measure4), ...;
/* Example of partitioned-additive aggregation */
// Since COUNT(DISTINCT memberId) is partitioned-additive, we have to specify to the CUBE operator that the
// data is partitioned on memberId.
// We also have to make sure that data is INDEED partitioned on memberId (which is accomplished by BLOCKGEN or
// BLOCKGEN BY INDEX command in the previous sections)
// [Notice the INNER memberId clause in the statement below]
CUBE input BY dim1, dim2, .. dimN INNER memberId AGGREGATES COUNT_DISTINCT(memberId) as sum;
```

The Count Distinct example above was a special case of partitioned additive aggregation function. In general, the aggregation function is a pair of two additive aggregation functions (e.g. [MAX, SUM]), where the inner function (SUM) is applied separately for each measure, and the outer function (MAX) is applied to the results across the measures. We can write these aggregation functions as:

```
CUBE input BY dim1, dim2 .. dimN INNER memberId AGGREGATES [MAX, SUM](sales);
```

The above statement will compute an OLAP cube of the MAX values of SUM of sales for each memberId.

## Understanding the Output¶

Lets go back to the full cube example. The first query (GROUP By product, store, date) will have the output in expected format:

Product | Store | Date | Sum |
---|---|---|---|

iPhone | Mountain View | 2014-01-01 | $1000 |

How about query 2 (GROUP BY product, store)? Here we are computing aggregates ACROSS all dates. The output the Cubert CUBE operator will assign *null* to the date column to indicate the aggregation is across ALL values for this column.

iPhone | Mountain View | null | $5000 |

Consider another query: GROUP BY store. Here the aggregations are across all products and dates; the output would look like:

null | Mountain View | null | $12000 |

And, of course, the last query (GROUP BY ALL) will have all columns as null.

Note

nulls in Input

Since *null* is reserved to indicate that the aggregate is across all values of the column, we **cannot have nulls in input data**. We should use the CASE function to replace the null values in input dataset with some placeholder value (e.g. “N/A”, “Unknown”, -9, etc).

Footnotes

[1] | dataset of this nature would usually be a derived dataset as a result of a multiple joins |

[2] | two or more cities in different state/county having the same name |