Aggregation Operators

GROUP BY

This operator computes group by aggregates on the input relation. Group by lets us specify the dimensions (or columns) on which the group bys are computed, the aggregation functions, the corresponding input columns, and optionally the names of the output columns of the computed aggregates. Group by operator supports SUM, COUNT, MIN, MAX, COUNT_DISTINCT and UDAFs (user defined aggregation functions).

The output tuples from this operator will contain all the group by columns and also the output columns from the aggregation functions.

Sample usages of GROUP BY operator.

// compute group by on two group by keys, compute sum aggregate on uniqueDaily and return the sum in a column named uniqueDailySum of type long
output = GROUP data BY country_sk, default_locale_sk AGGREGATES SUM(uniqueDaily) AS uniqueDailySum;

// compute multiple aggregates
output = GROUP data BY storeId AGGREGATES COUNT(tranId) AS tranCount, SUM(sales) AS totalSales;

PRECONDITIONS: The input relation must be partitioned and sorted on the group by columns. For COUNT_DISTINCT aggregations, the input relation must be partitioned and sorted on concatenation of group by columns and the count distinct column.

CUBE

This operator computes multiple group by over the input dataset. See Aggregation: Cube and Grouping Sets on a discussion of the OLAP Cube and grouping sets.

The operator can compute two kinds of aggregates: Additive and Partitioned-Additive (see Aggregation: Cube and Grouping Sets).

Additive Aggregates

Cubert supports following builtin aggregators: SUM, COUNT, MIN, MAX.

Example usage: computing a complete OLAP cube on two dimensions.

JOB "cube additive computing a full OLAP cube."
REDUCERS 10;
MAP {
        // load the fact table
        data = LOAD "$inputFactTable" USING AVRO;
                    // compute "full" cube
        data = CUBE data BY dim0, dim1 AGGREGATES SUM(measure) AS sum;
}
SHUFFLE data PARTITIONED ON dim0, dim1 SORTED ON dim0, dim1 AGGREGATES SUM(measure) AS sum;
REDUCE {
        result = GROUP data BY dim0, dim1 AGGREGATES SUM(sum) AS sum:long;
}
STORE result INTO "output/cube" USING AVRO;
END

PRECONDITIONS: None.

Partitioned Additive Aggregators

CUBE operator can also compute OLAP cube count distinct on the input relation. In its specification, it takes the list of dimensions, the partitioning measure to compute count distinct on and the corresponding output column names.

Optionally the grouping sets can be specified, listing the group by key combinations to be computed. If no grouping sets is specified, the full OLAP cube is computed.

JOB "OLAP cube count distinct operator"
REDUCERS 10;
MAP {
    // load the fact table
    data = LOAD "$inputFactTable" USING RUBIX;
    data = FROM data GENERATE memberId, search_type, platform_type, country_sk, locale_sk,
                              CASE (date_sk == "2014-02-06", memberId) AS todayMemberId;

    // OLAP CCD computation on a grouping set
    cube = CUBE data BY country_sk, locale_sk, search_type, platform_type INNER memberId
                    AGGREGATES COUNT_DISTINCT(memberId) AS uniqueWeekly:LONG, [SUM, COUNT_TO_ONE](todayMemberId) AS uniqueDaily
                    GROUPING SETS (country_sk, locale_sk, search_type, platform_type),
                                  (country_sk, locale_sk, search_type),
                                  (country_sk, locale_sk, platform_type),
                                  (country_sk, locale_sk);
}

SHUFFLE cube PARTITIONED ON country_sk, locale_sk, search_type, platform_type
                SORTED ON country_sk, locale_sk, search_type, platform_type;
REDUCE {
        output = GROUP cube BY country_sk, locale_sk, search_type, platform_type
                    AGGREGATES SUM(uniqueWeekly) AS uniqueWeekly:long, SUM(uniqueDaily) AS uniqueDaily;
}
STORE output INTO "$output/ccdOutput" USING AVRO;
END

Note

Using CUBE Shuffle Command

See the section on using the CUBE shuffle command that describes a much compact (and less error prone) form for this operator.

PRECONDITIONS: The input data must be partitioned and sorted on the measure (memberId in the above example).