Welcome to Cubert!

Cubert is a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop.

Where can I use Cubert?

Cubert is ideally suited for the following application domains:

  1. Statistical Calculations, Joins and Aggregations

    Cubert introduces a new model of computation that allows users to organize data in a format that is ideally suited for scalable execution of subsequent query processing operators, and a set of algorithmically-efficient operators (MeshJoin and CUBE) that exploit the organization to provide significantly improved CPU and resource utilization compared to existing solutions.

  2. Cubes and Grouping Set Aggregations

    The power-horse is the new CUBE operator that can efficiently (CPU and memory) compute additive, non-additive (e.g. Count Distinct) and exact percentile rank (e.g. Median) statistics; can roll up inner dimensions on-the-fly and compute multiple measures within a single job.

  3. Time range calculation and Incremental computations

    Cubert primitives are specially suited for reporting workflows that employ computation pattern that is both regular and repetitive, allowing for efficiency gains from partial result caching and incremental processing.

  4. Graph computations

    Cubert provides a novel sparse matrix multiplication algorithm that is best suited for analytics with large-scale graphs.

  5. When performance or resources are a matter of concern

    Cubert Script is a developer-friendly language that takes out the hints, guesswork and surprises when running the script. The script provides the developers complete control over the execution plan (without resorting to low-level programming!), and is extremely extensible by adding new functions, aggregators and even operators.

How is Cubert different?

_images/cubert-design.jpg

A flavor of Cubert Script

Cubert script is a physical script where we explicitly define the operators at the Mappers, Reducers and Combiners for the different jobs. Following is an example of the Word Count problem written in cubert script.

JOB "word count job"
    REDUCERS 10;
    MAP {
        // load the input data set as a TEXT file
        input = LOAD "$CUBERT_HOME/examples/words.txt" USING TEXT("schema": "STRING word");
        // add a column to each tuple
        with_count = FROM input GENERATE word, 1L AS count;
    }
    // shuffle the data and also invoke combiner to aggregate on map-side
    SHUFFLE with_count PARTITIONED ON word AGGREGATES COUNT(count) AS count;
    REDUCE {
        // at the reducers, sum the counts for each word
        output = GROUP with_count BY word AGGREGATES SUM(count) AS count;
    }
    // store the output using TEXT format
    STORE output INTO "output" USING TEXT();
END

While the Cubert Script code above is already very concise representation of the Word Count problem; as a matter of interest, the idiomatic way of writing in Cubert is even more concise (and a lot faster)!

JOB "idiomatic word count program (even more concise!)"
        REDUCERS 10;
        MAP {
                input = LOAD "$CUBERT_HOME/examples/words.txt" USING TEXT("schema": "STRING word");
        }
        CUBE input BY word AGGREGATES COUNT(word) AS count GROUPING SETS (word);
        STORE input INTO "output" USING TEXT();
END

Cubert Performance Summary

_images/perf-summary.png

Coming (really) soon..

  • Tez and Spark execution engine: the existing cubert scripts (written in Map-Reduce paradigm) can run unmodified on underlying Tez/Spark engine.
  • Cubert Script v2: a DAG oriented language to express complex composition of workflows.
  • Incremental computations: daily running jobs that read multiple days of data can materialize partial output and incrementally compute the results (thus cutting down on resources and time).
  • Analytical window functions.