Dictionary Operators

REFRESH-DICTIONARY

This operator builds a dictionary on the specified set of columns in the input dataset, and stores the dictionary in the specified output location. This dictionary can be used in other jobs to encode or decode the columns in the data set. Typically this dictionary is used to encode string columns to int in order to reduce data size.

JOB "create dict for 2 columns in dim_member"
        REFRESH-DICTIONARY FROM "/projects/dwh/dwh_dim/dim_member/#LATEST"
                                           INTO "output/dimMemberDict" ON country_sk, default_locale_sk;

This is a special kind of operator – it is only operator that can be present in the job, and the syntax of the JOB is also different from the traditional Map-Reduce cubert jobs (as seen above).

The ENCODE and DECODE shown below are “regular” cubert operators.

PRECONDITIONS: None.

ENCODE

ENCODE operator encodes the input data using the specified dictionary. This operator encodes all the columns on which the dictionary is built.

encoded = ENCODE inputdata USING "output/dimMemberDict";

PRECONDITIONS: None.

DECODE

DECODE operator decodes the input data using the specified dictionary. This operator decodes all the columns on which the dictionary is built.

decoded = DECODE encoded USING "output/dimMemberDict";

PRECONDITIONS: None.