Docs 

plot_trend_changepoint_detection(params=None)

Convenience function to plot the original trend changepoint detection results.

Parameters

params (dict or None, default None) –

The parameters in plot. If set to None, all components will be plotted.

Note: seasonality components plotting is not supported currently. plot parameter must be False.

Returns

fig – Figure.

Return type

property pred_category

A dictionary that stores the predictor names in each category.

This property is not initialized until used. This speeds up the fitting process. The categories includes

“intercept” : the intercept.

“time_features” : the predictors that include TimeFeaturesEnum but not SEASONALITY_REGEX.

“event_features” : the predictors that include EVENT_PREFIX.

“trend_features” : the predictors that include TREND_REGEX but not SEASONALITY_REGEX.

“seasonality_features” : the predictors that include SEASONALITY_REGEX.

“lag_features” : the predictors that include LAG_REGEX.

“regressor_features” : external regressors and other predictors manually passed to extra_pred_cols, but not in the categories above.

“interaction_features” : the predictors that include interaction terms, i.e., including a colon.

Note that each predictor falls into at least one category. Some “time_features” may also be “trend_features”. Predictors with an interaction are classified into all categories matched by the interaction components. Thus, “interaction_features” are already included in the other categories.

predict(X, y=None)

Creates forecast for the dates specified in X.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.
y (ignored.) –

Returns

predictions –

Forecasted values for the dates in X. Columns:

TIME_COL: dates

PREDICTED_COL: predictions

PREDICTED_LOWER_COL: lower bound of predictions, optional

PREDICTED_UPPER_COL: upper bound of predictions, optional

[other columns], optional

PREDICTED_LOWER_COL and PREDICTED_UPPER_COL are present if self.coverage is not None.

Return type

predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)

Makes predictions of prediction intervals for df based on the predictions and self.uncertainty_model.

Parameters

df (pandas.DataFrame) – The dataframe to calculate prediction intervals upon. It should have either self.value_col_ or PREDICT_COL which the prediction interval is based on.
predict_params (dict [str, any] or None, default None) – Parameters to be passed to the predict function.

Returns

result_df – The df with prediction interval columns.

Return type

score(X, y, sample_weight=None)

Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)

Notes

If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.

If null_model_params is None, returns score_func of the model itself.

By default, grid search (with no scoring parameter) optimizes improvement of score_func against null model.

To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignored
y (pandas.Series or numpy.array) – Actual value, used to compute error
sample_weight (pandas.Series or numpy.array) – ignored

Returns

score – Comparison of predictions against null predictions, according to specified score function

Return type

float or None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

summary(max_colwidth=20)

Creates the model summary for the given model

Parameters: max_colwidth (int) – The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.
Returns: model_summary – The model summary for this model. See ModelSummary
Return type: ModelSummary

class greykite.sklearn.estimator.silverkite_estimator.SilverkiteEstimator(silverkite: ~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast = <greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast object>, score_func=<function mean_squared_error>, coverage=None, null_model_params=None, freq=None, origin_for_time_vars=None, extra_pred_cols=None, drop_pred_cols=None, explicit_pred_cols=None, train_test_thresh=None, training_fraction=None, fit_algorithm_dict=None, daily_event_df_dict=None, daily_event_neighbor_impact=None, daily_event_shifted_effect=None, fs_components_df= name period order seas_names 0 tod 24.0 3 daily 1 tow 7.0 3 weekly 2 conti_year 1.0 5 yearly, autoreg_dict=None, past_df=None, lagged_regressor_dict=None, changepoints_dict=None, seasonality_changepoints_dict=None, changepoint_detector=None, min_admissible_value=None, max_admissible_value=None, uncertainty_dict=None, normalize_method=None, adjust_anomalous_dict=None, impute_dict=None, regression_weight_col=None, forecast_horizon=None, simulation_based=False, simulation_num=10, fast_simulation=False, remove_intercept=False)[source]

Wrapper for forecast.

Parameters

score_func (callable, optional, default mean_squared_error) – See BaseForecastEstimator.
coverage (float between [0.0, 1.0] or None, optional) – See BaseForecastEstimator.
null_model_params (dict or None, optional) – Dictionary with arguments to define DummyRegressor null model, default is None. See BaseForecastEstimator.
fit_algorithm_dict (dict or None, optional) –
How to fit the model. A dictionary with the following optional keys.

"fit_algorithm"str, optional, default “linear”
The type of predictive model used in fitting.

See fit_model_via_design_matrix for available options and their parameters.

"fit_algorithm_params"dict or None, optional, default None
Parameters passed to the requested fit_algorithm. If None, uses the defaults in fit_model_via_design_matrix.
uncertainty_dict (dict or str or None, optional) – How to fit the uncertainty model. See forecast. Note that this is allowed to be “auto”. If None or “auto”, will be set to a default value by coverage before calling forecast_silverkite. See BaseForecastEstimator for details.
fs_components_df (pandas.DataFrame or None, optional) –
A dataframe with information about fourier series generation. If provided, it must contain columns with following names:
- ”name”: name of the timeseries feature (e.g. tod, tow etc.).
- ”period”: Period of the fourier series.
- ”order”: Order of the fourier series. “seas_names”: Label for the type of seasonality (e.g. daily, weekly etc.) and should be unique.
- validate_fs_components_df checks for it, so that component plots don’t have duplicate y-axis labels.
This differs from the expected input of forecast_silverkite where “period”, “order” and “seas_names” are optional. This restriction is to facilitate appropriate computation of component (e.g. trend, seasonalities and holidays) effects. See Notes section in this docstring for a more detailed explanation with examples.

Other parameters are the same as in forecast.

If this Estimator is called from forecast_pipeline, train_test_thresh and training_fraction should almost always be None, because train/test is handled outside this Estimator.

The attributes are the same as BaseSilverkiteEstimator.

See also

None: For details on fit, predict, and component plots.
None: Functions performing the fit and predict.

validate_inputs()[source]: Validates the inputs to SilverkiteEstimator.

fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]

Fits Silverkite forecast model.

Parameters

X (pandas.DataFrame) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included in X to allow transformation by sklearn.pipeline.
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in X).
time_col (str) – Time column name in X.
value_col (str) – Value column name in X.
fit_params (dict) – additional parameters for null model.

static validate_fs_components_df(fs_components_df)[source]

Validates the inputs of a fourier series components dataframe called by SilverkiteEstimator to validate the input fs_components_df.

Parameters

fs_components_df (pandas.DataFrame) –

A DataFrame with information about fourier series generation. Must contain columns with following names:

”name”: name of the timeseries feature (e.g. “tod”, “tow” etc.)
”period”: Period of the fourier series
”order”: Order of the fourier series
”seas_names”: seas_name corresponding to the name (e.g. “daily”, “weekly” etc.).

finish_fit()

Makes important values of self.model_dict conveniently accessible.

To be called by subclasses at the end of their fit method. Sets {pred_cols, feature_cols, and coef_}.

fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)

Fits the uncertainty model with a given df and uncertainty_dict.

Parameters

df (pandas.DataFrame) – A dataframe representing the data to fit the uncertainty model.
uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:

”uncertainty_method”: a string that is in
UncertaintyMethodEnum.

”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the fit function.
kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.

Return type

The function sets self.uncertainty_model and does not return anything.

forecast_breakdown(grouping_regex_patterns_dict, forecast_x_mat=None, time_values=None, center_components=False, denominator=None, plt_title='breakdown of forecasts')

Generates silverkite forecast breakdown for groupings given in grouping_regex_patterns_dict. Note that this only works for additive regression models and not for models such as random forest.

Parameters

grouping_regex_patterns_dict (dict {str: str}) – A dictionary with group names as keys and regexes as values. This dictionary is used to partition the columns into various groups
forecast_x_mat (pd.DataFrame, default None) – The dataframe of design matrix of regression model. If None, this will be extracted from the estimator.
time_values (list or np.array, default None) – A collection of values (usually timestamps) to be used in the figure. It can also be used to join breakdown data with other data when needed. If None, and forecast_x_mat is not passed, timestamps will be extracted from the estimator to match the``forecast_x_mat`` which is also extracted from the estimator. If None, and``forecast_x_mat`` is passed, the timestamps cannot be inferred. Therefore we simply create an integer index with size of forecast_x_mat.
center_components (bool, default False) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same.
denominator (str, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
  The absolute value of the observed mean of the response
- ”y_std”float
  The standard deviation of the observed response
plt_title (str, default “prediction breakdown”) – The title of generated plot

Returns

result – Dictionary returned by breakdown_regression_based_prediction

Return type

dict

get_max_ar_order()

Gets the maximum autoregression order.

Returns: max_ar_order – The maximum autoregression order.
Return type: int

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

plot_components(grouping_regex_patterns_dict=None, center_components=True, denominator=None, predict_phase=False, title=None)

Class method to plot the components of a Silverkite model on datasets passed to either fit or predict.

Parameters

grouping_regex_patterns_dict (dict, optional, default None) – If None, it is set to DEFAULT_COMPONENTS_REGEX_DICT. An alternative dictionary is available that provides a more detailed breakdown of seasonality components (e.g., weekly, monthly, quarterly, yearly, etc.), See: DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT.
center_components (bool, optional, default True) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same. See forecast_breakdown.
denominator (str, optional, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
  The absolute value of the observed mean of the response
- ”y_std”float
  The standard deviation of the observed response
See forecast_breakdown.
predict_phase (bool, optional, default False) – If False, plots the components of the training data and shows three plots: 1) Component Plot, 2) Trend Plot + Change points, and 3) Residuals + Smoothed Residuals. If set to True, plots the component breakdown of the predicted values. When set to True, it only plots one plot, the component plot, as there are no change points or residuals in this time frame.
title (str, optional, default None) – Title of the plot.

Returns

fig – Figure plotting components against appropriate time scale. Plot layout includes: - Plot 1, “Component Plot” - breakdown from forecast_breakdown - Plot 2, “Trend + Change Points” - Plot 3, “Residuals + Smoothed Residuals”; smoothing done using exponentially weighted moving average

Return type

plot_trend_changepoint_detection(params=None)

Convenience function to plot the original trend changepoint detection results.

Parameters

params (dict or None, default None) –

The parameters in plot. If set to None, all components will be plotted.

Note: seasonality components plotting is not supported currently. plot parameter must be False.

Returns

fig – Figure.

Return type

property pred_category

A dictionary that stores the predictor names in each category.

This property is not initialized until used. This speeds up the fitting process. The categories includes

“intercept” : the intercept.

“time_features” : the predictors that include TimeFeaturesEnum but not SEASONALITY_REGEX.

“event_features” : the predictors that include EVENT_PREFIX.

“trend_features” : the predictors that include TREND_REGEX but not SEASONALITY_REGEX.

“seasonality_features” : the predictors that include SEASONALITY_REGEX.

“lag_features” : the predictors that include LAG_REGEX.

“regressor_features” : external regressors and other predictors manually passed to extra_pred_cols, but not in the categories above.

“interaction_features” : the predictors that include interaction terms, i.e., including a colon.

Note that each predictor falls into at least one category. Some “time_features” may also be “trend_features”. Predictors with an interaction are classified into all categories matched by the interaction components. Thus, “interaction_features” are already included in the other categories.

predict(X, y=None)

Creates forecast for the dates specified in X.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.
y (ignored.) –

Returns

predictions –

Forecasted values for the dates in X. Columns:

TIME_COL: dates

PREDICTED_COL: predictions

PREDICTED_LOWER_COL: lower bound of predictions, optional

PREDICTED_UPPER_COL: upper bound of predictions, optional

[other columns], optional

PREDICTED_LOWER_COL and PREDICTED_UPPER_COL are present if self.coverage is not None.

Return type

predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)

Makes predictions of prediction intervals for df based on the predictions and self.uncertainty_model.

Parameters

df (pandas.DataFrame) – The dataframe to calculate prediction intervals upon. It should have either self.value_col_ or PREDICT_COL which the prediction interval is based on.
predict_params (dict [str, any] or None, default None) – Parameters to be passed to the predict function.

Returns

result_df – The df with prediction interval columns.

Return type

score(X, y, sample_weight=None)

Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)

Notes

If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.

If null_model_params is None, returns score_func of the model itself.

By default, grid search (with no scoring parameter) optimizes improvement of score_func against null model.

To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignored
y (pandas.Series or numpy.array) – Actual value, used to compute error
sample_weight (pandas.Series or numpy.array) – ignored

Returns

score – Comparison of predictions against null predictions, according to specified score function

Return type

float or None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

summary(max_colwidth=20)

Creates the model summary for the given model

Parameters: max_colwidth (int) – The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.
Returns: model_summary – The model summary for this model. See ModelSummary
Return type: ModelSummary

class greykite.sklearn.estimator.base_silverkite_estimator.BaseSilverkiteEstimator(silverkite: ~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast = <greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast object>, score_func: callable = <function mean_squared_error>, coverage: ~typing.Optional[float] = None, null_model_params: ~typing.Optional[~typing.Dict] = None, uncertainty_dict: ~typing.Optional[~typing.Dict] = None)[source]

A base class for forecast estimators that fit using forecast.

Notes

Allows estimators that fit using forecast to share the same functions for input data validation, fit postprocessing, predict, summary, plot_components, etc.

Subclasses should:

Implement their own __init__ that uses a superset of the parameters here.

Implement their own fit, with this sequence of steps:

calls super().fit

calls SilverkiteForecast.forecast or SimpleSilverkiteForecast.forecast_simple and stores the result in self.model_dict

calls super().finish_fit

Uses coverage to set prediction band width. Even though coverage is not needed by forecast_silverkite, it is included in every BaseForecastEstimator to be used universally for forecast evaluation.

Therefore, uncertainty_dict must be consistent with coverage if provided as a dictionary. If uncertainty_dict is None or “auto”, an appropriate default value is set, according to coverage.

Parameters

score_func (callable, optional, default mean_squared_error) – See BaseForecastEstimator.
coverage (float between [0.0, 1.0] or None, optional) – See BaseForecastEstimator.
null_model_params (dict, optional) – Dictionary with arguments to define DummyRegressor null model, default is None. See BaseForecastEstimator.
uncertainty_dict (dict or str or None, optional) – How to fit the uncertainty model. See forecast. Note that this is allowed to be “auto”. If None or “auto”, will be set to a default value by coverage before calling forecast_silverkite.

silverkite

The silverkite algorithm instance used for forecasting

Type: Class or a derived class of SilverkiteForecast

model_dict

A dict with fitted model and its attributes. The output of forecast.

Type: dict or None

pred_cols

Names of the features used in the model.

Type: list [str] or None

feature_cols

Column names of the patsy design matrix built by design_mat_from_formula.

Type: list [str] or None

df

The training data used to fit the model.

Type: pandas.DataFrame or None

coef_

Estimated coefficient matrix for the model. Not available for random forest and gradient boosting methods and set to the default value None.

Type: pandas.DataFrame or None

_pred_category

A dictionary with keys being the predictor category and values being the predictors belonging to the category. For details, see pred_category.

Type: dict or None

extra_pred_cols

User provided extra predictor names, for details, see SimpleSilverkiteEstimator or SilverkiteEstimator.

Type: list or None

past_df

The extra past data before training data used to generate autoregression terms.

Type: pandas.DataFrame or None

forecast

Output of predict_silverkite, set by self.predict.

Type: pandas.DataFrame or None

forecast_x_mat

The design matrix of the model at the predict time.

Type: pandas.DataFrame or None

model_summary

The ModelSummary class.

Type: class or None

See also

None: Function performing the fit and predict.

Notes

The subclasses will pass fs_components_df to forecast_silverkite. The model terms it creates internally are used to generate the component plots.

fourier_series_multi_fcn uses fs_components_df["names"] (e.g. tod, tow) to build the fourier series and to create column names.

fs_components_df["seas_names"] (e.g. daily, weekly) is appended to the column names, if provided.

plot_components relies on a regular expression dictionary to group components together. There are two available in the library, see constants for the two definitions

“DEFAULT_COMPONENTS_REGEX_DICT” Grouped seasonality that is the default

“DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT”: A detailed seasonality breakdown where the user can view daily/weekly/monthly/quarterly/yearly seasonality

fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]

Pre-processing before fitting Silverkite forecast model.

Parameters

X (pandas.DataFrame) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included in X to allow transformation by sklearn.pipeline.
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in X).
time_col (str) – Time column name in X.
value_col (str) – Value column name in X.
fit_params (dict) – additional parameters for null model.

Notes

Subclasses are expected to call this at the beginning of their fit method, before calling forecast.

finish_fit()[source]

Makes important values of self.model_dict conveniently accessible.

To be called by subclasses at the end of their fit method. Sets {pred_cols, feature_cols, and coef_}.

predict(X, y=None)[source]

Creates forecast for the dates specified in X.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.
y (ignored.) –

Returns

predictions –

Forecasted values for the dates in X. Columns:

TIME_COL: dates

PREDICTED_COL: predictions

PREDICTED_LOWER_COL: lower bound of predictions, optional

PREDICTED_UPPER_COL: upper bound of predictions, optional

[other columns], optional

PREDICTED_LOWER_COL and PREDICTED_UPPER_COL are present if self.coverage is not None.

Return type

forecast_breakdown(grouping_regex_patterns_dict, forecast_x_mat=None, time_values=None, center_components=False, denominator=None, plt_title='breakdown of forecasts')[source]

Generates silverkite forecast breakdown for groupings given in grouping_regex_patterns_dict. Note that this only works for additive regression models and not for models such as random forest.

Parameters

grouping_regex_patterns_dict (dict {str: str}) – A dictionary with group names as keys and regexes as values. This dictionary is used to partition the columns into various groups
forecast_x_mat (pd.DataFrame, default None) – The dataframe of design matrix of regression model. If None, this will be extracted from the estimator.
time_values (list or np.array, default None) – A collection of values (usually timestamps) to be used in the figure. It can also be used to join breakdown data with other data when needed. If None, and forecast_x_mat is not passed, timestamps will be extracted from the estimator to match the``forecast_x_mat`` which is also extracted from the estimator. If None, and``forecast_x_mat`` is passed, the timestamps cannot be inferred. Therefore we simply create an integer index with size of forecast_x_mat.
center_components (bool, default False) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same.
denominator (str, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
  The absolute value of the observed mean of the response
- ”y_std”float
  The standard deviation of the observed response
plt_title (str, default “prediction breakdown”) – The title of generated plot

Returns

result – Dictionary returned by breakdown_regression_based_prediction

Return type

dict

property pred_category

A dictionary that stores the predictor names in each category.

This property is not initialized until used. This speeds up the fitting process. The categories includes

“intercept” : the intercept.

“time_features” : the predictors that include TimeFeaturesEnum but not SEASONALITY_REGEX.

“event_features” : the predictors that include EVENT_PREFIX.

“trend_features” : the predictors that include TREND_REGEX but not SEASONALITY_REGEX.

“seasonality_features” : the predictors that include SEASONALITY_REGEX.

“lag_features” : the predictors that include LAG_REGEX.

“regressor_features” : external regressors and other predictors manually passed to extra_pred_cols, but not in the categories above.

“interaction_features” : the predictors that include interaction terms, i.e., including a colon.

Note that each predictor falls into at least one category. Some “time_features” may also be “trend_features”. Predictors with an interaction are classified into all categories matched by the interaction components. Thus, “interaction_features” are already included in the other categories.

get_max_ar_order()[source]

Gets the maximum autoregression order.

Returns: max_ar_order – The maximum autoregression order.
Return type: int

summary(max_colwidth=20)[source]

Creates the model summary for the given model

Parameters: max_colwidth (int) – The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.
Returns: model_summary – The model summary for this model. See ModelSummary
Return type: ModelSummary

plot_components(grouping_regex_patterns_dict=None, center_components=True, denominator=None, predict_phase=False, title=None)[source]

Class method to plot the components of a Silverkite model on datasets passed to either fit or predict.

Parameters

grouping_regex_patterns_dict (dict, optional, default None) – If None, it is set to DEFAULT_COMPONENTS_REGEX_DICT. An alternative dictionary is available that provides a more detailed breakdown of seasonality components (e.g., weekly, monthly, quarterly, yearly, etc.), See: DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT.
center_components (bool, optional, default True) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same. See forecast_breakdown.
denominator (str, optional, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
  The absolute value of the observed mean of the response
- ”y_std”float
  The standard deviation of the observed response
See forecast_breakdown.
predict_phase (bool, optional, default False) – If False, plots the components of the training data and shows three plots: 1) Component Plot, 2) Trend Plot + Change points, and 3) Residuals + Smoothed Residuals. If set to True, plots the component breakdown of the predicted values. When set to True, it only plots one plot, the component plot, as there are no change points or residuals in this time frame.
title (str, optional, default None) – Title of the plot.

Returns

fig – Figure plotting components against appropriate time scale. Plot layout includes: - Plot 1, “Component Plot” - breakdown from forecast_breakdown - Plot 2, “Trend + Change Points” - Plot 3, “Residuals + Smoothed Residuals”; smoothing done using exponentially weighted moving average

Return type

plot_trend_changepoint_detection(params=None)[source]

Convenience function to plot the original trend changepoint detection results.

Parameters

params (dict or None, default None) –

The parameters in plot. If set to None, all components will be plotted.

Note: seasonality components plotting is not supported currently. plot parameter must be False.

Returns

fig – Figure.

Return type

fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)

Fits the uncertainty model with a given df and uncertainty_dict.

Parameters

df (pandas.DataFrame) – A dataframe representing the data to fit the uncertainty model.
uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:

”uncertainty_method”: a string that is in
UncertaintyMethodEnum.

”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the fit function.
kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.

Return type

The function sets self.uncertainty_model and does not return anything.

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)

Makes predictions of prediction intervals for df based on the predictions and self.uncertainty_model.

Parameters

df (pandas.DataFrame) – The dataframe to calculate prediction intervals upon. It should have either self.value_col_ or PREDICT_COL which the prediction interval is based on.
predict_params (dict [str, any] or None, default None) – Parameters to be passed to the predict function.

Returns

result_df – The df with prediction interval columns.

Return type

score(X, y, sample_weight=None)

Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)

Notes

If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.

If null_model_params is None, returns score_func of the model itself.

By default, grid search (with no scoring parameter) optimizes improvement of score_func against null model.

To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignored
y (pandas.Series or numpy.array) – Actual value, used to compute error
sample_weight (pandas.Series or numpy.array) – ignored

Returns

score – Comparison of predictions against null predictions, according to specified score function

Return type

float or None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

class greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions(freq: SILVERKITE_FREQ = SILVERKITE_FREQ.DAILY, seas: SILVERKITE_SEAS = SILVERKITE_SEAS.LT, gr: SILVERKITE_GR = SILVERKITE_GR.LINEAR, cp: SILVERKITE_CP = SILVERKITE_CP.NONE, hol: SILVERKITE_HOL = SILVERKITE_HOL.NONE, feaset: SILVERKITE_FEASET = SILVERKITE_FEASET.OFF, algo: SILVERKITE_ALGO = SILVERKITE_ALGO.LINEAR, ar: SILVERKITE_AR = SILVERKITE_AR.OFF, dsi: SILVERKITE_DSI = SILVERKITE_DSI.AUTO, wsi: SILVERKITE_WSI = SILVERKITE_WSI.AUTO)[source]

Defines generic simple silverkite template options.

Attributes can be set to different values using SILVERKITE_COMPONENT_KEYWORDS for high level tuning.

freq represents data frequency.

The other attributes stand for seasonality, growth, changepoints_dict, events, feature_sets_enabled, fit_algorithm and autoregression in ModelComponentsParam, which are used in SimpleSilverkiteTemplate.

freq: SILVERKITE_FREQ = 'DAILY': Valid values for simple silverkite template string name frequency. See SILVERKITE_FREQ.

seas: SILVERKITE_SEAS = 'LT': Valid values for simple silverkite template string name seasonality. See SILVERKITE_SEAS.

gr: SILVERKITE_GR = 'LINEAR': Valid values for simple silverkite template string name growth. See SILVERKITE_GR.

cp: SILVERKITE_CP = 'NONE': Valid values for simple silverkite template string name changepoints. See SILVERKITE_CP.

hol: SILVERKITE_HOL = 'NONE': Valid values for simple silverkite template string name holiday. See SILVERKITE_HOL.

feaset: SILVERKITE_FEASET = 'OFF': Valid values for simple silverkite template string name feature sets enabled. See SILVERKITE_FEASET.

algo: SILVERKITE_ALGO = 'LINEAR': Valid values for simple silverkite template string name fit algorithm. See SILVERKITE_ALGO.

ar: SILVERKITE_AR = 'OFF': Valid values for simple silverkite template string name autoregression. See SILVERKITE_AR.

dsi: SILVERKITE_DSI = 'AUTO': Valid values for simple silverkite template string name max daily seasonality interaction order. See SILVERKITE_DSI.

wsi: SILVERKITE_WSI = 'AUTO': Valid values for simple silverkite template string name max weekly seasonality interaction order. See SILVERKITE_WSI.

class greykite.framework.templates.silverkite_template.SilverkiteTemplate[source]

A template for SilverkiteEstimator.

Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call forecast_pipeline.

Notes

The attributes of a ForecastConfig for SilverkiteEstimator are:

computation_param: ComputationParam or None, default None
How to compute the result. See ComputationParam.

coverage: float or None, default None
Intended coverage of the prediction bands (0.0 to 1.0). Same as coverage in forecast_pipeline. You may tune how the uncertainty is computed via model_components.uncertainty[“uncertainty_dict”].

evaluation_metric_param: EvaluationMetricParam or None, default None
What metrics to evaluate. See EvaluationMetricParam.

evaluation_period_param: EvaluationPeriodParam or None, default None
How to split data for evaluation. See EvaluationPeriodParam.

forecast_horizon: int or None, default None
Number of periods to forecast into the future. Must be > 0 If None, default is determined from input data frequency Same as forecast_horizon in forecast_pipeline

metadata_param: MetadataParam or None, default None
Information about the input data. See MetadataParam.

model_components_param: ModelComponentsParam or None, default None
Parameters to tune the model. See ModelComponentsParam. The fields are dictionaries with the following items.

See inline comments on which values accept lists for grid search.
seasonality: dict [str, any] or None, optional
How to model the seasonality. A dictionary with keys corresponding to parameters in forecast.

Allowed keys: "fs_components_df".

growth: dict [str, any] or None, optional
How to model the growth.

Allowed keys: None. (Use model_components.custom["extra_pred_cols"] to specify growth terms.)

events: dict [str, any] or None, optional
How to model the holidays/events. A dictionary with keys corresponding to parameters in forecast.

Allowed keys: "daily_event_df_dict".

Note

Event names derived from daily_event_df_dict must be specified via model_components.custom["extra_pred_cols"] to be included in the model. This parameter has no effect on the model unless event names are passed to extra_pred_cols.

The function get_event_pred_cols can be used to extract all event names from daily_event_df_dict.

changepoints: dict [str, any] or None, optional
How to model changes in trend and seasonality. A dictionary with keys corresponding to parameters in forecast.

Allowed keys: “changepoints_dict”, “seasonality_changepoints_dict”, “changepoint_detector”.

autoregression: dict [str, any] or None, optional
Specifies the autoregression configuration. Dictionary with the following optional key:

"autoreg_dict": dict or str or None or a list of such values for grid search
If a dict: A dictionary with arguments for build_autoreg_df. That function’s parameter value_col is inferred from the input of current function self.forecast. Other keys are:

"lag_dict" : dict or None "agg_lag_dict" : dict or None "series_na_fill_func" : callable

If a str: The string will represent a method and a dictionary will be constructed using that str. Currently only implemented method is “auto” which uses __get_default_autoreg_dict to create a dictionary. See more details for above parameters in build_autoreg_df.

regressors: dict [str, any] or None, optional
How to model the regressors.

Allowed keys: None. (Use model_components.custom["extra_pred_cols"] to specify regressors.)

lagged_regressors: dict [str, dict] or None, optional
Specifies the lagged regressors configuration. Dictionary with the following optional key:
"lagged_regressor_dict": dict or None or a list of such values for grid search
A dictionary with arguments for build_autoreg_df_multi. The keys of the dictionary are the target lagged regressor column names. It can leverage the regressors included in df. The value of each key is either a dict or str. If dict, it has the following keys:

"lag_dict" : dict or None "agg_lag_dict" : dict or None "series_na_fill_func" : callable

If str, it represents a method and a dictionary will be constructed using that str. Currently the only implemented method is “auto” which uses SilverkiteForecast’s __get_default_lagged_regressor_dict to create a dictionary for each lagged regressor. An example:
lagged_regressor_dict = {
    "regressor1": {
        "lag_dict": {"orders": [1, 2, 3]},
        "agg_lag_dict": {
            "orders_list": [[7, 7 * 2, 7 * 3]],
            "interval_list": [(8, 7 * 2)]},
        "series_na_fill_func": lambda s: s.bfill().ffill()},
    "regressor2": "auto"}
Check the docstring of build_autoreg_df_multi for more details for each argument.
uncertainty: dict [str, any] or None, optional
How to model the uncertainty. A dictionary with keys corresponding to parameters in forecast.

Allowed keys: "uncertainty_dict".

custom: dict [str, any] or None, optional
Custom parameters that don’t fit the categories above. A dictionary with keys corresponding to parameters in forecast.

Allowed keys:
"silverkite", "origin_for_time_vars", "extra_pred_cols", "drop_pred_cols", "explicit_pred_cols", "fit_algorithm_dict", "min_admissible_value", "max_admissible_value".

Note

"extra_pred_cols" should contain the desired growth terms, regressor names, and event names.

fit_algorithm_dict is a dictionary with fit_algorithm and fit_algorithm_params parameters to forecast:

fit_algorithm_dictdict or None, optional
How to fit the model. A dictionary with the following optional keys.

"fit_algorithm"str, optional, default “linear”
The type of predictive model used in fitting.

See fit_model_via_design_matrix for available options and their parameters.

"fit_algorithm_params"dict or None, optional, default None
Parameters passed to the requested fit_algorithm. If None, uses the defaults in fit_model_via_design_matrix.

hyperparameter_override: dict [str, any] or None or list [dict [str, any] or None], optional
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.

Keys should have format {named_step}__{parameter_name} for the named steps of the sklearn.pipeline.Pipeline returned by this function. See sklearn.pipeline.Pipeline.

For example:
hyperparameter_override={
    "estimator__origin_for_time_vars": 2018.0,
    "input__response__null__impute_algorithm": "ts_interpolate",
    "input__response__null__impute_params": {"orders": [7, 14]},
    "input__regressors_numeric__normalize__normalize_algorithm": "RobustScaler",
}
If a list of dictionaries, grid search will be done for each dictionary in the list. Each dictionary in the list override the defaults. This enables grid search over specific combinations of parameters to reduce the search space.

For example, the first dictionary could define combinations of parameters for a “complex” model, and the second dictionary could define combinations of parameters for a “simple” model, to prevent mixed combinations of simple and complex.

Or the first dictionary could grid search over fit algorithm, and the second dictionary could use a single fit algorithm and grid search over seasonality.

The result is passed as the param_distributions parameter to sklearn.model_selection.RandomizedSearchCV.
model_template: str
This class only accepts “SK”.

DEFAULT_MODEL_TEMPLATE = 'SK': The default model template. See ModelTemplateEnum. Uses a string to avoid circular imports. Overrides the value from ForecastConfigDefaults.

property allow_model_template_list: SilverkiteTemplate does not allow config.model_template to be a list.

property allow_model_components_param_list: SilverkiteTemplate does not allow config.model_components_param to be a list.

get_regressor_cols()[source]

Returns regressor column names.

Implements the method in BaseTemplate.

The intersection of extra_pred_cols from model components and self.df columns, excluding time_col and value_col.

Returns: regressor_cols – See forecast_pipeline.
Return type: list [str] or None

get_lagged_regressor_info()[source]

Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.

Implements the method in BaseTemplate.

Returns

lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:

lagged_regressor_colslist [str] or None
See forecast_pipeline.

overall_min_lag_order : int or None overall_max_lag_order : int or None

For example:

self.config.model_components_param.lagged_regressors["lagged_regressor_dict"] = [
    {"regressor1": {
        "lag_dict": {"orders": [7]},
        "agg_lag_dict": {
            "orders_list": [[7, 7 * 2, 7 * 3]],
            "interval_list": [(8, 7 * 2)]},
        "series_na_fill_func": lambda s: s.bfill().ffill()}
    },
    {"regressor2": {
        "lag_dict": {"orders": [2]},
        "agg_lag_dict": {
            "orders_list": [[7, 7 * 2]],
            "interval_list": [(8, 7 * 2)]},
        "series_na_fill_func": lambda s: s.bfill().ffill()}
    },
    {"regressor3": "auto"}
]

Then the function returns:

lagged_regressor_info = {
    "lagged_regressor_cols": ["regressor1", "regressor2", "regressor3"],
    "overall_min_lag_order": 2,
    "overall_max_lag_order": 21
}

Note that “regressor3” is skipped as the “auto” option makes sure the lag order is proper.

Return type

dict

get_hyperparameter_grid()[source]

Returns hyperparameter grid.

Implements the method in BaseTemplate.

Uses self.time_properties and self.config to generate the hyperparameter grid.

Converts model components and time properties into SilverkiteEstimator hyperparameters.

Notes

forecast_pipeline handles the train/test splits according to EvaluationPeriodParam, so estimator__train_test_thresh and estimator__training_fraction are always None.

estimator__changepoint_detector is always None, to prevent leaking future information into the past. Pass changepoints_dict with method=”auto” for automatic detection.

Returns: hyperparameter_grid – See forecast_pipeline. The output dictionary values are lists, combined in grid search.
Return type: dict, list [dict] or None

static apply_computation_defaults(computation: Optional[ComputationParam] = None) → ComputationParam

Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.

Parameters: computation (ComputationParam or None) – The ComputationParam object.
Returns: computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
Return type: ComputationParam

static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) → EvaluationMetricParam

Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.

Parameters: evaluation (EvaluationMetricParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationMetricParam

static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) → EvaluationPeriodParam

Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.

Parameters: evaluation (EvaluationPeriodParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationPeriodParam

apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) → ForecastConfig

Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.

Parameters: config (ForecastConfig or None) – Forecast configuration if available. See ForecastConfig.
Returns: config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
Return type: ForecastConfig

static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) → MetadataParam

Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.

Parameters: metadata (MetadataParam or None) – The MetadataParam object.
Returns: metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
Return type: MetadataParam

static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) → Union[ModelComponentsParam, List[ModelComponentsParam]]

Applies the default ModelComponentsParam values to the given object.

Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.

Parameters: model_components (ModelComponentsParam or None or list of such items) – The ModelComponentsParam object.
Returns: model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
Return type: ModelComponentsParam or list of such items

apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) → Union[str, List[str]]

Applies the default model template to the given object.

Unpacks a list of a single element to the element itself. Sets default value if None.

Parameters: model_template (str or None or list [None, str]) – The model template name. See valid names in ModelTemplateEnum.
Returns: model_template – The model template name, with defaults value used if not provided.
Return type: str or list [str]

apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) → Dict[source]

Explicitly calls the method in BaseTemplate to make use of the decorator in this class.

Parameters

df (pandas.DataFrame) – The time series dataframe with time_col and value_col and optional regressor columns.
config (ForecastConfig.) – The ForecastConfig class that includes model training parameters.

Returns

pipeline_parameters – The pipeline parameters consumable by forecast_pipeline.

Return type

dict

property estimator: The estimator instance to use as the final step in the pipeline. An instance of greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator.

get_forecast_time_properties()

Returns forecast time parameters.

Uses self.df, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.lagged_regressor_cols

self.estimator

self.pipeline

Returns

time_properties – Time properties dictionary (likely produced by get_forecast_time_properties) with keys:

"period"int: Period of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"SimpleTimeFrequencyEnum: SimpleTimeFrequencyEnum member corresponding to data frequency.
"num_training_points"int: Number of observations for training.
"num_training_days"int: Number of days for training.
"start_year"int: Start year of the training period.
"end_year"int: End year of the forecast period.
"origin_for_time_vars"float: Continuous time representation of the first date in df.

Return type

dict [str, any] or None, default None

get_pipeline()

Returns pipeline.

Implementation may be overridden by subclass if a different pipeline is desired.

Uses self.estimator, self.score_func, self.score_func_greater_is_better, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.estimator

Returns: pipeline – See forecast_pipeline.
Return type: sklearn.pipeline.Pipeline

score_func: Score function used to select optimal model in CV.

score_func_greater_is_better: True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.

regressor_cols: A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.

lagged_regressor_cols: A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.

pipeline: Pipeline to fit. The final named step must be called “estimator”.

time_properties: Time properties dictionary (likely produced by get_forecast_time_properties)

hyperparameter_grid: Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to sklearn.model_selection.GridSearchCV (param_grid) or sklearn.model_selection.RandomizedSearchCV (param_distributions).

df: Optional[DataFrame]: Timeseries data to forecast.

config: Optional[ForecastConfig]: Forecast configuration.

pipeline_params: Optional[Dict]: Parameters (keyword arguments) to call forecast_pipeline.

static apply_template_decorator(func)[source]

Decorator for apply_template_for_pipeline_params function.

Overrides the method in BaseTemplate.

Raises: ValueError if config.model_template != "SK" –

Lag Based Template

class greykite.framework.templates.lag_based_template.LagBasedTemplate(estimator: BaseForecastEstimator = LagBasedEstimator())[source]

A template for :class: LagBasedEstimator.

DEFAULT_MODEL_TEMPLATE = 'LAG_BASED': The default model template. See ModelTemplateEnum. Uses a string to avoid circular imports.

property allow_model_template_list: LagBasedTemplate does not allow config.model_template to be a list.

property allow_model_components_param_list: LagBasedTemplate does not allow config.model_components_param to be a list.

get_regressor_cols()[source]: Returns regressor column names from the model components. LagBasedTemplate does not support regressors.

apply_lag_based_model_components_defaults(model_components: Optional[ModelComponentsParam] = None)[source]

Fills the default values to model_components if not provided.

Parameters: model_components (ModelComponentsParam or None, default None) – Configuration for LagBasedTemplate. Should only have values in the “custom” key.
Returns: model_components – The provided model_components with default values set.
Return type: ModelComponentsParam

get_hyperparameter_grid()[source]

Returns hyperparameter grid.

Implements the method in BaseTemplate.

Uses self.config to generate the hyperparameter grid.

Converts model components into LagBasedEstimator hyperparameters.

Returns: hyperparameter_grid – See forecast_pipeline. The output dictionary values are lists, combined in grid search.
Return type: dict, list [dict] or None

apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) → Dict[source]

Explicitly calls the method in BaseTemplate to make use of the decorator in this class.

Parameters

df (pandas.DataFrame) – The time series dataframe with time_col and value_col and optional regressor columns.
config (ForecastConfig.) – The ForecastConfig class that includes model training parameters.

Returns

pipeline_parameters – The pipeline parameters consumable by forecast_pipeline.

Return type

dict

static apply_computation_defaults(computation: Optional[ComputationParam] = None) → ComputationParam

Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.

Parameters: computation (ComputationParam or None) – The ComputationParam object.
Returns: computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
Return type: ComputationParam

static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) → EvaluationMetricParam

Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.

Parameters: evaluation (EvaluationMetricParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationMetricParam

static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) → EvaluationPeriodParam

Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.

Parameters: evaluation (EvaluationPeriodParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationPeriodParam

apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) → ForecastConfig

Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.

Parameters: config (ForecastConfig or None) – Forecast configuration if available. See ForecastConfig.
Returns: config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
Return type: ForecastConfig

static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) → MetadataParam

Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.

Parameters: metadata (MetadataParam or None) – The MetadataParam object.
Returns: metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
Return type: MetadataParam

static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) → Union[ModelComponentsParam, List[ModelComponentsParam]]

Applies the default ModelComponentsParam values to the given object.

Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.

Parameters: model_components (ModelComponentsParam or None or list of such items) – The ModelComponentsParam object.
Returns: model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
Return type: ModelComponentsParam or list of such items

apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) → Union[str, List[str]]

Applies the default model template to the given object.

Unpacks a list of a single element to the element itself. Sets default value if None.

Parameters: model_template (str or None or list [None, str]) – The model template name. See valid names in ModelTemplateEnum.
Returns: model_template – The model template name, with defaults value used if not provided.
Return type: str or list [str]

static apply_template_decorator(func)[source]

Decorator for apply_template_for_pipeline_params function.

Overrides the method in BaseTemplate.

Raises: ValueError if config.model_template != "LAG_BASED" –

property estimator: The estimator instance to use as the final step in the pipeline. An instance of greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator.

get_forecast_time_properties()

Returns forecast time parameters.

Uses self.df, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.lagged_regressor_cols

self.estimator

self.pipeline

Returns

time_properties – Time properties dictionary (likely produced by get_forecast_time_properties) with keys:

"period"int: Period of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"SimpleTimeFrequencyEnum: SimpleTimeFrequencyEnum member corresponding to data frequency.
"num_training_points"int: Number of observations for training.
"num_training_days"int: Number of days for training.
"start_year"int: Start year of the training period.
"end_year"int: End year of the forecast period.
"origin_for_time_vars"float: Continuous time representation of the first date in df.

Return type

dict [str, any] or None, default None

get_lagged_regressor_info()

Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.

Can be overridden by subclass.

Returns

lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:

lagged_regressor_colslist [str] or None
See forecast_pipeline.

overall_min_lag_order : int or None overall_max_lag_order : int or None

Return type

dict

get_pipeline()

Returns pipeline.

Implementation may be overridden by subclass if a different pipeline is desired.

Uses self.estimator, self.score_func, self.score_func_greater_is_better, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.estimator

Returns: pipeline – See forecast_pipeline.
Return type: sklearn.pipeline.Pipeline

score_func: Score function used to select optimal model in CV.

score_func_greater_is_better: True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.

regressor_cols: A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.

lagged_regressor_cols: A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.

pipeline: Pipeline to fit. The final named step must be called “estimator”.

time_properties: Time properties dictionary (likely produced by get_forecast_time_properties)

hyperparameter_grid: Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to sklearn.model_selection.GridSearchCV (param_grid) or sklearn.model_selection.RandomizedSearchCV (param_distributions).

df: Optional[DataFrame]: Timeseries data to forecast.

config: Optional[ForecastConfig]: Forecast configuration.

pipeline_params: Optional[Dict]: Parameters (keyword arguments) to call forecast_pipeline.

class greykite.sklearn.estimator.lag_based_estimator.LagBasedEstimator(score_func=<function mean_squared_error>, coverage=None, null_model_params=None, freq: ~typing.Optional[str] = None, lag_unit: str = 'week', lags: ~typing.Optional[~typing.Union[~typing.List[int], int]] = None, agg_func: ~typing.Union[str, callable] = 'mean', agg_func_params: ~typing.Optional[dict] = None, uncertainty_dict: ~typing.Optional[dict] = None, past_df: ~typing.Optional[~pandas.core.frame.DataFrame] = None, series_na_fill_func: ~typing.Optional[callable] = None)[source]

The lag based estimator, using lagged observations with aggregation functions to forecast the future. This estimator includes the common week-over-week estimation method.

The algorithm support specifying the following:

lag_unitthe unit to calculate lagged values. One of the values in
LagUnitEnum.

lagsa list of lags indicating which lagged lag_unit data are used in prediction.
For example, [1, 2] indicating using the past two lag_unit same time data.

agg_func : the aggregation function used over the lagged observations. agg_func_params : extra parameters used for agg_func.

When certain lags are not available, extra data will be extrapolated. When predicting into the future and future data is not available, predicted values will be used.

Parameters

freq (str or None, default None) – The data frequency, used to validate lags.
lag_unit (str, default “week”) – The unit to calculate lagged observations. Available options are in LagUnitEnum.
lags (list [int] or None, default None) – The lags in lag_unit’s. [1, 2] indicates using the past two lag_unit same time values. If not provided, the default is to use lag 1 observation only.
agg_func (str or callable, default “mean”) – The aggregation functions used over lagged observations.
agg_func_params (dict or None, default None) – Extra parameters used for agg_func.
uncertainty_dict (dict or None, default None) – How to fit the uncertainty model. See UncertaintyMethodEnum. If not provided but coverage is given, this falls back to SimpleConditionalResidualsModel.
past_df (pandas.DataFrame or None, default None) – The past data used to append to the training data. If not provided the past data needed will be interpolated.
series_na_fill_func (callable or None, default lambda s: s.bfill().ffill()) – The function to fill NAs when they exist.

df

The fitted and interpolated training data.

Type: pandas.DataFrame or None

uncertainty_model

The trained uncertainty model.

Type: any or None

max_lag_order

The maximum lag order.

Type: int or None

min_lag_order

The minimum lag order.

Type: int or None

train_start

The training start timestamp.

Type: pandas.Timestamp or None

train_end

The training end timestamp.

Type: pandas.Timestamp or None

fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]

Fits the lag based forecast model.

Parameters

X (pandas.DataFrame) – Input timeseries, with timestamp column and value column. The value column is the response, included in X to allow transformation by sklearn.pipeline.
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in X).
time_col (str) – Time column name in X.
value_col (str) – Value column name in X.
fit_params (dict) – additional parameters for null model.

Returns

self – Fitted class instance.

Return type

self

predict(X, y=None)[source]

Creates forecast for the dates specified in X.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.
y (ignored.) –

Returns

predictions –

Forecasted values for the dates in X. Columns:

TIME_COL: dates

PREDICTED_COL: predictions

PREDICTED_LOWER_COL: lower bound of predictions, optional

PREDICTED_UPPER_COL: upper bound of predictions, optional

PREDICTED_LOWER_COL and PREDICTED_UPPER_COL are present if self.coverage is not None.

Return type

summary()[source]: The summary of model.

fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)

Fits the uncertainty model with a given df and uncertainty_dict.

Parameters

df (pandas.DataFrame) – A dataframe representing the data to fit the uncertainty model.
uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:

”uncertainty_method”: a string that is in
UncertaintyMethodEnum.

”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the fit function.
kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.

Return type

The function sets self.uncertainty_model and does not return anything.

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)

Makes predictions of prediction intervals for df based on the predictions and self.uncertainty_model.

Parameters

df (pandas.DataFrame) – The dataframe to calculate prediction intervals upon. It should have either self.value_col_ or PREDICT_COL which the prediction interval is based on.
predict_params (dict [str, any] or None, default None) – Parameters to be passed to the predict function.

Returns

result_df – The df with prediction interval columns.

Return type

score(X, y, sample_weight=None)

Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)

Notes

If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.

If null_model_params is None, returns score_func of the model itself.

By default, grid search (with no scoring parameter) optimizes improvement of score_func against null model.

To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignored
y (pandas.Series or numpy.array) – Actual value, used to compute error
sample_weight (pandas.Series or numpy.array) – ignored

Returns

score – Comparison of predictions against null predictions, according to specified score function

Return type

float or None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

class greykite.sklearn.estimator.lag_based_estimator.LagUnitEnum(value)[source]: Defines the lag units available in LagBasedEstimator. The keys are available string names and the values are the corresponding dateutil.relativedelta.relativedelta objects.

Multistage Forecast Template

class greykite.framework.templates.multistage_forecast_template.MultistageForecastTemplate(constants: ~greykite.framework.templates.multistage_forecast_template_config.MultistageForecastTemplateConstants = <class 'greykite.framework.templates.multistage_forecast_template_config.MultistageForecastTemplateConstants'>, estimator: ~greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator = MultistageForecastEstimator(forecast_horizon=1, model_configs=[]))[source]

The model template for Multistage Forecast Estimator.

DEFAULT_MODEL_TEMPLATE = 'SILVERKITE_TWO_STAGE': The default model template. See ModelTemplateEnum. Uses a string to avoid circular imports.

property constants: MultistageForecastTemplateConstants: Constants used by the template class. Includes the model templates and their default values.

get_regressor_cols()[source]

Gets the regressor columns in the model.

Iterates over each submodel to extract the regressor columns.

Returns: regressor_cols – A list of the regressor column names used in any of the submodels.
Return type: list [str]

get_lagged_regressor_info()[source]

Gets the lagged regressor info for the model

Iterates over each submodel to extract the lagged regressor info.

Returns: lagged_regressor_info – The combined lagged regressor info from all submodels.
Return type: dict

get_hyperparameter_grid()[source]

Gets the hyperparameter grid for the Multistage Forecast Model.

Returns: hyperparameter_grid – hyperparameter_grid for grid search in forecast_pipeline. The output dictionary values are lists, combined in grid search.
Return type: dict [str, list [any]] or list [ dict [str, list [any]] ]

property allow_model_template_list: bool: Whether the template accepts a list for config.model_template (bool)

property allow_model_components_param_list: bool: Whether the template accepts a list for config.model_components_param (bool)

static apply_computation_defaults(computation: Optional[ComputationParam] = None) → ComputationParam

Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.

Parameters: computation (ComputationParam or None) – The ComputationParam object.
Returns: computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
Return type: ComputationParam

static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) → EvaluationMetricParam

Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.

Parameters: evaluation (EvaluationMetricParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationMetricParam

static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) → EvaluationPeriodParam

Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.

Parameters: evaluation (EvaluationPeriodParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationPeriodParam

apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) → ForecastConfig

Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.

Parameters: config (ForecastConfig or None) – Forecast configuration if available. See ForecastConfig.
Returns: config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
Return type: ForecastConfig

static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) → MetadataParam

Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.

Parameters: metadata (MetadataParam or None) – The MetadataParam object.
Returns: metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
Return type: MetadataParam

static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) → Union[ModelComponentsParam, List[ModelComponentsParam]]

Applies the default ModelComponentsParam values to the given object.

Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.

Parameters: model_components (ModelComponentsParam or None or list of such items) – The ModelComponentsParam object.
Returns: model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
Return type: ModelComponentsParam or list of such items

apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) → Union[str, List[str]]

Applies the default model template to the given object.

Unpacks a list of a single element to the element itself. Sets default value if None.

Parameters: model_template (str or None or list [None, str]) – The model template name. See valid names in ModelTemplateEnum.
Returns: model_template – The model template name, with defaults value used if not provided.
Return type: str or list [str]

static apply_template_decorator(func)

Decorator for apply_template_for_pipeline_params function.

By default, this applies apply_forecast_config_defaults to config.

Subclass may override this for pre/post processing of apply_template_for_pipeline_params, such as input validation. In this case, apply_template_for_pipeline_params must also be implemented in the subclass.

apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) → Dict

Implements template interface method. Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call forecast_pipeline.

See template interface for parameters and return value.

Uses the methods in this class to set:

"regressor_cols" : get_regressor_cols()

lagged_regressor_cols : get_lagged_regressor_info()

"pipeline" : get_pipeline()

"time_properties" : get_forecast_time_properties()

"hyperparameter_grid" : get_hyperparameter_grid()

All other parameters are taken directly from config.

property estimator: The estimator instance to use as the final step in the pipeline. An instance of greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator.

get_forecast_time_properties()

Returns forecast time parameters.

Uses self.df, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.lagged_regressor_cols

self.estimator

self.pipeline

Returns

time_properties – Time properties dictionary (likely produced by get_forecast_time_properties) with keys:

"period"int: Period of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"SimpleTimeFrequencyEnum: SimpleTimeFrequencyEnum member corresponding to data frequency.
"num_training_points"int: Number of observations for training.
"num_training_days"int: Number of days for training.
"start_year"int: Start year of the training period.
"end_year"int: End year of the forecast period.
"origin_for_time_vars"float: Continuous time representation of the first date in df.

Return type

dict [str, any] or None, default None

get_pipeline()

Returns pipeline.

Implementation may be overridden by subclass if a different pipeline is desired.

Uses self.estimator, self.score_func, self.score_func_greater_is_better, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.estimator

Returns: pipeline – See forecast_pipeline.
Return type: sklearn.pipeline.Pipeline

score_func: Score function used to select optimal model in CV.

score_func_greater_is_better: True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.

regressor_cols: A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.

lagged_regressor_cols: A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.

pipeline: Pipeline to fit. The final named step must be called “estimator”.

time_properties: Time properties dictionary (likely produced by get_forecast_time_properties)

hyperparameter_grid: Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to sklearn.model_selection.GridSearchCV (param_grid) or sklearn.model_selection.RandomizedSearchCV (param_distributions).

df: Optional[pd.DataFrame]: Timeseries data to forecast.

config: Optional[ForecastConfig]: Forecast configuration.

pipeline_params: Optional[Dict]: Parameters (keyword arguments) to call forecast_pipeline.

class greykite.sklearn.estimator.multistage_forecast_estimator.MultistageForecastEstimator(model_configs: ~typing.List[~greykite.sklearn.estimator.multistage_forecast_estimator.MultistageForecastModelConfig], forecast_horizon: int, freq: ~typing.Optional[str] = None, uncertainty_dict: ~typing.Optional[dict] = None, score_func: ~typing.Callable = <function mean_squared_error>, coverage: ~typing.Optional[float] = None, null_model_params: ~typing.Optional[dict] = None)[source]

The Multistage Forecast Estimator class. Implements the Multistage forecast method.

The Multistage forecast method allows users to fit multiple stages of models with each stage in the following fashions:

subseting: take a subset of data from the end of training data;

aggregation: aggregate the subset of data into desired frequency;

training: train a model with the desired estimator and parameters.

Users can just use one stage model to train on a subset/aggregation of the original data, or can specify multiple stages, where the later stages will be trained on the fitted residuals of the previous stages.

This can significantly speed up the training process if the original data is long and in fine granularity.

Notes

The following assumptions or special implementations are made in this class:

The actual fit_length, the length of data where the fitted values are calculated, is the longer of train_length and fit_length. The reason is that there is no benefit of calculating a shorter period of fitted values. The fitted values are already available during training (in Silverkite) so there is no loss to calculate fitted values on a super set of the training data.

The estimator sorts the model_configs according to the train_length in descending order. The corresponding aggregation frequency, aggregation function, fit length, estimator and parameters will be sorted accordingly. This is to ensure that we have enough data to use from the previous model when we fit the next model.

When calculating the length of training data, the length of past df, etc, the actual length used may include 1 more period to avoid missing timestamps. For example, for an AR order of 5, you may see the length of past_df to be 6; or for a train length of “365D”, you may see the actual length to be 366. This is expected, just to avoid potential missing timestamps after dropping incomplete aggregation periods.

Since the models in each stage may not fit on the entire training data, there could be periods where fitted values are not calculated. Leading fitted values in the training period may be NA. These values are ignored when computing evaluation metrics.

model_configs

A list of model configs for Multistage Forecast estimator, representing the stages in the model.

Type: list [MultistageForecastModelConfig]

forecast_horizon

The forecast horizon on the original data frequency.

Type: int

freq

The frequency of the original data.

Type: str or None

train_lengths

A list of training data lengths for the models.

Type: list [str] or None

fit_lengths

A list of fitting data lengths for the models.

Type: list [str] or None

agg_funcs

A list of aggregation functions for the models.

Type: list [str or Callable] or None

agg_freqs

A list of aggregation frequencies for the models.

Type: list [str] or None

estimators

A list of estimators used in the models.

Type: list [BaseForecastEstimator] or None

estimator_params

A list of estimator parameters for the estimators.

Type: list [dict or None] or None

train_lengths_in_seconds

The list of training lengths in seconds.

Type: list [int] or None

fit_lengths_in_seconds

The list of fitting lengths in seconds. If the original fit_length is None or is shorter than the corresponding train_length, it will be replaced by the corresponding train_length.

Type: : list [int] or None

max_ar_orders

A list of maximum AR orders in the models.

Type: list [int] or None

data_freq_in_seconds

The data frequency in seconds.

Type: int or None

num_points_per_agg_freqs

Number of data points in each aggregation frequency.

Type: list [int] or None

models

The list of model instances.

Type: list [BaseForecastEstimator]

fit_df

The prediction df.

Type: pandas.DataFrame or None

train_end

The train end timestamp.

Type: pandas.Timestamp or None

forecast_horizons

The list of forecast horizons for all models in terms of the aggregated frequencies.

Type: list [int]

fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]

Fits MultistageForecast forecast model.

Parameters

X (pandas.DataFrame) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included in X to allow transformation by sklearn.pipeline.
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in X).
time_col (str) – Time column name in X.
value_col (str) – Value column name in X.
fit_params (dict) – additional parameters for null model.

Returns

self – Fitted model is stored in self.model_dict.

Return type

self

predict(X, y=None)[source]

Creates forecast for the dates specified in X.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.
y (ignored.) –

Returns

predictions –

Forecasted values for the dates in X. Columns:

TIME_COL: dates

PREDICTED_COL: predictions

PREDICTED_LOWER_COL: lower bound of predictions, optional

PREDICTED_UPPER_COL: upper bound of predictions, optional

PREDICTED_LOWER_COL and PREDICTED_UPPER_COL are present if self.coverage is not None.

Return type

plot_components()[source]

Makes component plots.

Returns: figs – A list of figures from each model.
Return type: list [plotly.graph_objects.Figure or None]

summary()[source]

Gets model summaries.

Returns: summaries – A list of model summaries from each model.
Return type: list [ModelSummary or None]

fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)

Fits the uncertainty model with a given df and uncertainty_dict.

Parameters

df (pandas.DataFrame) – A dataframe representing the data to fit the uncertainty model.
uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:

”uncertainty_method”: a string that is in
UncertaintyMethodEnum.

”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the fit function.
kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.

Return type

The function sets self.uncertainty_model and does not return anything.

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)

Makes predictions of prediction intervals for df based on the predictions and self.uncertainty_model.

Parameters

df (pandas.DataFrame) – The dataframe to calculate prediction intervals upon. It should have either self.value_col_ or PREDICT_COL which the prediction interval is based on.
predict_params (dict [str, any] or None, default None) – Parameters to be passed to the predict function.

Returns

result_df – The df with prediction interval columns.

Return type

score(X, y, sample_weight=None)

Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)

Notes

If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.

If null_model_params is None, returns score_func of the model itself.

By default, grid search (with no scoring parameter) optimizes improvement of score_func against null model.

To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignored
y (pandas.Series or numpy.array) – Actual value, used to compute error
sample_weight (pandas.Series or numpy.array) – ignored

Returns

score – Comparison of predictions against null predictions, according to specified score function

Return type

float or None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

Prophet Template

class greykite.framework.templates.prophet_template.ProphetTemplate(estimator: Optional[BaseForecastEstimator] = None)[source]

A template for ProphetEstimator.

Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call forecast_pipeline.

Notes

The attributes of a ForecastConfig for ProphetEstimator are:

computation_param: ComputationParam or None, default None
How to compute the result. See ComputationParam.

coverage: float or None, default None
Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Same as coverage in forecast_pipeline

evaluation_metric_param: EvaluationMetricParam or None, default None
What metrics to evaluate. See EvaluationMetricParam.

evaluation_period_param: EvaluationPeriodParam or None, default None
How to split data for evaluation. See EvaluationPeriodParam.

forecast_horizon: int or None, default None
Number of periods to forecast into the future. Must be > 0 If None, default is determined from input data frequency Same as forecast_horizon in forecast_pipeline

metadata_param: MetadataParam or None, default None
Information about the input data. See MetadataParam.

model_components_param: ModelComponentsParam or None, default None
Parameters to tune the model. See ModelComponentsParam. The fields are dictionaries with the following items.
seasonality: dict [str, any] or None
Seasonality config dictionary, with the following optional keys.

"seasonality_mode": str or None or list of such values for grid search
Can be ‘additive’ (default) or ‘multiplicative’.

"seasonality_prior_scale": float or None or list of such values for grid search
Parameter modulating the strength of the seasonality model. Larger values allow the model to fit larger seasonal fluctuations, smaller values dampen the seasonality. Specify for individual seasonalities using add_seasonality_dict.

"yearly_seasonality": str or bool or int or list of such values for grid search, default ‘auto’
Determines the yearly seasonality Can be ‘auto’, True, False, or a number of Fourier terms to generate.

"weekly_seasonality": str or bool or int or list of such values for grid search, default ‘auto’
Determines the weekly seasonality Can be ‘auto’, True, False, or a number of Fourier terms to generate.

"daily_seasonality": str or bool or int or list of such values for grid search, default ‘auto’
Determines the daily seasonality Can be ‘auto’, True, False, or a number of Fourier terms to generate.

"add_seasonality_dict": dict or None or list of such values for grid search
dict of custom seasonality parameters to be added to the model, default=None Key is the seasonality component name e.g. ‘monthly’; parameters are specified via dict. See prophet_estimator for details.

growth: dict [str, any] or None
Specifies the growth parameter configuration. Dictionary with the following optional key:

"growth_term": str or None or list of such values for grid search
How to model the growth. Valid options are “linear” and “logistic” Specify a linear or logistic trend, these terms have their origin at the train start date.

events: dict [str, any] or None
Holiday/events configuration dictionary with the following optional keys:

"holiday_lookup_countries": list [str] or “auto” or None
Which countries’ holidays to include. Must contain all the holidays you intend to model. If “auto”, uses a default list of countries with a good coverage of global holidays. If None or an empty list, no holidays are modeled.

"holidays_prior_scale": float or None or list of such values for grid search, default 10.0
Modulates the strength of the holiday effect.

"holiday_pre_num_days": list [int] or None, default 2
Model holiday effects for holiday_pre_num_days days before the holiday. Grid search is not supported. Must be a list with one element or None.

"holiday_post_num_days": list [int] or None, default 2
Model holiday effects for holiday_post_num_days days after the holiday Grid search is not supported. Must be a list with one element or None.

changepoints: dict [str, any] or None
Specifies the changepoint configuration. Dictionary with the following optional keys:

"changepoint_prior_scale"float or None or list of such values for grid search, default 0.05
Parameter modulating the flexibility of the automatic changepoint selection. Large values will allow many changepoints, small values will allow few changepoints.

"changepoints"list [datetime.datetime] or None or list of such values for grid search, default None
List of dates at which to include potential changepoints. If not specified, potential changepoints are selected automatically.

"n_changepoints"int or None or list of such values for grid search, default 25
Number of potential changepoints to include. Not used if input changepoints is supplied. If changepoints is not supplied, then n_changepoints potential changepoints are selected uniformly from the first changepoint_range proportion of the history.

"changepoint_range"float or None or list of such values for grid search, default 0.8
Proportion of history in which trend changepoints will be estimated. Permitted values: (0,1] Not used if input changepoints is supplied.

regressors: dict [str, any] or None
Specifies the regressors to include in the model (e.g. macro-economic factors). Dictionary with the following optional keys:

"add_regressor_dict"dict or None or list of such values for grid search, default None
Dictionary of extra regressors to be modeled. See ProphetEstimator for details.

uncertainty: dict [str, any] or None
Specifies the uncertainty configuration. A dictionary with the following optional keys:

"mcmc_samples"int or None or list of such values for grid search, default 0
if greater than 0, will do full Bayesian inference with the specified number of MCMC samples. If 0, will do MAP estimation.

"uncertainty_samples"int or None or list of such values for grid search, default 1000
Number of simulated draws used to estimate uncertainty intervals. Setting this value to 0 or False will disable uncertainty estimation and speed up the calculation.

hyperparameter_override: dict [str, any] or None or list [dict [str, any] or None]
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.

Keys should have format {named_step}__{parameter_name} for the named steps of the sklearn.pipeline.Pipeline returned by this function. See sklearn.pipeline.Pipeline.

For example:
hyperparameter_override={
    "estimator__yearly_seasonality": [True, False],
    "estimator__seasonality_prior_scale": [5.0, 15.0],
    "input__response__null__impute_algorithm": "ts_interpolate",
    "input__response__null__impute_params": {"orders": [7, 14]},
    "input__regressors_numeric__normalize__normalize_algorithm": "RobustScaler",
}
If a list of dictionaries, grid search will be done for each dictionary in the list. Each dictionary in the list override the defaults. This enables grid search over specific combinations of parameters to reduce the search space.

For example, the first dictionary could define combinations of parameters for a “complex” model, and the second dictionary could define combinations of parameters for a “simple” model, to prevent mixed combinations of simple and complex.

Or the first dictionary could grid search over fit algorithm, and the second dictionary could use a single fit algorithm and grid search over seasonality.

The result is passed as the param_distributions parameter to sklearn.model_selection.RandomizedSearchCV.
autoregression: dict [str, any] or None
Ignored. Prophet template does not support autoregression.

lagged_regressors: dict [str, any] or None
Ignored. Prophet template does not support lagged regressors.

custom: dict [str, any] or None
Ignored. There are no custom options.
model_template: str
This class only accepts “PROPHET”.

DEFAULT_MODEL_TEMPLATE = 'PROPHET': The default model template. See ModelTemplateEnum. Uses a string to avoid circular imports. Overrides the value from ForecastConfigDefaults.

HOLIDAY_LOOKUP_COUNTRIES_AUTO = ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'): Default holiday countries to use if countries=’auto’

property allow_model_template_list: ProphetTemplate does not allow config.model_template to be a list.

property allow_model_components_param_list: ProphetTemplate does not allow config.model_components_param to be a list.

get_prophet_holidays(year_list, countries='auto', lower_window=-2, upper_window=2)[source]

Generates holidays for Prophet model.

Parameters

year_list (list [int]) – List of years for selecting the holidays across given countries.
countries (list [str] or “auto” or None, default “auto”) –
Countries for selecting holidays.
- If “auto”, uses a default list of countries with a good coverage of global holidays.
- If a list, a list of country names.
- If None, the function returns None.
lower_window (int or None, default -2) – Negative integer. Model holiday effects for given number of days before the holiday.
upper_window (int or None, default 2) – Positive integer. Model holiday effects for given number of days after the holiday.

Returns

holidays – holidays dataframe to pass to Prophet’s holidays argument.

Return type

See also

None, to, None

get_regressor_cols()[source]

Returns regressor column names.

Implements the method in BaseTemplate.

Returns: regressor_cols – The names of regressor columns used in any hyperparameter set requested by model_components. None if there are no regressors.
Return type: list [str] or None

apply_prophet_model_components_defaults(model_components=None, time_properties=None)[source]

Sets default values for model_components.

Called by get_hyperparameter_grid after time_properties` is defined. Requires ``time_properties as well as model_components so we do not simply override apply_model_components_defaults.

Parameters

model_components (ModelComponentsParam or None, default None) – Configuration of model growth, seasonality, events, etc. See the docstring of this class for details.
time_properties (dict [str, any] or None, default None) –
Time properties dictionary (likely produced by get_forecast_time_properties) with keys:

"period"int
Period of each observation (i.e. minimum time between observations, in seconds).

"simple_freq"SimpleTimeFrequencyEnum
SimpleTimeFrequencyEnum member corresponding to data frequency.

"num_training_points"int
Number of observations for training.

"num_training_days"int
Number of days for training.

"start_year"int
Start year of the training period.

"end_year"int
End year of the forecast period.

"origin_for_time_vars"float
Continuous time representation of the first date in df.

If None, start_year is set to 2015 and end_year to 2030.

Returns

model_components – The provided model_components with default values set

Return type

ModelComponentsParam

get_hyperparameter_grid()[source]

Returns hyperparameter grid.

Implements the method in BaseTemplate.

Uses self.time_properties and self.config to generate the hyperparameter grid.

Converts model components and time properties into ProphetEstimator hyperparameters.

Returns

hyperparameter_grid – ProphetEstimator hyperparameters.

See forecast_pipeline. The output dictionary values are lists, combined in grid search.

Return type

dict [str, list [any]] or None

static apply_computation_defaults(computation: Optional[ComputationParam] = None) → ComputationParam

Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.

Parameters: computation (ComputationParam or None) – The ComputationParam object.
Returns: computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
Return type: ComputationParam

static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) → EvaluationMetricParam

Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.

Parameters: evaluation (EvaluationMetricParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationMetricParam

static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) → EvaluationPeriodParam

Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.

Parameters: evaluation (EvaluationPeriodParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationPeriodParam

apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) → ForecastConfig

Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.

Parameters: config (ForecastConfig or None) – Forecast configuration if available. See ForecastConfig.
Returns: config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
Return type: ForecastConfig

static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) → MetadataParam

Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.

Parameters: metadata (MetadataParam or None) – The MetadataParam object.
Returns: metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
Return type: MetadataParam

static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) → Union[ModelComponentsParam, List[ModelComponentsParam]]

Applies the default ModelComponentsParam values to the given object.

Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.

Parameters: model_components (ModelComponentsParam or None or list of such items) – The ModelComponentsParam object.
Returns: model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
Return type: ModelComponentsParam or list of such items

apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) → Union[str, List[str]]

Applies the default model template to the given object.

Unpacks a list of a single element to the element itself. Sets default value if None.

Parameters: model_template (str or None or list [None, str]) – The model template name. See valid names in ModelTemplateEnum.
Returns: model_template – The model template name, with defaults value used if not provided.
Return type: str or list [str]

property estimator: The estimator instance to use as the final step in the pipeline. An instance of greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator.

get_forecast_time_properties()

Returns forecast time parameters.

Uses self.df, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.lagged_regressor_cols

self.estimator

self.pipeline

Returns

time_properties – Time properties dictionary (likely produced by get_forecast_time_properties) with keys:

"period"int: Period of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"SimpleTimeFrequencyEnum: SimpleTimeFrequencyEnum member corresponding to data frequency.
"num_training_points"int: Number of observations for training.
"num_training_days"int: Number of days for training.
"start_year"int: Start year of the training period.
"end_year"int: End year of the forecast period.
"origin_for_time_vars"float: Continuous time representation of the first date in df.

Return type

dict [str, any] or None, default None

get_lagged_regressor_info()

Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.

Can be overridden by subclass.

Returns

lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:

lagged_regressor_colslist [str] or None
See forecast_pipeline.

overall_min_lag_order : int or None overall_max_lag_order : int or None

Return type

dict

get_pipeline()

Returns pipeline.

Implementation may be overridden by subclass if a different pipeline is desired.

Uses self.estimator, self.score_func, self.score_func_greater_is_better, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.estimator

Returns: pipeline – See forecast_pipeline.
Return type: sklearn.pipeline.Pipeline

score_func: Score function used to select optimal model in CV.

score_func_greater_is_better: True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.

regressor_cols: A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.

lagged_regressor_cols: A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.

pipeline: Pipeline to fit. The final named step must be called “estimator”.

time_properties: Time properties dictionary (likely produced by get_forecast_time_properties)

hyperparameter_grid: Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to sklearn.model_selection.GridSearchCV (param_grid) or sklearn.model_selection.RandomizedSearchCV (param_distributions).

df: Optional[DataFrame]: Timeseries data to forecast.

config: Optional[ForecastConfig]: Forecast configuration.

pipeline_params: Optional[Dict]: Parameters (keyword arguments) to call forecast_pipeline.

apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) → Dict[source]

Explicitly calls the method in BaseTemplate to make use of the decorator in this class.

Parameters

df (pandas.DataFrame) – The time series dataframe with time_col and value_col and optional regressor columns.
config (ForecastConfig.) – The ForecastConfig class that includes model training parameters.

Returns

pipeline_parameters – The pipeline parameters consumable by forecast_pipeline.

Return type

dict

static apply_template_decorator(func)[source]

Decorator for apply_template_for_pipeline_params function.

Overrides the method in BaseTemplate.

Raises: ValueError if config.model_template != "PROPHET" –

class greykite.sklearn.estimator.prophet_estimator.ProphetEstimator(score_func=<function mean_squared_error>, coverage=0.8, null_model_params=None, growth='linear', changepoints=None, n_changepoints=25, changepoint_range=0.8, yearly_seasonality='auto', weekly_seasonality='auto', daily_seasonality='auto', holidays=None, seasonality_mode='additive', seasonality_prior_scale=10.0, holidays_prior_scale=10.0, changepoint_prior_scale=0.05, mcmc_samples=0, uncertainty_samples=1000, add_regressor_dict=None, add_seasonality_dict=None)[source]

Wrapper for Facebook Prophet model.

Parameters

score_func (callable) – see BaseForecastEstimator
coverage (float between [0.0, 1.0]) – see BaseForecastEstimator
null_model_params (dict with arguments to define DummyRegressor null model, optional, default=None) – see BaseForecastEstimator

add_regressor_dict (dictionary of extra regressors to be added to the model, optional, default=None) –

These should be available for training and entire prediction interval.

Dictionary format:

add_regressor_dict={  # we can add as many regressors as we want, in the following format
    "reg_col1": {
        "prior_scale": 10,
        "standardize": True,
        "mode": 'additive'
    },
    "reg_col2": {
        "prior_scale": 20,
        "standardize": True,
        "mode": 'multiplicative'
    }
}

add_seasonality_dict (dict of custom seasonality parameters to be added to the model, optional, default=None) –

parameter details: https://github.com/facebook/prophet/blob/main/python/prophet/forecaster.py - refer to add_seasonality() function. Key is the seasonality component name e.g. ‘monthly’; parameters are specified via dict.

Dictionary format:

add_seasonality_dict={
    'monthly': {
        'period': 30.5,
        'fourier_order': 5
    },
    'weekly': {
        'period': 7,
        'fourier_order': 20,
        'prior_scale': 0.6,
        'mode': 'additive',
        'condition_name': 'condition_col'  # takes a bool column in df with True/False values. This means that
        # the seasonality will only be applied to dates where the condition_name column is True.
    },
    'yearly': {
        'period': 365.25,
        'fourier_order': 10,
        'prior_scale': 0.2,
        'mode': 'additive'
    }
}

Note: If there is a conflict in built-in and custom seasonality e.g. both have “yearly”, then custom seasonality will be used and Model will throw a warning such as: “INFO:prophet:Found custom seasonality named “yearly”, disabling built-in yearly seasonality.”

kwargs (additional parameters) –
Other parameters are the same as Prophet model, with one exception: interval_width is specified by coverage.

See source code __init__ for the parameter names, and refer to Prophet documentation for a description:
- https://facebook.github.io/prophet/docs/quick_start.html
- https://github.com/facebook/prophet/blob/main/python/prophet/forecaster.py

model

Prophet model object

Type: Prophet object

forecast

Output of predict method of Prophet.

Type: pandas.DataFrame

fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]

Fits prophet model.

Parameters

X (pandas.DataFrame) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included in X to allow transformation by sklearn.pipeline.Pipeline
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in X.)
time_col (str) – Time column name in X
value_col (str) – Value column name in X
fit_params (dict) – additional parameters for null model

Returns

self – Fitted model is stored in self.model.

Return type

self

predict(X, y=None)[source]

Creates forecast for dates specified in X.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.
y (ignored) –

Returns

predictions –

Forecasted values for the dates in X. Columns:

TIME_COL dates

PREDICTED_COL predictions

PREDICTED_LOWER_COL lower bound of predictions, optional

PREDICTED_UPPER_COL upper bound of predictions, optional

[other columns], optional

PREDICTED_LOWER_COL and PREDICTED_UPPER_COL are present iff coverage is not None

Return type

summary()[source]

Prints input parameters and Prophet model parameters.

Returns: log_message – log message printed to logging.info()
Return type: str

plot_components(uncertainty=True, plot_cap=True, weekly_start=0, yearly_start=0, figsize=None)[source]

Plot the Prophet forecast components on the dataset passed to predict.

Will plot whichever are available of: trend, holidays, weekly seasonality, and yearly seasonality.

Parameters

uncertainty (bool, optional, default True) – Boolean to plot uncertainty intervals.
plot_cap (bool, optional, default True) – Boolean indicating if the capacity should be shown in the figure, if available.
weekly_start (int, optional, default 0) – Specifying the start day of the weekly seasonality plot. 0 (default) starts the week on Sunday. 1 shifts by 1 day to Jan 2, and so on.
yearly_start (int, optional, default 0) – Specifying the start day of the yearly seasonality plot. 0 (default) starts the year on Jan 1. 1 shifts by 1 day to Jan 2, and so on.
figsize (tuple , optional, default None) – Width, height in inches.

Returns

fig – A matplotlib figure.

Return type

matplotlib.figure.Figure

fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)

Fits the uncertainty model with a given df and uncertainty_dict.

Parameters

df (pandas.DataFrame) – A dataframe representing the data to fit the uncertainty model.
uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:

”uncertainty_method”: a string that is in
UncertaintyMethodEnum.

”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the fit function.
kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.

Return type

The function sets self.uncertainty_model and does not return anything.

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)

Makes predictions of prediction intervals for df based on the predictions and self.uncertainty_model.

Parameters

df (pandas.DataFrame) – The dataframe to calculate prediction intervals upon. It should have either self.value_col_ or PREDICT_COL which the prediction interval is based on.
predict_params (dict [str, any] or None, default None) – Parameters to be passed to the predict function.

Returns

result_df – The df with prediction interval columns.

Return type

score(X, y, sample_weight=None)

Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)

Notes

If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.

If null_model_params is None, returns score_func of the model itself.

By default, grid search (with no scoring parameter) optimizes improvement of score_func against null model.

To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignored
y (pandas.Series or numpy.array) – Actual value, used to compute error
sample_weight (pandas.Series or numpy.array) – ignored

Returns

score – Comparison of predictions against null predictions, according to specified score function

Return type

float or None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

ARIMA Template

class greykite.framework.templates.auto_arima_template.AutoArimaTemplate(estimator: BaseForecastEstimator = AutoArimaEstimator())[source]

A template for AutoArimaEstimator.

Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call forecast_pipeline.

Notes

The attributes of a ForecastConfig for AutoArimaEstimator are:

computation_param: ComputationParam or None, default None
How to compute the result. See ComputationParam.

coverage: float or None, default None
Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Same as coverage in forecast_pipeline

evaluation_metric_param: EvaluationMetricParam or None, default None
What metrics to evaluate. See EvaluationMetricParam.

evaluation_period_param: EvaluationPeriodParam or None, default None
How to split data for evaluation. See EvaluationPeriodParam.

forecast_horizon: int or None, default None
Number of periods to forecast into the future. Must be > 0 If None, default is determined from input data frequency Same as forecast_horizon in forecast_pipeline

metadata_param: MetadataParam or None, default None
Information about the input data. See MetadataParam.

model_components_param: ModelComponentsParam or None, default None
Parameters to tune the model. See ModelComponentsParam. The fields are dictionaries with the following items.
seasonality: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.

growth: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.

events: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.

changepoints: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.

regressors: dict [str, any] or None
Ignored. Auto Arima template currently does not support regressors.

uncertainty: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.

hyperparameter_override: dict [str, any] or None or list [dict [str, any] or None]
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.

Keys should have format {named_step}__{parameter_name} for the named steps of the sklearn.pipeline.Pipeline returned by this function. See sklearn.pipeline.Pipeline.

For example:
hyperparameter_override={
    "estimator__max_p": [8, 10],
    "estimator__information_criterion": ["bic"],
}
If a list of dictionaries, grid search will be done for each dictionary in the list. Each dictionary in the list override the defaults. This enables grid search over specific combinations of parameters to reduce the search space.

For example, the first dictionary could define combinations of parameters for a “complex” model, and the second dictionary could define combinations of parameters for a “simple” model, to prevent mixed combinations of simple and complex.

Or the first dictionary could grid search over fit algorithm, and the second dictionary could use a single fit algorithm and grid search over seasonality.

The result is passed as the param_distributions parameter to sklearn.model_selection.RandomizedSearchCV.
autoregression: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.

custom: dict [str, any] or None
Any parameter in the AutoArimaEstimator can be passed.
model_template: str
This class only accepts “AUTO_ARIMA”.

DEFAULT_MODEL_TEMPLATE = 'AUTO_ARIMA': The default model template. See ModelTemplateEnum. Uses a string to avoid circular imports.

property allow_model_template_list: AutoArimaTemplate does not allow config.model_template to be a list.

property allow_model_components_param_list: AutoArimaTemplate does not allow config.model_components_param to be a list.

get_regressor_cols()[source]

Returns regressor column names from the model components.

Currently does not implement regressors.

apply_auto_arima_model_components_defaults(model_components=None)[source]

Sets default values for model_components.

Parameters: model_components (ModelComponentsParam or None, default None) – Configuration of model growth, seasonality, events, etc. See the docstring of this class for details.
Returns: model_components – The provided model_components with default values set
Return type: ModelComponentsParam

get_hyperparameter_grid()[source]

Returns hyperparameter grid.

Implements the method in BaseTemplate.

Uses self.time_properties and self.config to generate the hyperparameter grid.

Converts model components into AutoArimaEstimator. hyperparameters.

The output dictionary values are lists, combined via grid search in forecast_pipeline.

Parameters

model_components (ModelComponentsParam or None, default None) – Configuration of parameter space to search the order (p, d, q etc.) of SARIMAX model. See auto_arima_template for details.
coverage (float or None, default=0.95) – Intended coverage of the prediction bands (0.0 to 1.0)

Returns

hyperparameter_grid – AutoArimaEstimator hyperparameters.

See forecast_pipeline. The output dictionary values are lists, combined in grid search.

Return type

dict [str, list [any]] or None

apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) → Dict[source]

Explicitly calls the method in BaseTemplate to make use of the decorator in this class.

Parameters

df (pandas.DataFrame) – The time series dataframe with time_col and value_col and optional regressor columns.
config (ForecastConfig.) – The ForecastConfig class that includes model training parameters.

Returns

pipeline_parameters – The pipeline parameters consumable by forecast_pipeline.

Return type

dict

static apply_computation_defaults(computation: Optional[ComputationParam] = None) → ComputationParam

Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.

Parameters: computation (ComputationParam or None) – The ComputationParam object.
Returns: computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
Return type: ComputationParam

static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) → EvaluationMetricParam

Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.

Parameters: evaluation (EvaluationMetricParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationMetricParam

static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) → EvaluationPeriodParam

Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.

Parameters: evaluation (EvaluationPeriodParam or None) – The EvaluationMetricParam object.
Returns: evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
Return type: EvaluationPeriodParam

apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) → ForecastConfig

Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.

Parameters: config (ForecastConfig or None) – Forecast configuration if available. See ForecastConfig.
Returns: config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
Return type: ForecastConfig

static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) → MetadataParam

Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.

Parameters: metadata (MetadataParam or None) – The MetadataParam object.
Returns: metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
Return type: MetadataParam

static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) → Union[ModelComponentsParam, List[ModelComponentsParam]]

Applies the default ModelComponentsParam values to the given object.

Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.

Parameters: model_components (ModelComponentsParam or None or list of such items) – The ModelComponentsParam object.
Returns: model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
Return type: ModelComponentsParam or list of such items

apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) → Union[str, List[str]]

Applies the default model template to the given object.

Unpacks a list of a single element to the element itself. Sets default value if None.

Parameters: model_template (str or None or list [None, str]) – The model template name. See valid names in ModelTemplateEnum.
Returns: model_template – The model template name, with defaults value used if not provided.
Return type: str or list [str]

static apply_template_decorator(func)[source]

Decorator for apply_template_for_pipeline_params function.

Overrides the method in BaseTemplate.

Raises: ValueError if config.model_template != "AUTO_ARIMA" –

property estimator: The estimator instance to use as the final step in the pipeline. An instance of greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator.

get_forecast_time_properties()

Returns forecast time parameters.

Uses self.df, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.lagged_regressor_cols

self.estimator

self.pipeline

Returns

time_properties – Time properties dictionary (likely produced by get_forecast_time_properties) with keys:

"period"int: Period of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"SimpleTimeFrequencyEnum: SimpleTimeFrequencyEnum member corresponding to data frequency.
"num_training_points"int: Number of observations for training.
"num_training_days"int: Number of days for training.
"start_year"int: Start year of the training period.
"end_year"int: End year of the forecast period.
"origin_for_time_vars"float: Continuous time representation of the first date in df.

Return type

dict [str, any] or None, default None

get_lagged_regressor_info()

Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.

Can be overridden by subclass.

Returns

lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:

lagged_regressor_colslist [str] or None
See forecast_pipeline.

overall_min_lag_order : int or None overall_max_lag_order : int or None

Return type

dict

get_pipeline()

Returns pipeline.

Implementation may be overridden by subclass if a different pipeline is desired.

Uses self.estimator, self.score_func, self.score_func_greater_is_better, self.config, self.regressor_cols.

Available parameters:

self.df

self.config

self.score_func

self.score_func_greater_is_better

self.regressor_cols

self.estimator

Returns: pipeline – See forecast_pipeline.
Return type: sklearn.pipeline.Pipeline

score_func: Score function used to select optimal model in CV.

score_func_greater_is_better: True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.

regressor_cols: A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.

lagged_regressor_cols: A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.

pipeline: Pipeline to fit. The final named step must be called “estimator”.

time_properties: Time properties dictionary (likely produced by get_forecast_time_properties)

hyperparameter_grid: Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to sklearn.model_selection.GridSearchCV (param_grid) or sklearn.model_selection.RandomizedSearchCV (param_distributions).

df: Optional[DataFrame]: Timeseries data to forecast.

config: Optional[ForecastConfig]: Forecast configuration.

pipeline_params: Optional[Dict]: Parameters (keyword arguments) to call forecast_pipeline.

class greykite.sklearn.estimator.auto_arima_estimator.AutoArimaEstimator(score_func: callable = <function mean_squared_error>, coverage: float = 0.9, null_model_params: ~typing.Optional[~typing.Dict] = None, regressor_cols: ~typing.Optional[~typing.List[str]] = None, freq: ~typing.Optional[float] = None, start_p: ~typing.Optional[int] = 2, d: ~typing.Optional[int] = None, start_q: ~typing.Optional[int] = 2, max_p: ~typing.Optional[int] = 5, max_d: ~typing.Optional[int] = 2, max_q: ~typing.Optional[int] = 5, start_P: ~typing.Optional[int] = 1, D: ~typing.Optional[int] = None, start_Q: ~typing.Optional[int] = 1, max_P: ~typing.Optional[int] = 2, max_D: ~typing.Optional[int] = 1, max_Q: ~typing.Optional[int] = 2, max_order: ~typing.Optional[int] = 5, m: ~typing.Optional[int] = 1, seasonal: ~typing.Optional[bool] = True, stationary: ~typing.Optional[bool] = False, information_criterion: ~typing.Optional[str] = 'aic', alpha: ~typing.Optional[int] = 0.05, test: ~typing.Optional[str] = 'kpss', seasonal_test: ~typing.Optional[str] = 'ocsb', stepwise: ~typing.Optional[bool] = True, n_jobs: ~typing.Optional[int] = 1, start_params: ~typing.Optional[~typing.Dict] = None, trend: ~typing.Optional[str] = None, method: ~typing.Optional[str] = 'lbfgs', maxiter: ~typing.Optional[int] = 50, offset_test_args: ~typing.Optional[~typing.Dict] = None, seasonal_test_args: ~typing.Optional[~typing.Dict] = None, suppress_warnings: ~typing.Optional[bool] = True, error_action: ~typing.Optional[str] = 'trace', trace: ~typing.Optional[~typing.Union[int, bool]] = False, random: ~typing.Optional[bool] = False, random_state: ~typing.Optional[~typing.Union[int, callable]] = None, n_fits: ~typing.Optional[int] = 10, out_of_sample_size: ~typing.Optional[int] = 0, scoring: ~typing.Optional[str] = 'mse', scoring_args: ~typing.Optional[~typing.Dict] = None, with_intercept: ~typing.Optional[~typing.Union[bool, str]] = 'auto', return_conf_int: ~typing.Optional[bool] = True, dynamic: ~typing.Optional[bool] = False)[source]

Wrapper for pmdarima.arima.AutoARIMA. It currently does not handle the regressor issue when there is gap between train and predict periods.

Parameters

score_func (callable) – see BaseForecastEstimator.
coverage (float between [0.0, 1.0]) – see BaseForecastEstimator.
null_model_params (dict with arguments to define DummyRegressor null model, optional, default=None) – see BaseForecastEstimator.
regressor_cols (list [str], optional, default None) – A list of regressor columns used during training and prediction. If None, no regressor columns are used.
descriptions (See AutoArima documentation for rest of the parameter) –
- https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.AutoARIMA.html#pmdarima.arima.AutoARIMA

model

Auto arima model object

Type: AutoArima object

fit_df

The training data used to fit the model.

Type: pandas.DataFrame or None

forecast

Output of the predict method of AutoArima.

Type: pandas.DataFrame

fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]

Fits ARIMA forecast model.

Parameters

X (pandas.DataFrame) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included in X to allow transformation by sklearn.pipeline.Pipeline
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in X.)
time_col (str) – Time column name in X
value_col (str) – Value column name in X
fit_params (dict) – additional parameters for null model

Returns

self – Fitted model is stored in self.model.

Return type

self

predict(X, y=None)[source]

Creates forecast for the dates specified in X. Currently does not support the regressor case where there is gap between train and predict periods.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.
y (ignored.) –

Returns

predictions –

Forecasted values for the dates in X. Columns:

TIME_COL: dates

PREDICTED_COL: predictions

PREDICTED_LOWER_COL: lower bound of predictions

PREDICTED_UPPER_COL: upper bound of predictions

Return type

summary()[source]

Creates human readable string of how the model works, including relevant diagnostics These details cannot be extracted from the forecast alone Prints model configuration. Extend this in child class to print the trained model parameters.

Log message is printed to the cst.LOGGER_NAME logger.

fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)

Fits the uncertainty model with a given df and uncertainty_dict.

Parameters

df (pandas.DataFrame) – A dataframe representing the data to fit the uncertainty model.
uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:

”uncertainty_method”: a string that is in
UncertaintyMethodEnum.

”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the fit function.
kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.

Return type

The function sets self.uncertainty_model and does not return anything.

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)

Makes predictions of prediction intervals for df based on the predictions and self.uncertainty_model.

Parameters

df (pandas.DataFrame) – The dataframe to calculate prediction intervals upon. It should have either self.value_col_ or PREDICT_COL which the prediction interval is based on.
predict_params (dict [str, any] or None, default None) – Parameters to be passed to the predict function.

Returns

result_df – The df with prediction interval columns.

Return type

score(X, y, sample_weight=None)

Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)

Notes

If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.

If null_model_params is None, returns score_func of the model itself.

By default, grid search (with no scoring parameter) optimizes improvement of score_func against null model.

To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.

Parameters

X (pandas.DataFrame) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignored
y (pandas.Series or numpy.array) – Actual value, used to compute error
sample_weight (pandas.Series or numpy.array) – ignored

Returns

score – Comparison of predictions against null predictions, according to specified score function

Return type

float or None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

Forecast Pipeline

greykite.framework.pipeline.pipeline.forecast_pipeline(df: DataFrame, time_col='ts', value_col='y', date_format=None, tz=None, freq=None, train_end_date=None, anomaly_info=None, pipeline=None, regressor_cols=None, lagged_regressor_cols=None, estimator=SimpleSilverkiteEstimator(), hyperparameter_grid=None, hyperparameter_budget=None, n_jobs=1, verbose=1, forecast_horizon=None, coverage=0.95, test_horizon=None, periods_between_train_test=None, agg_periods=None, agg_func=None, score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, cv_report_metrics='ALL', null_model_params=None, relative_error_tolerance=None, cv_horizon=None, cv_min_train_periods=None, cv_expanding_window=False, cv_use_most_recent_splits=False, cv_periods_between_splits=None, cv_periods_between_train_test=None, cv_max_splits=3)[source]

Computation pipeline for end-to-end forecasting.

Trains a forecast model end-to-end:

checks input data

runs cross-validation to select optimal hyperparameters e.g. best model

evaluates best model on test set

provides forecast of best model (re-trained on all data) into the future

Returns forecasts with methods to plot and see diagnostics. Also returns the fitted pipeline and CV results.

Provides a high degree of customization over training and evaluation parameters:

model

cross validation

evaluation

forecast horizon

See test cases for examples.

Parameters

df (pandas.DataFrame) – Timeseries data to forecast. Contains columns [time_col, value_col], and optional regressor columns Regressor columns should include future values for prediction
time_col (str, default TIME_COL in constants.py) – name of timestamp column in df
value_col (str, default VALUE_COL in constants.py) – name of value column in df (the values to forecast)
date_format (str or None, default None) – strftime format to parse time column, eg %m/%d/%Y. Note that %f will parse all the way up to nanoseconds. If None (recommended), inferred by pandas.to_datetime.
tz (str or None, default None) – Passed to pandas.tz_localize to localize the timestamp
freq (str or None, default None) – Frequency of input data. Used to generate future dates for prediction. Frequency strings can have multiples, e.g. ‘5H’. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a list of frequency aliases. If None, inferred by pandas.infer_freq. Provide this parameter if df has missing timepoints.
train_end_date (datetime.datetime, optional, default None) – Last date to use for fitting the model. Forecasts are generated after this date. If None, it is set to the last date with a non-null value in value_col of df.
anomaly_info (dict or list [dict] or None, default None) –
Anomaly adjustment info. Anomalies in df are corrected before any forecasting is done.

If None, no adjustments are made.

A dictionary containing the parameters to adjust_anomalous_data. See that function for details. The possible keys are:

"value_col"str
The name of the column in df to adjust. You may adjust the value to forecast as well as any numeric regressors.

"anomaly_df"pandas.DataFrame
Adjustments to correct the anomalies.

"start_time_col": str, default START_TIME_COL
Start date column in anomaly_df.

"end_time_col": str, default END_TIME_COL
End date column in anomaly_df.

"adjustment_delta_col": str or None, default None
Impact column in anomaly_df.

"filter_by_dict": dict or None, default None
Used to filter anomaly_df to the relevant anomalies for the value_col in this dictionary. Key specifies the column name, value specifies the filter value.

"filter_by_value_col"": str or None, default None
Adds {filter_by_value_col: value_col} to filter_by_dict if not None, for the value_col in this dictionary.

"adjustment_method"str (“add” or “subtract”), default “add”
How to make the adjustment, if adjustment_delta_col is provided.

Accepts a list of such dictionaries to adjust multiple columns in df.
pipeline (sklearn.pipeline.Pipeline or None, default None) – Pipeline to fit. The final named step must be called “estimator”. If None, will use the default Pipeline from get_basic_pipeline.
regressor_cols (list [str] or None, default None) – A list of regressor columns used in the training and prediction DataFrames. It should contain only the regressors that are being used in the grid search. If None, no regressor columns are used. Regressor columns that are unavailable in df are dropped.
lagged_regressor_cols (list [str] or None, default None) – A list of additional columns needed for lagged regressors in the training and prediction DataFrames. This list can have overlap with regressor_cols. If None, no additional columns are added to the DataFrame. Lagged regressor columns that are unavailable in df are dropped.
estimator (instance of an estimator that implements greykite.algo.models.base_forecast_estimator.BaseForecastEstimator) – Estimator to use as the final step in the pipeline. Ignored if pipeline is provided.
forecast_horizon (int or None, default None) – Number of periods to forecast into the future. Must be > 0. If None, default is determined from input data frequency
coverage (float or None, default=0.95) – Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Ignored if pipeline is provided. Uses coverage of the pipeline estimator instead.
test_horizon (int or None, default None) – Numbers of periods held back from end of df for test. The rest is used for cross validation. If None, default is forecast_horizon. Set to 0 to skip backtest.
periods_between_train_test (int or None, default None) – Number of periods for the gap between train and test data. If None, default is 0.
agg_periods (int or None, default None) –
Number of periods to aggregate before evaluation.

Model is fit and forecasted on the dataset’s original frequency.

Before evaluation, the actual and forecasted values are aggregated, using rolling windows of size agg_periods and the function agg_func. (e.g. if the dataset is hourly, use agg_periods=24, agg_func=np.sum, to evaluate performance on the daily totals).

If None, does not aggregate before evaluation.

Currently, this is only used when calculating CV metrics and the R2_null_model_score metric in backtest/forecast. No pre-aggregation is applied for the other backtest/forecast evaluation metrics.
agg_func (callable or None, default None) –
Takes an array and returns a number, e.g. np.max, np.sum.

Defines how to aggregate rolling windows of actual and predicted values before evaluation.

Ignored if agg_periods is None.

Currently, this is only used when calculating CV metrics and the R2_null_model_score metric in backtest/forecast. No pre-aggregation is applied for the other backtest/forecast evaluation metrics.
score_func (str or callable, default EvaluationMetricEnum.MeanAbsolutePercentError.name) – Score function used to select optimal model in CV. If a callable, takes arrays y_true, y_pred and returns a float. If a string, must be either a EvaluationMetricEnum member name or FRACTION_OUTSIDE_TOLERANCE.
score_func_greater_is_better (bool, default False) – True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided if score_func is a callable (custom function). Ignored if score_func is a string, because the direction is known.
cv_report_metrics (str, or list [str], or None, default CV_REPORT_METRICS_ALL) –
Additional metrics to compute during CV, besides the one specified by score_func.
- If the string constant greykite.framework.constants.CV_REPORT_METRICS_ALL, computes all metrics in EvaluationMetricEnum. Also computes FRACTION_OUTSIDE_TOLERANCE if relative_error_tolerance is not None. The results are reported by the short name (.get_metric_name()) for EvaluationMetricEnum members and FRACTION_OUTSIDE_TOLERANCE_NAME for FRACTION_OUTSIDE_TOLERANCE. These names appear in the keys of forecast_result.grid_search.cv_results_ returned by this function.
- If a list of strings, each of the listed metrics is computed. Valid strings are EvaluationMetricEnum member names and FRACTION_OUTSIDE_TOLERANCE.
  
  For example:
  ["MeanSquaredError", "MeanAbsoluteError", "MeanAbsolutePercentError", "MedianAbsolutePercentError", "FractionOutsideTolerance2"]
- If None, no additional metrics are computed.
null_model_params (dict or None, default None) –
Defines baseline model to compute R2_null_model_score evaluation metric. R2_null_model_score is the improvement in the loss function relative to a null model. It can be used to evaluate model quality with respect to a simple baseline. For details, see r2_null_model_score.

The null model is a DummyRegressor, which returns constant predictions.

Valid keys are “strategy”, “constant”, “quantile”. See DummyRegressor. For example:
```
null_model_params = {
    "strategy": "mean",
}
null_model_params = {
    "strategy": "median",
}
null_model_params = {
    "strategy": "quantile",
    "quantile": 0.8,
}
null_model_params = {
    "strategy": "constant",
    "constant": 2.0,
}
```
If None, R2_null_model_score is not calculated.

Note: CV model selection always optimizes score_func`, not the ``R2_null_model_score.
relative_error_tolerance (float or None, default None) – Threshold to compute the Outside Tolerance metric, defined as the fraction of forecasted values whose relative error is strictly greater than relative_error_tolerance. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.
hyperparameter_grid (dict, list [dict] or None, default None) –
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to sklearn.model_selection.GridSearchCV (param_grid) or sklearn.model_selection.RandomizedSearchCV (param_distributions).

Prefix transform/estimator attributes by the name of the step in the pipeline. See details at: https://scikit-learn.org/stable/modules/compose.html#nested-parameters

If None, uses the default pipeline parameters.
hyperparameter_budget (int or None, default None) –
Max number of hyperparameter sets to try within the hyperparameter_grid search space

Runs a full grid search if hyperparameter_budget is sufficient to exhaust full hyperparameter_grid, otherwise samples uniformly at random from the space.

If None, uses defaults:
- full grid search if all values are constant
- 10 if any value is a distribution to sample from
n_jobs (int or None, default COMPUTATION_N_JOBS) – Number of jobs to run in parallel (the maximum number of concurrently running workers). -1 uses all CPUs. -2 uses all CPUs but one. None is treated as 1 unless in a joblib.Parallel backend context that specifies otherwise.
verbose (int, default 1) – Verbosity level during CV. if > 0, prints number of fits if > 1, prints fit parameters, total score + fit time if > 2, prints train/test scores
cv_horizon (int or None, default None) – Number of periods in each CV test set If None, default is forecast_horizon. Set either cv_horizon or cv_max_splits to 0 to skip CV.
cv_min_train_periods (int or None, default None) – Minimum number of periods for training each CV fold. If cv_expanding_window is False, every training period is this size If None, default is 2 * cv_horizon
cv_expanding_window (bool, default False) – If True, training window for each CV split is fixed to the first available date. Otherwise, train start date is sliding, determined by cv_min_train_periods.
cv_use_most_recent_splits (bool, default False) – If True, splits from the end of the dataset are used. Else a sampling strategy is applied. Check _sample_splits for details.
cv_periods_between_splits (int or None, default None) – Number of periods to slide the test window between CV splits If None, default is cv_horizon
cv_periods_between_train_test (int or None, default None) – Number of periods for the gap between train and test in a CV split. If None, default is periods_between_train_test.
cv_max_splits (int or None, default 3) – Maximum number of CV splits. Given the above configuration, samples up to max_splits train/test splits, preferring splits toward the end of available data. If None, uses all splits. Set either cv_horizon or cv_max_splits to 0 to skip CV.

Returns

forecast_result – Forecast result. See ForecastResult for details.

If cv_horizon=0, forecast_result.grid_search.best_estimator_ and forecast_result.grid_search.best_params_ attributes are defined according to the provided single set of parameters. There must be a single set of parameters to skip cross-validation.

If test_horizon=0, forecast_result.backtest is None.

Return type

ForecastResult

class greykite.framework.pipeline.pipeline.ForecastResult(timeseries: Optional[UnivariateTimeSeries] = None, grid_search: Optional[RandomizedSearchCV] = None, model: Optional[Pipeline] = None, backtest: Optional[UnivariateForecast] = None, forecast: Optional[UnivariateForecast] = None)[source]

Forecast results. Contains results from cross-validation, backtest, and forecast, the trained model, and the original input data.

timeseries: UnivariateTimeSeries = None: Input time series in standard format with stats and convenient plot functions.

grid_search: RandomizedSearchCV = None

Result of cross-validation grid search on training dataset. The relevant attributes are:

cv_results_ cross-validation scores

best_estimator_ the model used for backtesting

best_params_ the optimal parameters used for backtesting.

Also see summarize_grid_search_results. We recommend using this function to extract results, rather than accessing cv_results_ directly.

model: Pipeline = None: Model fitted on full dataset, using the best parameters selected via cross-validation. Has .fit(), .predict(), and diagnostic functions depending on the model.

backtest: UnivariateForecast = None: Forecast on backtest period. Backtest period is a holdout test set to check forecast quality against the most recent actual values available. The best model from cross validation is refit on data prior to this period. The timestamps in backtest.df are sorted in ascending order. Has a .plot() method and attributes to get forecast vs actuals, evaluation results.

forecast: UnivariateForecast = None: Forecast on future period. Future dates are after the train end date, following the holdout test set. The best model from cross validation is refit on data prior to this period. The timestamps in forecast.df are sorted in ascending order. Has a .plot() method and attributes to get forecast vs actuals, evaluation results.

Template Output

class greykite.framework.input.univariate_time_series.UnivariateTimeSeries[source]

Defines univariate time series input. The dataset can include regressors, but only one metric is designated as the target metric to forecast.

Loads time series into a standard format. Provides statistics, plotting functions, and ability to generate future dataframe for prediction.

df

Data frame containing timestamp and value, with standardized column names for internal use (TIME_COL, VALUE_COL). Rows are sorted by time index, and missing gaps between dates are filled in so that dates are spaced at regular intervals. Values are adjusted for anomalies according to anomaly_info. The index can be timezone aware (but TIME_COL is not).

Type: pandas.DataFrame

y

Value of time series to forecast.

Type: pandas.Series, dtype float64

time_stats

Summary statistics about the timestamp column.

Type: dict

value_stats

Summary statistics about the value column.

Type: dict

original_time_col

Name of time column in original input data.

Type: str

original_value_col

Name of value column in original input data.

Type: str

regressor_cols

A list of regressor columns in the training and prediction DataFrames.

Type: list [str]

lagged_regressor_cols

A list of additional columns needed for lagged regressors in the training and prediction DataFrames.

Type: list [str]

last_date_for_val

Date or timestamp corresponding to last non-null value in df[original_value_col].

Type: datetime.datetime or None, default None

last_date_for_reg

Date or timestamp corresponding to last non-null value in df[regressor_cols]. If regressor_cols is None, last_date_for_reg is None.

Type: datetime.datetime or None, default None

last_date_for_lag_reg

Date or timestamp corresponding to last non-null value in df[lagged_regressor_cols]. If lagged_regressor_cols is None, last_date_for_lag_reg is None.

Type: datetime.datetime or None, default None

train_end_date

Last date or timestamp in fit_df. It is always less than or equal to minimum non-null values of last_date_for_val and last_date_for_reg.

Type: datetime.datetime

fit_cols

A list of columns used in the training and prediction DataFrames.

Type: list [str]

fit_df

Data frame containing timestamp and value, with standardized column names for internal use. Will be used for fitting (train, cv, backtest).

Type: pandas.DataFrame

fit_y

Value of time series for fit_df.

Type: pandas.Series, dtype float64

freq

timeseries frequency, DateOffset alias, e.g. {‘T’ (minute), ‘H’, D’, ‘W’, ‘M’ (month end), ‘MS’ (month start), ‘Y’ (year end), ‘Y’ (year start)} See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

Type: str

anomaly_info

Anomaly adjustment info. Anomalies in df are corrected before any forecasting is done. See self.load_data()

Type: dict or list [dict] or None, default None

df_before_adjustment

self.df before adjustment by anomaly_info. Used by self.plot() to show the adjustment.

Type: pandas.DataFrame or None, default None

load_data(df: DataFrame, time_col: str = 'ts', value_col: str = 'y', freq: Optional[str] = None, date_format: Optional[str] = None, tz: Optional[str] = None, train_end_date: Optional[Union[str, datetime]] = None, regressor_cols: Optional[List[str]] = None, lagged_regressor_cols: Optional[List[str]] = None, anomaly_info: Optional[Union[Dict, List[Dict]]] = None)[source]

Loads data to internal representation. Parses date column, sets timezone aware index. Checks for irregularities and raises an error if input is invalid. Adjusts for anomalies according to anomaly_info.

Parameters

df (pandas.DataFrame) – Input timeseries. A data frame which includes the timestamp column as well as the value column.
time_col (str) – The column name in df representing time for the time series data. The time column can be anything that can be parsed by pandas DatetimeIndex.
value_col (str) – The column name which has the value of interest to be forecasted.
freq (str or None, default None) – Timeseries frequency, DateOffset alias, If None automatically inferred.
date_format (str or None, default None) – strftime format to parse time column, eg %m/%d/%Y. Note that %f will parse all the way up to nanoseconds. If None (recommended), inferred by pandas.to_datetime.
tz (str or pytz.timezone object or None, default None) – Passed to pandas.tz_localize to localize the timestamp.
train_end_date (str or datetime.datetime or None, default None) – Last date to use for fitting the model. Forecasts are generated after this date. If None, it is set to the minimum of self.last_date_for_val and self.last_date_for_reg.
regressor_cols (list [str] or None, default None) – A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used. Regressor columns that are unavailable in df are dropped.
lagged_regressor_cols (list [str] or None, default None) – A list of additional columns needed for lagged regressors in the training and prediction DataFrames. This list can have overlap with regressor_cols. If None, no additional columns are added to the DataFrame. Lagged regressor columns that are unavailable in df are dropped.
anomaly_info (dict or list [dict] or None, default None) –
Anomaly adjustment info. Anomalies in df are corrected before any forecasting is done.

If None, no adjustments are made.

A dictionary containing the parameters to adjust_anomalous_data. See that function for details. The possible keys are:

"value_col"str
The name of the column in df to adjust. You may adjust the value to forecast as well as any numeric regressors.

"anomaly_df"pandas.DataFrame
Adjustments to correct the anomalies.

"start_time_col": str, default START_TIME_COL
Start date column in anomaly_df.

"end_time_col": str, default END_TIME_COL
End date column in anomaly_df.

"adjustment_delta_col": str or None, default None
Impact column in anomaly_df.

"filter_by_dict": dict or None, default None
Used to filter anomaly_df to the relevant anomalies for the value_col in this dictionary. Key specifies the column name, value specifies the filter value.

"filter_by_value_col"": str or None, default None
Adds {filter_by_value_col: value_col} to filter_by_dict if not None, for the value_col in this dictionary.

"adjustment_method"str (“add” or “subtract”), default “add”
How to make the adjustment, if adjustment_delta_col is provided.

Accepts a list of such dictionaries to adjust multiple columns in df.

Returns

self – Sets self.df with standard column names, value adjusted for anomalies, and time gaps filled in, sorted by time index.

Return type

Returns self.

describe_time_col()[source]

Basic descriptive stats on the timeseries time column.

Returns

time_stats –

Dictionary with descriptive stats on the timeseries time column.

data_points: int
number of time points

mean_increment_secs: float
mean frequency

min_timestamp: datetime64
start date

max_timestamp: datetime64
end date

Return type

dict

describe_value_col()[source]

Basic descriptive stats on the timeseries value column.

Returns: value_stats – Dict with keys: count, mean, std, min, 25%, 50%, 75%, max
Return type: dict [str, float]

make_future_dataframe(periods: Optional[int] = None, include_history=True)[source]

Extends the input data for prediction into the future.

Includes the historical values (VALUE_COL) so this can be fed into a Pipeline that transforms input data for fitting, and for use in evaluation.

Parameters

periods (int or None) – Number of periods to forecast. If there are no regressors, default is 30. If there are regressors, default is to predict all available dates.
include_history (bool) – Whether to return historical dates and values with future dates.

Returns

future_df – Dataframe with future timestamps for prediction. Contains columns for:

prediction dates (TIME_COL),

values (VALUE_COL),

optional regressors

Return type

plot(color='rgb(32, 149, 212)', show_anomaly_adjustment=False, **kwargs)[source]

Returns interactive plotly graph of the value against time.

If anomaly info is provided, there is an option to show the anomaly adjustment.

Parameters

color (str, default “rgb(32, 149, 212)” (light blue)) – Color of the value line (after adjustment, if applicable).
show_anomaly_adjustment (bool, default False) – Whether to show the anomaly adjustment.
kwargs (additional parameters) – Additional parameters to pass to plot_univariate such as title and color.

Returns

fig – Interactive plotly graph of the value against time.

See plot_forecast_vs_actual return value for how to plot the figure and add customization.

Return type

get_grouping_evaluation(aggregation_func=<function nanmean>, aggregation_func_name='mean', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None)[source]

Group-wise computation of aggregated timeSeries value. Can be used to evaluate error/ aggregated value by a time feature, over time, or by a user-provided column.

Exactly one of: groupby_time_feature, groupby_sliding_window_size, groupby_custom_column must be provided.

Parameters

aggregation_func (callable, optional, default numpy.nanmean) – Function that aggregates an array to a number. Signature (y: array) -> aggregated value: float.
aggregation_func_name (str or None, optional, default “mean”) – Name of grouping function, used to report results. If None, defaults to “aggregation”.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_column (pandas.Series or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.

Returns

grouped_df –

grouping_func_name: evaluation metric for aggregation of timeseries.
group name: group name depends on the grouping method: groupby_time_feature for groupby_time_feature cst.TIME_COL for groupby_sliding_window_size groupby_custom_column.name for groupby_custom_column.

Return type

pandas.DataFrame with two columns:

plot_grouping_evaluation(aggregation_func=<function nanmean>, aggregation_func_name='mean', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, xlabel=None, ylabel=None, title=None)[source]

Computes aggregated timeseries by group and plots the result. Can be used to plot aggregated timeseries by a time feature, over time, or by a user-provided column.

Exactly one of: groupby_time_feature, groupby_sliding_window_size, groupby_custom_column must be provided.

Parameters

aggregation_func (callable, optional, default numpy.nanmean) – Function that aggregates an array to a number. Signature (y: array) -> aggregated value: float.
aggregation_func_name (str or None, optional, default “mean”) – Name of grouping function, used to report results. If None, defaults to “aggregation”.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_column (pandas.Series or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.
xlabel (str, optional, default None) – X-axis label of the plot.
ylabel (str, optional, default None) – Y-axis label of the plot.
title (str or None, optional) – Plot title. If None, default is based on axis labels.

Returns

fig – plotly graph object showing aggregated timeseries by group. x-axis label depends on the grouping method: groupby_time_feature for groupby_time_feature TIME_COL for groupby_sliding_window_size groupby_custom_column.name for groupby_custom_column.

Return type

get_quantiles_and_overlays(groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, show_mean=False, show_quantiles=False, show_overlays=False, overlay_label_time_feature=None, overlay_label_sliding_window_size=None, overlay_label_custom_column=None, center_values=False, value_col='y', mean_col_name='mean', quantile_col_prefix='Q', **overlay_pivot_table_kwargs)[source]

Computes mean, quantiles, and overlays by the requested grouping dimension.

Overlays are best explained in the plotting context. The grouping dimension goes on the x-axis, and one line is shown for each level of the overlay dimension. This function returns a column for each line to plot (e.g. mean, each quantile, each overlay value).

Exactly one of: groupby_time_feature, groupby_sliding_window_size, groupby_custom_column must be provided as the grouping dimension.

If show_overlays is True, exactly one of: overlay_label_time_feature, overlay_label_sliding_window_size, overlay_label_custom_column can be provided to specify the label_col (overlay dimension). Internally, the function calls pandas.DataFrame.pivot_table with index=groupby_col, columns=label_col, values=value_col to get the overlay values for plotting. You can pass additional parameters to pandas.DataFrame.pivot_table via overlay_pivot_table_kwargs, e.g. to change the aggregation method. If an explicit label is not provided, the records are labeled by their position within the group.

For example, to show yearly seasonality mean, quantiles, and overlay plots for each individual year, use:

self.get_quantiles_and_overlays(
    groupby_time_feature="doy",         # Rows: a row for each day of year (1, 2, ..., 366)
    show_mean=True,                     # mean value on that day
    show_quantiles=[0.1, 0.9],          # quantiles of the observed distribution on that day
    show_overlays=True,                 # Include overlays defined by ``overlay_label_time_feature``
    overlay_label_time_feature="year")  # One column for each observed "year" (2016, 2017, 2018, ...)

To show weekly seasonality over time, use:

self.get_quantiles_and_overlays(
    groupby_time_feature="dow",            # Rows: a row for each day of week (1, 2, ..., 7)
    show_mean=True,                        # mean value on that day
    show_quantiles=[0.1, 0.5, 0.9],        # quantiles of the observed distribution on that day
    show_overlays=True,                    # Include overlays defined by ``overlay_label_time_feature``
    overlay_label_sliding_window_size=90,  # One column for each 90 period sliding window in the dataset,
    aggfunc="median")                      # overlay value is the median value for the dow over the period (default="mean").

It may be difficult to assess the weekly seasonality from the previous result, because overlays shift up/down over time due to trend/yearly seasonality. Use center_values=True to adjust each overlay so its average value is centered at 0. Mean and quantiles are shifted by a single constant to center the mean at 0, while preserving their relative values:

self.get_quantiles_and_overlays(
    groupby_time_feature="dow",
    show_mean=True,
    show_quantiles=[0.1, 0.5, 0.9],
    show_overlays=True,
    overlay_label_sliding_window_size=90,
    aggfunc="median",
    center_values=True)  # Centers the output

Centering reduces the variability in the overlays to make it easier to isolate the effect by the groupby column. As a result, centered overlays have smaller variability than that reported by the quantiles, which operate on the original, uncentered data points. Similarly, if overlays are aggregates of individual values (i.e. aggfunc is needed in the call to pandas.DataFrame.pivot_table), the quantiles of overlays will be less extreme than those of the original data.

To assess variability conditioned on the groupby value, check the quantiles.

To assess variability conditioned on both the groupby and overlay value, after any necessary aggregation, check the variability of the overlay values. Compute quantiles of overlays from the return value if desired.

Parameters

groupby_time_feature (str or None, default None) – If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_size (int or None, default None) – If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_column (pandas.Series or None, default None) – If provided, groups by this column value. Should be same length as the DataFrame.
show_mean (bool, default False) – Whether to return the mean value by the groupby column.
show_quantiles (bool or list [float] or numpy.array, default False) – Whether to return the quantiles of the value by the groupby column. If False, does not return quantiles. If True, returns default quantiles (0.1 and 0.9). If array-like, a list of quantiles to compute (e.g. (0.1, 0.25, 0.75, 0.9)).
show_overlays (bool or int or array-like [int or str] or None, default False) –
Whether to return overlays of the value by the groupby column.

If False, no overlays are shown.

If True and label_col is defined, calls pandas.DataFrame.pivot_table with index=groupby_col, columns=label_col, values=value_col. label_col is defined by one of overlay_label_time_feature, overlay_label_sliding_window_size, or overlay_label_custom_column. Returns one column for each value of the label_col.

If True and the label_col is not defined, returns the raw values within each group. Values across groups are put into columns by their position in the group (1st element in group, 2nd, 3rd, etc.). Positional order in a group is not guaranteed to correspond to anything meaningful, so the items within a column may not have anything in common. It is better to specify one of overlay_* to explicitly define the overlay labels.

If an integer, the number of overlays to randomly sample. The same as True, then randomly samples up to int columns. This is useful if there are too many values.

If a list [int], a list of column indices (int type). The same as True, then selects the specified columns by index.

If a list [str], a list of column names. Column names are matched by their string representation to the names in this list. The same as True, then selects the specified columns by name.
overlay_label_time_feature (str or None, default None) –
If show_overlays is True, can be used to define label_col, i.e. which dimension to show separately as overlays.

If provided, uses a column generated by build_time_features_df. See that function for valid values.
overlay_label_sliding_window_size (int or None, default None) –
If show_overlays is True, can be used to define label_col, i.e. which dimension to show separately as overlays.

If provided, uses a column that sequentially partitions data into groups of size groupby_sliding_window_size.
overlay_label_custom_column (pandas.Series or None, default None) –
If show_overlays is True, can be used to define label_col, i.e. which dimension to show separately as overlays.

If provided, uses this column value. Should be same length as the DataFrame.
value_col (str, default VALUE_COL) – The column name for the value column. By default, shows the univariate time series value, but it can be any other column in self.df.
mean_col_name (str, default “mean”) – The name to use for the mean column in the output. Applies if show_mean=True.
quantile_col_prefix (str, default “Q”) – The prefix to use for quantile column names in the output. Columns are named with this prefix followed by the quantile, rounded to 2 decimal places.
center_values (bool, default False) –
Whether to center the return values. If True, shifts each overlay so its average value is centered at 0. Shifts mean and quantiles by a constant to center the mean at 0, while preserving their relative values.

If False, values are not centered.
overlay_pivot_table_kwargs (additional parameters) – Additional keyword parameters to pass to pandas.DataFrame.pivot_table, used in generating the overlays. See above description for details.

Returns

grouped_df – Dataframe with mean, quantiles, and overlays by the grouping column. Overlays are defined by the grouping column and overlay dimension.

ColumnIndex is a multiindex with first level as the “category”, a subset of [MEAN_COL_GROUP, QUANTILE_COL_GROUP, OVERLAY_COL_GROUP] depending on what is requests.

grouped_df[MEAN_COL_GROUP] = df with single column, named mean_col_name.

grouped_df[QUANTILE_COL_GROUP] = df with a column for each quantile, named f”{quantile_col_prefix}{round(str(q))}”, where q is the quantile.

grouped_df[OVERLAY_COL_GROUP] = df with one column per overlay value, named by the overlay value.

For example, it might look like:

category    mean    quantile        overlay
name        mean    Q0.1    Q0.9    2007    2008    2009
doy
1               8.42    7.72    9.08    8.29    7.75    8.33
2               8.82    8.20    9.56    8.43    8.80    8.53
3               8.95    8.25    9.88    8.26    9.12    8.70
4               9.07    8.60    9.49    8.10    9.99    8.73
5               8.73    8.29    9.24    7.95    9.26    8.37
...         ...     ...     ...     ...     ...     ...

Return type

plot_quantiles_and_overlays(groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, show_mean=False, show_quantiles=False, show_overlays=False, overlay_label_time_feature=None, overlay_label_sliding_window_size=None, overlay_label_custom_column=None, center_values=False, value_col='y', mean_col_name='mean', quantile_col_prefix='Q', mean_style=None, quantile_style=None, overlay_style=None, xlabel=None, ylabel=None, title=None, showlegend=True, **overlay_pivot_table_kwargs)[source]

Plots mean, quantiles, and overlays by the requested grouping dimension.

The grouping dimension goes on the x-axis, and one line is shown for the mean, each quantile, and each level of the overlay dimension, as requested. By default, shading is applied between the quantiles.

Exactly one of: groupby_time_feature, groupby_sliding_window_size, groupby_custom_column must be provided as the grouping dimension.

If show_overlays is True, exactly one of: overlay_label_time_feature, overlay_label_sliding_window_size, overlay_label_custom_column can be provided to specify the label_col (overlay dimension). Internally, the function calls pandas.DataFrame.pivot_table with index=groupby_col, columns=label_col, values=value_col to get the overlay values for plotting. You can pass additional parameters to pandas.DataFrame.pivot_table via overlay_pivot_table_kwargs, e.g. to change the aggregation method. If an explicit label is not provided, the records are labeled by their position within the group.

For example, to show yearly seasonality mean, quantiles, and overlay plots for each individual year, use:

self.plot_quantiles_and_overlays(
    groupby_time_feature="doy",         # Rows: a row for each day of year (1, 2, ..., 366)
    show_mean=True,                     # mean value on that day
    show_quantiles=[0.1, 0.9],          # quantiles of the observed distribution on that day
    show_overlays=True,                 # Include overlays defined by ``overlay_label_time_feature``
    overlay_label_time_feature="year")  # One column for each observed "year" (2016, 2017, 2018, ...)

To show weekly seasonality over time, use:

self.plot_quantiles_and_overlays(
    groupby_time_feature="dow",            # Rows: a row for each day of week (1, 2, ..., 7)
    show_mean=True,                        # mean value on that day
    show_quantiles=[0.1, 0.5, 0.9],        # quantiles of the observed distribution on that day
    show_overlays=True,                    # Include overlays defined by ``overlay_label_time_feature``
    overlay_label_sliding_window_size=90,  # One column for each 90 period sliding window in the dataset,
    aggfunc="median")                      # overlay value is the median value for the dow over the period (default="mean").

It may be difficult to assess the weekly seasonality from the previous result, because overlays shift up/down over time due to trend/yearly seasonality. Use center_values=True to adjust each overlay so its average value is centered at 0. Mean and quantiles are shifted by a single constant to center the mean at 0, while preserving their relative values:

self.plot_quantiles_and_overlays(
    groupby_time_feature="dow",
    show_mean=True,
    show_quantiles=[0.1, 0.5, 0.9],
    show_overlays=True,
    overlay_label_sliding_window_size=90,
    aggfunc="median",
    center_values=True)  # Centers the output

Centering reduces the variability in the overlays to make it easier to isolate the effect by the groupby column. As a result, centered overlays have smaller variability than that reported by the quantiles, which operate on the original, uncentered data points. Similarly, if overlays are aggregates of individual values (i.e. aggfunc is needed in the call to pandas.DataFrame.pivot_table), the quantiles of overlays will be less extreme than those of the original data.

To assess variability conditioned on the groupby value, check the quantiles.

To assess variability conditioned on both the groupby and overlay value, after any necessary aggregation, check the variability of the overlay values. Compute quantiles of overlays from the return value if desired.

Parameters

groupby_time_feature (str or None, default None) – If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_size (int or None, default None) – If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_column (pandas.Series or None, default None) – If provided, groups by this column value. Should be same length as the DataFrame.
show_mean (bool, default False) – Whether to return the mean value by the groupby column.
show_quantiles (bool or list [float] or numpy.array, default False) – Whether to return the quantiles of the value by the groupby column. If False, does not return quantiles. If True, returns default quantiles (0.1 and 0.9). If array-like, a list of quantiles to compute (e.g. (0.1, 0.25, 0.75, 0.9)).
show_overlays (bool or int or array-like [int or str], default False) –
Whether to return overlays of the value by the groupby column.

If False, no overlays are shown.

If True and label_col is defined, calls pandas.DataFrame.pivot_table with index=groupby_col, columns=label_col, values=value_col. label_col is defined by one of overlay_label_time_feature, overlay_label_sliding_window_size, or overlay_label_custom_column. Returns one column for each value of the label_col.

If True and the label_col is not defined, returns the raw values within each group. Values across groups are put into columns by their position in the group (1st element in group, 2nd, 3rd, etc.). Positional order in a group is not guaranteed to correspond to anything meaningful, so the items within a column may not have anything in common. It is better to specify one of overlay_* to explicitly define the overlay labels.

If an integer, the number of overlays to randomly sample. The same as True, then randomly samples up to int columns. This is useful if there are too many values.

If a list [int], a list of column indices (int type). The same as True, then selects the specified columns by index.

If a list [str], a list of column names. Column names are matched by their string representation to the names in this list. The same as True, then selects the specified columns by name.
overlay_label_time_feature (str or None, default None) –
If show_overlays is True, can be used to define label_col, i.e. which dimension to show separately as overlays.

If provided, uses a column generated by build_time_features_df. See that function for valid values.
overlay_label_sliding_window_size (int or None, default None) –
If show_overlays is True, can be used to define label_col, i.e. which dimension to show separately as overlays.

If provided, uses a column that sequentially partitions data into groups of size groupby_sliding_window_size.
overlay_label_custom_column (pandas.Series or None, default None) –
If show_overlays is True, can be used to define label_col, i.e. which dimension to show separately as overlays.

If provided, uses this column value. Should be same length as the DataFrame.
value_col (str, default VALUE_COL) – The column name for the value column. By default, shows the univariate time series value, but it can be any other column in self.df.
mean_col_name (str, default “mean”) – The name to use for the mean column in the output. Applies if show_mean=True.
quantile_col_prefix (str, default “Q”) – The prefix to use for quantile column names in the output. Columns are named with this prefix followed by the quantile, rounded to 2 decimal places.
center_values (bool, default False) –
Whether to center the return values. If True, shifts each overlay so its average value is centered at 0. Shifts mean and quantiles by a constant to center the mean at 0, while preserving their relative values.

If False, values are not centered.

mean_style (dict or None, default None) –

How to style the mean line, passed as keyword arguments to plotly.graph_objects.Scatter. If None, the default is:

mean_style = {
    "line": dict(
        width=2,
        color="#595959"),  # gray
    "legendgroup": MEAN_COL_GROUP}

quantile_style (dict or None, default None) –
How to style the quantile lines, passed as keyword arguments to plotly.graph_objects.Scatter. If None, the default is:
```
quantile_style = {
    "line": dict(
        width=2,
        color="#1F9AFF",  # blue
        dash="solid"),
    "legendgroup": QUANTILE_COL_GROUP,  # show/hide them together
    "fill": "tonexty"}
```
Note that fill style is removed from to the first quantile line, to fill only between items in the same category.

overlay_style (dict or None, default None) –

How to style the overlay lines, passed as keyword arguments to plotly.graph_objects.Scatter. If None, the default is:

overlay_style = {
    "opacity": 0.5,  # makes it easier to see density
    "line": dict(
        width=1,
        color="#B3B3B3",  # light gray
        dash="solid"),
    "legendgroup": OVERLAY_COL_GROUP}

xlabel (str, optional, default None) – X-axis label of the plot.
ylabel (str, optional, default None) – Y-axis label of the plot. If None, uses value_col.
title (str or None, default None) – Plot title. If None, default is based on axis labels.
showlegend (bool, default True) – Whether to show the legend.
overlay_pivot_table_kwargs (additional parameters) – Additional keyword parameters to pass to pandas.DataFrame.pivot_table, used in generating the overlays. See get_quantiles_and_overlays description for details.

Returns

fig – plotly graph object showing the mean, quantiles, and overlays.

Return type

See also

None: To get the mean, quantiles, and overlays as a pandas.DataFrame without plotting.

class greykite.framework.output.univariate_forecast.UnivariateForecast(df, time_col='ts', actual_col='actual', predicted_col='forecast', predicted_lower_col='forecast_lower', predicted_upper_col='forecast_upper', null_model_predicted_col='forecast_null', ylabel='y', train_end_date=None, test_start_date=None, forecast_horizon=None, coverage=0.95, r2_loss_function=<function mean_squared_error>, estimator=None, relative_error_tolerance=None)[source]

Stores predicted and actual values. Provides functionality to evaluate a forecast:

plots true against actual with prediction bands.

evaluates model performance.

Input should be one of two kinds of forecast results:

model fit to train data, forecast on test set (actuals available).

model fit to all data, forecast on future dates (actuals not available).

The input df is a concatenation of fitted and forecasted values.

df

Timestamp, predicted, and actual values.

Type: pandas.DataFrame

time_col

Column in df with timestamp (default “ts”).

Type: str

actual_col

Column in df with actual values (default “y”).

Type: str

predicted_col

Column in df with predicted values (default “forecast”).

Type: str

predicted_lower_col

Column in df with predicted lower bound (default “forecast_lower”, optional).

Type: str or None

predicted_upper_col

Column in df with predicted upper bound (default “forecast_upper”, optional).

Type: str or None

null_model_predicted_col

Column in df with predicted value of null model (default “forecast_null”, optional).

Type: str or None

ylabel

Unit of measurement (default “y”)

Type: str

train_end_date

End date for train period. If None, assumes all data were used for training.

Type: str or datetime or None, default None

test_start_date

Start date of test period. If None, set to the time_col value immediately after train_end_date. This assumes that all data not used in training were used for testing.

Type: str or datetime or None, default None

forecast_horizon

Number of periods forecasted into the future. Must be > 0.

Type: int or None, default None

coverage

Intended coverage of the prediction bands (0.0 to 1.0).

Type: float or None

r2_loss_function

Loss function to calculate cst.R2_null_model_score, with signature loss_func(y_true, y_pred) (default mean_squared_error)

Type: function

estimator

The fitted estimator, the last step in the forecast pipeline.

Type: An instance of an estimator that implements greykite.models.base_forecast_estimator.BaseForecastEstimator.

relative_error_tolerance

Threshold to compute the Outside Tolerance metric, defined as the fraction of forecasted values whose relative error is strictly greater than relative_error_tolerance. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.

Type: float or None, default None

df_train

Subset of df where df[time_col] <= train_end_date.

Type: pandas.DataFrame

df_test

Subset of df where df[time_col] > train_end_date.

Type: pandas.DataFrame

train_evaluation

Evaluation metrics on training set.

Type: dict [str, float]

test_evaluation

Evaluation metrics on test set (if actual values provided after train_end_date).

Type: dict [str, float]

test_na_count

Count of NA values in test data.

Type: int

compute_evaluation_metrics_split()[source]

Computes __evaluation_metrics for train and test set separately.

Returns: dictionary with train and test evaluation metrics

plot(**kwargs)[source]

Plots predicted against actual.

Parameters

kwargs (additional parameters) – Additional parameters to pass to plot_forecast_vs_actual such as title, colors, and line styling.

Returns

fig – Plotly figure of forecast against actuals, with prediction intervals if available.

See plot_forecast_vs_actual return value for how to plot the figure and add customization.

Return type

get_grouping_evaluation(score_func=<function add_finite_filter_to_scorer.<locals>.score_func_finite>, score_func_name='MAPE', which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None)[source]

Group-wise computation of forecasting error. Can be used to evaluate error/ aggregated value by a time feature, over time, or by a user-provided column.

Exactly one of: groupby_time_feature, groupby_sliding_window_size, groupby_custom_column must be provided.

Parameters

score_func (callable, optional) – Function that maps two arrays to a number. Signature (y_true: array, y_pred: array) -> error: float
score_func_name (str or None, optional) – Name of the score function used to report results. If None, defaults to “metric”.
which (str) – “train” or “test”. Which dataset to evaluate.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_column (pandas.Series or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.

Returns

grouped_df –

grouping_func_name: evaluation metric computing forecasting error of timeseries.
group name: group name depends on the grouping method: groupby_time_feature for groupby_time_feature cst.TIME_COL for groupby_sliding_window_size groupby_custom_column.name for groupby_custom_column.

Return type

pandas.DataFrame with two columns:

plot_grouping_evaluation(score_func=<function add_finite_filter_to_scorer.<locals>.score_func_finite>, score_func_name='MAPE', which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, xlabel=None, ylabel=None, title=None)[source]

Computes error by group and plots the result. Can be used to plot error by a time feature, over time, or by a user-provided column.

Exactly one of: groupby_time_feature, groupby_sliding_window_size, groupby_custom_column must be provided.

Parameters

score_func (callable, optional) – Function that maps two arrays to a number. Signature (y_true: array, y_pred: array) -> error: float
score_func_name (str or None, optional) – Name of the score function used to report results. If None, defaults to “metric”.
which (str, optional, default “train”) – Which dataset to evaluate, “train” or “test”.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_column (pandas.Series or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.
xlabel (str, optional, default None) – X-axis label of the plot.
ylabel (str, optional, default None) – Y-axis label of the plot.
title (str or None, optional) – Plot title, if None this function creates a suitable title.

Returns

fig – plotly graph object showing forecasting error by group. x-axis label depends on the grouping method: groupby_time_feature for groupby_time_feature time_col for groupby_sliding_window_size groupby_custom_column.name for groupby_custom_column.

Return type

autocomplete_map_func_dict(map_func_dict)[source]

Sweeps through map_func_dict, converting values that are ElementwiseEvaluationMetricEnum member names to their corresponding row-wise evaluation function with appropriate column names for this UnivariateForecast instance.

For example:

map_func_dict = {
    "squared_error": ElementwiseEvaluationMetricEnum.SquaredError.name,
    "coverage": ElementwiseEvaluationMetricEnum.Coverage.name,
    "custom_metric": custom_function
}

is converted to

map_func_dict = {
    "squared_error": lambda row: ElementwiseEvaluationMetricEnum.SquaredError.get_metric_func()(
                                row[self.actual_col],
                                row[self.predicted_col]),
    "coverage": lambda row: ElementwiseEvaluationMetricEnum.Coverage.get_metric_func()(
                                row[self.actual_col],
                                row[self.predicted_lower_col],
                                row[self.predicted_upper_col]),
    "custom_metric": custom_function
}

Parameters: map_func_dict (dict or None) – Same as flexible_grouping_evaluation, with one exception: values may a ElementwiseEvaluationMetricEnum member name. There are converted a callable for flexible_grouping_evaluation.
Returns: map_func_dict – Can be passed to flexible_grouping_evaluation.
Return type: dict

get_flexible_grouping_evaluation(which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, map_func_dict=None, agg_kwargs=None, extend_col_names=False)[source]

Group-wise computation of evaluation metrics. Whereas self.get_grouping_evaluation computes one metric, this allows computation of any number of custom metrics.

For example:

Mean and quantiles of squared error by group.

Mean and quantiles of residuals by group.

Mean and quantiles of actual and forecast by group.

% of actuals outside prediction intervals by group

any combination of the above metrics by the same group

First adds a groupby column by passing groupby_ parameters to add_groupby_column. Then computes grouped evaluation metrics by passing map_func_dict, agg_kwargs and extend_col_names to flexible_grouping_evaluation.

Exactly one of: groupby_time_feature, groupby_sliding_window_size, groupby_custom_column must be provided.

which: str

“train” or “test”. Which dataset to evaluate.

groupby_time_featurestr or None, optional

If provided, groups by a column generated by build_time_features_df. See that function for valid values.

groupby_sliding_window_sizeint or None, optional

If provided, sequentially partitions data into groups of size groupby_sliding_window_size.

groupby_custom_columnpandas.Series or None, optional

If provided, groups by this column value. Should be same length as the DataFrame.

map_func_dictdict [str, callable] or None, default None

Row-wise transformation functions to create new columns. If None, no new columns are added.

key: new column name

value: row-wise function to apply to df to generate the column value.
Signature (row: pandas.DataFrame) -> transformed value: float.

For example:

map_func_dict = {
    "residual": lambda row: row["actual"] - row["forecast"],
    "squared_error": lambda row: (row["actual"] - row["forecast"])**2
}

Some predefined functions are available in ElementwiseEvaluationMetricEnum. For example:

map_func_dict = {
    "residual": lambda row: ElementwiseEvaluationMetricEnum.Residual.get_metric_func()(
        row["actual"],
        row["forecast"]),
    "squared_error": lambda row: ElementwiseEvaluationMetricEnum.SquaredError.get_metric_func()(
        row["actual"],
        row["forecast"]),
    "q90_loss": lambda row: ElementwiseEvaluationMetricEnum.Quantile90.get_metric_func()(
        row["actual"],
        row["forecast"]),
    "abs_percent_error": lambda row: ElementwiseEvaluationMetricEnum.AbsolutePercentError.get_metric_func()(
        row["actual"],
        row["forecast"]),
    "coverage": lambda row: ElementwiseEvaluationMetricEnum.Coverage.get_metric_func()(
        row["actual"],
        row["forecast_lower"],
        row["forecast_upper"]),
}

As shorthand, it is sufficient to provide the enum member name. These are auto-expanded into the appropriate function. So the following is equivalent:

map_func_dict = {
    "residual": ElementwiseEvaluationMetricEnum.Residual.name,
    "squared_error": ElementwiseEvaluationMetricEnum.SquaredError.name,
    "q90_loss": ElementwiseEvaluationMetricEnum.Quantile90.name,
    "abs_percent_error": ElementwiseEvaluationMetricEnum.AbsolutePercentError.name,
    "coverage": ElementwiseEvaluationMetricEnum.Coverage.name,
}

agg_kwargsdict or None, default None

Passed as keyword args to pandas.core.groupby.DataFrameGroupBy.aggregate after creating new columns and grouping by groupby_col.

See pandas.core.groupby.DataFrameGroupBy.aggregate or flexible_grouping_evaluation for details.

extend_col_namesbool or None, default False

How to flatten index after aggregation. In some cases, the column index after aggregation is a multi-index. This parameter controls how to flatten an index with 2 levels to 1 level.

If None, the index is not flattened.

If True, column name is a composite: {index0}_{index1} Use this option if index1 is not unique.

If False, column name is simply {index1}

Ignored if the ColumnIndex after aggregation has only one level (e.g. if named aggregation is used in agg_kwargs).

Returns

df_transformed – df after transformation and optional aggregation.

If groupby_col is None, returns df with additional columns as the keys in map_func_dict. Otherwise, df is grouped by groupby_col and this becomes the index. Columns are determined by agg_kwargs and extend_col_names.

Return type

See also

None: called by this function
None: called by this function

plot_flexible_grouping_evaluation(which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, map_func_dict=None, agg_kwargs=None, extend_col_names=False, y_col_style_dict='auto-fill', default_color='rgba(0, 145, 202, 1.0)', xlabel=None, ylabel=None, title=None, showlegend=True)[source]

Plots group-wise evaluation metrics. Whereas plot_grouping_evaluation shows one metric, this can show any number of custom metrics.

For example:

Mean and quantiles of squared error by group.

Mean and quantiles of residuals by group.

Mean and quantiles of actual and forecast by group.

% of actuals outside prediction intervals by group

any combination of the above metrics by the same group

See get_flexible_grouping_evaluation for details.

which: str

“train” or “test”. Which dataset to evaluate.

groupby_time_featurestr or None, optional

If provided, groups by a column generated by build_time_features_df. See that function for valid values.

groupby_sliding_window_sizeint or None, optional

If provided, sequentially partitions data into groups of size groupby_sliding_window_size.

groupby_custom_columnpandas.Series or None, optional

If provided, groups by this column value. Should be same length as the DataFrame.

map_func_dictdict [str, callable] or None, default None

Grouping evaluation metric specification, along with agg_kwargs. See get_flexible_grouping_evaluation.

agg_kwargsdict or None, default None

Grouping evaluation metric specification, along with map_func_dict. See get_flexible_grouping_evaluation.

extend_col_namesbool or None, default False

How to name the grouping metrics. See get_flexible_grouping_evaluation.

y_col_style_dict: dict [str, dict or None] or “plotly” or “auto” or “auto-fill”, default “auto-fill”

The column(s) to plot on the y-axis, and how to style them. The names should match those generated by agg_kwargs and extend_col_names. The function get_flexible_grouping_evaluation can be used to check the column names.

For convenience, start with “auto-fill” or “plotly”, then adjust styling as needed.

See plot_multivariate for details.

default_color: str, default “rgba(0, 145, 202, 1.0)” (blue)

Default line color when y_col_style_dict is one of “auto”, “auto-fill”.

xlabelstr or None, default None

x-axis label. If None, default is x_col.

ylabelstr or None, default None

y-axis label. If None, y-axis is not labeled.

titlestr or None, default None

Plot title. If None and ylabel is provided, a default title is used.

showlegendbool, default True

Whether to show the legend.

Returns

fig – Interactive plotly graph showing the evaluation metrics.

See plot_forecast_vs_actual return value for how to plot the figure and add customization.

Return type

See also

None: called by this function
None: called by this function

make_univariate_time_series()[source]

Converts prediction into a UnivariateTimeSeries Useful to convert a forecast into the input regressor for a subsequent forecast.

Returns: UnivariateTimeSeries

plot_components(**kwargs)[source]

Class method to plot the components of a UnivariateForecast object.

Silverkite calculates component plots based on fit dataset. Prophet calculates component plots based on predict dataset.

For estimator specific component plots with advanced plotting options call self.estimator.plot_components().

Returns: fig – matplotlib.figure.Figure for Prophet Figure plotting components against appropriate time scale.
Return type: plotly.graph_objects.Figure for Silverkite

class greykite.algo.common.model_summary.ModelSummary(x, y, pred_cols, pred_category, fit_algorithm, ml_model, max_colwidth=20)[source]

A class to store regression model summary statistics.

The class can be printed to get a well formatted model summary.

x

The design matrix.

Type: numpy.array

beta

The estimated coefficients.

Type: numpy.array

y

The response.

Type: numpy.array

pred_cols

List of predictor names.

Type: list [ str ]

pred_category

Predictor category, returned by create_pred_category.

Type: dict

fit_algorithm

The name of algorithm to fit the regression.

Type: str

ml_model

The trained machine learning model class.

Type: class

max_colwidth

The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.

Type: int

info_dict

The model summary dictionary, output of _get_summary

Type: dict

html_str

An html formatting of the string representation of the model summary.

Type: str

__str__()[source]: print method.

__repr__()[source]: print method

_get_summary()[source]

Gets the model summary from input. This function is called during initialization.

Returns: info_dict – Includes direct and derived metrics about the trained model. For detailed keys, refer to get_info_dict_lm or get_info_dict_tree.
Return type: dict

get_coef_summary(is_intercept=None, is_time_feature=None, is_event=None, is_trend=None, is_seasonality=None, is_lag=None, is_regressor=None, is_interaction=None, return_df=False)[source]

Gets the coefficient summary filtered by conditions.

Parameters

is_intercept (bool or None, default None) – Intercept or not.
is_time_feature (bool or None, default None) – Time features or not. Time features belong to TimeFeaturesEnum.
is_event (bool or None, default None) – Event features or not. Event features have EVENT_PREFIX.
is_trend (bool or None, default None) – Trend features or not. Trend features have CHANGEPOINT_COL_PREFIX or “cpd”.
is_seasonality (bool or None, default None) – Seasonality feature or not. Seasonality features have SEASONALITY_REGEX.
is_lag (bool or None, default None) – Lagged features or not. Lagged features have “lag”.
is_regressor (0 or 1) – Extra features provided by users. They are provided through extra_pred_cols in the fit function.
is_interaction (bool or None, default None) – Interaction feature or not. Interaction features have “:”.
return_df (bool, default False) –

If True, the filtered coefficient summary df is also returned.
Otherwise, the filtered coefficient summary df is printed only.

Returns

filtered_coef_summary – If return_df is set to True, returns the filtered coefficient summary df filtered by the given conditions.

Return type

pandas.DataFrame or None

Constants

class greykite.common.aggregation_function_enum.AggregationFunctionEnum(value)[source]

Defines some common aggregation functions that can be retrieved by their names.

Every function is wrapped with partial because Enum handles functions differently from values. Wrapping with partial allows us to extract the function with variable keys.

class greykite.common.evaluation.EvaluationMetricEnum(value)[source]

Valid evaluation metrics. The values tuple is (score_func: callable, greater_is_better: boolean, short_name: str)

add_finite_filter_to_scorer is added to the metrics that are directly imported from sklearn.metrics (e.g. mean_squared_error) to ensure that the metric gets calculated even when inputs have missing values.

Correlation = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, True, 'CORR'): Pearson correlation coefficient between forecast and actuals. Higher is better.

CoefficientOfDetermination = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, True, 'R2'): Coefficient of determination. See sklearn.metrics.r2_score. Higher is better. Equals 1.0 - mean_squared_error / variance(actuals).

MeanSquaredError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MSE'): Mean squared error, the average of squared differences, see sklearn.metrics.mean_squared_error.

RootMeanSquaredError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'RMSE'): Root mean squared error, the square root of sklearn.metrics.mean_squared_error

MeanAbsoluteError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MAE'): Mean absolute error, average of absolute differences, see sklearn.metrics.mean_absolute_error.

MedianAbsoluteError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MedAE'): Median absolute error, median of absolute differences, see sklearn.metrics.median_absolute_error.

MeanAbsolutePercentError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MAPE'): Mean absolute percent error, error relative to actuals expressed as a %, see wikipedia MAPE.

MedianAbsolutePercentError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MedAPE'): Median absolute percent error, median of error relative to actuals expressed as a %, a median version of the MeanAbsolutePercentError, less affected by extreme values.

SymmetricMeanAbsolutePercentError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'sMAPE'): Symmetric mean absolute percent error, error relative to (actuals+forecast) expressed as a %. Note that we do not include a factor of 2 in the denominator, so the range is 0% to 100%, see wikipedia sMAPE.

Quantile80 = (<function quantile_loss_q.<locals>.quantile_loss_wrapper>, False, 'Q80')

Quantile loss with q=0.80:

np.where(y_true < y_pred, (1 - q) * (y_pred - y_true), q * (y_true - y_pred)).mean()

Quantile95 = (<function quantile_loss_q.<locals>.quantile_loss_wrapper>, False, 'Q95')

Quantile loss with q=0.95:

np.where(y_true < y_pred, (1 - q) * (y_pred - y_true), q * (y_true - y_pred)).mean()

Quantile99 = (<function quantile_loss_q.<locals>.quantile_loss_wrapper>, False, 'Q99')

Quantile loss with q=0.99:

np.where(y_true < y_pred, (1 - q) * (y_pred - y_true), q * (y_true - y_pred)).mean()

FractionOutsideTolerance1 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.01), False, 'OutsideTolerance1p'): Fraction of forecasted values that deviate more than 1% from the actual

FractionOutsideTolerance2 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.02), False, 'OutsideTolerance2p'): Fraction of forecasted values that deviate more than 2% from the actual

FractionOutsideTolerance3 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.03), False, 'OutsideTolerance3p'): Fraction of forecasted values that deviate more than 3% from the actual

FractionOutsideTolerance4 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.04), False, 'OutsideTolerance4p'): Fraction of forecasted values that deviate more than 4% from the actual

FractionOutsideTolerance5 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.05), False, 'OutsideTolerance5p'): Fraction of forecasted values that deviate more than 5% from the actual

get_metric_func()[source]: Returns the metric function.

get_metric_greater_is_better()[source]: Returns the greater_is_better boolean.

get_metric_name()[source]: Returns the short name.

Constants used by code in common or in multiple places: algo, sklearn, and/or framework.

greykite.common.constants.TIME_COL = 'ts': The default name for the column with the timestamps of the time series.

greykite.common.constants.VALUE_COL = 'y': The default name for the column with the values of the time series.

greykite.common.constants.ACTUAL_COL = 'actual': The column name representing actual (observed) values.

greykite.common.constants.PREDICTED_COL = 'forecast': The column name representing the predicted values.

greykite.common.constants.RESIDUAL_COL = 'residual': The column name representing the forecast residuals.

greykite.common.constants.PREDICTED_LOWER_COL = 'forecast_lower': The column name representing lower bounds of prediction interval.

greykite.common.constants.PREDICTED_UPPER_COL = 'forecast_upper': The column name representing upper bounds of prediction interval.

greykite.common.constants.NULL_PREDICTED_COL = 'forecast_null': The column name representing predicted values from null model.

greykite.common.constants.ERR_STD_COL = 'err_std': The column name representing the error standard deviation from models.

greykite.common.constants.QUANTILE_SUMMARY_COL = 'quantile_summary': The column name representing the quantile summary from models.

greykite.common.constants.R2_null_model_score = 'R2_null_model_score': Evaluation metric. Improvement in the specified loss function compared to the predictions of a null model.

greykite.common.constants.FRACTION_OUTSIDE_TOLERANCE = 'Outside Tolerance (fraction)': Evaluation metric. The fraction of predictions outside the specified tolerance level.

greykite.common.constants.PREDICTION_BAND_WIDTH = 'Prediction Band Width (%)': Evaluation metric. Relative size of prediction bands vs actual, as a percent.

greykite.common.constants.PREDICTION_BAND_COVERAGE = 'Prediction Band Coverage (fraction)': Evaluation metric. Fraction of observations within the bands.

greykite.common.constants.LOWER_BAND_COVERAGE = 'Coverage: Lower Band': Evaluation metric. Fraction of observations within the lower band.

greykite.common.constants.UPPER_BAND_COVERAGE = 'Coverage: Upper Band': Evaluation metric. Fraction of observations within the upper band.

greykite.common.constants.COVERAGE_VS_INTENDED_DIFF = 'Coverage Diff: Actual_Coverage - Intended_Coverage': Evaluation metric. Difference between actual and intended coverage.

greykite.common.constants.EVENT_DF_DATE_COL = 'date': Name of date column for the DataFrames passed to silverkite custom_daily_event_df_dict.

greykite.common.constants.EVENT_DF_LABEL_COL = 'event_name': Name of event column for the DataFrames passed to silverkite custom_daily_event_df_dict.

greykite.common.constants.EVENT_PREFIX = 'events': Prefix for naming event features.

greykite.common.constants.EVENT_DEFAULT = '': Label used for days without an event.

greykite.common.constants.EVENT_INDICATOR = 'event': Binary indicator for an event.

greykite.common.constants.IS_EVENT_COL = 'is_event': Indicator column in feature matrix, 1 if the day is an event or its neighboring days.

greykite.common.constants.IS_EVENT_ADJACENT_COL = 'is_event_adjacent': Indicator column in feature matrix, 1 if the day is adjacent to an event.

greykite.common.constants.IS_EVENT_EXACT_COL = 'is_event_exact': Indicator column in feature matrix, 1 if the day is an event but not its neighboring days.

greykite.common.constants.EVENT_SHIFTED_SUFFIX_BEFORE = '_before': The suffix for neighboring events before the events added to the event names.

greykite.common.constants.EVENT_SHIFTED_SUFFIX_AFTER = '_after': The suffix for neighboring events after the events added to the event names.

greykite.common.constants.CHANGEPOINT_COL_PREFIX = 'changepoint': Prefix for naming changepoint features.

greykite.common.constants.CHANGEPOINT_COL_PREFIX_SHORT = 'cp': Short prefix for naming changepoint features.

greykite.common.constants.LEVELSHIFT_COL_PREFIX_SHORT = 'ctp': Short prefix for naming levelshift features.

greykite.common.constants.START_TIME_COL = 'start_time': Default column name for anomaly start time in the anomaly dataframe.

greykite.common.constants.END_TIME_COL = 'end_time': Default column name for anomaly end time in the anomaly dataframe.

greykite.common.constants.ADJUSTMENT_DELTA_COL = 'adjustment_delta': Default column name for anomaly adjustment in the anomaly dataframe.

greykite.common.constants.METRIC_COL = 'metric': Column to denote metric of interest.

greykite.common.constants.DIMENSION_COL = 'dimension': Dimension column.

greykite.common.constants.ANOMALY_COL = 'is_anomaly': Default column name for anomaly labels (boolean) in the time series.

greykite.common.constants.PREDICTED_ANOMALY_COL = 'is_anomaly_predicted': Default column name for predicted anomaly labels (boolean) in the time series.

greykite.common.constants.SEVERITY_SCORE_COL = 'severity_score': Default column name for anomaly severity score in the anomaly dataframe.

greykite.common.constants.USER_REVIEWED_COL = 'is_user_reviewed': Default column name for whether an anomaly is reviewed by the user (boolean) in the anomaly dataframe.

greykite.common.constants.NEW_PATTERN_ANOMALY_COL = 'new_pattern_anomaly': Default column name for whether an anomaly is a new pattern (boolean) in the anomaly dataframe.

class greykite.common.constants.TimeFeaturesEnum(value)[source]

Time features generated by build_time_features_df.

The item names are lower-case letters (kept the same as the values) for easier check of existence. To check if a string s is in this Enum, use s in TimeFeaturesEnum.__dict__["_member_names_"]. Direct check of existence s in TimeFeaturesEnum is deprecated in python 3.8.

class greykite.common.constants.GrowthColEnum(value)[source]

Human-readable names for the growth columns generated by build_time_features_df.

The names are the human-readable names, and the values are the corresponding column names generated by build_time_features_df.

greykite.common.constants.LAG_INFIX = '_lag': Infix for lagged feature names.

greykite.common.constants.AGG_LAG_INFIX = 'avglag': Infix for aggregated lag feature names.

greykite.common.constants.TREND_REGEX = 'changepoint\\d|ct\\d|ct_|cp\\d': Growth terms, including changepoints.

greykite.common.constants.SEASONALITY_REGEX = 'sin\\d|cos\\d': Seasonality terms modeled by fourier series.

greykite.common.constants.EVENT_REGEX = 'events_': Event terms.

greykite.common.constants.LAG_REGEX = '_lag\\d|_avglag_\\d': Lag terms.

greykite.common.constants.LOGGER_NAME = 'Greykite': Name used by the logger.

Constants used by `~greykite.framework.

greykite.framework.constants.EVALUATION_PERIOD_CV_MAX_SPLITS = 3: Default value for EvaluationPeriodParam().cv_max_splits

greykite.framework.constants.COMPUTATION_N_JOBS = 1: Default value for ComputationParam.n_jobs

greykite.framework.constants.COMPUTATION_VERBOSE = 1: Default value for ComputationParam.verbose

greykite.framework.constants.CV_REPORT_METRICS_ALL = 'ALL': Set cv_report_metrics to this value to compute all metrics during CV

greykite.framework.constants.FRACTION_OUTSIDE_TOLERANCE_NAME = 'OutsideTolerance': Short name used to report the result of FRACTION_OUTSIDE_TOLERANCE in CV

greykite.framework.constants.CUSTOM_SCORE_FUNC_NAME = 'score': Short name used to report the result of custom score_func in CV

greykite.framework.constants.MEAN_COL_GROUP = 'mean': Columns with mean.

greykite.framework.constants.QUANTILE_COL_GROUP = 'quantile': Columns with quantile.

greykite.framework.constants.OVERLAY_COL_GROUP = 'overlay': Columns with overlay.

greykite.framework.constants.FORECAST_STEP_COL = 'forecast_step': The column name for forecast step in benchmarking

class greykite.algo.forecast.silverkite.constants.silverkite_constant.SilverkiteConstant[source]

Uses the appropriate constant mixins to provide all the constants that will be used by Silverkite.

get_silverkite_column() → Type[SilverkiteColumn]: Return the SilverkiteColumn constants

get_silverkite_components_enum() → Type[SilverkiteComponentsEnum]: Return the SilverkiteComponentsEnum constants

get_silverkite_holiday() → Type[SilverkiteHoliday]: Return the SilverkiteHoliday constants

get_silverkite_seasonality_enum() → Type[SilverkiteSeasonalityEnum]: Return the SilverkiteSeasonalityEnum constants

get_silverkite_time_frequency_enum() → Type[SilverkiteTimeFrequencyEnum]: Return the SilverkiteTimeFrequencyEnum constants

class greykite.algo.forecast.silverkite.constants.silverkite_column.SilverkiteColumn[source]

Silverkite feature sets for sub-daily data.

COLS_HOUR_OF_WEEK: str = 'hour_of_week': Silverkite feature_sets_enabled key. constant hour of week effect

COLS_WEEKEND_SEAS: str = 'is_weekend:daily_seas': Silverkite feature_sets_enabled key. daily seasonality interaction with is_weekend

COLS_DAY_OF_WEEK_SEAS: str = 'day_of_week:daily_seas': Silverkite feature_sets_enabled key. daily seasonality interaction with day of week

COLS_TREND_DAILY_SEAS: str = 'trend:is_weekend:daily_seas': Silverkite feature_sets_enabled key. allow daily seasonality to change over time, depending on is_weekend

COLS_EVENT_SEAS: str = 'event:daily_seas': Silverkite feature_sets_enabled key. allow sub-daily event effects

COLS_EVENT_WEEKEND_SEAS: str = 'event:is_weekend:daily_seas': Silverkite feature_sets_enabled key. allow sub-daily event effect to interact with is_weekend

COLS_DAY_OF_WEEK: str = 'day_of_week': Silverkite feature_sets_enabled key. constant day of week effect

COLS_TREND_WEEKEND: str = 'trend:is_weekend': Silverkite feature_sets_enabled key. allow trend (growth, changepoints) to interact with is_weekend

COLS_TREND_DAY_OF_WEEK: str = 'trend:day_of_week': Silverkite feature_sets_enabled key. allow trend to interact with day of week

COLS_TREND_WEEKLY_SEAS: str = 'trend:weekly_seas': Silverkite feature_sets_enabled key. allow weekly seasonality to change over time

class greykite.algo.forecast.silverkite.constants.silverkite_component.SilverkiteComponentsEnum(value)[source]: Defines groupby time feature, xlabel and ylabel for Silverkite Component Plots.

class greykite.algo.forecast.silverkite.constants.silverkite_holiday.SilverkiteHoliday[source]

Holiday constants to be used by Silverkite

HOLIDAY_LOOKUP_COUNTRIES_AUTO = ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'): Auto setting for the countries that contain the holidays to include in the model

HOLIDAYS_TO_MODEL_SEPARATELY_AUTO = ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'): Auto setting for the holidays to include in the model

ALL_HOLIDAYS_IN_COUNTRIES = 'ALL_HOLIDAYS_IN_COUNTRIES': Value for holidays_to_model_separately to request all holidays in the lookup countries

HOLIDAYS_TO_INTERACT = ('Christmas Day', 'Christmas Day_minus_1', 'Christmas Day_minus_2', 'Christmas Day_plus_1', 'Christmas Day_plus_2', 'New Years Day', 'New Years Day_minus_1', 'New Years Day_minus_2', 'New Years Day_plus_1', 'New Years Day_plus_2', 'Thanksgiving', 'Thanksgiving_plus_1', 'Independence Day'): Significant holidays that may have a different daily seasonality pattern

class greykite.algo.forecast.silverkite.constants.silverkite_seasonality.SilverkiteSeasonalityEnum(value)[source]

Defines default seasonalities for Silverkite estimator. Names should match those in SeasonalityEnum. The default order for various seasonalities is stored in this enum.

DAILY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tod', period=24.0, order=12, seas_names='daily', default_min_days=2): tod is 0-24 time of day (tod granularity based on input data, up to second level). Requires at least two full cycles to add the seasonal term (default_min_days=2).

WEEKLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tow', period=7.0, order=4, seas_names='weekly', default_min_days=14): tow is 0-7 time of week (tow granularity based on input data, up to second level). order=4 for full flexibility to model daily input.

MONTHLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tom', period=1.0, order=2, seas_names='monthly', default_min_days=60): tom is 0-1 time of month (tom granularity based on input data, up to daily level).

QUARTERLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='toq', period=1.0, order=5, seas_names='quarterly', default_min_days=180): toq (continuous time of quarter) with natural period. Each day is mapped to a value in [0.0, 1.0) based on its position in the calendar quarter: (Jan1-Mar31, Apr1-Jun30, Jul1-Sep30, Oct1-Dec31). The start of each quarter is 0.0.

YEARLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='ct1', period=1.0, order=15, seas_names='yearly', default_min_days=548): ct1 (continuous year) with natural period.

class greykite.algo.forecast.silverkite.constants.silverkite_time_frequency.SilverkiteTimeFrequencyEnum(value)[source]: Provides properties for modeling for various time frequencies in Silverkite. The enum names is the time frequency, corresponding to the simple time frequencies in SimpleTimeFrequencyEnum.

Provides templates for SimpleSilverkiteEstimator that are pre-tuned to fit specific use cases.

A subset of these templates are recognized by ModelTemplateEnum.

simple_silverkite_template also accepts any model_template name that follows the naming convention in this file. For details, see the model_template parameter in SimpleSilverkiteTemplate.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_FREQ(value)[source]: Valid values for simple silverkite template string name frequency.

greykite.framework.templates.simple_silverkite_template_config.VALID_FREQ = ['HOURLY', 'DAILY', 'WEEKLY']: Valid non-default values for simple silverkite template string name frequency. These are the non-default frequencies recognized by SimpleSilverkiteTemplateOptions.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_SEAS(value)[source]: Valid values for simple silverkite template string name seasonality.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_GR(value)[source]: Valid values for simple silverkite template string name growth_term.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_CP(value)[source]: Valid values for simple silverkite template string name changepoints_dict.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_HOL(value)[source]: Valid values for simple silverkite template string name events.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_FEASET(value)[source]: Valid values for simple silverkite template string name feature_sets_enabled.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_ALGO(value)[source]: Valid values for simple silverkite template string name fit_algorithm.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_AR(value)[source]: Valid values for simple silverkite template string name autoregression.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_DSI(value)[source]: Valid values for simple silverkite template string name daily seasonality max interaction order.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_WSI(value)[source]: Valid values for simple silverkite template string name weekly seasonality max interaction order.

class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_COMPONENT_KEYWORDS(value)[source]: Valid values for simple silverkite template string name keywords. The names are the keywords and the values are the corresponding value enum. Can be used to create an instance of SimpleSilverkiteTemplateOptions.

class greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions(freq: SILVERKITE_FREQ = SILVERKITE_FREQ.DAILY, seas: SILVERKITE_SEAS = SILVERKITE_SEAS.LT, gr: SILVERKITE_GR = SILVERKITE_GR.LINEAR, cp: SILVERKITE_CP = SILVERKITE_CP.NONE, hol: SILVERKITE_HOL = SILVERKITE_HOL.NONE, feaset: SILVERKITE_FEASET = SILVERKITE_FEASET.OFF, algo: SILVERKITE_ALGO = SILVERKITE_ALGO.LINEAR, ar: SILVERKITE_AR = SILVERKITE_AR.OFF, dsi: SILVERKITE_DSI = SILVERKITE_DSI.AUTO, wsi: SILVERKITE_WSI = SILVERKITE_WSI.AUTO)[source]

Defines generic simple silverkite template options.

Attributes can be set to different values using SILVERKITE_COMPONENT_KEYWORDS for high level tuning.

freq represents data frequency.

The other attributes stand for seasonality, growth, changepoints_dict, events, feature_sets_enabled, fit_algorithm and autoregression in ModelComponentsParam, which are used in SimpleSilverkiteTemplate.

freq: SILVERKITE_FREQ = 'DAILY': Valid values for simple silverkite template string name frequency. See SILVERKITE_FREQ.

seas: SILVERKITE_SEAS = 'LT': Valid values for simple silverkite template string name seasonality. See SILVERKITE_SEAS.

gr: SILVERKITE_GR = 'LINEAR': Valid values for simple silverkite template string name growth. See SILVERKITE_GR.

cp: SILVERKITE_CP = 'NONE': Valid values for simple silverkite template string name changepoints. See SILVERKITE_CP.

hol: SILVERKITE_HOL = 'NONE': Valid values for simple silverkite template string name holiday. See SILVERKITE_HOL.

feaset: SILVERKITE_FEASET = 'OFF': Valid values for simple silverkite template string name feature sets enabled. See SILVERKITE_FEASET.

algo: SILVERKITE_ALGO = 'LINEAR': Valid values for simple silverkite template string name fit algorithm. See SILVERKITE_ALGO.

ar: SILVERKITE_AR = 'OFF': Valid values for simple silverkite template string name autoregression. See SILVERKITE_AR.

dsi: SILVERKITE_DSI = 'AUTO': Valid values for simple silverkite template string name max daily seasonality interaction order. See SILVERKITE_DSI.

wsi: SILVERKITE_WSI = 'AUTO': Valid values for simple silverkite template string name max weekly seasonality interaction order. See SILVERKITE_WSI.

greykite.framework.templates.simple_silverkite_template_config.COMMON_MODELCOMPONENTPARAM_PARAMETERS = {'ALGO': {'LASSO': {'fit_algorithm': 'lasso', 'fit_algorithm_params': None}, 'LINEAR': {'fit_algorithm': 'linear', 'fit_algorithm_params': None}, 'RIDGE': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'SGD': {'fit_algorithm': 'sgd', 'fit_algorithm_params': None}}, 'AR': {'AUTO': {'autoreg_dict': 'auto', 'fast_simulation': False, 'simulation_num': 10}, 'OFF': {'autoreg_dict': None, 'fast_simulation': False, 'simulation_num': 10}}, 'CP': {'DAILY': {'HV': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.3, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'LT': {'method': 'auto', 'no_changepoint_distance_from_end': '90D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.6, 'resample_freq': '7D', 'yearly_seasonality_change_freq': None, 'yearly_seasonality_order': 15}, 'NM': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.5, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'NONE': None}, 'HOURLY': {'HV': {'method': 'auto', 'no_changepoint_distance_from_end': '30D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.3, 'resample_freq': 'D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'LT': {'method': 'auto', 'no_changepoint_distance_from_end': '30D', 'potential_changepoint_distance': '7D', 'regularization_strength': 0.6, 'resample_freq': 'D', 'yearly_seasonality_change_freq': None, 'yearly_seasonality_order': 15}, 'NM': {'method': 'auto', 'no_changepoint_distance_from_end': '30D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.5, 'resample_freq': 'D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'NONE': None}, 'WEEKLY': {'HV': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '14D', 'regularization_strength': 0.3, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'LT': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '14D', 'regularization_strength': 0.6, 'resample_freq': '7D', 'yearly_seasonality_change_freq': None, 'yearly_seasonality_order': 15}, 'NM': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '14D', 'regularization_strength': 0.5, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'NONE': None}}, 'DSI': {'DAILY': {'AUTO': 0, 'OFF': 0}, 'HOURLY': {'AUTO': 5, 'OFF': 0}, 'WEEKLY': {'AUTO': 0, 'OFF': 0}}, 'FEASET': {'AUTO': 'auto', 'OFF': False, 'ON': True}, 'GR': {'LINEAR': {'growth_term': 'linear'}, 'NONE': {'growth_term': None}}, 'HOL': {'NONE': {'auto_holiday': False, 'auto_holiday_params': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': [], 'holiday_post_num_days': 0, 'holiday_pre_num_days': 0, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': []}, 'SP1': {'auto_holiday': False, 'auto_holiday_params': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 1, 'holiday_pre_num_days': 1, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': 'auto'}, 'SP2': {'auto_holiday': False, 'auto_holiday_params': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 2, 'holiday_pre_num_days': 2, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': 'auto'}, 'SP4': {'auto_holiday': False, 'auto_holiday_params': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 4, 'holiday_pre_num_days': 4, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': 'auto'}, 'TG': {'auto_holiday': False, 'auto_holiday_params': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 3, 'holiday_pre_num_days': 3, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': []}}, 'SEAS': {'DAILY': {'HV': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 4, 'yearly_seasonality': 25}, 'HVQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 4, 'quarterly_seasonality': 6, 'weekly_seasonality': 4, 'yearly_seasonality': 25}, 'LT': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'LTQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 2, 'quarterly_seasonality': 3, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'NM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 3, 'yearly_seasonality': 15}, 'NMQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 4, 'quarterly_seasonality': 4, 'weekly_seasonality': 3, 'yearly_seasonality': 15}, 'NONE': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 0}}, 'HOURLY': {'HV': {'auto_seasonality': False, 'daily_seasonality': 12, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 6, 'yearly_seasonality': 25}, 'HVQM': {'auto_seasonality': False, 'daily_seasonality': 12, 'monthly_seasonality': 4, 'quarterly_seasonality': 4, 'weekly_seasonality': 6, 'yearly_seasonality': 25}, 'LT': {'auto_seasonality': False, 'daily_seasonality': 5, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'LTQM': {'auto_seasonality': False, 'daily_seasonality': 5, 'monthly_seasonality': 2, 'quarterly_seasonality': 2, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'NM': {'auto_seasonality': False, 'daily_seasonality': 8, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 4, 'yearly_seasonality': 15}, 'NMQM': {'auto_seasonality': False, 'daily_seasonality': 8, 'monthly_seasonality': 3, 'quarterly_seasonality': 3, 'weekly_seasonality': 4, 'yearly_seasonality': 15}, 'NONE': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 0}}, 'WEEKLY': {'HV': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 25}, 'HVQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 4, 'quarterly_seasonality': 4, 'weekly_seasonality': 0, 'yearly_seasonality': 25}, 'LT': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 8}, 'LTQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 2, 'quarterly_seasonality': 2, 'weekly_seasonality': 0, 'yearly_seasonality': 8}, 'NM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 15}, 'NMQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 3, 'quarterly_seasonality': 3, 'weekly_seasonality': 0, 'yearly_seasonality': 15}, 'NONE': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 0}}}, 'WSI': {'DAILY': {'AUTO': 2, 'OFF': 0}, 'HOURLY': {'AUTO': 2, 'OFF': 0}, 'WEEKLY': {'AUTO': 0, 'OFF': 0}}}: Defines the default component values for SimpleSilverkiteTemplate. The components include seasonality, growth, holiday, trend changepoints, feature sets, autoregression, fit algorithm, etc. These are used when config.model_template provides the SimpleSilverkiteTemplateOptions.

greykite.framework.templates.simple_silverkite_template_config.SILVERKITE = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'yearly_seasonality_order': 15, 'resample_freq': '3D', 'regularization_strength': 0.6, 'actual_changepoint_min_distance': '30D', 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D'}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 'auto', 'quarterly_seasonality': 'auto', 'monthly_seasonality': 'auto', 'weekly_seasonality': 'auto', 'daily_seasonality': 'auto'}, uncertainty={'uncertainty_dict': None}): Defines the SILVERKITE template. Contains automatic growth, seasonality, holidays, autoregression and interactions. Uses “zero_to_one” normalization method. Best for hourly and daily frequencies. Uses SimpleSilverkiteEstimator.

greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_MONTHLY = ModelComponentsParam(autoregression={'autoreg_dict': {'lag_dict': None, 'agg_lag_dict': {'orders_list': [[1, 2, 3]]}}, 'simulation_num': 50, 'fast_simulation': True}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'regularization_strength': 0.6, 'resample_freq': '28D', 'potential_changepoint_distance': '180D', 'potential_changepoint_n_max': 100, 'actual_changepoint_min_distance': '730D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 6}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': False, 'max_daily_seas_interaction_order': 0, 'max_weekly_seas_interaction_order': 0, 'extra_pred_cols': ['y_avglag_1_2_3*C(month, levels=list(range(1, 13)))', 'C(month, levels=list(range(1, 13)))'], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': False, 'quarterly_seasonality': False, 'monthly_seasonality': False, 'weekly_seasonality': False, 'daily_seasonality': False}, uncertainty={'uncertainty_dict': None}): Defines the SILVERKITE_MONTHLY template. Contains automatic growth. Seasonality is modeled via categorical variable “month”. Includes aggregated autoregression. Simulation is needed when forecast horizon is greater than 1. Uses statistical normalization method. Uses SimpleSilverkiteEstimator.

greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_DAILY_1 = ['SILVERKITE_DAILY_1_CONFIG_1', 'SILVERKITE_DAILY_1_CONFIG_2', 'SILVERKITE_DAILY_1_CONFIG_3']: Defines the SILVERKITE_DAILY_1 template, which contains 3 candidate configs for grid search, optimized for the seasonality and changepoint parameters. Best for 1-day forecast for daily time series. Uses SimpleSilverkiteEstimator.

greykite.framework.templates.simple_silverkite_template_config.MULTI_TEMPLATES = {'SILVERKITE_DAILY_1': ['SILVERKITE_DAILY_1_CONFIG_1', 'SILVERKITE_DAILY_1_CONFIG_2', 'SILVERKITE_DAILY_1_CONFIG_3'], 'SILVERKITE_DAILY_90': ['DAILY_SEAS_LTQM_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_LTQM_GR_LINEAR_CP_NONE_HOL_SP2_FEASET_AUTO_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_LTQM_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO'], 'SILVERKITE_HOURLY_1': ['SILVERKITE', 'HOURLY_SEAS_LT_GR_LINEAR_CP_NM_HOL_SP4_FEASET_OFF_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_RIDGE_AR_AUTO'], 'SILVERKITE_HOURLY_168': ['HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NONE_HOL_SP4_FEASET_OFF_ALGO_LINEAR_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_RIDGE_AR_OFF'], 'SILVERKITE_HOURLY_24': ['HOURLY_SEAS_LT_GR_LINEAR_CP_NM_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_LT_GR_LINEAR_CP_NONE_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP1_FEASET_OFF_ALGO_LINEAR_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO'], 'SILVERKITE_HOURLY_336': ['HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP1_FEASET_AUTO_ALGO_LINEAR_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_LINEAR_AR_AUTO'], 'SILVERKITE_WEEKLY': ['WEEKLY_SEAS_NM_GR_LINEAR_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_HV_GR_LINEAR_CP_NM_HOL_NONE_FEASET_OFF_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_HV_GR_LINEAR_CP_LT_HOL_NONE_FEASET_OFF_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO']}

A dictionary of multi templates.

Keys are the available multi templates names (valid strings for config.model_template).
Values correspond to a list of ModelComponentsParam.

greykite.framework.templates.simple_silverkite_template_config.SINGLE_MODEL_TEMPLATE_TYPE

Types accepted by SimpleSilverkiteTemplate for config.model_template for a single template.

alias of Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions]

class greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateConstants(COMMON_MODELCOMPONENTPARAM_PARAMETERS: ~typing.Dict = <factory>, MULTI_TEMPLATES: ~typing.Dict = <factory>, SILVERKITE: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'yearly_seasonality_order': 15, 'resample_freq': '3D', 'regularization_strength': 0.6, 'actual_changepoint_min_distance': '30D', 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D'}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 'auto', 'quarterly_seasonality': 'auto', 'monthly_seasonality': 'auto', 'weekly_seasonality': 'auto', 'daily_seasonality': 'auto'}, uncertainty={'uncertainty_dict': None}), SILVERKITE_MONTHLY: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': {'lag_dict': None, 'agg_lag_dict': {'orders_list': [[1, 2, 3]]}}, 'simulation_num': 50, 'fast_simulation': True}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'regularization_strength': 0.6, 'resample_freq': '28D', 'potential_changepoint_distance': '180D', 'potential_changepoint_n_max': 100, 'actual_changepoint_min_distance': '730D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 6}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': False, 'max_daily_seas_interaction_order': 0, 'max_weekly_seas_interaction_order': 0, 'extra_pred_cols': ['y_avglag_1_2_3*C(month, levels=list(range(1, 13)))', 'C(month, levels=list(range(1, 13)))'], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': False, 'quarterly_seasonality': False, 'monthly_seasonality': False, 'weekly_seasonality': False, 'daily_seasonality': False}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_1: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.809, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '7D', 'yearly_seasonality_order': 8, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 7, 'weekly_seasonality': 1, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_2: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.624, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '17D', 'yearly_seasonality_order': 1, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 1, 'quarterly_seasonality': 0, 'monthly_seasonality': 4, 'weekly_seasonality': 6, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_3: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.59, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '8D', 'yearly_seasonality_order': 40, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 40, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 2, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_COMPONENT_KEYWORDS: ~typing.Type[~enum.Enum] = <enum 'SILVERKITE_COMPONENT_KEYWORDS'>, SILVERKITE_EMPTY: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = 'DAILY_SEAS_NONE_GR_NONE_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_OFF_WSI_OFF', VALID_FREQ: ~typing.List = <factory>, SimpleSilverkiteTemplateOptions: ~dataclasses.dataclass = <class 'greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions'>)[source]

Constants used by SimpleSilverkiteTemplate. Includes the model templates and their default values.

mutable_field is used when the default value is a mutable type like dict and list. Dataclass requires mutable default values to be wrapped in ‘default_factory’, so that instances of this dataclass cannot accidentally modify the default value. mutable_field wraps the constant accordingly.

COMMON_MODELCOMPONENTPARAM_PARAMETERS: Dict: Defines the default component values for SimpleSilverkiteTemplate. The components include seasonality, growth, holiday, trend changepoints, feature sets, autoregression, fit algorithm, etc. These are used when config.model_template provides the SimpleSilverkiteTemplateOptions.

MULTI_TEMPLATES: Dict

A dictionary of multi templates.

Keys are the available multi templates names (valid strings for config.model_template).
Values correspond to a list of ModelComponentsParam.

SILVERKITE: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'yearly_seasonality_order': 15, 'resample_freq': '3D', 'regularization_strength': 0.6, 'actual_changepoint_min_distance': '30D', 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D'}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 'auto', 'quarterly_seasonality': 'auto', 'monthly_seasonality': 'auto', 'weekly_seasonality': 'auto', 'daily_seasonality': 'auto'}, uncertainty={'uncertainty_dict': None}): Defines the "SILVERKITE" template. Contains automatic growth, seasonality, holidays, autoregression and interactions. Uses “zero_to_one” normalization method. Best for hourly and daily frequencies. Uses SimpleSilverkiteEstimator.

SILVERKITE_MONTHLY: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': {'lag_dict': None, 'agg_lag_dict': {'orders_list': [[1, 2, 3]]}}, 'simulation_num': 50, 'fast_simulation': True}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'regularization_strength': 0.6, 'resample_freq': '28D', 'potential_changepoint_distance': '180D', 'potential_changepoint_n_max': 100, 'actual_changepoint_min_distance': '730D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 6}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': False, 'max_daily_seas_interaction_order': 0, 'max_weekly_seas_interaction_order': 0, 'extra_pred_cols': ['y_avglag_1_2_3*C(month, levels=list(range(1, 13)))', 'C(month, levels=list(range(1, 13)))'], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': False, 'quarterly_seasonality': False, 'monthly_seasonality': False, 'weekly_seasonality': False, 'daily_seasonality': False}, uncertainty={'uncertainty_dict': None}): Defines the SILVERKITE_MONTHLY template. Best for monthly forecasts. Uses SimpleSilverkiteEstimator.

SILVERKITE_DAILY_1_CONFIG_1: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.809, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '7D', 'yearly_seasonality_order': 8, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 7, 'weekly_seasonality': 1, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}): Config 1 in template SILVERKITE_DAILY_1. Compared to SILVERKITE, it adds change points and uses parameters specifically tuned for daily data and 1-day forecast.

SILVERKITE_DAILY_1_CONFIG_2: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.624, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '17D', 'yearly_seasonality_order': 1, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 1, 'quarterly_seasonality': 0, 'monthly_seasonality': 4, 'weekly_seasonality': 6, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}): Config 2 in template SILVERKITE_DAILY_1. Compared to SILVERKITE, it adds change points and uses parameters specifically tuned for daily data and 1-day forecast.

SILVERKITE_DAILY_1_CONFIG_3: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.59, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '8D', 'yearly_seasonality_order': 40, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'auto_holiday_params': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 40, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 2, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}): Config 3 in template SILVERKITE_DAILY_1. Compared to SILVERKITE, it adds change points and uses parameters specifically tuned for daily data and 1-day forecast.

class SILVERKITE_COMPONENT_KEYWORDS(value): Valid values for simple silverkite template string name keywords. The names are the keywords and the values are the corresponding value enum. Can be used to create an instance of SimpleSilverkiteTemplateOptions.

SILVERKITE_EMPTY: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = 'DAILY_SEAS_NONE_GR_NONE_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_OFF_WSI_OFF': Defines the "SILVERKITE_EMPTY" template. Everything here is None or off.

VALID_FREQ: List: Valid non-default values for simple silverkite template string name frequency. SimpleSilverkiteTemplateOptions.

class SimpleSilverkiteTemplateOptions(freq: SILVERKITE_FREQ = SILVERKITE_FREQ.DAILY, seas: SILVERKITE_SEAS = SILVERKITE_SEAS.LT, gr: SILVERKITE_GR = SILVERKITE_GR.LINEAR, cp: SILVERKITE_CP = SILVERKITE_CP.NONE, hol: SILVERKITE_HOL = SILVERKITE_HOL.NONE, feaset: SILVERKITE_FEASET = SILVERKITE_FEASET.OFF, algo: SILVERKITE_ALGO = SILVERKITE_ALGO.LINEAR, ar: SILVERKITE_AR = SILVERKITE_AR.OFF, dsi: SILVERKITE_DSI = SILVERKITE_DSI.AUTO, wsi: SILVERKITE_WSI = SILVERKITE_WSI.AUTO)

Defines generic simple silverkite template options. Attributes can be set to different values using SILVERKITE_COMPONENT_KEYWORDS for high level tuning.

algo: SILVERKITE_ALGO = 'LINEAR': Valid values for simple silverkite template string name fit algorithm. See SILVERKITE_ALGO.

ar: SILVERKITE_AR = 'OFF': Valid values for simple silverkite template string name autoregression. See SILVERKITE_AR.

cp: SILVERKITE_CP = 'NONE': Valid values for simple silverkite template string name changepoints. See SILVERKITE_CP.

dsi: SILVERKITE_DSI = 'AUTO': Valid values for simple silverkite template string name max daily seasonality interaction order. See SILVERKITE_DSI.

feaset: SILVERKITE_FEASET = 'OFF': Valid values for simple silverkite template string name feature sets enabled. See SILVERKITE_FEASET.

freq: SILVERKITE_FREQ = 'DAILY': Valid values for simple silverkite template string name frequency. See SILVERKITE_FREQ.

gr: SILVERKITE_GR = 'LINEAR': Valid values for simple silverkite template string name growth. See SILVERKITE_GR.

hol: SILVERKITE_HOL = 'NONE': Valid values for simple silverkite template string name holiday. See SILVERKITE_HOL.

seas: SILVERKITE_SEAS = 'LT': Valid values for simple silverkite template string name seasonality. See SILVERKITE_SEAS.

wsi: SILVERKITE_WSI = 'AUTO': Valid values for simple silverkite template string name max weekly seasonality interaction order. See SILVERKITE_WSI.

EasyConfig

class greykite.algo.common.seasonality_inferrer.SeasonalityInferrer[source]

A class to infer appropriate Fourier series orders in different seasonality components.

The method allows users to:

optionally remove the trend with different methods. Available methods are in TrendAdjustMethodEnum.

optionally do an aggregation.

fits the seasonality component with different Fourier series orders.

calculates the AIC/BIC of the fits.

choose the most appropriate order with AIC or BIC and an optional tolerance.

plot the investigations.

df

The input timeseries.

Type: pandas.DataFrame or None

time_col

The column name for timestamps in df.

Type: str or None

value_col

The column name for values in df.

Type: str or None

fourier_series_orders

The inferred Fourier series orders. The keys are the seasonality component names. The values are the inferred best orders according to the config.

Type: dict or None

df_features

The cached dataframe with time features. Building this df is slow for large dataset. We cache it the first time we build it for subsequent uses.

Type: pandas.DataFrame or None

infer_fourier_series_order(df: DataFrame, configs: List[SeasonalityInferConfig], time_col: str = 'ts', value_col: str = 'y', adjust_trend_method: Optional[str] = None, adjust_trend_param: Optional[dict] = None, fit_algorithm: Optional[str] = None, tolerance: Optional[float] = None, plotting: Optional[bool] = None, aggregation_period: Optional[str] = None, offset: Optional[int] = None, criterion: Optional[str] = None) → dict[source]

Infers the most appropriate Fourier series order. Can infer multiple seasonality components with multiple configs at the same time. The configurations for each component are passed as a list of SeasonalityInferConfig object. To override a parameter for all configs, pass it via this function’s parameter.

For each seasonality component, the method first does an optional trend removal via grouped average or spline fit. For example, for yearly seasonality, one option is to remove the average of each year from the time series. The seasonality pattern is clearer and dominates after the trend removal.

Next it does an optional aggregation to emphasize the current seasonality. For example, for yearly seasonality, it can do a weekly aggregation so that the weekly seasonality won’t be mixed when modeling yearly seasonality.

Then it fits seasonality model using Fourier series with orders up to a certain max_order, and computes the AIC/BIC of the models.

The final order will be selected based on the criterion with a tolerance adjustment. A pre-specified offset can also be added to the selected order for adjustment.

Parameters

df (pandas.DataFrame) – The input timeseries.
configs (list [SeasonalityInferConfig]) – A list of SeasonalityInferConfig objects. Each element corresponds to the config for a seasonality component. For example, if you would like to infer seasonality orders for yearly seasonality and weekly seasonality, you need to provide a list of two configs.
time_col (str) – The column name for timestamps in df.
value_col (str) – The column name for values in df.
adjust_trend_method (str or None, default None) – The methods used to adjust trend. Supported methods are in AdjustTrendMethodEnum. If not None, value is used to override all configs.
adjust_trend_param (dict or None, default None) – Additional parameters for adjusting trend. For valid options, see _adjust_trend. If not None, value is used to override all configs.
fit_algorithm (str or None, default None) – The algorithms used to fit the seasonality. Supported algorithms are “linear”, “ridge” and “sgd”. If not None, value is used to override all configs.
plotting (bool or None, default None) – Whether to generate plots. If True, the returned dictionary will have plot via the “fig” key. Can turn this off to speed up the process. If not None, value is used to override all configs.
tolerance (float or None, default None) – A tolerance on the criterion to allow a smaller order. For example, if AIC’s minimum is 100 and tolerance is 0.1, then the function will find the smallest order that has AIC less than or equal to 110. If not None, value is used to override all configs.
aggregation_period (str or None, default None) – The aggregation periods before fitting the Fourier series. Having aggregation to eliminate shorter seasonal periods may help get more accurate orders. But also make sure the number of observations after aggregation is sufficient. (At least 2 * max_order + 1 to have a unique solution for the regression problem) If not None, value is used to override all configs.
offset (int or None, default None) – The offset order to be added to the inferred orders. The orders after applying offsets can not be negative. If not None, value is used to override all configs.
criterion (str or None, default None) – The criteria to pick the most appropriate orders. If not None, value is used to override all configs.

Returns

result –

The result dictionary with the following keys:

”result”: a list of result dictionaries from the inferring methods. The keys are:

”seas_name”: the seasonality name.

”orders”: the Fourier series orders fitted.

”aics”: the fitted AICs.

”bics”: the fitted BICs.

”best_aic_order”: the order corresponding to the best feasible AIC.

”best_bic_order”: the order corresponding to the best feasible BIC.

”fig”: the diagnostic figure.

”best_orders”: a dictionary of seasonality component names and their inferred Fourier series orders.

Return type

dict

class greykite.algo.common.seasonality_inferrer.TrendAdjustMethodEnum(value)[source]

The methods that are available for adjusting trend in infer_fourier_series_order.

seasonal_average = 'seasonal_average': Calculates the average within each seasonal period and removes it.

overall_average = 'overall_average': Calculates the average of the whole timeseries and removes it.

spline_fit = 'spline_fit': Fits a spline with no knots (polynomial) with a certain degree and removes it.

none = 'none': Does not adjust trend.

class greykite.algo.common.seasonality_inferrer.SeasonalityInferConfig(seas_name: str, col_name: str, period: float, max_order: int, adjust_trend_method: str = 'seasonal_average', adjust_trend_param: Optional[dict] = None, fit_algorithm: str = 'ridge', tolerance: float = 0.0, plotting: bool = False, aggregation_period: Optional[str] = None, offset: int = 0, criterion: str = 'bic')[source]

A dataclass to pass the parameters for infer_fourier_series_order.

seas_name

Required. The seasonality component name. Will be used to distinguish the results.

Type: str

col_name

Required. The column name used to generate seasonality Fourier series. Must be in df or can be generated by build_time_features_df. See fourier_series_multi_func.

Type: str

period

Required. The period corresponding to col_name. See fourier_series_multi_func.

Type: float

max_order

Required. The maximum Fourier series order to fit.

Type: int

adjust_trend_method

The method used to adjust trend. Supported methods are in AdjustTrendMethodEnum. None values are default to “seasonal_average” with subtracting yearly average as the default.

Type: str or None, default “seasonal_average”

adjust_trend_param

Additional parameters for adjusting the trend. For valid options, see _adjust_trend.

Type: dict or None, default None

fit_algorithm

The algorithm used to fit the seasonality. Supported algorithms are “linear”, “ridge” and “sgd”. None values are default to “ridge”.

Type: str or None, default “ridge”

plotting

Whether to generate plots. If True, the returned dictionary will have plot via the “fig” key. Can turn this off to speed up the process. None values are default to False.

Type: bool or None, default False

tolerance

A tolerance on the criterion to allow a smaller order. For example, if AIC’s minimum is 100 and tolerance is 0.1, then the function will find the smallest order that has AIC less than or equal to 110. None values are default to 0.0.

Type: float or None, default 0.0

aggregation_period

The aggregation period before fitting the Fourier series. Having aggregation to eliminate shorter seasonal periods may help get more accurate orders. But also making sure the number of observations after aggregation is sufficient. None corresponds to no aggregation.

Type: str or None, default None

offset

The offset order to be added to the inferred orders. The order after adding offset can not be negative.

Type: int or None, default 0

criterion

The criterion to pick the most appropriate orders. Supported criteria are “aic” and “bic”. None values are default to “bic”.

Type: str or None, default “bic”

class greykite.algo.common.holiday_inferrer.HolidayInferrer[source]

Implements methods to automatically infer holiday effects.

The class works for daily and sub-daily data. Sub-daily data is aggregated into daily data. It pulls holiday candidates from pypi:holidays-ext, and adds a pre-specified number of days before/after the holiday candidates as the whole holiday candidates pool. Every day in the candidate pool is compared with a pre-defined baseline imputed from surrounding days (e.g. the average of -7 and +7 days) and a score is generated to indicate deviation. The score is averaged if a holiday has multiple occurrences through the timeseries period. The holidays are ranked according to the magnitudes of the scores. Holidays are classified into:

model independently

model together

do not model

according to their score magnitudes. For example, if the sum of the absolute scores is 1000, and the threshold for independent holidays is 0.8, the method keeps adding holidays to the independent modeling list from the largest magnitude until the sum reaches 1000 x 0.8 = 800. Then it continues to count the together modeling list.

baseline_offsets

The offsets in days to calculate baselines.

Type: list [int] or None

post_search_days

The number of days after each holiday to be counted as candidates.

Type: int or None

pre_search_days

The number of days before each holiday to be counted as candidates.

Type: int or None

independent_holiday_thres

A certain proportion of the total holiday effects that are allocated for holidays that are modeled independently. For example, 0.8 means the holidays that contribute to the first 80% of the holiday effects are modeled independently.

Type: float or None

together_holiday_thres

A certain proportion of the total holiday effects that are allocated for holidays that are modeled together. For example, if independent_holiday_thres is 0.8 and together_holiday_thres is 0.9, then after the first 80% of the holiday effects are counted, the rest starts to be allocated for the holidays that are modeled together until the cum sum exceeds 0.9.

Type: float or None

extra_years

Extra years after self.year_end to pull holidays in self.country_holiday_df. This can be used to cover the forecast periods.

Type: int, default 2

df

The timeseries after daily aggregation.

Type: pandas.DataFrame or None

time_col

The column name for timestamps in df.

Type: str or None

value_col

The column name for values in df.

Type: str or None

year_start

The year of the first timeseries observation in df.

Type: int or None

year_end

The year of the last timeseries observation in df.

Type: int or None

ts

The existing timestamps in df for fast look up.

Type: set [datetime] or None

country_holiday_df

The holidays between year_start and year_end. This is the output from pypi:holidays-ext. Duplicates are dropped. Observed holidays are merged.

Type: pandas.DataFrame or None

all_holiday_dates

All holiday dates contained in country_holiday_df.

Type: list [datetime] or None

holidays

A list of holidays in country_holiday_df.

Type: list [str] or None

score_result

The scores from comparing holidays and their baselines. The keys are holidays. The values are a list of the scores for each occurrence.

Type: dict [str, list [float]] or None

score_result_avg

The scores from score_result where the values are averaged.

Type: dict [str, float] or None

result

The output of the model. Includes:

“scores”: dict [str, list [float]]
The score_result from self._get_scores_for_holidays.

“country_holiday_df”: pandas.DataFrame
The country_holiday_df from pypi:holidays_ext.

“independent_holidays”: list [tuple [str, str]]
The holidays to be modeled independently. Each item is in (country, holiday) format.

“together_holidays_positive”: list [tuple [str, str]]
The holidays with positive effects to be modeled together. Each item is in (country, holiday) format.

“together_holidays_negative”: list [tuple [str, str]]
The holidays with negative effects to be modeled together. Each item is in (country, holiday) format.

“fig”: plotly.graph_objs.Figure
The visualization if activated.

Type: dict [str, any]

infer_holidays(df: DataFrame, time_col: str = 'ts', value_col: str = 'y', countries: List[str] = ('US',), pre_search_days: int = 2, post_search_days: int = 2, baseline_offsets: List[int] = (-7, 7), plot: bool = False, independent_holiday_thres: float = 0.8, together_holiday_thres: float = 0.99, extra_years: int = 2, use_relative_score: bool = False) → Optional[Dict[str, any]][source]

Infers significant holidays and holiday configurations.

The class works for daily and sub-daily data. Sub-daily data is aggregated into daily data. It pulls holiday candidates from pypi:holidays-ext, and adds a pre-specified number of days before/after the holiday candidates as the whole holiday candidates pool. Every day in the candidate pool is compared with a pre-defined baseline imputed from surrounding days (e.g. the average of -7 and +7 days) and a score is generated to indicate deviation. The score is averaged if a holiday has multiple occurrences through the timeseries period. The holidays are ranked according to the magnitudes of the scores. Holidays are classified into:

model independently

model together

do not model

according to their score magnitudes. For example, if the sum of the absolute scores is 1000, and the threshold for independent holidays is 0.8, the method keeps adding holidays to the independent modeling list from the largest magnitude until the sum reaches 1000 x 0.8 = 800. Then it continues to count the together modeling list.

Parameters

df (pd.DataFrame) – The input timeseries.
time_col (str, default TIME_COL) – The column name for timestamps in df.
value_col (str, default VALUE_COL) – The column name for values in df.
countries (list [str], default (“UnitedStates”,)) – A list of countries to look up holiday candidates. Available countries can be listed with holidays_ext.get_holidays.get_available_holiday_lookup_countries(). Two-character country names are preferred.
pre_search_days (int, default 2) – The number of days to include as holidays candidates before each holiday.
post_search_days (int, default 2) – The number of days to include as holidays candidates after each holiday.
baseline_offsets (list [int], default (-7, 7)) – The offsets in days as a baseline to compare with each holiday.
plot (bool, default False) – Whether to generate visualization.
independent_holiday_thres (float, default 0.8) – A certain proportion of the total holiday effects that are allocated for holidays that are modeled independently. For example, 0.8 means the holidays that contribute to the first 80% of the holiday effects are modeled independently.
together_holiday_thres (float, default 0.99) – A certain proportion of the total holiday effects that are allocated for holidays that are modeled together. For example, if independent_holiday_thres is 0.8 and together_holiday_thres is 0.9, then after the first 80% of the holiday effects are counted, the rest starts to be allocated for the holidays that are modeled together until the cum sum exceeds 0.9.
extra_years (int, default 2) – Extra years after self.year_end to pull holidays in self.country_holiday_df. This can be used to cover the forecast periods.
use_relative_score (bool, default False) – Whether the holiday effect is calculated as a relative ratio. If False, _get_score_for_dates will use absolute difference compared to the baseline as the score. If True, it uses relative ratio compared to the baseline as the score.

Returns

result –

A dictionary with the following keys:

”scores”: dict [str, list [float]]
The score_result from self._get_scores_for_holidays.

”country_holiday_df”: pandas.DataFrame
The country_holiday_df from pypi:holidays_ext.

”independent_holidays”: list [tuple [str, str]]
The holidays to be modeled independently. Each item is in (country, holiday) format.

”together_holidays_positive”: list [tuple [str, str]]
The holidays with positive effects to be modeled together. Each item is in (country, holiday) format.

”together_holidays_negative”: list [tuple [str, str]]
The holidays with negative effects to be modeled together. Each item is in (country, holiday) format.

”fig”: plotly.graph_objs.Figure
The visualization if activated.

Return type

dict [str, any] or None

generate_daily_event_dict(country_holiday_df: Optional[DataFrame] = None, holiday_result: Optional[Dict[str, List[Tuple[str, str]]]] = None) → Dict[str, DataFrame][source]

Generates daily event dict for all holidays inferred. The daily event dict will contain:

Single events for every holiday or holiday neighboring day that is to be modeled independently.

A single event for all holiday or holiday neighboring days with positive effects that are modeled together.

A single event for all holiday or holiday neighboring days with negative effects that are modeled together.

Parameters

country_holiday_df (pandas.DataFrame or None, default None) – The dataframe that contains the country/holiday/dates information for holidays. Must cover the periods need in training/forecasting for all holidays. This has the same format as self.country_holiday_df. If None, it pulls from self.country_holiday_df.
holiday_result (dict [str, list [tuple [str, str]]] or None, default None) –
A dictionary with the following keys:
- INFERRED_INDEPENDENT_HOLIDAYS_KEY
- INFERRED_GROUPED_POSITIVE_HOLIDAYS_KEY
- INFERRED_GROUPED_NEGATIVE_HOLIDAYS_KEY
Each key’s value is a list of length-2 tuples of the format (country, holiday). This format is the output of self.infer_holidays. If None, it pulls from self.result.

Returns

daily_event_dict – The daily event dict that is consumable by SimpleSilverkiteForecast or SilverkiteForecast. The keys are the event names. The values are dataframes with the event dates.

Return type

dict

class greykite.algo.common.holiday_grouper.HolidayGrouper(df: DataFrame, time_col: str, value_col: str, holiday_df: DataFrame, holiday_date_col: str, holiday_name_col: str, holiday_impact_pre_num_days: int = 0, holiday_impact_post_num_days: int = 0, holiday_impact_dict: Optional[Dict[str, Tuple[int, int]]] = None, get_suffix_func: Optional[Union[Callable, str]] = 'wd_we')[source]

This module estimates the impact of holidays and their neighboring days given a raw holiday dataframe holiday_df, and a time series containing the observed values to construct the baselines. It groups events with similar effects to several groups using kernel density estimation (KDE) and generates the grouped events in a dictionary of dataframes that is recognizable by SilverkiteForecast.

Parameters

df (pandas.DataFrame) – Input time series that contains time_col and value_col. The values will be used to construct baselines to estimate the holiday impact.
time_col (str) – Name of the time column in df.
value_col (str) – Name of the value column in df.
holiday_df (pandas.DataFrame) – Input holiday dataframe that contains the dates and names of the holidays.
holiday_date_col (str) – Name of the holiday date column in holiday_df.
holiday_name_col (str) – Name of the holiday name column in holiday_df.
holiday_impact_pre_num_days (int, default 0) – Default number of days before the holiday that will be modeled for holiday effect if the given holiday is not specified in holiday_impact_dict.
holiday_impact_post_num_days (int, default 0) – Default number of days after the holiday that will be modeled for holiday effect if the given holiday is not specified in holiday_impact_dict.
holiday_impact_dict (Dict [str, Any] or None, default None) –
A dictionary containing the neighboring impacting days of a certain holiday. This overrides the default pre_num and post_num for each holiday specified here. The key is the name of the holiday matching those in the provided holiday_df. The value is a tuple of two values indicating the number of neighboring days before and after the holiday. For example, a valid dictionary may look like:
holiday_impact_dict = { "Christmas Day": [3, 3], "Memorial Day": [0, 0] }
get_suffix_func (Callable or str or None, default “wd_we”) –
A function that generates a suffix (usually a time feature e.g. “_WD” for weekday, “_WE” for weekend) given an input date. This can be used to estimate the interaction between floating holidays and on which day they are getting observed. We currently support two defaults:
- ”wd_we” to generate suffixes based on whether the day falls on weekday or weekend.
- ”dow_grouped” to generate three categories: [“_WD”, “_Sat”, “_Sun”].
If None, no suffix is added.

expanded_holiday_df

An expansion of holiday_df after adding the neighboring dates provided in holiday_impact_dict and the suffix generated by get_suffix_func. For example, if "Christmas Day": [3, 3] and “wd_we” are used, events such as “Christmas Day_WD_plus_1_WE” or “Christmas Day_WD_minus_3_WD” will be generated for a Christmas that falls on Friday.

Type: pandas.DataFrame

baseline_offsets

The offsets in days to calculate baselines for a given holiday. By default, the same days of the week before and after are used.

Type: Tuple`[`int] or None

use_relative_score

Whether to use relative or absolute score when estimating the holiday impact.

Type: bool or None

clustering_method

Clustering method used to group the holidays. Since we are doing 1-D clustering, current supported methods include (1) “kde” for kernel density estimation, and (2) “kmeans” for k-means clustering.

Type: str or None

bandwidth

The bandwidth used in the kernel density estimation. Higher bandwidth results in less clusters. If None, it is automatically inferred with the bandwidth_multiplier factor.

Type: float or None

bandwidth_multiplier

Multiplier to be multiplied to the kernel density estimation’s default parameter calculated from here<https://en.wikipedia.org/wiki/Kernel_density_estimation#A_rule-of-thumb_bandwidth_estimator>_. This multiplier has been found useful in adjusting the default bandwidth parameter in many cases. Only used when bandwidth is not specified.

Type: float or None

kde

The KernelDensity object if clustering_method == "kde".

Type: KernelDensity or None

n_clusters

Number of clusters in the k-means algorithm.

Type: int or None

kmeans

The KMeans object if clustering_method == "kmeans".

Type: KMeans or None

include_diagnostics

Whether to include kmeans_diagnostics and kmeans_plot in the output result_dict.

Type: bool or None

result_dict

A dictionary that stores the scores and clustering results, with the following keys.

“holiday_inferrer”: the HolidayInferrer
instance used for calculating the scores.

“score_result_original”: a dictionary with keys being the names of all holiday events
after expansion (i.e. the keys in expanded_holiday_df), values being a list of scores of all dates corresponding to this event.

“score_result_avg_original”: a dictionary with the same key as in
result_dict["score_result_original"]. But the values are the average scores of each event across all occurrences.

“score_result”: same as result_dict["score_result_original"], but after removing
holidays with inconsistent / negligible scores.

“score_result_avg”: same as result_dict["score_result_original"], but after removing
holidays with inconsistent / negligible scores.

“daily_event_df_dict_with_score”: a dictionary of dataframes.
Key is the group name "holiday_group_{k}". Value is a dataframe of all holiday events in this group, containing 4 columns: “date” (EVENT_DF_DATE_COL), “event_name” (EVENT_DF_LABEL_COL), “original_name”, “avg_score”.

“daily_event_df_dict”: a dictionary of dataframes that is ready to use in SilverkiteForecast.
Contains 2 keys: EVENT_DF_DATE_COL and EVENT_DF_LABEL_COL.

“kde_cutoffs”: a list of float, the cutoffs returned by the kernel density clustering.

“kde_res”: a dataframe that contains “score” and “density” from the kernel density estimation.

“kde_plot”: a plot of the kernel density estimation.

“kmeans_diagnostics”: a dataframe containing metrics for different number of clusters.
Columns are:

“k”: number of clusters;

“wsse”: within-cluster sum of squared error (lower is better);

“sil_score”: Silhouette coefficient, a value between [-1, 1] that describes
the separation of clusters (higher is better).

Only generated when include_diagnostics is True. See group_holidays for details.

“kmeans_plot”: a plot visualizing how the diagnostic metrics change over K.
Only generated when include_diagnostics is True. See group_holidays for details.

Type: Dict`[`str, Any] or None

group_holidays(baseline_offsets: Tuple[int, int] = (-7, 7), use_relative_score: bool = True, min_n_days: int = 1, min_same_sign_ratio: float = 0.66, min_abs_avg_score: float = 0.03, clustering_method: str = 'kmeans', bandwidth: Optional[float] = None, bandwidth_multiplier: Optional[float] = 0.2, n_clusters: Optional[int] = 5, include_diagnostics: bool = False) → None[source]

Estimates the impact of holidays and their neighboring days and groups events with similar effects to several groups using kernel density estimation (KDE). Then generates the grouped events and stores the results in self.result_dict.

Parameters

baseline_offsets (Tuple`[`int], default (-7, 7)) – The offsets in days to calculate baselines for a given holiday. By default, the same days of the week before and after are used.
use_relative_score (bool, default True) – Whether to use relative or absolute score when estimating the holiday impact.
min_n_days (int, default 1) – Minimal number of occurrences for a holiday event to be kept before grouping.
min_same_sign_ratio (float, default 0.66) – Threshold of the ratio of the same-sign scores for an event’s occurrences. For example, if an event has two occurrences, they both need to have positive or negative scores for the ratio to achieve 0.66. Similarly, if an event has 3 occurrences, at least 2 of them must have the same directional impact. This parameter is intended to rule out holidays that have indefinite effects.
min_abs_avg_score (float, default 0.03) – The minimal average score of an event (across all its occurrences) to be kept before grouping. When use_relative_score = True, 0.03 means the effect must be greater than 3%.
clustering_method (str, default “kmeans”) – Clustering method used to group the holidays. Since we are doing 1-D clustering, current supported methods include (1) “kde” for kernel density estimation, and (2) “kmeans” for k-means clustering.
bandwidth (float or None, default None) – The bandwidth used in the kernel density estimation. Higher bandwidth results in less clusters. If None, it is automatically inferred with the bandwidth_multiplier factor. Only used when clustering_method == "kde".
bandwidth_multiplier (float or None, default 0.2) – Multiplier to be multiplied to the kernel density estimation’s default parameter calculated from here<https://en.wikipedia.org/wiki/Kernel_density_estimation#A_rule-of-thumb_bandwidth_estimator>_. This multiplier has been found useful in adjusting the default bandwidth parameter in many cases. Only used when bandwidth is not specified and clustering_method == "kde".
n_clusters (int or None, default 5) – Number of clusters in the k-means algorithm. Only used when clustering_method == "kmeans".
include_diagnostics (bool, default False) – Whether to include kmeans_diagnostics and kmeans_plot in the output result_dict.

Return type

Saves the results in the result_dict attribute.

get_holiday_scores(baseline_offsets: Tuple[int, int] = (-7, 7), use_relative_score: bool = True, min_n_days: int = 1, min_same_sign_ratio: float = 0.66, min_abs_avg_score: float = 0.05) → Dict[str, Any][source]

Computes the score of all holiday events and their neighboring days in self.expanded_holiday_df, by comparing their observed values with a baseline value that is an average of the values on the days specified in baseline_offsets. If a baseline date falls on another holiday, the algorithm looks for the next value with the same step size as the given offset, up to 3 extra iterations. Please see more details in _get_scores_for_holidays. An additional pruning step is done to remove holidays with inconsistent / negligible scores. Both the results before and after the pruning are returned.

Parameters

baseline_offsets (Tuple`[`int], default (-7, 7)) – The offsets in days to calculate baselines for a given holiday. By default, the same days of the week before and after are used.
use_relative_score (bool, default True) – Whether to use relative or absolute score when estimating the holiday impact.
min_n_days (int, default 1) – Minimal number of occurrences for a holiday event to be kept before grouping.
min_same_sign_ratio (float, default 0.66) – Threshold of the ratio of the same-sign scores for an event’s occurrences. For example, if an event has two occurrences, they both need to have positive or negative scores for the ratio to achieve 0.66. Similarly, if an event has 3 occurrences, at least 2 of them must have the same directional impact. This parameter is intended to rule out holidays that have indefinite effects.
min_abs_avg_score (float, default 0.05) – The minimal average score of an event (across all its occurrences) to be kept before grouping. When use_relative_score = True, 0.05 means the effect must be greater than 5%.

Returns

result_dict – A dictionary containing the scoring results. In particular the following keys are set: “holiday_inferrer”, “score_result_original”, “score_result_avg_original”, “score_result”, and “score_result_avg”. Please refer to the docstring of the self.result_dict attribute of HolidayGrouper.

Return type

Dict [str, Any]

check_scores(holiday_name_pattern: str, show_pruned: bool = True) → None[source]

Spot checks the score of certain holidays containing pattern holiday_name_pattern. Prints out the dates, individual day scores of all occurrences, and the average scores of all matching holiday events. Note that it only checks the keys in self.expanded_holiday_df, and it assumes get_holiday_scores is already run.

Parameters

holiday_name_pattern (str) – Any substring of the holiday event names (self.expanded_holiday_df[self.holiday_name_col]).
show_pruned (bool, default True) – Whether to show pruned holidays along with the remaining holidays.

Returns

Prints out the dates, individual day scores of all occurrences,
and the average scores of all matching holiday events.

check_holiday_group(holiday_name_pattern: str = '', holiday_groups: Optional[Union[List[int], int]] = None) → None[source]

Prints out the holiday groups that contain holidays matching holiday_name_pattern and their scores. The searching is limited to the given holiday_groups. Note that it assumes group_holidays has already been run.

Parameters

holiday_name_pattern (str) – Any substring of the holiday event names (self.expanded_holiday_df[self.holiday_name_col]).
holiday_groups (List`[`int] or int, default None) – The indices of holiday groups that the searching is limited in. If None, all groups are available to search.

Return type

Prints out all qualifying holiday groups and their scores.

static expand_holiday_df_with_suffix(holiday_df: DataFrame, holiday_date_col: str, holiday_name_col: str, holiday_impact_pre_num_days: int = 0, holiday_impact_post_num_days: int = 0, holiday_impact_dict: Optional[Dict[str, Tuple[int, int]]] = None, get_suffix_func: Optional[Union[Callable, str]] = 'wd_we') → DataFrame[source]

Expands an input holiday dataframe holiday_df to include the neighboring days specified in holiday_impact_dict or through holiday_impact_pre_num_days and holiday_impact_post_num_days. Also adds suffixes generated by get_suffix_func to better model the effects of events falling on different days of week.

Parameters

holiday_df (pandas.DataFrame) – Input holiday dataframe that contains the dates and names of the holidays.
holiday_date_col (str) – Name of the holiday date column in holiday_df.
holiday_name_col (str) – Name of the holiday name column in holiday_df.
holiday_impact_pre_num_days (int, default 0) – Default number of days before the holiday that will be modeled for holiday effect if the given holiday is not specified in holiday_impact_dict.
holiday_impact_post_num_days (int, default 0) – Default number of days after the holiday that will be modeled for holiday effect if the given holiday is not specified in holiday_impact_dict.
holiday_impact_dict (Dict [str, Any] or None, default None) –
A dictionary containing the neighboring impacting days of a certain holiday. This overrides the default pre_num and post_num for each holiday specified here. The key is the name of the holiday matching those in the provided holiday_df. The value is a tuple of two values indicating the number of neighboring days before and after the holiday. For example, a valid dictionary may look like:
holiday_impact_dict = { "Christmas Day": [3, 3], "Memorial Day": [0, 0] }
get_suffix_func (Callable or str or None, default “wd_we”) –
A function that generates a suffix (usually a time feature e.g. “_WD” for weekday, “_WE” for weekend) given an input date. This can be used to estimate the interaction between floating holidays and on which day they are getting observed. We currently support two defaults:
- ”wd_we” to generate suffixes based on whether the day falls on weekday or weekend.
- ”dow_grouped” to generate three categories: [“_WD”, “_Sat”, “_Sun”].
If None, no suffix is added.

Returns

expanded_holiday_df – An expansion of holiday_df after adding the neighboring dates provided in holiday_impact_dict and the suffix generated by get_suffix_func. For example, if "Christmas Day": [3, 3] and “wd_we” are used, events such as “Christmas Day_WD_plus_1_WE” or “Christmas Day_WD_minus_3_WD” will be generated for a Christmas that falls on Friday.

Return type

Changepoint Detection

class greykite.algo.changepoint.adalasso.changepoint_detector.ChangepointDetector[source]

A class to implement change point detection.

Currently supports long-term change point detection only. Input is a dataframe with time_col indicating the column of time info (the format should be able to be parsed by pd.to_datetime), and value_col indicating the column of observed time series values.

original_df

The original data df, used to retrieve original observations, if aggregation is used in fitting change points.

Type: pandas.DataFrame

time_col

The column name for time column.

Type: str

value_col

The column name for value column.

Type: str

trend_potential_changepoint_n

The number of change points that are evenly distributed over the time period.

Type: int

yearly_seasonality_order

The yearly seasonality order used when fitting trend.

Type: int

y

The observations after aggregation.

Type: pandas.Series

trend_df

The augmented df of the original_df, including regressors of trend change points and Fourier series for yearly seasonality.

Type: pandas.DataFrame

trend_model

The fitted trend model.

Type: sklearn.base.RegressionMixin

trend_coef

The estimated trend coefficients.

Type: numpy.array

trend_intercept

The estimated trend intercept.

Type: float

adaptive_lasso_coef

The list of length two, first element is estimated trend coefficients, and second element is intercept, both estimated by adaptive lasso.

Type: list

trend_changepoints

The list of detected trend change points, parsable by pd.to_datetime

Type: list

trend_estimation

The estimated trend with detected trend change points.

Type: pd.Series

seasonality_df

The augmented df of original_df, including regressors of seasonality change points with different Fourier series frequencies.

Type: pandas.DataFrame

seasonality_changepoints

The dictionary of detected seasonality change points for each component. Keys are component names, and values are list of change points.

Type: dict

seasonality_estimation

The estimated seasonality with detected seasonality change points. The series has the same length as original_df. Index is timestamp, and values are the estimated seasonality at each timestamp. The seasonality estimation is the estimated of seasonality effect with trend estimated by estimate_trend_with_detected_changepoints removed.

Type: pandas.Series

find_trend_changepoints : callable: Finds the potential trend change points for a given time series df.

plot : callable: Plot the results after implementing find_trend_changepoints.

find_trend_changepoints(df, time_col, value_col, shift_detector=None, yearly_seasonality_order=8, yearly_seasonality_change_freq=None, resample_freq='D', trend_estimator='ridge', adaptive_lasso_initial_estimator='ridge', regularization_strength=None, actual_changepoint_min_distance='30D', potential_changepoint_distance=None, potential_changepoint_n=100, potential_changepoint_n_max=None, no_changepoint_distance_from_begin=None, no_changepoint_proportion_from_begin=0.0, no_changepoint_distance_from_end=None, no_changepoint_proportion_from_end=0.0, fast_trend_estimation=True)[source]

Finds trend change points automatically by adaptive lasso.

The algorithm does an aggregation with a user-defined frequency, defaults daily.

If potential_changepoint_distance is not given, potential_changepoint_n potential change points are evenly distributed over the time period, else potential_changepoint_n is overridden by:

total_time_length / ``potential_changepoint_distance``

Users can specify either no_changepoint_proportion_from_end to specify what proportion from the end of data they do not want changepoints, or no_changepoint_distance_from_end (overrides no_changepoint_proportion_from_end) to specify how long from the end they do not want change points.

Then all potential change points will be selected by adaptive lasso, with the initial estimator specified by adaptive_lasso_initial_estimator. If user specifies regularization_strength, then the adaptive lasso will be run with a single tuning parameter calculated based on user provided prior, else a cross-validation will be run to automatically select the tuning parameter.

A yearly seasonality is also fitted at the same time, preventing trend from catching yearly periodical changes.

A rule-based guard function is applied at the end to ensure change points are not too close, as specified by actual_changepoint_min_distance.

Parameters

df (pandas.DataFrame) – The data df
time_col (str) – Time column name in df
value_col (str) – Value column name in df
shift_detector (greykite.algo.changepoint.shift_detection.shift_detector.ShiftDetection) – An instance of ShiftDetection for identifying level shifts and computing regressors. Level shift points will be considered as regressors when selecting change points by adaptive lasso.
yearly_seasonality_order (int, default 8) – Fourier series order to capture yearly seasonality.
yearly_seasonality_change_freq (DateOffset, Timedelta or str or None, default None) –
How often to change the yearly seasonality model. Set to None to disable this feature.

This is useful if you have more than 2.5 years of data and the detected trend without this feature is inaccurate because yearly seasonality changes over the training period. Modeling yearly seasonality separately over the each period can prevent trend changepoints from fitting changes in yearly seasonality. For example, if you have 2.5 years of data and yearly seasonality increases in magnitude after the first year, setting this parameter to “365D” will model each year’s yearly seasonality differently and capture both shapes. However, without this feature, both years will have the same yearly seasonality, roughly the average effect across the training set.

Note that if you use str as input, the maximal supported unit is day, i.e., you might use “200D” but not “12M” or “1Y”.
resample_freq (DateOffset, Timedelta, str or None, default “D”.) – The frequency to aggregate data. Coarser aggregation leads to fitting longer term trends. If None, no aggregation will be done.
trend_estimator (str in [“ridge”, “lasso” or “ols”], default “ridge”.) – The estimator to estimate trend. The estimated trend is only for plotting purposes. ‘ols’ is not recommended when yearly_seasonality_order is specified other than 0, because significant over-fitting will happen. In this case, the given value is overridden by “ridge”.
adaptive_lasso_initial_estimator (str in [“ridge”, “lasso” or “ols”], default “ridge”.) – The initial estimator to compute adaptive lasso weights
regularization_strength (float in [0, 1] or None) – The regularization for change points. Greater value implies fewer change points. 0 indicates all change points, and 1 indicates no change point. If None, the turning parameter will be selected by cross-validation. If a value is given, it will be used as the tuning parameter.
actual_changepoint_min_distance (DateOffset, Timedelta or str, default “30D”) – The minimal distance allowed between detected change points. If consecutive change points are within this minimal distance, the one with smaller absolute change coefficient will be dropped. Note: maximal unit is ‘D’, i.e., you may use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
potential_changepoint_distance (DateOffset, Timedelta, str or None, default None) – The distance between potential change points. If provided, will override the parameter potential_changepoint_n. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
potential_changepoint_n (int, default 100) – Number of change points to be evenly distributed, recommended 1-2 per month, based on the training data length.
potential_changepoint_n_max (int or None, default None) – The maximum number of potential changepoints. This parameter is effective when user specifies potential_changepoint_distance, and the number of potential changepoints in the training data is more than potential_changepoint_n_max, then it is equivalent to specifying potential_changepoint_n = potential_changepoint_n_max, and ignoring potential_changepoint_distance.
no_changepoint_distance_from_begin (DateOffset, Timedelta, str or None, default None) – The length of time from the beginning of training data, within which no change point will be placed. If provided, will override the parameter no_changepoint_proportion_from_begin. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
no_changepoint_proportion_from_begin (float in [0, 1], default 0.0.) – potential_changepoint_n change points will be placed evenly over the whole training period, however, change points that are located within the first no_changepoint_proportion_from_begin proportion of training period will not be used for change point detection.
no_changepoint_distance_from_end (DateOffset, Timedelta, str or None, default None) – The length of time from the end of training data, within which no change point will be placed. If provided, will override the parameter no_changepoint_proportion_from_end. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
no_changepoint_proportion_from_end (float in [0, 1], default 0.0.) – potential_changepoint_n change points will be placed evenly over the whole training period, however, change points that are located within the last no_changepoint_proportion_from_end proportion of training period will not be used for change point detection.
fast_trend_estimation (bool, default True) – If True, the trend estimation is not refitted on the original data, but is a linear interpolation of the fitted trend from the resampled time series. If False, the trend estimation is refitted on the original data.

Returns

result – result dictionary with keys:

"trend_feature_df"pandas.DataFrame

The augmented df for change detection, in other words, the design matrix for the regression model. Columns:

’changepoint0’: regressor for change point 0, equals the continuous time of the observation minus the continuous time for time of origin.

…

’changepoint{potential_changepoint_n}’: regressor for change point {potential_changepoint_n}, equals the continuous time of the observation minus the continuous time of the {potential_changepoint_n}th change point.

’cos1_conti_year_yearly’: cosine yearly seasonality regressor of first order.

’sin1_conti_year_yearly’: sine yearly seasonality regressor of first order.

…

’cos{yearly_seasonality_order}_conti_year_yearly’ : cosine yearly seasonality regressor of {yearly_seasonality_order}th order.

’sin{yearly_seasonality_order}_conti_year_yearly’ : sine yearly seasonality regressor of {yearly_seasonality_order}th order.

"trend_changepoints"list

The list of detected change points.

"changepoints_dict"dict

The change point dictionary that is compatible as an input with forecast

"trend_estimation"pandas.Series

The estimated trend with detected trend change points.

Return type

dict

find_seasonality_changepoints(df, time_col, value_col, seasonality_components_df= name period order seas_names 0 tod 24.0 3 daily 1 tow 7.0 3 weekly 2 conti_year 1.0 5 yearly, resample_freq='H', regularization_strength=0.6, actual_changepoint_min_distance='30D', potential_changepoint_distance=None, potential_changepoint_n=50, no_changepoint_distance_from_end=None, no_changepoint_proportion_from_end=0.0, trend_changepoints=None)[source]

Finds the seasonality change points (defined as the time points where seasonality magnitude changes, i.e., the time series becomes “fatter” or “thinner”.)

Subtracts the estimated trend from the original time series first, then uses regression-based regularization methods to select important seasonality change points. Regressors are built from truncated Fourier series.

If you have run find_trend_changepoints before running find_seasonality_changepoints with the same df, the estimated trend will be automatically used for removing trend in find_seasonality_changepoints. Otherwise, find_trend_changepoints will be run automatically with the same parameters as you passed to find_seasonality_changepoints. If you do not want to use the same parameters, run find_trend_changepoints with your desired parameter before calling find_seasonality_changepoints.

The algorithm does an aggregation with a user-defined frequency, default hourly.

The regression features consists of potential_changepoint_n + 1 blocks of predictors. The first block consists of Fourier series according to seasonality_components_df, and other blocks are a copy of the first block truncated at the corresponding potential change point.

If potential_changepoint_distance is not given, potential_changepoint_n potential change points are evenly distributed over the time period, else potential_changepoint_n is overridden by:

total_time_length / ``potential_changepoint_distance``

Users can specify either no_changepoint_proportion_from_end to specify what proportion from the end of data they do not want changepoints, or no_changepoint_distance_from_end (overrides no_changepoint_proportion_from_end) to specify how long from the end they do not want change points.

Then all potential change points will be selected by adaptive lasso, with the initial estimator specified by adaptive_lasso_initial_estimator. The regularization strength is specified by regularization_strength, which lies between 0 and 1.

A rule-based guard function is applied at the end to ensure change points are not too close, as specified by actual_changepoint_min_distance.

Parameters

df (pandas.DataFrame) – The data df
time_col (str) – Time column name in df
value_col (str) – Value column name in df
seasonality_components_df (pandas.DataFrame) – The df to generate seasonality design matrix, which is compatible with seasonality_components_df in find_seasonality_changepoints
resample_freq (DateOffset, Timedelta or str, default “H”.) – The frequency to aggregate data. Coarser aggregation leads to fitting longer term trends.
regularization_strength (float in [0, 1] or None, default 0.6.) – The regularization for change points. Greater value implies fewer change points. 0 indicates all change points, and 1 indicates no change point. If None, the turning parameter will be selected by cross-validation. If a value is given, it will be used as the tuning parameter. Here “None” is not recommended, because seasonality change has different levels, and automatic selection by cross-validation may produce more change points than desired. Practically, 0.6 is a good choice for most cases. Tuning around 0.6 is recommended.
actual_changepoint_min_distance (DateOffset, Timedelta or str, default “30D”) – The minimal distance allowed between detected change points. If consecutive change points are within this minimal distance, the one with smaller absolute change coefficient will be dropped. Note: maximal unit is ‘D’, i.e., you may use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
potential_changepoint_distance (DateOffset, Timedelta, str or None, default None) – The distance between potential change points. If provided, will override the parameter potential_changepoint_n. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
potential_changepoint_n (int, default 50) – Number of change points to be evenly distributed, recommended 1 per month, based on the training data length.
no_changepoint_distance_from_end (DateOffset, Timedelta, str or None, default None) – The length of time from the end of training data, within which no change point will be placed. If provided, will override the parameter no_changepoint_proportion_from_end. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
no_changepoint_proportion_from_end (float in [0, 1], default 0.0.) – potential_changepoint_n change points will be placed evenly over the whole training period, however, only change points that are not located within the last no_changepoint_proportion_from_end proportion of training period will be used for change point detection.
trend_changepoints (list or None) – A list of user specified trend change points, used to estimated the trend to be removed from the time series before detecting seasonality change points. If provided, the algorithm will not check existence of detected trend change points or run find_trend_changepoints, but will use these change points directly for trend estimation.

Returns

result – result dictionary with keys:

"seasonality_feature_df"pandas.DataFrame

The augmented df for seasonality changepoint detection, in other words, the design matrix for the regression model. Columns:

”cos1_tod_daily”: cosine daily seasonality regressor of first order at change point 0.

”sin1_tod_daily”: sine daily seasonality regressor of first order at change point 0.

…

”cos1_conti_year_yearly”: cosine yearly seasonality regressor of first order at change point 0.

”sin1_conti_year_yearly”: sine yearly seasonality regressor of first order at change point 0.

…

”cos{daily_seasonality_order}_tod_daily_cp{potential_changepoint_n}” : cosine daily seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.

”sin{daily_seasonality_order}_tod_daily_cp{potential_changepoint_n}” : sine daily seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.

…

”cos{yearly_seasonality_order}_conti_year_yearly_cp{potential_changepoint_n}” : cosine yearly seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.

”sin{yearly_seasonality_order}_conti_year_yearly_cp{potential_changepoint_n}” : sine yearly seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.

"seasonality_changepoints"dict`[`list`[`datetime]]

The dictionary of detected seasonality change points for each component. Keys are component names, and values are list of change points.

"seasonality_estimation"pandas.Series

The estimated seasonality with detected seasonality change points.: The series has the same length as original_df. Index is timestamp, and values are the estimated seasonality at each timestamp. The seasonality estimation is the estimated of seasonality effect with trend estimated by estimate_trend_with_detected_changepoints removed.

"seasonality_components_dfpandas.DataFrame

The processed seasonality_components_df. Daily component row is removed if inferred frequency or aggregation frequency is at least one day.

Return type

dict

plot(observation=True, observation_original=True, trend_estimate=True, trend_change=True, yearly_seasonality_estimate=False, adaptive_lasso_estimate=False, seasonality_change=False, seasonality_change_by_component=True, seasonality_estimate=False, plot=True)[source]

Makes a plot to show the observations/estimations/change points.

In this function, component parameters specify if each component in the plot is included or not. These are bool variables. For those components that are set to True, their values will be replaced by the corresponding data. Other components values will be set to None. Then these variables will be fed into plot_change

Parameters

observation (bool) – Whether to include observation
observation_original (bool) – Set True to plot original observations, and False to plot aggregated observations. No effect is observation is False
trend_estimate (bool) – Set True to add trend estimation.
trend_change (bool) – Set True to add change points.
yearly_seasonality_estimate (bool) – Set True to add estimated yearly seasonality.
adaptive_lasso_estimate (bool) – Set True to add adaptive lasso estimated trend.
seasonality_change (bool) – Set True to add seasonality change points.
seasonality_change_by_component (bool) – If true, seasonality changes will be plotted separately for different components, else all will be in the same symbol. No effect if seasonality_change is False
seasonality_estimate (bool) – Set True to add estimated seasonality. The seasonality if plotted around trend, so the actual seasonality shown is trend estimation + seasonality estimation.
plot (bool, default True) – Set to True to display the plot, and set to False to return the plotly figure object.

Returns

None (if plot == True) – The function shows a plot.
fig (plotly.graph_objects.Figure) – The plot object.

Benchmarking

class greykite.framework.benchmark.benchmark_class.BenchmarkForecastConfig(df: ~pandas.core.frame.DataFrame, configs: ~typing.Dict[str, ~greykite.framework.templates.autogen.forecast_config.ForecastConfig], tscv: ~greykite.sklearn.cross_validation.RollingTimeSeriesSplit, forecaster: ~greykite.framework.templates.forecaster.Forecaster = <greykite.framework.templates.forecaster.Forecaster object>)[source]

Class for benchmarking multiple ForecastConfig on a rolling window basis.

df

Timeseries data to forecast. Contains columns [time_col, value_col], and optional regressor columns. Regressor columns should include future values for prediction.

Type: pandas.DataFrame

configs

Dictionary of model configurations. A model configuration is a ForecastConfig. See ForecastConfig for details on valid ForecastConfig. Validity of the configs for benchmarking is checked via the validate method.

Type: Dict [str, ForecastConfig]

tscv

Cross-validation object that determines the rolling window evaluation. See RollingTimeSeriesSplit for details. The forecast_horizon and periods_between_train_test parameters of configs are matched against that of tscv. A ValueError is raised if there is a mismatch.

Type: RollingTimeSeriesSplit

forecaster

Forecaster used to create the forecasts.

Type: Forecaster

is_run

Indicator of whether the run method is executed. After executing run, this indicator is set to True. Some class methods like get_forecast requires is_run to be True to be executed.

Type: bool, default False

result

Stores the benchmarking results. Has the same keys as configs.

Type: dict

forecasts

Merged DataFrame of forecasts, upper and lower confidence interval for all input configs. Also stores train end date and forecast step for each prediction.

Type: pandas.DataFrame, default None

validate()[source]

Validates the inputs to the class for the method run.

Raises a ValueError if there is a mismatch between the following parameters of configs and tscv:

forecast_horizon

periods_between_train_test

Raises ValueError if all the configs do not have the same coverage parameter.

run()[source]

Runs every config and stores the output of the forecast_pipeline. This function runs only if the configs and tscv are jointly valid.

Returns: self
Return type: Returns self. Stores pipeline output of every config in self.result.

extract_forecasts()[source]

Extracts forecasts, upper and lower confidence interval for each individual config. This is saved as a pandas.DataFrame with the name rolling_forecast_df within the corresponding config of self.result. e.g. if config key is “silverkite”, then the forecasts are stored in self.result["silverkite"]["rolling_forecast_df"].

This method also constructs a merged DataFrame of forecasts, upper and lower confidence interval for all input configs.

plot_forecasts_by_step(forecast_step: int, config_names: Optional[List] = None, xlabel: str = 'ts', ylabel: str = 'y', title: Optional[str] = None, showlegend: bool = True)[source]

Returns a forecast_step ahead rolling forecast plot. The plot consists one line for each valid. config_names. If available, the corresponding actual values are also plotted.

For a more customizable plot, see plot_multivariate

Parameters

forecast_step (int) – Which forecast step to plot. A forecast step is an integer between 1 and the forecast horizon, inclusive, indicating the number of periods from train end date to the prediction date (# steps ahead).
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
xlabel (str or None, default TIME_COL) – x-axis label.
ylabel (str or None, default VALUE_COL) – y-axis label.
title (str or None, default None) – Plot title. If None, default is based on forecast_step.
showlegend (bool, default True) – Whether to show the legend.

Returns

fig – Interactive plotly graph. Plots multiple column(s) in self.forecasts against TIME_COL.

See plot_forecast_vs_actual return value for how to plot the figure and add customization.

Return type

plot_forecasts_by_config(config_name: str, colors: List = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)', 'rgb(148, 103, 189)', 'rgb(140, 86, 75)', 'rgb(227, 119, 194)', 'rgb(127, 127, 127)', 'rgb(188, 189, 34)', 'rgb(23, 190, 207)'], xlabel: str = 'ts', ylabel: str = 'y', title: Optional[str] = None, showlegend: bool = True)[source]

Returns a rolling plot of the forecasts by config_name against TIME_COL. The plot consists of one line for each available split. Some lines may overlap if test period in corresponding splits intersect. Hence every line is given a different color. If available, the corresponding actual values are also plotted.

For a more customizable plot, see plot_multivariate_grouped

Parameters

config_name (str) – Which config result to plot. The name must match the name of one of the input configs.
colors ([str, List [str]], default DEFAULT_PLOTLY_COLORS) – Which colors to use to build the color palette. This can be a list of RGB colors or a str from PLOTLY_SCALES. To use a single color for all lines, pass a List with a single color.
xlabel (str or None, default TIME_COL) – x-axis label.
ylabel (str or None, default VALUE_COL) – y-axis label.
title (str or None, default None) – Plot title. If None, default is based on config_name.
showlegend (bool, default True) – Whether to show the legend.

Returns

fig – Interactive plotly graph. Plots multiple column(s) in self.forecasts against TIME_COL.

Return type

get_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None)[source]

Returns rolling train and test evaluation metric values.

Parameters

metric_dict (dict [str, callable]) –

Evaluation metrics to compute.

key: evaluation metric name, used to create column name in output.

value: metric function to apply to forecast df in each split to generate the column value.
Signature (y_true: str, y_pred: str) -> transformed value: float.

For example:

metric_dict = {
    "median_residual": lambda y_true, y_pred: np.median(y_true - y_pred),
    "mean_squared_error": lambda y_true, y_pred: np.mean((y_true - y_pred)**2)
}

Some predefined functions are available in evaluation. For example:

metric_dict = {
    "correlation": lambda y_true, y_pred: correlation(y_true, y_pred),
    "RMSE": lambda y_true, y_pred: root_mean_squared_error(y_true, y_pred),
    "Q_95": lambda y_true, y_pred: partial(quantile_loss(y_true, y_pred, q=0.95))
}

As shorthand, it is sufficient to provide the corresponding EvaluationMetricEnum member. These are auto-expanded into the appropriate function. So the following is equivalent:

metric_dict = {
    "correlation": EvaluationMetricEnum.Correlation,
    "RMSE": EvaluationMetricEnum.RootMeanSquaredError,
    "Q_95": EvaluationMetricEnum.Quantile95
}

config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.

Returns

evaluation_metrics_df – A DataFrame containing splitwise train and test evaluation metrics for metric_dict and config_names.

For example. Let’s assume:

metric_dict = {
    "RMSE": EvaluationMetricEnum.RootMeanSquaredError,
    "Q_95": EvaluationMetricEnum.Quantile95
}

config_names = ["default_prophet", "custom_silverkite"]
These are valid ``config_names`` and there are 2 splits for each.

Then evaluation_metrics_df =

config_name     split_num   train_RMSE  test_RMSE   train_Q_95  test_Q_95
default_prophet      0          *           *           *           *
default_prophet      1          *           *           *           *
custom_silverkite    0          *           *           *           *
custom_silverkite    1          *           *           *           *

where * represents computed values.

Return type

pd.DataFrame

plot_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None, xlabel: Optional[str] = None, ylabel: str = 'Metric value', title: Optional[str] = None, showlegend: bool = True)[source]

Returns a barplot of the train and test values of metric_dict of config_names. Value of a metric for all config_names are plotted as a grouped bar. Train and test values of a metric are plot side-by-side for easy comparison.

Parameters

metric_dict (dict [str, callable]) – Evaluation metrics to compute. Same as get_evaluation_metrics. To get the best visualization, keep number of metrics <= 2.
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
xlabel (str or None, default None) – x-axis label.
ylabel (str or None, default “Metric value”) – y-axis label.
title (str or None, default None) – Plot title.
showlegend (bool, default True) – Whether to show the legend.

Returns

fig – Interactive plotly bar plot.

Return type

get_grouping_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None, which: str = 'train', groupby_time_feature: Optional[str] = None, groupby_sliding_window_size: Optional[int] = None, groupby_custom_column: Optional[Series] = None)[source]

Returns splitwise rolling evaluation metric values.: These values are grouped by the grouping method chosen by groupby_time_feature, groupby_sliding_window_size and groupby_custom_column.

See get_grouping_evaluation for details on grouping method.

Parameters

get_evaluation_metrics.

config_nameslist [str], default None: Which config results to plot. A list of config names. If None, uses all the available config keys.
which: str: “train” or “test”. Which dataset to evaluate.
groupby_time_featurestr or None, default None: If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_sizeint or None, default None: If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_columnpandas.Series or None, default None: If provided, groups by this column value. Should be same length as the DataFrame.

Returns: grouped_evaluation_df – A DataFrame containing splitwise train and test evaluation metrics for metric_dict and config_names. The evaluation metrics are grouped by the grouping method.
Return type: pandas.DataFrame

plot_grouping_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None, which: str = 'train', groupby_time_feature: Optional[str] = None, groupby_sliding_window_size: Optional[int] = None, groupby_custom_column: Optional[Series] = None, xlabel=None, ylabel='Metric value', title=None, showlegend=True)[source]

Returns a line plot of the grouped evaluation values of metric_dict of config_names. These values are grouped by the grouping method chosen by groupby_time_feature,

groupby_sliding_window_size and groupby_custom_column.

See get_grouping_evaluation for details on grouping method.

Parameters

get_evaluation_metrics. To get the best visualization, keep number of metrics <= 2.

config_nameslist [str], default None: Which config results to plot. A list of config names. If None, uses all the available config keys.
which: str: “train” or “test”. Which dataset to evaluate.
groupby_time_featurestr or None, optional: If provided, groups by a column generated by build_time_features_df. See that function for valid values.
groupby_sliding_window_sizeint or None, optional: If provided, sequentially partitions data into groups of size groupby_sliding_window_size.
groupby_custom_columnpandas.Series or None, optional: If provided, groups by this column value. Should be same length as the DataFrame.
xlabelstr or None, default None: x-axis label. If None, label is determined by the groupby column name.
ylabelstr or None, default “Metric value”: y-axis label.
titlestr or None, default None: Plot title. If None, default is based on config_name.
showlegendbool, default True: Whether to show the legend.

Returns: fig – Interactive plotly graph.
Return type: plotly.graph_objects.Figure

get_runtimes(config_names: Optional[List] = None)[source]

Returns rolling average runtime in seconds for config_names.

Parameters

config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.

Returns

runtimes_df – A DataFrame containing splitwise runtime in seconds for config_names.

For example. Let’s assume:

config_names = ["default_prophet", "custom_silverkite"]
These are valid ``config_names`` and there are 2 splits for each.

Then runtimes_df =

config_name     split_num   runtime_sec
default_prophet      0          *
default_prophet      1          *
custom_silverkite    0          *
custom_silverkite    1          *

where * represents computed values.

Return type

pd.DataFrame

plot_runtimes(config_names: Optional[List] = None, xlabel: Optional[str] = None, ylabel: str = 'Mean runtime in seconds', title: str = 'Average runtime across rolling windows', showlegend: bool = True)[source]

Returns a barplot of the runtimes of config_names.

Parameters

config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
xlabel (str or None, default None) – x-axis label.
ylabel (str or None, default “Mean runtime in seconds”) – y-axis label.
title (str or None, default “Average runtime across rolling windows”) – Plot title.
showlegend (bool, default True) – Whether to show the legend.

Returns

fig – Interactive plotly bar plot.

Return type

get_valid_config_names(config_names: Optional[List] = None)[source]

Validate config_names against keys of configs. Raises a ValueError in case of a mismatch.

Parameters: config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
Returns: config_names – List of valid config names.
Return type: list

static autocomplete_metric_dict(metric_dict, enum_class)[source]

Sweeps through metric_dict, converting members of enum_class to their corresponding evaluation function.

For example:

metric_dict = {
    "correlation": EvaluationMetricEnum.Correlation,
    "RMSE": EvaluationMetricEnum.RootMeanSquaredError,
    "Q_95": EvaluationMetricEnum.Quantile95
    "custom_metric": custom_function
}

is converted to

metric_dict = {
    "correlation": correlation(y_true, y_pred),
    "RMSE": root_mean_squared_error(y_true, y_pred),
    "Q_95": quantile_loss_q(y_true, y_pred, q=0.95),
    "custom_function": custom_function
}

Parameters

metric_dict (dict [str, callable]) – Evaluation metrics to compute. Same as get_evaluation_metrics.
enum_class (Enum) – The enum class metric_dict elements might be member of. It must have a method get_metric_func.

Returns

updated_metric_dict – Autocompleted metric dict.

Return type

dict

Cross Validation

class greykite.sklearn.cross_validation.RollingTimeSeriesSplit(forecast_horizon, min_train_periods=None, expanding_window=False, use_most_recent_splits=False, periods_between_splits=None, periods_between_train_test=0, max_splits=3)[source]

Flexible splitter for time-series cross validation and rolling window evaluation. Suitable for use in GridSearchCV.

min_splits

Guaranteed min number of splits. This is always set to 1. If provided configuration results in 0 splits, the cross validator will yield a default split.

Type: int

__starting_test_index

Test end index of the first CV split. Actual offset = __starting_test_index + _get_offset(X), for a particular dataset X. Cross validator ensures the last test split contains the last observation in X.

Type: int

Examples

>>> from greykite.sklearn.cross_validation import RollingTimeSeriesSplit
>>> X = np.random.rand(20, 4)
>>> tscv = RollingTimeSeriesSplit(forecast_horizon=3, max_splits=4)
>>> tscv.get_n_splits(X=X)
4
>>> for train, test in tscv.split(X=X):
...     print(train, test)
[2 3 4 5 6 7] [ 8  9 10]
[ 5  6  7  8  9 10] [11 12 13]
[ 8  9 10 11 12 13] [14 15 16]
[11 12 13 14 15 16] [17 18 19]
>>> X = np.random.rand(20, 4)
>>> tscv = RollingTimeSeriesSplit(forecast_horizon=2,
...                               min_train_periods=4,
...                               expanding_window=True,
...                               periods_between_splits=4,
...                               periods_between_train_test=2,
...                               max_splits=None)
>>> tscv.get_n_splits(X=X)
4
>>> for train, test in tscv.split(X=X):
...     print(train, test)
[0 1 2 3] [6 7]
[0 1 2 3 4 5 6 7] [10 11]
[ 0  1  2  3  4  5  6  7  8  9 10 11] [14 15]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15] [18 19]
>>> X = np.random.rand(5, 4)  # default split if there is not enough data
>>> for train, test in tscv.split(X=X):
...     print(train, test)
[0 1 2 3] [4]

split(X, y=None, groups=None)[source]

Generates indices to split data into training and test CV folds according to rolling: window time series cross validation

Parameters

X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features. Must have shape method.
y (array-like, shape (n_samples,), optional) – The target variable for supervised learning problems. Always ignored, exists for compatibility.
groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set. Always ignored, exists for compatibility.

Yields

train (numpy.array) – The training set indices for that split.
test (numpy.array) – The testing set indices for that split.

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations yielded by the cross-validator

Parameters

X (array-like, shape (n_samples, n_features)) – Input data to split
y (object) – Always ignored, exists for compatibility.
groups (object) – Always ignored, exists for compatibility.

Returns

n_splits – The number of splitting iterations yielded by the cross-validator.

Return type

int

get_n_splits_without_capping(X=None)[source]

Returns the number of splitting iterations in the cross-validator as configured, ignoring: self.max_splits and self.min_splits

Parameters: X (array-like, shape (n_samples, n_features)) – Input data to split
Returns: n_splits – The number of splitting iterations in the cross-validator as configured, ignoring self.max_splits and self.min_splits
Return type: int

_get_offset(X=None)[source]

Returns an offset to add to test set indices when creating CV splits CV splits are shifted so that the last test observation is the last point in X. This shift does not affect the total number of splits.

Parameters: X (array-like, shape (n_samples, n_features)) – Input data to split
Returns: offset – The number of observations to ignore at the beginning of X when creating CV splits
Return type: int

_sample_splits(num_splits, seed=48912)[source]

Samples up to max_splits items from list(range(num_splits)).

If use_most_recent_splits is True, highest split indices up to max_splits are retained. Otherwise, the following sampling scheme is implemented:

takes the last 2 splits

samples from the rest uniformly at random

Parameters

num_splits (int) – Number of splits before sampling.
seed (int) – Seed for random sampling.

Returns

n_splits – Indices of splits to keep (subset of list(range(num_splits))).

Return type

list

_iter_test_indices(X=None, y=None, groups=None)[source]: Class directly implements split instead of providing this function

_iter_test_masks(X=None, y=None, groups=None)

Generates boolean masks corresponding to test sets.

By default, delegates to _iter_test_indices(X, y, groups)

Transformers

class greykite.sklearn.transform.zscore_outlier_transformer.ZscoreOutlierTransformer(z_cutoff=None, use_fit_baseline=False)[source]

Replaces outliers in data with NaN. Outliers are determined by z-score cutoff. Columns are handled independently.

Parameters

z_cutoff (float or None, default None) – z-score cutoff to define outliers. If None, this transformer is a no-op.
use_fit_baseline (bool, default False) –
If True, the z-scores are calculated using the mean and standard deviation of the dataset passed to fit.

If False, the transformer is stateless. z-scores are calculated for the dataset passed to transform, regardless of fit.

mean

Mean of each column. NaNs are ignored.

Type: pandas.Series

std

Standard deviation of each column. NaNs are ignored.

Type: pandas.Series

_is_fitted

Whether the transformer is fitted.

Type: bool

fit(X, y=None)[source]

Computes the column mean and standard deviation, stored as mean and std attributes.

Parameters

X (pandas.DataFrame) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.
y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.

Returns

self – Returns self.

Return type

transform(X)[source]

Replaces outliers with NaN.

Parameters: X (pandas.DataFrame) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.
Returns: X_outlier – A copy of the data frame with original values and outliers replaced with NaN.
Return type: pandas.DataFrame

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

class greykite.sklearn.transform.normalize_transformer.NormalizeTransformer(normalize_algorithm=None, normalize_params=None)[source]

Normalizes time series data.

Parameters

normalize_algorithm (str or None, default None) –
Which algorithm to use. Valid options are:
- ”MinMaxScaler” : sklearn.preprocessing.MinMaxScaler,
- ”MaxAbsScaler” : sklearn.preprocessing.MaxAbsScaler,
- ”StandardScaler” : sklearn.preprocessing.StandardScaler,
- ”RobustScaler” : sklearn.preprocessing.RobustScaler,
- ”Normalizer” : sklearn.preprocessing.Normalizer,
- ”QuantileTransformer” : sklearn.preprocessing.QuantileTransformer,
- ”PowerTransformer” : sklearn.preprocessing.PowerTransformer,
If None, this transformer is a no-op. No normalization is done.
normalize_params (dict or None, default None) – Params to initialize the normalization scaler/transformer.

scaler

sklearn class used for normalization

Type: class

_is_fitted

Whether the transformer is fitted.

Type: bool

fit(X, y=None)[source]

Fits the normalization transform.

Parameters

X (pandas.DataFrame) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.
y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.

Returns

self – Returns self.

Return type

transform(X)[source]

Normalizes data using the specified scaling method.

Parameters: X (pandas.DataFrame) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.
Returns: X_normalized – A normalized copy of the data frame.
Return type: pandas.DataFrame

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

class greykite.sklearn.transform.null_transformer.NullTransformer(max_frac=0.1, impute_algorithm=None, impute_params=None, impute_all=True)[source]

Imputes nulls in time series data.

This transform is stateless in the sense that transform output does not depend on the data passed to fit. The dataset passed to transform is used to impute itself.

Parameters

max_frac (float, default 0.10) – issues warning if fraction of nulls is above this value
impute_algorithm (str or None, default “interpolate”) –
Which imputation algorithm to use. Valid options are:
- ”interpolate” : pandas.DataFrame.interpolate
- ”ts_interpolate” : impute_with_lags_multi.
If None, this transformer is a no-op. No null imputation is done.
impute_params (dict or None, default None) –
Params to pass to the imputation algorithm. See pandas.DataFrame.interpolate and impute_with_lags_multi for their respective options.

For pandas “interpolate”, the “ffill”, “pad”, “bfill”, “backfill” methods are not allowed to avoid confusion with the fill axis parameter. Use “linear” with axis=0 instead, with direction controlled by limit_direction.

If None, uses the defaults provided in this class.
impute_all (bool, default True) –
Whether to impute all values. If True, NaNs are not allowed in the transformed result. Ignored if impute_algorithm is None.

The transform specified by impute_algorithm and impute_params may leave NaNs in the dataset. For example, if it fills in the forward direction but the first value in a column is NaN.

A first pass is taken with the impute algorithm specified. A second pass is taken with the “interpolate” algorithm (method=”linear”, limit_direction=”both”) to fill in remaining NaNs.

null_frac

The fraction data points that are null

Type: int

_is_fitted

Whether the transformer is fitted.

Type: bool

missing_info

Information about the missing data. Set by transform if impute_algorithm = "ts_interpolate".

Type: dict

fit(X, y=None)[source]

Updates self.impute_params.

Parameters

X (pandas.DataFrame) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.
y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.

Returns

self – Returns self.

Return type

transform(X)[source]

Imputes missing values in input time series.

Checks the % of data points that are null, and provides warning if it exceeds self.max_frac.

Parameters: X (pandas.DataFrame) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.
Returns: X_imputed – A copy of the data frame with original values and missing values imputed
Return type: pandas.DataFrame

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

class greykite.sklearn.transform.drop_degenerate_transformer.DropDegenerateTransformer(drop_degenerate=False)[source]

Removes degenerate (constant) columns.

Parameters: drop_degenerate (bool, default False) – Whether to drop degenerate columns.

drop_cols

Degenerate columns to drop

Type: list [str] or None

keep_cols

Columns to keep

Type: list [str] or None

fit(X, y=None)[source]

Identifies the degenerate columns, and sets self.keep_cols and self.drop_cols.

Parameters

X (pandas.DataFrame) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.
y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.

Returns

self – Returns self.

Return type

transform(X)[source]

Normalizes data using the specified scaling method.

Parameters: X (pandas.DataFrame) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.
Returns: X_subset – Selected columns of X. Keeps columns that were not degenerate on the training data.
Return type: pandas.DataFrame

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

Quantile Regression

class greykite.algo.common.l1_quantile_regression.QuantileRegression(quantile: float = 0.9, alpha: float = 0.001, sample_weight: Optional[np.typing.ArrayLike] = None, feature_weight: Optional[np.typing.ArrayLike] = None, max_iter: int = 100, tol: float = 0.01, fit_intercept: bool = True, optimize_mape: bool = False)[source]

Implements the quantile regression model.

Supports weighted sample, l1 regularization and weighted l1 regularization. These options can be configured to support different use cases. For example, specifying quantile to be 0.5 and sample weight to be the inverse absolute value of response minimizes the MAPE.

fit(X: np.typing.ArrayLike, y: np.typing.ArrayLike) → QuantileRegression[source]

Fits the quantile regression model.

Parameters

X (numpy.array, pandas.DataFrame or pandas.Series) – The design matrix.
y (numpy.array, pandas.DataFrame or pandas.Series) – The response vector.

Return type

self

predict(X: np.typing.ArrayLike) → np.array[source]

Makes prediction for a given x.

Parameters: X (numpy.array, pandas.DataFrame or pandas.Series) – The design matrix used for prediction.

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – \(R^2\) of self.predict(X) wrt. y.

Return type

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score. This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

Hierarchical Forecast

class greykite.algo.reconcile.convex.reconcile_forecasts.ReconcileAdditiveForecasts[source]

Reconciles forecasts to satisfy additive constraints.

Constraints can be encoded by the tree structure via levels. In the tree formulation, a parent’s value must be the sum of its children’s values.

Or, constraints can be encoded as a matrix via constraint_matrix, specifying additive expressions that must equal 0. The constraints need not have a tree representation.

Provides standard methods such as bottom up, ols, MinT. Also supports a custom method that minimizes user-specified types of error. The solution is derived by convex optimization. If desired, a constraint is added to require the transformation to be unbiased.

If not using method=”ols” or method=”bottom_up”, which don’t depend on the data, forecast reconciliation should be trained once per horizon (# periods between forecasted date and train_end_date), because the optimal adjustment may differ.

forecasts

Original forecasted values, used to train the method. Also known as “base” forecasts. Long format where each column is a time series. and each row is a time step. For proper variance estimates for the variance penalty, values should be at a fixed-horizon (e.g. always 7-step ahead).

Type: pandas.DataFrame, shape (n, m)

actuals

Actual values to train the method, corresponding to forecasts. Must have the same shape and column names as forecasts.

Type: pandas.DataFrame, shape (n, m)

constraint_matrix

Constraints. c x m array encoding c constraints of m variables. We require constraint_matrix @ transform_matrix = 0. For example, to encode -x1 + x2 + x3 == 0 and -x2 + x4 + x5 == 0:

constraint_matrix = np.array([
    [-1, 1, 1, 0, 0],
    [0, -1, 0, 1, 1]])

Entries are typically in [-1, 0, 1], but this is not required. Either constraint_matrix or levels must be provided.

Type: numpy.array, shape (c, m), or None

levels

A simpler way to encode tree constraints. Overrides constraint_matrix if provided. Specifies the number of children of each parent (internal) node in the tree. The number of inner lists is the height of the tree. The ith inner list provides the number of children of each node at depth i. For example:

# root node with 3 children
levels = [[3]]
# root node with 3 children, who have 2, 3, 3 children respectively
levels = [[3], [2, 3, 3]]

All leaf nodes must have the same depth. Thus, the first sublist must have one integer, the length of a sublist must equal the sum of the previous sublist, and all integers in levels must be positive.

Either constraint_matrix or levels must be provided.

Type: list [list [int]] or None

order_dict

How to order the columns before fitting. The key is the column name, the value is its position. When levels is used, map each column name to the order of its corresponding node in a BFS traversal of the tree. When constraint_matrix is used, this shuffles the order of the columns before the constraints are applied (thus, columns in constraint_matrix refer to the columns after reordering).

If None, no reordering is done.

Type: dict [str, float] or None

method

Which reconciliation method to use. Valid values are “bottom_up”, “ols”, “mint_sample”, “custom”:

“bottom_up”Sums leaf nodes. Unbiased transform that uses only the values of the leaf nodes
to propagate up the tree. Each node’s value is the sum of its corresponding leaf nodes’ values (a leaf node corresponds to a node T if it is a leaf node of the subtree with T as its root, i.e. a descendant of T or T itself). See Dangerfield and Morris 1992 “Top-down or bottom-up: Aggregate versus disaggregate extrapolations” for one discussion of this method. Depends only on the structure of the hierarchy, not on the data itself.

“ols”OLS estimate proposed by https://robjhyndman.com/papers/Hierarchical6.pdf
(Hyndman et al. 2010, “Optimal combination forecasts for hierarchical time series”) Also see https://robjhyndman.com/papers/mint.pdf section 2.4.1. (Wickramasuriya et al. 2019 “Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization”.) Unbiased transform that minimizes variance of adjusted residuals, using “identity” estimate of original residual variance. Optimal if original forecast errors are uncorrelated with equal variance (unlikely). Depends only on the structure of the hierarchy, not on the data itself.

“mint_sample”Unbiased transform that minimizes variance of adjusted residuals,
using “sample” estimate of original residual variance. Assumes base forecasts are unbiased. See Wickramasuriya et al. 2019 section 2.4.4. Depends on the structure of the hierarchy and forecast error covariances.

“custom”Optimization parameters can be set by the user. See
greykite.algo.reconcile.convex.reconcile_forecasts.ReconcileAdditiveForecasts.fit method for parameters and their default values. Depends on the structure of the hierarchy, base forecasts, and actuals, if all terms are included in the objective.

If “custom”, uses the parameters passed to greykite.algo.reconcile.convex.reconcile_forecasts.ReconcileAdditiveForecasts.fit to formulate the convex optimization problem.

If “bottom_up”, “ols”, or “mint_sample”, the other fit parameters are ignored.

Type: str

lower_bound

Lower bound on each entry of transform_matrix. If None, no lower bound is applied.

Type: float or None

upper_bound

Upper bound on each entry of transform_matrix. If None, no upper bound is applied.

Type: float or None

unbiased

Whether the resulting transformation must be unbiased.

Type: bool

lam_adj

Weight for the adjustment penalty. The adjustment penalty is the mean squared difference between adjusted forecasts and base forecasts.

Type: float

lam_bias

Weight for the bias penalty. The bias penalty is the mean squared difference between adjusted actuals and actuals. For an unbiased transformation (unbiased=True), the bias penalty is 0 so this has no effect.

Type: float

lam_train

Weight for the training MSE penalty. The train MSE penalty measures the mean squared difference between adjusted forecasts and actuals.

Type: float

lam_var

Weight for the variance penalty. The variance penalty measures the variance of adjusted forecast errors for an unbiased transformation. It is reported as the average of the variances across timeseries. It is based on the variance of the base forecast error variance, covariance. For biased transforms, this is an underestimate of the true variance.

Type: float

covariance

Variance-covariance matrix of base forecast errors. Used to compute the variance penalty.

If a numpy.array, row/column i corresponds to the ith column after reordering by order_dict. Should be reported on the original scale of the data.

If “sample”, the sample covariance of residuals assuming base forecasts are unbiased. Unlike numpy.cov, does not mean center the residuals, and divides by n instead of n-1.

If “identity”, the identity matrix.

Type: numpy.array of shape (m, m), or “sample” or “identity”

weight_adj

Weight for the adjustment penalty that allows a different weight per-timeseries.

If a numpy array/list, values specify the weight for each forecast after reordering by order_dict.

If “MedAPE”, proportional to the MedAPE of the forecast.

If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast. This can be useful to penalize adjustment to base forecasts that are already accurate.

If None, the identity matrix (equal weights).

Type: numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None

weight_bias

Weight for the bias penalty that allows a different weight per-timeseries.

If a numpy array/list, values specify the weight for each forecast after reordering by order_dict.

If “MedAPE”, proportional to the MedAPE of the forecast. This can be useful to focus more on improving the base forecasts with high error.

If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast.

If None, the identity matrix (equal weights).

For an unbiased transformation (unbiased=True), the bias penalty is 0 so this has no effect.

Type: numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None

weight_train

Weight for the train MSE penalty that allows a different weight per-timeseries.

If a numpy array/list, values specify the weight for each forecast after reordering by order_dict.

If “MedAPE”, proportional to the MedAPE of the forecast. This can be useful to focus more on improving the base forecasts with high error.

If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast.

If None, the identity matrix (equal weights).

Type: numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None

weight_var

Weight for the variance penalty that allows a different weight per-timeseries.

If a numpy array/list, values specify the weight for each forecast after reordering by order_dict.

If “MedAPE”, proportional to the MedAPE of the forecast. This can be useful to focus more on improving the base forecasts with high error.

If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast.

If None, the identity matrix (equal weights).

Type: numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None

names

Names of forecast columns after reordering by order_dict.

Type: pandas.Index

tree

If levels is provided, represents the tree structure encoded by the levels. Else None.

Type: greykite.algo.reconcile.hierarchical_relationship.HierarchicalRelationship or None

transform_variable

Optimization variable to learn the transform matrix. None if a rule-based method is used, e.g. method == bottom_up

Type: cvxpy.Variable, shape (m, m) or None

transform_matrix

Transformation matrix. Same as transform_variable.value, unless the solver failed the find a solution, and a backup value is used. Adjusted forecasts are computed by applying the transform from the left to reordered and transposed forecasts. See transform in this class.

Type: numpy.array, shape (m, m)

prob

Convex optimization problem.

Type: cvxpy.Problem

is_optimization_solution

Whether transform_matrix is a solution found by convex optimization solution. If False, then transform_matrix may be set to a backup value (bottom up transform). Check prob.status for more details about solver status.

Type: bool

objective_fn

Evaluates the objective function for a given transform matrix and dataset. Takes transform_matrix, forecast_matrix (optional), actual_matrix (optional). Return value has same format as objective_fn_val. If forecast_matrix/actual_matrix are not provided, uses the fitting datasets.

Type: callable

objective_fn_val

Dictionary containing the objective value, and its components, as evaluated on the training set for the identified optimal solution from convex optimization. Keys are:

"adj" : adjustment size "bias" : bias of estimator "train" : train set MSE "var" : variance of unbiased estimator "total" : sum of the above

Type: dict [str, float]

objective_weights

Weights used in the objective function, derived from covariance, weight_*, forecasts, actuals. Keys are:

weight_adj

weight_bias

weight_train

weight_var

covariance

Type: dict [str, np.array of shape (m, m)]

adjusted_forecasts

Adjusted forecasts that satisfy the constraints.

Type: pandas.DataFrame, shape (n, m)

constraint_violation

The normalized constraint violations on training set. Keys are “actual”, “forecast”, and “adjusted”. Root mean squared constraint violation is divided by root mean squared actual value.

Type: dict [str, float]

evaluation_df

DataFrame of evaluation results on training set. Rows are timeseries, columns are metrics. See evaluate in this class.

Type: pandas.DataFrame, shape (m, # metrics)

figures

Plotly figures to visualize evaluation results on training set. Keys are: “base_adj” (base vs adjusted forecast), “adj_size” (adjustment size %), “error” (% error). Each figure contains multiple subplots, one for each timeseries.

Type: dict [str, plotly.graph_objects.Figure] or None

forecasts_test

Forecasted values to test the method. Long format where each column is a time series and each row is a time step. Must have the same column names as forecasts. Can have a different number of rows (observations).

Type: pandas.DataFrame, shape (q, m)

actuals_test

Actual values to test the method. Must have the same shape and column names as forecasts_test.

Type: pandas.DataFrame, shape (q, m)

adjusted_forecasts_test

Adjusted forecasts_test that satisfy the constraints.

Type: pandas.DataFrame, shape (q, m)

constraint_violation_test

The normalized constraint violations on test set. Keys are “actual”, “forecast”, and “adjusted”. Root mean squared constraint violation is divided by root mean squared actual value on test set.

Type: dict [str, float]

evaluation_df_test

DataFrame of evaluation results on test set. Rows are timeseries, columns are metrics. See evaluate() in this class.

Type: pandas.DataFrame, shape (m, # metrics)

figures_test

Plotly figures to visualize evaluation results on test set. Keys are: “base_adj” (base vs adjusted forecast), “adj_size” (adjustment size %), “error” (% error). Each figure contains multiple subplots, one for each timeseries.

Type: dict [str, plotly.graph_objects.Figure] or None

fit : callable: Fits the transform_matrix from training data.

transform : callable: Adjusts a forecast to satisfy additive constraints using the transform_matrix.

evaluate : callable: Evaluates the adjustment quality by its impact to MAPE, MedAPE, and RMSE.

fit_transform : callable: Fits and transforms the training data.

fit_transform_evaluate : callable: Fits, transforms, and evaluates on training data.

transform_evaluate : calllable: Transforms and evaluates on a new test set.

fit(forecasts, actuals, order_dict=None, method='custom', levels=None, constraint_matrix=None, lower_bound=None, upper_bound=None, unbiased=True, lam_adj=1.0, lam_bias=1.0, lam_train=1.0, lam_var=1.0, covariance='sample', weight_adj=None, weight_bias=None, weight_train=None, weight_var=None, **solver_kwargs)[source]

Fits the transform_matrix based on input data, constraint, and objective function.

Sets the attributes between forecasts and objective_weights as noted in the class description, inclusive, including transform_matrix, transform_variable, prob, objective_fn_val.

If method != “bottom_up” and there is no solution, gives a warning and self.is_optimization_solution is set to False. Uses “bottom_up” solution as fallback approach if levels is provided.

Parameters

forecasts (pandas.DataFrame, shape (n, m)) – See attributes of ReconcileAdditiveForecasts.
actuals (pandas.DataFrame, shape (n, m)) – See attributes of ReconcileAdditiveForecasts.
order_dict (dict [str, float] or None, default None) – See attributes of ReconcileAdditiveForecasts.
method (str, default DEFAULT_METHOD) – See attributes of ReconcileAdditiveForecasts. If provided, the parameters from lower_bound to weight_var below are ignored.
levels (list [list [int]] or None, default None) – See attributes of ReconcileAdditiveForecasts.
constraint_matrix (numpy.array, shape (c, m) or None, default None) – See attributes of ReconcileAdditiveForecasts.
lower_bound (float or None, default None) – See attributes of ReconcileAdditiveForecasts.
upper_bound (float or None, default None) – See attributes of ReconcileAdditiveForecasts.
unbiased (bool, default True) – See attributes of ReconcileAdditiveForecasts.
lam_adj (float, default 1.0) – See attributes of ReconcileAdditiveForecasts.
lam_bias (float, default 1.0) – See attributes of ReconcileAdditiveForecasts.
lam_train (float, default 1.0) – See attributes of ReconcileAdditiveForecasts.
lam_var (float, default 1.0) – See attributes of ReconcileAdditiveForecasts.
covariance (numpy.array of shape (m, m), or “sample” or “identity”, default “sample”) – See attributes of ReconcileAdditiveForecasts.
weight_adj (numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes of ReconcileAdditiveForecasts.
weight_bias (numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes of ReconcileAdditiveForecasts.
weight_train (numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes of ReconcileAdditiveForecasts.
weight_var (numpy.array or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes of ReconcileAdditiveForecasts.
solver_kwargs (dict) – Specify the CVXPY solver and parameters. E.g. dict(verbose=True). See https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options.

Returns

transform_matrix – Transformation matrix. Same as transform_variable.value, unless the solver failed the find a solution, and a backup value is used. Adjusted forecasts are computed by applying the transform from the left to reordered and transposed forecasts. See transform() in this class.

Return type

numpy.array, shape (m, m)

transform(forecasts_test=None)[source]

Transforms the provided forecasts using the fitted self.transform_matrix.

Parameters

forecasts_test (pandas.DataFrame, shape (r, m) or None) – Forecasted values to transform. Must have the same columns as self.forecasts. If None, uses self.forecasts.

Returns

adjusted_forecasts (pandas.DataFrame, shape (r, m)) – Adjusted forecasts that satisfy additive constraints. Columns are reordered according to self.order_dict.
If forecasts is None, results are stored to self.adjusted_forecasts.
Else, results are stored to self.adjusted_forecasts_test, and the
provided forecasts_test to self.forecasts_test.

evaluate(is_train, actuals_test=None, ipython_display=False, plot=False, plot_num_cols=3)[source]

Evaluates the adjustment quality. Computes the following metrics for each of the m timeseries:

“Base MAPE” : MAPE of base forecasts “Base MedAPE” : MedAPE of base forecasts “Base RMSE” : RMSE of base forecasts “Adjusted MAPE” : MAPE of adjusted forecasts “Adjusted MedAPE” : MedAPE of adjusted forecasts “Adjusted RMSE” : RMSE of adjusted forecasts “RMSE % change” : (Adjusted RMSE) / (Base RMSE) - 1 “MAPE pp change” : (Adjusted MAPE) - (Base MAPE) “MedAPE pp change” : (Adjusted MedAPE) - (Base MedAPE)

“pp change” refers to percentage point change (difference in %).

Must call fit and transform before calling this method.

Parameters

is_train (bool) – Whether to evaluate on training set or test set. If True, evaluates training adjustment quality. Else, evaluates test adjustment quality. In this case, actuals_test must be provided.
actuals_test (pandas.DataFrame) – Actual values on test set, required if is_train==False. Must have the same shape as the forecasts passed to transform(), i.e. self.forecasts_test.shape.
ipython_display (bool, default False) – Whether to display the evaluation statistics.
plot (bool, default False) – Whether to display the evaluation plots.
plot_num_cols (int, default 3) – Number of columns in the plot. This is the number of timeseries to plot in each row.

Returns

evaluation_result (dict [str, dict, or pandas.DataFrame]) –
- "constraint_violation"dict [str, float]
  The normalized constraint violations. Keys are “actual”, “forecast”, and “adjusted”. The value is root mean squared constraint violation divided by root mean squared actual value. Constraint violation of actuals should be close to 0.
- "evaluation_df"pandas.DataFrame, shape (m, # metrics)
  Evaluation results. DataFrame with one row for each timeseries, and a column for each metric listed above.
- "figures"dict [str, plotly.graph_objects.Figure]
  Plotly figures to visualize evaluation results. Keys are: “base_adj” (base vs adjusted forecast), “adj_size” (adjustment size %), “error” (% error). Each figure contains multiple subplots, one for each timeseries.
If is_train, results are stored to self.constraint_violation, self.evaluation_df.
Otherwise, they are stored to self.constraint_violation_test, self.evaluation_df_test.

fit_transform(forecasts, actuals, **fit_kwargs)[source]

Fits and transforms training data.

Parameters

forecasts (pandas.DataFrame) – Forecasts to fit the adjustment. See fit.
actuals (pandas.DataFrame) – Actuals to fit the adjustment. See fit.
fit_kwargs (dict, optional) – Additional parameters to pass to fit.

Returns

adjusted_forecasts – Adjusted forecasts.

Return type

fit_transform_evaluate(forecasts, actuals, fit_kwargs=None, evaluate_kwargs=None)[source]

Fits, transforms, and evaluates on training data.

Parameters

forecasts (pandas.DataFrame) – Forecasts to fit the adjustment. See fit.
actuals (pandas.DataFrame) – Actuals to fit the adjustment. See fit.
fit_kwargs (dict, optional, default None) – Additional parameters to pass to fit.
evaluate_kwargs (dict, optional, default None) – Additional parameters to pass to evaluate.

Returns

evaluation_df – Evaluation results on provided forecasts.

Return type

transform_evaluate(forecasts_test, actuals_test, **evaluate_kwargs)[source]

Transforms and evaluates on test data.

Must call fit before calling this method.

forecasts_testpandas.DataFrame: Forecasts to make consistent. Should be different from the training data.
actuals_testpandas.DataFrame: Actuals to check quality of the adjustment.
evaluate_kwargsdict, optional, default None: Additional parameters to pass to evaluate.

Returns: evaluation_df_test – Evaluation results on provided forecasts_test.
Return type: pandas.DataFrame

plot_transform_matrix(color_continuous_scale='RdBu', zmin=-1.5, zmax=1.5, **kwargs)[source]

Plots the transform matrix visually, as a grid. By default, negative values are red and positive values are blue.

Parameters

color_continuous_scale (str or list [str], default “RdBu”) – Colormap used to map scalar data to colors. See plotly.express.imshow.
zmin (scalar or iterable, default -1.5) – The minimum value covered by the colormap. See plotly.express.imshow.
zmax (scalar or iterable, default 1.5) – The maximum value covered by the colormap. See plotly.express.imshow.
kwargs (keyword arguments) – Additional keyword arguments for plotly.express.imshow.

Returns

fig – The transform matrix plot object.

Return type

class greykite.algo.reconcile.hierarchical_relationship.HierarchicalRelationship(levels)[source]

Represents hierarchical relationships between nodes (time series).

Nodes are indexed by their position in the tree, in breadth-first search (BFS) order. Matrix attributes such as bottom_up_transform are applied from the left against tree values, represented as a numpy.array 2D array with the values of each node as a row.

levels

Specifies the number of children of each parent (internal) node in the tree. The number of inner lists is the height of the tree. The ith inner list provides the number of children of each node at depth i. For example:

# root node with 3 children
levels = [[3]]

# root node with 3 children, who have 2, 3, 3 children respectively
levels = [[3], [2, 3, 3]]
# These children are ordered from "left" to "right", so that the one with
# 2 children is the first in the 2nd level.
# This will be used as our running example.
#           0                # level 0
#   1       2        3       # level 1
#  4 5    6 7 8    9 10 11   # level 2

All leaf nodes must have the same depth. Thus, the first sublist must have one integer, the length of a sublist must equal the sum of the previous sublist, and all integers in levels must be positive.

Type: list [list [`int]] or None

num_children_per_parent

Flattened version of levels. The number of children for each parent (internal) node. [3, 2, 3, 3] in our example.

Type: list [int]

num_internal_nodes

The number of internal (parent) nodes (i.e. with children). 4 in our example.

Type: int

num_leaf_nodes

The number of leaf nodes (i.e. without children). 8 in our example.

Type: int

num_nodes

The total number of nodes. 12 in our example.

Type: int

nodes_per_level

The number of nodes at each level of the tree. [1, 3, 8] in our example.

Type: list [int]

starting_index_per_level

The index of the first node in each level. [0, 1, 4] in our example.

Type: list [int]

starting_child_index_per_parent

For each parent node, the index of its first child. [1, 4, 6, 9] in our example.

Type: list [int]

sum_matrix

Sum matrix used to compute values of all nodes from the leaf nodes. When applied to a matrix with the values for leaf nodes, returns values for every node by bubbling up leaf node values to the internal nodes. A node’s value is equal to the sum of its corresponding leaf nodes’ values.

Y_{all} = sum_matrix @ Y_{leaf} In our example:

# 4   5   6   7   8   9   10  11  (leaf nodes)
[[1., 1., 1., 1., 1., 1., 1., 1.], # 0
 [1., 1., 0., 0., 0., 0., 0., 0.], # 1
 [0., 0., 1., 1., 1., 0., 0., 0.], # 2
 [0., 0., 0., 0., 0., 1., 1., 1.], # 3
 [1., 0., 0., 0., 0., 0., 0., 0.], # 4
 [0., 1., 0., 0., 0., 0., 0., 0.], # 5
 [0., 0., 1., 0., 0., 0., 0., 0.], # 6
 [0., 0., 0., 1., 0., 0., 0., 0.], # 7
 [0., 0., 0., 0., 1., 0., 0., 0.], # 8
 [0., 0., 0., 0., 0., 1., 0., 0.], # 9
 [0., 0., 0., 0., 0., 0., 1., 0.], # 10
 [0., 0., 0., 0., 0., 0., 0., 1.]] # 11 (all nodes)

Type: numpy.array, shape (self.num_nodes, self.num_leaf_nodes)

leaf_projection_matrix

Projection matrix to get leaf nodes. When applied to a matrix with the values for all nodes, the projection matrix selects only the rows corresponding to leaf nodes.

Y_{leaf} = leaf_projection_matrix @ Y_{actual} In our example:

# 0   1   2   3   4   5   6   7   8   9   10  11  (all nodes)
[[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],  # 4
 [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],  # 5
 [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],  # 6
 [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],  # 7
 [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],  # 8
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],  # 9
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],  # 10
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]  # 11 (leaf nodes)

Type: numpy.array, shape (self.num_leaf_nodes, self.num_nodes)

bottom_up_transform

Bottom-up transformation matrix. When applied to a matrix with the values for all nodes, returns values for every node by bubbling up leaf node values to the internal nodes. The original values of internal nodes are ignored.

Y_{bu} = bottom_up_transform @ Y_{actual} Note that bottom_up_transform = sum_matrix @ leaf_projection_matrix. In our example:

# 0   1   2   3   4   5   6   7   8   9   10  11  (all nodes)
[[0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.], # 0
 [0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.], # 1
 [0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.], # 2
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.], # 3
 [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], # 4
 [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], # 5
 [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], # 6
 [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], # 7
 [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], # 8
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], # 9
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], # 10
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]] # 11 (all nodes)

Type: numpy.array, shape (self.num_nodes, self.num_nodes)

constraint_matrix

Constraint matrix representing hierarchical additive constraints, where a parent’s value is equal the sum of its leaf nodes’ values. constraint_matrix @ Y_{all} = 0 if Y_{all} satisfies the constraints. In our example:

#  0    1    2    3    4    5    6    7    8    9    10   11  (all nodes)
[[-1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],  # 0
 [ 0., -1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],  # 1
 [ 0.,  0., -1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.],  # 2
 [ 0.,  0.,  0., -1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.]]  # 3 (internal nodes)

Type: numpy.array, shape (self.num_internal_nodes, self.num_nodes)

get_level_of_node : callable: Returns a node’s level in the tree

get_child_nodes : callable: Returns the indices of a node’s children in the tree

__set_sum_matrix : callable: Constructs the summing matrix to compute values of all nodes from the leaf nodes.

__set_leaf_projection_matrix : callable: Constructs leaf projection matrix to retain only values of leaf nodes.

__set_constraint_matrix : callable: Constructs constraint matrix that requires each parent’s value to be the sum of its leaf node’s values.

get_level_of_node(node)[source]

Returns a node’s level in the tree. Level is defined as the length of the path to the root. The root is at level 0.

Parameters: node (int) – Index of the node.
Returns: level – The level of the node in the tree.
Return type: int

get_child_nodes(node)[source]

Returns the indices of a node’s children in the tree.

Parameters: node (int) – Index of the node.
Returns: child_nodes – Indices of all the node’s children.
Return type: list [int]

Utility Functions

Functions to generate derived time features useful in forecasting, such as growth, seasonality, holidays.

greykite.common.features.timeseries_features.convert_date_to_continuous_time(dt)[source]

Converts date to continuous time. Each year is one unit.

Parameters: dt (datetime object) – the date to convert
Returns: conti_date – the date represented in years
Return type: float

greykite.common.features.timeseries_features.get_default_origin_for_time_vars(df, time_col)[source]

Sets default value for origin_for_time_vars

Parameters

df (pandas.DataFrame) – Training data. A data frame which includes the timestamp and value columns
time_col (str) – The column name in df representing time for the time series data.

Returns

dt_continuous_time – The time origin used to create continuous variables for time

Return type

float

greykite.common.features.timeseries_features.pytz_is_dst_fcn(time_zone)[source]

For a given timezone, it constructs a function which determines if a timestamp (dt) is inside the daylight saving period or not for a list of timestamps.

This function, should work for regions in US / Canada and Europe.

The returned function assumes that the timestamps are in the given time_zone. Note that since daylight saving is the same for all of mainland US / Canada, one can pass any US time zone e.g. "US/Pacific" to construct a function which works for all of mainland US. Similarly for most of Europe, it is suffcient to pass any Europe time zone e.g. "Europe/London".

Note: Since this function is slow, a faster version is available: is_dst_fcn. However, we expect the current function would be more accurate assuming the package pytz keeps up to date with potential changes in DST.

Parameters: time_zone (str) – A string denoting the timestamp e.g. “US/Pacific”, “Canada/Eastern”, “Europe/London”.
Returns: is_dst – A function which takes a list of datetime-like objects and returns a list of colleans to determine if each timestamp is in daylight saving.
Return type: callable

greykite.common.features.timeseries_features.get_us_dst_start(year)[source]

For each year, it returns the second Sunday in March, which is the start of the daylight saving (DST) in US/Canada.

We assume DST starts on Second Sunday of March at 2 a.m.

Parameters: year (int) – Year for which DST start date is desired.
Returns: result – The timestamp of start of DST in US/Canada.
Return type: datetime.datetime

greykite.common.features.timeseries_features.get_us_dst_end(year)[source]

For each year, it returns the first Sunday in November, which is the end of the daylight saving (DST) in US/Canada.

We assume DST ends on Second Sunday of Novemeber at 2 a.m.

Parameters: year (int) – Year for which DST end date is desired.
Returns: result – The timestamp of end of DST in US/Canada.
Return type: datetime.datetime

greykite.common.features.timeseries_features.get_eu_dst_start(year)[source]

For each year, it returns the last Sunday in March, which is the start of the daylight saving (DST) in Europe.

We assume Europe DST starts on last Sunday of March at 1 a.m.

Parameters: year (int) – Year for which DST start date is desired.
Returns: result – The timestamp of start of DST in Europe.
Return type: datetime.datetime

greykite.common.features.timeseries_features.get_eu_dst_end(year)[source]

For each year, it returns the last Sunday in October, which is the end of the daylight saving (DST) in Europe.

We assume Europe DST ends on last Sunday of October at 2 a.m.

Parameters: year (int) – Year for which DST end date is desired.
Returns: result – The timestamp of end of DST in Europe.
Return type: datetime.datetime

greykite.common.features.timeseries_features.is_dst_fcn(time_zone)[source]

For a given timezone, it constructs a function which determines if a timestamp (dt) is inside the daylight saving period or not for a list of timestamps.

This function, should work for regions in US / Canada and Europe.

The returned function assumes that the timestamps are in the given time_zone. Note that since daylight saving is the same for all of mainland US / Canada, one can pass any US time zone e.g. "US/Pacific" to construct a function which works for all of mainland US. Similarly for most of Europe, it is suffcient to pass any Europe time zone e.g. "Europe/London".

Some references on when did DST start in modern era:

Note: This function assumes the DST rules remain the same as what they are in the year 2022 (when this code was written). A potentially more accurate (but much slower) version is available: pytz_is_dst_fcn. However, we expect the current function would be much faster and it can be updated in case DST rules change.

Parameters: time_zone (str) – A string denoting the timestamp e.g. “US/Pacific”, “Canada/Eastern”, “Europe/London”.
Returns: is_dst – A function which takes a list of datetime-like objects and returns a list of colleans to determine if each timestamp is in daylight saving.
Return type: callable

greykite.common.features.timeseries_features.build_time_features_df(dt, conti_year_origin, add_dst_info=True)[source]

This function gets a datetime-like vector and creates new columns containing temporal features useful for time series analysis and forecasting e.g. year, week of year, etc.

Parameters

dt (array-like (1-dimensional)) – A vector of datetime-like values
conti_year_origin (float) – The origin used for creating continuous time which is in years unit.
add_dst_info (bool, default True) – Determines if daylight saving columns for US and Europe should be added.

Returns

time_features_df –

Dataframe with the following time features.

”datetime”: datetime.datetime object, a combination of date and a time

”date”: datetime.date object, date with the format (year, month, day)

”year”: integer, year of the date e.g. 2018

”year_length”: integer, number of days in the year e.g. 365 or 366

”quarter”: integer, quarter of the date, 1, 2, 3, 4

”quarter_start”: pandas.DatetimeIndex, date of beginning of the current quarter

”quarter_length”: integer, number of days in the quarter, 90/91 for Q1, 91 for Q2, 92 for Q3 and Q4

”month”: integer, month of the year, January=1, February=2, …, December=12

”month_length”: integer, number of days in the month, 28/ 29/ 30/ 31

”woy”: integer, ISO 8601 week of the year where a week starts from Monday, 1, 2, …, 53

”doy”: integer, ordinal day of the year, 1, 2, …, year_length

”doq”: integer, ordinal day of the quarter, 1, 2, …, quarter_length

”dom”: integer, ordinal day of the month, 1, 2, …, month_length

”dow”: integer, day of the week, Monday=1, Tuesday=2, …, Sunday=7

”str_dow”: string, day of the week as a string e.g. “1-Mon”, “2-Tue”, …, “7-Sun”

”str_doy”: string, day of the year e.g. “2020-03-20” for March 20, 2020

”hour”: integer, discrete hours of the datetime, 0, 1, …, 23

”minute”: integer, minutes of the datetime, 0, 1, …, 59

”second”: integer, seconds of the datetime, 0, 1, …, 3599

”year_month”: string, (year, month) e.g. “2020-03” for March 2020

”year_woy”: string, (year, week of year) e.g. “2020_42” for 42nd week of 2020

”month_dom”: string, (month, day of month) e.g. “02/20” for February 20th

”year_woy_dow”: string, (year, week of year, day of week) e.g. “2020_03_6” for Saturday of 3rd week in 2020

”woy_dow”: string, (week of year, day of week) e.g. “03_6” for Saturday of 3rd week

”dow_hr”: string, (day of week, hour) e.g. “4_09” for 9am on Thursday

”dow_hr_min”: string, (day of week, hour, minute) e.g. “4_09_10” for 9:10am on Thursday

”tod”: float, time of day, continuous, 0.0 to 24.0

”tow”: float, time of week, continuous, 0.0 to 7.0

”tom”: float, standardized time of month, continuous, 0.0 to 1.0

”toq”: float, time of quarter, continuous, 0.0 to 1.0

”toy”: float, standardized time of year, continuous, 0.0 to 1.0

”conti_year”: float, year in continuous time, eg 2018.5 means middle of the year 2018

”is_weekend”: boolean, weekend indicator, True for weekend, else False

”dow_grouped”: string, Monday-Thursday=1234-MTuWTh, Friday=5-Fri, Saturday=6-Sat, Sunday=7-Sun

”ct1”: float, linear growth based on conti_year_origin, -infinity to infinity

”ct2”: float, signed quadratic growth, -infinity to infinity

”ct3”: float, signed cubic growth, -infinity to infinity

”ct_sqrt”: float, signed square root growth, -infinity to infinity

”ct_root3”: float, signed cubic root growth, -infinity to infinity

”us_dst”: bool, determines if the time inside the daylight saving time of US
This column is only generated if add_dst_info=True

”eu_dst”: bool, determines if the time inside the daylight saving time of Europe. This column is only generated if add_dst_info=True

Return type

greykite.common.features.timeseries_features.add_time_features_df(df, time_col, conti_year_origin, add_dst_info=True)[source]

Adds a time feature data frame to a data frame by calling build_time_features_df.

Parameters

df (pandas.Dataframe) – The input data frame
time_col (str) – The name of the time column of interest
conti_year_origin – The origin of time for the continuous time variable which is in years unit.
add_dst_info (bool, default True) – Determines if daylight saving columns for US and Europe should be added.

Returns

result – The same data frame (df) augmented with new columns generated by build_time_features_df

Return type

pandas.Dataframe

greykite.common.features.timeseries_features.get_holidays(countries, year_start, year_end)[source]

This function extracts a holiday data frame for the period of interest [year_start to year_end] for the given countries. This is done using the holidays libraries in pypi:holidays-ext

Parameters

countries (list [str]) – countries for which we need holidays
year_start (int) – first year of interest, inclusive
year_end (int) – last year of interest, inclusive

Returns

holiday_df_dict –

key: country name
value: data frame with holidays for that country Each data frame has two columns: EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL

Return type

dict [str, pandas.DataFrame]

greykite.common.features.timeseries_features.get_available_holiday_lookup_countries(countries=None)[source]

Returns list of available countries for modeling holidays

Parameters: countries – List[str] only look for available countries in this set
Returns: List[str] list of available countries for modeling holidays

greykite.common.features.timeseries_features.get_available_holidays_in_countries(countries, year_start, year_end)[source]

Returns a dictionary mapping each country to its holidays: between the years specified.

Parameters

countries – List[str] countries for which we need holidays
year_start – int first year of interest
year_end – int last year of interest

Returns

Dict[str, List[str]] key: country name value: list of holidays in that country between [year_start, year_end]

greykite.common.features.timeseries_features.get_available_holidays_across_countries(countries, year_start, year_end)[source]

Returns a list of holidays that occur any of the countries between the years specified.

Parameters

countries – List[str] countries for which we need holidays
year_start – int first year of interest
year_end – int last year of interest

Returns

List[str] names of holidays in any of the countries between [year_start, year_end]

greykite.common.features.timeseries_features.add_daily_events(df, event_df_dict, date_col='date', regular_day_label='', neighbor_impact=None, shifted_effect=None)[source]

For each key of event_df_dict, it adds a new column to a data frame (df) with a date column (date_col). Each new column will represent the events given for that key. This function also generates 3 binary event flags IS_EVENT_EXACT_COL, IS_EVENT_ADJACENT_COL and IS_EVENT_COL given the information in event_df_dict with the following logic:

(1) If the key contains “_minus_” or “_plus_”, that means the event was generated by the add_event_window function, and it is a neighboring day of some exact event day. In this case, IS_EVENT_ADJACENT_COL will be 1 for all days in this key.

(2) Otherwise the key indicates that it is on the exact event day being modeled. In this case, IS_EVENT_EXACT_COL will be 1 for all days in this key.

If a date appears in both types of keys, both above columns will be 1.

IS_EVENT_COL is 1 for all dates in the provided event_df_dict.

Parameters

df (pandas.DataFrame) – The data frame which has a date column.
event_df_dict (dict [str, pandas.DataFrame]) –
A dictionary of data frames, each representing events data for the corresponding key. Values are DataFrames with two columns:
- The first column contains the date. Must be at the same frequency as df[date_col] for proper join. Must be in a format recognized by pandas.to_datetime.
- The second column contains the event label for each date
date_col (str) – Column name in df that contains the dates for joining against the events in event_df_dict.
regular_day_label (str) – The label used for regular days which are not “events”.
neighbor_impact (int, list [int], callable or None, default None) –
The impact of neighboring timestamps of the events in event_df_dict. This is for daily events so the units below are all in days.

For example, if the data is weekly (“W-SUN”) and an event is daily, it may not exactly fall on the weekly date. But you can specify for New Year’s day on 1-1, it affects all dates in the week, e.g. 12-31, 1-1, …, 1-6, then it will be mapped to the weekly date. In this case you may want to map a daily event’s date to a few dates, and can specify neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)].

Another example is that the data is rolling 7 day daily data, thus a holiday may affect the t, t+1, …, t+6 dates. You can specify neighbor_impact=7.

If input is int, the mapping is t, t+1, …, t+neighbor_impact-1. If input is list, the mapping is [t+x for x in neighbor_impact]. If input is a function, it maps each daily event’s date to a list of dates.
shifted_effect (list [str] or None, default None) – Additional neighbor events based on given events. For example, passing [“-1D”, “7D”] will add extra daily events which are 1 day before and 7 days after the given events. Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas to_offset. The new events’ names will be the current events’ names with suffix “{offset}_before” or “{offset}_after”. For example, if we have an event named “US_Christmas Day”, a “7D” shift will have name “US_Christmas Day_7D_after”. This is useful when you expect an offset of the current holidays also has impact on the time series, or you want to interact the lagged terms with autoregression. If neighbor_impact is also specified, this will be applied after adding neighboring days.

Returns

df_daily_events – An augmented data frame version of df with new label columns – one for each key of event_df_dict.

Return type

sklearn.pipeline.Pipeline

greykite.common.features.timeseries_features.add_event_window(df, time_col, label_col, time_delta='1D', pre_num=1, post_num=1, events_name='')[source]

For a data frame of events with a time_col and label_col

it adds shifted events prior and after the given events For example if the event data frame includes the row

‘2019-12-25, Christmas’

the function will produce dataframes with the events:: ‘2019-12-24, Christmas’ and ‘2019-12-26, Christmas’

if pre_num and post_num are 1 or more.

Parameters

df – pd.DataFrame the events data frame with two columns ‘time_col’ and ‘label_col’
time_col – str The column with the timestamp of the events. This can be daily but does not have to
label_col – str the column with labels for the events
time_delta – str the amount of the shift for each unit specified by a string e.g. “1D” stands for one day delta
pre_num – int the number of events to be added prior to the given event for each event in df
post_num – int the number of events to be added after to the given event for each event in df
events_name –
str for each shift, we generate a new data frame and those data frames will be stored in a dictionary with appropriate keys. Each key starts with “events_name” and follow up with:

”_minus_1”, “_minus_2”, “_plus_1”, “_plus_2”, …

depending on pre_num and post_num

Returns

dict[key: pd.Dataframe] A dictionary of dataframes for each needed shift. For example if pre_num=2 and post_num=3. 2 + 3 = 5 data frames will be stored in the return dictionary.

greykite.common.features.timeseries_features.get_evenly_spaced_changepoints_values(df, continuous_time_col='ct1', n_changepoints=2)[source]

Partitions interval into n_changepoints + 1 segments,: placing a changepoint at left endpoint of each segment. The left most segment doesn’t get a changepoint. Changepoints should be determined from training data.

Parameters

df – pd.DataFrame training dataset. contains continuous_time_col
continuous_time_col – str name of continuous time column (e.g. conti_year, ct1)
n_changepoints – int number of changepoints requested

Returns

np.array values of df[continuous_time_col] at the changepoints

greykite.common.features.timeseries_features.get_evenly_spaced_changepoints_dates(df, time_col, n_changepoints)[source]

Partitions interval into n_changepoints + 1 segments,: placing a changepoint at left endpoint of each segment. The left most segment doesn’t get a changepoint. Changepoints should be determined from training data.

Parameters

df – pd.DataFrame training dataset. contains continuous_time_col
time_col – str name of time column
n_changepoints – int number of changepoints requested

Returns

pd.Series values of df[time_col] at the changepoints

greykite.common.features.timeseries_features.get_custom_changepoints_values(df, changepoint_dates, time_col='ts', continuous_time_col='ct1')[source]

Returns the values of continuous_time_col at the: requested changepoint_dates.

Parameters

df – pd.DataFrame training dataset. contains continuous_time_col and time_col
changepoint_dates – Iterable[Union[int, float, str, datetime]] Changepoint dates, interpreted by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset
time_col – str The column name in df representing time for the time series data The time column can be anything that can be parsed by pandas DatetimeIndex
continuous_time_col – str name of continuous time column (e.g. conti_year, ct1)

Returns

np.array values of df[continuous_time_col] at the changepoints

greykite.common.features.timeseries_features.get_changepoint_string(changepoint_dates)[source]

Gets proper formatted strings for changepoint dates.

The default format is “_%Y_%m_%d_%H”. When necessary, it appends “_%M” or “_%M_%S”.

Parameters: changepoint_dates (list) – List of changepoint dates, parsable by pandas.to_datetime.
Returns: date_strings – List of string formatted changepoint dates.
Return type: list[`str]`

greykite.common.features.timeseries_features.get_changepoint_features(df, changepoint_values, continuous_time_col='ct1', growth_func=None, changepoint_dates=None)[source]

Returns features for growth terms with continuous time origins at

the changepoint_values (locations) specified

Generates a time series feature for each changepoint:

Let t = continuous_time value, c = changepoint value Then the changepoint feature value at time point t is

growth_func(t - c) * I(t >= c), where I is the indicator function

This represents growth as a function of time, where the time origin is the changepoint

In the typical case where growth_func(0) = 0 (has origin at 0),

the total effect of the changepoints is continuous in time. If growth_func is the identity function, and continuous_time represents the year in continuous time, these terms form the basis for a continuous, piecewise linear curve to the growth trend. Fitting these terms with linear model, the coefficents represent slope change at each changepoint

Intended usage

To make predictions (on test set): Allow growth term as a function of time to change at these points.

:param : The dataset to make predictions. Contains column continuous_time_col. :type : param df: pd.Dataframe :param : List of changepoint values (on same scale as df[continuous_time_col]).

Should be determined from training data

:type : param changepoint_values: array-like :param : Name of continuous time column in df

growth_func is applied to this column to generate growth term If None, uses “ct1”, linear growth

:type : param continuous_time_col: Optional[str] :param : Growth function for defining changepoints (scalar -> scalar).

If None, uses identity function to use continuous_time_col directly as growth term

:type : param growth_func: Optional[callable] :param : List of change point dates, parsable by pandas.to_datetime. :type : param changepoint_dates: Optional[list] :param : Changepoint features, 0-indexed :type : return: pd.DataFrame, shape (df.shape[0], len(changepoints))

greykite.common.features.timeseries_features.get_changepoint_values_from_config(changepoints_dict, time_features_df, time_col='ts')[source]

Applies the changepoint method specified in changepoints_dict to return the changepoint values

Parameters

changepoints_dict –
Optional[Dict[str, any]] Specifies the changepoint configuration. “method”: str

The method to locate changepoints. Valid options:
”uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change. “custom”. Places changepoints at the specified dates.

Additional keys to provide parameters for each particular method are described below.

”continuous_time_col”: Optional[str]
Column to apply growth_func to, to generate changepoint features Typically, this should match the growth term in the model

”growth_func”: Optional[func]
Growth function (scalar -> scalar). Changepoint features are created by applying growth_func to “continuous_time_col” with offsets. If None, uses identity function to use continuous_time_col directly as growth term

If changepoints_dict[“method”] == “uniform”, this other key is required:

”n_changepoints”: int
number of changepoints to evenly space across training period

If changepoints_dict[“method”] == “custom”, this other key is required:

”dates”: Iterable[Union[int, float, str, datetime]]
Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
time_features_df – pd.Dataframe training dataset. contains column “continuous_time_col”
time_col – str The column name in time_features_df representing time for the time series data The time column can be anything that can be parsed by pandas DatetimeIndex Used only in the “custom” method.

Returns

np.array values of df[continuous_time_col] at the changepoints

greykite.common.features.timeseries_features.get_changepoint_features_and_values_from_config(df, time_col, changepoints_dict=None, origin_for_time_vars=None)[source]

Extracts changepoints from changepoint configuration and input data

Parameters

df – pd.DataFrame Training data. A data frame which includes the timestamp and value columns
time_col – str The column name in df representing time for the time series data The time column can be anything that can be parsed by pandas DatetimeIndex
changepoints_dict –
Optional[Dict[str, any]] Specifies the changepoint configuration. “method”: str

The method to locate changepoints. Valid options:
”uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change. “custom”. Places changepoints at the specified dates.

Additional keys to provide parameters for each particular method are described below.

”continuous_time_col”: Optional[str]
Column to apply growth_func to, to generate changepoint features Typically, this should match the growth term in the model

”growth_func”: Optional[func]
Growth function (scalar -> scalar). Changepoint features are created by applying growth_func to “continuous_time_col” with offsets. If None, uses identity function to use continuous_time_col directly as growth term

If changepoints_dict[“method”] == “uniform”, this other key is required:

”n_changepoints”: int
number of changepoints to evenly space across training period

If changepoints_dict[“method”] == “custom”, this other key is required:

”dates”: Iterable[Union[int, float, str, datetime]]
Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
origin_for_time_vars – Optional[float] The time origin used to create continuous variables for time

Returns

Dict[str, any] Dictionary with the requested changepoints and associated information changepoint_df: pd.DataFrame, shape (df.shape[0], len(changepoints))

Changepoint features for modeling the training data

changepoint_values: array-like: List of changepoint values (on same scale as df[continuous_time_col]) Can be used to generate changepoints for prediction.
continuous_time_col: Optional[str]: Name of continuous time column in df growth_func is applied to this column to generate growth term. If None, uses “ct1”, linear growth Can be used to generate changepoints for prediction.
growth_func: Optional[callable]: Growth function for defining changepoints (scalar -> scalar). If None, uses identity function to use continuous_time_col directly as growth term. Can be used to generate changepoints for prediction.
changepoint_cols: List[str]: Names of the changepoint columns for modeling

greykite.common.features.timeseries_features.get_changepoint_dates_from_changepoints_dict(changepoints_dict, df=None, time_col=None)[source]

Gets the changepoint dates from changepoints_dict

Parameters

changepoints_dict (dict or None) – The changepoints_dict which is compatible with forecast
df (pandas.DataFrame or None, default None) – The data df to put changepoints on.
time_col (str or None, default None) – The column name of time column in df.

Returns

changepoint_dates – List of changepoint dates.

Return type

list

greykite.common.features.timeseries_features.add_event_window_multi(event_df_dict, time_col, label_col, time_delta='1D', pre_num=1, post_num=1, pre_post_num_dict=None)[source]

For a given dictionary of events data frames with a time_col and label_col it adds shifted events prior and after the given events For example if the event data frame includes the row ‘2019-12-25, Christmas’ as a row the function will produce dataframes with the events ‘2019-12-24, Christmas’ and ‘2019-12-26, Christmas’ if pre_num and post_num are 1 or more.

Parameters

event_df_dict (dict [str, pandas.DataFrame]) – A dictionary of events data frames with each having two columns: time_col and label_col.
time_col (str) – The column with the timestamp of the events. This can be daily but does not have to be.
label_col (str) – The column with labels for the events.
time_delta (str, default “1D”) – The amount of the shift for each unit specified by a string e.g. ‘1D’ stands for one day delta
pre_num (int, default 1) – The number of events to be added prior to the given event for each event in df.
post_num (int, default 1) – The number of events to be added after to the given event for each event in df.
pre_post_num_dict (dict [str, (int, int)] or None, default None) – Optionally override pre_num and post_num for each key in event_df_dict. For example, if event_df_dict has keys “US” and “India”, this parameter can be set to pre_post_num_dict = {"US": [1, 3], "India": [1, 2]}, denoting that the “US” pre_num is 1 and post_num is 3, and “India” pre_num is 1 and post_num is 2. Keys not specified by pre_post_num_dict use the default given by pre_num and post_num.

Returns

df – A dictionary of dataframes for each needed shift. For example if pre_num=2 and post_num=3. 2 + 3 = 5 data frames will be stored in the return dictionary.

Return type

dict [str, pandas.DataFrame]

greykite.common.features.timeseries_features.get_fourier_col_name(k, col_name, function_name='sin', seas_name=None)[source]

Returns column name corresponding to a particular fourier term, as returned by fourier_series_fcn

Parameters

k – int fourier term
col_name – str column in the dataframe used to generate fourier series
function_name – str sin or cos
seas_name – strcols_interact appended to new column names added for fourier terms

Returns

str column name in DataFrame returned by fourier_series_fcn

greykite.common.features.timeseries_features.fourier_series_fcn(col_name, period=1.0, order=1, seas_name=None)[source]

Generates a function which creates fourier series matrix for a column of an input df :param col_name: str

is the column name in the dataframe which is to be used for generating fourier series. It needs to be a continuous variable.

Parameters

period – float the period of the fourier series
order – int the order of the fourier series
seas_name – Optional[str] appended to new column names added for fourier terms. Useful to distinguish multiple fourier series on same col_name with different periods.

Returns

callable a function which can be applied to any data.frame df with a column name being equal to col_name

greykite.common.features.timeseries_features.fourier_series_multi_fcn(col_names, periods=None, orders=None, seas_names=None)[source]

Generates a func which adds multiple fourier series with multiple periods.

Parameters

col_names (list [str]) – the column names which are to be used to generate Fourier series. Each column can have its own period and order.
periods (list [float] or None) – the periods corresponding to each column given in col_names
orders (list [int] or None) – the orders for each of the Fourier series
seas_names (list [str] or None) – Appended to the Fourier series name. If not provided (None) col_names will be used directly.

greykite.common.features.timeseries_features.signed_pow(x, y)[source]: Takes the absolute value of x and raises it to power of y. Then it multiplies the result by sign of x. This guarantees this function is non-decreasing. This is useful in many contexts e.g. statistical modeling. :param x: the base number which can be any real number :param y: the power which can be any real number :return: returns abs(x) to power of y multiplied by sign of x

greykite.common.features.timeseries_features.logistic(x, growth_rate=1.0, capacity=1.0, floor=0.0, inflection_point=0.0)[source]

Evaluates the logistic function at x with the specified growth rate,: capacity, floor, and inflection point.

Parameters

x (float) – value to evaluate the logistic function
growth_rate (float) – growth rate
capacity (float) – max value (carrying capacity)
floor (float) – min value (lower bound)
inflection_point (float) – the t value of the inflection point

Returns

value of the logistic function at t

Return type

float

greykite.common.features.timeseries_features.get_logistic_func(growth_rate=1.0, capacity=1.0, floor=0.0, inflection_point=0.0)[source]

Returns a function that evaluates the logistic function at t with the

specified growth rate, capacity, floor, and inflection point.

f(x) = floor + capacity / (1 + exp(-growth_rate * (x - inflection_point)))

Parameters

growth_rate (float) – growth rate
capacity (float) – max value (carrying capacity)
floor (float) – min value (lower bound)
inflection_point (float) – the t value of the inflection point

Returns

the logistic function with specified parameters

Return type

callable

greykite.algo.forecast.silverkite.forecast_simple_silverkite_helper.get_event_pred_cols(daily_event_df_dict, daily_event_shifted_effect=None)[source]

Generates the names of internal predictor columns from the event dictionary passed to forecast. These can be passed via the extra_pred_cols parameter to model event effects.

Note

The returned strings are patsy model formula terms. Each provides full set of levels so that prediction works even if a level is not found in the training set.

If a level does not appear in the training set, its coefficient may be unbounded in the “linear” fit_algorithm. A method with regularization avoids this issue (e.g. “ridge”, “elastic_net”).

Parameters

daily_event_df_dict (dict or None, optional, default None) – A dictionary of data frames, each representing events data for the corresponding key. See forecast.
daily_event_shifted_effect (list [str] or None, default None) – Additional neighbor events based on given events. For example, passing [“-1D”, “7D”] will add extra daily events which are 1 day before and 7 days after the given events. Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas to_offset. The new events’ names will be the current events’ names with suffix “{offset}_before” or “{offset}_after”. For example, if we have an event named “US_Christmas Day”, a “7D” shift will have name “US_Christmas Day_7D_after”. This is useful when you expect an offset of the current holidays also has impact on the time series, or you want to interact the lagged terms with autoregression. The interaction can be specified with e.g. y_lag7:events_US_Christmas Day_7D_after. If daily_event_neighbor_impact is also specified, this will be applied after adding neighboring days.

Returns

event_pred_cols – List of patsy model formula terms, one for each key of daily_event_df_dict.

Return type

list [str]

greykite.framework.pipeline.utils.get_basic_pipeline(estimator=SimpleSilverkiteEstimator(), score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, agg_periods=None, agg_func=None, relative_error_tolerance=None, coverage=0.95, null_model_params=None, regressor_cols=None, lagged_regressor_cols=None)[source]

Returns a basic pipeline for univariate forecasting. Allows for outlier detection, normalization, null imputation, degenerate column removal, and forecast model fitting. By default, only null imputation is enabled. See source code for the pipeline steps.

Notes

While score_func is used to define the estimator’s score function, the the scoring parameter of RandomizedSearchCV should be provided when using this pipeline in grid search. Otherwise, grid search assumes higher values are better for score_func.

Parameters

estimator (instance of an estimator that implements greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator, default SimpleSilverkiteEstimator() # noqa: E501) – Estimator to use as the final step in the pipeline.
score_func (str or callable, default EvaluationMetricEnum.MeanAbsolutePercentError.name) – Score function used to select optimal model in CV. If a callable, takes arrays y_true, y_pred and returns a float. If a string, must be either a EvaluationMetricEnum member name or FRACTION_OUTSIDE_TOLERANCE.
score_func_greater_is_better (bool, default False) – True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided if score_func is a callable (custom function). Ignored if score_func is a string, because the direction is known.
agg_periods (int or None, default None) – Number of periods to aggregate before evaluation. Model is fit at original frequency, and forecast is aggregated according to agg_periods E.g. fit model on hourly data, and evaluate performance at daily level If None, does not apply aggregation
agg_func (callable or None, default None) – Takes an array and returns a number, e.g. np.max, np.sum Used to aggregate data prior to evaluation (applied to actual and predicted) Ignored if agg_periods is None
relative_error_tolerance (float or None, default None) – Threshold to compute the FRACTION_OUTSIDE_TOLERANCE metric, defined as the fraction of forecasted values whose relative error is strictly greater than relative_error_tolerance. For example, 0.05 allows for 5% relative error. Required if score_func is FRACTION_OUTSIDE_TOLERANCE.
coverage (float or None, default=0.95) – Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Ignored if pipeline is provided. Uses coverage of the pipeline estimator instead.
null_model_params (dict or None, default None) –
Defines baseline model to compute R2_null_model_score evaluation metric. R2_null_model_score is the improvement in the loss function relative to a null model. It can be used to evaluate model quality with respect to a simple baseline. For details, see r2_null_model_score.

The null model is a DummyRegressor, which returns constant predictions.

Valid keys are “strategy”, “constant”, “quantile”. See DummyRegressor. For example:
```
null_model_params = {
    "strategy": "mean",
}
null_model_params = {
    "strategy": "median",
}
null_model_params = {
    "strategy": "quantile",
    "quantile": 0.8,
}
null_model_params = {
    "strategy": "constant",
    "constant": 2.0,
}
```
If None, R2_null_model_score is not calculated.

Note: CV model selection always optimizes score_func`, not the ``R2_null_model_score.
regressor_cols (list [str] or None, default None) – A list of regressor columns used in the training and prediction DataFrames. It should contain only the regressors that are being used in the grid search. If None, no regressor columns are used. Regressor columns that are unavailable in df are dropped.
lagged_regressor_cols (list [str] or None, default None) – A list of additional columns needed for lagged regressors in the training and prediction DataFrames. This list can have overlap with regressor_cols. If None, no additional columns are added to the DataFrame. Lagged regressor columns that are unavailable in df are dropped.

Returns

pipeline – sklearn Pipeline for univariate forecasting.

Return type

greykite.framework.utils.exploratory_data_analysis.get_exploratory_plots(df, time_col, value_col, freq=None, anomaly_info=None, output_path=None)[source]

Computes multiple exploratory data analysis (EDA) plots to visualize the metric in value_col``and aid in modeling. The EDA plots are written in an `html` file at ``output_path.

For details on how to interpret these EDA plots, check the tutorials.

Parameters

df (pandas.DataFrame) – Input timeseries. A data frame which includes the timestamp column as well as the value column.
time_col (str) – The column name in df representing time for the time series data. The time column can be anything that can be parsed by pandas DatetimeIndex.
value_col (str) – The column name which has the value of interest to be forecasted.
freq (str or None, default None) – Timeseries frequency, DateOffset alias, If None automatically inferred.
anomaly_info (dict or list [dict] or None, default None) – Anomaly adjustment info. Anomalies in df are corrected before any plotting is done.
output_path (str or None, default None) – Path where the html file is written. If None, it is set to “EDA_{value_col}.html”.

Returns

eda.html – An html file containing the EDA plots is written at output_path.

Return type

html file

greykite.framework.utils.result_summary.summarize_grid_search_results(grid_search, only_changing_params=True, combine_splits=True, decimals=None, score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, cv_report_metrics='ALL', column_order=None)[source]

Summarizes CV results for each grid search parameter combination.

While grid_search.cv_results_ could be imported into a pandas.DataFrame without this function, the following conveniences are provided:

returns the correct ranks based on each metric’s greater_is_better direction.

summarizes the hyperparameter space, only showing the parameters that change

combines split scores into a tuple to save table width

rounds the values to specified decimals

orders columns by type (test score, train score, metric, etc.)

Parameters

grid_search (RandomizedSearchCV) – Grid search output (fitted RandomizedSearchCV object).
only_changing_params (bool, default True) – If True, only show parameters with multiple values in the hyperparameter_grid.
combine_splits (bool, default True) –
Whether to report split scores as a tuple in a single column.
- If True, adds a column for the test splits scores for each requested metric. Adds a column with train split scores if those are available.
  
  For example, “split_train_score” would contain the values (split1_train_score, split2_train_score, split3_train_score) as as tuple.
- If False, this summary column is not added.
The original split columns are available either way.
decimals (int or None, default None) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point. If None, does not round.
score_func (str or callable, default EvaluationMetricEnum.MeanAbsolutePercentError.name) –
Score function used to select optimal model in CV. If a callable, takes arrays y_true, y_pred and returns a float. If a string, must be either a EvaluationMetricEnum member name or FRACTION_OUTSIDE_TOLERANCE.

Used in this function to fix the "rank_test_score" column if score_func_greater_is_better=False.

Should be the same as what was passed to run_forecast_config, or forecast_pipeline, or get_hyperparameter_searcher.
score_func_greater_is_better (bool, default False) –
True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided if score_func is a callable (custom function). Ignored if score_func is a string, because the direction is known.

Used in this function to fix the "rank_test_score" column if score_func_greater_is_better=False.

Should be the same as what was passed to run_forecast_config, or forecast_pipeline, or get_hyperparameter_searcher.
cv_report_metrics (CV_REPORT_METRICS_ALL, or list [str], or None, default CV_REPORT_METRICS_ALL # noqa: E501) –
Additional metrics to show in the summary, besides the one specified by score_func.

If a metric is specified but not available, a warning will be given.

Should be the same as what was passed to run_forecast_config, or forecast_pipeline, or get_hyperparameter_searcher, or a subset of computed metric to show.

If a list of strings, valid strings are greykite.common.evaluation.EvaluationMetricEnum member names and FRACTION_OUTSIDE_TOLERANCE.
column_order (list [str] or None, default None) –
How to order the columns. A list of regex to order column names, in greedy fashion. Column names matching the first item are placed first. Among remaining items, those matching the second items are placed next, etc. Use “*” as the last element to select all available columns, if desired. If None, uses default ordering:
```
column_order = ["rank_test", "mean_test", "split_test", "mean_train",
                "params", "param", "split_train", "time", ".*"]
```

Notes

Metrics are named in grid_search.cv_results_ according to the scoring parameter passed to RandomizedSearchCV.

"score" is the default used by sklearn for single metric evaluation.

If a dictionary is provided to scoring, as is the case through templates, then the metrics are named by its keys, and the metric used for selection is defined by refit. The keys are derived from score_func and cv_report_metrics in get_scoring_and_refit.

The key for score_func if it is a callable is CUSTOM_SCORE_FUNC_NAME.

The key for EvaluationMetricEnum member name is the short name from .get_metric_name().

The key for FRACTION_OUTSIDE_TOLERANCE is FRACTION_OUTSIDE_TOLERANCE_NAME.

Returns

cv_results – A summary of cross-validation results in tabular format. Each row corresponds to a set of parameters used in the grid search.

The columns have the following format, where name is the canonical short name for the metric.

"rank_test_{name}"int
The params ranked by mean_test_score (1 is best).

"mean_test_{name}"float
Average test score.

"split_test_{name}"list [float]
Test score on each split. [split 0, split 1, …]

"std_test_{name}"float
Standard deviation of test scores.

"mean_train_{name}"float
Average train score.

"split_train_{name}"list [float]
Train score on each split. [split 0, split 1, …]

"std_train_{name}"float
Standard deviation of train scores.

"mean_fit_time"float
Average time to fit each CV split (in seconds)

"std_fit_time"float
Std of time to fit each CV split (in seconds)

"mean_score_time"float
Average time to score each CV split (in seconds)

"std_score_time"float
Std of time to score each CV split (in seconds)

"params"dict
The parameters used. If only_changing==True, only shows the parameters which are not identical across all CV splits.

"param_{pipeline__param__name}"Any
The value of pipeline parameter pipeline__param__name for each row.

Return type

greykite.framework.utils.result_summary.get_ranks_and_splits(grid_search, score_func='MeanAbsolutePercentError', greater_is_better=False, combine_splits=True, decimals=None, warn_metric=True)[source]

Extracts CV results from grid_search for the specified score function. Returns the correct ranks on the test set and a tuple of the scores across splits, for both test set and train set (if available).

Notes

While cv_results contains keys with the ranks, these ranks are inverted if lower values are better and the scoring function was initialized with greater_is_better=True to report metrics with their original sign.

This function always returns the correct ranks, accounting for metric direction.

Parameters

grid_search (RandomizedSearchCV) – Grid search output (fitted RandomizedSearchCV object).
score_func (str or callable, default EvaluationMetricEnum.MeanAbsolutePercentError.name) –
Score function to get the ranks for. If a callable, takes arrays y_true, y_pred and returns a float. If a string, must be either a EvaluationMetricEnum member name or FRACTION_OUTSIDE_TOLERANCE.

Should be the same as what was passed to run_forecast_config, or forecast_pipeline, or get_hyperparameter_searcher.
greater_is_better (bool or None, default False) –
True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided if score_func is a callable (custom function). Ignored if score_func is a string, because the direction is known.

Used in this function to rank values in the proper direction.

Should be the same as what was passed to run_forecast_config, or forecast_pipeline, or get_hyperparameter_searcher.
combine_splits (bool, default True) – Whether to report split scores as a tuple in a single column. If True, a single column is returned for all the splits of a given metric and train/test set. For example, “split_train_score” would contain the values (split1_train_score, split2_train_score, split3_train_score) as as tuple. If False, they are reported in their original columns.
decimals (int or None, default None) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point. If None, does not round.
warn_metric (bool, default True) – Whether to issue a warning if the requested metric is not found in the CV results.

Returns

ranks_and_splits – Ranks and split scores. Dictionary with the following keys:

"short_name"int
Canonical short name for the score_func.

"ranks"numpy.array
Ranks of the test scores for the score_func, where 1 is the best.

"split_train"list [list [float]]
Train split scores. Outer list corresponds to the parameter setting; inner list contains the scores for that parameter setting across all splits.

"split_test"list [list [float]]
Test split scores. Outer list corresponds to the parameter setting; inner list contains the scores for that parameter setting across all splits.

Return type

dict

greykite.common.viz.timeseries_plotting.plot_multivariate(df, x_col, y_col_style_dict='plotly', default_color='rgba(0, 145, 202, 1.0)', xlabel=None, ylabel='y', title=None, showlegend=True)[source]

Plots one or more lines against the same x-axis values.

Parameters

df (pandas.DataFrame) – Data frame with x_col and columns named by the keys in y_col_style_dict.
x_col (str) – Which column to plot on the x-axis.
y_col_style_dict (dict [str, dict or None] or “plotly” or “auto” or “auto-fill”, default “plotly”) –
The column(s) to plot on the y-axis, and how to style them.

If a dictionary:
- keystr
  column name in df
- valuedict or None
  Optional styling options, passed as kwargs to go.Scatter. If None, uses the default: line labeled by the column name. See reference page for plotly.graph_objects.Scatter for options (e.g. color, mode, width/size, opacity). https://plotly.com/python/reference/#scatter.
If a string, plots all columns in df besides x_col against x_col:
- ”plotly”: plot lines with default plotly styling
- ”auto”: plot lines with color default_color, sorted by value (ascending)
- ”auto-fill”: plot lines with color default_color, sorted by value (ascending), and fills between lines
default_color (str, default “rgba(0, 145, 202, 1.0)” (blue)) – Default line color when y_col_style_dict is one of “auto”, “auto-fill”.
xlabel (str or None, default None) – x-axis label. If None, default is x_col.
ylabel (str or None, default VALUE_COL) – y-axis label
title (str or None, default None) – Plot title. If None, default is based on axis labels.
showlegend (bool, default True) – Whether to show the legend.

Returns

fig – Interactive plotly graph of one or more columns in df against x_col.

See plot_forecast_vs_actual return value for how to plot the figure and add customization.

Return type

greykite.common.viz.timeseries_plotting.plot_univariate(df, x_col, y_col, xlabel=None, ylabel=None, title=None, color='rgb(32, 149, 212)', showlegend=True)[source]

Simple plot of univariate timeseries.

Parameters

df (pandas.DataFrame) – Data frame with x_col and y_col
x_col (str) – x-axis column name, usually the time column
y_col (str) – y-axis column name, the value the plot
xlabel (str or None, default None) – x-axis label
ylabel (str or None, default None) – y-axis label
title (str or None, default None) – Plot title. If None, default is based on axis labels.
color (str, default “rgb(32, 149, 212)” (light blue)) – Line color
showlegend (bool, default True) – Whether to show the legend

Returns

fig – Interactive plotly graph of the value against time.

See plot_forecast_vs_actual return value for how to plot the figure and add customization.

Return type

See also

None: Provides more styling options. Also consider using plotly’s go.Scatter and go.Layout directly.

greykite.common.viz.timeseries_plotting.plot_forecast_vs_actual(df, time_col='ts', actual_col='actual', predicted_col='forecast', predicted_lower_col='forecast_lower', predicted_upper_col='forecast_upper', xlabel='ts', ylabel='y', train_end_date=None, title=None, showlegend=True, actual_mode='lines+markers', actual_points_color='rgba(250, 43, 20, 0.7)', actual_points_size=2.0, actual_color_opacity=1.0, forecast_curve_color='rgba(0, 90, 181, 0.7)', forecast_curve_dash='solid', ci_band_color='rgba(0, 90, 181, 0.15)', ci_boundary_curve_color='rgba(0, 90, 181, 0.5)', ci_boundary_curve_width=0.0, vertical_line_color='rgba(100, 100, 100, 0.9)', vertical_line_width=1.0)[source]

Plots forecast with prediction intervals, against actuals Adapted from plotly user guide: https://plot.ly/python/v3/continuous-error-bars/#basic-continuous-error-bars

Parameters

df (pandas.DataFrame) – Timestamp, predicted, and actual values
time_col (str, default TIME_COL) – Column in df with timestamp (x-axis)
actual_col (str, default ACTUAL_COL) – Column in df with actual values
predicted_col (str, default PREDICTED_COL) – Column in df with predicted values
predicted_lower_col (str or None, default PREDICTED_LOWER_COL) – Column in df with predicted lower bound
predicted_upper_col (str or None, default PREDICTED_UPPER_COL) – Column in df with predicted upper bound
xlabel (str, default TIME_COL) – x-axis label.
ylabel (str, default VALUE_COL) – y-axis label.
train_end_date (datetime.datetime or None, default None) – Train end date. Must be a value in df[time_col].
title (str or None, default None) – Plot title.
showlegend (bool, default True) – Whether to show a plot legend.
actual_mode (str, default “lines+markers”) – How to show the actuals. Options: markers, lines, lines+markers
actual_points_color (str, default “rgba(99, 114, 218, 1.0)”) – Color of actual line/marker.
actual_points_size (float, default 2.0) – Size of actual markers. Only used if “markers” is in actual_mode.
actual_color_opacity (float or None, default 1.0) – Opacity of actual values points.
forecast_curve_color (str, default “rgba(0, 145, 202, 1.0)”) – Color of forecasted values.
forecast_curve_dash (str, default “solid”) – ‘dash’ property of forecast scatter.line. One of: ['solid', 'dot', 'dash', 'longdash', 'dashdot', 'longdashdot'] or a string containing a dash length list in pixels or percentages (e.g. '5px 10px 2px 2px', '5, 10, 2, 2', '10% 20% 40%')
ci_band_color (str, default “rgba(0, 145, 202, 0.15)”) – Fill color of the prediction bands.
ci_boundary_curve_color (str, default “rgba(0, 145, 202, 0.15)”) – Color of the prediction upper/lower lines.
ci_boundary_curve_width (float, default 0.0) – Width of the prediction upper/lower lines. default 0.0 (hidden)
vertical_line_color (str, default “rgba(100, 100, 100, 0.9)”) – Color of the vertical line indicating train end date. Default is black with opacity of 0.9.
vertical_line_width (float, default 1.0) – width of the vertical line indicating train end date

Returns

fig – Plotly figure of forecast against actuals, with prediction intervals if available.

Can show, convert to HTML, update:

# show figure
fig.show()

# get HTML string, write to file
fig.to_html(include_plotlyjs=False, full_html=True)
fig.write_html("figure.html", include_plotlyjs=False, full_html=True)

# customize layout (https://plot.ly/python/v3/user-guide/)
update_layout = dict(
    yaxis=dict(title="new ylabel"),
    title_text="new title",
    title_x=0.5,
    title_font_size=30)
fig.update_layout(update_layout)

Return type

sklearn.model_selection.RandomizedSearchCV

greykite.common.features.timeseries_impute.impute_with_lags(df, value_col, orders, agg_func='mean', iter_num=1)[source]

A function to impute timeseries values (given in df) and in value_col with chosen lagged values or an aggregated of those. For example for daily data one could use the 7th lag to impute using the value of the same day of past week as opposed to the closest value available which can be inferior for business related timeseries.

The imputation can be done multiple times by specifying iter_num to decrease the number of missing in some cases. Note that there are no guarantees to impute all missing values with this method by design. However the original number of missing values and the final number of missing values are returned by the function along with the imputed dataframe.

Parameters

df (pandas.DataFrame) – Input dataframe which must include value_col as a column.
value_col (str) – The column name in df representing the values of the timeseries.
orders (list of int) – The lag orders to be used for aggregation.
agg_func ("mean" or callable, default: "mean") – pandas.Series -> float An aggregation function to aggregate the chosen lags. If “mean”, uses pandas.DataFrame.mean.
iter_num (int, default 1) – Maximum number of iterations to impute the series. Each iteration represent an imputation of the series using the provided lag orders (orders) and return an imputed dataframe. It might be the case that with one iterations some values are not imputed but with more iterations one can achieve more imputed values.

Returns

impute_info – A dictionary with following items:

”df”pandas.DataFrame: A dataframe with the imputed values.
”initial_missing_num”int: Initial number of missing values.
”final_missing_num”int: Final number of missing values after imputations.

Return type

dict

greykite.common.features.timeseries_impute.impute_with_lags_multi(df, orders, agg_func=<function mean>, iter_num=1, cols=None)[source]

Imputes every column of df using impute_with_lags.

Parameters

df (pandas.DataFrame) – Input dataframe which must include value_col as a column.
orders (list of int) – The lag orders to be used for aggregation.
agg_func (callable, default np.mean) – pandas.Series -> float An aggregation function to aggregate the chosen lags.
iter_num (int, default 1) – Maximum number of iterations to impute the series. Each iteration represent an imputation of the series using the provided lag orders (orders) and return an imputed dataframe. It might be the case that with one iterations some values are not imputed but with more iterations one can achieve more imputed values.
cols (list [str] or None, default None) – Which columns to impute. If None, imputes all columns.

Returns

impute_info – A dictionary with following items:

”df”pandas.DataFrame

A dataframe with the imputed values.

”missing_info”dict

Dictionary with information about the missing info.

Key = name of a column in df Value = dictionary containing:

”initial_missing_num”int
Initial number of missing values.

”final_missing_num”int
Final number of missing values after imputation.

Return type

dict

greykite.common.features.adjust_anomalous_data.adjust_anomalous_data(df, time_col, value_col, anomaly_df, start_time_col='start_time', end_time_col='end_time', adjustment_delta_col=None, filter_by_dict=None, filter_by_value_col=None, adjustment_method='add')[source]

This function takes:

a time series, in the form of a dataframe: df

the anomaly information, in the form of a dataframe: anomaly_df.

It then adjusts the values of the time series based on the perceived impact of the anomalies given in the column adjustment_delta_col and assigns np.nan if the impact is not given.

Note that anomaly_df can contain the anomaly information for many different timeseries. This is enabled by allowing multiple metrics and dimensions to be listed in the same anomaly dataframe. Columns can indicate the metric name and dimension value.

This function first subsets the anomaly_df to the relevant rows for the value_col as specified by filter_by_dict, then makes the specified adjustments to df.

Parameters

df (pandas.DataFrame) – A data frame which includes the timestamp column as well as the value column.
time_col (str) – The column name in df representing time for the time series data. The time column can be anything that can be parsed by pandas.DatetimeIndex.
value_col (str) – The column name which has the value of interest to be forecasted.
anomaly_df (pandas.DataFrame) –
A dataframe which includes the anomaly information for the input series (df) but potentially for multiple series and dimensions.

This dataframe must include these two columns:
- start_time_col
- end_time_col
and include
adjustment_delta_col if it is not None in the function call.
Moreover if dimensions are requested by passing the filter_by_dict argument (not None), all of this dictionary keys must also appear in anomaly_df.
Here is an example:
```
anomaly_df = pd.DataFrame({
    "start_time": ["1/1/2018", "1/4/2018", "1/8/2018", "1/10/2018"],
    "end_time": ["1/2/2018", "1/6/2018", "1/9/2018", "1/10/2018"],
    "adjustment_delta": [np.nan, 3, -5, np.nan],
    # extra columns for filtering
    "metric": ["y", "y", "z", "z"],
    "dimension1": ["level_1", "level_1", "level_2", "level_2"],
    "dimension2": ["level_1", "level_2", "level_1", "level_1"],
})
```
In the above example,
- ”start_time” is the start date of the anomaly, which is provided using the argument start_time_col.
- ”end_time” is the end date of the anomaly, which is provided using the argument end_time_col.
- ”adjustment_delta” is the column which includes the delta if it is known. The name of this column is provided using the argument adjustment_delta_col. Use numpy.nan if the adjustment size is not known, and the adjusted value will be set to numpy.nan.
- ”metric”, “dimension1”, and “dimension2” are example columns for filtering. They contain the metric name and dimensions for which the anomaly is applicable. filter_by_dict` is used to filter on these columns to get the relevant anomalies for the timeseries represented by ``df[value_col].
start_time_col (str, default START_TIME_COL) – The column name in anomaly_df representing the start timestamp of the anomalous period, inclusive. The format can be anything that can be parsed by pandas DatetimeIndex.
end_time_col (str, default END_TIME_COL) – The column name in anomaly_df representing the start timestamp of the anomalous period, inclusive. The format can be anything that can be parsed by pandas DatetimeIndex.
adjustment_delta_col (str or None, default None) –
The column name in anomaly_df for the impact delta of the anomalies on the values of the series.

If the value is available, it will be used to adjust the timeseries values in the given period by adding or subtracting this value to the raw series values in that period. Whether to add or subtract is specified by adjustment_method. If the value for a row is “” or np.nan, the adjusted value is set to np.nan.

If adjustment_delta_col is None, all adjusted values are set to np.nan.
filter_by_dict (dict [str, any] or None, default None) –
A dictionary whose keys are column names of anomaly_df, and values are the desired value for that column (e.g. a string or int). If the value is an iterable (list, tuple, set), then it enumerates all allowed values for that column.

This dictionary is used to filter anomaly_df to the matching anomalies. This helps when the anomaly_df includes the anomalies for various metrics and dimensions, so matching is needed to get the relevant anomalies for df.

Columns in anomaly_df can contain information on metric name, metric dimension (e.g. mobile/desktop), issue severity, etc. for filtering.
filter_by_value_col (str or None, default None) –
If provided, {filter_by_value_col: value_col} is added to filter_by_dict for filtering. This filters anomaly_df to rows where anomaly_df[filter_by_value_col] == value_col.

If value_col is the metric name, this is a convenient way to find anomalies matching the metric name.
adjustment_method (str (“add” or “subtract”), default “add”) –
How the adjustment in anomaly_df should be used to adjust the value in df.
- If “add”, the value in adjustment_delta_col is added to the original value.
- If “subtract”, it is subtracted from the original value.

Returns

Result – A dictionary with the following items (specified by key):

”adjusted_df”: pandas.DataFrame
A dataframe identical to the input dataframe df, but with value_col updated to the adjusted values.
”augmented_df”: pandas.DataFrame
A dataframe identical to the input dataframe df, with two extra columns
- ANOMALY_COL: Anomaly labels for the time series.
1 and 0 indicates anomalous and non-anomalous points, respectively. - f"adjusted_{value_col}": Adjusted values.

value_col retains the original values. This is useful to inspect which values have changed.

Return type

dict

greykite.common.evaluation.r2_null_model_score(y_true, y_pred, y_pred_null=None, y_train=None, loss_func=<function mean_squared_error>)[source]

Calculates improvement in the loss function compared to the predictions of a null model. Can be used to evaluate model quality with respect to a simple baseline model.

The score is defined as:

R2_null_model_score = 1.0 - loss_func(y_true, y_pred) / loss_func(y_true, y_pred_null)

Parameters

y_true (list [float] or numpy.array) – Observed response (usually on a test set).
y_pred (list [float] or numpy.array) – Model predictions (usually on a test set).
y_pred_null (list [float] or numpy.array or None) – A baseline prediction model to compare against. If None, derived from y_train or y_true.
y_train (list [float] or numpy.array or None) – Response values in the training data. If y_pred_null is None, then y_pred_null is set to the mean of y_train. If y_train is also None, then y_pred_null is set to the mean of y_true.
loss_func (callable, default sklearn.metrics.mean_squared_error) – The error loss function with signature (true_values, predicted_values).

Returns

r2_null_model – A value within (-infty, 1.0]. Higher scores are better. Can be interpreted as the improvement in the loss function compared to the predictions of the null model. For example, a score of 0.74 means the loss is 74% lower than for the null model.

Return type

float

Notes

There is a connection between R2_null_model_score and R2. R2_null_model_score can be interpreted as the additional improvement in the coefficient of determination (i.e. R2, see sklearn.metrics.r2_score) with respect to a null model.

Under the default settings of this function, where loss_func is mean squared error and y_pred_null is the average of y_true, the scores are equivalent:

# simplified definition of R2_score, where SSE is sum of squared error
y_true_avg = np.repeat(np.average(y_true), y_true.shape[0])
R2_score := 1.0 - SSE(y_true, y_pred) / SSE(y_true, y_true_avg)
R2_score := 1.0 - MSE(y_true, y_pred) / VAR(y_true)  # equivalent definition

r2_null_model_score(y_true, y_pred) == r2_score(y_true, y_pred)

r2_score is 0 if simply predicting the mean (y_pred = y_true_avg).

If y_pred_null is passed, and if loss_func is mean squared error and y_true has nonzero variance, this function measures how much “r2_score of the predictions (y_pred)” closes the gap between “r2_score of the null model (y_pred)” and the “r2_score of the best possible model (y_true)”, which is 1.0:

R2_pred = r2_score(y_true, y_pred)       # R2 of predictions
R2_null = r2_score(y_pred_null, y_pred)  # R2 of null model
r2_null_model_score(y_true, y_pred, y_pred_null) == (R2_pred - R2_null) / (1.0 - R2_null)

When y_pred_null=y_true_avg, R2_null is 0 and this reduces to the formula above.

Summary (for loss_func=mean_squared_error):

If R2_null>0 (good null model), then R2_null_model_score < R2_score

If R2_null=0 (uninformative null model), then R2_null_model_score = R2_score

If R2_null<0 (poor null model), then R2_null_model_score > R2_score

For other loss functions, r2_null_model_score has the same connection to pseudo R2.

greykite.common.evaluation.mean_interval_score(observed, lower, upper, coverage)[source]

Calculates the mean interval score. If an observed value falls within the interval, the score is simply the width of the interval. If an observed value falls outside the interval, the score is the width of the interval plus an error term proportional to distance between the actual and its closest interval boundary. The proportionality constant is 2.0 / (1.0 - coverage). See Strictly Proper Scoring Rules, Prediction, and Estimation, Tilmann Gneiting and Adrian E. Raftery, 2007, Journal of the American Statistical Association, Volume 102, 2007 - Issue 477.

Parameters

observed (pandas.Series or numpy.array) – Numeric, observed values.
lower (pandas.Series or numpy.array) – Numeric, lower bound.
upper (pandas.Series or numpy.array) – Numeric, upper bound.
coverage (float) – Intended coverage of the prediction bands (0.0 to 1.0)

Returns

mean_interval_score – The mean interval score.

Return type

float

greykite.framework.pipeline.utils.get_score_func_with_aggregation(score_func, greater_is_better=None, agg_periods=None, agg_func=None, relative_error_tolerance=None)[source]

Returns a score function that pre-aggregates inputs according to agg_func, and filters out invalid true values before evaluation. This allows fitting the model at a granular level, yet evaluating at a coarser level.

Also returns the proper direction and short name for the score function.

Parameters

score_func (str or callable) – If callable, a function that maps two arrays to a number: (true, predicted) -> score.
greater_is_better (bool, default False) – True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided if score_func is a callable (custom function). Ignored if score_func is a string, because the direction is known.
agg_periods (int or None, default None) – Number of periods to aggregate before evaluation. Model is fit at original frequency, and forecast is aggregated according to agg_periods E.g. fit model on hourly data, and evaluate performance at daily level If None, does not apply aggregation
agg_func (callable or None, default None) – Takes an array and returns a number, e.g. np.max, np.sum Used to aggregate data prior to evaluation (applied to actual and predicted) Ignored if agg_periods is None
relative_error_tolerance (float or None, default None) – Threshold to compute the FRACTION_OUTSIDE_TOLERANCE metric, defined as the fraction of forecasted values whose relative error is strictly greater than relative_error_tolerance. For example, 0.05 allows for 5% relative error. Required if score_func is FRACTION_OUTSIDE_TOLERANCE.

Returns

score_func (callable) – scorer with pre-aggregation function and filter,
greater_is_better (bool) – Whether greater_is_better for the scorer. Uses the provided greater_is_better if the provided score_func is a callable. Otherwise, looks up the direction.
short_name (str) – Canonical short name for the score_func.

greykite.framework.pipeline.utils.get_hyperparameter_searcher(hyperparameter_grid, model, cv=None, hyperparameter_budget=None, n_jobs=1, verbose=1, **kwargs) → RandomizedSearchCV[source]

Returns RandomizedSearchCV object for hyperparameter tuning via cross validation

sklearn.model_selection.RandomizedSearchCV runs a full grid search if hyperparameter_budget is sufficient to exhaust the full hyperparameter_grid, otherwise it samples uniformly at random from the space.

Parameters

hyperparameter_grid (dict or list [dict]) –
Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). Lists of parameters are sampled uniformly.

May also be a list of such dictionaries to avoid undesired combinations of parameters. Passed as param_distributions to sklearn.model_selection.RandomizedSearchCV, see docs for more info.
model (estimator object) – A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface.
cv (int, cross-validation generator, iterable, or None, default None) – Determines the cross-validation splitting strategy. See sklearn.model_selection.RandomizedSearchCV.
hyperparameter_budget (int or None, default None) –
max number of hyperparameter sets to try within the hyperparameter_grid search space If None, uses defaults:
- exhaustive grid search if all values are constant
- 10 if any value is a distribution to sample from
n_jobs (int or None, default 1) – Number of jobs to run in parallel (the maximum number of concurrently running workers). -1 uses all CPUs. -2 uses all CPUs but one. None is treated as 1 unless in a joblib.Parallel backend context that specifies otherwise.
verbose (int, default 1) –
Verbosity level during CV.
- if > 0, prints number of fits
- if > 1, prints fit parameters, total score + fit time
- if > 2, prints train/test scores
kwargs (additional parameters) –
Keyword arguments to pass to get_scoring_and_refit. Accepts the following parameters:
- "score_func"
- "score_func_greater_is_better"
- "cv_report_metrics"
- "agg_periods"
- "agg_func"
- "relative_error_tolerance"

Returns

grid_search – Object that can run randomized search on hyper parameters.

Return type

greykite.framework.pipeline.utils.get_scoring_and_refit(score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, cv_report_metrics=None, agg_periods=None, agg_func=None, relative_error_tolerance=None)[source]

Provides scoring and refit parameters for RandomizedSearchCV.

Together, scoring and refit specify how what metrics to evaluate and how to evaluate the predictions on the test set to identify the optimal model.

Notes

Sets greater_is_better=True in scoring for all metrics to report them with their original sign, and properly accounts for this in refit to extract the best index.

Pass both scoring and refit to RandomizedSearchCV

Parameters

score_func (str or callable, default EvaluationMetricEnum.MeanAbsolutePercentError.name) – Score function used to select optimal model in CV. If a callable, takes arrays y_true, y_pred and returns a float. If a string, must be either a EvaluationMetricEnum member name or FRACTION_OUTSIDE_TOLERANCE.
score_func_greater_is_better (bool, default False) – True if score_func is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided if score_func is a callable (custom function). Ignored if score_func is a string, because the direction is known.
cv_report_metrics (CV_REPORT_METRICS_ALL, or list [str], or None, default None # noqa: E501) –
Additional metrics to compute during CV, besides the one specified by score_func.
- If the string constant greykite.common.constants.CV_REPORT_METRICS_ALL, computes all metrics in EvaluationMetricEnum. Also computes FRACTION_OUTSIDE_TOLERANCE if relative_error_tolerance is not None. The results are reported by the short name (.get_metric_name()) for EvaluationMetricEnum members and FRACTION_OUTSIDE_TOLERANCE_NAME for FRACTION_OUTSIDE_TOLERANCE.
- If a list of strings, each of the listed metrics is computed. Valid strings are greykite.common.evaluation.EvaluationMetricEnum member names and FRACTION_OUTSIDE_TOLERANCE.
  
  For example:
  ["MeanSquaredError", "MeanAbsoluteError", "MeanAbsolutePercentError", "MedianAbsolutePercentError", "FractionOutsideTolerance2"]
- If None, no additional metrics are computed.
agg_periods (int or None, default None) – Number of periods to aggregate before evaluation. Model is fit at original frequency, and forecast is aggregated according to agg_periods E.g. fit model on hourly data, and evaluate performance at daily level If None, does not apply aggregation
agg_func (callable or None, default None) – Takes an array and returns a number, e.g. np.max, np.sum Used to aggregate data prior to evaluation (applied to actual and predicted) Ignored if agg_periods is None
relative_error_tolerance (float or None, default None) – Threshold to compute the FRACTION_OUTSIDE_TOLERANCE metric, defined as the fraction of forecasted values whose relative error is strictly greater than relative_error_tolerance. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.

Returns

scoring (dict) – A dictionary of metrics to evaluate for each CV split. The key is the metric name, the value is an instance of evaluation_PredictScorerDF generated by make_scorer_df.

The value has a score method that takes actual and predicted values and returns a single number.

There is one item in the dictionary for score_func and an additional item for each additional element in cv_report_metrics.
- The key for score_func if it is a callable is CUSTOM_SCORE_FUNC_NAME.
- The key for EvaluationMetricEnum member name is the short name from .get_metric_name().
- The key for FRACTION_OUTSIDE_TOLERANCE is FRACTION_OUTSIDE_TOLERANCE_NAME.
See RandomizedSearchCV.
refit (callable) – Callable that takes cv_results_ from grid search and returns the best index.

See RandomizedSearchCV.

greykite.framework.pipeline.utils.get_best_index(results, metric='score', greater_is_better=False)[source]

Suitable for use as the refit parameter to RandomizedSearchCV, after wrapping with functools.partial.

Callable that takes cv_results_ from grid search and returns the best index.

Parameters

results (dict [str, numpy.array]) – Results from CV grid search. See RandomizedSearchCV cv_results_ attribute for the format.
metric (str, default “score”) – Which metric to use to select the best parameters. In single metric evaluation, the metric name should be “score”. For multi-metric evaluation, the scoring parameter to RandomizedSearchCV is a dictionary, and metric must be a key of scoring.
greater_is_better (bool, default False) – If True, selects the parameters with highest test values for metric. Otherwise, selects those with the lowest test values for metric.

Returns

best_index – Best index to use for refitting the model.

Return type

int

Examples

>>> from functools import partial
>>> from sklearn.model_selection import RandomizedSearchCV
>>> refit = partial(get_best_index, metric="score", greater_is_better=False)
>>> # RandomizedSearchCV(..., refit=refit)

greykite.framework.pipeline.utils.get_forecast(df, trained_model: Pipeline, train_end_date=None, test_start_date=None, forecast_horizon=None, xlabel='ts', ylabel='y', relative_error_tolerance=None) → UnivariateForecast[source]

Runs model predictions on df and creates a UnivariateForecast object.

Parameters

df (pandas.DataFrame) – Has columns cst.TIME_COL, cst.VALUE_COL, to forecast.
trained_model (sklearn.pipeline) – A fitted Pipeline with estimator step and predict function.
train_end_date (datetime.datetime, default None) – Train end date. Passed to UnivariateForecast.
test_start_date (datetime.datetime, default None) – Test start date. Passed to UnivariateForecast.
forecast_horizon (int or None, default None) – Number of periods forecasted into the future. Must be > 0. Passed to UnivariateForecast.
xlabel (str) – Time column to use in representing forecast (e.g. x-axis in plots).
ylabel (str) – Time column to use in representing forecast (e.g. y-axis in plots).
relative_error_tolerance (float or None, default None) – Threshold to compute the Outside Tolerance metric, defined as the fraction of forecasted values whose relative error is strictly greater than relative_error_tolerance. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.

Returns

univariate_forecast – Forecasts represented as a UnivariateForecast object.

Return type

UnivariateForecast

greykite.framework.templates.pickle_utils.dump_obj(obj, dir_name, obj_name='obj', dump_design_info=True, overwrite_exist_dir=False, top_level=True)[source]

Uses DFS to recursively dump an object to pickle files. Originally intended for dumping the ForecastResult instance, but could potentially used for other objects.

For each object, if it’s picklable, a file with {object_name}.pkl will be generated, otherwise, depending on its type, a {object_name}.type file will be generated storing it’s type, and a folder with {object_name} will be generated to store each of its elements/attributes.

For example, if the folder to store results is forecast_result, the items in the folders could be:

timeseries.pkl: a picklable item.

model.type: model is not picklable, this file includes the class (Pipeline)

model: this folder includes the elements in model.

forecast.type: forecast is not picklable, this file includes the class (UnivariateForecast)

forecast: this folder includes the elements in forecast.

backtest.type: backtest is not picklable, this file includes the class (UnivariateForecast)

backtest: this folder includes the elements in backtest.

grid_search.type: grid_search is not picklable, this file includes the class (GridSearchCV)

grid_search: this folder includes the elements in grid_search.

The items in each subfolder follows the same rule.

The current supported recursion types are:

list/tuple: type name is “list” or “tuple”, each element is attempted to be pickled independently if the entire list/tuple is not picklable. The order is preserved.

OrderedDict: type name is “ordered_dict”, each key and value are attempted to be pickled independently if the entire dict is not picklable. The order is preserved.

dict: type name is “dict”, each key and value are attempted to be pickled independently if the entire dict is not picklable. The order is not preserved.

class instance: type name is the class object, used to create new instance. Each attribute is attempted to be pickled independently if the entire instance is not picklable.

Parameters

obj (object) – The object to be pickled.
dir_name (str) – The directory to store the pickled results.
obj_name (str, default “obj”) – The name for the pickled items. Applies to the top level object only when recursion is used.
dump_design_info (bool, default True) –
Whether to dump the design info in ForecastResult. The design info is specifically for Silverkite and can be accessed from
- ForecastResult.model[-1].model_dict[“x_design_info”]
- ForecastResult.forecast.estimator.model_dict[“x_design_info”]
- ForecastResult.backtest.estimator.model_dict[“x_design_info”]
The design info is a class from patsy and contains a significant amount of instances that can not be pickled directly. Recursively pickling them takes longer to run. If speed is important and you don’t need these information, you can turn it off.
overwrite_exist_dir (bool, default False) – If True and the directory in dir_name already exists, the existing directory will be removed. If False and the directory in dir_name already exists, an exception will be raised.
top_level (bool, default True) – Whether the implementation is an initial call (applies to the root object you want to pickle, not a recursive call). When you use this function to dump an object, this parameter should always be True. Only top level checks if the dir exists, because subsequent recursive calls may write files to the same directory, and the check for dir exists will not be implemented. Setting this parameter to False may cause problems.

Return type

The function writes files to local directory and does not return anything.

greykite.framework.templates.pickle_utils.load_obj(dir_name, obj=None, load_design_info=True)[source]

Loads the pickled files which are pickled by dump_obj. Originally intended for loading the ForecastResult instance, but could potentially used for other objects.

Parameters

dir_name (str) – The directory that stores the pickled files. Must be the top level dir when having nested pickling results.
obj (object, default None) – The object type for the next-level files. Can be one of “list”, “tuple”, “dict”, “ordered_dict” or a class.
load_design_info (bool, default True) –
Whether to load the design info in ForecastResult. The design info is specifically for Silverkite and can be accessed from
- ForecastResult.model[-1].model_dict[“x_design_info”]
- ForecastResult.forecast.estimator.model_dict[“x_design_info”]
- ForecastResult.backtest.estimator.model_dict[“x_design_info”]
The design info is a class from patsy and contains a significant amount of instances that can not be pickled directly. Recursively loading them takes longer to run. If speed is important and you don’t need these information, you can turn it off.

Returns

result – The loaded object from the pickled files.

Return type

object

class greykite.common.data_loader.DataLoader[source]

Returns datasets included in the library in pandas.DataFrame format.

available_datasets

The names of the available datasets.

Type: list [str]

static get_data_home(data_dir=None, data_sub_dir=None)[source]

Returns the folder path data_dir/data_sub_dir. If data_dir is None returns the internal data directory. By default the Greykite data dir is set to a folder named ‘data’ in the project source code. Alternatively, it can be set programmatically by giving an explicit folder path.

Parameters

data_dir (str or None, default None) – The path to the input data directory.
data_sub_dir (str or None, default None) – The name of the input data sub directory. Updates path by appending to the data_dir at the end. If None, data_dir path is unchanged.

Returns

data_home – Path to the data folder.

Return type

str

static get_data_names(data_path)[source]

Returns the names of the .csv and .csv.xz files in data_path.

Parameters: data_path (str) – Path to the data folder.
Returns: file_names – The names of the .csv and .csv.xz files in data_path.
Return type: list [str]

static get_aggregated_data(df, agg_freq=None, agg_func=None)[source]

Returns aggregated data.

Parameters

df (pandas.DataFrame.) – The input data must have TIME_COL (“ts”) column and the columns in the keys of agg_func.
agg_freq (str or None, default None) – If None, data will not be aggregated and will include all columns. Possible values: “hourly”, “daily”, “weekly”, or “monthly”.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.

Returns

df – The aggregated dataframe.

Return type

get_data_inventory()[source]

Returns the names of the available internal datasets.

Returns: file_names – The names of the available internal datasets.
Return type: list [str]

get_df(data_path, data_name)[source]

Returns a pandas.DataFrame containing the dataset from data_path/data_name. The input data must be in .csv or .csv.xz format. Raises a ValueError if the the specified input file is not found.

Parameters

data_path (str) – Path to the data folder.
data_name (str) – Name of the csv file to be loaded from. For example ‘peyton_manning’.

Returns

df – Input dataset.

Return type

load_peyton_manning()[source]

Loads the Daily Peyton Manning dataset.

This dataset contains log daily page views for the Wikipedia page for Peyton Manning. One of the primary datasets used for demonstrations by Facebook Prophet algorithm. Source: https://github.com/facebook/prophet/blob/main/examples/example_wp_log_peyton_manning.csv

Below is the dataset attribute information:

ts : date of the page view y : log of the number of page views

Returns

df –

Has the following columns:

”ts” : date of the page view. “y” : log of the number of page views.

Return type

pandas.DataFrame object with Peyton Manning data.

load_parking(system_code_number=None)[source]

Loads the Hourly Parking dataset. This dataset contains occupancy rates (8:00 to 16:30) from 2016/10/04 to 2016/12/19 from car parks in Birmingham that are operated by NCP from Birmingham City Council. Source: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham UK Open Government Licence (OGL)

Below is the dataset attribute information:

SystemCodeNumber: car park ID Capacity: car park capacity Occupancy: car park occupancy rate LastUpdated: date and time of the measure

Parameters

system_code_number (str or None, default None) – If None, occupancy rate is averaged across all the SystemCodeNumber. Else only the occupancy rate of the given system_code_number is returned.

Returns

df –

Has the following columns:

”LastUpdated” : time, rounded to the nearest half hour. “Capacity” : car park capacity “Occupancy” : car park occupancy rate “OccupancyRatio” : Occupancy divided by Capacity.

Return type

pandas.DataFrame object with Parking data.

load_bikesharing(agg_freq=None, agg_func=None)[source]

Loads the Hourly Bike Sharing Count dataset with possible aggregations.

This dataset contains aggregated hourly count of the number of rented bikes. The data also includes weather data: Maximum Daily temperature (tmax); Minimum Daily Temperature (tmin); Precipitation (pn) The raw bike-sharing data is provided by Capital Bikeshare. Source: https://www.capitalbikeshare.com/system-data The raw weather data (Baltimore-Washington INTL Airport) https://www.ncdc.noaa.gov/data-access/land-based-station-data

Below is the dataset attribute information:

ts : hour and date count : number of shared bikes tmin : minimum daily temperature tmax : maximum daily temperature pn : precipitation

Parameters

get_aggregated_data. (Refer to the input of function) –

Returns

df –

If no freq was specified, the returned data has the following columns:

”date” : day of year “ts” : hourly timestamp “count” : number of rented bikes across Washington DC. “tmin” : minimum daily temperature “tmax” : maximum daily temperature “pn” : precipitation

Otherwise, only agg_col column is returned.

Return type

pandas.DataFrame with bikesharing data.

load_solarpower(agg_freq=None, agg_func=None)[source]

Loads the Hourly Solar Power dataset.

This dataset contains the solar power production of an Australian wind farm from August 2019 to July 2020, with original frequency 4-second. We aggregated it to an hourly series and removed any incomplete hours. Source: https://zenodo.org/record/4656027#.YrpHbuzMLGp

Below is the dataset attribute information:

ts : hourly timestamp y : solar power production in MW (megawatt)

Parameters

agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.

Returns

df –

Has the following columns:

”ts” : hourly timestamp “y” : solar power production in MW (megawatt)

Return type

pandas.DataFrame object with Solar Power data.

load_windpower(agg_freq=None, agg_func=None)[source]

Loads the Hourly Wind Power dataset.

This dataset contains the wind power production of an Australian wind farm from August 2019 to July 2020, with original frequency 4-second. We aggregated it to an hourly series and removed any incomplete hours. Source: https://zenodo.org/record/4656032#.YrpJTezMLGp

Below is the dataset attribute information:

ts : hourly timestamp y : wind power production in MW (megawatt)

Parameters

agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.

Returns

df –

Has the following columns:

”ts” : hourly timestamp “y” : wind power production in MW (megawatt)

Return type

pandas.DataFrame object with Wind Power data.

load_electricity(agg_freq=None, agg_func=None)[source]

Loads the Hourly Electricity dataset.

This dataset contains the hourly consumption (in Kilowatt) of 321 clients from 2012 to 2014 published by Monash. We aggregated them by taking the average across the 321 clients. Source: https://zenodo.org/record/4656140#.YrpKtezMJqs

Below is the dataset attribute information:

ts : hourly timestamp y : average electricity consumption in Kilowatt

Parameters

agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.

Returns

df –

Has the following columns:

”ts” : hourly timestamp “y” : average electricity consumption in Kilowatt

Return type

pandas.DataFrame object with Electricity data.

load_sf_traffic(agg_freq=None, agg_func=None)[source]

Loads the Hourly San Francisco Bay Area Traffic dataset.

This dataset contains the road occupancy rates (between 0 and 1) measured by different sensors on San Francisco Bay area freeways from 2015 to 2016. Source: https://zenodo.org/record/4656132#.YrpMxuzMLGp

Below is the dataset attribute information:

ts : hourly timestamp y : average occupancy rate

Parameters

agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.

Returns

df –

Has the following columns:

”ts” : hourly timestamp “y” : average occupancy rate

Return type

pandas.DataFrame object with San Francisco Bay Area Traffic data.

load_bitcoin_transactions(agg_freq=None, agg_func=None)[source]

Loads the Daily Bitcoin Transactions dataset.

This dataset contains the number of Bitcoin transactions from 2009 to 2021. The dataset was curated (with missing values filled) by Monash. Source: https://zenodo.org/record/5122101#.YrpNFuzMLGp

Below is the dataset attribute information:

ts : date y : number of transactions

Parameters

agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.

Returns

df –

Has the following columns:

”ts” : date “y” : number of transactions

Return type

pandas.DataFrame object with Bitcoin Transactions data.

load_sunspot()[source]

Loads the Sunspot dataset.

This dataset contains the number of observed sunspots from 1818 to 2020 published by Monash. The original dataset was a daily series, and we aggregate it to a monthly time series more than 200 years long. Source: https://zenodo.org/record/4654722#.YrpQ4uzMLGp

Below is the dataset attribute information:

ts : month start date y : average number of sunspots

Returns

df –

Has the following columns:

”ts” : date “y” : average number of sunspots

Return type

pandas.DataFrame object with Sunspot data.

load_fred_housing()[source]

Loads the FRED House Supply dataset.

This dataset contains the monthly house supply in the United States from 1963 to 2021 obtained from FRED. Source: https://fred.stlouisfed.org/series/MSACSR

Below is the dataset attribute information:

ts : month start date y : monthly supply of new houses

Returns

df –

Has the following columns:

”ts” : date “y” : monthly supply of new houses

Return type

pandas.DataFrame object with FRED House Supply data.

load_beijing_pm(agg_freq=None, agg_func=None)[source]

Loads the Beijing Particulate Matter (PM2.5) dataset. https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data

This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included.

The dataset’s time period is between Jan 1st, 2010 to Dec 31st, 2014. Missing data are denoted as NA.

Below is the dataset attribute information:

No : row number year : year of data in this row month : month of data in this row day : day of data in this row hour : hour of data in this row pm2.5: PM2.5 concentration (ug/m^3) DEWP : dew point (celsius) TEMP : temperature (celsius) PRES : pressure (hPa) cbwd : combined wind direction Iws : cumulated wind speed (m/s) Is : cumulated hours of snow Ir : cumulated hours of rain

Parameters

get_aggregated_data. (Refer to the input of function) –

Returns

df –

Has the following columns:

”ts” : hourly timestamp “year” : year of data in this row “month” : month of data in this row “day” : day of data in this row “hour” : hour of data in this row “pm” : PM2.5 concentration (ug/m^3) “dewp” : dew point (celsius) “temp” : temperature (celsius) “pres” : pressure (hPa) “cbwd” : combined wind direction “iws” : cumulated wind speed (m/s) “is” : cumulated hours of snow “ir” : cumulated hours of rain

Return type

pandas.DataFrame with Beijing PM2.5 data.

load_hierarchical_actuals()[source]

Loads hierarchical actuals.

This dataset contains synthetic data that satisfy hierarchical constraints. Consider the 3-level tree with the parent-child relationships below.

00 # level 0

/ 10 11 # level 1

/ | / # noqa: W605

20 21 22 23 24 # level 2

There is one root node (00) with 2 children. The first child (10) has 3 children. The second child (11) has 2 children.

Let x_{ij} be the value of the j`th node in level `i of the tree ({ij} is shown in diagram above). We require the value of a parent to equal the sum of the values of its children. There are 3 constraints in this hierarchy, satisfied at all time points:

x_00 = x_10 + x_11

x_10 = x_20 + x_21 + x_22

x_11 = x_23 + x_24

Below is the dataset attribute information:

“ts” : date of the (synthetic) observation “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24

Returns

df –

Has the following columns:

”ts” : date of the (synthetic) observation “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24

The values satisfy the hierarchical constraints above.

Return type

pandas.DataFrame object with synthetic hierarchical data.

load_hierarchical_forecasts()[source]

Loads hierarchical forecasts.

This dataset contains forecasts for the actuals given by load_hierarchical_actuals. The attributes are the same.

Returns

df –

Has the following columns:

”ts” : date of the forecasted value “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24

The forecasts do not satisfy the hierarchical constraints. The index and columns are identical to load_hierarchical_actuals.

Return type

pandas.DataFrame object with forecasts for synthetic hierarchical data.

load_data(data_name, **kwargs)[source]

Loads dataset by name from the internal data library.

Parameters: data_name (str) – Dataset to load from the internal data library.
Returns: df
Return type: UnivariateTimeSeries object with data_name.

class greykite.framework.benchmark.data_loader_ts.DataLoaderTS[source]

Returns datasets included in the library in pandas.DataFrame or UnivariateTimeSeries format.

Extends DataLoader

load_peyton_manning_ts()[source]

Loads the Daily Peyton Manning dataset.

This dataset contains log daily page views for the Wikipedia page for Peyton Manning. One of the primary datasets used for demonstrations by Facebook Prophet algorithm. Source: https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csv

Below is the dataset attribute information:

ts : date of the page view y : log of the number of page views

Returns

ts –

Peyton Manning page views data. Time and value column:

time_col”ts”
Date of the page view.

value_col”y”
Log of the number of page views.

Return type

load_parking_ts(system_code_number=None)[source]

Loads the Hourly Parking dataset.

This dataset contains occupancy rates (8:00 to 16:30) from 2016/10/04 to 2016/12/19 from car parks in Birmingham that are operated by NCP from Birmingham City Council. Source: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham UK Open Government Licence (OGL)

Below is the dataset attribute information:

SystemCodeNumber: car park ID Capacity: car park capacity Occupancy: car park occupancy rate LastUpdated: date and time of the measure

Parameters

system_code_number (str or None, default None) – If None, occupancy rate is averaged across all the SystemCodeNumber. Else only the occupancy rate of the given system_code_number is returned.

Returns

ts –

Parking data. Time and value column:

time_col”LastUpdated”
Date and Time of the Occupancy Rate, rounded to the nearest half hour.

value_col”OccupancyRatio”
Occupancy divided by Capacity.

Return type

load_bikesharing_ts()[source]

Loads the Hourly Bike Sharing Count dataset.

This dataset contains aggregated hourly count of the number of rented bikes. The data also includes weather data: Maximum Daily temperature (tmax); Minimum Daily Temperature (tmin); Precipitation (pn) The raw bike-sharing data is provided by Capital Bikeshare. Source: https://www.capitalbikeshare.com/system-data The raw weather data (Baltimore-Washington INTL Airport) https://www.ncdc.noaa.gov/data-access/land-based-station-data

Below is the dataset attribute information:

ts : hour and date count : number of shared bikes tmin : minimum daily temperature tmax : maximum daily temperature pn : precipitation

Returns

ts –

Bike Sharing Count data. Time and value column:

time_col”ts”
Hour and Date.

value_col”y”
Number of rented bikes across Washington DC.

Additional regressors:

”tmin” : minimum daily temperature “tmax” : maximum daily temperature “pn” : precipitation

Return type

load_beijing_pm_ts()[source]

Loads the Beijing Particulate Matter (PM2.5) dataset. https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data

This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included.

The dataset’s time period is between Jan 1st, 2010 to Dec 31st, 2014. Missing data are denoted as NA.

Below is the dataset attribute information:

No : row number year : year of data in this row month : month of data in this row day : day of data in this row hour : hour of data in this row pm2.5: PM2.5 concentration (ug/m^3) DEWP : dew point (celsius) TEMP : temperature (celsius) PRES : pressure (hPa) cbwd : combined wind direction Iws : cumulated wind speed (m/s) Is : cumulated hours of snow Ir : cumulated hours of rain

Returns

ts –

Beijing PM2.5 data. Time and value column:

time_colTIME_COL
hourly timestamp

value_col”pm”
PM2.5 concentration (ug/m^3)

Additional regressors:

”dewp” : dew point (celsius) “temp” : temperature (celsius) “pres” : pressure (hPa) “cbwd” : combined wind direction “iws” : cumulated wind speed (m/s) “is” : cumulated hours of snow “ir” : cumulated hours of rain

Return type

load_data_ts(data_name, **kwargs)[source]

Loads dataset by name from the internal data library.

Parameters: data_name (str) – Dataset to load from the internal data library.
Returns: ts – Has the requested data_name.
Return type: UnivariateTimeSeries

static get_aggregated_data(df, agg_freq=None, agg_func=None)

Returns aggregated data.

Parameters

df (pandas.DataFrame.) – The input data must have TIME_COL (“ts”) column and the columns in the keys of agg_func.
agg_freq (str or None, default None) – If None, data will not be aggregated and will include all columns. Possible values: “hourly”, “daily”, “weekly”, or “monthly”.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.

Returns

df – The aggregated dataframe.

Return type

static get_data_home(data_dir=None, data_sub_dir=None)

Returns the folder path data_dir/data_sub_dir. If data_dir is None returns the internal data directory. By default the Greykite data dir is set to a folder named ‘data’ in the project source code. Alternatively, it can be set programmatically by giving an explicit folder path.

Parameters

data_dir (str or None, default None) – The path to the input data directory.
data_sub_dir (str or None, default None) – The name of the input data sub directory. Updates path by appending to the data_dir at the end. If None, data_dir path is unchanged.

Returns

data_home – Path to the data folder.

Return type

str

get_data_inventory()

Returns the names of the available internal datasets.

Returns: file_names – The names of the available internal datasets.
Return type: list [str]

static get_data_names(data_path)

Returns the names of the .csv and .csv.xz files in data_path.

Parameters: data_path (str) – Path to the data folder.
Returns: file_names – The names of the .csv and .csv.xz files in data_path.
Return type: list [str]

get_df(data_path, data_name)

Returns a pandas.DataFrame containing the dataset from data_path/data_name. The input data must be in .csv or .csv.xz format. Raises a ValueError if the the specified input file is not found.

Parameters

data_path (str) – Path to the data folder.
data_name (str) – Name of the csv file to be loaded from. For example ‘peyton_manning’.

Returns

df – Input dataset.

Return type